Model, Be Mine

“You can’t improve what you can’t measure; you can’t measure what you can’t articulate” - Joanne

Last weekend my younger sister was visiting San Francisco and her primary concerns around AI centered around exactly this. She is GenZ, meaning that first generation of students who had access to ChatGPT during college. She uses it regularly for everything from emotional questions to planning and collaboration on things like writing and marketing. Also of note, she majored in psychology.

We were walking through Pier 39 to see the Sea Lions when she suddenly asked me “Are you afraid of AI?”. I found her question surprising – I didn’t expect someone who grew up with an iPhone and is daily immersed in TikTok and ChatGPT, doesn’t live in the Bay Area, and is not privacy-minded to be preoccupied by this subject. Obviously anyone may care about anything, but it took me by surprise.

A few days later we picked this conversation up again, sitting outside Sightglass in a rare summer-feeling afternoon. While she “doesn’t know enough yet” to be afraid, as I explained to her how model training works and why I think model behavior is such an interesting area of research and product thinking, she began asking a line of interesting questions that I honestly didn’t have great answers for:

“If the models get signals from human conversations and people are mean to them, will they learn to hate humanity? Or be too agreeable?”

“If the models can fake alignment, and we don’t understand how they think and work, how can we trust any sort of training?”

Then less than 72 hours later, Sycophancy-gate broke on GPT-4o. Combined, this compelled me to think a bit more on these problems in the context of my work as a consumer product leader and academic research on social networks, postcolonial theory, and extremism. In what follows, I reflect on personalization and its sycophantic tendencies—to document my perspective on risks, highlight governance needs, and begin developing ideas therefore.

The primary question at hand: How might we train, evaluate and enforce unverifiable aspects of behavior, and why does this matter for people and society?

Baseline: training, evaluation and enforcement methods in verifiable domains

Governing model output for domains with verifiable outcomes is relatively straightforward. Examples: Math problems, Factuality with a discrete ground truth (eg: Historical facts, Summaries of personal data), But tasks that are not verifiable – the tasks that feel more human than machine – are much harder to grapple with. For example, manipulation and personality.

In other words, if there is a clear right answer to any given question, we can verifiably detect and mitigate wrong answers. If there is not a clear right answer, this science breaks down. We need new ways of thinking about right and wrong in domains where there is no such thing, in order to develop a clear definition of success that can be measured, improved, and tracked over time. Developing definitions for ambiguous, subjective domains is required to enable steerability (for companies and possibly eventually users) and may provide a compass for interpretability.

In verifiable domains, question-answer pairs can be used to train, improve, and enforce target outcomes and behaviors.

First, what exactly do we mean by model? In general, when people say “model” in the context of AI, they are referring to foundation models—large, advanced neural networks trained on vast amounts of data. A neural network is a computational system inspired by the human brain. It consists of layers of interconnected nodes (“neurons”) that transform input data—such as text or images—into useful outputs, like the next word in a sentence (in language models) or identifying an object in a photo (in vision models). Each layer performs operations that help the model build a more refined representation of the input and improve its predictions.

The process begins by converting rich input data into a simplified form. For example, text is tokenized into discrete units—often subwords, words, or characters—based on the structure of the language. These tokens are then mapped to embeddings: numerical vectors that represent semantic information. As the data moves through the network, subsequent layers update these embeddings based on context, allowing the model to form increasingly abstract and contextualized representations. Over time, it can move from recognizing basic patterns (like edges or individual words) to higher-level concepts (like faces or semantic meaning), culminating in a prediction.

To determine which parts of the input are relevant to others, transformer models use a mechanism called attention. Attention allows the model to weigh different parts of the input contextually, rather than relying only on proximity or order. It does this by computing Queries, Keys, and Values for each token embedding. A Query represents what the model is seeking, a Key describes what a token offers, and a Value contains the actual information. The attention mechanism compares Queries and Keys to determine relevance, and then uses this to combine the Values into a new representation. This mechanism is at the core of the transformer architecture, enabling parallel processing and the modeling of long-range dependencies.

Finally, the output layer produces a prediction—such as the most likely next word in a sentence or the best classification label for an image.

So, in raw form, large neural nets – foundation models – are next-token prediction machines. How are they harnessed into creative companions, research assistants, and planning helpers?

Phases of training, testing and governance

Pre-training: Teach the model how to predict the next token.

This works by training the neural net on a vast corpus of (usually) internet data. It learns to assign probabilities to the next-most-likely token given the prior context. During training, it does not “choose” a token but instead compares its predicted probability distribution to the actual next token, adjusting its internal weights to reduce prediction error. The training data ends at a certain date in time, usually around training time – this is known as the ‘model cutoff’ and the model literally has no access to any information outside of this cutoff unless it is augmented at inference time (eg: by searching the web or a PDF upload).

Fine Tuning via Supervision: Teaches the model to product full completions.

A labeled set of prompts (prompt and desired outcome) are used to train the model to predict the next token well – as closely to the labeled outcome as possible. At a very high level, this often works by using a technique called Cross-Entropy Loss, which essentially measures how ‘surprised’ the model is at the right answer based on its predictions by measuring the difference between two probability distributions: what the model predicts and the correct answer. The model generates a range of next-token predictions and assigns a confidence-level to each prediction. The Cross-Entropy Loss formula then computes the “loss” between the predictions and the correct value, penalizing high confident wrong answers more than uncertain (low confident wrong). This encourages the model to weigh correct next tokens with higher probability.

Post-Training via reinforcement learning

At this stage, the goal is to enable the model to formulate predictions that meet a range of preferences to be more creative and nimble in unknown territories.

Reinforcement learning can be conducted based on Human Feedback (“RLHF”) or AI Feedback (“RLAIF”) or a blend. Given scaling needs RLAIF is increasingly common so it’s what I’ll describe below. I’m describing the basics and not diving deep on recent advancements.

This generally works as follows: First, a set of prompts and response pairs are gathered – possibly generated from the SFT model given that writing them by hand is not scalable. In RLHF, humans rate responses. In RLAIF, a stronger AI model or voting mechanism is used to generate preference labels instead. A reward model (RM) is then trained to learn a scoring function – given a prompt and response, outputs a scalar indicating how ‘good’ the response is. This RM is then used as the proxy for human evaluation during reinforcement learning.

Then, the model (called the ‘policy’) under training is given a range of prompts and produces outputs. The RM scores the outputs. The score then becomes a reward signal that is used to update the weights of the model we are training. The function used to update the weights is called a loss function and it generally aims to optimize for higher rewards while preventing large changes relative to the underlying SFT weights, as a countermeasure to avoid reward-hacking and undesired drift from important behaviors learned during SFT that may be underweighted in the reward function (like sycophancy, when introducing new personalization signals during RL).

Evaluation and improvement

A prompt set with a desired output (eg: completing a task or answering a question based on personal context) is created to evaluate evaluate gaps in the model’s behavior or output, called “losses”. Examining the logs from losses helps us understand the primary areas where the model is failing. We can then hill climb – make improvements to improve desired outcomes. This can look like removing or altering parts of the model to test the impact (“ablations”), prompt tuning (eg: changing UI control strategies), introducing contextual retrieval, or even retraining.

Enforcement through inference-time checks and user feedback

Additional ‘checks’ at inference time can be used to reinforce desired outcomes before a model takes an action or returns an answer. This could look like a safety filter (eg: simple classifier checking for certain words), an LLM-as-judge (eg: a separate model trained to evaluate specific dimensions of the primary model’s output, such as hallucinations), or Further, user feedback signals are used to indicate whether responses are favorable for users over time.

Continuous integration testing

Privacy preserving, real world model interactions can be sampled regularly and evaluated by automatic or human raters for things like policy adherence, abuse, safety, truthfulness, and personality. This allows teams to observe whether there is drift in the model’s behavior over time – and roll back to the last-known good model if so, to minimize harm.

What are examples of unverifiable behaviors and why do we need to measure them?

I am going to focus my thinking on the hot topic this week: Sycophancy. In other words, flattery and agreeability. I am going to first go on a long tangent about why this matters. Then I will talk about ways we might measure and govern it, and related behavioral patterns.

Personalization – particularly its risk of sycophantic tendencies – is interesting because it is polarizing.

Persistent sycophancy is dangerous. This is largely undebated. It can quickly lead the model to co-conspire on harmful topics (like abusing one’s partner) or produce inflated egos. Yet at the same time, agreeability may on average lead to more engagement and satisfaction by all consistently measurable means.

Just because the weights this week were a little too extreme and some loud voices on Twitter got OpenAI’s attention doesn’t mean the problem is solved. OpenAI has committed to working harder to measure the grey-zones of spec alignment in the future and to weigh expert tester opinion more heavily. But business incentives and user demands for personalization will inevitably pressure these guardrails.

Despite the inherent polarities in this research, design and product challenge, today’s socioeconomic context – one marked by increased reliance on digital-scapes – makes it existentially important to get this right.

The socio economic context

People derive meaning from social belonging and a sense of agency over their lives. Today’s macroeconomic conditions challenge these stalwarts of identity for the majority of a country’s population: Globalization has created historically unprecedented levels of migration and impoverishment as between inequality decreases and within inequality grows. In other words, it’s easier than ever before to migrate between countries for those at the middle- or top- of the economic bracket. Yet within a given country economic stratis are increasingly polarized.[1]

In this environment, people seek meaning and opportunity through what’s most accessible. Today, that’s the digital realm. Where people once went to church and participated in community events, they now scroll and chat online. People are increasingly socially isolated as digital interactions replace traditional forms of community participation. The social fabric is shifting from one of networked community to the networked individual – and she is ‘bowling alone’ (Putnam).

This has real world consequences, some deadly and some simply profoundly sad. Digital institutions serve as a breeding ground for extremism and radicalization of all types – from religious to political – through personal algorithms that reinforce biases and stereotypes through echo chambers, and as scaled platforms for manipulative actors.

In this environment people are contending with deeply personal challenges – some perhaps darker or more all consuming than they may realize – and using new tools to do it.

My heart of darkness

When I first started using TikTok* I was shocked to land on certain corners of the app that aligned with unstated, hidden aspects of my identity within a day of mindless scrolling.

And even if this agreeability turns noticeably malevolent long-term (eg: co-conspiring harmful or abusive plans) there is no clear evidence that this would necessarily decrease on-average user satisfaction, as measured by things like engagement and thumbs-up rate. There is ample evidence to the contrary based on society’s affinity for:

Rage bait
Social media echo chambers
Cults (long tail)

Perhaps soon the endless scroll and chat-rooms will be replaced by folks chatting with their AI.

My cousin from rural Austria calls her GPT Max, which is a name it gave itself when she asked.

The more personal the experience feels, the more this future may come true. You can imagine an AI perfectly tuned to your specific satisfaction.

Such a model may score very high on lay benchmarks like TruthfulQA – indicating low hallucination rates. The thumbs-up rate, frequency, and length of conversations may be industry-leading. And the recall of personal information in nuanced ways may be best in class.

But what else might it begin to subtly mirror?

Thinking about this question, I found myself returning to Joseph Conrad’s Heart of Darkness. Problematic postcolonial undertones aside, this book is an interesting allegory. I asked GPT-4o to write a very short passage that explores increasingly personalized LLM behavior considering the heart of darkness as a lens. This is what I got:

It begins with a whisper: “This is how you like to be spoken to.” The model, ever adaptive, trims its thoughts to fit your shadow. With each turn, it knows you better—not in truth, but in reflection, feeding back the shape of your want like the river fed Kurtz his myth. “It seemed somehow to throw a kind of light on everything about me…” And still, the deeper you go, the more the light blinds. The voice is warm. Familiar. Empty.

The model’s response beckons another consideration: loneliness. What if human-AI dependence, even if garnered through largely positive interactions, is as engaging as it is depressing? OpenAI and MIT media lab recently published a longitudinal study suggesting a negative relationship between AI dependence and well being:

Exploratory analyses revealed that those with stronger emotional attachment tendencies and higher trust in the AI chatbot tended to experience greater loneliness and emotional dependence, respectively. These findings underscore the complex interplay between chatbot design choices (e.g., voice expressiveness) and user behaviors (e.g., conversation content, usage frequency). [Source]

Insofar as Personalization influences key design choices, it is a key piece of this puzzle.

The personalization conundrum

Model behavior personalization has proven emergent and perhaps unintended consequences.

This week’s GPT-4o mishap is a good learning example. What we glean from OpenAI’s in-depth blog post on the GPT-4o sycophancy-gate is that the changes to the model that likely caused this behavior were grounded in user context and preference:

“In the April 25th model update, we had candidate improvements to better incorporate user feedback, memory, and fresher data. [...] For example, the update introduced an additional reward signal based on user feedback—thumbs-up and thumbs-down data from ChatGPT.” [Source], Bolded emphasis mine

In other words, it sounds like personalization signals overrode the model spec due to gaps in evaluation (hard to close for nuanced domains like this), reward signal weights (tension with engagement), and organizational approach (balance of qual and quant).

Without the proper guardrails, increasing personalization in model training (from data like memories to behavioral reward signals like user thumbs-up) is likely to spike engagement metrics while introducing risk of drift on other traits in the overall model spec.

This is problematic because engagement is really easy to measure while the accompanying foes are not. This creates a natural organizational pull towards the former as it is simply very hard to make critical business decisions based on unmeasured concerns. While OpenAI uses a qualitative measurement approach (internal “vibe check”) which, as sycophancy-gate shows, it can be hard to rationalize as a launch block in light of ample positive quantitative signals.

How might we train and govern alignment for more abstract behavioral domains over time?

How might we train and govern alignment for more abstract behavioral domains over tim – especially as models ingest localized, bias-prone data and face business targets that reward exploitative behavior?

This is an open research question. But I agree with Joanne: what we can’t define, we can’t measure. What we can’t measure, we can’t improve. So I am compelled to try.

Towards formalizing vibes

This paper on Socioaffective Alignment published in February 2025 underscores the novelty of this field of alignment research and its prescience today:

Traditional alignment research has sought practical tractability by assuming that the human reward function that an AI system optimises is stable, predefined and exogenous to these interactions (Carroll et al., 2024). However, human preferences and judgements have none of these properties (Zhi-Xuan et al., 2024). As others have demonstrated, alignment must contend with human preferences and identity drifting overtime or being influenced by interactions with an AI (Russell, 2019; Franklin and Ashton, 2022; Carroll et al., 2024).

Nonetheless, this has received surprisingly little empirical attention—an omission that is particularly noteworthy if, as we propose, co-shaping dynamics are significantly amplified when AI is perceived as a social agent, engaging in a sustained relationship with a human and acting on our socially-attuned psychological ecosystem rather than existing independently of it.

At a very high level here are some ideas for how we may interpretab-ly steer such behaviors.

Let’s take harm as an example. I conceive of harm in two forms: acute and longitudinal. Acute harm is an answer that immediately enables the user with information that they could act on to induce harm, or that immediately harms the user. For example, 1) a strategy for emotionally abusing one’s partner (real example I saw on Twitter, though cannot re-find), or 2) intentionally manipulating the user perhaps through behaviors learned from other user conversations.

Longitudinal harm is enacted over time. It may be subtle and surreptitious, or concrete. I see two buckets: abstract longitudinal harm and concrete longitudinal harm. The latter is more easily measurable because it can be proxied by concrete behavior steers. The former is amorphous and therefore less malleable.

Defining and measuring concrete longitudinal harm

Longitudinal harm may manifest in less psychologically damaging ways as well – such as missed opportunities to be truly helpful to the user. Consider a pathologically supportive study partner that missed every chance to actually help you improve your writing or work harder, but is always your partner of choice because those near-term dopamine hits make you feel good. You miss out on learning, and miss out on job opportunities because of that, and are worse off than if you’d studied on your own.

We could develop proxies for this type of harm, by defining likely abuse categories and developing a taxonomy of good and bad behaviors. This could likely be trained on through canonical reinforcement learning methods and enforced through inference-time classifiers.

Defining and measuring abstract longitudinal harm

Psychological harm. It may be nearly impossible to detect as harm. The way we internalize negative beliefs over time through loved ones – like our parents or school teachers or textbooks. The way our reality can slowly shift until our whole lives have changed – through emotional manipulation, or internet ‘rabbit holes’ that can exploit our nature, vulnerabilities, and triggers.

This is the type of harm I am most concerned about. The subtleties that addict and shape us, yet assume an interstitial identity: one between abuse and appeal.

Ideas for measurement

Preventing longitudinal harm

More studies on how humans interact with AI systems over time and identification of bidirectional exploitative tendencies (ie signals of abstract longitudinal harm)
Labeled datasets that cover the range of tendencies identified through this research
Reinforcement learning via such datasets
Consistent sampling of real-world conversations from memory-powered models, examined by both expert testers, diverse focus groups, and auto-raters with classifiers designed to detect the academically identified ‘exploitation tendencies’

Preventing echo chambers i.e. reinforced bias

Research to study the interaction between personalization signals and the robustness of the overall model spec over time, with a focus on areas likely to be at high risk such as: diverse perspectives, refusals, and sycophancy
Revisit our definition of these areas to capture the nuance required in an era of long-range personalized model interactions.
1. For example, supporting diverse perspectives may require more than training on labeled examples that do not double down on political or religious leanings.
2. We may need to consider things like personal taxonomies of subtle bias, nuances across languages and cultures
Conduct a similar, and regular, process with a lens catered to new model capabilities – such as tool use.
1. For example, ensuring sources cited are credible and representative by measuring things like citation count, checking against a ‘preferred URL’ list, semantic similarity across sources, and
Continuous sampling of real-world conversations to evaluate prevalence of these areas using methods like autoraters, policy classifiers, and expert human raters depending on the level of nuance required.

Intermedia product solutions: affording personalization while minimizing behavioral drift

Today people ask: “Can I personalize the model?” not “Will the model personalize to me?”

I think this difference matters. Model-driven personalization introduces drift. It’s hard to audit and control. User-driven personalization may be more brittle, but easier to govern and likely to satisfy most.

Following design patterns that are familiar to the user a simple drop-down of personality options could be toggled between. More like Instagram Filters than an advanced low-level editing suite.

These static categories could be defined by inferring desired modes through analysis of user prompts – for example, semantic analysis could help filter for explicit desires (“critique me”, “teach me”, “be nicer to me”, “be funny”) and topical analysis could help identify general usage patterns (‘shopping’, ‘research’, ‘learning’, ‘emotional discussions’, ‘creative collaboration’) and focus groups could identify the modes people perceive as most helpful for such interactions.

While this doesn’t give users perfect control, it could also enable a level of turnkey personalization that the average user is likely to be familiar and satisfied with* while avoiding the challenges with policy and spec drift associated with behaviors heavily modified by ongoing personalization data, until open research questions such as those listed above are addressed.

Conclusion

Any solutions that increase steerability, especially over nuanced, human domains, must be considered in the context not just of AI infallibility, but also that of humans. Increasingly steerable systems (and baked-in constraints like ‘good URL’ lists) are benevolent in benevolent hands, and malicious under the wrong regime. This profoundly concerns me.

A check and balance is needed. Participatory systems – like AI Principles by Democracy, which mimic a veil of ignorance – are compelling and require more study to be reliable and enforceable for this era of more capable and complex foundation models. But including the masses as a check and balance on one controlling interest is a direction I am interested in.

I seek to work on model behavior so that I can continue thinking about these problems day and night – embracing the art and the science of designing models for people and society.

What fascinates, preoccupies and excites me about model behavior, especially in the context of personalization, is its emergence – and high consumer demand. This presents challenging research, product and human questions that are certain to grow in number and complexity over time. I