It is interesting that they are focusing a large part of this release on the model having a higher "EQ" (Emotional Quotient).
We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
This is very visible in the example comparing 4o with 4.5 when the user is complaining about failing a test, where 4o's response is what one would expect from a "typical AI response" with problem-solving bullets, and 4.5 is sending what you'd expect from a pal over instant messaging.
It seems Anthropic and Grok have both been moving in this direction as well. Are we going to see an escalation of foundation models impersonating "a friendly person" rather than "a helpful assistant"?
Personally I find this worrying and (as someone who builds upon SOTA model APIs) I really hope this behavior is not going to seep into API responses, or will at least be steerable through the system/developer prompt.
The whole robotic, monotone, helpful assistant thing was something these companies had to actively hammer in during the post-training stage. It's not really how LLMs will sound by default after pre-training.
I guess they're caring less and less about that effort especially since it hurts the model in some ways like creative writing.
Maybe, but I'm not sure how much the style is deliberate vs. a consequence of the post-training tasks like summarization and problem solving. Without seeing the post-training tasks and rating systems it's hard to judge if it's a deliberate style or an emergent consequence of other things.
But it's definitely the case that base models sound more human than instruction-tuned variants. And the shift isn't just vocabulary, it's also in grammar and rhetorical style. There's a shift toward longer words, but also participial phrases, phrasal coordination (with "and" and "or"), and nominalizations (turning adjectives/adverbs into nouns, like "development" or "naturalness"). https://arxiv.org/abs/2410.16107
How is "development" an adverb or adjective turned into a noun??
It comes from a French word (développement) and that in turns was just a natural derivation of the verb "développer"... no adverbs or adjectives (English or otherwise) seem to come into play here
Sorry, I should have said adjectives or verbs, as it's "develop" turned into a noun. Just like "discernment" or "punishment". The etymology isn't relevant for classifying it as a nominalization, only the grammatical function.
Or maybe they're just getting better at it, or developing better taste. After switching to Claude, I can't go back to ChatGPT's overly verbose bullet-point laden book reports every time I ask a question. I don't think that's pretraining—it's in the way OpenAI approaches tuning and prompting vs Anthropic.
If it's just a different choice during RLHF, I'll be curious to see what are the trade-offs in performance.
The "buddy in a chat group" style answers do not make me feel like asking it for a story will make the story long/detailed/poignant enough to warrant the difference.
Anthropic pretty much abandoned this direction after Claude 3, and said it wasn't what they wanted [1]. Claude 3.5+ is extremely dry and neutral, it doesn't seem to have the same training.
>Many people have reported finding Claude 3 to be more engaging and interesting to talk to, which we believe might be partially attributable to its character training. This wasn’t the core goal of character training, however. Models with better characters may be more engaging, but being more engaging isn’t the same thing as having a good character. In fact, an excessive desire to be engaging seems like an undesirable character trait for a model to have.
It's the opposite incentive to ad-funded social media. One wants to drain your wallet and keep you hooked, the other wants you to spend as little of their funding as possible finding what you're looking for.
> We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
That’s a hard nope from me, when companies pull that move. I’ll stick to my flesh and blood humans who still hallucinate but only rarely.
Yes, the "personality" (vibe) of the model is a key qualitative attribute of gpt-4.5.
I suspect this has something to do with shining light on an increased value prop in a dimension many people will appreciate since gains on quantitative comparison with other models were not notable enough to pop eyeballs.
Now you just need a Pro subscription to get Sora generate a video to go along with this and post it to YouTube and rake in the views (and the money that goes along with it).
That was impressive. If it all came from just this short 4-line prompt, it's even more impressive.
All we're missing now is a text-to-video (or text+audio and then audio-to-video) that can convincingly follow the style instructions for emphasis and pausing. Or are we already there yet?
Yesterday, I had Claude 3.7 write a full 80,000-word novel. My prompt was a bit longer, but the result was shockingly good. The new thinking mode is very impressive.
I had been sleeping on Claude's ability to write books until a couple of days ago I had it write a novel set in the Accelerando universe. It whipped up a very convincing complete multi-Act 13 chapter side plot about humans learning to interact with Economics 2.0. It was quite good though I'm sure cstross would be horrified.
Okay, you know what? I laughed a few times. Yeah it may not work as an actual stand up routine to a general audience, it’s kinda cringe (as most LLM-generated content), but it was legitimately entertaining to read.
My benchmark for this has been asking the model to write some tweets in the style of dril, a popular user who writes short funny tweets. Sometimes I include a few example tweets in the prompt too. Here's an example of results I got from Claude 3 Opus and GPT 4 for this last year: https://bsky.app/profile/macil.tech/post/3kpcvicmirs2v. My opinion is that Claude's results were mostly bangers while GPT's were all a bit groanworthy. I need to try this again with the latest models sometime.
If you like absurdist humor, go into the OpenAI playground, select 3.5-Turbo, and dial up the temperature to the point where the output devolves into garbled text after 500 tokens or so. The first ~200 tokens are in the freaking sweet spot of humor.
Maybe it's rose-colored glasses, but 3.5 was really the golden era for LLM comedy. More modern LLMs can't touch it.
Just ask it to write you a film screenplay involving some hard-ass 80s/90s action star and someone totally unrelated and opposite of that. The ensuring unhinged magic is unparalleled.
> We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
And soon we'll have the new AI friend recommending Bud Lite™ and turning the beer can with the logo towards you.
I don't know if I fully agree. The input clearly shows the need for emotional support more than "how do I pass this test?" The answer by 4o is comical even if you know you're talking to a machine.
It reminds me of the advice to "not offer solutions when a woman talks about her problems, but just listen."
How could a machine provide emotional support? When I ask questions like this to LLMs, it's always to brainstorm solutions. I get annoyed when I receive fake-attention follow-up questions instead.
I guess there's a trade-off between being human and being useful. But this isn't unique to LLMs, it's similar to how one wouldn't expect a deep personal connection with a customer service professional.
There are some businesses trying to do emotional support with AI, like AI GF's, etc
Some will make some profit as a niche thing (millions of users on a global scale, and if unit economics work, can make millions of $)
But it seems it will never be something really mainstream because most normal people don't care what a bot says or does.
The example I always think of is chess bots have been better at chess than humans for decades. But very few people watch stockfish tournaments. Everyone loves Magnus Carlsen though.
I agree with you on the timescale of a single generation.
I disagree with you on the timescale of n ≥ 2 generations: kids/teens/adults will pick up new habits and ways of seeing the world.
Just like someone like me can appear like a grizzled old fool for not seeing the appeal of TikTok, it's 100% possible to be blinded to the very real appeal of a 24/7 sycophantic "friend".
And I'll give you a concrete example: I was at a business conference 3 weeks ago where I talked to the group about the trap people could easily fall into, of ditching personal/professional support for AI support (the trap is: it's easy for the "digital friend" to get you roped in by just being sycophantic enough - "it's never your fault").
And then in the very same meeting, one of the keynote speeches was this influential female CEO explaining how she had "taught her custom GPT to become her spiritual leader" and how this GPT spiritual teacher was acting as her guide, therapist and coach (complete with a name, backstory and profile picture). I was rolling my eyes so hard they might have fallen out of my head.
This is where we're going towards, and people like this misguided CEO will lead their audiences and followers straight there (especially when that is combined with financial incentives or social rewards).
I think it's a good thing because, idk why, I just start tuning out after getting reams and reams of bullet points I'm already not super confident about the truthfulness of
Well yeah, if the llm can keep you engaged and talking, that'll make them a lot more money; compared to if you just use it as a information retrieval tool in which case you are likely to leave after getting what you are looking for.
Since they offer a subscription, keeping you engaged just requires them to waste more compute. The ideal case would be that the LLM gives you a one shot correct response using as little compute as possible.
In a subscription business, you don't want the user to use as few resources as possible. It's the wrong optimization to make.
You want users to keep coming back as often as possible (at the lowest cost-per-run possible though). If they are not coming back they are not renewing.
So, yes, it makes sense to make answers shorter to cut on compute cost (which these SMS-length replies could accomplish) but the main point of making the AI flirtatious or "concerned" is possibly the addictive factor of having a shoulder to cry on 24/7, one that does not call you on your BS and is always supportive... for just $20 a month
The "one-shot correct response" to "I failed my exams" might be "Tough luck, try better next time" but if you do that, you will indeed use very little compute because people will cancel the subscription and never come back.
AI subscriptions are already very sticky . I can't imagine at least not paying for one, so I doubt they care about retention like the rest of us plebs do.
First imagine paying a subscription fee which actually makes the company profitable and gives investors ROI, then I think you can also imagine not paying that amount at all.
Plus level subscription has limits too, and Pro level costs 10x more - as long as Pro users don't use ChatGPT 10x more than Plus users on average, OpenAI can benefit. There's also the user retention factor.
We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
This is very visible in the example comparing 4o with 4.5 when the user is complaining about failing a test, where 4o's response is what one would expect from a "typical AI response" with problem-solving bullets, and 4.5 is sending what you'd expect from a pal over instant messaging.
It seems Anthropic and Grok have both been moving in this direction as well. Are we going to see an escalation of foundation models impersonating "a friendly person" rather than "a helpful assistant"?
Personally I find this worrying and (as someone who builds upon SOTA model APIs) I really hope this behavior is not going to seep into API responses, or will at least be steerable through the system/developer prompt.