Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sounds like it's a distill of O1? After R1, I don't care that much about non-reasoning models anymore. They don't even seem excited about it on the livestream.

I want tiny, fast and cheap non-reasoning models I can use in APIs and I want ultra smart reasoning models that I can query a few times a day as an end user (I don't mind if it takes a few minutes while I refill a coffee).

Oh, and I want that advanced voice mode that's good enough at transcription to serve as a babelfish!

After that, I guess it's pretty much all solved until the robots start appearing in public.



It isn't even vaguely a distill of o1. The reasoning models are, from what we can tell, relatively small. This model is massive and they probably scaled the parameter count to improve factual knowledge retention.

They also mentioned developing some new techniques for training small models and then incorporating those into the larger model (probably to help scale across datacenters), so I wonder if they are doing a bit of what people think MoE is, but isn't. Pre-train a smaller model, focus it on specific domains, then use that to provide synthetic data for training the larger model on that domain.


You can 'distill' with data from a smaller, better model into a larger, shittier one. It doesn't matter. This is what they said they did on the livestream.


I have distilled models before, I know how it works. They may have used o1 or o3 to create some of the synthetic data for this one, but they clearly did not try and create any self-reflective reasoning in this model whatsoever.


My impression is that it’s a massive increase in the parameter count. This is likely the spiritual successor to GPT4 and would have been called GPT5 if not for the lackluster performance. The speculation is that there simply isn’t enough data on the internet to support yet another 10x jump in parameters.

O1-mini is a distill of O1. This definitely isn’t the same thing.


Probably not a distill of o1, since o1 is a reasoning model and GPT4.5 is not. Also, OpenAI has been claiming that this is a very large model (and it's 2.5x more expensive than even OG GPT-4) so we can assume it's the biggest model they've trained so far.

They'll probably distill this one into GPT-4.5-mini or such, and have something faster and cheaper available soon.


There are plenty of distills of reasoning models now, and they said in they livestream they used training data from "smaller models" - which is probably every model ever considering how expensive this one is.


Knowledge distillation is literally by definition teaching a smaller model from a big one, not the opposite.

Generating outputs from existing (therefore smaller) models to train the largest model of all time would simply be called "using synthetic data". These are not the same thing at all.

Also, if you were to distill a reasoning model, the goal would be to get a (smaller) reasoning model because you're teaching your new model to mimic outputs that show a reasoning/thinking trace. E.G. that's what all of those "local" Deepseek models are: small LLama models distilled from the big R1 ; a process which "taught" Llama-7B to show reasoning steps before coming up with a final answer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: