Sounds like it's a distill of O1? After R1, I don't care that much about non-rea...

eightysixfour · on Feb 27, 2025

It isn't even vaguely a distill of o1. The reasoning models are, from what we can tell, relatively small. This model is massive and they probably scaled the parameter count to improve factual knowledge retention.

They also mentioned developing some new techniques for training small models and then incorporating those into the larger model (probably to help scale across datacenters), so I wonder if they are doing a bit of what people think MoE is, but isn't. Pre-train a smaller model, focus it on specific domains, then use that to provide synthetic data for training the larger model on that domain.

Mizza · on Feb 27, 2025

You can 'distill' with data from a smaller, better model into a larger, shittier one. It doesn't matter. This is what they said they did on the livestream.

eightysixfour · on Feb 27, 2025

I have distilled models before, I know how it works. They may have used o1 or o3 to create some of the synthetic data for this one, but they clearly did not try and create any self-reflective reasoning in this model whatsoever.

valine · on Feb 27, 2025

My impression is that it’s a massive increase in the parameter count. This is likely the spiritual successor to GPT4 and would have been called GPT5 if not for the lackluster performance. The speculation is that there simply isn’t enough data on the internet to support yet another 10x jump in parameters.

O1-mini is a distill of O1. This definitely isn’t the same thing.

sebastiennight · on Feb 27, 2025

Probably not a distill of o1, since o1 is a reasoning model and GPT4.5 is not. Also, OpenAI has been claiming that this is a very large model (and it's 2.5x more expensive than even OG GPT-4) so we can assume it's the biggest model they've trained so far.

They'll probably distill this one into GPT-4.5-mini or such, and have something faster and cheaper available soon.

Mizza · on Feb 27, 2025

There are plenty of distills of reasoning models now, and they said in they livestream they used training data from "smaller models" - which is probably every model ever considering how expensive this one is.

sebastiennight · on Feb 27, 2025

Knowledge distillation is literally by definition teaching a smaller model from a big one, not the opposite.

Generating outputs from existing (therefore smaller) models to train the largest model of all time would simply be called "using synthetic data". These are not the same thing at all.

Also, if you were to distill a reasoning model, the goal would be to get a (smaller) reasoning model because you're teaching your new model to mimic outputs that show a reasoning/thinking trace. E.G. that's what all of those "local" Deepseek models are: small LLama models distilled from the big R1 ; a process which "taught" Llama-7B to show reasoning steps before coming up with a final answer.