It seems to me that the software is occasionally doing better than the supposed “ground truth” (who annotated that?), and I don't understand why the authors are blindly following the latter, and the reviewers apparently approved that.
In Figure 1 the authors complain that Gemini “misreads 'ss ety!' as 'ness ety!'”, but even a casual look at the image reveals that Gemini's reading is correct.
In Figure 11, they state that Claude is “altering the natural sequence of ideas in the ground truth”, except that the sequence in the ground truth makes no sense, while Claude's order does (only the initial “the” is misplaced).
I think the goal here was to convince the AI to actually read chars ("OCR") rather than speculate what might be written on paper/in the image. Hence why the ground truth is explicitly removing the letters & word parts that are obscured, even when they can be guessed.
TBH, I'm not sure it's a good test. I can somewhat see the argument against "BASELINE" for ground truth - the underlying text might have been BASE(IAKS), for all we know. But, IMO the ground truth should have been "Direction & ess" at the very least. And, more significantly than that - it's a fake scenario, that we don't care for in practice. Why use that? Use invoices with IDs that sound like words but are not. Use license plates and stuff like that. Heck, use large prints of random characters, mixed with handwritten gibberish.
For at least some of images that they used, the expectation from a good text reader is actually to understand context and not blindly OCR. Take "Trader Joe's": we *know* that's an 's', but only from outside context; from OCR, it might've been an 8, there's really no way to tell. Why accept the "s" in ground truth, but reject the full world "Coconut" (which is obviously what is written on the can, even if partially obscured)? Furthermore, a human would know what kind of products are sold by Trader Joe's, and coupling that with the top of the letters "M I L" that are visible, would deduce that's Coconut Milk. So really, Claude nailed that one.
I think there are multiple possible goals we could imagine in text recognition tasks. Should the AI guess the occluded text? That could be really helpful in some instances. But if the goal is OCR, then it should only recognize characters optically, and any guessing at occluded characters is undesired.
Maybe a better goal is some representation for "COCONUT [with these 3 letters occluded]". Then the consumer might combine this with other evidence about the occluded parts, or review it if questions come up about how accurate the OCR was in this case.
This looks a lot like "compared to a bunch of people who are 10 years behind (non-transformer, vision-only models), and people who aren't trying (aren't optimizing for OCR) Google is doing real well"
EasyOCR is LSTM-CTC from 2007, RapidOCR is a ConvNet approach from 2021, both focused on speed. Both will vastly outperform almost any transformer model, and certainly a big one, on speed and memory usage, but they aren't state of the art on accuracy. This is well known, for a decade at this point. 2 decades for LSTM-CTC.
Plus, I must say the GPT-4o results look a lot saner. "COCONUT" (GPT-4o) vs "CONU CNBC" (Gemini) vs Ground Truth "C CONU CNBC". And, obviously the ground truth should be "COCONUT MILK" (the word milk is almost entirely out of the picture, but is still the right answer that a human would give). The "C CONU" comes from the first O of COCONUT being somewhat obscured by a drawing of ... I don't know what the hell that is. It's still very obvious it's meant to be "COCONUT MILK", so the GPT-4o answer is still not quite perfect, but heaps better than all the others.
Now this looks very much like it might be temperature related, and I can find nothing in the paper about changing the temperature, which is imho a very big gap (temperature gives transformer models more freedom to choose more creative answers. The better performance of GPT-4o might well be the result of such a more creative choice, and might also explain why Gemini is trying so hard to stay so very close to the ground truth. It's still quite the accomplishment to succeed, but GPT-4o is still better)
> And, obviously the ground truth should be "COCONUT MILK" (the word milk is almost entirely out of the picture, but is still the right answer that a human would give).
Maybe? Seems application-dependent to me.
If you're OCRing checks or invoices or car license plates or tables in PDF documents, you might prefer a model that's more conservative when it comes to filling in the blanks!
And even when recognising packaged coconut products, you've also got your organic coconut oil, organic coconut milk with reduced fat, organic coconut cream, organic coconut flakes, organic coconut dessicated chips, organic coconut and strawberry bites, organic coconut milk powder, organic coconut milk block, organic coconut milk 9% fat, organic coconut yoghurt, organic coconut milk long life barista-style drink, organic coconut kefir, organic coconut banana and pear baby food pouches, organic coconut banana and pineapple smoothie, organic coconut scented body wash and so on.
>The "C CONU" comes from the first O of COCONUT being somewhat obscured by a drawing of ... I don't know what the hell that is.
It's clearly the stem from the bell pepper in front of the can. You're complaining that the software is lesser than a human, yet it appears your human needs better training in understanding context too.
Yup, definitely the human needs better context training. Then again, for an account that's only 6 months old, it's possible you're not really a human.
Edit to insert: WHAT DRAWING? There's a can of coconut milk that is turned so the word coconut is not fully visible. In front of that can is a real red bell pepper with a green stem still attached that is partially obstructed by the bowls in the foreground. What you're attempting to claim as a drawing is just a real life object in the table top setup. Since this is a CNBC branding image, I'm assuming this is a still frame from a video clip. Based on being a video type person, this view probably changes based on time with different things being obstructed/revealed by the camera's movement.
Your RLHF could really use some improvement. To be this argumentative when you're clearly wrong is quite amusing, but not in an entertaining way. It just reinforces my sentiments towards the joke the industry has become
The question is what is OCR for ? If it's to answer questions and work with a document, then VLMs do actually contain self correcting mechanisms. That is, the end to end image + text input to text output is statistically grounded, by training.
So the question to ask is what do you need OCR for ? Fedding an LLM? Then feed it to the VLM instead. Some other usage ? Well, to be decided.
But near now, CTX and lstms are done with, because VLMs do everything: finding the area to read, reading, embedding, and answering.
OCR was a mid-step, it's going away.
It's not obvious at all—it depends on the use case.
You also didn’t really counter the paper. Sure, the OCR models are old, but what should they have tested instead? Are there better open-source OCR models available that would have made for a fairer comparison?
This is what's so terrifying about uses of "AI". People's idea of accuracy being "tell me what I think is there", not "tell me what's there". The can in this image probably says "coconut milk", but the image certainly doesn't.
I think it's useful to add the context that CNBC is correct and does appear at the top right of that picture. CNBC is not a mis-transcribing of MILK, and the letters M, I, L and K are not actually visible in the picture.
So, I did some OCR research early last year, that didn't include any VLMs, on some 1960s era English scanned documents with a mix of typed and handwritten (about 80/20), and here's what I found (in terms of cosine similarity):
Handwritten is a weighted average of Handwritten and typed, I also did Jaccard and Levenshtein distance, but the results were similar enough that just leaving them out for sake of space.
Overall, of you want the best, if you're an enterprise, just use whatever AWS/GCP/Azure you're on, if you're an individual, pick between those. While some of the Open Source solutions do quite well, surya took 188 seconds to process 88 pages on my RTX 3080, while the cloud ones were a few seconds to upload the docs and download them all. But if you do want open source, seriously consider surya, tesseract, and nougat depending on your needs. Surya is the best overall, while nougat was pretty good at handwriting. Tesseract is just blazingly fast, from 121-200 seconds depending on using the tessdata-fast or best, but that's CPU based and it's trivially parallelizeable, and on my 5950X using all the cores, took only 10 seconds to run through all 88 pages.
But really, you need to generate some of your own sample test data/examples and run them through the models to see what's best. Given frankly how little this paper tested, I really should redo my study, add VLMs, and write a small blog/paper, been meaning to for years now.
For handwritten texts, the tool that works best for me is Qwen2.5-VL-72b [0]. It is also available online [1]. I'm surprised that it is not mentioned in the article since even the previous model (Qwen2-VL-72b) was better than the other VLMs I tried for OCR on handwritten texts.
Not GP but it depends what you mean by accuracy. If you want inference like the 'coconut milk' described then obviously an LLM. If you want accurate as-written transcription, then I don't know the state of the art, but it'll be something purpose built for CV & handwriting recognition.
It'll also depend if you care about tabular data, whether a 'minor' numerical error (like 0 & 8 mismatched sometimes) is significantly worse than a 'typo' as it were in recognising a word, etc.
Accuracy should always work to be the answer you want, which is the most useful answer for applications. That is "coconut milk", not "coconut cnbc". Maybe "cnbc" should even be included, but definitely not replacing the word "milk" in that location.
Lots of factors to rank on but generally speaking I don't find any of the open source options usable. They all take either a long time to tune or are just not accurate enough. Commercial services from one of cloud players has hit the sweet spot for me.
The paper says, "GPT-4o achieves the highest overall accuracy, while Gemini-1.5 Pro demonstrates the lowest word error rate." Saying Gemini "beats everyone" in this benchmark is misleading.
The systems they tested against the LLMs are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench [2] and Reductos table bench [3]. This paper is really bad...
> Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o
Literally none of those are state of the art. Academia is completely unprepared to deal with the speed Ai develops.
This is extremely common in research papers.
That's literally in the abstract. If I can see a completely wrong sentence 5 seconds into reading the paper, why should I read the rest?
What models would you recommend instead, for sophisticated OCR applications?
Honestly I thought Claude-3 and GPT-4o were some of the newest major models with vision support, and that models like o1 and deepseek were more reasoning-oriented than OCR-oriented.
My anecdotal tests and several benchmarks suggest that Qwen2-VL-72b [0] is better than the tested models (even better than Claude 3.5 Sonnet), notably for OCR applications. It has been available since October 2024.
For Google, definitely flash-2.0; It's a way better model.
GPT-4o is kinda dated now. o1 is the one I'd pick for OpenAI. It's basically their "main" model now.
I'm not that familiar with Claude for vision. I don't think Anthropic focusses on that. But the 3.5 family of models is way better. If 3.5 Sonnet supports vision that's what I'd use
It was literally launched February 5th, ~10 days ago. I'm no researcher, and I know "academia moves slow" is of course true too, but I don't think we can expect research papers to include things that were launched probably after they finished the reviews of said paper.
Maybe papers aren't the right approach here at all, but I don't feel like it's a fair complaint they don't include models released less than 2 weeks ago.
Honestly? I don't know how long it's been available.
But I do know it's been some time already. Enough be aware of it when posting this on arxiv.
I'm not even disagreeing that it takes time to write papers, and it's "common" for this to happen.
But it's just more evidence for what I said in my original comment:
> Academia is completely unprepared to deal with the speed AI develops
Sure, but they posted this 4 days ago.
The minimum I'd expect for quality research is for them to skim the abstract before posting and change that line to:
"Models from leading AI labs" or similar. Leaving it like now signals either sloppiness or dishonesty
The speed of publishing is just too slow. If you want to apply any kind of scientific rigor and have your peers check what you're doing (not even doing a full peer review), things take more time than just posting on blogs and iterating.
As someone building in this space, we've found that raw OCR accuracy is just one piece (and it's becomming a commodity).
The real challenge is building reliable and accurate ETL pipelines (document ingestion from web, OCR, classification, validation, etc.) that work at scale in production.
The best products will be defined by everything "non-AI", like UX, performance, and human-in-the loop feedback loop for non-techies.
Avoiding over-reliance on specific models also helps. With good internal eval data and benchmarks, you can easily switch or fine-tune models.
That’s the point of using AI in the first place. If your product is just a polished interface on top of a prompt, then your moat isn’t that strong, and chances are your product will be commoditized soon.
By building a good UX and integrating it with other processes that require traditional collaboration, you increase the chances that replicating your secret sauce is either infeasible or too difficult for newcomers to bother.
This looks very interesting. I conducted some explorations of whether LLMs can be used to extract information from hand-written forms [0][1]. Such a system could allow users to snap pictures of forms and other legal documents, automatically extract structured information, and use this information to e.g. automatically fill out new forms or determine whether the user has the right to a government benefit.
The initial results were quite promising, as GPT-4o could reliably identify the correct place in the form for the information, and moderately reliably extract the values, even if the image was blurry or the text was sloppily written. Excited to see how Gemini 2.0 would do on this task!
I have lots of customer files and I've looked around with all these AI tools for something, paid or self hosted or whatever, where I point it to a folder with xlsx and pdf and then I can query "Whats the end date or M Smith contract" or "How much does M Smith still owe" and I've been very disappointed by that, it's either very complicated, or they break down with non text based pdf, or...
It feels to me that if you need to provide schema and preprocess the data and this and that at the end all AI provide is a way to do some SQL in natural language, meaning yes it's better but it doesn't remove the actual pain point if you're a tech user.
Then again maybe I'm wrong, didn't find the right tool or didn't understand it.
Is what I'm looking for something that actually exists (and works, not just on simple cases)?
I worked on this a bit 1-2 years ago. Back then, LLMs weren't really up to the task, but I found them OK for suggestions that a human double checks. Brings us to the Ironies of Automation though (human oversight of automation with a review process doesn't really work, it's a paper worth reading).
We tried several dedicated services for extracting structured data and factoids like that from documents: First Google Document AI, then a dedicated provider focusing solely on our niche. Back then, that gave the best results.
There wasn't enough budget to go deeper into this and we just reverted to doing it manually. But I think a really cool way to do this would be to make a user friendly UI where they can see suggestions and the text snippets they were extracted from as they skim through the document, with a simple way to modify and accept these. I think that'd work to scale the process quite a bit. Focusing the attention of the human at the relevant parts of the document basically.
Haven't worked on this space since then, but I'm pretty bearish on fully automated fact extraction. Getting stuff in contracts and invoices wrong is typically not acceptable. I think a solid human in the loop approach is probably still the way to go.
I'm not completely up to date but a few months ago Qwen2-VL (runnable locally) was able to perfectly read text from images. So I'd say you would still need to preprocess that folder to texts to get any reasonable speed for queries but after that if you feed the data to a LLM with long enough context it should just work. If on the other hand it's too much data and the LLM is required to use tools then it is indeed still too soon. But it is coming.
"Perform OCR on this image. Return only the text found in the image as a single continuous string without any newlines, additional text, or commentary. Separate words with single spaces. For any truncated, partially visible, or occluded text, include only the visible portions without attempting to complete or guess the full text. If no text is present, return empty double quotes."
TL;DR: For original object truth rather than image truth, this paper shows VLMS are superior, even though prompt shows the authors are "holding it wrong".
Yet another paper where the authors don't address what tokens are. It's like publishing Rolling pin fails at math or Calculator fails to turn dough ball into round pizza.
While I can understand where they're coming from in a desire to avoid hallucination when doing some letter for letter transcription from an image, certainly most times you reach for OCR you want the original copy, despite damage to its representation (paper tears, coffee stains, hands in front of it). Turns out token conjunction probability conjectures come in handy here!
Whether the image of an object, or the object, is "Ground Truth" is an exercise left to the user's goal. Almost all use cases would want what was originally written on the object, not its present occlulded [sic] representation.
People say CPU benchmarks are meaningless (what does even 10-15% better mean in practice?) but LLM benchmarks are even more of a mystery. The same LLM will produce a novel output everytime you given it the exakt same prompt.
Really? This surprises me because I use open-AI pro for $200 per month and I still fall back to using my Gemini $20 per month account a lot these days, I like the new 2.0 experimental speed and how it defaults to diving into producing usable code, immediately. Whereas, my open ai pro mode will spend a few minutes to give me an initial answer that beats around the bush at a much higher level to start. So, my workflow has evolved to using Gemini to iterate my initial thinking and frame out requirements and first draft code. Then, when I get about 2,000 - 3,000 lines for a detailed initial pro mode prompt, I send that to open ai pro mode and then it shines. But, I really like starting with the Gemini 2.0 model first. The main thing I dislike about Gemini is I often need to tell it “please continue” when it reaches its output limit. But it nearly always just picks up right where it left off and continues its output. This is critical in using Gemini.
It's not surprising that google has such a huge mote with their highly illegal and unethical activity of scanning and digitizing billions of pages of copyrighted work to train their models. Oh wait, google books search was fair use. I got it confused with LLMs.
> It's not surprising that google has such a huge mote with their highly illegal and unethical activity of scanning and digitizing billions of pages of copyrighted work to train their models.
Excellent Freudian slip (proverb allusion suggesting Google has a blind spot, while discussing OCR).
In Figure 1 the authors complain that Gemini “misreads 'ss ety!' as 'ness ety!'”, but even a casual look at the image reveals that Gemini's reading is correct.
In Figure 11, they state that Claude is “altering the natural sequence of ideas in the ground truth”, except that the sequence in the ground truth makes no sense, while Claude's order does (only the initial “the” is misplaced).