Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tesseract.js: Pure JavaScript OCR for 100 Languages (projectnaptha.com)
364 points by petercooper on Dec 20, 2019 | hide | past | favorite | 77 comments


In case it's not clear, Tesseract is developed by Google since 2006, having been started at HP in 1985 and open-sourced by HP in 2005. [1]

As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

This (Tesseract.js) is a WASM port of the project by a separate group of people.

I investigated using this port a couple years ago, but as you can see from the demo, it's fairly slow to initialize and run, so I never found a practical use for OCR client-side rather than server-side, but I still think it's tremendously cool.

In case anyone's interested (shameless plug), because I do a lot of academic research that involves tons of copying from webpages, PDF's and screenshots and pasting into notes documents, I created a tool at https://pastemagic.com that helps selectively remove rich text formatting, remove line breaks and does OCR on screenshots and camera photos. Setting up Tesseract on my server and creating a simple HTTP endpoint for it took less than an hour, and for free I had OCR as powerful as Google's. Pretty cool I thought.

[1] https://github.com/tesseract-ocr/tesseract


Disclaimer: I was the original author of Tesseract.JS— though all the hard work nowadays is done by Jerome Wu. If you're interested in supporting the project, consider backing the OpenCollective (https://opencollective.com/tesseractjs)

By all of that, what I mean to say is that I've learned a decent amount of fun OCR trivia over the past few years.

Firstly, the engine that powers Google Cloud Vision is almost certainly an entirely independent code base from Tesseract built on neural networks. In fact, the most recent major version of Tesseract (version 4.0) was a sort of rewrite of the core of Tesseract to use bidirectional LSTMs to seem a bit more like the modern OCR pipeline that systems like GCV use.

The original Tesseract algorithm dates back a previous AI spring— in the 80s when neural networks were cool (before they were uncool, and then subsequently cool again). The core of the original algorithm involved fitting polygons to character shapes in order generate features which could be matched by a kind of rudimentary neural network.

One of the primary authors of Tesseract is Ray Smith (at Google)— who gave a presentation at some point a few years ago about the history of OCR— though I can't quite find a link to it at the moment.

OCR actually predates electronic computers. In 1929, someone had invented a machine that would take a piece of paper and shine a bright light on a single letter, and pass the letter through a carousel of letter masks, so that it could hit a (effectively single pixel) photo-sensor. When the carousel and the letter mask were in alignment with the printed letter, then the drop in brightness registered that a particular letter was seen!

OCR was used by the US Postal system for sorting mail as early as 1965, but it wasn't until 1976 that any system could reasonably support more than a certain number of hard-coded fonts (fun fact this was invented by Ray Kurzweil, the Singularity is Near guy).


First of all thank you for all your hard work.

Major question:

Why isn’t Tesseract using neural networks. I know it just introduced LSTM based models but they suck.

Why is GCP vision text recognition API so much better than open source alternatives?!


There are open source versions of everything done within a GCP API call, but it requires multiple machines and lots of data to build an NLP model to be as fast and accurate as GCP, and cloud computing is relatively new compared to OCR.


There are? Can you give a list of pointers or what to look for?

I was looking for an OCR that can do license plates while the car is moving, for a hobby project. The image quality is less than perfect, the lighting is never very good, and as the camera is mounted on my side window, all plates have a perspective transformation applied (e.g., topline and baseline are essentially never parallel)

Tesseract fails miserably. Trying to help it, I have not found a good open source project that would consistently equalize color pictures to black-and-white - sometimes there's shadow on the plates that foils all simple attempts.

And yet, GCV needs no parameters, and seem to do this perfectly on images I've tried.

So, assuming I'm willing to put in the time - how do I build my own GCV -- even if it's just for the hobby use case of reading license plate (and the next stage: reading house numbers - which GCV does reasonably well, although it is a much much harder problem)


I had some good luck with https://github.com/sergiomsilva/alpr-unconstrained/blob/mast... as long as the images were high enough resolution. You might want to check it out, comes with trained models.


Thanks!


Training the model would be computationally intensive, but deploying that to use Tensorflow.js and predicting a single datapoint in the browser shouldn't be as much, right?


There are ML models that are so computationally intensive that they can't reasonably run on the edge. AI accelerator chips obviously help move the line, but AI accelerators benefit the cloud, too. Furthermore, Models can be tens to hundreds of megabytes in size. Okay for the cloud, not okay for wasm running in the browser.


Also AFAIK GCV uses techniques beyond better OCR that greatly help accuracy. It does image fixing, boundary detection, NLP, spell check, etc.


> As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

Tesseract is acceptable only if the text is neatly laid out, in more-or-less straight parallel lines or at the very least consistent orientation that's close enough to being straight horizontal lines.

Google Cloud Vision, however, can read any orientation, any font, through perspective distortion and does not need the different text blobs in the image to be consistent in any way. Superior in every way to plain Tesseract (and if it is Tesseract after preprocessing, the magic is in that preprocessing more than it is in Tesseract)

I would actually be very surprised to hear GCV uses Tesseract; and if they don't, why would they use something inferior for other products?


> As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

Afaik Google no longer uses Tesseract for any of its products. Googles Clould OCR is much better than Tesseract.

I think Google devs still work on Tesseract, but only as their side project (not sure about this, obviously)


IME Google's OCR is much more accurate than Tesseract. I doubt they still use it.


Oh very interesting. I'd verified the output was identical a couple of years ago, and that Keep and Docs in production were using the 4.0 beta release at the time. But if Cloud OCR is better, makes sense they would have switched since then.

Tesseract 4.0 has a brand-new neural engine that totally supersedes the earlier engine, however -- I wonder if there's any relation between that and Cloud OCR?


“Cloud OCR” is an interface. Something is still doing the OCR behind the scenes (and that may indeed not be Tesseract).


It's probably a detection neural net (such as Faster R-CNN) for putting bounding boxes around words, which is complicated by the fact that it can predict polygons in any orientation, followed by a LSTM-CRF layer for text transcription. It's a good generalist OCR but often has sub-par results for specific types of input. It tens to often miss single letters surrounded by whitespace.


> As far as I know, it powers all OCR at Google (e.g. in Keep, Docs, etc.).

Tesseract is the shittiest OCR and Google doesn't use it internally. Their cloud OCR offering is much more performant.


I've been working on an editor on top of ProseMirror to support saving web content in the form of rich text and predefined schemas. Given that you have academic research in this area of web OCR, what's the current literature or tools on saving web content using both html and visual cues from the rendered html? For example, both <figure><img><figcaption> and <div><img><p> visually look like captioned images, but are represented differently in html. Is there a way to parse that into a simple [figure, [img], [figcaption]] schema?


Can you share more information about how you created an HTTP endpoint? Or code? Glancing at the docs I only saw a command line or C bindings.


Oh sure, on my webserver I just wrote the POSTed image to a temp file, called the command-line utility itself from within my code, and captured the stdout to return back. The command-line utility initializes very quickly so the performance was fine.

The only semi-tricky bits were parsing stderr if anything went wrong (distinguish warnings from actual errors), and the fact that Tesseract doesn't respect the JPEG orientation bit (big problem with iPhone camera images), so checking that and manually rotating the JPEG first if necessary (gibberish otherwise).


Do you have it somewhere on github? I could see hosting this on my own server for personal use, maybe trying to contribute back as well.


pastemagic is actually a really cool way to read Wikipedia articles. The link density of the usual Wikipedia article is very high and I find it very distracting when some significant portion of the text is in a different color. Combined with the nice typesetting in the output, it makes for a very pleasant reading experience.


You might also like Wikiwand - https://www.wikiwand.com


Tesseract simply cannot power the GCP OCR API because Tesseract sucks super bad and the GCP API is mint.


added your tool to my firefox new tab. If other say that Google OCR is "way better" with don't you implement that behind your endpoint ?


I'll have to look into it! But right now it's a free tool so I can't afford to set up paid OCR :)


Somewhat offtopic, do you know of a library that would allow me to select an area of a PDF through a GUI and only read the text in those coordinates?


The tesseract-cli (and so I'm sure the library also) will give you HOCR output, which is an HTML format that gives you the text, with bounding boxes around paragraphs and individual characters.

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line...

It's not quite what you want, but I think you could probably filter the output based on the selected region and pretty quickly get what you want.


Try tabula[0]

It is opensource and runs on Java.You can also extract the areas of interest in the pdf and run it via cmdline[1].You can get more details if required on my blog[2]

[0]https://tabula.technology/

[1]https://github.com/tabulapdf/tabula-java/wiki/Using-the-comm...

[2]https://narayanansiyer.com/Tabula/tabula/


I think the Project Naptha extension by the folks that wrote this library will do that, no? https://projectnaptha.com/

Not sure if it only reads at those coordinates vs. OCRing the whole thing (for example if you were legally prohibited from OCRing content outside a certain coordinate space), but it is selectable.


You could simply pipe an area screenshot to tesseract, discard the input image and get the tesseract output, am I wrong?


That sounds like a valid approach, any idea what tools I could use to get the define the area and get the screenshot?


You possibly have one installed. Mine comes with my desktop (Xfce), and gives me a GUI and a CLI to take screenshots of the full desktop, any window, or a particular area defined by crosshairs.

There's a very popular and minimalist CLI called scrot that I think would be ideal... well scratch that, I made a search and our question has already been asked and answered:

https://askubuntu.com/questions/280475/how-can-instantaneous...

https://stackoverflow.com/questions/21497447/ocr-on-a-screen...


If I remember correctly, I did it with the ImageMagick "import" command. I found I had to add a wide white border, as Tesseract got confused near the edges of the image (this was over 10 years ago though).


I'm not sure if there's a non GUI interface for it, but zathura does this for pdfs.


Commercial or open source? PDFTron can do it, but they’re not an open source project.


I prefer something I can install locally (doesn't need to be open source). I'm trying to extract text from a PDF at a certain position, the PDF is indeed text not an image so OCR isn't strictly needed.

The goal is to draw a box using GUI, then use those coordinates to extract text from several homogeneous pages.

I also have a different goal of trying to interpret structure of a PDF that has visual structure (headers, sections and subsections all numbered). But that seems to lend itself to some sort of text parsing.


I also have a different goal of trying to interpret structure of a PDF that has visual structure (headers, sections and subsections all numbered). But that seems to lend itself to some sort of text parsing.

Some reading here: https://stackoverflow.com/questions/53219016/detecting-secti...


PDFTron provides an SDK and isn't really meant as a plug-and-play end-user application. But it can accomplish what you're looking for.

Here's how to extract text from a PDF based on coordinates (this explains how to do it on web, but it's also possible using other platforms):

https://groups.google.com/d/msg/pdfnet-webviewer/h2W3VksbQUI...

Here's how to extract a PDF's logical structure:

https://www.pdftron.com/documentation/samples/#logicalstruct...


Pdf.js and filtering the output. Par.sr with the good input module configuration


Curious, does it use Deep Learning techniques, and for what tasks?


> Tesseract is developed by Google since 2006, having been started at HP in 1985 and open-sourced by HP in 2005.

Fun fact: it actually started as a US national defence initiative in the last big AI hype bubble in the 1980s.

While it isn't in the wiki, probably for intellectual property reasons, I'm almost positive Cuneiform originated as the Soviet version of the same thing (I have evidence in notebooks somewhere): OCR was something needed for "AI" of the 1980s in the Soviet system as well.

Had it been further developed by Microsoft ... or Apple, in contrast to Tesseract, it would have continued to be the official OCR of an opposing world-historical system.


I recently did a project where I OCR'd a very rare book that I could only find in the library of congress so I could read it on my kindle.

Tesseract was amazingly powerful and accurate, but it seemed to struggle if the page was warped or tilted even a little. I had to preprocess the images heavily to try to dewarp the natural spine curvature, and even then it could only get about 99% accuracy (which sounds like a lot, but consider a book where every 100th letter was wrong - I basically flagged the errors on my kindle as I went along and manually corrected them later).

I guess the point of this comment is that, in my experience, Tesseract.js is probably going to need an accompanying PageDewarp.js for it to be of use scanning books. Not everyone has access to a right angle scanner or can slice the spine and get perfectly straight high-res scans.


That's very interesting given that Tesseract uses Leptonica. I'm not sure if they use it for dewarping but all of my little projects with Leptonica really worked well. Dewarping, binarizing, extracting individual elements etc.

https://github.com/DanBloomberg/leptonica


Maybe I wasn't using tesseract to its fullest potential, but I had a really hard time getting it to do accurate OCR on warped paged - straight pages worked perfect.


Also, the last time I checked tesseract liked stuff to be 200-300 dpi. (You don’t have to scan it at that resolution, but helps if you scale it up.)

Years ago, I dug up a couple of papers on the spine thing, but never got around to implementing it. I think you can estimate the curvature based on the shadow and dewarp.

I was just scanning some recipes to save typing, so it wasn’t really worth the efffort.


I ended up using this guy's tool[1] for dewarping, which worked pretty well. The tool was more a prototype than anything, but it was enough to finish my project.

[1] https://mzucker.github.io/2016/08/15/page-dewarping.html


I'm curious what the book was and the context for why you were reading it.


Why didn't you combine it with a dictionary and then ML in it to correct the letter based on the context?


I tested this by taking a screenshot of the introduction blurb. This is what it came up with:

  Tesseract s is 2 pure Javascript port of the popular Tesseract OCR engine.

  This library supports more than 100 languages, automatic text orientation and script detection, a simple interface for reading paragraph, word, and character bounding boxes. Tesseract s can run either in 2 browser znd on & server with NodeJS.
Not bad, but far from useful.


The reason I posted this is that they've just released v2.0. This isn't highlighted on the homepage, but I assume it has some significance to the project overall.


Wow, I was wondering why there were so many new github stars yesterday, and here I found the reason why. :)

Thanks for being interested in tesseract.js, it makes all the work worth. And I have to thank @antimatter15 for creating this library, without him we cannot go this far

I have read all the comments and here I would like to provide my two cents for some questions:

1. Is tesseract.js pure JavaScript?

Yes, it is 100% JavaScript and it leverages Webassembly port of original tesseract-ocr. (means we compile the C sorce code to JavaScript Webassembly code, powered by Emscripten)

2. The accuracy of tesseract.js is poor.

In my experience, it is hard to get perfect results without applying additional techniques to your source images. You may need to some preprocessing and sometimes train a custom traineddata. It is not easy, but it is the price of high accuracy.

3. Cloud OCR service is much more accurate

Yes, that's true. But tesseract.js provides an in browser offline option to do your OCR, it is useful for scenarios like PWA and high confidential image content (which you don't want to send to server). Tesseract.js is not a silver bullet, but it is handy sometimes.

Hope you enjoy this library and feel free to leave any comment to us!


Ugh why does open source OCR still suck so bad?!

Why isn’t there an open source OCR engine even half as powerful as the Google Cloud Platform API?!


I investigated tesseract.js for turning images of spreadsheets into data. I didn't mind the initial startup time or run time but unfortunately wasn't able to get good enough accuracy for my case. It seemed to work really well with plain english text though.




I'm curious about the performance tradeoffs of Javascript versus native code?

A few weeks ago, I tried writing something that was long-running, CPU intense, ect, in a webworker. It was so darn slow that I switched to a native language. (I hope I didn't do something silly that made my code run more slowly than it should.)

I see some mention about running in WASM. Does this do something like have ordinary Tesseract compiled for WebAssembly and then fallback to Javascript?


A while back I ported several c++ libraries with emscripten and it's usually just 2-10x slower than native. It gets maybe another order of magnitude worse if you're a porting a library that heavily relies on vectorization which isn't available on the web.


What's slightly weird is that the Chinese "example" text mis-reads a character in the first line. The image shows:

冬 日 平 泉 路 晚 归

But the OCR reports:

冬 日 平 柳 路 晚 归

(Note the different character right in the middle)


I tried using Tesseract around a year ago to recognize digits only, very clean images, not some weird font. I had thousands of images, and it failed in around 3% of them. It was so weird, as it would recognize the same digits in other images just fine.

I tried 4 or 5 different OCR programs, and none of them worked well enough for my case.

I was actually surprised, I thought OCR was a solved problem with ridiculously low error rates.


In my experience Tesseract can get confused near the edges of images, and padding with a wide white border helped a lot. Strange that none of the OCR programs were good enough, though.


Even though it says it supports 100 languages, I cannot find the list of supported languages. I am mostly trying to find out if it supports Indic languages.


https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#d...

Seems so, lists at least Hindi, Urdu, Bengali, Sanskrit, Urdu, Nepali, Marathi, Sinhala and Punjabi!


Nice project! Tried it with a screenshot from an eBook. Unfortunately the Os became Qs and the Is became |s.


You never realize how similar characters are until you start an OCR project.

ec

ij

tf

Il|1

hb

OQ

etc.

Even the tiniest addition or subtraction of printed ink can transform one character in one of the above rows to any other character in the row. Throw in page tilt/warp/etc. and OCR can frequently confuse them unless you train it specifically on your text. The pipeline I've found that works best is:

image -> upscale -> dewarp -> OCR -> spellcheck -> grammar check


I too have been very disappointed in Tesseract for "simple" OCR (converting subtitles).

For communication, Turbo Codes[1] for example have the decoder produce an integer value for each bit, rather than just a bit. The value is a measure of how likely the value is 0 or 1.

This is then used with previous bit values, which includes parity data, to make a "hard" decision.

I wonder if something similar has been tried for OCR? I imagine the OCR front-end could feed a number of probable hits, along with confidence, into a spellchecker. Or something along those lines.

[1]: https://en.wikipedia.org/wiki/Turbo_code


Does it work only on books and magazines or would it work on a driver license or ID card as well?


Before OCR'ing it converts to black-and-white using a brightness threshold. Keeping an evenly lit background is particularly important, because otherwise a shadow area can easily fall under the threshold of all-black.

A license or ID will almost certainly have medium-contrast elements in the background that will show up as dark. But if you were able to manipulate the contrast/brightness appropriately in advance, you could probably get it to work.


Tesseract is optimized for images with white backgrounds. ID cards or movie screenshots do not work well.


I have used tesseract ocr combined with imagemagick and ffmpeg to great success for video text extraction.


Can you list your script/pipeline? I haven't had much success (though, I only ran ffmpeg's internal tesseract OCR[0], no imagemagick processing or any other processing in between)

[0] https://ffmpeg.org/ffmpeg-all.html#ocr


There is another post about tesseract but by python. Same question but I guess it might be closer ... is it compatible with TensorFlow js? or is it ok to run alongside? Or the model can be “simplified” to run on client side like mobile net?


This is just amazing. I wonder how much work and how many brilliant people it took to develop this. Congratulations to everyone involved. I'm sure a lot of cool stuff will come from this tool.


Ooh exciting! The main reason why I've needed to leverage the GCP vision API is due to orientation limitations on local OCR.

I'll test and migrate to this soon depending on accuracy. Great job so far.


> Tesseract.js wraps an emscripten port of the Tesseract OCR Engine.

So it is a wrapper library of a C++ project, which is cool. But saying it is a "Pure" JavaScript is purely misleading.


I remember reading that Tesseract is the old tech which does worse than current ML based stuff. But since no one gives out their ML data that’s all you get.


Your knowledge is out of date. Tesseract 4 added a new OCR engine based on LSTM neural networks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: