Linux, OCR and PDF – problem solved
Tuesday, January 19th, 2010 | Author: Konrad Voelkel
Imagine you've scanned some book into a PDF file on Linux, such that every pdf-page contains two book-pages and there is a lot of additional white-space and maybe the page orientation is wrong. And, worst of all, there is no full-text search, thus no full-text indexing for desktop search engines. I consider this problem solved on Linux!
[UPDATE 2013-03-29] Three years later, I wrote up a better solution to OCR scans on Linux. The explanations and comments from this thread might still be helpful.
In general, the first thing you'll have to do is to convert the PDF (or, basically any file format you've scanned to, like TIFF or PNG or DjVu ...) into the hOCR format, which is the text extracted, in a XHTML file with layout annotations. Then you can apply a program to convert the hOCR file into a searchable PDF again.
There are several approaches to solve this problem (however, you can download my solution directly, if you wish). To extract the text from a scan, you have to use OCR software such as gocr, ocrad, tesseract or cuneiform. I have achieved the best results with tesseract and the worst with gocr, however the most convenient way to produce hOCR files was using Cuneiform. Cuneiform is a Russian software, once one of the best proprietary OCR software in the world. Now they have ported it to Linux.
In future (maybe two years), the project OCRopus will have a nice UI, then this may be another good way to OCR with Linux.
To convert a hOCR file into a searchable, indexable PDF, I only know of hocr2pdf from the ExactImage package.
The best results were found if each pdf-page is cropped and split in two, such that the files processed by the OCR program are PNG-files that contain exactly one book-page without additional stuff (graphics are OK). To do this, you need some batch-processing.
To get fast results without much work, I wrote a shell-script that calls pdf-to-image converters, OCR software and hocr2pdf in the right sequence with the right command-line options. The shell-script isn't perfect nor beautiful, but maybe you can use it to model upon it your own shell-script to suit your needs.
What the script does:
- splits one pdf into many (one pdf-file per pdf-page) via pdftk
- converts each pdf-page into a monochrome image with 300dpi via ImageMagick & ghostscript
- converts each pdf-page into two images for each book-page (after rotating & cropping the pdf-page appropriately) via ImageMagick
- OCRs each book-page via Cuneiform
- converts each book-page into PDF format via ExactImage
- merges all book-pages into one PDF file via pdfjam (& LaTeX)
- writes metadata (optionally) via pdftk
So the dependencies are convert (ImageMagick), ghostscript, pdftk, pdfjam, hocr2pdf (ExactImage) and cuneiform. In Ubuntu, you can run sudo apt-get install imagemagick ghostscript pdftk pdfjam exactimage
to get the most dependencies. Cuneiform, however, must be installed by hand (grab the .tar.bz2 file from their launchpad website and read the readme.txt installation instructions; maybe run sudo apt-get install cmake).
The dependency to ImageMagick could be dropped because ExactImage provides the same tools (although ExactImage is faster).
There are three minor issues to discuss:
- After installing Cuneiform, it will complain about a missing library "libpuma.so" even if it's there. Solution:
sudo ldconfig
- If you're looking for the program hocr2pdf, it's in the debian package "exactimage".
- You will get many warning messages because of malformed PDFs. This is not really a problem and will be fixed in future versions of pdftk.
You can either download the script to make PDFs searchable and indexable under Linux here or copy&paste the code from below:
#!/bin/bash
echo "usage: pdfocr.sh document.pdf orientation split left top right bottom lang author title"
# where orientation is one of 0,1,2,3, meaning the amount of rotation by 90°
# and split is either 0 (already single-paged) or 1 (2 book-pages per pdf-page)
# and (left top right bottom) are the coordinates to crop (after rotation!)
# and lang is a language as in "cuneiform -l".
# and author,title are used for the PDF metadata
# all values relative to a resolution of 300dpi
#
# usage examples:
# ./pdfocr.sh SomeFile.pdf 0 0 0 0 2500 2000 ger SomeAuthor SomeTitle
# will process a PDF with one page per pdf-page, cropping to width 2500 and height 2000
pdftk "$1" burst dont_ask
for f in pg_*.pdf
do
echo "pre-processing $f ..."
convert -quiet -rotate $[90*$2] -monochrome -normalize -density 300 "$f" "$f.png"
convert -quiet -crop $6x$7+$4+$5 "$f.png" "$f.png"
if [ "1" = "$3" ];
then
convert -quiet -crop $[$6/2]x$7+0+0 "$f.png" "$f.1.png"
convert -quiet -crop 0x$7+$[$6/2]+0 "$f.png" "$f.2.png"
rm -f "$f.png"
else
echo no splitting
fi
rm -f "$f"
done
for f in pg_*.png
do
echo "processing $f ..."
convert "$f" "$f.bmp"
cuneiform -l $8 -f hocr -o "$f.hocr" "$f.bmp"
convert -blur 0.4 "$f" "$f.bmp"
hocr2pdf -i "$f.bmp" -s -o "$f.pdf" < "$f.hocr" rm -f "$f" "$f.bmp" "$f.hocr" done echo "InfoKey: Author" > in.info
echo "InfoValue: $9" >> in.info
echo "InfoKey: Title" >> in.info
echo "InfoValue: $10" >> in.info
echo "InfoKey: Creator" >> in.info
echo "InfoValue: PDF OCR scan script" >> in.info
pdfjoin --fitpaper --tidy --outfile "$1.ocr1.pdf" "pg_*.png.pdf"
rm -f pg_*.png.pdf
pdftk "$1.ocr1.pdf" update_info doc_data.txt output "$1.ocr2.pdf"
pdftk "$1.ocr2.pdf" update_info in.info output "$1-ocr.pdf"
rm -f "$1.ocr1.pdf" "$1.ocr2.pdf" doc_data.txt in.info
rm -rf pg_*_files
And if you think something with this code is wrong, not good or ugly, you can write me an email with corrections.
Happy scanning & searching in PDFs!
2010-01-27 (27. January 2010)
Very nice and works very well. Would you also know of an approach that the result would be placed into a Word/OpenOffice Document?
Having one Script that could do either or both would be interesting so that the “original” pdf could be used or to change portions of the document that were NOT recognized.
2010-01-27 (27. January 2010)
Have you tried to skip the step "hocr2pdf", and looked instead at the .hocr files with a webbrowser? These are actually HTML files, so you could try to use the HTML-import feature of your Office application.
But I haven't tried this yet and I guess the result won't be very usable. For the purpose of editing text, I would use a simpler approach, not using hOCR but directly converting the pdf-files to pure text files with Tesseract. Scripts to do this can be found elsewhere.
2010-10-01 (1. October 2010)
I have spent the last few hours doing some heavy scanning and OCRing and I found your page. First of all, thanks a lot - I didn't use your script, but I semi-manually performed the same steps, and it was a big help.
Secondly, I wanted to make an observation. I might have side-stepped this problem if I'd used your script, but I am not sure, and having finally figured it out I had to share this somewhere. :-) If your pages are monochrome, it's important that they are in a 1bpp file format when submitted to hocr2pdf - it's not smart enough (and I can't really blame it) to notice this by itself. I suspect your script works fine - I hit this because my inputs were PDFs produced by gscan2pdf, so I burst them into individual files for cleanup and OCR using pdftoppm, which produced 24bpp output. As a result my PDF file ballooned from 15MB before bursting-and-OCRing to 114MB when reassembled. Converting the image files to PBM before giving them to hocr2pdf reduced my final PDF size to 14MB. (I assume I saved some extra space as a result of cleaning up noise on the bitmaps after bursting.)
2010-12-23 (23. December 2010)
Thanks for an excellent and detailed solution to this problem.
The next problem is a command-line search facility. I have a few hundred OCR-ed PDF files that I would like to search with regex strings, and have some program return a list of filename and page # combinations where the search string was found.
Does such a program exist?
2011-01-07 (7. January 2011)
This is fantastic, thank you!
I've been waiting for years to do be able to do something like this without making files immense (there is a program on the Mac that I have used, but it bloats the files 5-10x and isn't very accurate).
I adapted this script to use Tesseract 3.00 for English text on Ubuntu 10.04.1: http://wwww.ubuntuforums.org/showpost.php?p=10327088&postcount=5
The accuracy is pretty good, too, at least in English.
2011-02-20 (20. February 2011)
Hi there,
I have a general problem with the ocr-ing step.
I have here a perfectly readable page that stubbornly resists to any attempt of being ocred on either linux or mac osx, using your proposal (cuneiform), but also tesseract, and ocroscript.
I tried to increase the resolution up to 1200 in the convert to png step, but to no avail.
The ocr step either fails completely (mac osx, segmentation fault in cuneiform) or else only produces gibberish.
Adobe acrobat professional perfectly ocrs this file.
Maybe you have an idea how to tweak some of you parameters to get this thing ocred? You can find the file here:
files.me.com/bjrnfrdnnd2/3sg35f
2011-02-20 (20. February 2011)
I looked at your file.
With 1200 dpi, a cuneiform bug prevents OCRing, because the file is too large. See https://bugs.launchpad.net/cuneiform-linux/+bug/349110
Your problem seems to be related to font rendering issues. Using the -monochrome switch in convert produces an image with bad fonts which are unlikely to be OCRed properly. I think it has nothing to do with the rendering resolution (however, I guess 600dpi is fine for your data).
Using GIMP, I converted the image to a high-contrast monochrome image which worked well in cuneiform (at 300 dpi). The OCR quality was not perfect, but tuning the parameters you'll get much better results.
2011-02-20 (20. February 2011)
Thanks for your message.
I tried also 300, 600 and 1200 dpi using your script, but none worked. Your solution seems logical (using high contrast), but using gimp means that you leave the world of batch processing (and you wouldn't want to start gimp for every one of the 100 pages that you want to scan, would you?).
So while gimp certainly is a workaround for one page, do you have an idea which of the parameters in the convert step in your script to tune in order to get cuneiform to work on the file?
2011-02-20 (20. February 2011)
Gimp can be used for batch-processing and convert has lots of options which might help as well. Just take a look at their manpages ("man convert") or google for more documentation.
2011-02-20 (20. February 2011)
I actually already tried that: using edge detection, threshold, sharpening with convert. I never succeeded producing anything that cuneiform would be able to ocr. I was even unable to find any suitable gimp operation that would produce an image that cuneiform would be able to ocr.
So if you still remember what your successful gimp operation was, please tell me.
2011-06-09 (9. June 2011)
I get
" pdfjam ERROR: pg_*.png.pdf not found"
Any idea?
2011-06-09 (9. June 2011)
@Drew
it seems the hocr2pdf step failed. Can you check manually, which step fails? Maybe I should add you should execute this script in a shell, in the folder where your file is.
Anyway, I have no idea what's wrong - try debugging ;-)
2011-06-15 (15. June 2011)
@Drew
If you edit the shell script at line 54, removing the quotation marks around "pg_*.png.pdf", that is, change "pg_*.png.pdf" to pg_*.png.pdf the filename expansion should work properly. At least it worked for me, I had the same problem. But, after that, I'm getting the following error on the pdfjoin call :
" pdfjam: Effective call for this run of pdfjam:
/usr/bin/pdfjam --fitpaper 'true' --rotateoversize 'true' --suffix joined --fitpaper '--no-tidy' --outfile 0010775_Manual_Eletronico_H61H2_M2_RevC.pdf.ocr1.pdf -- pg_0001.pdf.png.pdf - pg_0002.pdf.png.pdf - pg_0003.pdf.png.pdf - pg_0004.pdf.png.pdf - pg_0005.pdf.png.pdf - pg_0006.pdf.png.pdf - pg_0007.pdf.png.pdf - pg_0008.pdf.png.pdf - pg_0009.pdf.png.pdf - pg_0010.pdf.png.pdf - pg_0011.pdf.png.pdf - pg_0012.pdf.png.pdf - pg_0013.pdf.png.pdf - pg_0014.pdf.png.pdf - pg_0015.pdf.png.pdf - pg_0016.pdf.png.pdf - pg_0017.pdf.png.pdf - pg_0018.pdf.png.pdf - pg_0019.pdf.png.pdf - pg_0020.pdf.png.pdf - pg_0021.pdf.png.pdf - pg_0022.pdf.png.pdf - pg_0023.pdf.png.pdf - pg_0024.pdf.png.pdf - pg_0025.pdf.png.pdf - pg_0026.pdf.png.pdf - pg_0027.pdf.png.pdf - pg_0028.pdf.png.pdf -
pdfjam: Calling pdflatex...
pdfjam: FAILED.
The call to 'pdflatex' resulted in an error.
If '--no-tidy' was used, you can examine the
log file at
/var/tmp/pdfjam-sVj7OI/a.log
to try to diagnose the problem.
pdfjam ERROR: Output file not written
"
2011-06-15 (15. June 2011)
So, what does the logfile tell?
2011-06-15 (15. June 2011)
Beats me. There is no /var/tmp/pdfjam-sVj7OI/a.log file, unfortunately.
2011-06-18 (18. June 2011)
I had the same missing log file problem. I substituted the pdfjoin line with
pdfjam --outfile $1.ocr1.pdf --a4paper pg_*.png.pdf
which made the script finish successfully.
I unfortunately get in spite of this no satisfying results. The resulting PDF is virtually not searchable. When running pdftotext on it, most of the text is missing and the recognized text is of poor quality.
But this is not due to bad original scans! If I directly do the OCR with cuneiform on JPGs, the plain text output is superb. My problem is the conversion from recognized text to a searchable PDF.
Any ideas to that?
2011-06-21 (21. June 2011)
First a have my own version of the script. Here the right left bottom top are how many pixels to crop the image. And middle is how much to remove from the middle when splitting.
http://pastebin.com/pEx1zmCn
But I have some problems with hocrpdf2. I have checked the hocr-files and they contain almost all the text. But when after hocr2pdf have executed the resulting pdf contains garbage. The lines are cropped and some text is too large.
http://peecee.dk/upload/view/313549
2011-06-22 (22. June 2011)
Solved my problem. Apparently hocr2pdf works better with tiff.
So edited my ealier version of the script. A small benchmark with four two-side pages were 3.5x faster. Furthermore it "handles" errors where cuneiform fails to read any text and crashes and just uses the page from the original pdf.
http://pastebin.com/6ag39WnW
2011-07-12 (12. July 2011)
New version, more use of ghostscript.
http://pastebin.com/FKz6LRs7
2011-09-12 (12. September 2011)
I search a program which adjust = rotate etc scanned text pages automatically, i.e. analyse the scan for the best rotation angle and for the best trapezoidal correction of each scanned page. Such a solution should exist, as google's scans of old books always are corrected very good for these problems.
2011-09-28 (28. September 2011)
I'm hopeful that this feature will be built into OCRopus directly and better:
http://code.google.com/p/ocropus/issues/detail?id=146&q=searchable%20pdf#makechanges
Adobe Acrobat (and our printers at work) can take a PDF scan and 'hide' OCR text behind the words, so that it looks like the scan, but lets you search or select text from the OCR layer. This is what I'm really wanting.
I haven't tried it yet, but my reading of your script is that it will only show the OCR layer in the resulting PDF, rather than the scan with the OCR hidden behind. Is that correct?
2011-09-28 (28. September 2011)
The script does the same, i.e. "hiding" the OCRed text behind an image of the scanned page. I hope OCRopus will supersede something like my script soon :-)
2012-02-08 (8. February 2012)
Hope this is of use to someone - I make extensive extensive use of creating PDFs from scanned OCRed images, and though I would rather use open source I simply haven't found any of the options perform well enough. There are however two solutions I'd recommend that work under linux (with Wine or Crossover), even if they're not open source. The first is PDF XChange Viewer from Tracker (portable and installable versions) which works well under WINE. It does occasionally crash if you move too quickly through a large PDF for a few hundred pages by pressing page down but otherwise seems rock solid. Its latest version has OCR (hidden but aligned text layer under image) built in, and remains free (Beer). The other is very much not free (beer) but it is really really effective, that is ReadIris (I have the corporate edition), which although they don't support it for linux in any way runs perfectly well under Crossover (I've OCRed tens of thousands of pages with no crashes) and produces very compact PDFs.
I keep watching the open source options - when they get close I'll be thrilled to switch, but not got there yet for me.
One other extremely useful tool, potentially, is Infix PDF Editor (also works in Wine) - its killer feature for me is that it will happily edit even PDFs created with Adobe's Clearscan function, something Acrobat itself can't do, so it's very handy for correcting OCR output.
2012-02-08 (8. February 2012)
P.S. whatever OCR software you're using, results can be much improved if you have the time to run the original page images through ScanTailor (open source/multi platform) - it's voodoo magic freakily good software! (P.P.S. I have no connection with any of the above, it's just the workflow I've arrived at through several years of experimentation)
2012-02-15 (15. February 2012)
Hello,
when I use your script, I obtain this :
pdfjam: This is pdfjam version 2.05.
pdfjam: Reading any site-wide or user-specific defaults...
(none found)
pdfjam ERROR: pg_*.png.pdf not found
Error: Failed to open PDF file:
2010_01_00.pdf.ocr1.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Error: Failed to open PDF file:
2010_01_00.pdf.ocr2.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Why ?
Thanks.
2012-04-22 (22. April 2012)
I need to extract text from pdf files.
Isn't there anything recognizable as text in the pdf encoding?
The browser (any) recognizes the text portions of the pdf screen display. How does it do that?
I just think that a OCR approach is not very reliable.
I need to do the extractions in an automated fashion - i.e. via a php script with no manual intervention.
2012-04-24 (24. April 2012)
If I load a pdf file with a browser (like Firefox) and do a ctl-A, ctl-C, and a ctl-V to a notepad or wordpad screen - I end up with the document converted to readable text. Isn't there some way to automate that process?
I tried and I just don't understand all this stuff about loading and executing OCR software programatically.
2012-04-25 (25. April 2012)
I have looked through this and I'm afraid it's too complex and involves several elements I'm not familiar with. How can I find someone who can help me for a fee?
I just need a function callable from php to extract text from simple pdf files.
2012-04-25 (25. April 2012)
Dear Ron Z, if you can extract text from a PDF with your browser like you described, then the text ist already "in the PDF", so no OCR is necessary. There are tools to get the text from a PDF in this case, like "pdf2text" or similar. I don't know wether something like that is available in PHP but surely it is for Unix/Linux. For real OCR, maybe there is some pay-for webservice that can help. For example, the new Google Drive (previously Google Docs) can do OCR and they seem to have open APIs.. good luck!
2012-08-20 (20. August 2012)
Sounds good but so many tools to use!
For the last 3 steps, you can use a PDF software called PDF Studio to import image files as a PDF and add document metadata. It would be nice if PDF studio had some OCR integration with Cunieform.
2012-11-13 (13. November 2012)
i made a pdf2text with ocr based on what the article taught me in case someone wants it ready:
https://github.com/cirosantilli/bash/blob/master/bin/pdfocr2txt.sh
2012-11-13 (13. November 2012)
pdf2txt with ocr based on you article in case someone wants one ready:
https://github.com/cirosantilli/bash/blob/2c7cfd1fb77e8fafab66c229067d30994d16b3f9/pdfocr2txt.sh
2013-07-30 (30. July 2013)
Thank you very much for this great job!
I'd like to point Cuneiform is now also available from Ubuntu repos. So,
$ sudo apt-get install cuneiform
... must work.
Regards!
2013-11-13 (13. November 2013)
Very Nice!
How useful is for mathematics pdfs (a lot of math formulas in the pdf?
Thanks
2013-11-14 (14. November 2013)
Formulas don't parse correctly (most of the time). This is a highly non-trivial problem, much harder than just latin character OCR.
2014-02-09 (9. February 2014)
Hi Konrad,
First of all I own you a huge thanks. Your detailed posts about OCR in Linux saved me some precious time.
As your script wasn't working on my machine, due to a bug on pdfjam when it calls pdflatex, I decided to make my own simplified version. You can find it at: https://gist.github.com/dllud/8892741
Main improvements/differences:
- Dropped the use of pdfjam. Instead, I use pdftk to merge PDFs, which besides being bug free, is faster and does not recompress PDFs (as nate suggested).
- Added support for doing OCR with tesseract besides cuneiform. As you explain on "Linux, OCR and PDF: Scan to PDF/A", tesseract gives the best results (also true for me).
- Removed the option for cropping the PDF pages. Besides being confusing when one first approaches the script (it took me some time to check the size of my PDF pages in pixels), I found little use for it. Most unOCRed PDFs I get need no cropping. Now the division of pages in half, when split is set, is done using relative (%) sizes.
- Removed the option to rotate the pages. Again, found little use for it.
I hope it is useful for someone.
Regards!
2014-08-31 (31. August 2014)
I had to add a "+matte -compress none" here:
...
echo "processing $f ..."
convert +matte -compress none "$f" "$f.bmp"
Because I got this error:
X.pdf.png.bmp is a compressed BMP. Only uncompressed BMP files are supported
2016-01-06 (6. January 2016)
Now that tesseract can directly export pdf files, some of this script can easily be refactored out. I hope you update the article with this information.
2016-01-24 (24. January 2016)
I don't have time right now ... but why don't you try to do this? I'll be happy linking to your blog then.