Linux, OCR and PDF – problem solved – Konrad Voelkel's Blog

mathematics, life, science, software, philosophy, juggling and nonsense

Jump to: Content | Navigation | Footer

Linux, OCR and PDF – problem solved

19 January, 2010 in the category english, photos, web by Konrad.

Imagine you’ve scanned some book into a PDF file on Linux, such that every pdf-page contains two book-pages and there is a lot of additional whitespace and maybe the page orientation is wrong. And, worst of all, there is no full-text search, thus no full-text indexing for desktop search engines. I consider this problem solved on Linux!

In general, the first thing you’ll have to do is to convert the PDF (or, basically any file format you’ve scanned to, like TIFF or PNG or DjVu …) into the hOCR format, which is the text extracted, in a XHTML file with layout annotations. Then you can apply a program to convert the hOCR file into a searchable PDF again.

There are several approaches to solve this problem (however, you can download my solution directly, if you wish). To extract the text from a scan, you have to use OCR software such as gocr, ocrad, tesseract or cuneiform. I have achieved the best results with tesseract and the worst with gocr, however the most convenient way to produce hOCR files was using Cuneiform. Cuneiform is a russian software, once one of the best proprietary OCR softwares in the world. Now they have ported it to Linux.
In future (maybe two years), the project OCRopus will have a nice UI, then this may be another good way to OCR with Linux.

To convert a hOCR file into a searchable, indexable PDF, I only know of hocr2pdf from the ExactImage package.

The best results were found if each pdf-page is cropped and split in two, such that the files processed by the OCR program are PNG-files that contain exactly one book-page without additional stuff (graphics are OK). To do this, you need some batch-processing.

 


 

To get fast results without much work, I wrote a shell-script that calls pdf-to-image converters, OCR software and hocr2pdf in the right sequence with the right command-line options. The shell-script isn’t perfect nor beautiful, but maybe you can use it to model upon it your own shell-script to suit your needs.

What the script does:

  • splits one pdf into many (one pdf-file per pdf-page) via pdftk
  • converts each pdf-page into a monochrome image with 300dpi via ImageMagick & ghostscript
  • converts each pdf-page into two images for each book-page (after rotating & cropping the pdf-page appropriately) via ImageMagick
  • OCRs each book-page via Cuneiform
  • converts each book-page into PDF format via ExactImage
  • merges all book-pages into one PDF file via pdfjam (& LaTeX)
  • writes metadata (optionally) via pdftk

So the dependencies are convert (ImageMagick), ghostscript, pdftk, pdfjam, hocr2pdf (ExactImage) and cuneiform. In Ubuntu, you can run sudo apt-get install imagemagick ghostscript pdftk pdfjam exactimage to get the most dependencies. Cuneiform, however, must be installed by hand (grab the .tar.bz2 file from their launchpad website and read the readme.txt installation instructions; maybe run sudo apt-get install cmake).

The dependency to ImageMagick could be dropped because ExactImage provides the same tools (although ExactImage is faster).

There are three minor issues to discuss:

You can either download the script to make PDFs searchable and indexable under Linux here or copy&paste the code from below:
#!/bin/bash
echo "usage: pdfocr.sh document.pdf orientation split left top right bottom lang author title"
# where orientation is one of 0,1,2,3, meaning the amount of rotation by 90°
# and split is either 0 (already single-paged) or 1 (2 book-pages per pdf-page)
# and (left top right bottom) are the coordinates to crop (after rotation!)
# and lang is a language as in "cuneiform -l".
# and author,title are used for the PDF metadata
# all values relative to a resolution of 300dpi
#
# usage examples:
# ./pdfocr.sh SomeFile.pdf 0 0 0 0 2500 2000 ger SomeAuthor SomeTitle
# will process a PDF with one page per pdf-page, cropping to width 2500 and height 2000
pdftk "$1" burst dont_ask
for f in pg_*.pdf
do
echo "pre-processing $f ..."
convert -quiet -rotate $[90*$2] -monochrome -normalize -density 300 "$f" "$f.png"
convert -quiet -crop $6x$7+$4+$5 "$f.png" "$f.png"
if [ "1" = "$3" ];
then
convert -quiet -crop $[$6/2]x$7+0+0 "$f.png" "$f.1.png"
convert -quiet -crop 0x$7+$[$6/2]+0 "$f.png" "$f.2.png"
rm -f "$f.png"
else
echo no splitting
fi
rm -f "$f"
done
for f in pg_*.png
do
echo "processing $f ..."
convert "$f" "$f.bmp"
cuneiform -l $8 -f hocr -o "$f.hocr" "$f.bmp"
convert -blur 0.4 "$f" "$f.bmp"
hocr2pdf -i "$f.bmp" -s -o "$f.pdf" < "$f.hocr"
rm -f "$f" "$f.bmp" "$f.hocr"
done
echo "InfoKey: Author" > in.info
echo "InfoValue: $9" >> in.info
echo "InfoKey: Title" >> in.info
echo "InfoValue: $10" >> in.info
echo "InfoKey: Creator" >> in.info
echo "InfoValue: PDF OCR scan script" >> in.info
pdfjoin --fitpaper --tidy --outfile "$1.ocr1.pdf" "pg_*.png.pdf"
rm -f pg_*.png.pdf
pdftk "$1.ocr1.pdf" update_info doc_data.txt output "$1.ocr2.pdf"
pdftk "$1.ocr2.pdf" update_info in.info output "$1-ocr.pdf"
rm -f "$1.ocr1.pdf" "$1.ocr2.pdf" doc_data.txt in.info
rm -rf pg_*_files

And if you think something with this code is wrong, not good or ugly, you can write me an email with corrections.

Happy scanning & searching in PDFs!

Maybe related posts

Comments

Pingback by » Managing the paper’s metadata (Konrad Voelkel's Blog)
Zeit 2010-01-25 um 19:09

[...] The general solution to get full-text search and indexable documents (with Linux) is to look at another article on this blog: Linux, OCR and PDF – problem solved [...]

Comment by Mark Johnson
Zeit 2010-01-27 um 07:07

Very nice and works very well. Would you also know of an approach that the result would be placed into a Word/OpenOffice Document?
Having one Scrip that could do either or both would be interesting so that the “original” pdf could be used or change portions of the document that were recognized.

Comment by Mark Johnson
Zeit 2010-01-27 um 07:09

Very nice and works very well. Would you also know of an approach that the result would be placed into a Word/OpenOffice Document?
Having one Script that could do either or both would be interesting so that the “original” pdf could be used or to change portions of the document that were NOT recognized.

Comment by Konrad
Zeit 2010-01-27 um 07:13

Have you tried to skip the step “hocr2pdf”, and looked instead at the .hocr files with a webbrowser? These are actually HTML files, so you could try to use the HTML-import feature of your Office application.

But I haven’t tried this yet and I guess the result won’t be very usable. For the purpose of editing text, I would use a simpler approach, not using hOCR but directly converting the pdf-files to pure text files with Tesseract. Scripts to do this can be found elsewhere.

Write a comment

You have to log in to write comments. Sorry. Everybody can register!