I bet creating searchable PDFs has been done many times over, even so I’d like to share the way I did it recently with strictly open source tools. The pipeline is simple: GS to separate the PDF to pages, tesseract OCR to extract text, hocr2pdf to create a merged PDF and GS again to bundle everything back to unified PDF. If you’re creating a PDF from scanned books, this project may also be of help: unpaper
Edit 5/21/2014: I’ve had good experience using Scantailor, which is available on homebrew for the Mac. And also, I’ve submitted hocr2pdf to homebrew as part of the exact-image library (the name of the formula is “exact-image”).
A script
Please excuse the Bash, but DOS or other types of scripts should work similarly.
#!/bin/sh # bash tut: http://linuxconfig.org/bash-scripting-tutorial # Linux PDF,OCR: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/ y="`pwd`/$1" echo Will create a searchable PDF for $y x=`basename "$y"` name=${x%.*} mkdir "$name" cd "$name" # splitting to individual pages gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f "$y" # process each page for f in $( ls *.jpg ); do # extract text tesseract -l eng -psm 3 $f ${f%.*} hocr # remove the “<?xml” line, it disturbed hocr2df grep -v "<?xml" ${f%.*}.html > ${f%.*}.noxml rm ${f%.*}.html # create a searchable page hocr2pdf -i $f -s -o ${f%.*}.pdf < ${f%.*}.noxml rm ${f%.*}.noxml rm $f done # combine all pages back to a single file # from http://www.ehow.com/how_6874571_merge-pdf-files-ghostscript.html gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=../${name}_searchable.pdf *.pdf cd .. rm -rf $name
Usage is quite simple:
./make_searchable.sh my_non_searchable.pdf
6 replies on “Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr”
Hello, is there a possibility to select a Folder with many “my_non_searchable.pdf” – so that every PDF get searchable at the End ?
Dokument -> Scanner -> Dropbox -> Raspberry -> NAS (FOLDER (non_searchable) -> Raspberry 2 ORC -> NAS FOLDER ( searchable ).
[…] Creating a searchable PDF with opensource tools ghostscript, hocr2pdf and tesseract-ocr […]
Hey Roy,
Thank you soo much! With this script I was able to help a friend of mine who study history of psychology!
Hi, I tried this several times with different variants of parameters but I always get results where the hidden text does not match position with the image.
Hello Roy, I tried that changing the language to por (portuguese), but it doesn’t work…. it takes a long time, creates the new PDF, but there is no “text” inside it.
Even if I process it in english…. have you had some issue like that?
Tks!
@Lucas
You should make sure Tesseract has the Portuguese language files from here