Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used.
Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.
There is no latest build available for PDFBox. Sourceforge has very old binaries. But the old version fails to work with PDF 1.5 specification. So one need to compile the latest code from SVN.
I am sharing the latest jar file built from svn here
The following example explains how to extract the text from a pdf file using PDFBox.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.pdfbox.cos.COSDocument; import org.apache.pdfbox.pdfparser.PDFParser; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.util.PDFTextStripper; public class PDFTextParser { // Extract text from PDF Document static String pdftoText(String fileName) { PDFParser parser; String parsedText = null;; PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; File file = new File(fileName); if (!file.isFile()) { System.err.println("File " + fileName + " does not exist."); return null; } try { parser = new PDFParser(new FileInputStream(file)); } catch (IOException e) { System.err.println("Unable to open PDF Parser. " + e.getMessage()); return null; } try { parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); parsedText = pdfStripper.getText(pdDoc); } catch (Exception e) { System.err .println("An exception occured in parsing the PDF Document." + e.getMessage()); } finally { try { if (cosDoc != null) cosDoc.close(); if (pdDoc != null) pdDoc.close(); } catch (Exception e) { e.printStackTrace(); } } return parsedText; } public static void main(String args[]){ System.out.println(pdftoText("/home/santhosh/pdfbox/test.pdf")); } } |
More details on the APIs can be read from here
Hi there,
Super post, Need to mark it on Digg
Elcoj
I have used the above program and getting the below error.appreciate any help.
java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser
Hi, i’ve got the same problem. Have you solved it in any way?
Tanks
The error is because of missing libs. As it tells you:
org/apache/fontbox/cmap/CMapParser is missing.
You can use findjar to find an lib, which includes the class:
http://www.findjar.com/jar/org.fontbox/jars/fontbox-0.1.0.jar.html?all=true
Hi, please can you explain me, like to a baby, how to compile and run this source.
I’ve downloaded pdfbox-0.8.0-incubating.ja but i don’t know how use it together with “javac”
Thanks
Hello,
I tried to run the above code using the pdfbox jar and fontbox jar mentioned in earlier replies. But it seems the code still blows up giving the exception below. Has anyone faced this issue before. If yes, could you tell me how did you fix it.
org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:125)
at org.apache.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:120)
at PDFTextParser1.pdftoText(PDFTextParser1.java:33)
at PDFTextParser1.main(PDFTextParser1.java:56)
Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph cannot be cast to org.apache.pdfbox.util.operator.OperatorProcessor
at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:119)
… 3 more
Thanks in advance,
You are most likely mixing 0.8.0 and 0.7.3 versions in the classpath.
How can you extract bold text from a PDF if there is one?
There is a small defect in ur code..U got to replace
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
WITH
import org.pdfbox.cos.COSDocument;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
I agree with Hrishi. But changing the import path from org.apache.pdfbox.* to org.pdfbox.* will not help as all the class files inside the jar will contain the path as prg.apache.pdfbox.* only.
Moreover the class PDFfont uses the import for fonbox.jar file as org.apache.fonbox.* while the fonbox-0.1.0.jar file have the path defined as org.fonbox.*.
So Its really difficult to use these two jars having diffient packages.
use the fonbox-0.8.0-incubating.jar instead of fonbox-0.1.0.jar as well you need to use one more jar as common-logging-4.0.6.jar
I am using 1.1.0 versions of all (i.e. pdf box, fontbox, jempbox, commons logging from here http://pdfbox.apache.org/download.html. Its working without any problems
Sometimes I am getting to see the warning
Jun 14, 2010 6:35:55 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont
WARNING: Failed to create Type1C font. Falling back to Type1 font
java.lang.NullPointerException
but extraction is successful
Works, but Title and footer of pdf document are extraxted inclompletely, do you know why?
Hi all,
I have successfully read the pdf. but my problem is i want display that pdf form into an html page.
my requirement is
parsing PDF forms (to parse textbox/textarea/checkbox/radio button/numerical data and print to screen)
to print into an html page
please help me.
i am using pdfbox-app-1.2.1.jar ,jempbox-1.2.1.jar,fontbox-1.2.1.jar and the above code.working fine …. good work..Thx for the code…
Hi nithya,
could you please briefly explain to me how it works… I am new to the java environment to use this tool. I have downloaded the pdfbox-1.2.1 and I have dowloaded the JDK 1.6 and apache maven. Please help me how to use it…
I am not getting the test is as is. The header part is displaying in the last and table content in the top. How do I get the test as is from pdf to text file?
Hi,
im trying the same(above code n jars).
Im gtng da error as in title.Hw to resolve te issue.10-07 12:43:42.456: java.lang.NoClassDefFoundError: org.apache.pdfbox.pdmodel.PDDocument
Im using jdk 1.5 with eclipse..and pdf 1.2 to check..
Hi,
The program works perfectly, but I am facing issues when translating PDFs that have text in hindi, telugu and malayalam fonts. The output is some garbled set of characters. Please help.
Great Work. Thank You. This is what the exact thing that I was searching for.
can we read the pdf line by line kindly tell any method .
hi,
Then now I wanna ask about how to read a PDF file by it sections. I mean, if I have a PDF file and I don’t wanna read all contents from it, Can you help how to read it by its sections? For example I just wanna get the content from Introduction section, or may be I wanna read only the table of content.
I’ve tried the code, and its work,
Thank you.
it really works.
thanks for your work.
I have used your program and its work for me really. Thanks for sharing this code. And hope to see more these type of useful code.
Thanks, really works!
I’ve tried to extract text from a particular region within a pdf with current pdfbox build presented on apache websit and had no luck….hope to do this with your jar file)
Thank you. All except this pdf did not work for me. can you please check what is the issue here
http://cid-a3aa7f7d9888874d.office.live.com/self.aspx/Public/getting%5E_started%5E_with%5E_Flex3.pdf
its awesome..
Thanks for sharing..
I have to extract only lines data(position, thickness ,width,height) from pdf to text file if u people know do needfull to me
Hi Everyone,
I have the same problem with Patrik’s:
org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:125)
at org.apache.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:120)
at PDFTextParser1.pdftoText(PDFTextParser1.java:33)
at PDFTextParser1.main(PDFTextParser1.java:56)
Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph cannot be cast to org.apache.pdfbox.util.operator.OperatorProcessor
at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:119)
… 3 more
Does anyone know how to solve this? I am running on windows xp. Thank you very much in advance for helping me sort this out.
Thanks for nice explanation of this utility. I tried this and got very good results with the English text, but when it came to extracting unicode/Indian language text from PDF the out put was recognizable in many of the well known fonts. This must be a well known issue, can you please suggest me a solution?
the output for indian languages is unrecognizable or not recognizable, I made a mistake in writing above sentence. Thanks.