Skip to content


PDFBox : Extract Text from PDF

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used.

Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

There is no latest build available for PDFBox. Sourceforge has very old binaries. But  the old version fails to work with PDF 1.5 specification. So one need to compile the latest code from SVN. 

I am sharing the latest jar file built from svn here

The following example explains how to extract the text from a pdf file using PDFBox.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
 
public class PDFTextParser {
 
	// Extract text from PDF Document
	static String pdftoText(String fileName) {
		PDFParser parser;
		String parsedText = null;;
		PDFTextStripper pdfStripper = null;
		PDDocument pdDoc = null;
		COSDocument cosDoc = null;
		File file = new File(fileName);
		if (!file.isFile()) {
			System.err.println("File " + fileName + " does not exist.");
			return null;
		}
		try {
			parser = new PDFParser(new FileInputStream(file));
		} catch (IOException e) {
			System.err.println("Unable to open PDF Parser. " + e.getMessage());
			return null;
		}
		try {
			parser.parse();
			cosDoc = parser.getDocument();
			pdfStripper = new PDFTextStripper();
			pdDoc = new PDDocument(cosDoc);
			pdfStripper.setStartPage(1);
			pdfStripper.setEndPage(5);
			parsedText = pdfStripper.getText(pdDoc);
		} catch (Exception e) {
			System.err
					.println("An exception occured in parsing the PDF Document."
							+ e.getMessage());
		} finally {
			try {
				if (cosDoc != null)
					cosDoc.close();
				if (pdDoc != null)
					pdDoc.close();
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
		return parsedText;
	}
	public static void main(String args[]){
		System.out.println(pdftoText("/home/santhosh/pdfbox/test.pdf"));
	}
 
}
 
 

More details on the APIs can be read from here

Posted in Misc.

Tagged with , .


31 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Elcoj says

    Hi there,
    Super post, Need to mark it on Digg
    Elcoj

  2. Neel says

    I have used the above program and getting the below error.appreciate any help.
    java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser

  3. Giordano says

    Hi, please can you explain me, like to a baby, how to compile and run this source.
    I’ve downloaded pdfbox-0.8.0-incubating.ja but i don’t know how use it together with “javac”

    Thanks

  4. Pratik says

    Hello,

    I tried to run the above code using the pdfbox jar and fontbox jar mentioned in earlier replies. But it seems the code still blows up giving the exception below. Has anyone faced this issue before. If yes, could you tell me how did you fix it.

    org.apache.pdfbox.exceptions.WrappedIOException
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:125)
    at org.apache.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:120)
    at PDFTextParser1.pdftoText(PDFTextParser1.java:33)
    at PDFTextParser1.main(PDFTextParser1.java:56)
    Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph cannot be cast to org.apache.pdfbox.util.operator.OperatorProcessor
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:119)
    … 3 more

    Thanks in advance,

    • Szemere says

      You are most likely mixing 0.8.0 and 0.7.3 versions in the classpath.

  5. Ching says

    How can you extract bold text from a PDF if there is one?

  6. Hrishi says

    There is a small defect in ur code..U got to replace
    import org.apache.pdfbox.cos.COSDocument;
    import org.apache.pdfbox.pdfparser.PDFParser;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.util.PDFTextStripper;

    WITH
    import org.pdfbox.cos.COSDocument;
    import org.pdfbox.pdfparser.PDFParser;
    import org.pdfbox.pdmodel.PDDocument;
    import org.pdfbox.util.PDFTextStripper;

    • Pooja says

      I agree with Hrishi. But changing the import path from org.apache.pdfbox.* to org.pdfbox.* will not help as all the class files inside the jar will contain the path as prg.apache.pdfbox.* only.
      Moreover the class PDFfont uses the import for fonbox.jar file as org.apache.fonbox.* while the fonbox-0.1.0.jar file have the path defined as org.fonbox.*.
      So Its really difficult to use these two jars having diffient packages.

  7. Pooja says

    use the fonbox-0.8.0-incubating.jar instead of fonbox-0.1.0.jar as well you need to use one more jar as common-logging-4.0.6.jar

  8. Ravi says

    I am using 1.1.0 versions of all (i.e. pdf box, fontbox, jempbox, commons logging from here http://pdfbox.apache.org/download.html. Its working without any problems

    Sometimes I am getting to see the warning
    Jun 14, 2010 6:35:55 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont
    WARNING: Failed to create Type1C font. Falling back to Type1 font
    java.lang.NullPointerException

    but extraction is successful

  9. Art says

    Works, but Title and footer of pdf document are extraxted inclompletely, do you know why?

  10. jyothi says

    Hi all,
    I have successfully read the pdf. but my problem is i want display that pdf form into an html page.
    my requirement is
    parsing PDF forms (to parse textbox/textarea/checkbox/radio button/numerical data and print to screen)
    to print into an html page
    please help me.

  11. Nithyananthakumar says

    i am using pdfbox-app-1.2.1.jar ,jempbox-1.2.1.jar,fontbox-1.2.1.jar and the above code.working fine …. good work..Thx for the code…

    • Arun says

      Hi nithya,
      could you please briefly explain to me how it works… I am new to the java environment to use this tool. I have downloaded the pdfbox-1.2.1 and I have dowloaded the JDK 1.6 and apache maven. Please help me how to use it…

  12. rag says

    I am not getting the test is as is. The header part is displaying in the last and table content in the top. How do I get the test as is from pdf to text file?

  13. bhavani says

    Hi,
    im trying the same(above code n jars).
    Im gtng da error as in title.Hw to resolve te issue.10-07 12:43:42.456: java.lang.NoClassDefFoundError: org.apache.pdfbox.pdmodel.PDDocument
    Im using jdk 1.5 with eclipse..and pdf 1.2 to check..

  14. roger says

    Hi,
    The program works perfectly, but I am facing issues when translating PDFs that have text in hindi, telugu and malayalam fonts. The output is some garbled set of characters. Please help.

  15. Linkesh says

    Great Work. Thank You. This is what the exact thing that I was searching for.

  16. Awais says

    can we read the pdf line by line kindly tell any method .

  17. arlisa says

    hi,
    I’ve tried the code, and its work, :) Then now I wanna ask about how to read a PDF file by it sections. I mean, if I have a PDF file and I don’t wanna read all contents from it, Can you help how to read it by its sections? For example I just wanna get the content from Introduction section, or may be I wanna read only the table of content.
    Thank you.

  18. Sakil Imran says

    it really works. :) thanks for your work.

  19. Best hosting service says

    I have used your program and its work for me really. Thanks for sharing this code. And hope to see more these type of useful code.

  20. nik says

    Thanks, really works!

    I’ve tried to extract text from a particular region within a pdf with current pdfbox build presented on apache websit and had no luck….hope to do this with your jar file)

  21. dazzle says

    Thank you. All except this pdf did not work for me. can you please check what is the issue here
    http://cid-a3aa7f7d9888874d.office.live.com/self.aspx/Public/getting%5E_started%5E_with%5E_Flex3.pdf

  22. vinz says

    its awesome..

    Thanks for sharing..

  23. gayathri says

    I have to extract only lines data(position, thickness ,width,height) from pdf to text file if u people know do needfull to me

  24. NAB says

    Hi Everyone,

    I have the same problem with Patrik’s:

    org.apache.pdfbox.exceptions.WrappedIOException
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:125)
    at org.apache.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:120)
    at PDFTextParser1.pdftoText(PDFTextParser1.java:33)
    at PDFTextParser1.main(PDFTextParser1.java:56)
    Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph cannot be cast to org.apache.pdfbox.util.operator.OperatorProcessor
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:119)
    … 3 more

    Does anyone know how to solve this? I am running on windows xp. Thank you very much in advance for helping me sort this out.

  25. Prakash says

    Thanks for nice explanation of this utility. I tried this and got very good results with the English text, but when it came to extracting unicode/Indian language text from PDF the out put was recognizable in many of the well known fonts. This must be a well known issue, can you please suggest me a solution?

  26. Prakash says

    the output for indian languages is unrecognizable or not recognizable, I made a mistake in writing above sentence. Thanks.



Some HTML is OK

or, reply to this post via trackback.

Powered by WP Hashcash