Skip to content


PDFBox : Extract Text from PDF

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used.

Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

There is no latest build available for PDFBox. Sourceforge has very old binaries. But  the old version fails to work with PDF 1.5 specification. So one need to compile the latest code from SVN. 

I am sharing the latest jar file built from svn here

The following example explains how to extract the text from a pdf file using PDFBox.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
 
public class PDFTextParser {
 
	// Extract text from PDF Document
	static String pdftoText(String fileName) {
		PDFParser parser;
		String parsedText = null;;
		PDFTextStripper pdfStripper = null;
		PDDocument pdDoc = null;
		COSDocument cosDoc = null;
		File file = new File(fileName);
		if (!file.isFile()) {
			System.err.println("File " + fileName + " does not exist.");
			return null;
		}
		try {
			parser = new PDFParser(new FileInputStream(file));
		} catch (IOException e) {
			System.err.println("Unable to open PDF Parser. " + e.getMessage());
			return null;
		}
		try {
			parser.parse();
			cosDoc = parser.getDocument();
			pdfStripper = new PDFTextStripper();
			pdDoc = new PDDocument(cosDoc);
			pdfStripper.setStartPage(1);
			pdfStripper.setEndPage(5);
			parsedText = pdfStripper.getText(pdDoc);
		} catch (Exception e) {
			System.err
					.println("An exception occured in parsing the PDF Document."
							+ e.getMessage());
		} finally {
			try {
				if (cosDoc != null)
					cosDoc.close();
				if (pdDoc != null)
					pdDoc.close();
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
		return parsedText;
	}
	public static void main(String args[]){
		System.out.println(pdftoText("/home/santhosh/pdfbox/test.pdf"));
	}
 
}
 
 

More details on the APIs can be read from here

Posted in Misc.

Tagged with , .


15 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Elcoj says

    Hi there,
    Super post, Need to mark it on Digg
    Elcoj

  2. Neel says

    I have used the above program and getting the below error.appreciate any help.
    java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser

  3. Giordano says

    Hi, please can you explain me, like to a baby, how to compile and run this source.
    I’ve downloaded pdfbox-0.8.0-incubating.ja but i don’t know how use it together with “javac”

    Thanks

  4. Pratik says

    Hello,

    I tried to run the above code using the pdfbox jar and fontbox jar mentioned in earlier replies. But it seems the code still blows up giving the exception below. Has anyone faced this issue before. If yes, could you tell me how did you fix it.

    org.apache.pdfbox.exceptions.WrappedIOException
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:125)
    at org.apache.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:120)
    at PDFTextParser1.pdftoText(PDFTextParser1.java:33)
    at PDFTextParser1.main(PDFTextParser1.java:56)
    Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph cannot be cast to org.apache.pdfbox.util.operator.OperatorProcessor
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:119)
    … 3 more

    Thanks in advance,

    • Szemere says

      You are most likely mixing 0.8.0 and 0.7.3 versions in the classpath.

  5. Ching says

    How can you extract bold text from a PDF if there is one?

  6. Hrishi says

    There is a small defect in ur code..U got to replace
    import org.apache.pdfbox.cos.COSDocument;
    import org.apache.pdfbox.pdfparser.PDFParser;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.util.PDFTextStripper;

    WITH
    import org.pdfbox.cos.COSDocument;
    import org.pdfbox.pdfparser.PDFParser;
    import org.pdfbox.pdmodel.PDDocument;
    import org.pdfbox.util.PDFTextStripper;

    • Pooja says

      I agree with Hrishi. But changing the import path from org.apache.pdfbox.* to org.pdfbox.* will not help as all the class files inside the jar will contain the path as prg.apache.pdfbox.* only.
      Moreover the class PDFfont uses the import for fonbox.jar file as org.apache.fonbox.* while the fonbox-0.1.0.jar file have the path defined as org.fonbox.*.
      So Its really difficult to use these two jars having diffient packages.

  7. Pooja says

    use the fonbox-0.8.0-incubating.jar instead of fonbox-0.1.0.jar as well you need to use one more jar as common-logging-4.0.6.jar

  8. Ravi says

    I am using 1.1.0 versions of all (i.e. pdf box, fontbox, jempbox, commons logging from here http://pdfbox.apache.org/download.html. Its working without any problems

    Sometimes I am getting to see the warning
    Jun 14, 2010 6:35:55 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont
    WARNING: Failed to create Type1C font. Falling back to Type1 font
    java.lang.NullPointerException

    but extraction is successful

  9. Art says

    Works, but Title and footer of pdf document are extraxted inclompletely, do you know why?

  10. jyothi says

    Hi all,
    I have successfully read the pdf. but my problem is i want display that pdf form into an html page.
    my requirement is
    parsing PDF forms (to parse textbox/textarea/checkbox/radio button/numerical data and print to screen)
    to print into an html page
    please help me.

  11. Nithyananthakumar says

    i am using pdfbox-app-1.2.1.jar ,jempbox-1.2.1.jar,fontbox-1.2.1.jar and the above code.working fine …. good work..Thx for the code…



Some HTML is OK

or, reply to this post via trackback.

Powered by WP Hashcash