PDFBox : Extract Text from PDF

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used.

Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities.

There is no latest build available for PDFBox. Sourceforge has very old binaries. But  the old version fails to work with PDF 1.5 specification. So one need to compile the latest code from SVN. 

I am sharing the latest jar file built from svn here

The following example explains how to extract the text from a pdf file using PDFBox.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PDFTextParser {

	// Extract text from PDF Document
	static String pdftoText(String fileName) {
		PDFParser parser;
		String parsedText = null;;
		PDFTextStripper pdfStripper = null;
		PDDocument pdDoc = null;
		COSDocument cosDoc = null;
		File file = new File(fileName);
		if (!file.isFile()) {
			System.err.println("File " + fileName + " does not exist.");
			return null;
		}
		try {
			parser = new PDFParser(new FileInputStream(file));
		} catch (IOException e) {
			System.err.println("Unable to open PDF Parser. " + e.getMessage());
			return null;
		}
		try {
			parser.parse();
			cosDoc = parser.getDocument();
			pdfStripper = new PDFTextStripper();
			pdDoc = new PDDocument(cosDoc);
			pdfStripper.setStartPage(1);
			pdfStripper.setEndPage(5);
			parsedText = pdfStripper.getText(pdDoc);
		} catch (Exception e) {
			System.err
					.println("An exception occured in parsing the PDF Document."
							+ e.getMessage());
		} finally {
			try {
				if (cosDoc != null)
					cosDoc.close();
				if (pdDoc != null)
					pdDoc.close();
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
		return parsedText;
	}
	public static void main(String args[]){
		System.out.println(pdftoText("/home/santhosh/pdfbox/test.pdf"));
	}

}

 

More details on the APIs can be read from here

53 thoughts on “PDFBox : Extract Text from PDF”

  1. I have used the above program and getting the below error.appreciate any help.
    java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser

  2. Hi, please can you explain me, like to a baby, how to compile and run this source.
    I’ve downloaded pdfbox-0.8.0-incubating.ja but i don’t know how use it together with “javac”

    Thanks

  3. Hello,

    I tried to run the above code using the pdfbox jar and fontbox jar mentioned in earlier replies. But it seems the code still blows up giving the exception below. Has anyone faced this issue before. If yes, could you tell me how did you fix it.

    org.apache.pdfbox.exceptions.WrappedIOException
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:125)
    at org.apache.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:120)
    at PDFTextParser1.pdftoText(PDFTextParser1.java:33)
    at PDFTextParser1.main(PDFTextParser1.java:56)
    Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph cannot be cast to org.apache.pdfbox.util.operator.OperatorProcessor
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:119)
    … 3 more

    Thanks in advance,

  4. There is a small defect in ur code..U got to replace
    import org.apache.pdfbox.cos.COSDocument;
    import org.apache.pdfbox.pdfparser.PDFParser;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.util.PDFTextStripper;

    WITH
    import org.pdfbox.cos.COSDocument;
    import org.pdfbox.pdfparser.PDFParser;
    import org.pdfbox.pdmodel.PDDocument;
    import org.pdfbox.util.PDFTextStripper;

    1. I agree with Hrishi. But changing the import path from org.apache.pdfbox.* to org.pdfbox.* will not help as all the class files inside the jar will contain the path as prg.apache.pdfbox.* only.
      Moreover the class PDFfont uses the import for fonbox.jar file as org.apache.fonbox.* while the fonbox-0.1.0.jar file have the path defined as org.fonbox.*.
      So Its really difficult to use these two jars having diffient packages.

  5. use the fonbox-0.8.0-incubating.jar instead of fonbox-0.1.0.jar as well you need to use one more jar as common-logging-4.0.6.jar

  6. I am using 1.1.0 versions of all (i.e. pdf box, fontbox, jempbox, commons logging from here http://pdfbox.apache.org/download.html. Its working without any problems

    Sometimes I am getting to see the warning
    Jun 14, 2010 6:35:55 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont
    WARNING: Failed to create Type1C font. Falling back to Type1 font
    java.lang.NullPointerException

    but extraction is successful

  7. Hi all,
    I have successfully read the pdf. but my problem is i want display that pdf form into an html page.
    my requirement is
    parsing PDF forms (to parse textbox/textarea/checkbox/radio button/numerical data and print to screen)
    to print into an html page
    please help me.

  8. i am using pdfbox-app-1.2.1.jar ,jempbox-1.2.1.jar,fontbox-1.2.1.jar and the above code.working fine …. good work..Thx for the code…

    1. Hi nithya,
      could you please briefly explain to me how it works… I am new to the java environment to use this tool. I have downloaded the pdfbox-1.2.1 and I have dowloaded the JDK 1.6 and apache maven. Please help me how to use it…

  9. I am not getting the test is as is. The header part is displaying in the last and table content in the top. How do I get the test as is from pdf to text file?

  10. Hi,
    im trying the same(above code n jars).
    Im gtng da error as in title.Hw to resolve te issue.10-07 12:43:42.456: java.lang.NoClassDefFoundError: org.apache.pdfbox.pdmodel.PDDocument
    Im using jdk 1.5 with eclipse..and pdf 1.2 to check..

  11. Hi,
    The program works perfectly, but I am facing issues when translating PDFs that have text in hindi, telugu and malayalam fonts. The output is some garbled set of characters. Please help.

  12. hi,
    I’ve tried the code, and its work, 🙂 Then now I wanna ask about how to read a PDF file by it sections. I mean, if I have a PDF file and I don’t wanna read all contents from it, Can you help how to read it by its sections? For example I just wanna get the content from Introduction section, or may be I wanna read only the table of content.
    Thank you.

  13. Thanks, really works!

    I’ve tried to extract text from a particular region within a pdf with current pdfbox build presented on apache websit and had no luck….hope to do this with your jar file)

  14. I have to extract only lines data(position, thickness ,width,height) from pdf to text file if u people know do needfull to me

  15. Hi Everyone,

    I have the same problem with Patrik’s:

    org.apache.pdfbox.exceptions.WrappedIOException
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:125)
    at org.apache.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:120)
    at PDFTextParser1.pdftoText(PDFTextParser1.java:33)
    at PDFTextParser1.main(PDFTextParser1.java:56)
    Caused by: java.lang.ClassCastException: org.pdfbox.util.operator.ShowTextGlyph cannot be cast to org.apache.pdfbox.util.operator.OperatorProcessor
    at org.apache.pdfbox.util.PDFStreamEngine.(PDFStreamEngine.java:119)
    … 3 more

    Does anyone know how to solve this? I am running on windows xp. Thank you very much in advance for helping me sort this out.

  16. Thanks for nice explanation of this utility. I tried this and got very good results with the English text, but when it came to extracting unicode/Indian language text from PDF the out put was recognizable in many of the well known fonts. This must be a well known issue, can you please suggest me a solution?

  17. and is there any way to get the exact location of the text in the pdf. I mean what are the X,Y co-ordinates of each string or line. Please help me with this…

  18. Hi,
    I have a PDF file and I don’t wanna read all contents from it, Can you help how to read it by its sections? For example I just wanna get the content from Introduction section, or may be I wanna read only the table of content.and also tell me how to run it.
    Thank you.

  19. i followed every step given above
    Still not Error:
    Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1069)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:712)
    at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
    at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

    1. i followed every step given above
      Still got below Error:
      Error: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
      at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1069)
      at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:712)
      at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
      at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
      at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
      at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:415)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

      Can some one help on this

    2. Hi, It has been years since I wrote this blog post and tutorial. So it is very well possible that the above steps wont work in modern version and libraries. Unfortunately I cannot help much since I am into a different profession now and last time I did anything with Java was in 2011. 🙁

  20. Hii
    I am Ajay…when I am trying to read a pdf in Hindi language i am not getting right format of hindi..can anyone please say what is the reason.
    Here the example given below:
    input:akabar birbal in hindi
    output:अकबर ब रब क

  21. Hi I am getting below error –
    Exception in thread “main” java.lang.ClassCastException: java.io.FileInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead
    at testsel.Test46.main(Test46.java:28)

    For below statement –

    parser = new PDFParser((RandomAccessRead) new FileInputStream(file));

    Please suggest..
    Thanks

  22. hi,

    i am also facing same issue not able to get the hindi text..

    example : क्लस्टर िंग, प्रतिगमन, समय श्रिंखला, तिभाजन क्लस्टर िंग, प्रतिगमन, समय श्रिंखला, तिभाजन
    क्लस्टर िंग, प्रतिगमन, समय श्रिंखला, तिभाजन ग, प्रतिगमन, समय श्रिंखला, तिभाजन

    giving output as : –
    ??????? ???, ????????, ??? ????????, ??????
    ??????? ???, ????????, ??? ????????, ??????

    ??????? ???, ????????, ??? ????????, ??????
    ?, ????????, ??? ????????, ??????

    please help me..

  23. code is :-

    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import org.apache.pdfbox.cos.COSDocument;
    import org.apache.pdfbox.pdfparser.PDFParser;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.util.PDFTextStripper;

    public class pdftest1 {
    static String pdftoText(String fileName) throws IOException {
    PDFParser parser;
    String parsedText = null;
    //PDFTextStripper pdfStripper = null;
    PDFTextStripper pdfStripper = new PDFTextStripper(“UTF-8”);
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    File file = new File(fileName);
    if (!file.isFile()) {
    System.out.println(“File ” + fileName + ” does not exist.”);
    return null;
    }
    try {
    parser = new PDFParser(new FileInputStream(file));
    } catch (IOException e) {
    System.out.println(“Unable to open PDF Parser. ” + e.getMessage());
    return null;
    }
    try {
    parser.parse();
    cosDoc = parser.getDocument();
    pdfStripper = new PDFTextStripper();
    pdDoc = new PDDocument(cosDoc);
    pdfStripper.setStartPage(1);
    pdfStripper.setEndPage(5);
    parsedText = pdfStripper.getText(pdDoc);
    } catch (Exception e) {
    e.printStackTrace();
    System.out.println(“An exception occured in parsing the PDF Document.”+ e.getMessage());
    } finally {
    try {
    if (cosDoc != null)
    cosDoc.close();
    if (pdDoc != null)
    pdDoc.close();
    } catch (Exception e) {
    e.printStackTrace();
    }
    }
    return parsedText;
    }
    public static void main(String args[]) throws IOException{
    System.out.println(pdftoText(“C:/Users/AB850924/Desktop/AbdulData/PDFData/Hindi.pdf”));
    }
    }

  24. if i have pdf file which having question and answer i want to separate image of particular question and answer with some margin(height , width)?????

  25. hi,
    and compliment for this code.
    i have a question: i search if is possible print the pdf page when found the string that i search?
    help me 🙂 i am a beginner in java!

Leave a Reply

Your email address will not be published. Required fields are marked *