Java -

PDFBox : Extract Text from PDF

Posted on June 24, 2009 | Santhosh Thottingal

Recently I had to extract text from PDF files for indexing the content using Apache Lucene. Apache PDFBox was the obvious choice for the java library to be used. Apache PDFBox is an opensource java library for working with PDF files. The PDFBox library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. PDFBox also includes several command line utilities. There is no latest build available for PDFBox. [Read More]

java pdf