Home:ALL Converter>Extract Images and Words with coordinates and sizes from PDF

Extract Images and Words with coordinates and sizes from PDF

Ask Time:2011-11-23T19:52:05         Author:Alex

Json Formatter

I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.

The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.

I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.

Could you recommend a good and working solution for the task?

Author:Alex,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/8241724/extract-images-and-words-with-coordinates-and-sizes-from-pdf
Balamurugan Muthiah :

Use XPDF (http://www.foolabs.com/xpdf/)\n\nIt can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.\n\nIt's open source (GPLv2) and supports a lot of additional extraction functionalities as well.",
2015-01-23T10:28:40
mark stephens :

Several Java libraries can do this. Have you looked at JPedal or PdfBox?",
2011-11-23T14:24:23
yy