How to extract images from a scanned pdf

Ask Time：2017-11-06T16:57:48 Author：Plouf

I use Tesseract to extract text from scanned PDF. Some of these files also contain images. Is there a way to get those images?

I prepare my scanned pdf for tesseract by converting them in tiff files. But I can't find any command line tool to extract images from them, as pdfimages would do for "text" pdf.

Any idea of a tool (or a combination of tools) that would help me do the job?

Author:Plouf，eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article：https://stackoverflow.com/questions/47133072/how-to-extract-images-from-a-scanned-pdf

user5509289 :

You won't be able to use Tesseract OCR for images, as that's not what it was designed to do. Best to use a tool to extract the images beforehand, and then get the text later using Tesseract.\n\nYou may get some use out of PDFimages, by xPDF.\n\nhttp://www.xpdfreader.com/pdfimages-man.html\n\nYou will need to download R, Rstudio, xPDFreader, and PDFtools to accomplish this. Make sure your program files are able to be found in \"Environment Variables\" (if using Windows) so that R can find the programs. \n\nThen do something like this to convert it. See the options in documentation for help on PDFimages. This is just how the syntax will be (specifically after paste0). Note the placement of the options. They have to be before the file input name:\n\n #(\"PDF to PPM\") \n files <- tools::file_path_sans_ext(list.files(path = dest, pattern = \n \"pdf\", full.names = TRUE))\n lapply(files, function(i){\n shell(shQuote(paste0(\"pdftoppm -f 1 -l 10 -r 300 \", i,\".pdf\", \" \",i)))\n })\n\n\nYou could also just use the CMD prompt and type\n\npdftoppm -f 1 -l 10 -r 300 stuff.pdf stuff.ppm\n",

2017-11-07T20:13:09

JKAbrams :

1. Extract the images using pdfimages\npdfimages mydoc.pdf\n\n2. Use the following extraction script:\n./extractImages.py images*\n\nFind your cut out images in a new images folder.\nLook at what was done in the tracing folder to make sure no images were missed.\nOperation\nIt will process all images and look for shapes inside the images. If a shape is found and is larger than a configurable size it fill figure out the maximum bounding box, cut out the image and save it in a new images, in addition it will create folder named traces where it shows all the bounding boxes.\nIf you want to find smaller images, just decrease the minimumWidth and minimumHeight however if you set it too low it will find each character.\nIn my tests it works extremely well, it just finds a few too many images.\nextractImages.py\n#!/bin/env python \n\nimport cv2\nimport numpy as np\nimport os\nfrom pathlib import Path\n\ndef extractImagesFromFile(inputFilename, outputDirectory, tracing=False, tracingDirectory=""):\n \n # Settings:\n minimumWidth = 100\n minimumHeight = 100\n greenColor = (36, 255, 12)\n traceWidth = 2\n \n # Load image, grayscale, Otsu's threshold\n image = cv2.imread(inputFilename)\n original = image.copy()\n gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)\n thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]\n\n # Find contours, obtain bounding box, extract and save ROI\n ROI_number = 1\n cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)\n cnts = cnts[0] if len(cnts) == 2 else cnts[1]\n for c in cnts:\n x, y, w, h = cv2.boundingRect(c)\n if w >= minimumWidth and h >= minimumHeight:\n cv2.rectangle(image, (x, y), (x + w, y + h), greenColor, traceWidth)\n ROI = original[y:y+h, x:x+w]\n outImage = os.path.join(outputDirectory, '{}_{}.png'.format(Path(inputFilename).stem, ROI_number))\n cv2.imwrite(outImage, ROI)\n ROI_number += 1\n if tracing:\n outImage = os.path.join(tracingDirectory, Path(inputFilename).stem + '_trace.png')\n cv2.imwrite(outImage, image)\n\ndef main(files):\n\n tracingEnabled = True\n outputDirectory = 'images'\n tracingDirectory = 'tracing'\n\n # Create the output directory if it does not exist\n outputPath = Path.cwd() / outputDirectory\n outputPath.mkdir(exist_ok=True)\n\n if tracingEnabled:\n tracingPath = Path.cwd() / tracingDirectory\n tracingPath.mkdir(exist_ok=True)\n\n for f in files:\n print("Prcessing {}".format(f))\n if Path(f).is_file():\n extractImagesFromFile(f, outputDirectory, tracingEnabled, tracingDirectory)\n else:\n print("Invalid file: {}".format(f))\n\nif __name__ == "__main__":\n import argparse\n from glob import glob\n parser = argparse.ArgumentParser() \n parser.add_argument("fileNames", nargs='*') \n args = parser.parse_args() \n fileNames = list() \n for arg in args.fileNames: \n fileNames += glob(arg) \n main(fileNames)\n\nCredit\nThe basic algorithm was provided by nathancy as an answer to this question:\nExtract all bounding boxes using OpenCV Python",

2020-10-11T23:38:58

How to extract images from a scanned pdf