Python2.7 Sample method for reading PDF files

  • 2020-06-12 09:51:18
  • OfStack

This article illustrates how Python2.7 reads PDF files. To share for your reference, specific as follows:

Example code USES this article Python version is 2.7, need to download the plugin is PDFMiner, download address is http: / / www unixuser. org / ~ euske/python pdfminer/address in installation method, I am no longer in detail, to be sure Python2 can only use PDFMiner Python3 unusable, PDFMiner3K Python3 can use, Download address for https: / / pypi python. org/pypi pdfminer3k /. The use of the two plug-ins is broadly similar, and Here I use Python2 as an example, using the PDFMiner plug-in. The code is as follows:


#!/usr/bin/env python
#-*- coding:utf-8 -*-
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
# Get the document object that you put in algorithm.pdf Just change the name of your file. 
fp=open("algorithm.pdf","rb")
# create 1 Six interpreters are associated with the document 
parser=PDFParser(fp)
#PDF The document object 
doc=PDFDocument(parser)
# Link interpreter and document object 
parser.set_document(doc)
#doc.set_paeser(parser)
# Initialization document 
#doc.initialize("")
# create PDF Resource manager 
resource=PDFResourceManager()
# Parametric analyzer 
laparam=LAParams()
# create 1 An aggregator 
device=PDFPageAggregator(resource,laparams=laparam)
# create PDF Page interpreter 
interpreter=PDFPageInterpreter(resource,device)
# Use the document object to get a collection of pages 
for page in PDFPage.create_pages(doc):
  # Use the page interpreter to read 
  interpreter.process_page(page)
  # Use an aggregator to get the content 
  layout=device.get_result()
  for out in layout:
    if hasattr(out, "get_text"):
      print out.get_text()

For more information about Python, please refer to Python Files and Directories, Python Data Structures and Algorithms, Python Functions, Python String Manipulation, and Python Introductory and Advanced Classic.

I hope this article has been helpful in Python programming.


Related articles: