当前位置:   article > 正文

使用Python解析PDF为文本文件_pypdf4读取pdf文档

pypdf4读取pdf文档

一、解析PDF

使用pdfminer解析PDF文件,其中Layout类型包括LAParams, LTTextBox, LTTextLine, LTFigure, LTImage, LTChar。

示例一:解析LTTextBox

  1. from pdfminer.layout import LTTextBoxHorizontal
  2. from pdfminer.pdfpage import PDFTextExtractionNotAllowed
  3. def parse(Path, Save_name):
  4. parser = PDFParser(Path)
  5. document = PDFDocument(parser)
  6. if not document.is_extractable:
  7. print 'error'
  8. raise PDFTextExtractionNotAllowed
  9. else:
  10. rsrcmgr = PDFResourceManager()
  11. laparams = LAParams()
  12. device = PDFPageAggregator(rsrcmgr, laparams=laparams)
  13. interpreter = PDFPageInterpreter(rsrcmgr, device)
  14. for page in PDFPage.create_pages(document):
  15. interpreter.process_page(page)
  16. layout = device.get_result()
  17. for x in layout:
  18. if (isinstance(x, LTTextBoxHorizontal)):
  19. with open('%s' % (Save_name), 'a') as f:
  20. results = x.get_text().encode('utf-8')
  21. print results
  22. f.write(results + "\n")
  23. if __name__ == '__main__':
  24. Path = open('/local/mnt/workspace/PycharmProject/demo/src/tmp/2019.pdf', 'rb')
  25. Parse(Path, '/local/mnt/workspace/PycharmProject/demo/src/tmp/1.txt')

示例二:解析更多Layout类型

  1. #!/usr/bin/python
  2. import sys
  3. import os
  4. from binascii import b2a_hex
  5. ###
  6. ### pdf-miner requirements
  7. ###
  8. from pdfminer.pdfparser import PDFParser
  9. from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
  10. from pdfminer.pdfpage import PDFPage
  11. from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
  12. from pdfminer.converter import PDFPageAggregator
  13. from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage, LTChar
  14. def with_pdf (pdf_doc, fn, pdf_pwd, *args):
  15. """Open the pdf document, and apply the function, returning the results"""
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/article/detail/51581
推荐阅读
相关标签
  

闽ICP备14008679号