python自动化系列之提取pdf文字和图片_python3安装 pdfplumber

作者：二进制舞者 | 2024-01-31 19:18:46

踩

python3安装 pdfplumber

在python中有许多开源的库可以处理Pdf文档，最常用的Pypdf2库可以读取文档，合并，分割pdf文档，但是也有局限性：

无法提取文档中的文字

提取PDF文字需要使用另外的库，如pdfplumbe
提取PDF中的图片需要使用fitz库

使用pdfplumbe提取文字

pdfplumbe使用可以用来解析PDF文件，获取其文本内容、标题、表格等的开源工具；
开源代码地址：https://github.com/jsvine/pdfplumber

安装pdfplumbe:

pip install pdfplumbe

引入：

import pdfplumbe

简单使用代码示例：

filepath = 'H:/test_w.pdf'

def extract_text_info(filepath):
    """
    提取PDF中的文字
    @param filepath:文件路径
    @return:
    """
    with pdfplumber.open(filepath) as pdf:
        # 获取第2页数据
        page = pdf.pages[3]
        print(page.extract_text()) #提取文字
        table = page.extract_tables() #提取表格
        print(table)
        for row in table:
            
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/article/detail/51612