GPT-4o 上线 | 利用它助力图像结构化信息提取

作者：小桥流水78 | 2024-08-09 17:33:19

踩

gpt4o可以直接返回图片吗

点击下方卡片，关注“小白玩转Python”公众号

OpenAI最近发布了GPT-4o——据称是OpenAI最好的AI模型，但价格只有GPT-4的一半！这个新模型提供了实时的多模态能力，涵盖文本、视觉和音频，其智能水平与GPT-4Turbo相同，但效率更高——这意味着它具有更低的延迟，文本生成速度快2倍，而且非常重要的是，它的价格是GPT-4Turbo的一半。

动机

如果你需要分析图像以收集结构化信息，你来对地方了。在这篇文章中，我们将快速学习如何使用OpenAI最新最先进的模型——GPT-4o，从图像中提取结构化信息。

工作流程

首先导入相关的库：


import base64
import json
import os
import os.path
import re
import sys
import pandas as pd
import tiktoken
import openai
from openai import OpenAI

在本文中，我将使用以下图像进行演示。你可以对自己的图像应用相同的原理，提出任何问题：

用来提取信息的图像

步骤1：加载和编码图像

图像主要有两种方式提供给模型：通过传递图像链接或通过在请求中直接传递base64编码的图像。Base64是一种编码算法，可以将图像转换为可读的字符串。你可以在用户、系统和助手消息中传递图像。

由于我的图像存储在本地，让我们将本地图像编码为base64 URL。以下函数读取图像文件，确定其MIME类型，并将其编码为base64数据URL，适合传输到API：


import base64
from mimetypes import guess_type
 
 
 
 
# Function to encode a local image into data URL
def local_image_to_data_url(image_path):
    # Guess the MIME type of the image based on the file extension
    mime_type, _ = guess_type(image_path)
    if mime_type is None:
        mime_type = "application/octet-stream"  # Default MIME type if none is found
 
 
    # Read and encode the image file
    with open(image_path, "rb") as image_file:
        base64_encoded_data = base64.b64encode(image_file.read()).decode("utf-8")
 
 
    # Construct the data URL
    return f"data:{mime_type};base64,{base64_encoded_data}"

步骤2：设置API客户端和模型

要构建流水线，首先使用我们的API密钥设置OpenAI API客户端：


openai.api_key = 'your api key'
client = OpenAI(api_key = openai.api_key)
model = "gpt-4o"

这一步初始化OpenAI客户端，使我们能够与GPT-vision模型进行交互。现在，我们准备好运行流水线。

步骤3：运行GPT-4o流水线以处理图像并获取响应

在此代码中，我们遍历指定目录中的图像，编码每个图像，并向GPT-4o模型发送请求。模型被指示分析图像并提取结构化信息，返回的应该是JSON格式。


#define the base directory for the images, encode them into base64, and use the model to extract structured information:
base_dir = "data/"
data_urls = []
responses = []
 
 
path = os.path.join(base_dir, "figures")
if os.path.isdir(path):  # Ensure it's a directory
    # use the image and ask question
    for image_file in os.listdir(path)[:1]:
        image_path = os.path.join(path, image_file)
        try:
            print(f"processing image: {image_path}")
            data_url = local_image_to_data_url(image_path)
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {
                        "role": "system",
                        "content": """
            You are `gpt-4o`, the latest OpenAI model that can interpret images and can describe images provided by the user
            in detail. The user has attached an image to this message for
            you to answer a question, there is definitely an image attached,
            you will never reply saying that you cannot see the image
            because the image is absolutely and always attached to this
            message. Answer the question asked by the user based on the
            image provided. Do not give any further explanation. Do not
            reply saying you can't answer the question. The answer has to be
            in a JSON format. If the image provided does not contain the
            necessary data to answer the question, return 'null' for that
            key in the JSON to ensure consistent JSON structure. 
            
            """,
                    },
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": """ 
        
            You are tasked with accurately interpreting detailed charts and
            text from the images provided. You will focus on extracting the price for all the DRINKS from the menu.
            Guidelines:
            - Include all the drinks in the menu
            - The output must be in JSON format, with the following structure and fields strictly adhered to: 
            Response Format: 
            The output must be in JSON format, with the following structure and key strictly adhered to: 
            - dish: the name of the appetizer dish
            - price: the price of the appetizer dish
            - currency: the currency
                
            """,
                            },
                            {"type": "image_url", "image_url": {"url": data_url}},
                        ],
                    },
                ],
                max_tokens=3000,
            )
            content = response.choices[0].message.content
 
 
            responses.append(
                { "image": image_file, "response": content}
            )
        except Exception as e:
            print(f"error processing image {image_path}: {e}")
responses

响应输出变量以JSON格式返回，按照我的提示请求，它看起来与参考照片一致。


```json
[
    {
        "drink": "Purified Water",
        "price": 3.99,
        "currency": "$"
    },
    {
        "drink": "Sparkling Water",
        "price": 3.99,
        "currency": "$"
    },
    {
        "drink": "Soda In A Bottle",
        "price": 4.50,
        "currency": "$"
    },
    {
        "drink": "Orange Juice",
        "price": 6.00,
        "currency": "$"
    },
    {
        "drink": "Fresh Lemonade",
        "price": 7.50,
        "currency": "$"
    }
]
```

步骤4：将JSON格式转换为数据框

你还可以通过以下代码将JSON输出解析为数据框，从而创建结构化数据格式：


def format_output_qa(output, debug = False):
    print(f"Raw model output: {output}")
    try:
        # cleanup the json output
        output_text = output.replace("\n", "")
        output_text=output_text.replace("```json", "")
        output_text=output_text.replace("```", "")
        if debug is True:
            return output_text
        # Now load it into a Python dictionary
        output_dict = json.loads(output_text)
        # create a df
        df = pd.DataFrame(output_dict)
    except Exception as e:
        print(f"Error processing output: {e}")
        df = pd.DataFrame({"error": str(e)}, index=[0])
    return df
 
 
# Now process each response in the list
df_output = pd.DataFrame()
for response_dict in responses:
    response = response_dict["response"]
    df = format_output_qa(response)
    df["image"] = response_dict["image"]
    df_output = pd.concat([df_output, df], ignore_index=True)
 
 
df_output

结果如下：

结论

这是一个简单而强大的流水线，利用GPT-4o的视觉能力从图像中提取数据，将非结构化的视觉数据转化为结构化数据以供进一步分析。无论你是在收集和分析视觉内容，这些步骤都为将图像处理集成到AI工作流程中提供了坚实的基础。

· END ·

HAPPY LIFE

本文仅供学习交流使用，如有侵权请联系作者删除

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小桥流水78/article/detail/954398