赞
踩
视频作为移动互联网时代最常见的内容服务,在购物、直播、短视频、和社交等邻域扮演着越来越重要的角色。视频中的文字作为视频中的显著特征,是关键信息的载体和媒介。数字原生视频是相对于自然场景视频而言的分类,其内容往往通过后期的制作和处理,才会对外传播,我们常见的视频动画、特效和提示词都是数字原生的重要对象。因此在数字原生视频中的文字,有比自然场景视频的文字更高的出现频率,并有更为明显的意图性。
视频文字问答主要针对视频中的文字进行提问和回答。
在以往的智能问答任务中,往往是输入文字内容生成回答结果。但是在视频场景情况则更为复杂,它是图片、文字、语音的综合体。在本系列项目中,我们将针对这一场景, 逐步进行探索。
本项目使用的部分数据集来自ICDAR2023 数字原生视频文本问答竞赛 赛道2:视频文本问答,通过对赛题提供的样例数据进行简单分析,视频语音不只是英语,还有中文、日语等其它语言,但是要进行的问答内容、字幕、文字均为英文,如果要进行语音识别+翻译,情况会复杂不少。遵循先易后难的分析逻辑,我们就先从纯图像OCR方向进行着手。
# 安装依赖库
!pip install paddlenlp --upgrade
!pip install paddleocr --upgrade
!pip install paddlespeech --upgrade
import os
import cv2
import numpy as np
from tqdm import tqdm
from pprint import pprint
from paddlenlp import Taskflow
from IPython.display import Video
from paddleocr import PaddleOCR, draw_ocr
在PaddleNLP为我们提供的各类解决方案中,面对领域多变、任务多样、数据稀缺的挑战,UIE具有较强的适应性。其中,uie-x-base
模型面向纯文本和文档场景的抽取式模型,支持中英文的文档/图片/表格的端到端信息抽取。
比如下面这个视频,如果我们将其内容进行抽帧,得到的一系列的图片直接送入**uie-x-base
** 模型中进行信息抽取,由于视频中传输的知识结构清晰,模型对 某一步具体是什么内容这里的问题,能较为准确地给出反馈结果。
Video('video01-clip.mp4')
video
element.
# 定义实体关系抽取的schema——也就是视频问答的问题
schema = ['what is the 3rd step']
ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en")
[2023-02-05 19:22:16,119] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load '/home/aistudio/.paddlenlp/taskflow/information_extraction/uie-x-base'.
src_video = cv2.VideoCapture('video01-clip.mp4') fps = int(src_video.get(cv2.CAP_PROP_FPS)) total_frame = int(src_video.get(cv2.CAP_PROP_FRAME_COUNT)) # 计算视频总帧数 prob = 0 output = '' for i in tqdm(range(total_frame)): success, frame = src_video.read() # 对传入视频抽帧 if i % (fps) == 10: if success: # 保存图片 cv2.imwrite(str(i) + '.jpg', frame) # 送入UIE模型进行文档信息抽取 result = ie({"doc": str(i) + '.jpg'}) if len(result[0]) > 0: # 只保留识别结果中,置信度最高的那个 if result[0][schema[0]][0]['probability'] > prob: prob = result[0][schema[0]][0]['probability'] output = result[0][schema[0]][0]['text'] # 输出结果 pprint(result[0][schema[0]][0])
34%|███▎ | 172/510 [00:01<00:01, 183.01it/s] {'bbox': [[594, 30, 724, 80]], 'end': 8, 'probability': 0.8937306903884945, 'start': 2, 'text': 'UNPACK'} 74%|███████▍ | 379/510 [00:02<00:00, 169.88it/s] {'bbox': [[603, 138, 810, 183]], 'end': 32, 'probability': 0.9051069707893973, 'start': 20, 'text': 'SAFETy CHECK'} 100%|██████████| 510/510 [00:02<00:00, 175.77it/s]
上面视频问答的标准答案为:
Q: What is the third step?
A: safety check
# 删除多余图片
!rm *.jpg
我们把上面这个过程简单梳理下,主要包括下面几个步骤:
下面,我们就将这个思路写成一个视频问答处理函数,并验证效果。
def get_video_info(video_path, question): # 定义实体关系抽取的schema schema = [question] ie = Taskflow("information_extraction", schema=schema, model="uie-x-base", ocr_lang="en", schema_lang="en") src_video = cv2.VideoCapture(video_path) fps = int(src_video.get(cv2.CAP_PROP_FPS)) total_frame = int(src_video.get(cv2.CAP_PROP_FRAME_COUNT)) # 计算视频总帧数 prob = 0 output = '' pre_frame = 10 for i in tqdm(range(total_frame)): success, frame = src_video.read() # 记录保存的前一个最优结果图片 if i % (fps) == 10: if success: cv2.imwrite(str(i) + '.jpg', frame) result = ie({"doc": str(i) + '.jpg'}) if len(result[0]) > 0: if result[0][schema[0]][0]['probability'] > prob: if os.path.exists(str(pre_frame) + '.jpg'): os.remove(str(pre_frame) + '.jpg') prob = result[0][schema[0]][0]['probability'] output = result[0][schema[0]][0]['text'] pprint(result[0][schema[0]][0]) pre_frame = i else: os.remove(str(i) + '.jpg') elif i!=10: os.remove(str(i) + '.jpg') return output
# 显示要进行问答的视频
Video('video03-clip.mp4')
问答的标准答案:
Q: What is the purpose of the red laser sights?
A: Help you accurately aim at the target
get_video_info('video03-clip.mp4', 'What is the purpose of the red laser sights?')
[2023-02-05 22:00:07,586] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load '/home/aistudio/.paddlenlp/taskflow/information_extraction/uie-x-base'. 28%|██▊ | 126/450 [00:03<00:04, 72.93it/s] {'bbox': [[528, 21, 649, 44], [274, 55, 695, 77]], 'end': 53, 'probability': 0.9742606425060707, 'start': 17, 'text': 'HELP YOUACCURATELY AIM AT THE TARGET'} 38%|███▊ | 172/450 [00:03<00:02, 101.23it/s] {'bbox': [[528, 21, 649, 44], [274, 55, 695, 77]], 'end': 53, 'probability': 0.974278300524599, 'start': 17, 'text': 'HELP YOUACCURATELY AIM AT THE TARGET'} 43%|████▎ | 195/450 [00:04<00:02, 112.72it/s] {'bbox': [[528, 21, 649, 44], [274, 54, 694, 75]], 'end': 52, 'probability': 0.9762005052161093, 'start': 17, 'text': 'HELP YOUACCURATELY AIMAT THE TARGET'} 100%|██████████| 450/450 [00:05<00:00, 83.42it/s] 'HELP YOUACCURATELY AIMAT THE TARGET'
我们观察video03-clip.mp4
这个视频的问答抽取结果,可以发现虽然识别结果是正确的,但是直接输出的OCR识别结果后续还需要进行文本矫正。
ERNIE-Layout是依托PaddleNLP对外开源业界最强的多语言跨模态文档预训练模型。ERNIE-Layout以文心文本大模型ERNIE为底座,融合了文本、图像、布局等信息进行跨模态联合建模,创新性引入布局知识增强,提出阅读顺序预测、细粒度图文匹配等自监督预训练任务,升级空间解偶注意力机制,在各数据集上效果取得大幅度提升。
参考资料:ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding
ERNIE-Layout同样可以使用Taskflow一键调用。
from pprint import pprint
from paddlenlp import Taskflow
docprompt = Taskflow("document_intelligence", lang='en')
pprint(docprompt([{"doc": "217.jpg", "prompt": ["What is the purpose of the red laser sights?"]}]))
[2023-02-05 21:49:22,279] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load 'ernie-layoutx-base-uncased'.
[2023-02-05 21:49:22,283] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/vocab.txt
[2023-02-05 21:49:22,285] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/sentencepiece.bpe.model
[2023-02-05 21:49:22,932] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/tokenizer_config.json
[2023-02-05 21:49:23,014] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/special_tokens_map.json
[{'prompt': 'What is the purpose of the red laser sights?',
'result': [{'end': 17,
'prob': 0.97,
'start': 9,
'value': 'ACCURATELY AIM AT THE TARGET'}]}]
def get_docprompt(video_path, question): # 定义实体关系抽取的schema schema = [question] ie = Taskflow("document_intelligence", lang='en') src_video = cv2.VideoCapture(video_path) fps = int(src_video.get(cv2.CAP_PROP_FPS)) total_frame = int(src_video.get(cv2.CAP_PROP_FRAME_COUNT)) # 计算视频总帧数 prob = 0 output = '' pre_frame = 10 for i in tqdm(range(total_frame)): success, frame = src_video.read() # 记录保存的前一个最优结果图片 if i % (fps) == 10: if success: cv2.imwrite(str(i) + '.jpg', frame) result = ie([{"doc": str(i)+".jpg", "prompt": schema}]) if len(result[0]) > 0: if result[0]['result'][0]['prob'] > prob: if os.path.exists(str(pre_frame) + '.jpg'): os.remove(str(pre_frame) + '.jpg') prob = result[0]['result'][0]['prob'] output = result[0]['result'][0]['value'] pprint(result[0]['result'][0]) pre_frame = i else: os.remove(str(i) + '.jpg') elif i!=10: os.remove(str(i) + '.jpg') return output
get_docprompt('video01-clip.mp4', 'What is the third step?')
[2023-02-05 21:59:20,521] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load 'ernie-layoutx-base-uncased'. [2023-02-05 21:59:20,525] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/vocab.txt [2023-02-05 21:59:20,527] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/sentencepiece.bpe.model [2023-02-05 21:59:21,160] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/tokenizer_config.json [2023-02-05 21:59:21,163] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/special_tokens_map.json 45%|████▌ | 231/510 [00:00<00:00, 347.51it/s] {'end': 11, 'prob': 1.0, 'start': 9, 'value': 'SAFETy CHECK'} 100%|██████████| 510/510 [00:01<00:00, 270.32it/s] 'SAFETy CHECK'
get_docprompt('video03-clip.mp4', "What is the purpose of the red laser sights?")
[2023-02-05 21:57:12,703] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load 'ernie-layoutx-base-uncased'. [2023-02-05 21:57:12,707] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/vocab.txt [2023-02-05 21:57:12,709] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/sentencepiece.bpe.model [2023-02-05 21:57:13,338] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/tokenizer_config.json [2023-02-05 21:57:13,341] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/special_tokens_map.json 0%| | 1/450 [00:00<01:04, 6.91it/s] {'end': 7, 'prob': 0.86, 'start': 5, 'value': 'RANGEFINDER RETICLE'} 2%|▏ | 11/450 [00:00<00:10, 42.43it/s] {'end': 7, 'prob': 0.87, 'start': 7, 'value': 'RANGEFINDER'} 8%|▊ | 34/450 [00:00<00:03, 106.16it/s] {'end': 7, 'prob': 0.88, 'start': 7, 'value': 'RANGEFINDER'} 13%|█▎ | 57/450 [00:00<00:02, 141.83it/s] {'end': 7, 'prob': 0.89, 'start': 7, 'value': 'RANGEFINDER'} 28%|██▊ | 126/450 [00:00<00:01, 191.73it/s] {'end': 17, 'prob': 0.96, 'start': 9, 'value': 'ACCURATELY AIM AT THE TARGET'} 33%|███▎ | 149/450 [00:00<00:01, 194.31it/s] {'end': 17, 'prob': 0.97, 'start': 9, 'value': 'ACCURATELY AIM AT THE TARGET'} 100%|██████████| 450/450 [00:02<00:00, 212.87it/s] 'ACCURATELY AIM AT THE TARGET'
虽然在video01-clip.mp4
和video03-clip.mp4
两个视频的问答结果上,ERNIE-Layout和信息抽取结果大同小于,甚至video03-clip.mp4
的问答结果离标准答案还有一点点缺漏,但是读者可以比较下面video02-clip.mp4
和video07-clip.mp4
的问答结果,会发现ERNIE-Layout在真正的上下文理解上,明显要更强一些。
# 显示要进行问答的视频
Video('video02-clip.mp4')
Q1: How many bolts are there?
A1: 8
get_docprompt('video02-clip.mp4', "How many bolts are there?")
[2023-02-05 22:08:41,260] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load 'ernie-layoutx-base-uncased'. [2023-02-05 22:08:41,265] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/vocab.txt [2023-02-05 22:08:41,267] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/sentencepiece.bpe.model [2023-02-05 22:08:41,890] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/tokenizer_config.json [2023-02-05 22:08:41,893] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/special_tokens_map.json 2%|▏ | 11/537 [00:00<00:10, 49.61it/s] {'end': 0, 'prob': 0.98, 'start': 0, 'value': '8SOLIDSTEELLOCKINGBOLTS'} 23%|██▎ | 126/537 [00:00<00:02, 192.12it/s] {'end': 0, 'prob': 0.99, 'start': 0, 'value': '8SOLIDSTEELLOCKING'} 100%|██████████| 537/537 [00:02<00:00, 201.76it/s] '8SOLIDSTEELLOCKING'
有的视频是带字幕的,而且问答内容只在字幕中,视频其它位置的文字反而会形成严重干扰,这时候在读取图片的时候限定字幕范围,可以很好地提升问答结果的准确程度。
def get_docprompt_v2(video_path, question): # 定义实体关系抽取的schema schema = [question] ie = Taskflow("document_intelligence", lang='en') src_video = cv2.VideoCapture(video_path) fps = int(src_video.get(cv2.CAP_PROP_FPS)) total_frame = int(src_video.get(cv2.CAP_PROP_FRAME_COUNT)) # 计算视频总帧数 prob = 0 output = '' pre_frame = 10 for i in tqdm(range(total_frame)): success, frame = src_video.read() # 记录保存的前一个最优结果图片 if i % (fps) == 10: if success: # 限定范围只抽取字幕 cv2.imwrite(str(i) + '.jpg', frame[-180:-30:]) result = ie([{"doc": str(i)+".jpg", "prompt": schema}]) if len(result[0]) > 0: if result[0]['result'][0]['prob'] > prob: if os.path.exists(str(pre_frame) + '.jpg'): os.remove(str(pre_frame) + '.jpg') prob = result[0]['result'][0]['prob'] output = result[0]['result'][0]['value'] pprint(result[0]['result'][0]) pre_frame = i else: os.remove(str(i) + '.jpg') elif i!=10: os.remove(str(i) + '.jpg') return output
# 显示要进行问答的视频
Video('video07-clip.mp4')
video
element.
Q: What does Treasure Race mean?
A: The hunt for the treasure of Gold Roger.
get_docprompt_v2('video07-clip.mp4', "What does Treasure Race mean?")
[2023-02-05 22:23:35,360] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load 'ernie-layoutx-base-uncased'. [2023-02-05 22:23:35,364] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/vocab.txt [2023-02-05 22:23:35,366] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/sentencepiece.bpe.model [2023-02-05 22:23:36,011] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/tokenizer_config.json [2023-02-05 22:23:36,013] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/ernie-layoutx-base-uncased/special_tokens_map.json 22%|██▏ | 133/597 [00:00<00:01, 255.63it/s] {'end': 13, 'prob': 0.95, 'start': 0, 'value': 'The hunt for the treasureof Gold Roger!'} 100%|██████████| 597/597 [00:01<00:00, 305.63it/s] 'The hunt for the treasureof Gold Roger!'
还有一种更加复杂的情况,就是要结合出现文字/字幕的时序信息进行问答。
# 显示要进行问答的视频
Video('video05-clip.mp4')
Q: What is the first step to do a fast healing?
A: Clean the cut or scrape.
比如上面这个视频,问答内容与是fast healing的处理步骤,但是视频帧里,步骤只有文字,而不像第一个视频一样有1、2、3、4……,这时候用文档抽取或ERNIE-Layout就傻眼了,因为对着一张图片怎么也回答不出来。此时,就需要把读取到的文字都拼接起来,得到时序相关的文字信息。
ocr = PaddleOCR(use_angle_cls=False, lang="en")
similarity = Taskflow(task="text_similarity", mode="fast", max_seq_len=16, lang="en")
src_video = cv2.VideoCapture('video05-clip.mp4') fps = int(src_video.get(cv2.CAP_PROP_FPS)) total_frame = int(src_video.get(cv2.CAP_PROP_FRAME_COUNT)) # 计算视频总帧数 save_text0 = [] for i in tqdm(range(total_frame)): success, frame = src_video.read() if i % (fps) == 10: line_text = [] if success: # 排除干扰信息,只抽取部分画面 result = ocr.ocr(frame[30:180:], cls=True) for idx in range(len(result)): res = result[idx] for line in res: if len(line[1][0]) > 1: line_text.append(line[1][0]) line_res = ' '.join(line_text) save_text0.append(line_res)
save_text = []
for i in save_text0:
if i != '':
save_text.append(i)
# 结果去重
final_text =list(set(save_text))
final_text.sort(key=save_text.index)
final_text = ','.join(final_text)
final_text
'3 Steps to Fast Healing,Clean the cut or scrape,Treat the Wound with a topical antibiotic,Cover the cut or scrape'
完成上述工作后,看着final_text
这段文字,总算可以进行问答了。不过,直接用信息抽取的预训练模型还是得不到结果,我们可以稍微调整下,增加关键词。等到后面赛题后续训练集公布后,重新微调训练模型,识别效果必然会提升不少。
# 定义实体关系抽取的schema
schema = ['What is the first step to do a healing?']
ie = Taskflow("information_extraction", schema=schema, model="uie-x-base")
ie('3 Steps to Fast Healing, first Clean the cut or scrape,second Treat the Wound with a topical antibiotic,then Cover the cut or scrape')
[2023-02-05 23:13:12,541] [ INFO] - We are using <class 'paddlenlp.transformers.ernie_layout.tokenizer.ErnieLayoutTokenizer'> to load '/home/aistudio/.paddlenlp/taskflow/information_extraction/uie-x-base'.
[{'What is the first step to do a healing?': [{'text': 'Clean the cut or scrape',
'start': 31,
'end': 54,
'probability': 0.9410384130303413}]}]
在本项目中,我们结合ICDAR2023 数字原生视频文本问答竞赛 赛道2:视频文本问答比赛的示例数据,基于PaddleNLP提供的预训练模型,完成了视频问答任务的初步探索。从实验结果可以看出,尽管视频问答场景相当复杂,但是只要经过正确的处理,UIE和ERNIE-Layout的跨模态问答能力还是非常强大的。我们有理由相信,后续基于UIE和ERNIE-Layout微调训练后的效果,还会更上一层楼。
当然,从上面的分析大家也会发现,视频文本问答这个场景情况确实相当复杂,从比赛角度说,不同类型的视频数据如何分门别类用最佳方式进行处理,很可能将对排位起到决定性影响。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。