赞
踩
【数据集分析】NYT-Wiki关系抽取数据集分析(一)—— 理解单条实例
【数据集分析】NYT-Wiki关系抽取数据集分析(二)—— 统计类别和实例数
【数据集分析】NYT-Wiki关系抽取数据集分析(三)—— 绘制Relation分布图
最近拿到一个关系抽取数据集,nyt-wiki,分析了一波分布、重合等,分享一下分析思路和代码。
[’/m/0124gn1g’, ‘/m/02lx2r’, ‘trick’, ‘album’, ‘instance_of’, ‘utah saints pulls off a similar trick on its hit single, “something good,” the opening track on its eponymous debut album (london/plg 828 374-2; cd and cassette).’, ‘###END###’]
可以看到这一个instance的构成为:
[头实体id,尾实体id,头实体,尾实体,关系名,句子,终止记号]
一共包含七个部分,其中我们需要的是前六个部分。
我们很难记住list
的序号对应的值的构成,因此将instance数据类型转化成dict
,这样取数据时就会比较方便,dict
的key
值如下:
{“text”: , “relation”: , “h”: {“id”: , “name”: , “pos”: }, “t”: {“id”: , “name”: , “pos”: }}
这样我们在取数据时就直接用instance["text"]
就可以直接取出数据,非常的方便,list
转化dict
后,一个instance如下:
{ "text":"utah saints pulls off a similar trick on its hit single, "something good," the opening track on its eponymous debut album (london/plg 828 374-2; cd and cassette).", "relation":"instance_of", "h":{ "id":"/m/0124gn1g", "name":"trick", "pos":[ 32, 37 ] }, "t":{ "id":"/m/02lx2r", "name":"album", "pos":[ 116, 121 ] } }
NOTE:
dict
,该dict
包含三部分{id,name,pos}。dict
类型可以用json相互转化,存储和读取比较规范。import json train_rel_fre_dict = {} train_data = {} temp1 = {} temp2 = {} with open("nytwiki_train.txt", 'w', encoding = 'utf-8') as f_op: with open("train.txt", 'r', encoding = 'utf-8') as f: lines = f.readlines() for line in lines: line = line.strip().split('\t') #loads后面字符串, load(文件名字) # 获取头(尾)实体在句子中的位置 pos1 = [line[5].index(line[2]),line[5].index(line[2])+len(line[2])] pos2 = [line[5].index(line[3]),line[5].index(line[3])+len(line[3])] train_data['text'] = line[5] train_data['relation'] = line[4] temp1['id'] = line[0] temp1['name'] = line[2] temp1['pos'] = pos1 train_data['h'] = temp1 temp2['id'] = line[1] temp2['name'] = line[3] temp2['pos'] = pos2 train_data['t'] = temp2 json.dump(train_data, f_op) f_op.write('\n')
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。