赞
踩
基于术语词典干预的机器翻译挑战赛选择以英文为源语言,中文为目标语言的机器翻译。本次大赛除英文到中文的双语数据,还提供英中对照的术语词典。参赛队伍需要基于提供的训练数据样本从多语言机器翻译模型的构建与训练,并基于测试集以及术语词典,提供最终的翻译结果,数据包括:
https://challenge.xfyun.cn/topic/info?type=machine-translation-2024&option=tjjg&ch=dw24_AtTCK9
Task2:从baseline代码详解入门深度学习 - 飞书云文档 (feishu.cn)
跑通BaseLine注意事项:
安装spacy时需注意文件下载的格式:
en_core_web_trf 436MB 文件较大,官网下载链接如下(通过pip安装)
English · spaCy Models Documentation
jieba
对中文进行分词,使用spaCy
对英文进行分词。<PAD>
标记填充。<SOS>
(Sequence Start)和<EOS>
(Sequence End)标记,帮助模型识别序列的起始和结束。<UNK>
(Unknown)标记,使模型能够处理未见过的词汇。感谢学习群内大佬分享~
# 数据清洗 import re import contractions import unicodedata def unicodeToAscii(text): return ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn') def preprocess_en(text): text = unicodeToAscii(text.strip()) text = contractions.fix(text) text = re.sub(r'\([^)]*\)', '', text) text = re.sub(r"[^a-zA-Z0-9.!?]+", r" ", text) # 保留数字 return text def preprocess_zh(text): patterns_to_replace = ["(笑声)", "(掌声)", "(口哨声)","口哨声)", "(音乐)", "(鼓掌)", "(笑)", "(众笑)", "(视频):", "(大笑)", "(录音)", "(消音)", "(欢呼)", "(视频)", "(叫声)", "(录像):", "(录像)", "(拍手)", "(大喊)", "(吟唱)", "(噪音)", "(铃声)", "(尖叫)", "(影片)", "(声音)", "(喇叭)", "(齐唱)", "(混音)", "(音频)", "(影视)", "(噪声)", "(口哨)", "(击掌)", "(铃铛)", "(小号)", "(歌声)", "(狂笑)", "(演唱)", "(喝彩)", "(配乐)", "(调音)", "(笑话)", "(叹气)", "(鸟鸣)", "(鸟鸣)", "(爆炸)", "(枪声)", "(爆笑)", "(滑音)", "(音调)", "(游戏)", "(笑)", "(淫笑)", "(音译)", "(笑♫)", "(音乐)", "(咳嗽)", "(咳嗽)", "(马嘶声)", "(音乐声)", "(鼓掌声)", "(众人笑)", "(喇叭声)","(钢琴声)", "(吹口哨)","(尖叫声)", "(大家笑)", "(重击声)", "(呼吸声)", "(感叹声)", "(敲打声)", "(背景音)", "(噼啪声)", "(观众笑)", "(爆炸声)","(歌词:)", "(敲椅声)","(滋滋声)", "(静电声)", "(笑~~)", "(喝彩声)", "(抨击声)", "(咳嗽声)", "(喊叫声)", "(风雨声)", "(哭泣声)", "(大笑声)", "(欢呼声)", "(嘀嘀声)", "(闹铃声)", "(拍手声)", "(讨论声)", "(鼓掌♫)", "(喘息声)", "(打呼声)", "(惊叫声)", "(议论声)", "(音乐起)", "(小提琴)", "(拍巴掌)", "(众鼓掌)", "(众人鼓掌)", "(众人欢呼)", "(观众笑声)", "(观众掌声)", "(热烈鼓掌)", "(哄堂大笑)", "(警报噪声)", "(掌声♫♪)", "(按喇叭声)", "(众人大笑)", "(现场笑声)", "(限频音乐)", "(音乐响起)", "(掌声。 )", "(观众鼓掌)", "(电话铃声)", "(又是狂笑)", "(电话铃响)", "(音乐和声)", "(笑声,掌声)", "(频率的声音)", "(众笑+鼓掌)", "(相机快门声)", "(音乐录影带)", "(诺基亚铃声)", "(听众的笑声)", "(无意义的声音)", "(笑+鼓掌♫♫)", "(发射时的噪音)", "(人群的欢呼声)", "(打喷嚏的声音)" # "", "","", "","", "","", "", ] pattern = "|".join(map(re.escape, patterns_to_replace)) pattern1 = r'(.*?)' # 直接替换掉带括号的词 text = re.sub(pattern, "", text) return text sen = "我们管它叫做 一个情感工程 它使用最新的 十七世纪的技术- (笑声) 来把名词 变成动词" text = preprocess_zh(sen) print(text) sen = "there's a dog" text = preprocess_en(sen) print(text) # 数据预处理函数 def preprocess_data(en_data: List[str], zh_data: List[str]) -> List[Tuple[List[str], List[str]]]: processed_data = [] for en, zh in zip(en_data, zh_data): # 将英文缩写拆开 there's -> there is en = preprocess_zh(en) # 替换掉中文中的语气如:(笑声) zh = preprocess_zh(zh) en_tokens = en_tokenizer(en.lower())[:MAX_LENGTH] zh_tokens = zh_tokenizer(zh)[:MAX_LENGTH] if en_tokens and zh_tokens: # 确保两个序列都不为空 processed_data.append((en_tokens, zh_tokens)) return processed_data
# 新增术语词典加载部分 def load_terminology_dictionary(dict_file): terminology = {} with open(dict_file, 'r', encoding='utf-8') as f: for line in f: en_term, ch_term = line.strip().split('\t') terminology[en_term] = ch_term return terminology # 数据加载函数 def load_data(train_path: str, dev_en_path: str, dev_zh_path: str, test_en_path: str): # 读取训练数据 train_data = read_data(train_path) train_en, train_zh = zip(*(line.split('\t') for line in train_data)) # 读取开发集和测试集 dev_en = read_data(dev_en_path) dev_zh = read_data(dev_zh_path) test_en = read_data(test_en_path) terminology = load_terminology_dictionary(terminology_path) for en_term, zh_term in terminology.items(): train_en += (en_term,) train_zh += (zh_term,) # 预处理数据 train_processed = preprocess_data(train_en, train_zh) dev_processed = preprocess_data(dev_en, dev_zh) test_processed = [(en_tokenizer(en.lower())[:MAX_LENGTH], []) for en in test_en if en.strip()] # 构建词汇表 global en_vocab, zh_vocab en_vocab, zh_vocab = build_vocab(train_processed) # 创建数据集 train_dataset = TranslationDataset(train_processed, en_vocab, zh_vocab) dev_dataset = TranslationDataset(dev_processed, en_vocab, zh_vocab) test_dataset = TranslationDataset(test_processed, en_vocab, zh_vocab) from torch.utils.data import Subset # 假设你有10000个样本,你只想用前1000个样本进行测试 indices = list(range(N)) train_dataset = Subset(train_dataset, indices) # 创建数据加载器 train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn, drop_last=True) dev_loader = DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn, drop_last=True) test_loader = DataLoader(test_dataset, batch_size=1, collate_fn=collate_fn, drop_last=True) return train_loader, dev_loader, test_loader, en_vocab, zh_vocab
训练模型时,Val loss 先降低后上升
虽然BLEU评分看着挺高的
但是最终的翻译结果明显出错
以上问题待解决中…
不禁思考,是数据的问题,还是模型的问题,还是我的问题(T_T)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。