赞
踩
阿里达摩院开源大型端到端语音识别工具包FunASR:
FunASR提供了在大规模工业语料库上训练的模型,并能够将其部署到应用程序中。工具包的核心模型是Paraformer,这是一个非自回归的端到端语音识别模型,经过手动注释的普通话语音识别数据集进行了训练,该数据集包含60,000小时的语音数据。为了提高Paraformer的性能,本文在标准的Paraformer基础上增加了时间戳预测和热词定制能力。此外,为了便于模型部署,本文还开源了基于前馈时序记忆网络FSMN-VAD的语音活动检测模型和基于可控时延Transformer(CT-Transformer)的文本后处理标点模型,这两个模型都是在工业语料库上训练的。这些功能模块为构建高精度的长音频语音识别服务提供了坚实的基础,与在公开数据集上训练的其它模型相比,Paraformer展现出了更卓越的性能。 FunASR 的中文语音转写效果比 Whisper 更优秀。
https://github.com/modelscope/FunASR
- conda create -n funasr python=3.9
-
- conda activate funasr
-
- conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=11.8 -c pytorch -c nvidia
-
- pip install -U funasr
-
- pip install -U modelscope huggingface_hub
需要下载模型
- from funasr import AutoModel
- from funasr.utils.postprocess_utils import rich_transcription_postprocess
-
- model_dir = "iic/SenseVoiceSmall"
-
- model = AutoModel(
- model=model_dir,
- vad_model="fsmn-vad",
- vad_kwargs={"max_single_segment_time": 30000},
- device="cuda:0",
- )
-
- # en
- res = model.generate(
- input=f"{model.model_path}/example/en.mp3",
- cache={},
- language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
- use_itn=True,
- batch_size_s=60,
- merge_vad=True, #
- merge_length_s=15,
- )
- text = rich_transcription_postprocess(res[0]["text"])
- print(text)

英文识别:
中文识别:
- from funasr import AutoModel
-
- chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
- encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
- decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
-
- model = AutoModel(model="iic/paraformer-zh-streaming")
-
- import soundfile
- import os
-
- wav_file = os.path.join(model.model_path, "example/asr_example.wav")
- speech, sample_rate = soundfile.read(wav_file)
- chunk_stride = chunk_size[1] * 960 # 600ms
-
- cache = {}
- total_chunk_num = int(len((speech)-1)/chunk_stride+1)
- for i in range(total_chunk_num):
- speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
- is_final = i == total_chunk_num - 1
- res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
- print(res)

\funasr_samples\samples\python
python funasr_wss_server.py
运行服务器端:
运行客户端:即可使用麦克风,进行实时转录。
python funasr_wss_client.py
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。