diff --git a/README.md b/README.md index 9815deb..bbad9dc 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,318 @@ --- -frameworks: -- Pytorch -license: Apache License 2.0 tasks: - auto-speech-recognition +domain: +- audio +model-type: +- autoregressive +frameworks: +- pytorch +backbone: +- transformer/conformer +metrics: +- CER +license: Apache License 2.0 +language: +- multilingual +tags: +- FunASR +- Whisper +datasets: + train: + - 680,000 hour + test: + - test +indexing: + results: + - task: + name: Automatic Speech Recognition + dataset: + name: 680,000 hour + metrics: + - type: CER + value: 8.53% # float + description: greedy search, withou lm, avg. + args: default + - type: RTF + value: 0.0251 # float + description: GPU inference on V100 + args: batch_size=1 +widgets: + - task: auto-speech-recognition + model_revision: v1.0.0 + inputs: + - type: audio + name: input + title: 音频 + examples: + - name: 1 + title: 示例1 + inputs: + - name: input + data: git://example/asr_example.wav + inferencespec: + cpu: 8 #CPU数量 + memory: 4096 --- -###### 模型文件和权重,请浏览“模型文件”页面获取。 -###### 当前模型的贡献者未提供更加详细的模型介绍,但是您可以通过如下git clone命令,或者ModelScope SDK来下载模型。 -###### Clone with HTTP + + +# Whisper模型介绍 + +## [ModelScope-FunASR](https://github.com/alibaba-damo-academy/FunASR) +[FunASR](https://github.com/alibaba-damo-academy/FunASR)希望在语音识别方面建立学术研究和工业应用之间的桥梁。通过支持在ModelScope上发布的工业级语音识别模型的训练和微调,研究人员和开发人员可以更方便地进行语音识别模型的研究和生产,并促进语音识别生态系统的发展。 + +[**最新动态**](https://github.com/alibaba-damo-academy/FunASR#whats-new) +| [**环境安装**](https://github.com/alibaba-damo-academy/FunASR#installation) +| [**介绍文档**](https://alibaba-damo-academy.github.io/FunASR/en/index.html) +| [**中文教程**](https://github.com/alibaba-damo-academy/FunASR/wiki#funasr%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C) +| [**服务部署**](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime) +| [**模型库**](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/model_zoo/modelscope_models.md) +| [**联系我们**](https://github.com/alibaba-damo-academy/FunASR#contact) + + +## 基于ModelScope进行推理 + +- 推理支持音频格式如下: + - wav文件路径,例如:data/test/audios/asr_example.wav + - pcm文件路径,例如:data/test/audios/asr_example.pcm + - wav文件url,例如:https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav + - wav二进制数据,格式bytes,例如:用户直接从文件里读出bytes数据或者是麦克风录出bytes数据。 + - 已解析的audio音频,例如:audio, rate = soundfile.read("asr_example_zh.wav"),类型为numpy.ndarray或者torch.Tensor。 + - wav.scp文件,需符合如下要求: + +```sh +cat wav.scp +asr_example1 data/test/audios/asr_example1.wav +asr_example2 data/test/audios/asr_example2.wav +... ``` -git clone https://www.modelscope.cn/iic/Whisper-large-v3.git + +- 若输入格式wav文件url,api调用方式可参考如下范例: + +```python +from modelscope.pipelines import pipeline +from modelscope.utils.constant import Tasks + +inference_pipeline = pipeline( + task=Tasks.auto_speech_recognition, + model='iic/Whisper-large-v3') + +rec_result = inference_pipeline(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', language=None) +print(rec_result) ``` -###### 如果您是本模型的贡献者,我们邀请您根据[模型贡献文档说明](https://www.modelscope.cn/docs/%E5%A6%82%E4%BD%95%E6%92%B0%E5%86%99%E5%A5%BD%E7%94%A8%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%8D%A1%E7%89%87),及时完善模型卡片内容。 \ No newline at end of file + +- 输入音频为pcm格式,调用api时需要传入音频采样率参数fs,例如: + +```python +rec_result = inference_pipeline(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.pcm', fs=16000) +``` + +- 输入音频为wav格式,api调用方式可参考如下范例: + +```python +rec_result = inference_pipeline(input'asr_example_zh.wav') +``` + +- 若输入格式为文件wav.scp(注:文件名需要以.scp结尾),可添加 output_dir 参数将识别结果写入文件中,api调用方式可参考如下范例: + +```python +inference_pipeline(input="wav.scp", output_dir='./output_dir') +``` +识别结果输出路径结构如下: + +```sh +tree output_dir/ +output_dir/ +└── 1best_recog + ├── score + └── text + +1 directory, 3 files +``` +score:识别路径得分 + +text:语音识别结果文件 + + +- 若输入音频为已解析的audio音频,api调用方式可参考如下范例: + +```python +import soundfile + +waveform, sample_rate = soundfile.read("asr_example_zh.wav") +rec_result = inference_pipeline(input=waveform) +``` + +- ASR、VAD、PUNC模型自由组合 + +可根据使用需求对VAD和PUNC标点模型进行自由组合,使用方式如下: +```python +inference_pipeline = pipeline( + task=Tasks.auto_speech_recognition, + model='iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision="v2.0.4", + vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4", + punc_model='iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', punc_model_revision="v2.0.4", + # spk_model="iic/speech_campplus_sv_zh-cn_16k-common", + # spk_model_revision="v2.0.2", +) +``` +若不使用PUNC模型,可配置punc_model="",或不传入punc_model参数,如需加入LM模型,可增加配置lm_model='damo/speech_transformer_lm_zh-cn-common-vocab8404-pytorch',并设置lm_weight和beam_size参数。 + +## 基于FunASR进行推理 + +下面为快速上手教程,测试音频([中文](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav),[英文](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav)) + +### 可执行命令行 +在命令行终端执行: + +```shell +funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=vad_example.wav +``` + +注:支持单条音频文件识别,也支持文件列表,列表为kaldi风格wav.scp:`wav_id wav_path` + +### python示例 +#### 非实时语音识别 +```python +from funasr import AutoModel +# paraformer-zh is a multi-functional asr model +# use vad, punc, spk or not as you need +model = AutoModel(model="paraformer-zh", model_revision="v2.0.4", + vad_model="fsmn-vad", vad_model_revision="v2.0.4", + punc_model="ct-punc-c", punc_model_revision="v2.0.4", + # spk_model="cam++", spk_model_revision="v2.0.2", + ) +res = model.generate(input=f"{model.model_path}/example/asr_example.wav", + batch_size_s=300, + hotword='魔搭') +print(res) +``` +注:`model_hub`:表示模型仓库,`ms`为选择modelscope下载,`hf`为选择huggingface下载。 + +#### 实时语音识别 + +```python +from funasr import AutoModel + +chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms +encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention +decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention + +model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4") + +import soundfile +import os + +wav_file = os.path.join(model.model_path, "example/asr_example.wav") +speech, sample_rate = soundfile.read(wav_file) +chunk_stride = chunk_size[1] * 960 # 600ms + +cache = {} +total_chunk_num = int(len((speech)-1)/chunk_stride+1) +for i in range(total_chunk_num): + speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride] + is_final = i == total_chunk_num - 1 + res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back) + print(res) +``` + +注:`chunk_size`为流式延时配置,`[0,10,5]`表示上屏实时出字粒度为`10*60=600ms`,未来信息为`5*60=300ms`。每次推理输入为`600ms`(采样点数为`16000*0.6=960`),输出为对应文字,最后一个语音片段输入需要设置`is_final=True`来强制输出最后一个字。 + +#### 语音端点检测(非实时) +```python +from funasr import AutoModel + +model = AutoModel(model="fsmn-vad", model_revision="v2.0.4") + +wav_file = f"{model.model_path}/example/asr_example.wav" +res = model.generate(input=wav_file) +print(res) +``` + +#### 语音端点检测(实时) +```python +from funasr import AutoModel + +chunk_size = 200 # ms +model = AutoModel(model="fsmn-vad", model_revision="v2.0.4") + +import soundfile + +wav_file = f"{model.model_path}/example/vad_example.wav" +speech, sample_rate = soundfile.read(wav_file) +chunk_stride = int(chunk_size * sample_rate / 1000) + +cache = {} +total_chunk_num = int(len((speech)-1)/chunk_stride+1) +for i in range(total_chunk_num): + speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride] + is_final = i == total_chunk_num - 1 + res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size) + if len(res[0]["value"]): + print(res) +``` + +#### 标点恢复 +```python +from funasr import AutoModel + +model = AutoModel(model="ct-punc", model_revision="v2.0.4") + +res = model.generate(input="那今天的会就到这里吧 happy new year 明年见") +print(res) +``` + +#### 时间戳预测 +```python +from funasr import AutoModel + +model = AutoModel(model="fa-zh", model_revision="v2.0.4") + +wav_file = f"{model.model_path}/example/asr_example.wav" +text_file = f"{model.model_path}/example/text.txt" +res = model.generate(input=(wav_file, text_file), data_type=("sound", "text")) +print(res) +``` + +更多详细用法([示例](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining)) + + +## 微调 + +详细用法([示例](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining)) + +## 使用方式以及适用范围 + +运行范围 +- 支持Linux-x86_64、Mac和Windows运行。 + +使用方式 +- 直接推理:可以直接对输入音频进行解码,输出目标文字。 + +使用范围与目标场景 +- 适合于离线语音识别场景 + + +## 模型局限性以及可能的偏差 + +考虑到特征提取流程和工具以及训练工具差异,会对CER的数据带来一定的差异(<0.1%),推理GPU环境差异导致的RTF数值差异。 + + + +## 相关论文以及引用信息 + +```BibTeX +@inproceedings{radford2023robust, + title={Robust speech recognition via large-scale weak supervision}, + author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, + booktitle={International Conference on Machine Learning}, + pages={28492--28518}, + year={2023}, + organization={PMLR} +} +``` + + + + + diff --git a/config.yaml b/config.yaml new file mode 100644 index 0000000..887a298 --- /dev/null +++ b/config.yaml @@ -0,0 +1,38 @@ +# network architecture +model: WhisperWarp +model_conf: + lsm_weight: 0.1 + length_normalized_loss: true + hub: funasr # openai + + + +# only use for hub == funasr, +# if hub == openai, dims is automaticall download +dims: + n_mels: 128 + n_vocab: 51866 + n_audio_ctx: 1500 + n_audio_state: 1280 + n_audio_head: 20 + n_audio_layer: 32 + n_text_ctx: 448 + n_text_state: 1280 + n_text_head: 20 + n_text_layer: 32 + +# frontend related +frontend: WhisperFrontend +frontend_conf: + fs: 16000 + n_mels: ${dims.n_mels} + do_pad_trim: true + +tokenizer: WhisperTokenizer +tokenizer_conf: + language: null + task: transcribe + is_multilingual: true + num_languages: 100 + +scope_map: [none, "model."] \ No newline at end of file