qwenaudio whisper-openai v1.0.12
This commit is contained in:
parent
2fdda1eeea
commit
5c11e5ed76
320
README.md
320
README.md
@ -1,14 +1,318 @@
|
||||
---
|
||||
frameworks:
|
||||
- Pytorch
|
||||
license: Apache License 2.0
|
||||
tasks:
|
||||
- auto-speech-recognition
|
||||
domain:
|
||||
- audio
|
||||
model-type:
|
||||
- autoregressive
|
||||
frameworks:
|
||||
- pytorch
|
||||
backbone:
|
||||
- transformer/conformer
|
||||
metrics:
|
||||
- CER
|
||||
license: Apache License 2.0
|
||||
language:
|
||||
- multilingual
|
||||
tags:
|
||||
- FunASR
|
||||
- Whisper
|
||||
datasets:
|
||||
train:
|
||||
- 680,000 hour
|
||||
test:
|
||||
- test
|
||||
indexing:
|
||||
results:
|
||||
- task:
|
||||
name: Automatic Speech Recognition
|
||||
dataset:
|
||||
name: 680,000 hour
|
||||
metrics:
|
||||
- type: CER
|
||||
value: 8.53% # float
|
||||
description: greedy search, withou lm, avg.
|
||||
args: default
|
||||
- type: RTF
|
||||
value: 0.0251 # float
|
||||
description: GPU inference on V100
|
||||
args: batch_size=1
|
||||
widgets:
|
||||
- task: auto-speech-recognition
|
||||
model_revision: v1.0.0
|
||||
inputs:
|
||||
- type: audio
|
||||
name: input
|
||||
title: 音频
|
||||
examples:
|
||||
- name: 1
|
||||
title: 示例1
|
||||
inputs:
|
||||
- name: input
|
||||
data: git://example/asr_example.wav
|
||||
inferencespec:
|
||||
cpu: 8 #CPU数量
|
||||
memory: 4096
|
||||
---
|
||||
###### 模型文件和权重,请浏览“模型文件”页面获取。
|
||||
###### 当前模型的贡献者未提供更加详细的模型介绍,但是您可以通过如下git clone命令,或者ModelScope SDK来下载模型。
|
||||
###### Clone with HTTP
|
||||
|
||||
|
||||
# Whisper模型介绍
|
||||
|
||||
## <strong>[ModelScope-FunASR](https://github.com/alibaba-damo-academy/FunASR)</strong>
|
||||
<strong>[FunASR](https://github.com/alibaba-damo-academy/FunASR)</strong>希望在语音识别方面建立学术研究和工业应用之间的桥梁。通过支持在ModelScope上发布的工业级语音识别模型的训练和微调,研究人员和开发人员可以更方便地进行语音识别模型的研究和生产,并促进语音识别生态系统的发展。
|
||||
|
||||
[**最新动态**](https://github.com/alibaba-damo-academy/FunASR#whats-new)
|
||||
| [**环境安装**](https://github.com/alibaba-damo-academy/FunASR#installation)
|
||||
| [**介绍文档**](https://alibaba-damo-academy.github.io/FunASR/en/index.html)
|
||||
| [**中文教程**](https://github.com/alibaba-damo-academy/FunASR/wiki#funasr%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C)
|
||||
| [**服务部署**](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime)
|
||||
| [**模型库**](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/model_zoo/modelscope_models.md)
|
||||
| [**联系我们**](https://github.com/alibaba-damo-academy/FunASR#contact)
|
||||
|
||||
|
||||
## 基于ModelScope进行推理
|
||||
|
||||
- 推理支持音频格式如下:
|
||||
- wav文件路径,例如:data/test/audios/asr_example.wav
|
||||
- pcm文件路径,例如:data/test/audios/asr_example.pcm
|
||||
- wav文件url,例如:https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav
|
||||
- wav二进制数据,格式bytes,例如:用户直接从文件里读出bytes数据或者是麦克风录出bytes数据。
|
||||
- 已解析的audio音频,例如:audio, rate = soundfile.read("asr_example_zh.wav"),类型为numpy.ndarray或者torch.Tensor。
|
||||
- wav.scp文件,需符合如下要求:
|
||||
|
||||
```sh
|
||||
cat wav.scp
|
||||
asr_example1 data/test/audios/asr_example1.wav
|
||||
asr_example2 data/test/audios/asr_example2.wav
|
||||
...
|
||||
```
|
||||
git clone https://www.modelscope.cn/iic/Whisper-large-v3.git
|
||||
|
||||
- 若输入格式wav文件url,api调用方式可参考如下范例:
|
||||
|
||||
```python
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model='iic/Whisper-large-v3')
|
||||
|
||||
rec_result = inference_pipeline(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', language=None)
|
||||
print(rec_result)
|
||||
```
|
||||
###### 如果您是本模型的贡献者,我们邀请您根据[模型贡献文档说明](https://www.modelscope.cn/docs/%E5%A6%82%E4%BD%95%E6%92%B0%E5%86%99%E5%A5%BD%E7%94%A8%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%8D%A1%E7%89%87),及时完善模型卡片内容。
|
||||
|
||||
- 输入音频为pcm格式,调用api时需要传入音频采样率参数fs,例如:
|
||||
|
||||
```python
|
||||
rec_result = inference_pipeline(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.pcm', fs=16000)
|
||||
```
|
||||
|
||||
- 输入音频为wav格式,api调用方式可参考如下范例:
|
||||
|
||||
```python
|
||||
rec_result = inference_pipeline(input'asr_example_zh.wav')
|
||||
```
|
||||
|
||||
- 若输入格式为文件wav.scp(注:文件名需要以.scp结尾),可添加 output_dir 参数将识别结果写入文件中,api调用方式可参考如下范例:
|
||||
|
||||
```python
|
||||
inference_pipeline(input="wav.scp", output_dir='./output_dir')
|
||||
```
|
||||
识别结果输出路径结构如下:
|
||||
|
||||
```sh
|
||||
tree output_dir/
|
||||
output_dir/
|
||||
└── 1best_recog
|
||||
├── score
|
||||
└── text
|
||||
|
||||
1 directory, 3 files
|
||||
```
|
||||
score:识别路径得分
|
||||
|
||||
text:语音识别结果文件
|
||||
|
||||
|
||||
- 若输入音频为已解析的audio音频,api调用方式可参考如下范例:
|
||||
|
||||
```python
|
||||
import soundfile
|
||||
|
||||
waveform, sample_rate = soundfile.read("asr_example_zh.wav")
|
||||
rec_result = inference_pipeline(input=waveform)
|
||||
```
|
||||
|
||||
- ASR、VAD、PUNC模型自由组合
|
||||
|
||||
可根据使用需求对VAD和PUNC标点模型进行自由组合,使用方式如下:
|
||||
```python
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model='iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', model_revision="v2.0.4",
|
||||
vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4",
|
||||
punc_model='iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', punc_model_revision="v2.0.4",
|
||||
# spk_model="iic/speech_campplus_sv_zh-cn_16k-common",
|
||||
# spk_model_revision="v2.0.2",
|
||||
)
|
||||
```
|
||||
若不使用PUNC模型,可配置punc_model="",或不传入punc_model参数,如需加入LM模型,可增加配置lm_model='damo/speech_transformer_lm_zh-cn-common-vocab8404-pytorch',并设置lm_weight和beam_size参数。
|
||||
|
||||
## 基于FunASR进行推理
|
||||
|
||||
下面为快速上手教程,测试音频([中文](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/vad_example.wav),[英文](https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_en.wav))
|
||||
|
||||
### 可执行命令行
|
||||
在命令行终端执行:
|
||||
|
||||
```shell
|
||||
funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=vad_example.wav
|
||||
```
|
||||
|
||||
注:支持单条音频文件识别,也支持文件列表,列表为kaldi风格wav.scp:`wav_id wav_path`
|
||||
|
||||
### python示例
|
||||
#### 非实时语音识别
|
||||
```python
|
||||
from funasr import AutoModel
|
||||
# paraformer-zh is a multi-functional asr model
|
||||
# use vad, punc, spk or not as you need
|
||||
model = AutoModel(model="paraformer-zh", model_revision="v2.0.4",
|
||||
vad_model="fsmn-vad", vad_model_revision="v2.0.4",
|
||||
punc_model="ct-punc-c", punc_model_revision="v2.0.4",
|
||||
# spk_model="cam++", spk_model_revision="v2.0.2",
|
||||
)
|
||||
res = model.generate(input=f"{model.model_path}/example/asr_example.wav",
|
||||
batch_size_s=300,
|
||||
hotword='魔搭')
|
||||
print(res)
|
||||
```
|
||||
注:`model_hub`:表示模型仓库,`ms`为选择modelscope下载,`hf`为选择huggingface下载。
|
||||
|
||||
#### 实时语音识别
|
||||
|
||||
```python
|
||||
from funasr import AutoModel
|
||||
|
||||
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
|
||||
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
|
||||
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
|
||||
|
||||
model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")
|
||||
|
||||
import soundfile
|
||||
import os
|
||||
|
||||
wav_file = os.path.join(model.model_path, "example/asr_example.wav")
|
||||
speech, sample_rate = soundfile.read(wav_file)
|
||||
chunk_stride = chunk_size[1] * 960 # 600ms
|
||||
|
||||
cache = {}
|
||||
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
|
||||
for i in range(total_chunk_num):
|
||||
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
|
||||
is_final = i == total_chunk_num - 1
|
||||
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
|
||||
print(res)
|
||||
```
|
||||
|
||||
注:`chunk_size`为流式延时配置,`[0,10,5]`表示上屏实时出字粒度为`10*60=600ms`,未来信息为`5*60=300ms`。每次推理输入为`600ms`(采样点数为`16000*0.6=960`),输出为对应文字,最后一个语音片段输入需要设置`is_final=True`来强制输出最后一个字。
|
||||
|
||||
#### 语音端点检测(非实时)
|
||||
```python
|
||||
from funasr import AutoModel
|
||||
|
||||
model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
|
||||
|
||||
wav_file = f"{model.model_path}/example/asr_example.wav"
|
||||
res = model.generate(input=wav_file)
|
||||
print(res)
|
||||
```
|
||||
|
||||
#### 语音端点检测(实时)
|
||||
```python
|
||||
from funasr import AutoModel
|
||||
|
||||
chunk_size = 200 # ms
|
||||
model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")
|
||||
|
||||
import soundfile
|
||||
|
||||
wav_file = f"{model.model_path}/example/vad_example.wav"
|
||||
speech, sample_rate = soundfile.read(wav_file)
|
||||
chunk_stride = int(chunk_size * sample_rate / 1000)
|
||||
|
||||
cache = {}
|
||||
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
|
||||
for i in range(total_chunk_num):
|
||||
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
|
||||
is_final = i == total_chunk_num - 1
|
||||
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
|
||||
if len(res[0]["value"]):
|
||||
print(res)
|
||||
```
|
||||
|
||||
#### 标点恢复
|
||||
```python
|
||||
from funasr import AutoModel
|
||||
|
||||
model = AutoModel(model="ct-punc", model_revision="v2.0.4")
|
||||
|
||||
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
|
||||
print(res)
|
||||
```
|
||||
|
||||
#### 时间戳预测
|
||||
```python
|
||||
from funasr import AutoModel
|
||||
|
||||
model = AutoModel(model="fa-zh", model_revision="v2.0.4")
|
||||
|
||||
wav_file = f"{model.model_path}/example/asr_example.wav"
|
||||
text_file = f"{model.model_path}/example/text.txt"
|
||||
res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
|
||||
print(res)
|
||||
```
|
||||
|
||||
更多详细用法([示例](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining))
|
||||
|
||||
|
||||
## 微调
|
||||
|
||||
详细用法([示例](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining))
|
||||
|
||||
## 使用方式以及适用范围
|
||||
|
||||
运行范围
|
||||
- 支持Linux-x86_64、Mac和Windows运行。
|
||||
|
||||
使用方式
|
||||
- 直接推理:可以直接对输入音频进行解码,输出目标文字。
|
||||
|
||||
使用范围与目标场景
|
||||
- 适合于离线语音识别场景
|
||||
|
||||
|
||||
## 模型局限性以及可能的偏差
|
||||
|
||||
考虑到特征提取流程和工具以及训练工具差异,会对CER的数据带来一定的差异(<0.1%),推理GPU环境差异导致的RTF数值差异。
|
||||
|
||||
|
||||
|
||||
## 相关论文以及引用信息
|
||||
|
||||
```BibTeX
|
||||
@inproceedings{radford2023robust,
|
||||
title={Robust speech recognition via large-scale weak supervision},
|
||||
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
|
||||
booktitle={International Conference on Machine Learning},
|
||||
pages={28492--28518},
|
||||
year={2023},
|
||||
organization={PMLR}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
38
config.yaml
Normal file
38
config.yaml
Normal file
@ -0,0 +1,38 @@
|
||||
# network architecture
|
||||
model: WhisperWarp
|
||||
model_conf:
|
||||
lsm_weight: 0.1
|
||||
length_normalized_loss: true
|
||||
hub: funasr # openai
|
||||
|
||||
|
||||
|
||||
# only use for hub == funasr,
|
||||
# if hub == openai, dims is automaticall download
|
||||
dims:
|
||||
n_mels: 128
|
||||
n_vocab: 51866
|
||||
n_audio_ctx: 1500
|
||||
n_audio_state: 1280
|
||||
n_audio_head: 20
|
||||
n_audio_layer: 32
|
||||
n_text_ctx: 448
|
||||
n_text_state: 1280
|
||||
n_text_head: 20
|
||||
n_text_layer: 32
|
||||
|
||||
# frontend related
|
||||
frontend: WhisperFrontend
|
||||
frontend_conf:
|
||||
fs: 16000
|
||||
n_mels: ${dims.n_mels}
|
||||
do_pad_trim: true
|
||||
|
||||
tokenizer: WhisperTokenizer
|
||||
tokenizer_conf:
|
||||
language: null
|
||||
task: transcribe
|
||||
is_multilingual: true
|
||||
num_languages: 100
|
||||
|
||||
scope_map: [none, "model."]
|
||||
Loading…
Reference in New Issue
Block a user