This commit is contained in:
xuqichen 2025-09-05 13:45:03 +08:00
parent 4e5a083fa3
commit 7d211949fc
3 changed files with 100 additions and 0 deletions

60
README.md Normal file
View File

@ -0,0 +1,60 @@
---
license: CC BY NC 4.0
#用户自定义标签
tags:
- finetune
- alpaca
- gpt4
text:
TextGeneration:
样本规模:
- 10k-100k
language:
- zh
语言:
- 中文
---
## 数据集描述
该数据集为GPT-4生成的中文数据集用于LLM的指令精调和强化学习等。
### 数据集加载方式
```python
from modelscope.msdatasets import MsDataset
ds = MsDataset.load("alpaca-gpt4-data-zh", namespace="AI-ModelScope", split="train")
print(next(iter(ds)))
```
### 数据分片
数据已经预设了train分片。
## 数据集版权信息
数据集已经开源license为CC BY NC 4.0仅用于非商业化用途如有违反相关条款随时联系modelscope删除。
## 引用方式
```
@article{peng2023gpt4llm,
title={Instruction Tuning with GPT-4},
author={Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, Jianfeng Gao},
journal={arXiv preprint arXiv:2304.03277},
year={2023}
}
```
## 参考链接
```
https://huggingface.co/datasets/c-s-ale/alpaca-gpt4-data-zh
https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
```
### Clone with HTTP
```bash
git clone https://www.modelscope.cn/datasets/AI-ModelScope/alpaca-gpt4-data-zh.git
```

9
dataset_info.json Normal file
View File

@ -0,0 +1,9 @@
{
"default": {
"file_name": "train.json",
"columns": {
"prompt": "instruction",
"response": "output"
}
}
}

31
train.json Normal file
View File

@ -0,0 +1,31 @@
[
{
"instruction": "保持健康的三个提示。",
"output": "\n\n1. \n\n2. >\n\n3. 7-8
"
},
{
"instruction": "三原色是什么?",
"output": "绿RGB
, 绿绿绿
\n\nRYB>"
},
{
"instruction": "描述原子的结构。",
"output": "\n\n>使
\n\n28
\n\n使
\n\n<EFBFBD><EFBFBD><EFBFBD>"
},
{
"instruction": "如何减少空气污染?",
"output": ":\n\n1. : 使\n2.
: 使\n3. : 使\n4. 绿:
\n5. <EFBFBD><EFBFBD><EFBFBD>: 使\n6. : \n\n>"
},
{
"instruction": "描述一次你不得不做出困难决定的经历。",
"output": "\n\n>\n\n
\n\n"
}
]