dataset-5/README.md
2024-07-01 23:40:30 +08:00

39 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
displayName: SAMSum Corpus
labelTypes:
- Classification
license:
- CC BY-NC-ND 4.0
mediaTypes:
- Text
paperUrl: https://arxiv.org/pdf/1911.12237v2.pdf
publishDate: "2019"
publishUrl: https://github.com/huggingface/datasets/tree/master/datasets/samsum
publisher:
- Samsung R&D Institute Poland
tags:
- Text
taskTypes:
- Text Summarization/Simplication
- Federated Learning
- Abstractive Text Summarization
---
# 数据集介绍
## 简介
SAMSum 数据集包含大约 16k 个带有摘要的类似信使的对话。对话由精通英语的语言学家创建和记录。语言学家被要求创建类似于他们每天所写的对话,以反映他们现实生活中的信使对话的主题比例。风格和语域是多样化的——对话可以是非正式的、半正式的或正式的,它们可能包含俚语、表情符号和错别字。然后,用摘要对对话进行注释。假设摘要应该是人们在第三人称对话中所谈论内容的简明扼要。 SAMSum 数据集由波兰三星研发研究所准备并分发用于研究目的非商业许可CC BY-NC-ND 4.0)。
## 引文
```
"@article{gliwa2019samsum,
title={SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization},
author={Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej and Wawer, Aleksander},
journal={arXiv preprint arXiv:1911.12237},
year={2019}
}"
```
## Download dataset
:modelscope-code[]{type="git"}