General Language Generation Evaluation (GLGE) benchmark is a new multi-task benchmark for evaluating the generalization capabilities of NLG across eight language generation tasks.
GLGE contains 8 language generation tasks, including Abstractive Text Summarization (CNN/DailyMail, Gigaword, XSUM, MSNews), Answer-aware Question Generation (SQuAD 1.1, MSQG), Conversational Question Answering (CoQA), and Personalizing Dialogue (Personachat). In order to provide more diversified difficulty challenges, we provide 3 different difficulty versions (GLGE-Easy, GLGE-Medium, and GLGE-Hard) for each task.
The GLGE datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.
If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.
Model | Submission Date | GLGE Score | CNN/DailyMail | Gigaword | XSUM | MSNews | SQuAD 1.1 | MSQG | CoQA | PersonaChat |
---|---|---|---|---|---|---|---|---|---|---|
CTRLgen (42Maru) |
2021-03-08 | 37.6 | 46.2/22.6/43.1 | 39.2/20.0/36.5 | 45.7/22.4/37.3 | 45.4/26.1/41.5 | 51.3/23.0/26.8 | 39.8/10.6/24.3 | 75.1 | 51.0/40.2/1.1/6.8 |
P3LM (JD AI Research) |
2021-05-11 | 37.4 | 44.3/21.0/41.4 | 39.6/20.2/36.8 | 45.3/22.3/37.3 | 44.6/25.0/40.8 | 51.6/23.0/26.6 | 39.5/11.0/23.6 | 75.3 | 48.8/39.4/1.7/13.7 |
ProphetNet-large (GLGE Team) |
2020-11-24 | 36.5 | 44.2/21.1/41.3 | 39.5/20.4/36.6 | 44.4/21.3/36.4 | 44.1/24.4/40.2 | 51.5/22.5/26.0 | 38.3/9.6/23.3 | 73.0 | 46.7/39.0/1.3/7.5 |
BART-large (GLGE Team) |
2020-11-24 | 35.8 | 44.1/21.2/40.9 | 38.1/18.4/34.9 | 45.1/22.2/37.2 | 43.8/24.0/39.2 | 50.3/22.0/26.4 | 38.8/9.2/24.3 | 68.6 | 49.9/40.0/1.3/8.0 |
MASS-middle (GLGE Team) |
2020-11-24 | 34.3 | 42.9/19.8/39.8 | 38.9/20.2/36.2 | 39.1/16.5/31.4 | 40.4/21.5/36.8 | 49.9/21.3/25.2 | 38.9/9.5/23.5 | 67.6 | 46.0/38.2/1.2/6.2 |
ProphetNet-base (GLGE Team) |
2020-11-24 | 33.8 | 42.5/19.7/39.5 | 38.9/19.9/36.0 | 39.8/17.1/32.0 | 40.6/21.6/37.0 | 48.0/19.5/23.9 | 37.1/9.3/22.7 | 65.3 | 46.0/38.4/1.3/7.3 |
MASS-base (GLGE Team) |
2020-11-24 | 33.6 | 42.1/19.5/39.0 | 38.7/19.7/35.9 | 39.7/17.2/31.9 | 39.4/21.0/36.1 | 49.4/20.1/24.4 | 38.9/10.2/23.3 | 65.4 | 41.0/35.7/1.4/6.9 |
Transformer (GLGE Team) |
2020-11-24 | 21.9 | 39.5/16.7/36.7 | 37.1/18.4/34.5 | 30.5/10.4/24.2 | 33.0/15.4/30.0 | 30.7/4.8/10.9 | 29.3/5.1/16.6 | 15.7 | 38.3/33.6/0.2/0.7 |
LSTM (GLGE Team) |
2020-11-24 | 20.0 | 37.3/15.7/34.4 | 34.2/16.0/31.8 | 25.1/6.9/19.9 | 30.0/14.6/27.7 | 27.2/3.8/8.9 | 25.3/3.5/14.1 | 15.1 | 42.2/35.9/0.2/0.7 |
ExtBART + DABS (42Maru) |
2020-12-02 | - | 45.0/21.5/42.0 | - | - | - | - | - | - | - |
Model | Submission Date | GLGE Score | CNN/DailyMail | Gigaword | XSUM | MSNews | SQuAD 1.1 | MSQG | CoQA | PersonaChat |
---|---|---|---|---|---|---|---|---|---|---|
ProphetNet-large (GLGE Team) |
2020-11-24 | 35.5 | 43.1/20.3/40.1 | 39.1/19.8/36.1 | 41.8/18.7/33.8 | 43.3/23.5/39.4 | 50.4/21.9/25.8 | 39.3/10.0/23.7 | 72.3 | 42.0/36.4/1.4/7.8 |
BART-large (GLGE Team) |
2020-11-24 | 35.3 | 42.8/20.1/39.1 | 38.0/18.3/34.7 | 43.1/19.5/34.1 | 43.4/23.6/38.9 | 49.7/21.6/25.9 | 38.4/9.5/24.0 | 69.4 | 50.4/39.1/1.2/7.4 |
MASS-middle (GLGE Team) |
2020-11-24 | 33.6 | 41.5/19.0/38.5 | 38.3/19.1/35.4 | 38.4/15.8/30.7 | 39.6/20.9/36.0 | 49.3/20.4/24.4 | 38.3/9.9/22.7 | 67.2 | 44.0/37.3/1.3/6.1 |
MASS-base (GLGE Team) |
2020-11-24 | 33.0 | 41.2/18.8/38.2 | 37.9/19.1/35.2 | 37.4/14.9/29.8 | 38.9/20.5/35.6 | 48.9/20.0/24.3 | 38.2/9.5/22.8 | 65.0 | 42.8/36.7/1.3/6.2 |
ProphetNet-base (GLGE Team) |
2020-11-24 | 32.6 | 41.6/19.2/38.7 | 38.6/19.6/35.7 | 37.8/15.3/30.4 | 39.0/20.4/35.7 | 46.4/17.9/22.5 | 37.0/8.7/22.3 | 62.5 | 45.4/37.7/1.4/7.3 |
Transformer (GLGE Team) |
2020-11-24 | 19.5 | 35.0/11.0/32.4 | 36.7/18.1/34.1 | 27.5/8.3/21.8 | 26.8/9.7/24.3 | 28.3/4.1/9.8 | 27.0/4.2/15.0 | 14.2 | 37.7/29.6/0.2/0.7 |
LSTM (GLGE Team) |
2020-11-24 | 18.1 | 35.3/14.1/32.8 | 33.3/15.2/31.1 | 21.5/4.6/17.1 | 27.0/12.1/24.9 | 26.6/3.5/8.2 | 18.6/1.7/9.5 | 12.9 | 41.3/35.3/0.1/0.5 |
Model | Submission Date | GLGE Score | CNN/DailyMail | Gigaword | XSUM | MSNews | SQuAD 1.1 | MSQG | CoQA | PersonaChat |
---|---|---|---|---|---|---|---|---|---|---|
BART-large (GLGE Team) |
2020-11-24 | 31.0 | 41.7/19.1/37.9 | 33.0/13.6/30.0 | 39.7/16.1/30.9 | 40.8/20.8/36.4 | 45.9/18.1/23.7 | 35.1/8.5/20.7 | 53.5 | 48.3/37.3/1.3/7.2 |
ProphetNet-large (GLGE Team) |
2020-11-24 | 30.5 | 41.2/18.7/38.0 | 32.4/13.7/29.9 | 39.4/16.1/31.6 | 40.3/20.5/36.4 | 46.4/17.0/22.1 | 34.0/8.2/19.0 | 54.1 | 40.5/35.2/1.8/9.2 |
MASS-middle (GLGE Team) |
2020-11-24 | 29.1 | 41.1/18.5/38.0 | 32.2/13.5/29.9 | 34.9/12.5/27.6 | 36.6/18.0/33.4 | 45.1/16.0/21.3 | 34.3/8.0/19.0 | 51.2 | 41.4/35.4/1.5/7.6 |
MASS-base (GLGE Team) |
2020-11-24 | 28.2 | 40.4/18.0/37.3 | 32.2/13.6/29.5 | 33.7/11.6/26.7 | 35.4/17.0/32.4 | 42.8/13.4/19.0 | 34.1/7.5/18.6 | 50.2 | 40.1/34.9/1.6/7.8 |
ProphetNet-base (GLGE Team) |
2020-11-24 | 28.0 | 40.9/18.4/37.7 | 32.0/13.5/29.5 | 34.2/11.6/26.8 | 35.2/17.0/32.1 | 41.6/13.4/18.9 | 32.3/7.2/18.0 | 48.5 | 41.6/35.5/1.6/8.3 |
Transformer (GLGE Team) |
2020-11-24 | 14.4 | 28.3/6.2/25.8 | 28.6/10.8/26.5 | 23.0/5.3/18.3 | 18.0/3.5/16.2 | 25.9/1.1/7.0 | 17.0/1.3/8.2 | 9.9 | 30.0/29.7/0.1/0.2 |
LSTM (GLGE Team) |
2020-11-24 | 12.6 | 26.2/6.8/24.2 | 26.3/9.2/24.6 | 17.8/2.4/14.3 | 8.2/0.9/7.6 | 27.3/1.0/6.7 | 12.5/0.4/5.0 | 10.3 | 36.8/28.7/0.1/0.4 |
To submit your predictions for evaluation, Please validate that you have done this correctly by evaluating against the development file. Once that is done email your submission. We will reply with your model performance. Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps:
Your email should include:
To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time.
General Language Generation Evaluation (GLGE) benchmark is a new multi-task benchmark for evaluating the generalization capabilities of NLG across eight language generation tasks.
@article{Liu2020GLGE,
title={GLGE: A New General Language Generation Evaluation Benchmark},
author={Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Linjun Shou, Ming Gong, Pengcheng Wang, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Ruofei Zhang, Winnie Wu, Ming Zhou and Nan Duan},
journal={arXiv},
year={2020},
volume={abs/2011.11928}
}
Additionally, since GLGE is also built out of exiting 6 datasets, please ensure you cite all of them.
An example: We evaluate our model using the GLGE benchmark \cite{Liu2020GLGE}, a general langugae generation evaluation benchmark consisting of CNN/DailyMail \cite{hermann2015cnndm} \cite{see2017get}, Gigaword \cite{rush2015neural} \cite{graff2003gigaword}, XSum \cite{narayan2018don}, MSNews, SQuAD 1.1 cite{rajpurkar2016squad}, MSQG, CoQA \cite{reddy2019coqa}, and PersonaChat \cite{zhang2018personalizing}.
inproceedings{hermann2015cnndm,
title={Teaching machines to read and comprehend},
author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
booktitle={NIPS},
pages={1693--1701},
year={2015}
}
@inproceedings{see2017get,
title={Get to the point: Summarization with pointer-generator networks},
author={See, Abigail and Liu, Peter J and Manning, Christopher D},
booktitle={ACL},
pages={1073--1083},
year={2017}
}
@inproceedings{rush2015neural,
title={A neural attention model for abstractive sentence summarization},
author={Rush, Alexander M and Chopra, Sumit and Weston, Jason},
booktitle={EMNLP},
pages={379-389},
year={2015}
}
@article{graff2003gigaword,
title={English gigaword},
author={Graff, David and Kong, Junbo and Chen, Ke and Maeda, Kazuaki},
journal={Linguistic Data Consortium, Philadelphia},
volume={4},
number={1},
pages={34},
year={2003}
}
@inproceedings{narayan2018don,
title={Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization},
author={Narayan, Shashi and Cohen, Shay B and Lapata, Mirella},
booktitle={EMNLP},
pages={1797--1807},
year={2018}
}
@inproceedings{rajpurkar2016squad,
title={Squad: 100,000+ questions for machine comprehension of text},
author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy},
booktitle={EMNLP},
pages={2383--2392},
year={2016}
@article{reddy2019coqa,
title={Coqa: A conversational question answering challenge},
author={Reddy, Siva and Chen, Danqi and Manning, Christopher D},
journal={TACL},
volume={7},
pages={249--266},
year={2019}
@inproceedings{zhang2018personalizing,
title={Personalizing dialogue agents: I have a dog, do you have pets too?},
author={Zhang, Saizheng and Dinan, Emily and Urbanek, Jack and Szlam, Arthur and Kiela, Douwe and Weston, Jason},
booktitle={ACL},
pages={2204--2213},
year={2018}
}