General Language Generation Evaluation (GLGE) benchmark is a new multi-task benchmark for evaluating the generalization capabilities of NLG across eight language generation tasks.

GLGE contains 8 language generation tasks, including Abstractive Text Summarization (CNN/DailyMail, Gigaword, XSUM, MSNews), Answer-aware Question Generation (SQuAD 1.1, MSQG), Conversational Question Answering (CoQA), and Personalizing Dialogue (Personachat). In order to provide more diversified difficulty challenges, we provide 3 different difficulty versions (GLGE-Easy, GLGE-Medium, and GLGE-Hard) for each task.

Important Updates

  • 2021-05-11: We fixed the issue where inconsistent post-processing led to low ROUGE scores of our baselines for the gigaword test set.
  • 2020-11-24: We update the evaluation scripts and ProphetNet baselines on Link.

Terms and Conditions

The GLGE datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.



If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

GLGE Dataset and Leaderboard

Tasks

  1. CNN/DailyMail
  2. Gigaword
  3. XSUM
  4. MSNews
  5. SQuAD 1.1
  6. MSQG
  7. CoQA
  8. PersonaChat

Relevant Links

GLGE Submission Guideline/Github GLGE Paper ProphetNet Baseline

Leaderboard (11/24/2020-Present) ranked by GLGE-Easy Score (average score on 8 NLG tasks)

Model Submission Date GLGE Score CNN/DailyMail Gigaword XSUM MSNews SQuAD 1.1 MSQG CoQA PersonaChat
CTRLgen
(42Maru)
2021-03-08 37.6 46.2/22.6/43.1 39.2/20.0/36.5 45.7/22.4/37.3 45.4/26.1/41.5 51.3/23.0/26.8 39.8/10.6/24.3 75.1 51.0/40.2/1.1/6.8
P3LM
(JD AI Research)
2021-05-11 37.4 44.3/21.0/41.4 39.6/20.2/36.8 45.3/22.3/37.3 44.6/25.0/40.8 51.6/23.0/26.6 39.5/11.0/23.6 75.3 48.8/39.4/1.7/13.7
ProphetNet-large
(GLGE Team)
2020-11-24 36.5 44.2/21.1/41.3 39.5/20.4/36.6 44.4/21.3/36.4 44.1/24.4/40.2 51.5/22.5/26.0 38.3/9.6/23.3 73.0 46.7/39.0/1.3/7.5
BART-large
(GLGE Team)
2020-11-24 35.8 44.1/21.2/40.9 38.1/18.4/34.9 45.1/22.2/37.2 43.8/24.0/39.2 50.3/22.0/26.4 38.8/9.2/24.3 68.6 49.9/40.0/1.3/8.0
MASS-middle
(GLGE Team)
2020-11-24 34.3 42.9/19.8/39.8 38.9/20.2/36.2 39.1/16.5/31.4 40.4/21.5/36.8 49.9/21.3/25.2 38.9/9.5/23.5 67.6 46.0/38.2/1.2/6.2
ProphetNet-base
(GLGE Team)
2020-11-24 33.8 42.5/19.7/39.5 38.9/19.9/36.0 39.8/17.1/32.0 40.6/21.6/37.0 48.0/19.5/23.9 37.1/9.3/22.7 65.3 46.0/38.4/1.3/7.3
MASS-base
(GLGE Team)
2020-11-24 33.6 42.1/19.5/39.0 38.7/19.7/35.9 39.7/17.2/31.9 39.4/21.0/36.1 49.4/20.1/24.4 38.9/10.2/23.3 65.4 41.0/35.7/1.4/6.9
Transformer
(GLGE Team)
2020-11-24 21.9 39.5/16.7/36.7 37.1/18.4/34.5 30.5/10.4/24.2 33.0/15.4/30.0 30.7/4.8/10.9 29.3/5.1/16.6 15.7 38.3/33.6/0.2/0.7
LSTM
(GLGE Team)
2020-11-24 20.0 37.3/15.7/34.4 34.2/16.0/31.8 25.1/6.9/19.9 30.0/14.6/27.7 27.2/3.8/8.9 25.3/3.5/14.1 15.1 42.2/35.9/0.2/0.7
ExtBART + DABS
(42Maru)
2020-12-02 - 45.0/21.5/42.0 - - - - - - -

Leaderboard (11/24/2020-Present) ranked by GLGE-Medium Score (average score on 8 NLG tasks)

Model Submission Date GLGE Score CNN/DailyMail Gigaword XSUM MSNews SQuAD 1.1 MSQG CoQA PersonaChat
ProphetNet-large
(GLGE Team)
2020-11-24 35.5 43.1/20.3/40.1 39.1/19.8/36.1 41.8/18.7/33.8 43.3/23.5/39.4 50.4/21.9/25.8 39.3/10.0/23.7 72.3 42.0/36.4/1.4/7.8
BART-large
(GLGE Team)
2020-11-24 35.3 42.8/20.1/39.1 38.0/18.3/34.7 43.1/19.5/34.1 43.4/23.6/38.9 49.7/21.6/25.9 38.4/9.5/24.0 69.4 50.4/39.1/1.2/7.4
MASS-middle
(GLGE Team)
2020-11-24 33.6 41.5/19.0/38.5 38.3/19.1/35.4 38.4/15.8/30.7 39.6/20.9/36.0 49.3/20.4/24.4 38.3/9.9/22.7 67.2 44.0/37.3/1.3/6.1
MASS-base
(GLGE Team)
2020-11-24 33.0 41.2/18.8/38.2 37.9/19.1/35.2 37.4/14.9/29.8 38.9/20.5/35.6 48.9/20.0/24.3 38.2/9.5/22.8 65.0 42.8/36.7/1.3/6.2
ProphetNet-base
(GLGE Team)
2020-11-24 32.6 41.6/19.2/38.7 38.6/19.6/35.7 37.8/15.3/30.4 39.0/20.4/35.7 46.4/17.9/22.5 37.0/8.7/22.3 62.5 45.4/37.7/1.4/7.3
Transformer
(GLGE Team)
2020-11-24 19.5 35.0/11.0/32.4 36.7/18.1/34.1 27.5/8.3/21.8 26.8/9.7/24.3 28.3/4.1/9.8 27.0/4.2/15.0 14.2 37.7/29.6/0.2/0.7
LSTM
(GLGE Team)
2020-11-24 18.1 35.3/14.1/32.8 33.3/15.2/31.1 21.5/4.6/17.1 27.0/12.1/24.9 26.6/3.5/8.2 18.6/1.7/9.5 12.9 41.3/35.3/0.1/0.5

Leaderboard (11/24/2020-Present) ranked by GLGE-Hard Score (average score on 8 NLG tasks)

Model Submission Date GLGE Score CNN/DailyMail Gigaword XSUM MSNews SQuAD 1.1 MSQG CoQA PersonaChat
BART-large
(GLGE Team)
2020-11-24 31.0 41.7/19.1/37.9 33.0/13.6/30.0 39.7/16.1/30.9 40.8/20.8/36.4 45.9/18.1/23.7 35.1/8.5/20.7 53.5 48.3/37.3/1.3/7.2
ProphetNet-large
(GLGE Team)
2020-11-24 30.5 41.2/18.7/38.0 32.4/13.7/29.9 39.4/16.1/31.6 40.3/20.5/36.4 46.4/17.0/22.1 34.0/8.2/19.0 54.1 40.5/35.2/1.8/9.2
MASS-middle
(GLGE Team)
2020-11-24 29.1 41.1/18.5/38.0 32.2/13.5/29.9 34.9/12.5/27.6 36.6/18.0/33.4 45.1/16.0/21.3 34.3/8.0/19.0 51.2 41.4/35.4/1.5/7.6
MASS-base
(GLGE Team)
2020-11-24 28.2 40.4/18.0/37.3 32.2/13.6/29.5 33.7/11.6/26.7 35.4/17.0/32.4 42.8/13.4/19.0 34.1/7.5/18.6 50.2 40.1/34.9/1.6/7.8
ProphetNet-base
(GLGE Team)
2020-11-24 28.0 40.9/18.4/37.7 32.0/13.5/29.5 34.2/11.6/26.8 35.2/17.0/32.1 41.6/13.4/18.9 32.3/7.2/18.0 48.5 41.6/35.5/1.6/8.3
Transformer
(GLGE Team)
2020-11-24 14.4 28.3/6.2/25.8 28.6/10.8/26.5 23.0/5.3/18.3 18.0/3.5/16.2 25.9/1.1/7.0 17.0/1.3/8.2 9.9 30.0/29.7/0.1/0.2
LSTM
(GLGE Team)
2020-11-24 12.6 26.2/6.8/24.2 26.3/9.2/24.6 17.8/2.4/14.3 8.2/0.9/7.6 27.3/1.0/6.7 12.5/0.4/5.0 10.3 36.8/28.7/0.1/0.4

GLGE Submission Instructions

To submit your predictions for evaluation, Please validate that you have done this correctly by evaluating against the development file. Once that is done email your submission. We will reply with your model performance. Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps:

  1. Generate your prediction output for the dev set.
  2. Run the official evaluation methodologies found in the task specific git repo and verify your systems are running as expected.
  3. Generate your output for the test set and submit the following information by .

Your email should include:

  1. Prediction results on test set. Please create a single folder which contains the prediction files (see submission examples for an example). The prediction file shoud be named with the following format: {task}.{version}.test where {version} is the difficulty versions (easy, medium, and hard), task is the task name (cnndm, gigaword, xsum, msnews, sqaudqg, msqg, coqa, and personachat). [Required]
  2. Prediction results on dev set. [Optional]
  3. Individual/Team Name: Name of the individual or the team to appear in the leaderboard. [Required]
  4. Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard. [Optional]
  5. Model code: Training code for the model. [Recommended]
  6. Model information: Name of the model/technique to appear in the leaderboard. [Required]
  7. Paper Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard. [Optional]

To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time.

About GLGE

General Language Generation Evaluation (GLGE) benchmark is a new multi-task benchmark for evaluating the generalization capabilities of NLG across eight language generation tasks.

How to cite

@article{Liu2020GLGE,
  title={GLGE: A New General Language Generation Evaluation Benchmark},
  author={Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu, Linjun Shou, Ming Gong, Pengcheng Wang, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, Ruofei Zhang, Winnie Wu, Ming Zhou and Nan Duan},
  journal={arXiv},
  year={2020},
  volume={abs/2011.11928}
}

Additionally, since GLGE is also built out of exiting 6 datasets, please ensure you cite all of them.

An example: We evaluate our model using the GLGE benchmark \cite{Liu2020GLGE}, a general langugae generation evaluation benchmark consisting of CNN/DailyMail \cite{hermann2015cnndm} \cite{see2017get}, Gigaword \cite{rush2015neural} \cite{graff2003gigaword}, XSum \cite{narayan2018don}, MSNews, SQuAD 1.1 cite{rajpurkar2016squad}, MSQG, CoQA \cite{reddy2019coqa}, and PersonaChat \cite{zhang2018personalizing}.

inproceedings{hermann2015cnndm,
  title={Teaching machines to read and comprehend},
  author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
  booktitle={NIPS},
  pages={1693--1701},
  year={2015}
}

@inproceedings{see2017get,
  title={Get to the point: Summarization with pointer-generator networks},
  author={See, Abigail and Liu, Peter J and Manning, Christopher D},
  booktitle={ACL},
  pages={1073--1083},
  year={2017}
}

@inproceedings{rush2015neural,
  title={A neural attention model for abstractive sentence summarization},
  author={Rush, Alexander M and Chopra, Sumit and Weston, Jason},
  booktitle={EMNLP},
   pages={379-389},
  year={2015}
}

@article{graff2003gigaword,
  title={English gigaword},
  author={Graff, David and Kong, Junbo and Chen, Ke and Maeda, Kazuaki},
  journal={Linguistic Data Consortium, Philadelphia},
  volume={4},
  number={1},
  pages={34},
  year={2003}
}

@inproceedings{narayan2018don,
  title={Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization},
  author={Narayan, Shashi and Cohen, Shay B and Lapata, Mirella},
  booktitle={EMNLP},
   pages={1797--1807},
  year={2018}
}

@inproceedings{rajpurkar2016squad,
  title={Squad: 100,000+ questions for machine comprehension of text},
  author={Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy},
  booktitle={EMNLP},
  pages={2383--2392},
  year={2016}

@article{reddy2019coqa,
  title={Coqa: A conversational question answering challenge},
  author={Reddy, Siva and Chen, Danqi and Manning, Christopher D},
  journal={TACL},
  volume={7},
  pages={249--266},
  year={2019}

@inproceedings{zhang2018personalizing,
  title={Personalizing dialogue agents: I have a dog, do you have pets too?},
  author={Zhang, Saizheng and Dinan, Emily and Urbanek, Jack and Szlam, Arthur and Kiela, Douwe and Weston, Jason},
  booktitle={ACL},
  pages={2204--2213},
  year={2018}
}