GEM (A General Evaluation Benchmark on Multi-modal Tasks) is a new benchmark dataset to evaluate the performance of cross-modal pre-trained models including understanding and generation tasks.

The current version of GEM is composed of 8 tasks. For each task, training and validation set are provided. GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.

Terms and Conditions

The GEM datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.

If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

I agree to terms and conditions. Upon accepting links to dataset will become available.

GEM Dataset and Leaderboard

Tasks

QI_mR: Query-Image Retrieval
QIT_mR: Query-Image-Title Retrieval
IQ_RougeL: Image-Query Captioning
QV_mR: Query-Video Retrieval
QVT_mR: Query-Video-Title Retrieval
VQ_RougeL: Video-Query Captioning
TQ_RougeL: Title-Query Captioning
VTQ_RougeL: Video-Title-Query Captioning

Relevant Links

GEM Submission Guideline/Github GEM Paper M3P Baseline UniVL Baseline

GEM-I Leaderboard (06/01/2021-Present) ranked by GEM-I Score (averaged score on 3 tasks)

Rank	Model	Submission Date	QI_mR	QIT_mR	IQ_RougeL	GEM_I Score
M3P Baseline (GEM Team)	2021-06-01	19.85	79.74	5.61	35.07

GEM-V Leaderboard (06/01/2021-Present) ranked by GEM-V Score (averaged score on 5 tasks)

Rank	Model	Submission Date	QV_mR	QVT_mR	VQ_RougeL	TQ_RougeL	VTQ_RougeL	GEM_V Score
m-UniVL Baseline (GEM Team)	2021-06-01	15.44	58.83	6.89	29.15	29.67	28.00

GEM Submission Instructions

Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps:

Generate your prediction output for the dev set.
Run the official evaluation methodologies found in the task specific git repo and verify your systems are running as expected.
Generate your prediction output for the test set and submit the following information by .

Your email should include:

Prediction results on test set. [Required]
Prediction results on dev set. [Recommended]
Individual/Team Name: Name of the individual or the team to appear in the leaderboard. [Required]
Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard. [Optional]
Model code: Training code for the model. [Recommended]
Model information: Name of the model/technique to appear in the leaderboard. [Required]
Paper Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard. [Optional]

To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time.

About GEM

GEM (A General Evaluation Benchmark on Multi-modal Tasks) is a new benchmark dataset to evaluate the performance of cross-modal pre-trained models including understanding and generation tasks.

How to cite

@article{lin2021gem,

							  title={{GEM}: A General Evaluation Benchmark for Multimodal Tasks},

							  author={Lin Su and Nan Duan and Edward Cui and Lei Ji and Chenfei Wu and Huaishao Luo and Yongfei Liu and Ming Zhong and Taroon Bharti and Arun Sacheti},

							  booktitle="Findings of the Association for Computational Linguistics",

							  year={2021}

}