GEM (A General Evaluation Benchmark on Multi-modal Tasks) is a new benchmark dataset to evaluate the performance of cross-modal pre-trained models including understanding and generation tasks.
The current version of GEM is composed of 8 tasks. For each task, training and validation set are provided. GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.
The GEM datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.
If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.
Rank | Model | Submission Date | QI_mR | QIT_mR | IQ_RougeL | GEM_I Score |
---|---|---|---|---|---|---|
M3P Baseline (GEM Team) | 2021-06-01 | 19.85 | 79.74 | 5.61 | 35.07 |
Rank | Model | Submission Date | QV_mR | QVT_mR | VQ_RougeL | TQ_RougeL | VTQ_RougeL | GEM_V Score |
---|---|---|---|---|---|---|---|---|
m-UniVL Baseline (GEM Team) | 2021-06-01 | 15.44 | 58.83 | 6.89 | 29.15 | 29.67 | 28.00 |
Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps:
Your email should include:
To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time.
GEM (A General Evaluation Benchmark on Multi-modal Tasks) is a new benchmark dataset to evaluate the performance of cross-modal pre-trained models including understanding and generation tasks.
@article{lin2021gem,
title={{GEM}: A General Evaluation Benchmark for Multimodal Tasks},
author={Lin Su and Nan Duan and Edward Cui and Lei Ji and Chenfei Wu and Huaishao Luo and Yongfei Liu and Ming Zhong and Taroon Bharti and Arun Sacheti},
booktitle="Findings of the Association for Computational Linguistics",
year={2021}
}