GEM (A General Evaluation Benchmark on Multi-modal Tasks) is a new benchmark dataset to evaluate the performance of cross-modal pre-trained models including understanding and generation tasks.

The current version of GEM is composed of 8 tasks. For each task, training and validation set are provided. GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.

Terms and Conditions

The GEM datasets are intended for non-commercial research purposes only to promote advancement in the field of artificial intelligence and related areas, and is made available free of charge without extending any license or other intellectual property rights. The dataset is provided “as is” without warranty and usage of the data has risks since we may not own the underlying rights in the documents. We are not be liable for any damages related to use of the dataset. Feedback is voluntarily given and can be used as we see fit. Upon violation of any of these terms, your rights to use the dataset will end automatically.



If you have questions about use of the dataset or any research outputs in your products or services, we encourage you to undertake your own independent legal review. For other questions, please feel free to contact us.

GEM Dataset and Leaderboard

Tasks

  1. QI_mR: Query-Image Retrieval
  2. QIT_mR: Query-Image-Title Retrieval
  3. IQ_RougeL: Image-Query Captioning
  4. QV_mR: Query-Video Retrieval
  5. QVT_mR: Query-Video-Title Retrieval
  6. VQ_RougeL: Video-Query Captioning
  7. TQ_RougeL: Title-Query Captioning
  8. VTQ_RougeL: Video-Title-Query Captioning

Relevant Links

GEM Submission Guideline/Github GEM Paper M3P Baseline UniVL Baseline
 
 
 

GEM-I Leaderboard (06/01/2021-Present) ranked by GEM-I Score (averaged score on 3 tasks)

Rank Model Submission Date QI_mR QIT_mR IQ_RougeL GEM_I Score
M3P Baseline (GEM Team) 2021-06-01 19.85 79.74 5.61 35.07
 

GEM-V Leaderboard (06/01/2021-Present) ranked by GEM-V Score (averaged score on 5 tasks)

Rank Model Submission Date QV_mR QVT_mR VQ_RougeL TQ_RougeL VTQ_RougeL GEM_V Score
m-UniVL Baseline (GEM Team) 2021-06-01 15.44 58.83 6.89 29.15 29.67 28.00

GEM Submission Instructions

Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps:

  1. Generate your prediction output for the dev set.
  2. Run the official evaluation methodologies found in the task specific git repo and verify your systems are running as expected.
  3. Generate your prediction output for the test set and submit the following information by .

Your email should include:

  1. Prediction results on test set. [Required]
  2. Prediction results on dev set. [Recommended]
  3. Individual/Team Name: Name of the individual or the team to appear in the leaderboard. [Required]
  4. Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard. [Optional]
  5. Model code: Training code for the model. [Recommended]
  6. Model information: Name of the model/technique to appear in the leaderboard. [Required]
  7. Paper Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard. [Optional]

To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time.

About GEM

GEM (A General Evaluation Benchmark on Multi-modal Tasks) is a new benchmark dataset to evaluate the performance of cross-modal pre-trained models including understanding and generation tasks.

How to cite

@article{lin2021gem,
  title={{GEM}: A General Evaluation Benchmark for Multimodal Tasks},
  author={Lin Su and Nan Duan and Edward Cui and Lei Ji and Chenfei Wu and Huaishao Luo and Yongfei Liu and Ming Zhong and Taroon Bharti and Arun Sacheti},
  booktitle="Findings of the Association for Computational Linguistics",
  year={2021}
}

 

Current Team


Edward Cui
Principal Software Engineering Manager

Lin Su
Senior Software Engineer

Chenfei Wu
Researcher

Ming Zhong
Software Engineer II