Eureka ML Insights
Evaluating and Understanding Large Foundation Models
Eureka is an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. We report in-depth evaluation and analysis of 12 state-of-the-art models across a collection of language and multimodal benchmarks. These benchmarks test fundamental but overlooked capabilities that are still challenging for even the most capable models.
Read full reportGithub
Language Evaluation
The evaluation through Eureka shows that there have been important advances from state-of-the-art models in the language capabilities of instruction following, long context question answering, information retrieval, and safety. The analysis also discovers major differences and gaps between models related to robustness to context length, factuality and grounding for information retrieval, and refusal behavior.
Multimodal Evaluation
State-of-the-art models are still fairly limited in their multimodal abilities, specifically when it comes to detailed image understanding. For example, these models struggle with localizing objects, geometric and spatial reasoning, and navigation, which are all examples of capabilities that are most needed in truly multimodal scenarios that require physical awareness, visual grounding and localization.