Project Eureka | Project Eureka

Language Evaluation

The evaluation through Eureka shows that there have been important advances from state-of-the-art models in the language capabilities of instruction following, long context question answering, information retrieval, and safety. The analysis also discovers major differences and gaps between models related to robustness to context length, factuality and grounding for information retrieval, and refusal behavior.

01Faster improvements for instruction following across all model families

02All models' performance in question answering drops with longer context

03Major gaps in factuality and grounding for information retrieval from parametric knowledge or input context

04High refusal rates. Lower accuracy in detecting toxic content vs. neutral content for most models.

Multimodal Evaluation

State-of-the-art models are still fairly limited in their multimodal abilities, specifically when it comes to detailed image understanding. For example, these models struggle with localizing objects, geometric and spatial reasoning, and navigation, which are all examples of capabilities that are most needed in truly multimodal scenarios that require physical awareness, visual grounding and localization.

01State-of-the-art multimodal models struggle with geometric reasoning

02Multimodal capabilities lag language capabilities

03Complementary performance across models for fundamental multimodal skills

Eureka ML Insights

Evaluating and Understanding Large Foundation Models