Challenge 06 - LLM Evaluation and Quality Gates

< Previous Challenge - Home - Next Challenge >

Introduction

You can’t just ship AI without testing. What if the agent returns a non-existent destination? What if the itinerary is way too long or short? What if recommendations are unsafe (war zones, extreme weather)? What if the response includes toxicity or negativity?

In this challenge, you’ll build an automated quality gate for your AI agents using New Relic’s AI Monitoring platform. Quality gates ensure that only high-quality travel plans reach your customers.

Description

Your goal is to implement a comprehensive evaluation and quality assurance system for your AI-generated travel plans. This involves several layers of evaluation working together.

Layer 1: Custom Events for New Relic AI Monitoring

OpenTelemetry defines an Event as a LogRecord with a non-empty EventName. Custom Events are a core signal in the New Relic platform. However, despite using the same name, OpenTelemetry Events and New Relic Custom Events are not identical concepts:

Because of these differences, OpenTelemetry Events are ingested as New Relic Logs since most of the time, OpenTelemetry Events are closer in similarity to New Relic Logs than New Relic Custom Events.

However, you can explicitly signal that an OpenTelemetry LogRecord should be ingested as a Custom Event by adding an entry to LogRecord.attributes following the form: newrelic.event.type=<EventType>.

For example, a LogRecord with attribute newrelic.event.type=MyEvent will be ingested as a Custom Event with type=MyEvent, and accessible via NRQL with: SELECT * FROM MyEvent.

The foundation of enterprise AI evaluation is capturing AI interactions as structured events. New Relic’s AI Monitoring uses a special attribute newrelic.event.type that automatically populates:

You need to emit three custom events after each LLM interaction:

Layer 2: Rule-Based Evaluation

Implement deterministic checks against business rules:

Layer 3: Integration into Your Application

Integrate the evaluation system into your Flask application:

Layer 4: User Feedback Collection

Capture real user feedback to measure actual satisfaction with AI-generated travel plans:

This feedback data will help you:

Layer 5: LLM-Based Quality Evaluation (Optional)

Use another LLM to evaluate responses for:

Accessing New Relic AI Monitoring

Once you emit the custom events, you can access New Relic’s curated AI Monitoring experience:

Hint: You may need to pin the “AI Monitoring” section in New Relic’s sidebar via “All capabilities” to see it. AI Monitoring Sidebar Screenshot

Success Criteria

To complete this challenge successfully, you should be able to:

Learning Resources

Tips

Advanced Challenges (Optional)