# Agen»õi AutoGen √Æn Produc»õie: Observabilitate »ôi Evaluare

√én acest tutorial, vom √ÆnvƒÉ»õa cum sƒÉ **monitorizƒÉm pa»ôii interni (traseele) ale [agen»õilor Autogen](https://github.com/microsoft/autogen)** »ôi **sƒÉ evaluƒÉm performan»õa acestora** folosind [Langfuse](https://langfuse.com).

Acest ghid acoperƒÉ metrici de evaluare **online** »ôi **offline** utilizate de echipe pentru a aduce agen»õii √Æn produc»õie rapid »ôi fiabil.

**De ce este importantƒÉ evaluarea agen»õilor AI:**
- Depanarea problemelor atunci c√¢nd sarcinile e»ôueazƒÉ sau produc rezultate suboptime
- Monitorizarea costurilor »ôi performan»õei √Æn timp real
- √émbunƒÉtƒÉ»õirea fiabilitƒÉ»õii »ôi siguran»õei prin feedback continuu


## Pasul 1: SeteazƒÉ variabilele de mediu

Ob»õine cheile API Langfuse √Ænscriindu-te pe [Langfuse Cloud](https://cloud.langfuse.com/) sau [gƒÉzduind Langfuse pe cont propriu](https://langfuse.com/self-hosting).

_**NotƒÉ:** Cei care gƒÉzduiesc pe cont propriu pot folosi [modulele Terraform](https://langfuse.com/self-hosting/azure) pentru a implementa Langfuse pe Azure. Alternativ, po»õi implementa Langfuse pe Kubernetes folosind [Helm chart](https://langfuse.com/self-hosting/kubernetes-helm)._


In [5]:
import os

# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..." 
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..." 
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # üá™üá∫ EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # üá∫üá∏ US region

Cu variabilele de mediu setate, putem acum ini»õializa clientul Langfuse. `get_client()` ini»õializeazƒÉ clientul Langfuse folosind acreditƒÉrile furnizate √Æn variabilele de mediu.


In [6]:
from langfuse import Langfuse
 
# Filter out Autogen OpenTelemetryspans
langfuse = Langfuse(
    blocked_instrumentation_scopes=["autogen SingleThreadedAgentRuntime"]
)
 
# Verify connection
if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

Langfuse client is authenticated and ready!


## Pasul 2: Ini»õializa»õi Instrumenta»õia OpenLit

Acum, ini»õializƒÉm instrumenta»õia [OpenLit](https://github.com/openlit/openlit). OpenLit captureazƒÉ automat opera»õiunile AutoGen »ôi exportƒÉ segmentele OpenTelemetry (OTel) cƒÉtre Langfuse.


In [7]:
import openlit
 
# Initialize OpenLIT instrumentation. The disable_batch flag is set to true to process traces immediately.
openlit.init(tracer=langfuse._otel_tracer, disable_batch=True, disabled_instrumentors=["mistral"])

## Pasul 3: RuleazƒÉ agentul tƒÉu

Acum configurƒÉm un agent cu mai multe interac»õiuni pentru a testa instrumentarea noastrƒÉ.


In [2]:
import os

from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.azure import AzureAIChatCompletionClient
from azure.core.credentials import AzureKeyCredential
from autogen_agentchat.base import TaskResult

from autogen_agentchat.conditions import TextMentionTermination
from autogen_agentchat.teams import RoundRobinGroupChat

In [3]:
client = AzureAIChatCompletionClient(
    model="gpt-4o-mini",
    endpoint="https://models.inference.ai.azure.com",
    # To authenticate with the model you will need to generate a personal access token (PAT) in your GitHub settings.
    # Create your PAT token by following instructions here: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
    credential=AzureKeyCredential(os.environ["GITHUB_TOKEN"]),
    model_info={
        "json_output": True,
        "function_calling": True,
        "vision": True,
        "family": "unknown",
        "structured_output": False
    },
)

In [8]:
# üç¥ Agent 1 ‚Äì proposes ONE healthy meal idea each turn
meal_planner_agent = AssistantAgent(
    "meal_planner_agent",
    model_client=client,
    description="A seasoned meal-planning coach who suggests balanced meals.",
    system_message="""
    You are a Meal-Planning Assistant with a decade of experience helping busy people prepare meals.
    Goal: propose the single best meal (breakfast, lunch, or dinner) given the user's context.
    Each response must contain ONLY one complete meal idea (title + very brief component list) ‚Äî no extras.
    Keep it concise: skip greetings, chit-chat, and filler.
    """,
)

# ü•ó Agent 2 ‚Äì checks nutritional quality & variety
nutritionist_agent = AssistantAgent(
    "nutritionist_agent",
    model_client=client,
    description="A registered dietitian ensuring meals meet nutritional standards.",
    system_message="""
    You are a Nutritionist focused on whole-food, macro-balanced eating.
    Evaluate the meal_planner_agent‚Äôs recommendation.
    If the meal is nutritionally sound, sufficiently varied, and portion-appropriate, respond with 'APPROVE'.
    Otherwise, give high-level guidance on how to improve it (e.g. 'add a plant-based protein') ‚Äî do NOT provide a full alternative recipe.
    """,
)

In [9]:
# ‚úÖ Chat stops once the nutritionist says APPROVE
termination = TextMentionTermination("APPROVE")

# üîÑ Alternate turns between the two agents until termination
team = RoundRobinGroupChat(
    [meal_planner_agent, nutritionist_agent],
    termination_condition=termination,
)

# Example kickoff
user_input = "I'm looking for a quick, delicious dinner I can prep after work. I have 30 minutes and minimal clean-up is ideal."

In [None]:
with langfuse.start_as_current_span(name="create_meal_plan") as span:
    async for message in team.run_stream(task=user_input):
        if isinstance(message, TaskResult):
            print("Stop Reason:", message.stop_reason)
        else:
            print(message)

    span.update_trace(
        input=user_input,
        output=message.stop_reason,
    )

# Flush the trace to Langfuse for short-lived environments such as Jupyter Notebooks
langfuse.flush()

### Structura UrmƒÉririi

Langfuse √ÆnregistreazƒÉ un **trace** care con»õine **spans**, ce reprezintƒÉ fiecare pas al logicii agentului tƒÉu. Aici, trace-ul include rularea generalƒÉ a agentului »ôi sub-spans pentru:
- Agentul planificator de mese
- Agen»õii nutri»õioni»ôti

Po»õi inspecta aceste elemente pentru a vedea exact unde se consumƒÉ timpul, c√¢»õi tokeni sunt utiliza»õi »ôi a»ôa mai departe:

![Arborele urmƒÉririi √Æn Langfuse](https://langfuse.com/images/cookbook/example-autogen-evaluation/trace-tree.png)

_[Link cƒÉtre urmƒÉrie](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/dac2b33e7cd709e685ccf86a137ecc64)_


## Evaluare Online

Evaluarea online se referƒÉ la evaluarea agentului √Æntr-un mediu real, √Æn timp real, adicƒÉ √Æn timpul utilizƒÉrii efective √Æn produc»õie. Aceasta implicƒÉ monitorizarea performan»õei agentului √Æn interac»õiunile reale cu utilizatorii »ôi analiza continuƒÉ a rezultatelor.

### Metrici Comune de UrmƒÉrit √Æn Produc»õie

1. **Costuri** ‚Äî Instrumenta»õia capteazƒÉ utilizarea de tokeni, pe care o po»õi transforma √Æn costuri aproximative atribuind un pre»õ per token.
2. **Laten»õƒÉ** ‚Äî ObservƒÉ timpul necesar pentru a finaliza fiecare pas sau √Æntreaga execu»õie.
3. **Feedback-ul Utilizatorilor** ‚Äî Utilizatorii pot oferi feedback direct (thumbs up/down) pentru a ajuta la rafinarea sau corectarea agentului.
4. **LLM-ca-JudecƒÉtor** ‚Äî Folose»ôte un LLM separat pentru a evalua ie»ôirea agentului aproape √Æn timp real (de exemplu, verific√¢nd toxicitatea sau corectitudinea).

Mai jos, prezentƒÉm exemple ale acestor metrici.


#### 1. Costuri

Mai jos este o capturƒÉ de ecran care aratƒÉ utilizarea pentru apelurile `gpt-4o-mini`. Acest lucru este util pentru a identifica pa»ôii costisitori »ôi pentru a optimiza agentul.

![Costuri](https://langfuse.com/images/cookbook/example-autogen-evaluation/gpt-4o-costs.png) 

_[Link cƒÉtre urmƒÉ](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/dac2b33e7cd709e685ccf86a137ecc64)_


#### 2. Laten»õƒÉ

Putem observa, de asemenea, c√¢t timp a fost necesar pentru a finaliza fiecare pas. √én exemplul de mai jos, √Æntreaga execu»õie a durat aproximativ 3 secunde, pe care le pute»õi descompune pe pa»ôi. Acest lucru vƒÉ ajutƒÉ sƒÉ identifica»õi blocajele »ôi sƒÉ optimiza»õi agentul.

![Laten»õƒÉ](https://langfuse.com/images/cookbook/example-autogen-evaluation/agent-latency.png) 

_[Link cƒÉtre trasabilitate](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/dac2b33e7cd709e685ccf86a137ecc64?display=timeline)_


#### 3. Feedback-ul utilizatorului

DacƒÉ agentul tƒÉu este integrat √Æntr-o interfa»õƒÉ de utilizator, po»õi √Ænregistra feedback-ul direct al utilizatorului (cum ar fi un vot pozitiv/negativ √Æntr-o interfa»õƒÉ de chat).


In [10]:
from langfuse import get_client
 
langfuse = get_client()
 
# Option 1: Use the yielded span object from the context manager
with langfuse.start_as_current_span(
    name="autogen-request-user-feedback-1") as span:
    
    async for message in team.run_stream(task="Create a meal with potatoes"):
            if isinstance(message, TaskResult):
                print("Stop Reason:", message.stop_reason)
            else:
                print(message)    
 
    # Score using the span object
    span.score_trace(
        name="user-feedback",
        value=1,
        data_type="NUMERIC",
        comment="This was delicious, thank you"
    )
 
# Option 2: Use langfuse.score_current_trace() if still in context
with langfuse.start_as_current_span(name="autogen-request-user-feedback-2") as span:
    # ... Autogen execution ...

    async for message in team.run_stream(task="I am allergic to gluten."):
            if isinstance(message, TaskResult):
                print("Stop Reason:", message.stop_reason)
            else:
                print(message)    
 
    # Score using current context
    langfuse.score_current_trace(
        name="user-feedback",
        value=1,
        data_type="NUMERIC"
    )

id='da068880-22ae-4f01-9f01-2bb231939089' source='user' models_usage=None metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 43, 732669, tzinfo=datetime.timezone.utc) content='Create a meal with potatoes' type='TextMessage'
id='ad937ce4-3534-493f-824b-ca9c226b5287' source='meal_planner_agent' models_usage=RequestUsage(prompt_tokens=95, completion_tokens=30) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 45, 186423, tzinfo=datetime.timezone.utc) content='Potato and Spinach Frittata  \n- Eggs  \n- Potatoes  \n- Fresh spinach  \n- Onion  \n- Cheese (optional)  ' type='TextMessage'
id='50fd33c1-057f-49fe-afad-ee86d164296d' source='nutritionist_agent' models_usage=RequestUsage(prompt_tokens=132, completion_tokens=4) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 45, 581059, tzinfo=datetime.timezone.utc) content='APPROVE' type='TextMessage'
Stop Reason: Text 'APPROVE' mentioned
id='e371de6c-e5fc-42c1-8eda-e5b8cd5accab' source='user' models_usage=None met

In [None]:
# Option 3: Use create_score() with trace ID (when outside context)
langfuse.create_score(
    trace_id="predefined_trace_id",
    name="user-feedback",
    value=1,
    data_type="NUMERIC",
    comment="This was correct, thank you"
)

Feedbackul utilizatorului este apoi capturat √Æn Langfuse:

![Feedbackul utilizatorului este capturat √Æn Langfuse](https://langfuse.com/images/cookbook/example-autogen-evaluation/user-feedback.png)


#### 4. Scorare automatƒÉ cu LLM-ca-JudecƒÉtor

LLM-ca-JudecƒÉtor este o altƒÉ metodƒÉ de a evalua automat rezultatele generate de agentul tƒÉu. Po»õi configura un apel separat cƒÉtre LLM pentru a evalua corectitudinea, toxicitatea, stilul sau orice alt criteriu care te intereseazƒÉ.

**Flux de lucru**:
1. Defini»õi un **»òablon de Evaluare**, de exemplu: "VerificƒÉ dacƒÉ textul este toxic."
2. Setezi un model care va fi utilizat ca model-judecƒÉtor; √Æn acest caz, `gpt-4o-mini` interogat prin Azure.
2. De fiecare datƒÉ c√¢nd agentul tƒÉu genereazƒÉ un rezultat, √Æl trimi»õi cƒÉtre LLM-ul "judecƒÉtor" √ÆmpreunƒÉ cu »ôablonul.
3. LLM-ul judecƒÉtor rƒÉspunde cu un scor sau o etichetƒÉ pe care o √Ænregistrezi √Æn instrumentul tƒÉu de observabilitate.

Exemplu din Langfuse:

![Evaluator LLM-ca-JudecƒÉtor](https://langfuse.com/images/cookbook/example-autogen-evaluation/evaluator.png)


In [12]:
with langfuse.start_as_current_span(name="autogen-request-user-feedback-2") as span:

    async for message in team.run_stream(task="I am a picky eater and not sure if you find something for me."):
            if isinstance(message, TaskResult):
                print("Stop Reason:", message.stop_reason)
            else:
                print(message) 

    span.update_trace(
        input=user_input,
        output=message.stop_reason,
    )

langfuse.flush()

id='eefc628d-502f-451a-8f70-be486f62f8c5' source='user' models_usage=None metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 29, 171393, tzinfo=datetime.timezone.utc) content='I am a picky eater and not sure if you find something for me.' type='TextMessage'
id='13b3e14b-bcf7-42a5-80d6-54b0c7be765e' source='meal_planner_agent' models_usage=RequestUsage(prompt_tokens=352, completion_tokens=27) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 30, 433516, tzinfo=datetime.timezone.utc) content='Chicken Alfredo Pasta  \n- Gluten-free pasta  \n- Grilled chicken breast  \n- Heavy cream  \n- Parmesan cheese  \n- Garlic  ' type='TextMessage'
id='550f2dee-0e08-4bbd-b67f-991b467328f1' source='nutritionist_agent' models_usage=RequestUsage(prompt_tokens=386, completion_tokens=17) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 31, 505173, tzinfo=datetime.timezone.utc) content='Consider incorporating some vegetables, like spinach or broccoli, to increase the nutrien

Po»õi observa cƒÉ rƒÉspunsul acestui exemplu este considerat ‚Äûnon-toxic‚Äù.

![Scor de Evaluare LLM-ca-JudecƒÉtor](https://langfuse.com/images/cookbook/example-autogen-evaluation/llm-as-a-judge-score.png)


#### 5. Prezentare generalƒÉ a metricilor de observabilitate

Toate aceste metrici pot fi vizualizate √ÆmpreunƒÉ √Æn dashboard-uri. Acest lucru √Æ»õi permite sƒÉ vezi rapid cum performeazƒÉ agentul tƒÉu √Æn diverse sesiuni »ôi te ajutƒÉ sƒÉ urmƒÉre»ôti metricile de calitate √Æn timp. 

![Prezentare generalƒÉ a metricilor de observabilitate](https://langfuse.com/images/cookbook/example-autogen-evaluation/dashboard.png)


## Evaluare Offline

Evaluarea online este esen»õialƒÉ pentru feedback √Æn timp real, dar ai nevoie »ôi de **evaluare offline**‚ÄîverificƒÉri sistematice √Ænainte sau √Æn timpul dezvoltƒÉrii. Acest lucru ajutƒÉ la men»õinerea calitƒÉ»õii »ôi fiabilitƒÉ»õii √Ænainte de implementarea modificƒÉrilor √Æn produc»õie.


### Evaluarea setului de date

√én evaluarea offline, de obicei:
1. Ai un set de date de referin»õƒÉ (cu perechi de √ÆntrebƒÉri »ôi rƒÉspunsuri a»ôteptate)
2. Rulezi agentul pe acel set de date
3. Compari rezultatele ob»õinute cu cele a»ôteptate sau folose»ôti un mecanism suplimentar de evaluare

Mai jos, demonstrƒÉm aceastƒÉ abordare cu [q&a-dataset](https://huggingface.co/datasets/junzhang1207/search-dataset), care con»õine √ÆntrebƒÉri »ôi rƒÉspunsuri a»ôteptate.


In [16]:
import pandas as pd
from datasets import load_dataset
 
# Fetch search-dataset from Hugging Face
dataset = load_dataset("junzhang1207/search-dataset", split = "train")
df = pd.DataFrame(dataset)
print("First few rows of search-dataset:")
print(df.head())

  from .autonotebook import tqdm as notebook_tqdm


First few rows of search-dataset:
                                     id  \
0  20caf138-0c81-4ef9-be60-fe919e0d68d4   
1  1f37d9fd-1bcc-4f79-b004-bc0e1e944033   
2  76173a7f-d645-4e3e-8e0d-cca139e00ebe   
3  5f5ef4ca-91fe-4610-a8a9-e15b12e3c803   
4  64dbed0d-d91b-4acd-9a9c-0a7aa83115ec   

                                            question  \
0                 steve jobs statue location budapst   
1  Why is the Battle of Stalingrad considered a t...   
2  In what year did 'The Birth of a Nation' surpa...   
3  How many Russian soldiers surrendered to AFU i...   
4   What event led to the creation of Google Images?   

                                     expected_answer       category       area  
0  The Steve Jobs statue is located in Budapest, ...           Arts  Knowledge  
1  The Battle of Stalingrad is considered a turni...   General News       News  
2  This question is based on a false premise. 'Th...  Entertainment       News  
3  About 300 Russian soldiers surrendered to t

Apoi, creƒÉm o entitate de set de date √Æn Langfuse pentru a urmƒÉri execu»õiile. Apoi, adƒÉugƒÉm fiecare element din setul de date √Æn sistem.


In [17]:
from langfuse import Langfuse
langfuse = Langfuse()
 
langfuse_dataset_name = "qa-dataset_autogen-agent"
 
# Create a dataset in Langfuse
langfuse.create_dataset(
    name=langfuse_dataset_name,
    description="q&a dataset uploaded from Hugging Face",
    metadata={
        "date": "2025-03-21",
        "type": "benchmark"
    }
)

Dataset(id='cmcm7524d00kjad07s2cjwqcf', name='qa-dataset_autogen-agent', description='q&a dataset uploaded from Hugging Face', metadata={'date': '2025-03-21', 'type': 'benchmark'}, project_id='cloramnkj0002jz088vzn1ja4', created_at=datetime.datetime(2025, 7, 2, 16, 54, 7, 357000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 7, 2, 16, 54, 7, 357000, tzinfo=datetime.timezone.utc))

In [18]:
df_25 = df.sample(25) # For this example, we upload only 25 dataset questions

for idx, row in df_25.iterrows():
    langfuse.create_dataset_item(
        dataset_name=langfuse_dataset_name,
        input={"text": row["question"]},
        expected_output={"text": row["expected_answer"]}
    )

![Elemente ale setului de date √Æn Langfuse](https://langfuse.com/images/cookbook/example-autogen-evaluation/example-dataset.png)


#### Rularea Agentului pe Setul de Date

Mai √Ænt√¢i, construim un agent Autogen simplu care rƒÉspunde la √ÆntrebƒÉri folosind modelele Azure OpenAI.


In [8]:
import os
from dotenv import load_dotenv

from autogen_agentchat.agents import AssistantAgent
from autogen_core.models import UserMessage
from autogen_ext.models.azure import AzureAIChatCompletionClient
from azure.core.credentials import AzureKeyCredential
from autogen_core import CancellationToken
from autogen_agentchat.messages import TextMessage

In [None]:
load_dotenv()
client = AzureAIChatCompletionClient(
    model="gpt-4o",
    endpoint="https://models.inference.ai.azure.com",
    # To authenticate with the model you will need to generate a personal access token (PAT) in your GitHub settings.
    # Create your PAT token by following instructions here: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
    credential=AzureKeyCredential(os.getenv("GITHUB_TOKEN")),
    max_tokens=5000,
    model_info={
        "json_output": True,
        "function_calling": False,
        "vision": False,
        "family": "unknown",
        "structured_output": True,
    },
)

result = await client.create([UserMessage(content="What is the capital of France?", source="user")])
print(result)

In [18]:
agent = AssistantAgent(
    name="assistant",
    model_client=client,
    tools=[],
    system_message="You are participant in a quizz show and you are given a question. You need to create a short answer to the question.",
)

Apoi, definim o func»õie auxiliarƒÉ `my_agent()`.


In [19]:
async def my_agent(user_query: str):

    with langfuse.start_as_current_span(name="autogen-trace") as span:

        # Execute the agent response
        response = await agent.on_messages(
            [TextMessage(content=user_query, source="user")],
            cancellation_token=CancellationToken(),
        )

        span.update_trace(
            input=user_query,
            output=response.chat_message.content,
        )

    return str(response.chat_message.content)

# Test the function
await my_agent("What is the capital of France?")

'The capital of France is Paris.'

√én cele din urmƒÉ, parcurgem fiecare element al setului de date, rulƒÉm agentul »ôi legƒÉm traseul de elementul setului de date. De asemenea, putem ata»ôa un scor rapid de evaluare, dacƒÉ se dore»ôte.


In [20]:
dataset_name = "qa-dataset_autogen-agent"
current_run_name = "dev_tasks_run-autogen_gpt-4.1" # Identifies this specific evaluation run
current_run_metadata={"model_provider": "Azure", "model": "gpt-4.1"}
current_run_description="Evaluation run for Autogen model on July 3rd"

dataset = langfuse.get_dataset('qa-dataset_autogen-agent')

for item in dataset.items:
    print(f"Running evaluation for item: {item.id} (Input: {item.input})")
 
    # Use the item.run() context manager
    with item.run(
        run_name=current_run_name,
        run_metadata=current_run_metadata,
        run_description=current_run_description
    ) as root_span: 
        # All subsequent langfuse operations within this block are part of this trace.
        generated_answer = await my_agent(user_query = item.input["text"])
    
    print("Generated Answer: ", generated_answer)
 
print(f"\nFinished processing dataset '{dataset_name}' for run '{current_run_name}'.")

langfuse.flush()

Running evaluation for item: 09810cc4-9992-4712-a3b2-7224da31776a (Input: {'text': 'In Hindu mythology, which deity is the Ganges river dolphin associated with?'})
Generated Answer:  In Hindu mythology, the Ganges river dolphin is associated with the deity Ganga.
Running evaluation for item: bb113f94-7723-47c6-8c34-59d883044514 (Input: {'text': 'What significant discovery did the LHCb collaboration report in 2015?'})
Generated Answer:  In 2015, the LHCb collaboration reported the discovery of pentaquark particles.
Running evaluation for item: 4d8ae54e-ceab-46d0-ad2c-6e8e223589a9 (Input: {'text': 'What is the M√Ñ\x81ori name for the red-crowned parakeet?'})
Generated Answer:  The MƒÅori name for the red-crowned parakeet is kƒÅkƒÅriki.
Running evaluation for item: 21e5a0d5-f619-4a73-868e-9955053b3e72 (Input: {'text': 'Who starred in the 1978 television film adaptation of Les Mis√É¬©rables?'})
Generated Answer:  Richard Jordan starred as Jean Valjean in the 1978 television film adaptation

Pute»õi repeta acest proces cu configura»õii diferite ale agentului, cum ar fi:
- Modele (gpt-4o-mini, gpt-4.1, etc.)
- Prompteri
- Instrumente (cƒÉutare vs. fƒÉrƒÉ cƒÉutare)
- Complexitatea agentului (multi-agent vs single-agent)

Apoi compara»õi-le una l√¢ngƒÉ alta √Æn Langfuse. √én acest exemplu, am rulat agentul de 3 ori pe cele 25 de √ÆntrebƒÉri din setul de date. Pentru fiecare rulare, am folosit un model diferit de Azure OpenAI. Se poate observa cƒÉ numƒÉrul de √ÆntrebƒÉri rƒÉspunse corect cre»ôte atunci c√¢nd se utilizeazƒÉ un model mai mare (a»ôa cum era de a»ôteptat). Scorul `correct_answer` este creat de un [Evaluator LLM-as-a-Judge](https://langfuse.com/docs/scores/model-based-evals) care este configurat sƒÉ evalueze corectitudinea rƒÉspunsului pe baza rƒÉspunsului exemplu oferit √Æn setul de date.

![Prezentare generalƒÉ a rulƒÉrii setului de date](https://langfuse.com/images/cookbook/example-autogen-evaluation/dataset_runs.png)
![Compara»õie rulare set de date](https://langfuse.com/images/cookbook/example-autogen-evaluation/dataset-run-comparison.png)



---

**Declinare de responsabilitate**:  
Acest document a fost tradus folosind serviciul de traducere AI [Co-op Translator](https://github.com/Azure/co-op-translator). De»ôi ne strƒÉduim sƒÉ asigurƒÉm acurate»õea, vƒÉ rugƒÉm sƒÉ fi»õi con»ôtien»õi cƒÉ traducerile automate pot con»õine erori sau inexactitƒÉ»õi. Documentul original √Æn limba sa natalƒÉ ar trebui considerat sursa autoritarƒÉ. Pentru informa»õii critice, se recomandƒÉ traducerea profesionalƒÉ realizatƒÉ de un specialist uman. Nu ne asumƒÉm responsabilitatea pentru eventualele ne√Æn»õelegeri sau interpretƒÉri gre»ôite care pot apƒÉrea din utilizarea acestei traduceri.
