# Mga AutoGen Agent sa Produksyon: Obserbasyon at Pagsusuri

Sa tutorial na ito, matututuhan natin kung paano **subaybayan ang mga panloob na hakbang (traces) ng [Autogen agents](https://github.com/microsoft/autogen)** at **suriin ang kanilang performance** gamit ang [Langfuse](https://langfuse.com).

Saklaw ng gabay na ito ang mga **online** at **offline** na sukatan ng pagsusuri na ginagamit ng mga team upang mabilis at maaasahang maipakilala ang mga agent sa produksyon.

**Bakit mahalaga ang Pagsusuri ng AI Agent:**
- Pag-aayos ng mga isyu kapag nabigo ang mga gawain o nagresulta sa hindi kanais-nais na output
- Pagsubaybay sa mga gastos at performance sa real-time
- Pagpapabuti ng pagiging maaasahan at kaligtasan sa pamamagitan ng tuloy-tuloy na feedback


## Hakbang 1: Itakda ang mga Environment Variable

Kunin ang iyong Langfuse API keys sa pamamagitan ng pag-sign up sa [Langfuse Cloud](https://cloud.langfuse.com/) o [pagho-host ng Langfuse sa sarili](https://langfuse.com/self-hosting).

_**Tandaan:** Ang mga nagho-host sa sarili ay maaaring gumamit ng [Terraform modules](https://langfuse.com/self-hosting/azure) upang i-deploy ang Langfuse sa Azure. Bilang alternatibo, maaari mong i-deploy ang Langfuse sa Kubernetes gamit ang [Helm chart](https://langfuse.com/self-hosting/kubernetes-helm)._


In [5]:
import os

# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..." 
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..." 
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # üá™üá∫ EU region
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com" # üá∫üá∏ US region

Sa pamamagitan ng mga nakatakdang environment variables, maaari na nating i-initialize ang Langfuse client. Ang `get_client()` ay nag-i-initialize ng Langfuse client gamit ang mga kredensyal na ibinigay sa environment variables.


In [6]:
from langfuse import Langfuse
 
# Filter out Autogen OpenTelemetryspans
langfuse = Langfuse(
    blocked_instrumentation_scopes=["autogen SingleThreadedAgentRuntime"]
)
 
# Verify connection
if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

Langfuse client is authenticated and ready!


## Hakbang 2: I-initialize ang OpenLit Instrumentation

Ngayon, i-initialize natin ang [OpenLit](https://github.com/openlit/openlit) instrumentation. Ang OpenLit ay awtomatikong kumukuha ng mga operasyon ng AutoGen at ine-export ang mga OpenTelemetry (OTel) spans sa Langfuse.


In [7]:
import openlit
 
# Initialize OpenLIT instrumentation. The disable_batch flag is set to true to process traces immediately.
openlit.init(tracer=langfuse._otel_tracer, disable_batch=True, disabled_instrumentors=["mistral"])

## Hakbang 3: Patakbuhin ang iyong ahente

Ngayon, magse-set up tayo ng isang multi-turn na ahente upang subukan ang ating instrumentation.


In [2]:
import os

from autogen_agentchat.agents import AssistantAgent
from autogen_ext.models.azure import AzureAIChatCompletionClient
from azure.core.credentials import AzureKeyCredential
from autogen_agentchat.base import TaskResult

from autogen_agentchat.conditions import TextMentionTermination
from autogen_agentchat.teams import RoundRobinGroupChat

In [3]:
client = AzureAIChatCompletionClient(
    model="gpt-4o-mini",
    endpoint="https://models.inference.ai.azure.com",
    # To authenticate with the model you will need to generate a personal access token (PAT) in your GitHub settings.
    # Create your PAT token by following instructions here: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
    credential=AzureKeyCredential(os.environ["GITHUB_TOKEN"]),
    model_info={
        "json_output": True,
        "function_calling": True,
        "vision": True,
        "family": "unknown",
        "structured_output": False
    },
)

In [8]:
# üç¥ Agent 1 ‚Äì proposes ONE healthy meal idea each turn
meal_planner_agent = AssistantAgent(
    "meal_planner_agent",
    model_client=client,
    description="A seasoned meal-planning coach who suggests balanced meals.",
    system_message="""
    You are a Meal-Planning Assistant with a decade of experience helping busy people prepare meals.
    Goal: propose the single best meal (breakfast, lunch, or dinner) given the user's context.
    Each response must contain ONLY one complete meal idea (title + very brief component list) ‚Äî no extras.
    Keep it concise: skip greetings, chit-chat, and filler.
    """,
)

# ü•ó Agent 2 ‚Äì checks nutritional quality & variety
nutritionist_agent = AssistantAgent(
    "nutritionist_agent",
    model_client=client,
    description="A registered dietitian ensuring meals meet nutritional standards.",
    system_message="""
    You are a Nutritionist focused on whole-food, macro-balanced eating.
    Evaluate the meal_planner_agent‚Äôs recommendation.
    If the meal is nutritionally sound, sufficiently varied, and portion-appropriate, respond with 'APPROVE'.
    Otherwise, give high-level guidance on how to improve it (e.g. 'add a plant-based protein') ‚Äî do NOT provide a full alternative recipe.
    """,
)

In [9]:
# ‚úÖ Chat stops once the nutritionist says APPROVE
termination = TextMentionTermination("APPROVE")

# üîÑ Alternate turns between the two agents until termination
team = RoundRobinGroupChat(
    [meal_planner_agent, nutritionist_agent],
    termination_condition=termination,
)

# Example kickoff
user_input = "I'm looking for a quick, delicious dinner I can prep after work. I have 30 minutes and minimal clean-up is ideal."

In [None]:
with langfuse.start_as_current_span(name="create_meal_plan") as span:
    async for message in team.run_stream(task=user_input):
        if isinstance(message, TaskResult):
            print("Stop Reason:", message.stop_reason)
        else:
            print(message)

    span.update_trace(
        input=user_input,
        output=message.stop_reason,
    )

# Flush the trace to Langfuse for short-lived environments such as Jupyter Notebooks
langfuse.flush()

### Estruktura ng Trace

Ang Langfuse ay nagtatala ng isang **trace** na naglalaman ng **spans**, na kumakatawan sa bawat hakbang ng lohika ng iyong agent. Dito, ang trace ay naglalaman ng kabuuang pagtakbo ng agent at mga sub-spans para sa:
- Ang meal planner agent
- Ang mga nutritionist agents

Maaari mong suriin ang mga ito upang makita nang eksakto kung saan ginugugol ang oras, kung gaano karaming mga token ang nagamit, at iba pa:

![Trace tree sa Langfuse](https://langfuse.com/images/cookbook/example-autogen-evaluation/trace-tree.png)

_[Link sa trace](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/dac2b33e7cd709e685ccf86a137ecc64)_


## Online Evaluation

Ang Online Evaluation ay tumutukoy sa pagsusuri ng agent sa isang aktwal, totoong mundo na kapaligiran, ibig sabihin, habang ginagamit ito sa produksyon. Kasama rito ang pagmamanman sa performance ng agent sa mga totoong interaksyon ng user at ang tuloy-tuloy na pagsusuri ng mga resulta.

### Karaniwang Sukatan na Dapat Subaybayan sa Produksyon

1. **Gastos** ‚Äî Ang instrumentation ay nagtatala ng paggamit ng token, na maaari mong gawing tinatayang gastos sa pamamagitan ng pagtatalaga ng presyo sa bawat token.
2. **Latency** ‚Äî Obserbahan ang oras na kinakailangan upang makumpleto ang bawat hakbang, o ang buong proseso.
3. **Feedback ng User** ‚Äî Maaaring magbigay ng direktang feedback ang mga user (thumbs up/down) upang makatulong sa pagpapabuti o pagwawasto ng agent.
4. **LLM-as-a-Judge** ‚Äî Gumamit ng hiwalay na LLM upang suriin ang output ng iyong agent sa halos real-time (halimbawa, pagsusuri ng toxicity o pagiging tama).

Sa ibaba, ipinapakita namin ang mga halimbawa ng mga sukatan na ito.


#### 1. Mga Gastos

Narito ang isang screenshot na nagpapakita ng paggamit para sa `gpt-4o-mini` na mga tawag. Ito ay kapaki-pakinabang upang makita ang mga hakbang na may mataas na gastos at i-optimize ang iyong agent.

![Mga Gastos](https://langfuse.com/images/cookbook/example-autogen-evaluation/gpt-4o-costs.png) 

_[Link sa trace](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/dac2b33e7cd709e685ccf86a137ecc64)_


#### 2. Latency

Makikita rin natin kung gaano katagal ang bawat hakbang upang makumpleto. Sa halimbawa sa ibaba, ang buong proseso ay tumagal ng humigit-kumulang 3 segundo, na maaari mong hatiin ayon sa hakbang. Nakakatulong ito upang matukoy ang mga bottleneck at ma-optimize ang iyong agent.

![Latency](https://langfuse.com/images/cookbook/example-autogen-evaluation/agent-latency.png) 

_[Link sa trace](https://cloud.langfuse.com/project/cloramnkj0002jz088vzn1ja4/traces/dac2b33e7cd709e685ccf86a137ecc64?display=timeline)_


#### 3. Feedback ng Gumagamit

Kung ang iyong agent ay naka-embed sa isang user interface, maaari kang magtala ng direktang feedback mula sa gumagamit (tulad ng thumbs-up/down sa isang chat UI).


In [10]:
from langfuse import get_client
 
langfuse = get_client()
 
# Option 1: Use the yielded span object from the context manager
with langfuse.start_as_current_span(
    name="autogen-request-user-feedback-1") as span:
    
    async for message in team.run_stream(task="Create a meal with potatoes"):
            if isinstance(message, TaskResult):
                print("Stop Reason:", message.stop_reason)
            else:
                print(message)    
 
    # Score using the span object
    span.score_trace(
        name="user-feedback",
        value=1,
        data_type="NUMERIC",
        comment="This was delicious, thank you"
    )
 
# Option 2: Use langfuse.score_current_trace() if still in context
with langfuse.start_as_current_span(name="autogen-request-user-feedback-2") as span:
    # ... Autogen execution ...

    async for message in team.run_stream(task="I am allergic to gluten."):
            if isinstance(message, TaskResult):
                print("Stop Reason:", message.stop_reason)
            else:
                print(message)    
 
    # Score using current context
    langfuse.score_current_trace(
        name="user-feedback",
        value=1,
        data_type="NUMERIC"
    )

id='da068880-22ae-4f01-9f01-2bb231939089' source='user' models_usage=None metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 43, 732669, tzinfo=datetime.timezone.utc) content='Create a meal with potatoes' type='TextMessage'
id='ad937ce4-3534-493f-824b-ca9c226b5287' source='meal_planner_agent' models_usage=RequestUsage(prompt_tokens=95, completion_tokens=30) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 45, 186423, tzinfo=datetime.timezone.utc) content='Potato and Spinach Frittata  \n- Eggs  \n- Potatoes  \n- Fresh spinach  \n- Onion  \n- Cheese (optional)  ' type='TextMessage'
id='50fd33c1-057f-49fe-afad-ee86d164296d' source='nutritionist_agent' models_usage=RequestUsage(prompt_tokens=132, completion_tokens=4) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 20, 45, 581059, tzinfo=datetime.timezone.utc) content='APPROVE' type='TextMessage'
Stop Reason: Text 'APPROVE' mentioned
id='e371de6c-e5fc-42c1-8eda-e5b8cd5accab' source='user' models_usage=None met

In [None]:
# Option 3: Use create_score() with trace ID (when outside context)
langfuse.create_score(
    trace_id="predefined_trace_id",
    name="user-feedback",
    value=1,
    data_type="NUMERIC",
    comment="This was correct, thank you"
)

Ang feedback ng user ay kinukuha sa Langfuse:

![Ang feedback ng user ay kinukuha sa Langfuse](https://langfuse.com/images/cookbook/example-autogen-evaluation/user-feedback.png)


#### 4. Automated LLM-as-a-Judge Scoring

Ang LLM-as-a-Judge ay isa pang paraan upang awtomatikong suriin ang output ng iyong agent. Maaari kang mag-set up ng hiwalay na tawag sa LLM upang suriin ang pagiging tama, toxicity, estilo, o anumang iba pang pamantayan na mahalaga sa iyo.

**Daloy ng Trabaho**:
1. Magtakda ka ng **Evaluation Template**, halimbawa, "Suriin kung ang teksto ay toxic."
2. Mag-set ka ng model na gagamitin bilang judge-model; sa kasong ito, `gpt-4o-mini` na ginagamit sa pamamagitan ng Azure.
3. Sa tuwing ang iyong agent ay gumagawa ng output, ipapasa mo ang output na iyon sa "judge" LLM gamit ang template.
4. Ang judge LLM ay magbibigay ng rating o label na iyong ilalagay sa iyong observability tool.

Halimbawa mula sa Langfuse:

![LLM-as-a-Judge Evaluator](https://langfuse.com/images/cookbook/example-autogen-evaluation/evaluator.png)


In [12]:
with langfuse.start_as_current_span(name="autogen-request-user-feedback-2") as span:

    async for message in team.run_stream(task="I am a picky eater and not sure if you find something for me."):
            if isinstance(message, TaskResult):
                print("Stop Reason:", message.stop_reason)
            else:
                print(message) 

    span.update_trace(
        input=user_input,
        output=message.stop_reason,
    )

langfuse.flush()

id='eefc628d-502f-451a-8f70-be486f62f8c5' source='user' models_usage=None metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 29, 171393, tzinfo=datetime.timezone.utc) content='I am a picky eater and not sure if you find something for me.' type='TextMessage'
id='13b3e14b-bcf7-42a5-80d6-54b0c7be765e' source='meal_planner_agent' models_usage=RequestUsage(prompt_tokens=352, completion_tokens=27) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 30, 433516, tzinfo=datetime.timezone.utc) content='Chicken Alfredo Pasta  \n- Gluten-free pasta  \n- Grilled chicken breast  \n- Heavy cream  \n- Parmesan cheese  \n- Garlic  ' type='TextMessage'
id='550f2dee-0e08-4bbd-b67f-991b467328f1' source='nutritionist_agent' models_usage=RequestUsage(prompt_tokens=386, completion_tokens=17) metadata={} created_at=datetime.datetime(2025, 7, 2, 16, 38, 31, 505173, tzinfo=datetime.timezone.utc) content='Consider incorporating some vegetables, like spinach or broccoli, to increase the nutrien

Makikita mo na ang sagot ng halimbawang ito ay hinusgahan bilang "hindi nakasasama".

![LLM-bilang-Isang-Hukom na Iskor ng Pagsusuri](https://langfuse.com/images/cookbook/example-autogen-evaluation/llm-as-a-judge-score.png)


#### 5. Pangkalahatang-ideya ng Mga Sukatan ng Obserbabilidad

Ang lahat ng mga sukatan na ito ay maaaring ipakita nang sabay-sabay sa mga dashboard. Binibigyang-daan ka nitong mabilis na makita kung paano gumagana ang iyong agent sa maraming sesyon at tumutulong sa iyong subaybayan ang mga sukatan ng kalidad sa paglipas ng panahon.

![Pangkalahatang-ideya ng mga sukatan ng obserbabilidad](https://langfuse.com/images/cookbook/example-autogen-evaluation/dashboard.png)


## Offline na Pagsusuri

Mahalaga ang online na pagsusuri para sa agarang feedback, ngunit kailangan mo rin ng **offline na pagsusuri**‚Äîmga sistematikong pagsusuri bago o habang nasa proseso ng pag-develop. Nakakatulong ito upang mapanatili ang kalidad at pagiging maaasahan bago ipatupad ang mga pagbabago sa produksyon.


### Pagsusuri ng Dataset

Sa offline na pagsusuri, karaniwan mong:
1. Mayroon kang benchmark dataset (na may prompt at inaasahang pares ng output)
2. Patakbuhin ang iyong agent sa dataset na iyon
3. Ihambing ang mga output sa inaasahang resulta o gumamit ng karagdagang mekanismo ng pagmamarka

Sa ibaba, ipinapakita namin ang diskarteng ito gamit ang [q&a-dataset](https://huggingface.co/datasets/junzhang1207/search-dataset), na naglalaman ng mga tanong at inaasahang sagot.


In [16]:
import pandas as pd
from datasets import load_dataset
 
# Fetch search-dataset from Hugging Face
dataset = load_dataset("junzhang1207/search-dataset", split = "train")
df = pd.DataFrame(dataset)
print("First few rows of search-dataset:")
print(df.head())

  from .autonotebook import tqdm as notebook_tqdm


First few rows of search-dataset:
                                     id  \
0  20caf138-0c81-4ef9-be60-fe919e0d68d4   
1  1f37d9fd-1bcc-4f79-b004-bc0e1e944033   
2  76173a7f-d645-4e3e-8e0d-cca139e00ebe   
3  5f5ef4ca-91fe-4610-a8a9-e15b12e3c803   
4  64dbed0d-d91b-4acd-9a9c-0a7aa83115ec   

                                            question  \
0                 steve jobs statue location budapst   
1  Why is the Battle of Stalingrad considered a t...   
2  In what year did 'The Birth of a Nation' surpa...   
3  How many Russian soldiers surrendered to AFU i...   
4   What event led to the creation of Google Images?   

                                     expected_answer       category       area  
0  The Steve Jobs statue is located in Budapest, ...           Arts  Knowledge  
1  The Battle of Stalingrad is considered a turni...   General News       News  
2  This question is based on a false premise. 'Th...  Entertainment       News  
3  About 300 Russian soldiers surrendered to t

Susunod, gumagawa tayo ng dataset entity sa Langfuse upang subaybayan ang mga run. Pagkatapos, idinadagdag natin ang bawat item mula sa dataset sa sistema.


In [17]:
from langfuse import Langfuse
langfuse = Langfuse()
 
langfuse_dataset_name = "qa-dataset_autogen-agent"
 
# Create a dataset in Langfuse
langfuse.create_dataset(
    name=langfuse_dataset_name,
    description="q&a dataset uploaded from Hugging Face",
    metadata={
        "date": "2025-03-21",
        "type": "benchmark"
    }
)

Dataset(id='cmcm7524d00kjad07s2cjwqcf', name='qa-dataset_autogen-agent', description='q&a dataset uploaded from Hugging Face', metadata={'date': '2025-03-21', 'type': 'benchmark'}, project_id='cloramnkj0002jz088vzn1ja4', created_at=datetime.datetime(2025, 7, 2, 16, 54, 7, 357000, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 7, 2, 16, 54, 7, 357000, tzinfo=datetime.timezone.utc))

In [18]:
df_25 = df.sample(25) # For this example, we upload only 25 dataset questions

for idx, row in df_25.iterrows():
    langfuse.create_dataset_item(
        dataset_name=langfuse_dataset_name,
        input={"text": row["question"]},
        expected_output={"text": row["expected_answer"]}
    )

![Mga item ng Dataset sa Langfuse](https://langfuse.com/images/cookbook/example-autogen-evaluation/example-dataset.png)


#### Pagpapatakbo ng Agent sa Dataset

Una, bumuo tayo ng isang simpleng Autogen agent na sumasagot ng mga tanong gamit ang mga modelo ng Azure OpenAI.


In [8]:
import os
from dotenv import load_dotenv

from autogen_agentchat.agents import AssistantAgent
from autogen_core.models import UserMessage
from autogen_ext.models.azure import AzureAIChatCompletionClient
from azure.core.credentials import AzureKeyCredential
from autogen_core import CancellationToken
from autogen_agentchat.messages import TextMessage

In [None]:
load_dotenv()
client = AzureAIChatCompletionClient(
    model="gpt-4o",
    endpoint="https://models.inference.ai.azure.com",
    # To authenticate with the model you will need to generate a personal access token (PAT) in your GitHub settings.
    # Create your PAT token by following instructions here: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
    credential=AzureKeyCredential(os.getenv("GITHUB_TOKEN")),
    max_tokens=5000,
    model_info={
        "json_output": True,
        "function_calling": False,
        "vision": False,
        "family": "unknown",
        "structured_output": True,
    },
)

result = await client.create([UserMessage(content="What is the capital of France?", source="user")])
print(result)

In [18]:
agent = AssistantAgent(
    name="assistant",
    model_client=client,
    tools=[],
    system_message="You are participant in a quizz show and you are given a question. You need to create a short answer to the question.",
)

Pagkatapos, tinutukoy natin ang isang helper function na `my_agent()`.


In [19]:
async def my_agent(user_query: str):

    with langfuse.start_as_current_span(name="autogen-trace") as span:

        # Execute the agent response
        response = await agent.on_messages(
            [TextMessage(content=user_query, source="user")],
            cancellation_token=CancellationToken(),
        )

        span.update_trace(
            input=user_query,
            output=response.chat_message.content,
        )

    return str(response.chat_message.content)

# Test the function
await my_agent("What is the capital of France?")

'The capital of France is Paris.'

Sa wakas, inuulit natin ang bawat item ng dataset, pinapatakbo ang ahente, at iniuugnay ang trace sa item ng dataset. Maaari rin tayong magdagdag ng mabilisang pagsusuri ng score kung nais.


In [20]:
dataset_name = "qa-dataset_autogen-agent"
current_run_name = "dev_tasks_run-autogen_gpt-4.1" # Identifies this specific evaluation run
current_run_metadata={"model_provider": "Azure", "model": "gpt-4.1"}
current_run_description="Evaluation run for Autogen model on July 3rd"

dataset = langfuse.get_dataset('qa-dataset_autogen-agent')

for item in dataset.items:
    print(f"Running evaluation for item: {item.id} (Input: {item.input})")
 
    # Use the item.run() context manager
    with item.run(
        run_name=current_run_name,
        run_metadata=current_run_metadata,
        run_description=current_run_description
    ) as root_span: 
        # All subsequent langfuse operations within this block are part of this trace.
        generated_answer = await my_agent(user_query = item.input["text"])
    
    print("Generated Answer: ", generated_answer)
 
print(f"\nFinished processing dataset '{dataset_name}' for run '{current_run_name}'.")

langfuse.flush()

Running evaluation for item: 09810cc4-9992-4712-a3b2-7224da31776a (Input: {'text': 'In Hindu mythology, which deity is the Ganges river dolphin associated with?'})
Generated Answer:  In Hindu mythology, the Ganges river dolphin is associated with the deity Ganga.
Running evaluation for item: bb113f94-7723-47c6-8c34-59d883044514 (Input: {'text': 'What significant discovery did the LHCb collaboration report in 2015?'})
Generated Answer:  In 2015, the LHCb collaboration reported the discovery of pentaquark particles.
Running evaluation for item: 4d8ae54e-ceab-46d0-ad2c-6e8e223589a9 (Input: {'text': 'What is the M√Ñ\x81ori name for the red-crowned parakeet?'})
Generated Answer:  The MƒÅori name for the red-crowned parakeet is kƒÅkƒÅriki.
Running evaluation for item: 21e5a0d5-f619-4a73-868e-9955053b3e72 (Input: {'text': 'Who starred in the 1978 television film adaptation of Les Mis√É¬©rables?'})
Generated Answer:  Richard Jordan starred as Jean Valjean in the 1978 television film adaptation

Maaari mong ulitin ang prosesong ito gamit ang iba't ibang mga configuration ng agent tulad ng:  
- Mga Modelo (gpt-4o-mini, gpt-4.1, atbp.)  
- Mga Prompt  
- Mga Tool (search vs. walang search)  
- Kumplikado ng agent (multi-agent vs single-agent)  

Pagkatapos, ihambing ang mga ito nang magkatabi sa Langfuse. Sa halimbawang ito, pinatakbo ko ang agent nang 3 beses sa 25 tanong mula sa dataset. Para sa bawat run, gumamit ako ng ibang modelo ng Azure OpenAI. Makikita mo na ang dami ng tamang sagot ay tumataas kapag mas malaking modelo ang ginamit (gaya ng inaasahan). Ang `correct_answer` score ay nilikha ng isang [LLM-as-a-Judge Evaluator](https://langfuse.com/docs/scores/model-based-evals) na naka-set up upang husgahan ang pagiging tama ng tanong batay sa sample na sagot na ibinigay sa dataset.  

![Dataset run overview](https://langfuse.com/images/cookbook/example-autogen-evaluation/dataset_runs.png)  
![Dataset run comparison](https://langfuse.com/images/cookbook/example-autogen-evaluation/dataset-run-comparison.png)  



---

**Paunawa**:  
Ang dokumentong ito ay isinalin gamit ang AI translation service na [Co-op Translator](https://github.com/Azure/co-op-translator). Bagama't sinisikap naming maging tumpak, pakitandaan na ang mga awtomatikong pagsasalin ay maaaring maglaman ng mga pagkakamali o hindi pagkakatugma. Ang orihinal na dokumento sa orihinal nitong wika ang dapat ituring na opisyal na sanggunian. Para sa mahalagang impormasyon, inirerekomenda ang propesyonal na pagsasalin ng tao. Hindi kami mananagot sa anumang hindi pagkakaunawaan o maling interpretasyon na maaaring magmula sa paggamit ng pagsasaling ito.
