Task 01: Monitoring Your Prompt Flows

Description

In this task, you will create and configure a monitoring solution to track the quality and token usage of your deployed prompt flow application.

Success Criteria

Set up monitoring for Prompt Flow
Configure monitoring
Consume monitoring results

Solution

01: Deploy the RAG flow with tracing enabled

In this section, you will deploy your flow again, but this time with tracing enabled.

Expand this section to view the solution

Delete the dist directory created during the previous deployment
```
rm -rf dist
```
Package your flow as a Docker image. This process will create a Dockerfile for your flow.
1. Open a terminal in the root directory of your project.
2. Run the following command to build your flow and create a Docker image:
```
pf flow build --source src --output dist --format docker
```
This command packages your flow and outputs it to the dist directory in Docker format.

Enable tracing in your flow.

Open the following file: dist/flow/flow.flex.yaml

Update its content with:

app_insights_enabled: true

inputs:
  question:
    type: string
  chat_history:
    type: object
entry: chat_request:get_response

Save the file.

To ensure Python can locate the modules in the flow source directory, you need to set the PYTHONPATH environment variable. In your terminal, run the following command:
```
export PYTHONPATH=./src:$PYTHONPATH
```
Skipping this step will result in a ModuleNotFoundError: No module named ‘chat_request’.

Now, you’re ready to deploy your flow.
Open the deployment script: util/deploy_moe.py

Navigate to line 77 and update it with the following code to configure your flow to work with the AI Foundry Test Chat interface:

     model = Model(
         name="ragwithtrace",
         path=flow_path,  # path to the promptflow folder
         properties=[  # enables the chat interface in the endpoint test tab
             ["azureml.promptflow.source_flow_id", "ragwithtrace"],
             ["azureml.promptflow.mode", "chat"],
             ["azureml.promptflow.chat_input", "question"],
             ["azureml.promptflow.chat_output", "answer"]
         ]
     ),

In the terminal, run the following command, replacing XXXX in both the endpoint-name and deployment-name with a unique four-digit number of your choice:
```
python util/deploy_moe.py --endpoint-name rag-XXXX-endpoint --deployment-name rag-XXXX-deployment
```
Use a different endpoint and deployment name than the one used in the previous exercise.

This may take around 20 minutes to deploy.
Upon completion, you should see output similar to the following in your terminal:

If you encounter the error “Key based authentication is not permitted on this storage account,” enable the Allow storage account key access option in the Configuration section of your storage account in the Azure portal.

02: Create a monitor configuration

Expand this section to view the solution

Create a monitor.py file in the local folder and add the following content, updating the Update your Azure resources details section with your deployment details:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    MonitorSchedule,
    CronTrigger,
    MonitorDefinition,
    ServerlessSparkCompute,
    MonitoringTarget,
    AlertNotification,
    GenerationSafetyQualityMonitoringMetricThreshold,
    GenerationSafetyQualitySignal,
    BaselineDataRange,
    LlmData,
)
from azure.ai.ml.entities._inputs_outputs import Input
from azure.ai.ml.constants import MonitorTargetTasks, MonitorDatasetContext
from azure.identity import DefaultAzureCredential
   
credential = DefaultAzureCredential()

# Update your Azure resources details
subscription_id = "[your_subscription_id]"
resource_group = "[your_resource_group_id]"
aoai_deployment_name = "gpt-4"
aoai_connection_name = "aoai-connection"
project_name = "[your_ai_studio_project_name]"  # Ex: ai-project-lh7b37cbhixdq
endpoint_name = "[your_endpoint_name]"  # Ex: rag-PCLN-endpoint
deployment_name = "[your_deployment_name]"  # Ex: rag-PCLN-deployment
   
# These variables can be renamed, but it's not necessary
app_trace_name = "app_traces"
app_trace_version = "1"
monitor_name = "gen_ai_monitor_generation_quality"
defaultgsqsignalname = "gsq-signal"
   
# Set the frequency and notification emails for the monitor
trigger_schedule = CronTrigger(expression="*/5 * * * *")
notification_emails_list = ["test@example.com", "def@example.com"]
   
ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=project_name,
)
   
spark_compute = ServerlessSparkCompute(instance_type="standard_e4s_v3", runtime_version="3.3")
monitoring_target = MonitoringTarget(
    ml_task=MonitorTargetTasks.QUESTION_ANSWERING,
    endpoint_deployment_id=f"azureml:{endpoint_name}:{deployment_name}",
)
   
# Set thresholds (0.7 = 70%)
aggregated_groundedness_pass_rate = 0.7
aggregated_relevance_pass_rate = 0.7
aggregated_coherence_pass_rate = 0.7
aggregated_fluency_pass_rate = 0.7
   
# Create a Generation Safety Quality (GSQ) signal
generation_quality_thresholds = GenerationSafetyQualityMonitoringMetricThreshold(
    groundedness={"aggregated_groundedness_pass_rate": aggregated_groundedness_pass_rate},
    relevance={"aggregated_relevance_pass_rate": aggregated_relevance_pass_rate},
    coherence={"aggregated_coherence_pass_rate": aggregated_coherence_pass_rate},
    fluency={"aggregated_fluency_pass_rate": aggregated_fluency_pass_rate},
)
input_data = Input(
    type="uri_folder",
    path=f"{endpoint_name}-{deployment_name}-{app_trace_name}:{app_trace_version}",
)
data_window = BaselineDataRange(lookback_window_size="P7D", lookback_window_offset="P0D")
production_data = LlmData(
    data_column_names={"prompt_column": "question", "completion_column": "answer", "context_column": "context"},
    input_data=input_data,
    data_window=data_window,
)
   
gsq_signal = GenerationSafetyQualitySignal(
    connection_id=f"/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.MachineLearningServices/workspaces/{project_name}/connections/{aoai_connection_name}",
    metric_thresholds=generation_quality_thresholds,
    production_data=[production_data],
    sampling_rate=1.0,
    properties={
        "aoai_deployment_name": aoai_deployment_name,
        "enable_action_analyzer": "false",
        "azureml.modelmonitor.gsq_thresholds": '[{"metricName":"average_fluency","threshold":{"value":4}},{"metricName":"average_coherence","threshold":{"value":4}}]',
    },
)
   
monitoring_signals = {
    defaultgsqsignalname: gsq_signal,
}
   
monitor_settings = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    monitoring_signals=monitoring_signals,
    alert_notification=AlertNotification(emails=notification_emails_list),
)
   
model_monitor = MonitorSchedule(
    name=monitor_name,
    trigger=trigger_schedule,
    create_monitor=monitor_settings,
)
   
ml_client.schedules.begin_create_or_update(model_monitor)

Now, run the monitor.py program to create your monitor configuration:
```
python local/monitor.py
```

Your monitoring configuration is now set up.

03: Send some requests to your endpoint to generate monitoring data

Expand this section to view the solution

Update the local/test.py program created earlier with the new endpoint and API key
Save the file.
Run the local/test.py program multiple times to generate some requests
```
python local/test.py
```

You can vary the question content for each request by modifying it on line 19 of the test.py program.

04: Consume monitoring results

Once you’ve set up your monitor, it will run accordingly the schedule you set in the monitoring configuration.

Expand this section to view the solution

Go to the Monitoring (preview) tab within the deployment to view the monitoring results.
Go to the Generation quality tab to monitor the quality of your application over time. Metrics shown include:
- Violation count: Sum of violations for a given metric (e.g., Fluency) during the time window.
- Average score: Average score for all instances (or requests) over the time window.
The Generation quality violations card shows the violation rate, which is the number of violations divided by the total number of possible violations. Adjust thresholds and computation frequency (default: daily) in the settings.
Go to the Operational tab to view near real-time operational metrics for the deployment, including:
- Request count
- Latency
- Error rate

Reference

05: Enable tracing in Application Insights

You can enable tracing in Application Insights by selecting your workspace’s Application Insights.

Expand this section to view the solution

Open Application Insights, select Usage and estimated costs under Configure from the left menu.
Select Custom metrics (Preview).
Select With dimensions, then select OK to save changes.
Select Metrics under Monitoring from the left menu.
Select Promptflow standard metrics from the Metric Namespace, and explore metrics using different aggregation methods.
Select Transaction search from the left menu to view tracing in the transactional data.

Reference