< Previous Challenge - Home - Next Challenge >
This challenge assumes that all requirements for Challenges 01 were successfully completed.
As a major employer on the Contoso Islands, Contoso Yachts, Inc has also expanded into the education vertical, extending their Citrus Bus application with advanced AI services to help educators on the island grade student exams.
The Citrus Bus application has an Azure storage account with a Blob Store containing a large number of documents and Images of handwritten essays from students in various file formats including but not limited to PDF, PNG, TIFF and JPEG.
Often, an organization’s data is not readily consumable by the app. It needs to be digested, parsed and simplified to extract the data in a format that can be readily leveraged by the language model. The Azure AI platform has services that help with these tasks such as Azure Document Intelligence, Azure AI Vision, Azure Custom Vision, and Azure Video Indexer.
In this challenge, you will focus on just one of these parsers, Azure Document Intelligence, to parse records uploaded to Azure Blob Storage and Azure Cosmos DB.
The goal of this challenge is to observe the extraction of the school district, school name, student id, student name and question answers from civics/social studies exams from students in the country. You will also see the parsing of the activity preferences and advance requests from the tourists visiting Contoso Islands.
The Citrus Bus application has a pipeline that can process all of the historical PDF and PNG files for these exam submissions and activity preferences stored in blob storage.
There are 21 sample documents in the sub-folders under the /artifacts/contoso-education
folder:
/F01-Civics-Geography and Climate
/F02-Civics-Tourism and Economy
/F03-Civics-Government and Politics
/F04-Activity-Preferences
Each folder contains 5 samples (except for /F01-Civics-Geography and Climate
which has 6) that you will use for training the custom classifier and extractor.
In Azure Blob Storage you should see a container called classifications
. There should be a total of 21 samples from the 4 classes or categories inside the classifications
container. They were also copied to these containers in Azure Blob Storage:
f01-geo-climate
f02-tour-economy
f03-gov-politics
f04-activity-preferences
At runtime in the automated data pipeline, the app will invoke the custom classifier from Azure Document Intelligence to recognize which document type it has encountered and then it will call the corresponding custom extractor model to parse the document and extract the relevant fields.
You will need to create one Classifier Project which will give you one Classification Model to process the 4 different types of documents we have. When you create your Model, make sure the name matches the value of the DOCUMENT_INTELLIGENCE_CLASSIFIER_MODEL_ID
setting in your applications settings config file local.settings.json
.
NOTE: You may need to use the Settings
icon in the Azure Portal to switch directories if your Entra ID belongs to more than one Azure tenant.
The custom classifier helps you to automate the recognition of the different document types or classes in your knowledge store
Use these directions Building a Custom Classifier Model for how to train the custom classifier in Azure Document Intelligence to recognize the following 4 categories of documents:
f01-geography-climate
f02-tourism-economy
f03-geography-politics
f04-activity-preferences
Rename your document-intelligence-dictionary.json.example
to document-intelligence-dictionary.json
.
When creating your extraction models in document intelligence studio after labelling, please ensure that you use the names in the JSON as the model IDs. Changing the variables file for document intelligence (document-intelligence-dictionary.json
) should not be necessary but if you name your extraction models differently than what is specified in the JSON you should update the corresponding variables accordingly. Ensure that the classifier_document_type
in your dictionary configuration matches what you have in your Document Intelligence Studio.
Use these directions for Building a Custom Extractor Model to build and train each of the extractor models for the 4 document types. You will need 4 projects (1 for each category of documents that references each Azure Blob storage container listed above).
Make sure that the extractor_model_name
field in your application config document-intelligence-dictionary.json
matches what you have in Document Intelligence Studio
f01-extractor
f02-extractor
f03-extractor
f04-extractor
Also ensure that the field names such as q1
and q5
matches exactly what you have in Document Intelligence Studio.
The first 3 extractor models a straightforward. However in the 4th document type, we have tables, signatures and checkboxes.
document-intelligence-dictionary.json
[
{
"classifier_document_type": "f03-geography-politics",
"extractor_model_name": "f03-extractor",
"model_description": "This is used to extract the contents of Form 03 using Azure Doc Intelligence",
"fields": {
"student_id": "student_id",
"student_name": "student_name",
"school_district": "school_district",
"school_name": "school_name",
"exam_date": "exam_date",
"question_1": "q1",
"question_2": "q2",
"question_3": "q3",
"question_4": "q4",
"question_5": "q5"
}
},
{
"classifier_document_type": "f04-activity-preferences",
"extractor_model_name": "f04-extractor2",
"model_description": "This is used to extract the contents of Form 04 (Activity Preferences) using Azure Doc Intelligence",
"fields": {
"guest_full_name": "guest_full_name",
"guest_email_address": "email_address",
"checkbox_contoso_floating_museums": "contoso_floating_museums",
"checkbox_contoso_solar_yachts": "contoso_solar_yachts",
"checkbox_contoso_beachfront_spa": "contoso_beachfront_spa",
"checkbox_contoso_dolphin_turtle_tour": "contoso_dolphin_turtle_tour",
"table_name": "activity_requests",
"table_column_header_experience_name": "experience_name",
"table_column_header_preferred_time": "preferred_time",
"table_column_header_party_size": "party_size",
"signature_field_name": "guest_signature",
"signature_date_field_name": "signature_date"
}
}
]
After training your models, you can test the form processing pipeline by uploading the files located locally in /artifacts/contoso-education/submissions
to the submissions
container in your storage account. Refer back to CH0 for uploading local files into your storage account. This will trigger Azure Functions, which have been created for you in the backend. Azure Functions will classify, extract, and store the results in Cosmos DB.
You will be able to observe that the solution should:
You can go to the data explorer for the Cosmos DB to verify that the exam submissions have loaded successfully into the examsubmissions
collection.
The graded exams corresponding to each submission will reside in the grades collection in Cosmos DB
For the activity preferences for each customer uploaded, these are parsed and reside in the activitypreferences
Cosmos DB container.
Just like how you uploaded yacht records and modified the yacht records via the http client, use the rest-api-students-management.http http client to upload student records to Cosmos DB. The AI assistant will only respond to queries from students registered in the Cosmos DB database.
Once you have verified that these documents have been parsed and the data has been extracted into the corresponding containers, you can use the Murphy and Priscilla AI Assistants to query the database to get the answers from these data stores.
You will need to configure the assistant tools for each AI assistant to ensure that the correct function is executed when the student or parent needs to retrieve the grades for the exam submissions or when a guest needs to get recommendations for activities during their trip on the island.
During this challenge you will: