Task 03 - Prepare datasets

Introduction

Now that you have deployed and configured the necessary Azure resources, the next step is to prepare the datasets that you will use in this training. In this scenario, Zava has product catalog data in Cosmos DB. We will then use Azure AI Search to create a search index and import the data from Cosmos DB.

Description

In this task, you will prepare the dataset needed for the rest of this training. This includes loading data into Cosmos DB and creating a Azure AI Search index to import the data.

Success Criteria

  • You have loaded data into Cosmos DB.
  • You have created a search index in Azure AI Search and imported data into it.

Learning Resources

Key Tasks

01: Create a Cosmos DB container

The data that you will use in this training is located in src/data/updated_product_catalog(in).csv. This CSV file contains product catalog data for Zava, including product names, descriptions, categories, prices, and image URLs. You will load this data into the Cosmos DB instance that you created in the first task of this exercise.

Expand this section to view the solution

Navigate to the Azure portal and open the Cosmos DB account that you created in the first task of this exercise. Then, navigate to the Data Explorer section of your Cosmos DB account from the left-hand menu. In the Data Explorer, you can see a zava database that was created as part of the deployment process. Select the ellipsis (…) next to the zava database and choose New Container.

Create a new Cosmos DB container.

In the New Container pane, ensure that the database ID is zava and enter product_catalog for the container ID. Then, enter /ProductID for the partition key. Set the container throughput to Manual and enter 400 RU/s.

Configure the new Cosmos DB container.

Then, select the OK button to create the container.

02: Create a virtual environment and install dependencies

Before you can run the script to load data into Cosmos DB, you should create a Python virtual environment. This will allow you to install the necessary Python packages without affecting your global Python installation.

Expand this section to view the solution

Open a terminal and navigate to the root directory of the repository that you cloned in the first task. Then, run the following commands to create a virtual environment and activate it:

# Navigate to the /src/ directory
cd src

# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows
venv\Scripts\activate.bat
# On Windows (PowerShell)
venv\Scripts\Activate.ps1
# On macOS/Linux
source venv/bin/activate

Once the virtual environment is activated, you can install the necessary dependencies using pip:

pip install -r requirements.txt

You may receive an error message whose final two lines are assert os.path.exists(pyc_path) and AssertionError. If you do receive this error message, try running the command again. You may additionally wish to upgrade pip to the latest version by running pip install --upgrade pip and then try installing the requirements again.

03: Import data into the Cosmos DB container

Now that you have created the Cosmos DB container, the next step is to import the product catalog data into the container. There is a Python script in src/pipelines/ingest_to_cosmos.py that you can use to load the data from the CSV file into the Cosmos DB container.

Expand this section to view the solution

Use the same terminal window where you created and activated the virtual environment in the previous step. Ensure that you are still in the src directory of the repository and that your virtual environment is active. Then, run the following command to execute the ingestion script:

python pipelines/ingest_to_cosmos.py

You may receive an error message that “Local Authorization is disabled. Use an AAD token to authorize all requests.” There are two common causes. The first is that you are not logged into the Azure CLI, and the solution is to ensure that you are logged in with the correct account, and then try again. The second common cause is that your Cosmos DB public network access is disabled, and thus your Codespace VM or local machine cannot access the Cosmos DB database. To resolve this, navigate to the Cosmos DB account in the Azure portal, select Networking from the Settings menu, and ensure that Public network access is set to All networks. After saving this change, give the service a few minutes to update and then try running the ingestion script again.

04: Confirm that the data was imported

After the ingestion script has completed, you can confirm that the data was successfully imported into the Cosmos DB container by using the Data Explorer in the Azure portal.

Expand this section to view the solution

Navigate to the Azure portal and open the Cosmos DB account that you created in the first task of this exercise. Then, navigate to the Data Explorer section from the left-hand menu.

In the Data Explorer, you should see the zava database. Inside of it is the product_catalog container that you created earlier. You can expand the container to view the imported data.

Select the Items option under the product_catalog container to view the documents that were imported. You should see multiple documents representing the products in the catalog.

Review the imported documents in the product_catalog container.

05: Create an Azure AI Search index and import data

The final step in preparing the dataset is to create an Azure AI Search index and import the data from the Cosmos DB container into the search index. This will allow you to use Azure AI Search to query the product catalog data.

It is also possible to use Cosmos DB’s built-in vector search capabilities for this scenario. However, for the purposes of this training, we will use Azure AI Search to demonstrate how to integrate it with Microsoft Foundry. The benefit to using Azure AI Search is that it provides additional features such as hybrid search, as well as serving as a central search service that can be used across multiple data sources and applications.

Expand this section to view the solution

Navigate to the Azure portal and open the Azure AI Search service that you created in the first task. Then, select the Import data (new) option from the central menu.

Import data into Azure AI Search.

Choose Azure Cosmos DB from the list of data sources.

Select Azure Cosmos DB as the data source.

Select RAG as the scenario to target.

Select RAG as the scenario.

In the Connect to your data menu section, choose your Cosmos DB account from the drop-down list. Then, select the zava database and the product_catalog container. Additionally, select the Authenticate using managed identity. option and ensure that the managed identity type is System-assigned. After that, select Next to continue.

Connect to the Cosmos DB data source.

On the Vectorize your text page, select content_for_vector from the drop-down list as the column to vectorize. Then, choose Azure AI Foundry (Preview) as the Kind and select your Microsoft Foundry project from the drop-down list. In the Model deployment drop-down list, choose text-embedding-3-large. Then, select System assigned identity as the authentication type. After that, tick the checkbox indicating that connecting to a Foundry project will incur additional costs and select Next to continue.

Configure the vectorization settings.

Leave the Advanced settings page with the default settings and select Next to continue.

For the Objects name prefix, enter zava-product-catalog. Then, select Create to import the data.

After the import process is complete, navigate to the Indexes section from the Search management menu. You should see a new index named zava-product-catalog. Select this index to view its details and confirm that there are 54 documents in the index. This may take several minutes, so be patient until it completes.

Review the created search index.

You may receive an error message indicating Could not parse document. Invalid document key: 'B+wRAKOYLyUBAAAAAAAAAA=='. Keys can only contain letters, digits, underscore (_), dash (-), or equal sign (=).. If you receive this error message, delete the Cosmos DB zava database, re-create the zava database, re-create the container, re-load the data, delete the Azure AI Search indexer and related assets, and try again. This issue happens sporadically when Cosmos DB assigns _rid values with a + sign in them, something Azure AI Search cannot handle.