Survey Transformation Basics¶

This notebook explains survey transformation in order: main input, structured outputs, and practical benefits.

1. Setup¶

Configure pandas extension and model once.

In [1]:

Copied!





import os

import pandas as pd
from pydantic import BaseModel, Field

import openaivec
from openaivec import pandas_ext

assert os.getenv("OPENAI_API_KEY") or os.getenv("AZURE_OPENAI_BASE_URL"), (
    "Set OPENAI_API_KEY or Azure OpenAI environment variables before running this notebook."
)

openaivec.set_responses_model("gpt-5.4")
import os

import pandas as pd
from pydantic import BaseModel, Field

import openaivec
from openaivec import pandas_ext

assert os.getenv("OPENAI_API_KEY") or os.getenv("AZURE_OPENAI_BASE_URL"), (
    "Set OPENAI_API_KEY or Azure OpenAI environment variables before running this notebook."
)

openaivec.set_responses_model("gpt-5.4")

2. Input: survey responses DataFrame¶

Each row contains one free-text survey response.

In [2]:

Copied!





survey_df = pd.DataFrame(
    {
        "response_id": ["RESP_001", "RESP_002", "RESP_003", "RESP_004"],
        "response": [
            "I am a 24-year-old student in Seattle and enjoy gaming and anime.",
            "I am a 41-year-old manager in New York, interested in fitness and business books.",
            "I am a 33-year-old software engineer in Austin who likes hiking and coffee.",
            "I am retired in Denver and spend time gardening and local community events.",
        ],
    }
)

survey_df
survey_df = pd.DataFrame(
    {
        "response_id": ["RESP_001", "RESP_002", "RESP_003", "RESP_004"],
        "response": [
            "I am a 24-year-old student in Seattle and enjoy gaming and anime.",
            "I am a 41-year-old manager in New York, interested in fitness and business books.",
            "I am a 33-year-old software engineer in Austin who likes hiking and coffee.",
            "I am retired in Denver and spend time gardening and local community events.",
        ],
    }
)

survey_df

Out[2]:

	response_id	response
0	RESP_001	I am a 24-year-old student in Seattle and enjo...
1	RESP_002	I am a 41-year-old manager in New York, intere...
2	RESP_003	I am a 33-year-old software engineer in Austin...
3	RESP_004	I am retired in Denver and spend time gardenin...

3. Output A: structured profile per response¶

Convert each free-text response into a typed profile.

In [3]:

Copied!





class SurveyProfile(BaseModel):
    age_group: str = Field(description="One of: 18-25, 26-35, 36-45, 46-55, 56+")
    occupation_category: str = Field(description="Short category such as student, technology, business")
    location: str = Field(description="City or region")
    interests: list[str] = Field(description="Top interests")


profiles = survey_df["response"].ai.responses(
    instructions=(
        "Extract age group, occupation category, location, and interests "
        "from each survey response. Keep labels concise."
    ),
    response_format=SurveyProfile,
)

survey_df.assign(profile=profiles)[["response_id", "profile"]]
class SurveyProfile(BaseModel):
    age_group: str = Field(description="One of: 18-25, 26-35, 36-45, 46-55, 56+")
    occupation_category: str = Field(description="Short category such as student, technology, business")
    location: str = Field(description="City or region")
    interests: list[str] = Field(description="Top interests")


profiles = survey_df["response"].ai.responses(
    instructions=(
        "Extract age group, occupation category, location, and interests "
        "from each survey response. Keep labels concise."
    ),
    response_format=SurveyProfile,
)

survey_df.assign(profile=profiles)[["response_id", "profile"]]

Processing batches:   0%|          | 0/4 [00:00<?, ?item/s]

Out[3]:

	response_id	profile
0	RESP_001	age_group='18-25' occupation_category='student...
1	RESP_002	age_group='36-45' occupation_category='busines...
2	RESP_003	age_group='26-35' occupation_category='technol...
3	RESP_004	age_group='56+' occupation_category='retired' ...

4. Output B: analysis-ready columns¶

Expand structured profiles into regular columns for aggregation.

In [4]:

Copied!

analysis_df = survey_df[["response_id"]].join(profiles.rename("profile").ai.extract())

analysis_df
analysis_df = survey_df[["response_id"]].join(profiles.rename("profile").ai.extract())

analysis_df

Out[4]:

	response_id	profile_age_group	profile_occupation_category	profile_location	profile_interests
0	RESP_001	18-25	student	Seattle	[gaming, anime]
1	RESP_002	36-45	business	New York	[fitness, business books]
2	RESP_003	26-35	technology	Austin	[hiking, coffee]
3	RESP_004	56+	retired	Denver	[gardening, community events]

In [5]:

Copied!

print("Age group distribution")
print(analysis_df["profile_age_group"].value_counts())

print("\nOccupation category distribution")
print(analysis_df["profile_occupation_category"].value_counts())
print("Age group distribution")
print(analysis_df["profile_age_group"].value_counts())

print("\nOccupation category distribution")
print(analysis_df["profile_occupation_category"].value_counts())

Age group distribution
profile_age_group
18-25    1
36-45    1
26-35    1
56+      1
Name: count, dtype: int64

Occupation category distribution
profile_occupation_category
student       1
business      1
technology    1
retired       1
Name: count, dtype: int64

5. Benefits¶

Main input

Free-text survey responses (response column)
Extraction schema (SurveyProfile)

Main output

Structured profile objects (ai.responses)
Flat analysis columns (ai.extract)

Why this helps

Turns qualitative text into queryable fields
Keeps transformation logic explicit and reproducible
Makes downstream segmentation and BI reporting easier