Survey Transformation Basics¶
This notebook explains survey transformation in order: main input, structured outputs, and practical benefits.
1. Setup¶
Configure pandas extension and model once.
In [1]:
Copied!
import os
import pandas as pd
from pydantic import BaseModel, Field
from openaivec import pandas_ext
assert os.getenv("OPENAI_API_KEY") or os.getenv("AZURE_OPENAI_BASE_URL"), (
"Set OPENAI_API_KEY or Azure OpenAI environment variables before running this notebook."
)
pandas_ext.set_responses_model("gpt-5.2")
import os
import pandas as pd
from pydantic import BaseModel, Field
from openaivec import pandas_ext
assert os.getenv("OPENAI_API_KEY") or os.getenv("AZURE_OPENAI_BASE_URL"), (
"Set OPENAI_API_KEY or Azure OpenAI environment variables before running this notebook."
)
pandas_ext.set_responses_model("gpt-5.2")
2. Input: survey responses DataFrame¶
Each row contains one free-text survey response.
In [2]:
Copied!
survey_df = pd.DataFrame(
{
"response_id": ["RESP_001", "RESP_002", "RESP_003", "RESP_004"],
"response": [
"I am a 24-year-old student in Seattle and enjoy gaming and anime.",
"I am a 41-year-old manager in New York, interested in fitness and business books.",
"I am a 33-year-old software engineer in Austin who likes hiking and coffee.",
"I am retired in Denver and spend time gardening and local community events.",
],
}
)
survey_df
survey_df = pd.DataFrame(
{
"response_id": ["RESP_001", "RESP_002", "RESP_003", "RESP_004"],
"response": [
"I am a 24-year-old student in Seattle and enjoy gaming and anime.",
"I am a 41-year-old manager in New York, interested in fitness and business books.",
"I am a 33-year-old software engineer in Austin who likes hiking and coffee.",
"I am retired in Denver and spend time gardening and local community events.",
],
}
)
survey_df
Out[2]:
| response_id | response | |
|---|---|---|
| 0 | RESP_001 | I am a 24-year-old student in Seattle and enjo... |
| 1 | RESP_002 | I am a 41-year-old manager in New York, intere... |
| 2 | RESP_003 | I am a 33-year-old software engineer in Austin... |
| 3 | RESP_004 | I am retired in Denver and spend time gardenin... |
3. Output A: structured profile per response¶
Convert each free-text response into a typed profile.
In [3]:
Copied!
class SurveyProfile(BaseModel):
age_group: str = Field(description="One of: 18-25, 26-35, 36-45, 46-55, 56+")
occupation_category: str = Field(description="Short category such as student, technology, business")
location: str = Field(description="City or region")
interests: list[str] = Field(description="Top interests")
profiles = survey_df["response"].ai.responses(
instructions=(
"Extract age group, occupation category, location, and interests "
"from each survey response. Keep labels concise."
),
response_format=SurveyProfile,
)
survey_df.assign(profile=profiles)[["response_id", "profile"]]
class SurveyProfile(BaseModel):
age_group: str = Field(description="One of: 18-25, 26-35, 36-45, 46-55, 56+")
occupation_category: str = Field(description="Short category such as student, technology, business")
location: str = Field(description="City or region")
interests: list[str] = Field(description="Top interests")
profiles = survey_df["response"].ai.responses(
instructions=(
"Extract age group, occupation category, location, and interests "
"from each survey response. Keep labels concise."
),
response_format=SurveyProfile,
)
survey_df.assign(profile=profiles)[["response_id", "profile"]]
Processing batches: 0%| | 0/4 [00:00<?, ?item/s]
Out[3]:
| response_id | profile | |
|---|---|---|
| 0 | RESP_001 | age_group='18-25' occupation_category='student... |
| 1 | RESP_002 | age_group='36-45' occupation_category='managem... |
| 2 | RESP_003 | age_group='26-35' occupation_category='technol... |
| 3 | RESP_004 | age_group='56+' occupation_category='retired' ... |
4. Output B: analysis-ready columns¶
Expand structured profiles into regular columns for aggregation.
In [4]:
Copied!
analysis_df = survey_df[["response_id"]].join(profiles.rename("profile").ai.extract())
analysis_df
analysis_df = survey_df[["response_id"]].join(profiles.rename("profile").ai.extract())
analysis_df
Out[4]:
| response_id | profile_age_group | profile_occupation_category | profile_location | profile_interests | |
|---|---|---|---|---|---|
| 0 | RESP_001 | 18-25 | student | Seattle | [gaming, anime] |
| 1 | RESP_002 | 36-45 | management | New York | [fitness, business books] |
| 2 | RESP_003 | 26-35 | technology | Austin | [hiking, coffee] |
| 3 | RESP_004 | 56+ | retired | Denver | [gardening, community events] |
In [5]:
Copied!
print("Age group distribution")
print(analysis_df["profile_age_group"].value_counts())
print("\nOccupation category distribution")
print(analysis_df["profile_occupation_category"].value_counts())
print("Age group distribution")
print(analysis_df["profile_age_group"].value_counts())
print("\nOccupation category distribution")
print(analysis_df["profile_occupation_category"].value_counts())
Age group distribution profile_age_group 18-25 1 36-45 1 26-35 1 56+ 1 Name: count, dtype: int64 Occupation category distribution profile_occupation_category student 1 management 1 technology 1 retired 1 Name: count, dtype: int64
5. Benefits¶
Main input
- Free-text survey responses (
responsecolumn) - Extraction schema (
SurveyProfile)
Main output
- Structured profile objects (
ai.responses) - Flat analysis columns (
ai.extract)
Why this helps
- Turns qualitative text into queryable fields
- Keeps transformation logic explicit and reproducible
- Makes downstream segmentation and BI reporting easier