Intelligent Data Completion with openaivec¶
This notebook demonstrates how to use openaivec
's fillna
functionality to solve common missing data problems in marketing and retail industries.
Unlike traditional statistical methods (mean, median, mode), AI-powered completion enables contextually-aware, natural value imputation by understanding the underlying data relationships.
Setup¶
First, let's import the necessary libraries and configure the OpenAI client.
import pandas as pd
from openai import OpenAI
from openaivec import pandas_ext
# Enable openaivec
pandas_ext.use(OpenAI())
# Configure pandas display options for better table formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)
Case Study 1: Customer Database Missing Value Completion¶
E-commerce customer databases often contain missing values due to incomplete registration forms or data lost during system integrations. Let's use AI to intelligently fill these gaps.
# E-commerce customer data with missing values
customer_data = pd.DataFrame({
"customer_id": [1001, 1002, 1003, 1004, 1005],
"name": ["John Smith", "Sarah Johnson", None, "Michael Brown", "Emily Davis"],
"age": [35, None, 28, 45, None],
"occupation": ["Software Engineer", "Marketing Manager", "UX Designer", None, "Graduate Student"],
"annual_income": [75000, 65000, None, 95000, 25000],
"preferred_category": ["Electronics", None, "Fashion", "Books", "Beauty"]
})
print("Customer data with missing values:")
display(customer_data)
print(f"\nTotal missing values: {customer_data.isnull().sum().sum()}")
Customer data with missing values:
customer_id | name | age | occupation | annual_income | preferred_category | |
---|---|---|---|---|---|---|
0 | 1001 | John Smith | 35.0 | Software Engineer | 75000.0 | Electronics |
1 | 1002 | Sarah Johnson | NaN | Marketing Manager | 65000.0 | None |
2 | 1003 | None | 28.0 | UX Designer | NaN | Fashion |
3 | 1004 | Michael Brown | 45.0 | None | 95000.0 | Books |
4 | 1005 | Emily Davis | NaN | Graduate Student | 25000.0 | Beauty |
Total missing values: 6
Income Data Completion¶
Using occupation information to estimate reasonable income values. AI understands the relationship between occupation and age to propose appropriate values.
# Fill missing income values
filled_income = customer_data.ai.fillna("annual_income")
print("Income completion results:")
display(filled_income[["name", "occupation", "age", "annual_income"]])
# Before/after comparison for better visualization
print("\nBefore/After Comparison:")
comparison_df = pd.DataFrame({
'Before': customer_data["annual_income"],
'After': filled_income["annual_income"],
'Name': customer_data["name"],
'Occupation': customer_data["occupation"]
})
display(comparison_df)
Income completion results:
name | occupation | age | annual_income | |
---|---|---|---|---|
0 | John Smith | Software Engineer | 35.0 | 75000.0 |
1 | Sarah Johnson | Marketing Manager | NaN | 65000.0 |
2 | None | UX Designer | 28.0 | 60000.0 |
3 | Michael Brown | None | 45.0 | 95000.0 |
4 | Emily Davis | Graduate Student | NaN | 25000.0 |
Before/After Comparison:
Before | After | Name | Occupation | |
---|---|---|---|---|
0 | 75000.0 | 75000.0 | John Smith | Software Engineer |
1 | 65000.0 | 65000.0 | Sarah Johnson | Marketing Manager |
2 | NaN | 60000.0 | None | UX Designer |
3 | 95000.0 | 95000.0 | Michael Brown | None |
4 | 25000.0 | 25000.0 | Emily Davis | Graduate Student |
Preferred Category Prediction¶
Based on customer occupation, age, and income, predict the product categories they're likely interested in. This can be used as foundational data for recommendation systems.
# Fill missing preferred category values
filled_category = customer_data.ai.fillna("preferred_category")
print("Preferred category completion results:")
display(filled_category[["name", "occupation", "age", "annual_income", "preferred_category"]])
# Before/after comparison for better visualization
print("\nBefore/After Comparison:")
comparison_df = pd.DataFrame({
'Before': customer_data["preferred_category"],
'After': filled_category["preferred_category"],
'Name': customer_data["name"],
'Occupation': customer_data["occupation"]
})
display(comparison_df)
Preferred category completion results:
name | occupation | age | annual_income | preferred_category | |
---|---|---|---|---|---|
0 | John Smith | Software Engineer | 35.0 | 75000.0 | Electronics |
1 | Sarah Johnson | Marketing Manager | NaN | 65000.0 | Electronics |
2 | None | UX Designer | 28.0 | NaN | Fashion |
3 | Michael Brown | None | 45.0 | 95000.0 | Books |
4 | Emily Davis | Graduate Student | NaN | 25000.0 | Beauty |
Before/After Comparison:
Before | After | Name | Occupation | |
---|---|---|---|---|
0 | Electronics | Electronics | John Smith | Software Engineer |
1 | None | Electronics | Sarah Johnson | Marketing Manager |
2 | Fashion | Fashion | None | UX Designer |
3 | Books | Books | Michael Brown | None |
4 | Beauty | Beauty | Emily Davis | Graduate Student |
Case Study 2: Product Data Auto-Completion¶
Marketing departments often face challenges with incomplete product databases. New products or externally sourced data frequently lack marketing copy and target demographic information.
# Product data with missing marketing information
product_data = pd.DataFrame({
"product_id": ["P001", "P002", "P003", "P004", "P005"],
"product_name": ["Wireless Earbuds Pro", "Organic Face Moisturizer", "Business Planner 2024", "Protein Shaker Bottle", "Essential Oil Diffuser"],
"category": ["Electronics", "Beauty", "Stationery", "Fitness", "Home & Garden"],
"price": [199.99, 34.99, 24.99, 19.99, 89.99],
"target_age": ["25-40", None, "30-50", "20-35", None],
"marketing_copy": [None, "100% natural ingredients, perfect for sensitive skin", None, "Maximize your workout efficiency", None]
})
print("Product data with missing information:")
display(product_data)
print(f"\nTotal missing values: {product_data.isnull().sum().sum()}")
Product data with missing information:
product_id | product_name | category | price | target_age | marketing_copy | |
---|---|---|---|---|---|---|
0 | P001 | Wireless Earbuds Pro | Electronics | 199.99 | 25-40 | None |
1 | P002 | Organic Face Moisturizer | Beauty | 34.99 | None | 100% natural ingredients, perfect for sensitiv... |
2 | P003 | Business Planner 2024 | Stationery | 24.99 | 30-50 | None |
3 | P004 | Protein Shaker Bottle | Fitness | 19.99 | 20-35 | Maximize your workout efficiency |
4 | P005 | Essential Oil Diffuser | Home & Garden | 89.99 | None | None |
Total missing values: 5
Automatic Marketing Copy Generation¶
Generate compelling marketing copy from product name, category, and price. This helps streamline marketing team workflows.
# Generate marketing copy automatically
filled_marketing_copy = product_data.ai.fillna("marketing_copy")
print("Marketing copy generation results:")
display(filled_marketing_copy[["product_name", "category", "price", "marketing_copy"]])
print("\nGenerated Marketing Copy Details:")
for idx, row in filled_marketing_copy.iterrows():
if pd.isna(product_data.loc[idx, "marketing_copy"]): # Originally missing rows
print(f"✨ {row['product_name']}")
print(f" Generated copy: \"{row['marketing_copy']}\"")
print(f" Category: {row['category']}, Price: ${row['price']}")
print()
Marketing copy generation results:
product_name | category | price | marketing_copy | |
---|---|---|---|---|
0 | Wireless Earbuds Pro | Electronics | 199.99 | Experience unparalleled sound quality and comf... |
1 | Organic Face Moisturizer | Beauty | 34.99 | 100% natural ingredients, perfect for sensitiv... |
2 | Business Planner 2024 | Stationery | 24.99 | Stay organized and achieve your goals with our... |
3 | Protein Shaker Bottle | Fitness | 19.99 | Maximize your workout efficiency |
4 | Essential Oil Diffuser | Home & Garden | 89.99 | Transform your space with soothing aromas and ... |
Generated Marketing Copy Details: ✨ Wireless Earbuds Pro Generated copy: "Experience unparalleled sound quality and comfort with our Wireless Earbuds Pro, designed for the modern lifestyle." Category: Electronics, Price: $199.99 ✨ Business Planner 2024 Generated copy: "Stay organized and achieve your goals with our Business Planner 2024, perfect for professionals on the go." Category: Stationery, Price: $24.99 ✨ Essential Oil Diffuser Generated copy: "Transform your space with soothing aromas and ambiance using our Essential Oil Diffuser, ideal for relaxation." Category: Home & Garden, Price: $89.99
Target Age Group Estimation¶
Estimate appropriate target age groups based on product characteristics. This can be used for marketing strategy development.
# Estimate target age groups
filled_target_age = product_data.ai.fillna("target_age")
print("Target age group estimation results:")
display(filled_target_age[["product_name", "category", "price", "target_age"]])
print("\nGenerated Target Age Details:")
for idx, row in filled_target_age.iterrows():
if pd.isna(product_data.loc[idx, "target_age"]): # Originally missing rows
print(f"📊 {row['product_name']}")
print(f" Estimated target age: {row['target_age']}")
print(f" Category: {row['category']}, Price: ${row['price']}")
print()
print("=" * 50)
print("🎯 Marketing Benefits")
print("=" * 50)
print("✅ Complete missing data with natural, contextual values")
print("✅ Product-aware appropriate targeting")
print("✅ Reduced manual effort for marketing teams")
print("✅ Support for data-driven decision making")
Target age group estimation results:
product_name | category | price | target_age | |
---|---|---|---|---|
0 | Wireless Earbuds Pro | Electronics | 199.99 | 25-40 |
1 | Organic Face Moisturizer | Beauty | 34.99 | 25-40 |
2 | Business Planner 2024 | Stationery | 24.99 | 30-50 |
3 | Protein Shaker Bottle | Fitness | 19.99 | 20-35 |
4 | Essential Oil Diffuser | Home & Garden | 89.99 | 30-50 |
Generated Target Age Details: 📊 Organic Face Moisturizer Estimated target age: 25-40 Category: Beauty, Price: $34.99 📊 Essential Oil Diffuser Estimated target age: 30-50 Category: Home & Garden, Price: $89.99 ================================================== 🎯 Marketing Benefits ================================================== ✅ Complete missing data with natural, contextual values ✅ Product-aware appropriate targeting ✅ Reduced manual effort for marketing teams ✅ Support for data-driven decision making
Case Study 3: Multiple Column Chain Completion¶
In real business scenarios, you often need to complete multiple missing values at once. openaivec
allows efficient processing using method chaining.
# Complete multiple columns at once (method chaining)
complete_data = (customer_data
.ai.fillna("annual_income")
.ai.fillna("preferred_category")
.ai.fillna("name"))
print("Fully completed customer data:")
display(complete_data)
print("\nCompletion Summary:")
# Create summary for the relevant columns that had missing values
summary_data = []
for col in customer_data.columns:
summary_data.append({
'Column': col,
'Missing Before': customer_data[col].isnull().sum(),
'Missing After': complete_data[col].isnull().sum()
})
summary_df = pd.DataFrame(summary_data)
display(summary_df)
print(f"\nTotal missing values before: {customer_data.isnull().sum().sum()}")
print(f"Total missing values after: {complete_data.isnull().sum().sum()}")
print("✅ All missing values have been naturally completed!")
Fully completed customer data:
customer_id | name | age | occupation | annual_income | preferred_category | |
---|---|---|---|---|---|---|
0 | 1001 | John Smith | 35.0 | Software Engineer | 75000.0 | Electronics |
1 | 1002 | Sarah Johnson | NaN | Marketing Manager | 65000.0 | Electronics |
2 | 1003 | Alice Johnson | 28.0 | UX Designer | 70000.0 | Fashion |
3 | 1004 | Michael Brown | 45.0 | None | 95000.0 | Books |
4 | 1005 | Emily Davis | NaN | Graduate Student | 25000.0 | Beauty |
Completion Summary:
Column | Missing Before | Missing After | |
---|---|---|---|
0 | customer_id | 0 | 0 |
1 | name | 1 | 0 |
2 | age | 2 | 2 |
3 | occupation | 1 | 1 |
4 | annual_income | 1 | 0 |
5 | preferred_category | 1 | 0 |
Total missing values before: 6 Total missing values after: 3 ✅ All missing values have been naturally completed!
Summary¶
openaivec
's fillna
functionality enables contextually-aware, natural data completion that was impossible with traditional statistical methods.
Business Benefits¶
- Improved Data Quality: Complete missing values with appropriate values to enhance analysis accuracy
- Operational Efficiency: Significantly reduce manual data entry tasks
- Enhanced Marketing Precision: Enable more accurate customer segmentation
- Accelerated Decision Making: Quick analysis with complete datasets
Applicable Scenarios¶
- Customer database maintenance
- Automated product information completion
- Marketing campaign targeting
- Sales forecast accuracy improvement
- A/B test result analysis
Leverage openaivec
to achieve data-driven business growth!