Intelligent Data Completion with openaivec¶

This notebook demonstrates how to use openaivec's fillna functionality to solve common missing data problems in marketing and retail industries.

Unlike traditional statistical methods (mean, median, mode), AI-powered completion enables contextually-aware, natural value imputation by understanding the underlying data relationships.

Setup¶

First, let's import the necessary libraries and configure the OpenAI client.

In [ ]:

Copied!





import pandas as pd
from openai import OpenAI

from openaivec import pandas_ext

# Enable openaivec
pandas_ext.set_client(OpenAI())

# Configure pandas display options for better table formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)
import pandas as pd
from openai import OpenAI

from openaivec import pandas_ext

# Enable openaivec
pandas_ext.set_client(OpenAI())

# Configure pandas display options for better table formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

Case Study 1: Customer Database Missing Value Completion¶

E-commerce customer databases often contain missing values due to incomplete registration forms or data lost during system integrations. Let's use AI to intelligently fill these gaps.

In [2]:

Copied!





# E-commerce customer data with missing values
customer_data = pd.DataFrame({
    "customer_id": [1001, 1002, 1003, 1004, 1005],
    "name": ["John Smith", "Sarah Johnson", None, "Michael Brown", "Emily Davis"],
    "age": [35, None, 28, 45, None],
    "occupation": ["Software Engineer", "Marketing Manager", "UX Designer", None, "Graduate Student"],
    "annual_income": [75000, 65000, None, 95000, 25000],
    "preferred_category": ["Electronics", None, "Fashion", "Books", "Beauty"]
})

print("Customer data with missing values:")
display(customer_data)
print(f"\nTotal missing values: {customer_data.isnull().sum().sum()}")
# E-commerce customer data with missing values
customer_data = pd.DataFrame({
    "customer_id": [1001, 1002, 1003, 1004, 1005],
    "name": ["John Smith", "Sarah Johnson", None, "Michael Brown", "Emily Davis"],
    "age": [35, None, 28, 45, None],
    "occupation": ["Software Engineer", "Marketing Manager", "UX Designer", None, "Graduate Student"],
    "annual_income": [75000, 65000, None, 95000, 25000],
    "preferred_category": ["Electronics", None, "Fashion", "Books", "Beauty"]
})

print("Customer data with missing values:")
display(customer_data)
print(f"\nTotal missing values: {customer_data.isnull().sum().sum()}")

Customer data with missing values:

	customer_id	name	age	occupation	annual_income	preferred_category
0	1001	John Smith	35.0	Software Engineer	75000.0	Electronics
1	1002	Sarah Johnson	NaN	Marketing Manager	65000.0	None
2	1003	None	28.0	UX Designer	NaN	Fashion
3	1004	Michael Brown	45.0	None	95000.0	Books
4	1005	Emily Davis	NaN	Graduate Student	25000.0	Beauty

Total missing values: 6

Income Data Completion¶

Using occupation information to estimate reasonable income values. AI understands the relationship between occupation and age to propose appropriate values.

In [3]:

Copied!





# Fill missing income values
filled_income = customer_data.ai.fillna("annual_income")

print("Income completion results:")
display(filled_income[["name", "occupation", "age", "annual_income"]])

# Before/after comparison for better visualization
print("\nBefore/After Comparison:")
comparison_df = pd.DataFrame({
    'Before': customer_data["annual_income"],
    'After': filled_income["annual_income"],
    'Name': customer_data["name"],
    'Occupation': customer_data["occupation"]
})
display(comparison_df)
# Fill missing income values
filled_income = customer_data.ai.fillna("annual_income")

print("Income completion results:")
display(filled_income[["name", "occupation", "age", "annual_income"]])

# Before/after comparison for better visualization
print("\nBefore/After Comparison:")
comparison_df = pd.DataFrame({
    'Before': customer_data["annual_income"],
    'After': filled_income["annual_income"],
    'Name': customer_data["name"],
    'Occupation': customer_data["occupation"]
})
display(comparison_df)

Income completion results:

	name	occupation	age	annual_income
0	John Smith	Software Engineer	35.0	75000.0
1	Sarah Johnson	Marketing Manager	NaN	65000.0
2	None	UX Designer	28.0	60000.0
3	Michael Brown	None	45.0	95000.0
4	Emily Davis	Graduate Student	NaN	25000.0

Before/After Comparison:

	Before	After	Name	Occupation
0	75000.0	75000.0	John Smith	Software Engineer
1	65000.0	65000.0	Sarah Johnson	Marketing Manager
2	NaN	60000.0	None	UX Designer
3	95000.0	95000.0	Michael Brown	None
4	25000.0	25000.0	Emily Davis	Graduate Student

Preferred Category Prediction¶

Based on customer occupation, age, and income, predict the product categories they're likely interested in. This can be used as foundational data for recommendation systems.

In [4]:

Copied!





# Fill missing preferred category values
filled_category = customer_data.ai.fillna("preferred_category")

print("Preferred category completion results:")
display(filled_category[["name", "occupation", "age", "annual_income", "preferred_category"]])

# Before/after comparison for better visualization
print("\nBefore/After Comparison:")
comparison_df = pd.DataFrame({
    'Before': customer_data["preferred_category"],
    'After': filled_category["preferred_category"],
    'Name': customer_data["name"],
    'Occupation': customer_data["occupation"]
})
display(comparison_df)
# Fill missing preferred category values
filled_category = customer_data.ai.fillna("preferred_category")

print("Preferred category completion results:")
display(filled_category[["name", "occupation", "age", "annual_income", "preferred_category"]])

# Before/after comparison for better visualization
print("\nBefore/After Comparison:")
comparison_df = pd.DataFrame({
    'Before': customer_data["preferred_category"],
    'After': filled_category["preferred_category"],
    'Name': customer_data["name"],
    'Occupation': customer_data["occupation"]
})
display(comparison_df)

Preferred category completion results:

	name	occupation	age	annual_income	preferred_category
0	John Smith	Software Engineer	35.0	75000.0	Electronics
1	Sarah Johnson	Marketing Manager	NaN	65000.0	Electronics
2	None	UX Designer	28.0	NaN	Fashion
3	Michael Brown	None	45.0	95000.0	Books
4	Emily Davis	Graduate Student	NaN	25000.0	Beauty

Before/After Comparison:

	Before	After	Name	Occupation
0	Electronics	Electronics	John Smith	Software Engineer
1	None	Electronics	Sarah Johnson	Marketing Manager
2	Fashion	Fashion	None	UX Designer
3	Books	Books	Michael Brown	None
4	Beauty	Beauty	Emily Davis	Graduate Student

Case Study 2: Product Data Auto-Completion¶

Marketing departments often face challenges with incomplete product databases. New products or externally sourced data frequently lack marketing copy and target demographic information.

In [5]:

Copied!





# Product data with missing marketing information
product_data = pd.DataFrame({
    "product_id": ["P001", "P002", "P003", "P004", "P005"],
    "product_name": ["Wireless Earbuds Pro", "Organic Face Moisturizer", "Business Planner 2024", "Protein Shaker Bottle", "Essential Oil Diffuser"],
    "category": ["Electronics", "Beauty", "Stationery", "Fitness", "Home & Garden"],
    "price": [199.99, 34.99, 24.99, 19.99, 89.99],
    "target_age": ["25-40", None, "30-50", "20-35", None],
    "marketing_copy": [None, "100% natural ingredients, perfect for sensitive skin", None, "Maximize your workout efficiency", None]
})

print("Product data with missing information:")
display(product_data)
print(f"\nTotal missing values: {product_data.isnull().sum().sum()}")
# Product data with missing marketing information
product_data = pd.DataFrame({
    "product_id": ["P001", "P002", "P003", "P004", "P005"],
    "product_name": ["Wireless Earbuds Pro", "Organic Face Moisturizer", "Business Planner 2024", "Protein Shaker Bottle", "Essential Oil Diffuser"],
    "category": ["Electronics", "Beauty", "Stationery", "Fitness", "Home & Garden"],
    "price": [199.99, 34.99, 24.99, 19.99, 89.99],
    "target_age": ["25-40", None, "30-50", "20-35", None],
    "marketing_copy": [None, "100% natural ingredients, perfect for sensitive skin", None, "Maximize your workout efficiency", None]
})

print("Product data with missing information:")
display(product_data)
print(f"\nTotal missing values: {product_data.isnull().sum().sum()}")

Product data with missing information:

	product_id	product_name	category	price	target_age	marketing_copy
0	P001	Wireless Earbuds Pro	Electronics	199.99	25-40	None
1	P002	Organic Face Moisturizer	Beauty	34.99	None	100% natural ingredients, perfect for sensitiv...
2	P003	Business Planner 2024	Stationery	24.99	30-50	None
3	P004	Protein Shaker Bottle	Fitness	19.99	20-35	Maximize your workout efficiency
4	P005	Essential Oil Diffuser	Home & Garden	89.99	None	None

Total missing values: 5

Automatic Marketing Copy Generation¶

Generate compelling marketing copy from product name, category, and price. This helps streamline marketing team workflows.

In [6]:

Copied!





# Generate marketing copy automatically
filled_marketing_copy = product_data.ai.fillna("marketing_copy")

print("Marketing copy generation results:")
display(filled_marketing_copy[["product_name", "category", "price", "marketing_copy"]])

print("\nGenerated Marketing Copy Details:")
for idx, row in filled_marketing_copy.iterrows():
    if pd.isna(product_data.loc[idx, "marketing_copy"]):  # Originally missing rows
        print(f"✨ {row['product_name']}")
        print(f"   Generated copy: \"{row['marketing_copy']}\"")
        print(f"   Category: {row['category']}, Price: ${row['price']}")
        print()
# Generate marketing copy automatically
filled_marketing_copy = product_data.ai.fillna("marketing_copy")

print("Marketing copy generation results:")
display(filled_marketing_copy[["product_name", "category", "price", "marketing_copy"]])

print("\nGenerated Marketing Copy Details:")
for idx, row in filled_marketing_copy.iterrows():
    if pd.isna(product_data.loc[idx, "marketing_copy"]):  # Originally missing rows
        print(f"✨ {row['product_name']}")
        print(f"   Generated copy: \"{row['marketing_copy']}\"")
        print(f"   Category: {row['category']}, Price: ${row['price']}")
        print()

Marketing copy generation results:

	product_name	category	price	marketing_copy
0	Wireless Earbuds Pro	Electronics	199.99	Experience unparalleled sound quality and comf...
1	Organic Face Moisturizer	Beauty	34.99	100% natural ingredients, perfect for sensitiv...
2	Business Planner 2024	Stationery	24.99	Stay organized and achieve your goals with our...
3	Protein Shaker Bottle	Fitness	19.99	Maximize your workout efficiency
4	Essential Oil Diffuser	Home & Garden	89.99	Transform your space with soothing aromas and ...

Generated Marketing Copy Details:
✨ Wireless Earbuds Pro
   Generated copy: "Experience unparalleled sound quality and comfort with our Wireless Earbuds Pro, designed for the modern lifestyle."
   Category: Electronics, Price: $199.99

✨ Business Planner 2024
   Generated copy: "Stay organized and achieve your goals with our Business Planner 2024, perfect for professionals on the go."
   Category: Stationery, Price: $24.99

✨ Essential Oil Diffuser
   Generated copy: "Transform your space with soothing aromas and ambiance using our Essential Oil Diffuser, ideal for relaxation."
   Category: Home & Garden, Price: $89.99

Target Age Group Estimation¶

Estimate appropriate target age groups based on product characteristics. This can be used for marketing strategy development.

In [7]:

Copied!





# Estimate target age groups
filled_target_age = product_data.ai.fillna("target_age")

print("Target age group estimation results:")
display(filled_target_age[["product_name", "category", "price", "target_age"]])

print("\nGenerated Target Age Details:")
for idx, row in filled_target_age.iterrows():
    if pd.isna(product_data.loc[idx, "target_age"]):  # Originally missing rows
        print(f"📊 {row['product_name']}")
        print(f"   Estimated target age: {row['target_age']}")
        print(f"   Category: {row['category']}, Price: ${row['price']}")
        print()

print("=" * 50)
print("🎯 Marketing Benefits")
print("=" * 50)
print("✅ Complete missing data with natural, contextual values")
print("✅ Product-aware appropriate targeting")
print("✅ Reduced manual effort for marketing teams")
print("✅ Support for data-driven decision making")
# Estimate target age groups
filled_target_age = product_data.ai.fillna("target_age")

print("Target age group estimation results:")
display(filled_target_age[["product_name", "category", "price", "target_age"]])

print("\nGenerated Target Age Details:")
for idx, row in filled_target_age.iterrows():
    if pd.isna(product_data.loc[idx, "target_age"]):  # Originally missing rows
        print(f"📊 {row['product_name']}")
        print(f"   Estimated target age: {row['target_age']}")
        print(f"   Category: {row['category']}, Price: ${row['price']}")
        print()

print("=" * 50)
print("🎯 Marketing Benefits")
print("=" * 50)
print("✅ Complete missing data with natural, contextual values")
print("✅ Product-aware appropriate targeting")
print("✅ Reduced manual effort for marketing teams")
print("✅ Support for data-driven decision making")

Target age group estimation results:

	product_name	category	price	target_age
0	Wireless Earbuds Pro	Electronics	199.99	25-40
1	Organic Face Moisturizer	Beauty	34.99	25-40
2	Business Planner 2024	Stationery	24.99	30-50
3	Protein Shaker Bottle	Fitness	19.99	20-35
4	Essential Oil Diffuser	Home & Garden	89.99	30-50

Generated Target Age Details:
📊 Organic Face Moisturizer
   Estimated target age: 25-40
   Category: Beauty, Price: $34.99

📊 Essential Oil Diffuser
   Estimated target age: 30-50
   Category: Home & Garden, Price: $89.99

==================================================
🎯 Marketing Benefits
==================================================
✅ Complete missing data with natural, contextual values
✅ Product-aware appropriate targeting
✅ Reduced manual effort for marketing teams
✅ Support for data-driven decision making

Case Study 3: Multiple Column Chain Completion¶

In real business scenarios, you often need to complete multiple missing values at once. openaivec allows efficient processing using method chaining.

In [8]:

Copied!





# Complete multiple columns at once (method chaining)
complete_data = (customer_data
                .ai.fillna("annual_income")
                .ai.fillna("preferred_category")
                .ai.fillna("name"))

print("Fully completed customer data:")
display(complete_data)

print("\nCompletion Summary:")
# Create summary for the relevant columns that had missing values
summary_data = []
for col in customer_data.columns:
    summary_data.append({
        'Column': col,
        'Missing Before': customer_data[col].isnull().sum(),
        'Missing After': complete_data[col].isnull().sum()
    })

summary_df = pd.DataFrame(summary_data)
display(summary_df)

print(f"\nTotal missing values before: {customer_data.isnull().sum().sum()}")
print(f"Total missing values after: {complete_data.isnull().sum().sum()}")
print("✅ All missing values have been naturally completed!")
# Complete multiple columns at once (method chaining)
complete_data = (customer_data
                .ai.fillna("annual_income")
                .ai.fillna("preferred_category")
                .ai.fillna("name"))

print("Fully completed customer data:")
display(complete_data)

print("\nCompletion Summary:")
# Create summary for the relevant columns that had missing values
summary_data = []
for col in customer_data.columns:
    summary_data.append({
        'Column': col,
        'Missing Before': customer_data[col].isnull().sum(),
        'Missing After': complete_data[col].isnull().sum()
    })

summary_df = pd.DataFrame(summary_data)
display(summary_df)

print(f"\nTotal missing values before: {customer_data.isnull().sum().sum()}")
print(f"Total missing values after: {complete_data.isnull().sum().sum()}")
print("✅ All missing values have been naturally completed!")

Fully completed customer data:

	customer_id	name	age	occupation	annual_income	preferred_category
0	1001	John Smith	35.0	Software Engineer	75000.0	Electronics
1	1002	Sarah Johnson	NaN	Marketing Manager	65000.0	Electronics
2	1003	Alice Johnson	28.0	UX Designer	70000.0	Fashion
3	1004	Michael Brown	45.0	None	95000.0	Books
4	1005	Emily Davis	NaN	Graduate Student	25000.0	Beauty

Completion Summary:

	Column	Missing Before	Missing After
0	customer_id	0	0
1	name	1	0
2	age	2	2
3	occupation	1	1
4	annual_income	1	0
5	preferred_category	1	0

Total missing values before: 6
Total missing values after: 3
✅ All missing values have been naturally completed!

Summary¶

openaivec's fillna functionality enables contextually-aware, natural data completion that was impossible with traditional statistical methods.

Business Benefits¶

Improved Data Quality: Complete missing values with appropriate values to enhance analysis accuracy
Operational Efficiency: Significantly reduce manual data entry tasks
Enhanced Marketing Precision: Enable more accurate customer segmentation
Accelerated Decision Making: Quick analysis with complete datasets

Applicable Scenarios¶

Customer database maintenance
Automated product information completion
Marketing campaign targeting
Sales forecast accuracy improvement
A/B test result analysis

Leverage openaivec to achieve data-driven business growth!