Data Preparation & Setup for Causal Inference
Overview
Successful causal inference starts with properly structured data. Unlike simple correlation analysis, causal methods require specific data characteristics and measurement approaches. This guide covers essential data preparation steps before implementing any causal inference method.
Navigation:
Understanding Your Data Requirements
The Gold Standard: Panel Data Structure The ideal dataset for causal analysis is panel data - observations of the same individuals over multiple time periods. This structure allows us to observe changes within individuals and control for time-invariant characteristics.
Before implementing any causal inference method, ensure your data meets these requirements:
import pandas as pd
import numpy as np
# Required data structure for causal analysis
required_columns = {
'user_id': 'str', # Unique identifier for each user
'time_period': 'datetime', # Time dimension (weekly/monthly)
'treatment': 'int', # Copilot usage (0/1 or continuous)
'outcome': 'float', # Productivity metric
'confounders': 'various' # Job role, tenure, team size, etc.
}
# Example data validation function
def validate_causal_data(df):
"""
Comprehensive validation of data quality for causal inference
Returns a dictionary of checks and recommendations
"""
checks = {
'temporal_ordering': df['time_period'].is_monotonic_increasing,
'treatment_variation': df['treatment'].nunique() > 1,
'outcome_completeness': df['outcome'].notna().mean() > 0.8,
'sufficient_sample': len(df) > 100,
'panel_balance': len(df['user_id'].unique()) * len(df['time_period'].unique()) == len(df)
}
# Generate recommendations based on failed checks
recommendations = []
if not checks['temporal_ordering']:
recommendations.append("Sort data by time_period to ensure proper temporal ordering")
if not checks['treatment_variation']:
recommendations.append("Treatment variable shows no variation - check data quality")
if not checks['outcome_completeness']:
recommendations.append("High missing data in outcome - consider imputation or data cleaning")
if not checks['sufficient_sample']:
recommendations.append("Sample size may be too small for reliable causal inference")
if not checks['panel_balance']:
recommendations.append("Unbalanced panel - some users missing observations in some periods")
return {'checks': checks, 'recommendations': recommendations}
# Run validation
validation_results = validate_causal_data(df)
print("Data Quality Checks:", validation_results['checks'])
if validation_results['recommendations']:
print("Recommendations:", validation_results['recommendations'])
Understanding the Output:
temporal_ordering
: Ensures time flows correctly (critical for causal ordering)treatment_variation
: Confirms we have both treated and untreated observationsoutcome_completeness
: High missing data can bias resultssufficient_sample
: Small samples lead to unreliable estimatespanel_balance
: Balanced panels enable stronger causal identification
Defining Treatment and Outcome Variables
Treatment Variable Design
How you define your treatment variable fundamentally affects your causal interpretation. Consider these different approaches:
# Treatment variable definitions
def define_treatment_binary(df, threshold=1):
"""
Binary treatment: Used Copilot (1) vs Not Used (0)
This is the simplest approach - users either used Copilot or they didn't.
Good for: Initial adoption analysis, policy evaluation
Interpretation: "The effect of any Copilot usage vs none"
"""
return (df['copilot_actions'] >= threshold).astype(int)
def define_treatment_continuous(df):
"""
Continuous treatment: Number of Copilot actions per week
Captures intensity of usage, not just adoption.
Good for: Understanding dose-response relationships
Interpretation: "The effect of each additional Copilot action per week"
"""
return df['copilot_actions_per_week']
def define_treatment_categorical(df):
"""
Categorical treatment: Usage intensity levels
Balances simplicity with intensity measurement.
Good for: Identifying thresholds, communicating to stakeholders
Interpretation: "The effect of different usage intensities"
"""
return pd.cut(df['copilot_actions'],
bins=[0, 1, 10, 50, np.inf],
labels=['None', 'Light', 'Moderate', 'Heavy'])
# Example of how choice affects interpretation
df['treatment_binary'] = define_treatment_binary(df)
df['treatment_continuous'] = define_treatment_continuous(df)
df['treatment_categorical'] = define_treatment_categorical(df)
print("Treatment variable distributions:")
print("Binary:", df['treatment_binary'].value_counts())
print("Continuous:", df['treatment_continuous'].describe())
print("Categorical:", df['treatment_categorical'].value_counts())
Outcome Variable Selection
Your outcome choice determines what causal effect youβre measuring:
def productivity_metrics(df):
"""
Various productivity outcome measures with different interpretations
"""
outcomes = {
# Volume metrics
'tickets_resolved': df['tickets_closed'] / df['time_period_weeks'],
'emails_sent': df['emails_sent'] / df['time_period_weeks'],
# Quality metrics
'case_resolution_time': df['avg_case_hours'],
'customer_satisfaction': df['csat_score'],
# Business metrics
'revenue_per_user': df['deals_closed'] * df['avg_deal_size'],
'cost_savings': df['hours_saved'] * df['hourly_wage']
}
# Add interpretation guide
interpretations = {
'tickets_resolved': 'Additional tickets resolved per week due to Copilot',
'case_resolution_time': 'Hours saved per case due to Copilot',
'customer_satisfaction': 'CSAT point improvement due to Copilot',
'revenue_per_user': 'Additional revenue per user per period due to Copilot'
}
return outcomes, interpretations
# Choose your outcome based on business priorities
outcomes, interpretations = productivity_metrics(df)
chosen_outcome = 'tickets_resolved'
print(f"Analyzing: {interpretations[chosen_outcome]}")
Identifying and Measuring Confounders
Confounders are variables that influence both treatment assignment and the outcome. Proper identification and measurement of confounders is crucial for causal inference.
Common Confounders in Copilot Analytics
def identify_potential_confounders():
"""
Comprehensive list of potential confounders for Copilot analysis
"""
confounders = {
# Individual characteristics
'demographic': [
'tenure_months',
'job_level',
'department',
'role_type',
'education_level'
],
# Work characteristics
'work_context': [
'team_size',
'manager_span',
'workload_intensity',
'meeting_hours_per_week',
'project_complexity'
],
# Historical performance
'baseline_performance': [
'tickets_resolved_baseline',
'performance_rating_last_period',
'training_completed',
'tool_adoption_history'
],
# Technology context
'technology': [
'system_access_level',
'other_tools_used',
'technical_skill_level',
'device_type'
]
}
return confounders
# Create confounder dataset
def prepare_confounders(df):
"""
Prepare and validate confounder variables
"""
confounders = identify_potential_confounders()
# Check availability of confounders in data
available_confounders = []
missing_confounders = []
for category, variables in confounders.items():
for var in variables:
if var in df.columns:
available_confounders.append(var)
else:
missing_confounders.append(var)
print(f"Available confounders: {len(available_confounders)}")
print(f"Missing confounders: {len(missing_confounders)}")
if missing_confounders:
print(f"Consider collecting: {missing_confounders[:5]}...")
return available_confounders
# Prepare confounder list
available_confounders = prepare_confounders(df)
Confounder Quality Assessment
def assess_confounder_quality(df, confounders, treatment_col, outcome_col):
"""
Assess the quality and relevance of potential confounders
"""
assessments = {}
for confounder in confounders:
if confounder not in df.columns:
continue
# Missing data rate
missing_rate = df[confounder].isnull().mean()
# Correlation with treatment
if df[confounder].dtype in ['int64', 'float64']:
treatment_corr = df[confounder].corr(df[treatment_col])
outcome_corr = df[confounder].corr(df[outcome_col])
else:
# For categorical variables, use association measures
treatment_corr = "categorical"
outcome_corr = "categorical"
# Variance check
if df[confounder].dtype in ['int64', 'float64']:
has_variance = df[confounder].var() > 0
else:
has_variance = df[confounder].nunique() > 1
assessments[confounder] = {
'missing_rate': missing_rate,
'treatment_correlation': treatment_corr,
'outcome_correlation': outcome_corr,
'has_variance': has_variance,
'data_type': str(df[confounder].dtype)
}
return assessments
# Assess confounder quality
confounder_assessments = assess_confounder_quality(
df, available_confounders, 'copilot_usage', 'tickets_per_week'
)
# Print assessment summary
for var, assessment in confounder_assessments.items():
if assessment['missing_rate'] > 0.2:
print(f"β οΈ {var}: High missing data ({assessment['missing_rate']:.1%})")
elif not assessment['has_variance']:
print(f"β οΈ {var}: No variance in data")
else:
print(f"β
{var}: Good quality confounder")
Data Quality Checks and Validation
Comprehensive Data Quality Assessment
def comprehensive_data_quality_check(df, outcome_col, treatment_col, confounder_cols):
"""
Comprehensive data quality assessment for causal inference
"""
quality_report = {
'sample_size': len(df),
'time_periods': df['time_period'].nunique() if 'time_period' in df.columns else 1,
'unique_users': df['user_id'].nunique() if 'user_id' in df.columns else len(df),
'treatment_balance': df[treatment_col].mean(),
'outcome_distribution': {
'mean': df[outcome_col].mean(),
'std': df[outcome_col].std(),
'skewness': df[outcome_col].skew(),
'outliers': ((df[outcome_col] - df[outcome_col].mean()).abs() > 3 * df[outcome_col].std()).sum()
},
'missing_data': {
'outcome': df[outcome_col].isnull().mean(),
'treatment': df[treatment_col].isnull().mean(),
'confounders': {col: df[col].isnull().mean() for col in confounder_cols if col in df.columns}
}
}
# Data quality warnings
warnings = []
if quality_report['sample_size'] < 500:
warnings.append("Small sample size may lead to unreliable estimates")
if quality_report['treatment_balance'] < 0.1 or quality_report['treatment_balance'] > 0.9:
warnings.append("Imbalanced treatment assignment - consider different specification")
if quality_report['outcome_distribution']['outliers'] > len(df) * 0.05:
warnings.append("Many outliers detected - consider winsorization or robust methods")
if any(rate > 0.1 for rate in quality_report['missing_data']['confounders'].values()):
warnings.append("High missing data in confounders - consider imputation")
quality_report['warnings'] = warnings
return quality_report
# Run comprehensive quality check
quality_report = comprehensive_data_quality_check(
df, 'tickets_per_week', 'copilot_usage', available_confounders
)
# Print quality report
print("=== DATA QUALITY REPORT ===")
print(f"Sample size: {quality_report['sample_size']}")
print(f"Treatment balance: {quality_report['treatment_balance']:.1%}")
print(f"Outcome mean: {quality_report['outcome_distribution']['mean']:.2f}")
if quality_report['warnings']:
print("\nβ οΈ WARNINGS:")
for warning in quality_report['warnings']:
print(f" - {warning}")
Data Preprocessing for Causal Analysis
def preprocess_for_causal_analysis(df, outcome_col, treatment_col, confounder_cols):
"""
Standardized preprocessing pipeline for causal inference
"""
processed_df = df.copy()
# 1. Handle missing data
# Simple imputation for demonstration - use more sophisticated methods in practice
for col in confounder_cols:
if col in processed_df.columns:
if processed_df[col].dtype in ['int64', 'float64']:
processed_df[col] = processed_df[col].fillna(processed_df[col].median())
else:
processed_df[col] = processed_df[col].fillna(processed_df[col].mode().iloc[0])
# 2. Outlier treatment (winsorization)
from scipy.stats import mstats
for col in [outcome_col] + confounder_cols:
if col in processed_df.columns and processed_df[col].dtype in ['int64', 'float64']:
processed_df[col] = mstats.winsorize(processed_df[col], limits=[0.01, 0.01])
# 3. Create categorical variables from continuous ones if needed
if 'tenure_months' in processed_df.columns:
processed_df['tenure_category'] = pd.cut(
processed_df['tenure_months'],
bins=[0, 12, 36, 60, np.inf],
labels=['Junior', 'Mid', 'Senior', 'Expert']
)
# 4. Validate final dataset
final_validation = validate_causal_data(processed_df)
print("=== PREPROCESSING COMPLETE ===")
print(f"Final sample size: {len(processed_df)}")
print(f"Data quality checks passed: {sum(final_validation['checks'].values())}/{len(final_validation['checks'])}")
return processed_df, final_validation
# Preprocess data
processed_df, validation_results = preprocess_for_causal_analysis(
df, 'tickets_per_week', 'copilot_usage', available_confounders
)
Creating Analysis-Ready Datasets
Standard Dataset Formats
def create_analysis_datasets(df, outcome_col, treatment_col, confounder_cols):
"""
Create standardized datasets for different causal inference methods
"""
datasets = {}
# 1. Cross-sectional dataset (for regression, propensity scores)
if 'time_period' in df.columns:
# Take the latest period for each user
datasets['cross_sectional'] = (
df.sort_values('time_period')
.groupby('user_id')
.last()
.reset_index()
)
else:
datasets['cross_sectional'] = df.copy()
# 2. Panel dataset (for DiD, fixed effects)
if 'time_period' in df.columns and 'user_id' in df.columns:
datasets['panel'] = df.copy()
# Ensure balanced panel for DiD
user_periods = df.groupby('user_id')['time_period'].nunique()
complete_users = user_periods[user_periods == df['time_period'].nunique()].index
datasets['panel_balanced'] = df[df['user_id'].isin(complete_users)].copy()
# 3. Before/After dataset (for DiD with clear intervention point)
if 'time_period' in df.columns:
# Assume intervention at midpoint for demonstration
intervention_date = df['time_period'].median()
df_did = df.copy()
df_did['post_treatment'] = (df_did['time_period'] >= intervention_date).astype(int)
datasets['before_after'] = df_did
# Print dataset summaries
for name, dataset in datasets.items():
print(f"{name}: {len(dataset)} observations, {dataset['user_id'].nunique() if 'user_id' in dataset.columns else 'N/A'} users")
return datasets
# Create analysis-ready datasets
analysis_datasets = create_analysis_datasets(
processed_df, 'tickets_per_week', 'copilot_usage', available_confounders
)
Best Practices Summary
Data Quality Checklist
Before proceeding with causal analysis, ensure:
- β Temporal ordering: Treatment occurs before outcome measurement
- β Sufficient variation: Both treated and untreated observations exist
- β Complete outcomes: Low missing data in key variables (<10%)
- β Adequate sample size: At least 100 observations, preferably 500+
- β Balanced treatment: Treatment prevalence between 10-90%
- β Measured confounders: Key confounding variables are captured
- β Data consistency: Variables measured consistently across time/groups
Common Pitfalls to Avoid
- Selection bias in data collection: Ensure representative sampling
- Measurement error: Validate that variables capture intended constructs
- Time-varying confounders: Account for confounders that change over time
- Spillover effects: Consider whether treatment affects control group
- Survivorship bias: Account for users who drop out of analysis
Next Steps
Once your data is prepared and validated:
- Choose your causal method based on data structure and research question
- Start with simple methods (regression adjustment) before moving to complex ones
- Always check assumptions specific to your chosen method
- Validate results through robustness testing and sensitivity analysis
Continue to: Regression Adjustment Method β
Proper data preparation is the foundation of reliable causal inference. Take time to understand your data structure and validate quality before proceeding with analysis.