Causal Inference Technical Implementation Guide
Overview
This technical guide provides detailed implementation instructions for applying causal inference methods to Copilot analytics data. It includes code examples, methodological details, and practical considerations for data scientists and analysts.
What youβll learn:
- How to structure and validate data for causal analysis
- Step-by-step implementation of five major causal inference methods
- How to interpret statistical outputs and translate them into business insights
- Best practices for robust analysis and common pitfall avoidance
For a conceptual introduction to causal inference in Copilot analytics, see our main guide.
Method-Specific Implementation Guides
Each causal inference method has its own detailed implementation guide with code examples, diagnostics, and interpretation guidance:
π Data Preparation & Setup
Essential data validation, variable definition, and quality checks before running any causal analysis.
π Method 1: Regression Adjustment
Linear regression-based causal inference with interaction effects and heterogeneous treatment analysis.
π― Method 2: Propensity Score Methods
Propensity score estimation, matching, and inverse probability weighting for balanced comparisons.
π Method 3: Difference-in-Differences
Temporal analysis using before/after comparisons with parallel trends testing.
π§ Method 4: Instrumental Variables
Two-stage least squares and instrumental variable analysis for unobserved confounding.
π€ Method 5: Doubly Robust Methods
Machine learning enhanced causal inference with protection against model misspecification.
β Validation & Robustness Testing
Sensitivity analysis, placebo tests, and robustness checks for causal estimates.
Quick Method Selection Guide
Method | Best For | Key Assumption | Data Requirements |
---|---|---|---|
Regression Adjustment | Few confounders, linear relationships | No unmeasured confounders | Cross-sectional or panel |
Propensity Scores | Many confounders, binary treatment | No unmeasured confounders | Cross-sectional or panel |
Difference-in-Differences | Policy interventions, natural experiments | Parallel trends | Panel data with pre/post periods |
Instrumental Variables | Unobserved confounding suspected | Valid instrument exists | Cross-sectional or panel |
Doubly Robust | Complex relationships, many variables | Either outcome or PS model correct | Large sample size |
Understanding the Causal Analysis Workflow
Before diving into specific methods, itβs important to understand the general workflow for causal analysis:
- Problem Definition: Clearly articulate your causal question
- Data Preparation: Structure and validate your dataset
- Method Selection: Choose the appropriate causal inference technique
- Assumption Checking: Verify that method assumptions are met
- Implementation: Run the analysis with proper validation
- Results Interpretation: Translate statistical outputs into business insights
- Robustness Testing: Validate findings through sensitivity analysis
Each method weβll cover follows this general pattern, but with different technical requirements and assumptions.
Getting Started
- Start with Data Preparation to ensure your dataset meets causal analysis requirements
- Choose your method based on your research question and data characteristics
- Follow the method-specific guide for detailed implementation
- Validate your results using robustness testing techniques
- Interpret and communicate findings using business translation frameworks
Resources and Libraries
Essential Python Libraries
# Core causal inference libraries
pip install dowhy causalml econml causallib
# Statistical analysis
pip install statsmodels linearmodels
# Machine learning and data manipulation
pip install scikit-learn pandas numpy matplotlib seaborn plotly
R Libraries
# Core causal inference packages
install.packages(c("causalTree", "grf", "MatchIt", "WeightIt"))
# Panel data and econometrics
install.packages(c("plm", "fixest", "did"))
# Data manipulation and visualization
install.packages(c("dplyr", "ggplot2", "tidyr"))
This guide provides the foundation for implementing rigorous causal inference methods with Copilot data. Always validate assumptions, check robustness, and interpret results in business context.
Pre-Analysis Setup
Understanding Your Data Requirements
Successful causal inference starts with properly structured data. Unlike simple correlation analysis, causal methods require specific data characteristics and measurement approaches.
The Gold Standard: Panel Data Structure The ideal dataset for causal analysis is panel data - observations of the same individuals over multiple time periods. This structure allows us to observe changes within individuals and control for time-invariant characteristics.
Before implementing any causal inference method, ensure your data meets these requirements:
import pandas as pd
import numpy as np
# Required data structure for causal analysis
required_columns = {
'user_id': 'str', # Unique identifier for each user
'time_period': 'datetime', # Time dimension (weekly/monthly)
'treatment': 'int', # Copilot usage (0/1 or continuous)
'outcome': 'float', # Productivity metric
'confounders': 'various' # Job role, tenure, team size, etc.
}
# Example data validation function
def validate_causal_data(df):
"""
Comprehensive validation of data quality for causal inference
Returns a dictionary of checks and recommendations
"""
checks = {
'temporal_ordering': df['time_period'].is_monotonic_increasing,
'treatment_variation': df['treatment'].nunique() > 1,
'outcome_completeness': df['outcome'].notna().mean() > 0.8,
'sufficient_sample': len(df) > 100,
'panel_balance': len(df['user_id'].unique()) * len(df['time_period'].unique()) == len(df)
}
# Generate recommendations based on failed checks
recommendations = []
if not checks['temporal_ordering']:
recommendations.append("Sort data by time_period to ensure proper temporal ordering")
if not checks['treatment_variation']:
recommendations.append("Treatment variable shows no variation - check data quality")
if not checks['outcome_completeness']:
recommendations.append("High missing data in outcome - consider imputation or data cleaning")
if not checks['sufficient_sample']:
recommendations.append("Sample size may be too small for reliable causal inference")
if not checks['panel_balance']:
recommendations.append("Unbalanced panel - some users missing observations in some periods")
return {'checks': checks, 'recommendations': recommendations}
# Run validation
validation_results = validate_causal_data(df)
print("Data Quality Checks:", validation_results['checks'])
if validation_results['recommendations']:
print("Recommendations:", validation_results['recommendations'])
Understanding the Output:
temporal_ordering
: Ensures time flows correctly (critical for causal ordering)treatment_variation
: Confirms we have both treated and untreated observationsoutcome_completeness
: High missing data can bias resultssufficient_sample
: Small samples lead to unreliable estimatespanel_balance
: Balanced panels enable stronger causal identification ```
Defining Treatment and Outcome Variables
Treatment Variable Design How you define your treatment variable fundamentally affects your causal interpretation. Consider these different approaches:
# Treatment variable definitions
def define_treatment_binary(df, threshold=1):
"""
Binary treatment: Used Copilot (1) vs Not Used (0)
This is the simplest approach - users either used Copilot or they didn't.
Good for: Initial adoption analysis, policy evaluation
Interpretation: "The effect of any Copilot usage vs none"
"""
return (df['copilot_actions'] >= threshold).astype(int)
def define_treatment_continuous(df):
"""
Continuous treatment: Number of Copilot actions per week
Captures intensity of usage, not just adoption.
Good for: Understanding dose-response relationships
Interpretation: "The effect of each additional Copilot action per week"
"""
return df['copilot_actions_per_week']
def define_treatment_categorical(df):
"""
Categorical treatment: Usage intensity levels
Balances simplicity with intensity measurement.
Good for: Identifying thresholds, communicating to stakeholders
Interpretation: "The effect of different usage intensities"
"""
return pd.cut(df['copilot_actions'],
bins=[0, 1, 10, 50, np.inf],
labels=['None', 'Light', 'Moderate', 'Heavy'])
# Example of how choice affects interpretation
df['treatment_binary'] = define_treatment_binary(df)
df['treatment_continuous'] = define_treatment_continuous(df)
df['treatment_categorical'] = define_treatment_categorical(df)
print("Treatment variable distributions:")
print("Binary:", df['treatment_binary'].value_counts())
print("Continuous:", df['treatment_continuous'].describe())
print("Categorical:", df['treatment_categorical'].value_counts())
Outcome Variable Selection Your outcome choice determines what causal effect youβre measuring:
def productivity_metrics(df):
"""
Various productivity outcome measures with different interpretations
"""
outcomes = {
# Volume metrics
'tickets_resolved': df['tickets_closed'] / df['time_period_weeks'],
'emails_sent': df['emails_sent'] / df['time_period_weeks'],
# Quality metrics
'case_resolution_time': df['avg_case_hours'],
'customer_satisfaction': df['csat_score'],
# Business metrics
'revenue_per_user': df['deals_closed'] * df['avg_deal_size'],
'cost_savings': df['hours_saved'] * df['hourly_wage']
}
# Add interpretation guide
interpretations = {
'tickets_resolved': 'Additional tickets resolved per week due to Copilot',
'case_resolution_time': 'Hours saved per case due to Copilot',
'customer_satisfaction': 'CSAT point improvement due to Copilot',
'revenue_per_user': 'Additional revenue per user per period due to Copilot'
}
return outcomes, interpretations
# Choose your outcome based on business priorities
outcomes, interpretations = productivity_metrics(df)
chosen_outcome = 'tickets_resolved'
print(f"Analyzing: {interpretations[chosen_outcome]}")
Method 1: Regression Adjustment
Conceptual Foundation
Regression adjustment is the most straightforward causal inference method. It estimates the treatment effect by controlling for confounding variables through linear regression. The key insight is that if you can measure and control for all variables that influence both treatment and outcome, the remaining correlation between treatment and outcome represents the causal effect.
When to use: When you have a good understanding of confounders and theyβre mostly captured in your data. Assumption: No unmeasured confounders (strong assumption!)
Basic Implementation
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
def regression_adjustment(df, outcome_col, treatment_col, confounder_cols):
"""
Simple regression adjustment for causal inference
This function implements the fundamental equation:
Y = Ξ± + Ξ²*Treatment + Ξ³β*Xβ + Ξ³β*Xβ + ... + Ξ΅
The coefficient Ξ² represents our causal estimate of treatment effect.
"""
# Prepare formula
confounders_str = ' + '.join(confounder_cols)
formula = f'{outcome_col} ~ {treatment_col} + {confounders_str}'
print(f"Running regression: {formula}")
# Fit model
model = smf.ols(formula, data=df).fit()
# Extract treatment effect
treatment_effect = model.params[treatment_col]
confidence_interval = model.conf_int().loc[treatment_col]
results = {
'ate': treatment_effect,
'ci_lower': confidence_interval[0],
'ci_upper': confidence_interval[1],
'p_value': model.pvalues[treatment_col],
'r_squared': model.rsquared,
'model': model
}
return results
# Example usage with interpretation
confounders = ['tenure_months', 'job_level', 'team_size', 'manager_span']
result = regression_adjustment(df, 'tickets_per_week', 'copilot_usage', confounders)
print(f"\n=== CAUSAL EFFECT ESTIMATE ===")
print(f"Treatment Effect: {result['ate']:.2f} tickets per week")
print(f"95% Confidence Interval: ({result['ci_lower']:.2f}, {result['ci_upper']:.2f})")
print(f"P-value: {result['p_value']:.4f}")
print(f"Model RΒ²: {result['r_squared']:.3f}")
# Interpret the results
if result['p_value'] < 0.05:
print(f"\nβ
SIGNIFICANT EFFECT DETECTED")
if result['ate'] > 0:
print(f"Copilot usage INCREASES productivity by {result['ate']:.2f} tickets/week")
else:
print(f"Copilot usage DECREASES productivity by {abs(result['ate']):.2f} tickets/week")
else:
print(f"\nβ NO SIGNIFICANT EFFECT DETECTED")
print(f"Cannot conclude that Copilot has a causal effect on productivity")
Understanding the Output:
- Treatment Effect (ATE): The average causal effect across all users
- Confidence Interval: Range of plausible true effects (95% confidence)
- P-value: Probability of seeing this effect if true effect were zero
- RΒ²: How much variation the model explains (model fit quality) ```
Advanced Regression with Interaction Effects
Sometimes the treatment effect differs across subgroups. Interaction terms help us identify these heterogeneous effects - crucial for targeted deployment strategies.
def regression_with_interactions(df, outcome_col, treatment_col, confounder_cols, interaction_vars=None):
"""
Regression adjustment with interaction terms for heterogeneous effects
Interaction terms allow the treatment effect to vary by subgroup:
Y = Ξ± + Ξ²β*Treatment + Ξ²β*Group + Ξ²β*(Treatment Γ Group) + controls
The interaction coefficient Ξ²β tells us how the treatment effect
differs between groups.
"""
base_formula = f'{outcome_col} ~ {treatment_col} + {" + ".join(confounder_cols)}'
# Add interaction terms
if interaction_vars:
interactions = [f'{treatment_col}:{var}' for var in interaction_vars]
formula = base_formula + ' + ' + ' + '.join(interactions)
else:
formula = base_formula
print(f"Running interaction model: {formula}")
model = smf.ols(formula, data=df).fit()
# Calculate marginal effects for different groups
effects = {}
main_effect = model.params[treatment_col]
if interaction_vars:
for var in interaction_vars:
interaction_term = f'{treatment_col}:{var}'
if interaction_term in model.params:
interaction_coeff = model.params[interaction_term]
# Calculate effects for each level of the interaction variable
for level in df[var].unique():
if pd.api.types.is_numeric_dtype(df[var]):
# For continuous variables, show effect at different values
marginal_effect = main_effect + (interaction_coeff * level)
effects[f'{var}_{level}'] = marginal_effect
else:
# For categorical variables
if level == df[var].mode().iloc[0]: # Reference category
effects[f'{var}_{level}'] = main_effect
else:
# This is simplified - proper categorical interactions need dummy encoding
effects[f'{var}_{level}'] = main_effect + interaction_coeff
return model, effects
# Example: Different effects by job level
model, effects = regression_with_interactions(
df, 'tickets_per_week', 'copilot_usage',
confounders, interaction_vars=['job_level']
)
print(f"\n=== HETEROGENEOUS TREATMENT EFFECTS ===")
for group, effect in effects.items():
print(f"{group}: {effect:.2f} tickets/week")
# Interpret interaction effects
print(f"\n=== INTERACTION INTERPRETATION ===")
print("These results show how Copilot's effect varies by job level.")
print("Use this to target deployment to groups with highest returns.")
Business Value of Interaction Analysis:
- Targeted Deployment: Focus Copilot rollout on highest-impact groups
- Training Customization: Adjust training programs for different user types
- ROI Optimization: Allocate licenses where they generate most value ```
Method 2: Propensity Score Methods
Propensity score methods address selection bias by balancing groups on observed characteristics. Instead of controlling for confounders directly in the outcome model, we match treated and control units with similar probabilities of receiving treatment.
Key Advantages:
- Dimension Reduction: Summarizes many confounders into a single score
- Transparency: Shows which units are truly comparable
- Assumption Testing: Forces explicit consideration of overlap
When to Use Propensity Scores:
- High-dimensional confounding (many covariates)
- Need to demonstrate comparable groups to stakeholders
- Concerns about model specification in outcome analysis
Propensity Score Estimation
The propensity score e(x) = P(Treatment = 1 | X = x) is the probability of treatment given observed characteristics. A well-estimated propensity score should achieve balance and have sufficient overlap between groups. |
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
def estimate_propensity_scores(df, treatment_col, confounder_cols, method='logistic'):
"""
Estimate propensity scores with comprehensive diagnostics
The propensity score summarizes the probability of treatment assignment.
Good propensity score models should:
1. Achieve covariate balance between groups
2. Have sufficient overlap (common support)
3. Be well-calibrated
"""
X = df[confounder_cols]
y = df[treatment_col]
print(f"=== PROPENSITY SCORE ESTIMATION ===")
print(f"Method: {method}")
print(f"Sample size: {len(df)} ({y.sum()} treated, {len(y) - y.sum()} control)")
print(f"Covariates: {len(confounder_cols)}")
if method == 'logistic':
# Standardize features for logistic regression
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_scaled, y)
propensity_scores = model.predict_proba(X_scaled)[:, 1]
# Model performance
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='roc_auc')
print(f"Cross-validation AUC: {cv_scores.mean():.3f} Β± {cv_scores.std():.3f}")
elif method == 'random_forest':
# Random forest for complex interactions
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
propensity_scores = model.predict_proba(X)[:, 1]
# Feature importance
feature_importance = pd.DataFrame({
'feature': confounder_cols,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print(f"Top predictors: {', '.join(feature_importance.head(3)['feature'].tolist())}")
# Add scores to dataframe
df_with_ps = df.copy()
df_with_ps['propensity_score'] = propensity_scores
# Comprehensive overlap diagnostics
overlap_diagnostics = check_overlap_detailed(propensity_scores, y)
return df_with_ps, model, overlap_diagnostics
def check_overlap_detailed(ps, treatment):
"""
Comprehensive overlap diagnostics for propensity scores
Poor overlap indicates limited comparability between groups.
"""
treated_ps = ps[treatment == 1]
control_ps = ps[treatment == 0]
print(f"\n=== OVERLAP DIAGNOSTICS ===")
print(f"Treated group: min={treated_ps.min():.3f}, max={treated_ps.max():.3f}, mean={treated_ps.mean():.3f}")
print(f"Control group: min={control_ps.min():.3f}, max={control_ps.max():.3f}, mean={control_ps.mean():.3f}")
# Common support region
common_min = max(treated_ps.min(), control_ps.min())
common_max = min(treated_ps.max(), control_ps.max())
print(f"Common support: [{common_min:.3f}, {common_max:.3f}]")
# Overlap quality metrics
in_common_support = (ps >= common_min) & (ps <= common_max)
overlap_rate = in_common_support.mean()
print(f"Proportion in common support: {overlap_rate:.1%}")
# Extreme propensity scores (problematic regions)
extreme_high = (ps > 0.9).sum()
extreme_low = (ps < 0.1).sum()
print(f"Extreme scores: {extreme_high} above 0.9, {extreme_low} below 0.1")
# Balance assessment
overlap_stats = {
'treated_range': (treated_ps.min(), treated_ps.max()),
'control_range': (control_ps.min(), control_ps.max()),
'common_support': (common_min, common_max),
'overlap_rate': overlap_rate,
'extreme_count': extreme_high + extreme_low
}
# Warnings for poor overlap
if overlap_rate < 0.8:
print("β οΈ Warning: Poor overlap - less than 80% of units in common support")
if extreme_high + extreme_low > len(ps) * 0.05:
print("β οΈ Warning: Many extreme propensity scores suggest poor model fit")
if common_max - common_min < 0.6:
print("β οΈ Warning: Narrow common support region")
return overlap_stats
# Estimate propensity scores with diagnostics
df_with_ps, ps_model, overlap_stats = estimate_propensity_scores(
df, 'copilot_usage', confounders, method='logistic'
)
Interpreting Propensity Score Quality:
- AUC 0.6-0.8: Good discrimination, sufficient variation in treatment probability
- AUC > 0.9: Excellent prediction but may indicate poor overlap
- Common Support: Should include most observations (>80%)
- Extreme Scores: Values near 0 or 1 suggest deterministic assignment
Propensity Score Matching
Once we have good propensity scores, we match treated units to similar control units. This creates a balanced dataset where treatment assignment appears βas good as randomβ within matched pairs.
from scipy.spatial.distance import cdist
import warnings
def propensity_score_matching(df, ps_col, treatment_col, outcome_col, ratio=1, caliper=0.1):
"""
Nearest neighbor propensity score matching with comprehensive analysis
Matching Process:
1. For each treated unit, find nearest control unit(s)
2. Apply caliper constraint to ensure good matches
3. Calculate treatment effect from matched pairs
Parameters:
- ratio: Number of control units to match to each treated unit
- caliper: Maximum allowed difference in propensity scores
"""
treated = df[df[treatment_col] == 1].copy()
control = df[df[treatment_col] == 0].copy()
print(f"=== PROPENSITY SCORE MATCHING ===")
print(f"Treated units: {len(treated)}")
print(f"Control units: {len(control)}")
print(f"Matching ratio: 1:{ratio}")
print(f"Caliper: {caliper}")
matched_pairs = []
unmatched_treated = []
used_controls = set()
for _, treated_unit in treated.iterrows():
# Calculate distances to all unused control units
available_controls = control[~control.index.isin(used_controls)]
if len(available_controls) == 0:
unmatched_treated.append(treated_unit.name)
continue
distances = np.abs(available_controls[ps_col] - treated_unit[ps_col])
# Apply caliper constraint
valid_matches = distances <= caliper
if not valid_matches.any():
unmatched_treated.append(treated_unit.name)
continue
# Find closest matches
n_matches = min(ratio, valid_matches.sum())
closest_indices = distances[valid_matches].nsmallest(n_matches).index
for idx in closest_indices:
matched_pairs.append({
'treated_id': treated_unit.name,
'control_id': idx,
'treated_outcome': treated_unit[outcome_col],
'control_outcome': control.loc[idx, outcome_col],
'treated_ps': treated_unit[ps_col],
'control_ps': control.loc[idx, ps_col],
'ps_distance': distances[idx]
})
used_controls.add(idx)
# Create matched dataset analysis
matched_df = pd.DataFrame(matched_pairs)
if len(matched_df) == 0:
print("β No valid matches found - consider relaxing caliper or improving propensity score model")
return None
# Matching quality assessment
print(f"\n=== MATCHING RESULTS ===")
print(f"Successful matches: {len(matched_df)}")
print(f"Unmatched treated units: {len(unmatched_treated)}")
print(f"Match rate: {len(matched_df)/len(treated):.1%}")
print(f"Average PS distance: {matched_df['ps_distance'].mean():.4f}")
print(f"Max PS distance: {matched_df['ps_distance'].max():.4f}")
# Calculate treatment effect
individual_effects = matched_df['treated_outcome'] - matched_df['control_outcome']
ate = individual_effects.mean()
ate_se = individual_effects.std() / np.sqrt(len(individual_effects))
print(f"\n=== TREATMENT EFFECT ===")
print(f"Average Treatment Effect: {ate:.3f}")
print(f"Standard Error: {ate_se:.3f}")
print(f"95% CI: [{ate - 1.96*ate_se:.3f}, {ate + 1.96*ate_se:.3f}]")
# Effect size interpretation
outcome_std = df[outcome_col].std()
cohens_d = ate / outcome_std
print(f"Effect size (Cohen's d): {cohens_d:.3f}")
if abs(cohens_d) < 0.2:
effect_size = "small"
elif abs(cohens_d) < 0.5:
effect_size = "medium"
else:
effect_size = "large"
print(f"Effect magnitude: {effect_size}")
# Create matched dataset for further analysis
matched_treated = df.loc[matched_df['treated_id']]
matched_control = df.loc[matched_df['control_id']]
matched_dataset = pd.concat([matched_treated, matched_control])
results = {
'ate': ate,
'se': ate_se,
'matched_pairs': matched_df,
'matched_dataset': matched_dataset,
'unmatched_treated': unmatched_treated,
'match_quality': {
'match_rate': len(matched_df)/len(treated),
'avg_distance': matched_df['ps_distance'].mean(),
'max_distance': matched_df['ps_distance'].max()
}
}
return results
# Perform matching analysis
matching_results = propensity_score_matching(
df_with_ps, 'propensity_score', 'copilot_usage', 'tickets_per_week',
ratio=1, caliper=0.05
)
if matching_results:
print(f"\n=== BUSINESS INTERPRETATION ===")
effect = matching_results['ate']
if effect > 0:
print(f"Copilot users complete {effect:.1f} more tickets per week")
print(f"Monthly impact: ~{effect * 4:.1f} additional tickets per user")
else:
print(f"Copilot users complete {abs(effect):.1f} fewer tickets per week")
print("Consider investigating implementation or training issues")
Key Matching Diagnostics:
- Match Rate: Percentage of treated units successfully matched (aim for >80%)
- Caliper Choice: Balance between match quality and sample size
- PS Distance: Average distance should be small (< 0.05)
- Effect Size: Cohenβs d helps interpret practical significance
Propensity Score Weighting (IPW)
Instead of matching, we can weight observations by their inverse propensity scores. This uses all data while balancing the groups - treated units get weight 1/e(x) and controls get weight 1/(1-e(x)).
def inverse_propensity_weighting(df, ps_col, treatment_col, outcome_col, trim_weights=True):
"""
Inverse Propensity Weighting (IPW) for causal inference
IPW reweights the sample to create balance between groups:
- Treated units: weight = 1 / propensity_score
- Control units: weight = 1 / (1 - propensity_score)
This creates a "pseudo-population" where treatment assignment
is independent of confounders.
"""
df_ipw = df.copy()
# Calculate IPW weights
weights = np.where(
df[treatment_col] == 1,
1 / df[ps_col], # Treated units
1 / (1 - df[ps_col]) # Control units
)
df_ipw['ipw_weight'] = weights
print(f"=== INVERSE PROPENSITY WEIGHTING ===")
print(f"Sample size: {len(df_ipw)}")
print(f"Weight statistics:")
print(f" Mean: {weights.mean():.2f}")
print(f" Std: {weights.std():.2f}")
print(f" Min: {weights.min():.2f}")
print(f" Max: {weights.max():.2f}")
# Check for extreme weights
extreme_threshold = 10
extreme_weights = (weights > extreme_threshold).sum()
if extreme_weights > 0:
print(f"β οΈ {extreme_weights} observations have weights > {extreme_threshold}")
print(" Consider trimming or improving propensity score model")
# Optional weight trimming
if trim_weights:
# Trim weights at 1st and 99th percentiles
lower_bound = np.percentile(weights, 1)
upper_bound = np.percentile(weights, 99)
weights_trimmed = np.clip(weights, lower_bound, upper_bound)
df_ipw['ipw_weight_trimmed'] = weights_trimmed
print(f"\nWeight trimming:")
print(f" Trimmed at [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f" {(weights != weights_trimmed).sum()} weights modified")
# Use trimmed weights for analysis
analysis_weights = weights_trimmed
weight_col = 'ipw_weight_trimmed'
else:
analysis_weights = weights
weight_col = 'ipw_weight'
# Calculate weighted treatment effect
treated_outcome = df_ipw[df_ipw[treatment_col] == 1][outcome_col]
control_outcome = df_ipw[df_ipw[treatment_col] == 0][outcome_col]
treated_weights = df_ipw[df_ipw[treatment_col] == 1][weight_col]
control_weights = df_ipw[df_ipw[treatment_col] == 0][weight_col]
# Weighted means
weighted_treated_mean = np.average(treated_outcome, weights=treated_weights)
weighted_control_mean = np.average(control_outcome, weights=control_weights)
ate_ipw = weighted_treated_mean - weighted_control_mean
# Robust standard errors (simplified)
# For production use, consider sandwich estimators
treated_var = np.average((treated_outcome - weighted_treated_mean)**2, weights=treated_weights)
control_var = np.average((control_outcome - weighted_control_mean)**2, weights=control_weights)
se_treated = np.sqrt(treated_var / len(treated_outcome))
se_control = np.sqrt(control_var / len(control_outcome))
se_ate = np.sqrt(se_treated**2 + se_control**2)
print(f"\n=== IPW TREATMENT EFFECT ===")
print(f"Weighted treated mean: {weighted_treated_mean:.3f}")
print(f"Weighted control mean: {weighted_control_mean:.3f}")
print(f"Average Treatment Effect: {ate_ipw:.3f}")
print(f"Standard Error: {se_ate:.3f}")
print(f"95% CI: [{ate_ipw - 1.96*se_ate:.3f}, {ate_ipw + 1.96*se_ate:.3f}]")
return {
'ate': ate_ipw,
'se': se_ate,
'treated_mean': weighted_treated_mean,
'control_mean': weighted_control_mean,
'weights': df_ipw[weight_col],
'effective_n': len(df_ipw)
}
# Perform IPW analysis
ipw_results = inverse_propensity_weighting(
df_with_ps, 'propensity_score', 'copilot_usage', 'tickets_per_week',
trim_weights=True
)
IPW Advantages vs. Matching:
- Efficiency: Uses entire sample, more statistical power
- Transparency: Clear weighting interpretation
- Flexibility: Easier to incorporate into complex models
IPW Diagnostics to Monitor:
- Weight Distribution: Extreme weights indicate poor overlap
- Effective Sample Size: Heavy weighting reduces effective N
- Balance Check: Weighted covariates should be balanced
Method 3: Difference-in-Differences
Difference-in-differences (DiD) exploits temporal variation to identify causal effects. It compares changes over time between treatment and control groups, controlling for time-invariant confounders through the βdouble difference.β
DiD Formula:
Effect = (Treatment_After - Treatment_Before) - (Control_After - Control_Before)
Key Assumptions:
- Parallel Trends: Without treatment, both groups would follow similar trends
- No Spillovers: Treatment doesnβt affect control group
- Stable Unit Treatment: Treatment effect is constant over time
When to Use DiD:
- Natural experiments with clear before/after periods
- Policy rollouts with staggered implementation
- A/B tests where you can observe pre-treatment periods
Basic Difference-in-Differences
def difference_in_differences(df, outcome_col, treatment_col, time_col, unit_col):
"""
Canonical difference-in-differences estimation
This method compares the change in outcomes over time between
treatment and control groups. The key insight is that time-invariant
confounders are differenced out.
Model: Y_it = Ξ± + Ξ²*Treatment_i + Ξ³*Post_t + Ξ΄*(Treatment_i Γ Post_t) + Ξ΅_it
Where Ξ΄ is the DiD treatment effect.
"""
print(f"=== DIFFERENCE-IN-DIFFERENCES ANALYSIS ===")
# Create post-treatment indicator
pre_period = df[df[time_col] == 'pre'].copy()
post_period = df[df[time_col] == 'post'].copy()
print(f"Pre-period observations: {len(pre_period)}")
print(f"Post-period observations: {len(post_period)}")
# Calculate group means by period
group_means = df.groupby([treatment_col, time_col])[outcome_col].agg(['mean', 'count', 'std']).round(3)
print(f"\n=== GROUP MEANS BY PERIOD ===")
print(group_means)
# Extract the four key quantities for DiD
try:
control_pre = group_means.loc[(0, 'pre'), 'mean']
control_post = group_means.loc[(0, 'post'), 'mean']
treated_pre = group_means.loc[(1, 'pre'), 'mean']
treated_post = group_means.loc[(1, 'post'), 'mean']
# Calculate differences
control_change = control_post - control_pre
treated_change = treated_post - treated_pre
did_effect = treated_change - control_change
print(f"\n=== DIFFERENCE-IN-DIFFERENCES CALCULATION ===")
print(f"Control group change: {control_change:.3f}")
print(f"Treatment group change: {treated_change:.3f}")
print(f"DiD Effect: {did_effect:.3f}")
# Interpret the effect
if did_effect > 0:
print(f"\nInterpretation: Treatment caused a {did_effect:.3f} unit increase")
print(f"This represents the effect above what would have happened anyway")
else:
print(f"\nInterpretation: Treatment caused a {abs(did_effect):.3f} unit decrease")
except KeyError as e:
print(f"Error: Missing group in data - {e}")
return None
# Regression-based DiD for standard errors
df_reg = df.copy()
df_reg['post'] = (df_reg[time_col] == 'post').astype(int)
df_reg['interaction'] = df_reg[treatment_col] * df_reg['post']
# Run DiD regression: Y = Ξ± + Ξ²*Treatment + Ξ³*Post + Ξ΄*(TreatmentΓPost)
did_formula = f"{outcome_col} ~ {treatment_col} + post + interaction"
did_model = smf.ols(did_formula, data=df_reg).fit()
print(f"\n=== REGRESSION RESULTS ===")
print(f"DiD Coefficient: {did_model.params['interaction']:.3f}")
print(f"Standard Error: {did_model.bse['interaction']:.3f}")
print(f"P-value: {did_model.pvalues['interaction']:.3f}")
print(f"95% CI: [{did_model.conf_int().loc['interaction', 0]:.3f}, "
f"{did_model.conf_int().loc['interaction', 1]:.3f}]")
# Effect size
outcome_std = df[outcome_col].std()
cohens_d = did_effect / outcome_std
print(f"Effect size (Cohen's d): {cohens_d:.3f}")
results = {
'did_effect': did_effect,
'model': did_model,
'group_means': group_means,
'control_change': control_change,
'treated_change': treated_change
}
return results
# Example: Copilot rollout analysis
# Assumes we have data from before and after Copilot deployment
did_results = difference_in_differences(
df_longitudinal, 'tickets_per_week', 'copilot_group', 'period', 'employee_id'
)
Business Value of DiD:
- Causal Clarity: Clear before/after comparison removes many confounders
- Policy Evaluation: Perfect for evaluating intervention impacts
- Trend Analysis: Shows whether effects persist over time
Testing Parallel Trends Assumption
The parallel trends assumption is crucial for DiD validity. We test this by examining whether treatment and control groups had similar trends before the intervention.
def test_parallel_trends(df, outcome_col, treatment_col, time_col, pre_periods):
"""
Test parallel trends assumption using pre-treatment data
Strategy:
1. Use only pre-treatment periods
2. Test for differential time trends between groups
3. Visualize trends and test statistically
"""
# Filter to pre-treatment periods only
pre_data = df[df[time_col].isin(pre_periods)].copy()
print(f"=== PARALLEL TRENDS TEST ===")
print(f"Pre-treatment periods: {pre_periods}")
print(f"Pre-treatment observations: {len(pre_data)}")
# Create time trend variable (assuming ordered periods)
time_mapping = {period: i for i, period in enumerate(sorted(pre_periods))}
pre_data['time_trend'] = pre_data[time_col].map(time_mapping)
# Test for differential trends: Y = Ξ± + Ξ²*Treatment + Ξ³*Time + Ξ΄*(TreatmentΓTime)
pre_data['treat_time_interaction'] = pre_data[treatment_col] * pre_data['time_trend']
trend_formula = f"{outcome_col} ~ {treatment_col} + time_trend + treat_time_interaction"
trend_model = smf.ols(trend_formula, data=pre_data).fit()
# The interaction coefficient tests parallel trends
interaction_coeff = trend_model.params['treat_time_interaction']
interaction_pvalue = trend_model.pvalues['treat_time_interaction']
print(f"\n=== STATISTICAL TEST ===")
print(f"Differential trend coefficient: {interaction_coeff:.4f}")
print(f"P-value: {interaction_pvalue:.4f}")
if interaction_pvalue < 0.05:
print("β VIOLATION: Significant differential pre-trends detected")
print(" DiD estimates may be biased. Consider:")
print(" - Group-specific time trends")
print(" - Alternative identification strategies")
else:
print("β
PASSED: No evidence against parallel trends assumption")
print(" DiD identification appears valid")
# Calculate group-specific trends
group_trends = pre_data.groupby([treatment_col, time_col])[outcome_col].mean().reset_index()
# Visual trend analysis
print(f"\n=== TREND ANALYSIS ===")
for group in [0, 1]:
group_data = group_trends[group_trends[treatment_col] == group]
trend_slope = np.polyfit(group_data['time_trend'] if 'time_trend' in group_data.columns
else range(len(group_data)), group_data[outcome_col], 1)[0]
group_name = "Treatment" if group == 1 else "Control"
print(f"{group_name} group trend: {trend_slope:.3f} per period")
return {
'model': trend_model,
'interaction_coeff': interaction_coeff,
'p_value': interaction_pvalue,
'parallel_trends_valid': interaction_pvalue >= 0.05,
'group_trends': group_trends
}
# Test parallel trends assumption
# Assumes multiple pre-treatment periods
parallel_trends_test = test_parallel_trends(
df_longitudinal, 'tickets_per_week', 'copilot_group', 'period',
pre_periods=['2023-Q1', '2023-Q2', '2023-Q3'] # Pre-rollout quarters
)
if not parallel_trends_test['parallel_trends_valid']:
print("\nβ οΈ Consider robustness checks:")
print("- Synthetic control methods")
print("- Regression discontinuity (if applicable)")
print("- Alternative control groups")
Parallel Trends Interpretation:
- Non-significant interaction: Groups had similar pre-trends (good for DiD)
- Significant interaction: Differential pre-trends suggest bias
- Visual inspection: Plot trends to understand patterns
Example usage
ate_iptw, weights = inverse_probability_weighting( df, βpropensity_scoreβ, βcopilot_usageβ, βtickets_per_weekβ ) print(fβIPTW ATE: {ate_iptw:.2f}β)
---
## Method 3: Difference-in-Differences
### Standard DiD Implementation
```python
import pandas as pd
from linearmodels import PanelOLS
def difference_in_differences(df, unit_col, time_col, treatment_col, outcome_col, post_col):
"""
Standard difference-in-differences estimation
"""
# Create interaction term
df['treatment_post'] = df[treatment_col] * df[post_col]
# Prepare data for panel regression
df_panel = df.set_index([unit_col, time_col])
# Run DiD regression: Y = Ξ± + Ξ²βT + Ξ²βPost + Ξ²β(TΓPost) + Ξ΅
formula = f'{outcome_col} ~ {treatment_col} + {post_col} + treatment_post + EntityEffects + TimeEffects'
model = PanelOLS.from_formula(formula, df_panel).fit(cov_type='clustered', cluster_entity=True)
# Treatment effect is the coefficient on interaction term
ate = model.params['treatment_post']
return {
'ate': ate,
'std_error': model.std_errors['treatment_post'],
'p_value': model.pvalues['treatment_post'],
'confidence_interval': model.conf_int().loc['treatment_post'],
'model': model
}
# Example usage for phased Copilot rollout
df['post_rollout'] = (df['date'] >= '2024-01-01').astype(int)
df['treated_department'] = (df['department'].isin(['Sales', 'Support'])).astype(int)
did_result = difference_in_differences(
df, 'user_id', 'date', 'treated_department', 'tickets_per_week', 'post_rollout'
)
print(f"DiD ATE: {did_result['ate']:.2f} Β± {1.96 * did_result['std_error']:.2f}")
Event Study Design
def event_study_analysis(df, unit_col, time_col, treatment_col, outcome_col, event_time_col):
"""
Event study design to test parallel trends and estimate dynamic effects
"""
# Create relative time indicators
relative_times = []
for t in range(-6, 7): # 6 periods before to 6 periods after
indicator = f'rel_time_{t}'
df[indicator] = ((df[event_time_col] == t) & (df[treatment_col] == 1)).astype(int)
if t != -1: # Omit t=-1 as reference period
relative_times.append(indicator)
# Prepare panel data
df_panel = df.set_index([unit_col, time_col])
# Event study regression
formula = f'{outcome_col} ~ {" + ".join(relative_times)} + EntityEffects + TimeEffects'
model = PanelOLS.from_formula(formula, df_panel).fit(cov_type='clustered', cluster_entity=True)
# Extract coefficients and plot
coefficients = {}
for var in relative_times:
time_point = int(var.split('_')[-1])
coefficients[time_point] = {
'estimate': model.params[var],
'std_error': model.std_errors[var]
}
return coefficients, model
# Plotting function for event study
def plot_event_study(coefficients):
import matplotlib.pyplot as plt
times = sorted(coefficients.keys())
estimates = [coefficients[t]['estimate'] for t in times]
std_errors = [coefficients[t]['std_error'] for t in times]
plt.figure(figsize=(10, 6))
plt.errorbar(times, estimates, yerr=[1.96*se for se in std_errors],
marker='o', capsize=5)
plt.axhline(y=0, color='red', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='red', linestyle='-', alpha=0.5)
plt.xlabel('Relative Time (Periods)')
plt.ylabel('Treatment Effect')
plt.title('Event Study: Dynamic Treatment Effects')
plt.grid(True, alpha=0.3)
plt.show()
Method 4: Instrumental Variables
Two-Stage Least Squares
from linearmodels.iv import IV2SLS
def instrumental_variables_analysis(df, outcome_col, treatment_col, instrument_col, confounder_cols):
"""
Two-stage least squares with instrumental variables
"""
# Prepare formula: outcome ~ confounders + [treatment ~ instrument + confounders]
confounders_str = ' + '.join(confounder_cols)
formula = f'{outcome_col} ~ {confounders_str} + [{treatment_col} ~ {instrument_col} + {confounders_str}]'
# First stage: Check instrument strength
first_stage_formula = f'{treatment_col} ~ {instrument_col} + {confounders_str}'
first_stage = smf.ols(first_stage_formula, data=df).fit()
f_stat = first_stage.fvalue
# Second stage: IV estimation
iv_model = IV2SLS.from_formula(formula, df).fit()
# Weak instrument test
weak_instrument = f_stat < 10 # Rule of thumb: F-stat should be > 10
return {
'ate': iv_model.params[treatment_col],
'std_error': iv_model.std_errors[treatment_col],
'confidence_interval': iv_model.conf_int().loc[treatment_col],
'first_stage_f_stat': f_stat,
'weak_instrument_warning': weak_instrument,
'model': iv_model
}
# Example: Using random license assignment as instrument
# (This requires that licenses were randomly assigned initially)
iv_result = instrumental_variables_analysis(
df, 'tickets_per_week', 'copilot_usage', 'license_assigned', confounders
)
if iv_result['weak_instrument_warning']:
print("Warning: Weak instrument detected!")
print(f"IV ATE: {iv_result['ate']:.2f}")
Method 5: Doubly Robust Methods
AIPW (Augmented Inverse Probability Weighting)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict
def doubly_robust_estimation(df, outcome_col, treatment_col, confounder_cols):
"""
Augmented Inverse Probability Weighting (AIPW) - Doubly Robust method
"""
X = df[confounder_cols]
T = df[treatment_col]
Y = df[outcome_col]
# Step 1: Estimate propensity scores
ps_model = LogisticRegression(random_state=42)
propensity_scores = cross_val_predict(ps_model, X, T, method='predict_proba')[:, 1]
# Step 2: Estimate outcome regression for both treatment levels
# Outcome model for treated units
treated_idx = T == 1
outcome_model_1 = RandomForestRegressor(random_state=42)
outcome_model_1.fit(X[treated_idx], Y[treated_idx])
mu_1 = outcome_model_1.predict(X)
# Outcome model for control units
control_idx = T == 0
outcome_model_0 = RandomForestRegressor(random_state=42)
outcome_model_0.fit(X[control_idx], Y[control_idx])
mu_0 = outcome_model_0.predict(X)
# Step 3: AIPW estimation
# Treated units component
treated_component = (T * (Y - mu_1) / propensity_scores) + mu_1
# Control units component
control_component = ((1 - T) * (Y - mu_0) / (1 - propensity_scores)) + mu_0
# Average treatment effect
ate = np.mean(treated_component - control_component)
# Bootstrap confidence intervals
n_bootstrap = 1000
bootstrap_ates = []
for _ in range(n_bootstrap):
bootstrap_idx = np.random.choice(len(df), size=len(df), replace=True)
bootstrap_ate = np.mean(
treated_component[bootstrap_idx] - control_component[bootstrap_idx]
)
bootstrap_ates.append(bootstrap_ate)
ci_lower = np.percentile(bootstrap_ates, 2.5)
ci_upper = np.percentile(bootstrap_ates, 97.5)
return {
'ate': ate,
'confidence_interval': (ci_lower, ci_upper),
'propensity_scores': propensity_scores,
'outcome_predictions': {'mu_0': mu_0, 'mu_1': mu_1}
}
# Example usage
dr_result = doubly_robust_estimation(
df, 'tickets_per_week', 'copilot_usage', confounders
)
print(f"Doubly Robust ATE: {dr_result['ate']:.2f} "
f"({dr_result['confidence_interval'][0]:.2f}, {dr_result['confidence_interval'][1]:.2f})")
Sensitivity Analysis and Robustness Checks
Sensitivity to Unobserved Confounding
def sensitivity_analysis_unobserved_confounding(base_estimate, confounder_strength_range):
"""
Assess sensitivity to potential unobserved confounders
"""
sensitivity_results = []
for gamma in confounder_strength_range:
# Rosenbaum bounds calculation
# This is a simplified version - full implementation requires more complex calculations
lower_bound = base_estimate / (1 + gamma)
upper_bound = base_estimate / (1 - gamma) if gamma < 1 else np.inf
sensitivity_results.append({
'confounder_strength': gamma,
'lower_bound': lower_bound,
'upper_bound': upper_bound,
'significant': lower_bound > 0 or upper_bound < 0
})
return sensitivity_results
# Example usage
gamma_range = np.arange(0, 0.5, 0.05)
sensitivity = sensitivity_analysis_unobserved_confounding(ate, gamma_range)
Placebo Tests
def placebo_tests(df, outcome_col, treatment_col, confounder_cols, method='regression'):
"""
Conduct placebo tests using pre-treatment periods or alternative outcomes
"""
placebo_results = {}
# Temporal placebo: Use pre-treatment period
pre_treatment = df[df['time_period'] < df['treatment_start_date']]
if len(pre_treatment) > 0:
if method == 'regression':
placebo_result = regression_adjustment(
pre_treatment, outcome_col, treatment_col, confounder_cols
)
placebo_results['temporal'] = placebo_result
# Outcome placebo: Use unrelated outcomes
unrelated_outcomes = ['vacation_days', 'sick_days', 'training_hours']
for outcome in unrelated_outcomes:
if outcome in df.columns:
placebo_result = regression_adjustment(
df, outcome, treatment_col, confounder_cols
)
placebo_results[f'outcome_{outcome}'] = placebo_result
return placebo_results
# Example usage
placebo_results = placebo_tests(df, 'tickets_per_week', 'copilot_usage', confounders)
Results Interpretation and Reporting
Effect Size Calculation and Business Translation
def interpret_causal_results(ate, baseline_mean, confidence_interval, outcome_unit='tickets'):
"""
Translate causal estimates into business-relevant terms
"""
# Calculate effect sizes
absolute_effect = ate
percentage_effect = (ate / baseline_mean) * 100
# Statistical significance
is_significant = not (confidence_interval[0] <= 0 <= confidence_interval[1])
# Business translation examples
if outcome_unit == 'tickets':
monthly_effect = absolute_effect * 4.33 # weeks to months
annual_effect = absolute_effect * 52 # weeks to year
business_translation = {
'weekly_increase': f"{absolute_effect:.1f} additional tickets per week",
'monthly_increase': f"{monthly_effect:.1f} additional tickets per month",
'annual_increase': f"{annual_effect:.0f} additional tickets per year",
'percentage_improvement': f"{percentage_effect:.1f}% improvement"
}
return {
'absolute_effect': absolute_effect,
'percentage_effect': percentage_effect,
'is_significant': is_significant,
'business_translation': business_translation,
'confidence_interval': confidence_interval
}
# Example usage
baseline_productivity = df[df['copilot_usage'] == 0]['tickets_per_week'].mean()
interpretation = interpret_causal_results(
ate=2.5,
baseline_mean=baseline_productivity,
confidence_interval=(1.8, 3.2),
outcome_unit='tickets'
)
print(f"Copilot causes {interpretation['business_translation']['percentage_improvement']}")
print(f"That's {interpretation['business_translation']['annual_increase']}")
Heterogeneous Treatment Effects Analysis
def analyze_heterogeneous_effects(df, outcome_col, treatment_col, confounder_cols, subgroup_vars):
"""
Analyze how treatment effects vary across different subgroups
"""
subgroup_effects = {}
for var in subgroup_vars:
subgroup_effects[var] = {}
for level in df[var].unique():
subset = df[df[var] == level]
if len(subset) > 50: # Ensure sufficient sample size
effect = regression_adjustment(
subset, outcome_col, treatment_col, confounder_cols
)
subgroup_effects[var][level] = effect
return subgroup_effects
# Example usage
subgroup_vars = ['job_level', 'department', 'tenure_category']
heterogeneous_effects = analyze_heterogeneous_effects(
df, 'tickets_per_week', 'copilot_usage', confounders, subgroup_vars
)
# Print results
for var, levels in heterogeneous_effects.items():
print(f"\nHeterogeneous effects by {var}:")
for level, effect in levels.items():
print(f" {level}: {effect['ate']:.2f} ({effect['ci_lower']:.2f}, {effect['ci_upper']:.2f})")
Best Practices and Common Pitfalls
Data Quality Checks
def causal_inference_data_quality_check(df, outcome_col, treatment_col, confounder_cols):
"""
Comprehensive data quality assessment for causal inference
"""
quality_report = {}
# 1. Missing data assessment
missing_rates = df[confounder_cols + [outcome_col, treatment_col]].isnull().mean()
quality_report['missing_data'] = missing_rates[missing_rates > 0.05]
# 2. Treatment variation
treatment_balance = df[treatment_col].value_counts(normalize=True)
quality_report['treatment_balance'] = treatment_balance
# 3. Outcome distribution
quality_report['outcome_stats'] = df[outcome_col].describe()
# 4. Temporal coverage
if 'time_period' in df.columns:
quality_report['temporal_coverage'] = {
'start_date': df['time_period'].min(),
'end_date': df['time_period'].max(),
'n_periods': df['time_period'].nunique()
}
# 5. Sample size by subgroup
if len(confounder_cols) > 0:
sample_sizes = df.groupby(confounder_cols[0]).size()
quality_report['min_subgroup_size'] = sample_sizes.min()
return quality_report
# Example usage
quality_check = causal_inference_data_quality_check(
df, 'tickets_per_week', 'copilot_usage', confounders
)
Model Diagnostics
def model_diagnostics(model, df, residual_threshold=3):
"""
Run diagnostics on causal inference models
"""
diagnostics = {}
# Residual analysis
residuals = model.resid
diagnostics['residual_stats'] = {
'mean': residuals.mean(),
'std': residuals.std(),
'outliers': np.sum(np.abs(residuals) > residual_threshold)
}
# Homoscedasticity test (Breusch-Pagan)
from statsmodels.stats.diagnostic import het_breuschpagan
lm, lm_pvalue, fvalue, f_pvalue = het_breuschpagan(residuals, model.model.exog)
diagnostics['homoscedasticity'] = {
'test_statistic': lm,
'p_value': lm_pvalue,
'homoscedastic': lm_pvalue > 0.05
}
# Normality test
from scipy.stats import shapiro
shapiro_stat, shapiro_p = shapiro(residuals)
diagnostics['normality'] = {
'test_statistic': shapiro_stat,
'p_value': shapiro_p,
'normal': shapiro_p > 0.05
}
return diagnostics
Resources and Further Reading
Recommended Libraries
# Essential libraries for causal inference in Python
libraries = {
'causal_inference': [
'dowhy', # Microsoft's causal inference library
'causalml', # Uber's causal ML library
'econml', # Microsoft's econometrics + ML
'causallib', # IBM's causal inference library
],
'statistical_models': [
'statsmodels', # Statistical modeling
'linearmodels', # Panel data and IV models
'scikit-learn', # Machine learning
],
'data_manipulation': [
'pandas', # Data manipulation
'numpy', # Numerical computing
],
'visualization': [
'matplotlib', # Basic plotting
'seaborn', # Statistical visualization
'plotly', # Interactive plots
]
}
# Installation command
install_command = """
pip install dowhy causalml econml causallib statsmodels linearmodels
pip install scikit-learn pandas numpy matplotlib seaborn plotly
"""
Example Analysis Workflow
def complete_causal_analysis_workflow(df, outcome_col, treatment_col, confounder_cols):
"""
Complete workflow for causal analysis of Copilot data
"""
results = {}
# Step 1: Data quality check
quality_check = causal_inference_data_quality_check(df, outcome_col, treatment_col, confounder_cols)
results['data_quality'] = quality_check
# Step 2: Multiple estimation methods
methods = {
'regression': lambda: regression_adjustment(df, outcome_col, treatment_col, confounder_cols),
'propensity_matching': lambda: propensity_score_analysis(df, outcome_col, treatment_col, confounder_cols),
'doubly_robust': lambda: doubly_robust_estimation(df, outcome_col, treatment_col, confounder_cols)
}
estimates = {}
for method_name, method_func in methods.items():
try:
estimates[method_name] = method_func()
except Exception as e:
print(f"Warning: {method_name} failed with error: {e}")
results['estimates'] = estimates
# Step 3: Sensitivity analysis
if 'regression' in estimates:
base_ate = estimates['regression']['ate']
sensitivity = sensitivity_analysis_unobserved_confounding(base_ate, np.arange(0, 0.3, 0.05))
results['sensitivity'] = sensitivity
# Step 4: Business interpretation
baseline_mean = df[df[treatment_col] == 0][outcome_col].mean()
for method, estimate in estimates.items():
if 'ate' in estimate:
interpretation = interpret_causal_results(
estimate['ate'], baseline_mean,
estimate.get('confidence_interval', (np.nan, np.nan))
)
results[f'{method}_interpretation'] = interpretation
return results
# Run complete analysis
complete_results = complete_causal_analysis_workflow(
df, 'tickets_per_week', 'copilot_usage', confounders
)
This technical guide provides the foundation for implementing rigorous causal inference methods with Copilot data. Always validate assumptions, check robustness, and interpret results in business context.