Copilot Causal Toolkit: Interpretation Guide

Now that you have successfully run the Copilot Causal Toolkit notebooks, you should see results appear in your Jupyter notebook and in your output directory (and with no errors). This document provides detailed guidance on how to interpret every output the toolkit produces, and what actions you should consider as a result.

For instructions on setting up and running the toolkit, see the main Copilot Causal Toolkit page.


What you should see in the outputs

After completing a notebook run, the toolkit produces two categories of outputs:

  1. In-notebook outputs β€” plots, tables, and summary statistics displayed inline as you run each cell
  2. Saved files β€” CSV, PNG, JSON, and TXT files written to the output/ directory

Output directory structure

Your output/ directory will follow this structure:

output/
└── Subgroup Analysis - [YOUR COMPANY]/
    β”œβ”€β”€ significant_subgroups_[timestamp].csv
    β”œβ”€β”€ all_significant_subgroups_effects_[timestamp].png
    β”œβ”€β”€ sensitivity_analysis_[timestamp].png
    β”œβ”€β”€ sensitivity_analysis_results_[timestamp].json
    β”‚
    β”œβ”€β”€ Subgroup_1_[Name]/
    β”‚   β”œβ”€β”€ ate_plot_Total_Copilot_actions_taken_[timestamp].png
    β”‚   β”œβ”€β”€ ate_results_Total_Copilot_actions_taken_[timestamp].csv
    β”‚   β”œβ”€β”€ subgroup_definition.txt
    β”‚   └── transition_matrix_Total_Copilot_actions_taken_[timestamp].csv
    β”‚
    β”œβ”€β”€ Subgroup_2_[Name]/
    β”‚   └── (same files as above)
    β”‚
    β”œβ”€β”€ Subgroup_3_[Name]/
    β”œβ”€β”€ Subgroup_4_[Name]/
    └── Subgroup_5_[Name]/

Each [timestamp] follows the format YYYYMMDD_HHMM (e.g., 20260116_1137) and corresponds to when you ran the analysis, so you can track multiple runs.


Interpreting results from the notebook

As you run the notebook cell by cell, several key outputs appear inline. This section walks through what to look for at each stage.

Data loading and filtering summary

Early cells print a summary of your data after loading and filtering. Look for:

What to check: Verify the number of unique individuals is reasonable for your organization. If the Copilot user percentage is very low (e.g., <5%), the analysis may have limited statistical power.

Exploratory plots

The notebook generates three exploratory visualizations before the causal analysis:

  1. Bar chart of the outcome variable by Organization β€” shows which organizational units have the highest/lowest outcome values (e.g., after-hours collaboration hours)
  2. Time trend of the outcome variable β€” shows how the outcome has changed week-over-week across the analysis period
  3. Time trend of the treatment variable β€” shows how Copilot adoption has changed over time

What to check: Look for obvious data quality issues such as sudden spikes or drops, missing weeks, or unreasonably large or small values. If the outcome or treatment variable is flat over time, the analysis may have limited variation to work with.

Data aggregation

The notebook aggregates the longitudinal (person-week) data into a cross-sectional (person-level) snapshot by averaging all numeric metrics across weeks for each person. The output confirms:

What to check: Ensure all expected variables are present and no critical variables are listed as missing.

Subgroup combinations

The toolkit creates all pairwise combinations of the organizational attributes you specified in SUBGROUP_VARS (e.g., Organization Γ— FunctionType, Organization Γ— IsManager, etc.) and shows:

What to check: If very few subgroups meet the minimum size threshold, consider reducing min_group_size or using fewer SUBGROUP_VARS.

CATE model results

The CausalForestDML model estimates an individual treatment effect for each person. The notebook displays:

These individual effects are then used to identify which subgroups show the strongest effects.

Top 5 subgroups

The notebook identifies and prints the top 5 subgroups with the most significant treatment effects. For each subgroup, it shows:

The direction of β€œtop” depends on the FIND_NEGATIVE_EFFECTS toggle:


Interpreting significant subgroups (.csv)

File: significant_subgroups_[timestamp].csv

This is the most important summary file. It lists all statistically significant subgroups (p < 0.05), sorted by effect size.

Column descriptions

Column Description
name The subgroup identifier, combining two organizational attributes (e.g., Organization__Aggregated__Sales__and__FunctionType__Aggregated__IC)
var1 The first grouping variable name
val1 The value of the first grouping variable for this subgroup
var2 The second grouping variable name
val2 The value of the second grouping variable for this subgroup
mean_effect The average treatment effect (ATE) in hours β€” the estimated change in the outcome when moving from 0 to the average Copilot usage level
std_effect Standard deviation of individual treatment effects within the subgroup
p_value Statistical significance (values < 0.05 are considered significant)
n_observations Number of person-level observations in this subgroup
n_users Number of unique individuals (typically equals n_observations since data is aggregated per-person)
significant Boolean flag β€” True if p_value < 0.05

How to read mean_effect

The mean_effect represents the estimated causal change in the outcome variable (in hours per week) when an individual moves from zero Copilot usage to the dataset’s average usage level.

Example from a real output:

Subgroup mean_effect p_value n_users Interpretation
Global Delivery -0.2236 0.0000 652 Each user’s after-hours collaboration reduced by ~0.22 hours/week (~13 min/week)
Commerce -0.1752 0.0042 59 ~0.18 hours/week reduction (~11 min/week)
Testing +0.0583 0.0477 113 ~0.06 hours/week increase (~3.5 min/week)

What to look for

  1. Effect direction consistency: Are most subgroups showing effects in the same direction, or is it mixed? Mixed results may indicate heterogeneity worth investigating.
  2. Effect magnitude: Consider whether the effect sizes are practically meaningful. A statistically significant reduction of 0.05 hours/week (3 minutes) may not warrant action, while 0.5 hours/week (30 minutes) is substantial.
  3. Sample sizes: Larger subgroups (higher n_users) produce more reliable estimates. Be cautious with subgroups that barely meet the minimum threshold.
  4. p-values: Lower p-values indicate stronger statistical evidence. Effects with p < 0.001 are particularly strong.
  5. Opposing effects: Pay close attention to subgroups that show effects in the opposite direction to the majority. These may represent populations where Copilot has unintended consequences, or they may reflect confounding.

Interpreting the subgroup comparison bar chart (.png)

File: all_significant_subgroups_effects_[timestamp].png

This horizontal bar chart visualizes the mean_effect for all statistically significant subgroups side by side. It provides the quickest visual summary of your results.

Reading the chart

Summary statistics box

In the top-left corner, a text box shows:

What to look for

  1. Distribution of effects: Are most bars on one side of zero, or evenly split? A lopsided distribution suggests a general trend, while an even split suggests heterogeneity.
  2. Outlier subgroups: Any bars that are substantially longer than the others deserve attention β€” these are the populations most affected by Copilot.
  3. Contrast with non-plotted subgroups: Remember that this chart only shows statistically significant subgroups. Non-significant subgroups (which are not plotted) may simply lack sufficient sample size.

Interpreting each subgroup’s detailed results

For each of the top 5 subgroups, the toolkit generates a dedicated folder with four files. This section explains how to read each one.

Subgroup definition (subgroup_definition.txt)

Example content:

Organization__Aggregated_=Global Delivery, FunctionType__Aggregated_=#ERROR

This text file records which two organizational attribute values define this subgroup. It is deliberately simple so you can quickly reference which population was analyzed.

How to use: When presenting findings to stakeholders, translate the machine-readable format into business language. For example:

If a value appears as #ERROR or looks unexpected, this typically means the attribute had missing or null values in the source data. You may want to verify the data quality for that field.


ATE results table (ate_results_Total_Copilot_actions_taken_[timestamp].csv)

This file contains the detailed dose-response estimates β€” how the outcome changes at each level of Copilot usage (0, 1, 2, 3, … actions per week).

Column descriptions

Column Description
Treatment The Copilot usage level (actions per week). Ranges from 0 to the maximum observed in the subgroup.
ATE_Featurized Estimated treatment effect at this usage level from the featurized model (uses spline transformation to capture non-linear dose-response patterns).
ATE_Baseline Estimated treatment effect at this usage level from the baseline model (assumes a simple linear effect).
CI_Lower_Featurized Lower bound of the 95% confidence interval for the featurized model.
CI_Upper_Featurized Upper bound of the 95% confidence interval for the featurized model.
CI_Lower_Baseline Lower bound of the 95% confidence interval for the baseline model.
CI_Upper_Baseline Upper bound of the 95% confidence interval for the baseline model.
P_Value_Featurized Statistical significance at this treatment level for the featurized model.
P_Value_Baseline Statistical significance at this treatment level for the baseline model.

Understanding the two models

The toolkit fits two DML models for each subgroup, and for good reason:

  1. Featurized model (SplineTransformer with 4 knots, degree 3): This model can capture non-linear dose-response patterns β€” for example, diminishing returns at high usage levels, or a threshold effect where benefits only kick in above a certain usage level. The spline transformation allows the effect curve to bend and flex.

  2. Baseline model (no featurization): This model assumes a constant, linear effect β€” each additional Copilot action has the same impact regardless of the starting level. It serves as a simpler benchmark.

When the two models agree (their curves are close together), the effect is likely linear and straightforward. When they diverge, the relationship is non-linear, and the featurized model is usually more informative.

How to read the table

Each row represents a treatment level. The ATE_Featurized and ATE_Baseline columns tell you the estimated change in the outcome (in hours) when a person moves from 0 actions/week to that treatment level.

Example interpretation:

Treatment ATE_Featurized ATE_Baseline Interpretation
0 0.000 0.000 Baseline (no Copilot usage) β€” effect is always 0 at Treatment=0
5 0.387 -0.099 Featurized model estimates +0.39 hours change; baseline estimates -0.10 hours
10 0.376 -0.198 Featurized effect has plateaued; baseline continues linearly
15 -0.067 -0.297 Featurized effect turns negative at higher usage levels
20 -0.558 -0.397 Both models now show negative effects

In this example, the featurized model reveals a non-linear pattern: moderate Copilot usage (5-10 actions/week) has a positive effect, but very high usage (15+ actions) reverses the effect. The baseline model, constrained to be linear, misses this nuance entirely.

Confidence intervals and significance

What to look for

  1. Where is the effect strongest? Identify the treatment level(s) where ATE_Featurized is largest in absolute terms.
  2. Are there diminishing returns? If the effect flattens or reverses at higher treatment levels, there may be an optimal β€œsweet spot” for Copilot usage.
  3. Do the two models agree? If the featurized and baseline models diverge substantially, the non-linear pattern is important.
  4. Where are confidence intervals narrowest? This is where the estimate is most reliable, typically at treatment levels near the population median.

ATE dose-response plot (ate_plot_Total_Copilot_actions_taken_[timestamp].png)

This plot is a visual representation of the ATE results table. It is the single most important visual output for each subgroup.

Reading the plot

Interpreting common curve shapes

1. Steadily declining curve (negative slope throughout)

The outcome decreases at all Copilot usage levels. For after-hours collaboration, this means Copilot consistently reduces evening/weekend work. The stronger the slope, the stronger the reduction per additional action.

2. Steadily rising curve (positive slope throughout)

The outcome increases at all usage levels. For external collaboration, this means Copilot consistently increases time with customers/partners.

3. Inverted-U shape (rises then falls)

The outcome increases at low-to-moderate usage levels, then decreases at high usage. This suggests an optimal usage range. Users below the peak may benefit from increased adoption; users above the peak might be experiencing diminishing or negative returns.

4. U shape (falls then rises)

The outcome decreases at low usage levels, then increases at high levels. This is less common but may indicate that heavy Copilot users are in demanding roles that naturally involve more of the outcome behavior.

5. Flat curve (near zero throughout)

No meaningful effect at any usage level. If the confidence band is narrow and covers zero, you can be fairly confident there is no effect. If the confidence band is wide, the sample may be too small to detect an effect.

Confidence band interpretation


Transition matrix (transition_matrix_Total_Copilot_actions_taken_[timestamp].csv)

This file compares raw outcome differences between groups of people at different Copilot usage levels. It serves as a simple, descriptive cross-check on the causal model results.

Column descriptions

Column Description
Bucket_i The lower treatment intensity bucket (e.g., β€œBottom 25%”)
Bucket_i_T_Range The range of Copilot usage values in bucket i (e.g., β€œ(2.0, 7.5]”)
Users_in_Bucket_i Number of users in the lower bucket
Bucket_j The higher treatment intensity bucket (e.g., β€œTop 10%”)
Bucket_j_T_Range The range of Copilot usage values in bucket j
Users_in_Bucket_j Number of users in the higher bucket
dYij The raw mean outcome difference between bucket j and bucket i (in hours). Positive = higher-usage group has a higher outcome; Negative = higher-usage group has a lower outcome.
P_Value P-value from a two-sample t-test comparing outcomes between the two buckets

Treatment buckets

The toolkit divides people into five treatment intensity buckets based on their average Copilot usage:

Bucket Definition
Bottom 25% 0th–25th percentile of usage
25–50% 25th–50th percentile
50–75% 50th–75th percentile
75–90% 75th–90th percentile
Top 10% 90th–100th percentile

Each row in the table compares two buckets (all 10 possible pairwise comparisons).

How to interpret

Example row:

Bucket_i Bucket_i_T_Range Users_in_Bucket_i Bucket_j Bucket_j_T_Range Users_in_Bucket_j dYij P_Value
Bottom 25% (2.0, 7.5] 163 Top 10% (28.1, 53.0] 66 0.096 0.867

This means: People in the Top 10% of Copilot usage (28-53 actions/week) have 0.096 hours/week more after-hours collaboration than people in the Bottom 25% (2-7.5 actions/week), but this difference is not statistically significant (p = 0.867).

Important caveats

What to look for

  1. Monotonic trends in dYij: If dYij consistently increases or decreases as you compare increasingly different buckets (e.g., Bottom 25% vs. 25-50%, vs. 50-75%, etc.), this supports a dose-response relationship.
  2. Consistency with the DML results: Do the raw differences point in the same direction as the causal estimates? If not, confounders may be playing a significant role (which is exactly why we use DML).
  3. Bucket sizes: Verify that each bucket has a reasonable number of users. Very small buckets (e.g., <20 users) produce unreliable comparisons.

Interpreting the sensitivity analysis

The sensitivity analysis assesses the robustness of results to potential unmeasured confounders β€” factors that might affect both Copilot usage and the outcome but were not included in the model. This is critical because no observational study can guarantee that all confounders have been measured.

Sensitivity analysis results file (sensitivity_analysis_results_[timestamp].json)

This JSON file contains the complete sensitivity analysis output. It has three main sections:

1. Overall analysis (overall_analysis)

Results for the entire population (not subgroup-specific):

Field Description
sample_size Total number of individuals analyzed
mean_effect Overall average treatment effect
standard_error Standard error of the mean effect
evalue_point E-value for the point estimate
evalue_ci E-value for the confidence interval (more conservative)
rosenbaum_critical_gamma Critical Gamma from Rosenbaum bounds (null if beyond tested range)
original_p_value P-value from the Wilcoxon signed-rank test

2. Subgroup analysis (subgroup_analysis)

An array of results for each of the top 5 subgroups:

Field Description
subgroup_name Name of the subgroup
rank Rank (1 = strongest effect)
mean_effect Mean treatment effect for this subgroup
p_value Statistical significance
sample_size Number of individuals
evalue_point E-value for the point estimate
evalue_ci E-value for the confidence interval
rosenbaum_gamma Critical Gamma (null if beyond tested range)
robustness_score Combined robustness measure: the minimum of evalue_ci and rosenbaum_gamma

3. Robustness categories (robustness_categories)

Each subgroup is classified into one of three categories:

Category Criteria Meaning
Very Robust robustness_score β‰₯ 2.0 An unmeasured confounder would need very strong associations with both treatment and outcome to explain away the result
Moderately Robust robustness_score 1.5–2.0 Moderate confounding could partially explain the result
Potentially Fragile robustness_score < 1.5 Relatively weak confounding could explain the result β€” interpret with caution

Sensitivity analysis plot (sensitivity_analysis_[timestamp].png)

This figure contains two side-by-side plots:

Left panel: Treatment Effect vs E-value

Interpretation: Points above the red line are robust to moderate unmeasured confounding. Points below are more sensitive. Ideally, subgroups with the largest effects should also have the highest E-values.

Right panel: Rosenbaum Gamma by subgroup

Interpretation: Taller green bars are better. A subgroup whose bar reaches above the red dashed line has results that would survive even substantial hidden bias in treatment assignment.


Understanding E-values

The E-value (VanderWeele & Ding, 2017) answers the question:

β€œHow strong would an unmeasured confounder need to be to completely explain away this effect?”

Specifically, the E-value is the minimum risk ratio that an unmeasured confounder would need to have with both the treatment (Copilot usage) and the outcome (e.g., after-hours collaboration) to fully explain away the observed effect, conditional on all the covariates already in the model.

E-value thresholds

E-value Robustness level What it means
< 1.5 Potentially fragile A relatively weak confounder (e.g., a factor that increases both Copilot usage and after-hours work by 50%) could explain the result
1.5 – 2.0 Moderate Would require a moderately strong confounder
2.0 – 3.0 Strong Would require a confounder with strong associations to both treatment and outcome
> 3.0 Very strong Highly unlikely that any single unmeasured confounder could explain the result

Two E-values are reported

Always focus on the CI E-value when assessing robustness.

Contextualizing E-values

To give E-values practical meaning, compare them against plausible confounders in your data:


Understanding Rosenbaum bounds (Gamma)

The Rosenbaum bounds test (Rosenbaum, 2002) answers a slightly different question:

β€œHow much hidden bias in treatment assignment could exist before the result becomes statistically non-significant?”

The Critical Gamma (Ξ“) is the maximum factor by which two otherwise identical individuals could differ in their odds of treatment (due to unobserved factors) while the result still retains statistical significance at p < 0.05.

Gamma thresholds

Ξ“ value Robustness level What it means
< 1.5 Fragile Even modest differences in treatment assignment odds could eliminate statistical significance
1.5 – 2.0 Moderate Tolerates moderate hidden bias
2.0 – 3.0 Strong Results survive substantial hidden bias
> 3.0 (or null) Very robust Either beyond the tested range (Ξ“ > 5.0) or the result is extremely robust

When Gamma is null

A null value for rosenbaum_gamma can mean:

Check the original_p_value field to distinguish: if p is very small (e.g., < 0.001), a null Gamma usually means the result is very robust.

E-value vs Rosenbaum Gamma: Key distinction

Metric What it tests Focus
E-value Could confounding explain away the magnitude of the effect? Effect size
Rosenbaum Ξ“ Could hidden bias eliminate statistical significance? P-value

Both matter. A result can have a low E-value but high Gamma (significant but potentially confounded) or a high E-value but low Gamma (robust in magnitude but marginally significant). Ideally, both should be high.


Interpreting the robustness score

The robustness_score is a single summary metric: the minimum of the CI E-value and the Critical Gamma. It represents the β€œweakest link” β€” whichever sensitivity measure is most concerning.

Robustness score Verdict
β‰₯ 2.0 Confident β€” results are robust
1.5 – 2.0 Cautiously optimistic β€” acknowledge limitations
< 1.5 Exercise caution β€” results may be sensitive to unmeasured confounding

How to report sensitivity analysis results

Template for robust results (score β‰₯ 2.0):

β€œThe sensitivity analysis suggests these findings are robust to unmeasured confounding. An unmeasured confounder would need to be associated with both Copilot usage and the outcome by a risk ratio of at least [E-value] to explain away the observed effect. Additionally, hidden bias in treatment assignment would need to exceed Ξ“ = [value] to overturn the statistical significance.”

Template for moderate results (score 1.5–2.0):

β€œThe results show moderate robustness. While statistical significance is maintained under moderate hidden bias assumptions (Ξ“ = [value]), the effect magnitude could potentially be explained by a moderately strong unmeasured confounder (E-value = [value]). Plausible confounders such as [job complexity, manager expectations] should be considered when interpreting these findings.”

Template for fragile results (score < 1.5):

β€œThese findings should be interpreted with caution. The sensitivity analysis indicates that relatively modest unmeasured confounding (E-value = [value]) could potentially explain the observed effect. While the direction of the effect is suggestive, these results are best treated as hypothesis-generating rather than confirmatory. We recommend validating with alternative study designs.”


Putting it all together: A structured review process

For a systematic review of your results, we recommend the following workflow:

Step 1: Start with the summary

Open significant_subgroups_[timestamp].csv and answer:

Step 2: Examine the bar chart

Open all_significant_subgroups_effects_[timestamp].png and look for:

Step 3: Dive into the top subgroups

For each of the top 5 subgroups, review:

  1. subgroup_definition.txt β€” Who is in this group?
  2. ate_plot_[...].png β€” What does the dose-response curve look like? Linear or non-linear? Where is the effect strongest?
  3. ate_results_[...].csv β€” At what treatment level is the effect statistically significant? What’s the precise estimate at key usage levels (e.g., 5, 10, 15 actions/week)?
  4. transition_matrix_[...].csv β€” Do the raw outcome differences support the causal findings?

Step 4: Check robustness

Open sensitivity_analysis_results_[timestamp].json and sensitivity_analysis_[timestamp].png:

Step 5: Synthesize and communicate

Prepare your findings for stakeholders:

  1. Lead with the business question β€” β€œDoes Copilot usage affect [outcome]?”
  2. State the overall finding β€” β€œAcross [N] employees, we found [direction] effects on [outcome] associated with increased Copilot usage.”
  3. Highlight key subgroups β€” β€œThe strongest effects were observed in [subgroup], where [describe the dose-response pattern].”
  4. Acknowledge limitations β€” β€œSensitivity analysis classified [N] of 5 top subgroups as [robustness level]. Plausible unmeasured confounders include [examples].”
  5. Recommend next steps β€” Based on the findings, suggest targeted interventions, additional data collection, or follow-up analyses.

Glossary

Term Definition
ATE Average Treatment Effect β€” the estimated causal effect of treatment on the outcome, averaged across all individuals
CATE Conditional Average Treatment Effect β€” the treatment effect conditional on individual characteristics (allows for heterogeneity)
CausalForestDML A machine learning-based causal inference estimator that discovers heterogeneous treatment effects across subgroups
Confidence Interval (CI) A range of values within which the true effect is likely to fall (95% of the time at the 95% confidence level)
Confounder A variable that affects both the treatment (Copilot usage) and the outcome, which must be controlled for to avoid bias
DML Double Machine Learning β€” a framework that uses ML to flexibly model confounding while maintaining valid causal inference
Dose-response curve A curve showing how the outcome changes across different treatment levels (rather than just comparing treated vs. untreated)
E-value The minimum strength of confounding (on the risk ratio scale) needed to explain away the observed effect
Featurized model A DML model that uses spline transformations to capture non-linear dose-response patterns
LinearDML A DML estimator that models treatment effects as a linear function after residualizing confounders
Rosenbaum Gamma (Ξ“) The maximum odds ratio of treatment assignment due to unobserved factors that still preserves statistical significance
Robustness score The minimum of the E-value (CI) and Rosenbaum Gamma β€” represents overall sensitivity to unmeasured confounding
SplineTransformer A feature transformation that models non-linear relationships using piecewise polynomials with smooth joins
Transition matrix A descriptive comparison of raw outcome differences across treatment intensity buckets
Winsorization Capping extreme values at a percentile threshold (e.g., 95th) to reduce the influence of outliers