Survival Curves with create_survival()

create_survival() produces a Kaplan–Meier survival curve — a non-parametric estimate of how quickly a specific event occurs across a population. In a workforce analytics context typical events include:

  • First week of after-hours collaboration

  • First week a collaboration or network metric crosses a threshold

  • First observed use of a new tool

Three exported functions cover the full workflow:

Function

What it does

create_survival_prep()

Converts panel Person Query data into person-level survival format

create_survival_calc()

Computes Kaplan–Meier curves per group

create_survival_viz()

Renders step-function curves from a pre-computed table

create_survival()

End-to-end wrapper (calc + viz)

Reframing “survival” for workforce contexts

In classical survival analysis the event is typically something negative — death, equipment failure — and the y-axis probability represents “still alive”. In workforce analytics the event is usually a positive milestone: first adoption of a tool, first week as a power user, first time a metric crosses a meaningful threshold.

The terminology inverts: “surviving” means not yet having reached the milestone, and the event means success — the person converted. It is often more intuitive to read the chart as a time-to-adoption curve or a conversion curve:

  • A curve that drops steeply early → most people reached the milestone quickly.

  • A curve that stays high → many people had not yet converted by the end of the observation window.

  • The y-axis → the share of people who have not yet experienced the milestone.

[ ]:
import warnings
warnings.filterwarnings('ignore')

import vivainsights as vi

pq_data = vi.load_pq_data()

Step 1 — Prepare person-level survival data

pq_data is a panel dataset with one row per person per week. create_survival_prep() collapses it to one row per person, recording:

  • ``time``: the week number at which the event first occurred, or the total number of observed weeks if the event never occurred (censored).

  • ``event``: 1 if the condition was met in at least one week, 0 otherwise.

Here we define the event as a person’s first week with any after-hours collaboration (After_hours_collaboration_hours > 0):

[ ]:
surv_data = vi.create_survival_prep(
    data=pq_data,
    metric="After_hours_collaboration_hours",
    event_condition=lambda x: x > 0,
    hrvar="Organization",
)

surv_data.head(10)
[ ]:
# Event rate and time distribution
n_total = len(surv_data)
event_rate = surv_data["event"].mean() * 100
time_range = surv_data["time"].agg(["min", "max"])

print(f"Total persons:  {n_total}")
print(f"Event rate:     {event_rate:.1f}%")
print(f"Weeks observed: {time_range['min']}{time_range['max']}")

surv_data.groupby("event").size().rename("n").reset_index().assign(
    pct=lambda d: (d["n"] / d["n"].sum() * 100).round(1),
    label=lambda d: d["event"].map({1: "Had event", 0: "Censored"}),
)

Persons who never had any after-hours work during the observation window appear as censored (event = 0). The survival curve accounts for this: they contribute information up to their last observed week.


Step 2 — Plot the Kaplan–Meier curve

Pass the person-level data to create_survival(), specifying the time and event columns and the grouping variable:

[ ]:
vi.create_survival(
    data=surv_data,
    time_col="time",
    event_col="event",
    hrvar="Organization",
    mingroup=5,
)

Reading the chart:

  • Y-axis — probability of not yet having reached the milestone (the “survival” probability, or equivalently, the share yet to convert).

  • X-axis — week number (time since the start of observation).

  • Each step down marks one or more conversions in that group at that week.

  • Curves that drop quickly indicate groups where most people reached the milestone early (fast adoption).

  • Curves that stay high indicate groups where many people had not yet converted by the end of the window.


Overall curve (no grouping)

Set hrvar=None to estimate a single curve across the whole population:

[ ]:
vi.create_survival(
    data=surv_data,
    time_col="time",
    event_col="event",
    hrvar=None,
    mingroup=5,
)

Returning the survival table

Set return_type="table" to get the underlying long-format data frame. Each row represents one event time within one group:

[ ]:
surv_tbl = vi.create_survival(
    data=surv_data,
    time_col="time",
    event_col="event",
    hrvar="Organization",
    mingroup=5,
    return_type="table",
)

surv_tbl.head(12)

Columns:

Column

Description

Organization

Group identifier

time

Week number

survival

Estimated survival probability at that time

at_risk

Persons still in the risk set (event not yet occurred)

events

Events occurring at this time point

Extracting median event times

The median survival time is the week at which 50% of the group has experienced the event:

[ ]:
# First time per group that survival drops to or below 0.5
median_times = (
    surv_tbl[surv_tbl["survival"] <= 0.5]
    .groupby("Organization", as_index=False)
    .first()[["Organization", "time", "survival"]]
    .sort_values("time")
    .rename(columns={"time": "median_time"})
)

median_times

Groups with a smaller median_time value reach the event threshold faster on average.


Grouping by a different HR variable

Any character column can be used as the grouping variable. Here we compare adoption by LevelDesignation:

[ ]:
surv_level = vi.create_survival_prep(
    data=pq_data,
    metric="After_hours_collaboration_hours",
    event_condition=lambda x: x > 0,
    hrvar="LevelDesignation",
)

vi.create_survival(
    data=surv_level,
    time_col="time",
    event_col="event",
    hrvar="LevelDesignation",
    mingroup=5,
)

Changing the event definition

The event_condition argument in create_survival_prep() accepts any callable that takes a pandas Series and returns a boolean Series. This makes it easy to explore different thresholds without modifying your data.

Higher after-hours threshold

[ ]:
# Event: first week with more than 2 hours of after-hours collaboration
surv_high = vi.create_survival_prep(
    data=pq_data,
    metric="After_hours_collaboration_hours",
    event_condition=lambda x: x > 2,
    hrvar="Organization",
)

print(f"Event rate (> 2 h after-hours): {surv_high['event'].mean() * 100:.1f}%")

vi.create_survival(
    data=surv_high,
    time_col="time",
    event_col="event",
    hrvar="Organization",
    mingroup=5,
)

Network growth milestone

[ ]:
# Event: first week where internal network size exceeds 10 contacts
surv_net = vi.create_survival_prep(
    data=pq_data,
    metric="Internal_network_size",
    event_condition=lambda x: x > 10,
    hrvar="Organization",
)

print(f"Event rate (network > 10): {surv_net['event'].mean() * 100:.1f}%")

vi.create_survival(
    data=surv_net,
    time_col="time",
    event_col="event",
    hrvar="Organization",
    mingroup=5,
)

Privacy filtering

Groups below mingroup unique persons are removed before the curve is estimated. Increase the threshold to be more conservative:

[ ]:
surv_strict = vi.create_survival(
    data=surv_data,
    time_col="time",
    event_col="event",
    hrvar="Organization",
    mingroup=20,
    return_type="table",
)

# Which groups remained after stricter filtering?
surv_strict[["Organization", "at_risk"]].groupby("Organization")["at_risk"].max().sort_values(ascending=False)

Using return_type to retrieve the table or figure

create_survival() accepts a return_type argument:

  • return_type="table" — returns the long-format Kaplan–Meier table so you can inspect survival probabilities directly.

  • return_type="plot" (default) — returns the matplotlib Figure for further customisation (e.g., adding reference lines).

There is no need to call create_survival_calc or create_survival_viz directly in typical usage.

[ ]:
survival_long = vi.create_survival(
    data=surv_data,
    time_col="time",
    event_col="event",
    hrvar="Organization",
    mingroup=5,
    return_type="table",
)

# Inspect group sizes (max at_risk per group = initial size after filtering)
counts = survival_long.groupby("Organization")["at_risk"].max()
print("Group sizes after privacy filtering:")
print(counts)

survival_long.head(10)
[ ]:
import matplotlib.pyplot as plt

fig = vi.create_survival(
    data=surv_data,
    time_col="time",
    event_col="event",
    hrvar="Organization",
    mingroup=5,
    title="Time to first after-hours week",
)

# Add a 50% threshold line to read off median times
ax = fig.get_axes()[0]
ax.axhline(y=0.5, linestyle="--", color="grey", linewidth=1, alpha=0.7)
ax.text(0.5, 0.52, "50% threshold", fontsize=8, color="grey")

plt.show()