{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Survival Curves with `create_survival()`\n", "\n", "`create_survival()` produces a **Kaplan–Meier survival curve** — a non-parametric estimate of how quickly a specific event occurs across a population. In a workforce analytics context typical events include:\n", "\n", "- First week of after-hours collaboration\n", "- First week a collaboration or network metric crosses a threshold\n", "- First observed use of a new tool\n", "\n", "Three exported functions cover the full workflow:\n", "\n", "| Function | What it does |\n", "|---|---|\n", "| `create_survival_prep()` | Converts panel Person Query data into person-level survival format |\n", "| `create_survival_calc()` | Computes Kaplan–Meier curves per group |\n", "| `create_survival_viz()` | Renders step-function curves from a pre-computed table |\n", "| `create_survival()` | End-to-end wrapper (calc + viz) |\n", "\n", "### Reframing \"survival\" for workforce contexts\n", "\n", "In classical survival analysis the *event* is typically something negative — death, equipment failure — and the y-axis probability represents \"still alive\". In workforce analytics the event is usually **a positive milestone**: first adoption of a tool, first week as a power user, first time a metric crosses a meaningful threshold.\n", "\n", "The terminology inverts: \"surviving\" means *not yet having reached the milestone*, and the event means *success — the person converted*. It is often more intuitive to read the chart as a **time-to-adoption curve** or a **conversion curve**:\n", "\n", "- **A curve that drops steeply early** → most people reached the milestone quickly.\n", "- **A curve that stays high** → many people had not yet converted by the end of the observation window.\n", "- **The y-axis** → the share of people who have *not yet* experienced the milestone." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "import vivainsights as vi\n", "\n", "pq_data = vi.load_pq_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 1 — Prepare person-level survival data\n", "\n", "`pq_data` is a panel dataset with one row per person per week. `create_survival_prep()` collapses it to one row per person, recording:\n", "\n", "- **`time`**: the week number at which the event first occurred, or the total number of observed weeks if the event never occurred (censored).\n", "- **`event`**: 1 if the condition was met in at least one week, 0 otherwise.\n", "\n", "Here we define the event as a person's **first week with any after-hours collaboration** (`After_hours_collaboration_hours > 0`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "surv_data = vi.create_survival_prep(\n", " data=pq_data,\n", " metric=\"After_hours_collaboration_hours\",\n", " event_condition=lambda x: x > 0,\n", " hrvar=\"Organization\",\n", ")\n", "\n", "surv_data.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Event rate and time distribution\n", "n_total = len(surv_data)\n", "event_rate = surv_data[\"event\"].mean() * 100\n", "time_range = surv_data[\"time\"].agg([\"min\", \"max\"])\n", "\n", "print(f\"Total persons: {n_total}\")\n", "print(f\"Event rate: {event_rate:.1f}%\")\n", "print(f\"Weeks observed: {time_range['min']} – {time_range['max']}\")\n", "\n", "surv_data.groupby(\"event\").size().rename(\"n\").reset_index().assign(\n", " pct=lambda d: (d[\"n\"] / d[\"n\"].sum() * 100).round(1),\n", " label=lambda d: d[\"event\"].map({1: \"Had event\", 0: \"Censored\"}),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Persons who never had any after-hours work during the observation window appear as censored (`event = 0`). The survival curve accounts for this: they contribute information up to their last observed week.\n", "\n", "---\n", "\n", "## Step 2 — Plot the Kaplan–Meier curve\n", "\n", "Pass the person-level data to `create_survival()`, specifying the time and event columns and the grouping variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vi.create_survival(\n", " data=surv_data,\n", " time_col=\"time\",\n", " event_col=\"event\",\n", " hrvar=\"Organization\",\n", " mingroup=5,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Reading the chart:**\n", "\n", "- **Y-axis** — probability of *not yet* having reached the milestone (the \"survival\" probability, or equivalently, the share yet to convert).\n", "- **X-axis** — week number (time since the start of observation).\n", "- **Each step down** marks one or more conversions in that group at that week.\n", "- **Curves that drop quickly** indicate groups where most people reached the milestone early (fast adoption).\n", "- **Curves that stay high** indicate groups where many people had not yet converted by the end of the window.\n", "\n", "---\n", "\n", "## Overall curve (no grouping)\n", "\n", "Set `hrvar=None` to estimate a single curve across the whole population:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vi.create_survival(\n", " data=surv_data,\n", " time_col=\"time\",\n", " event_col=\"event\",\n", " hrvar=None,\n", " mingroup=5,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Returning the survival table\n", "\n", "Set `return_type=\"table\"` to get the underlying long-format data frame. Each row represents one event time within one group:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "surv_tbl = vi.create_survival(\n", " data=surv_data,\n", " time_col=\"time\",\n", " event_col=\"event\",\n", " hrvar=\"Organization\",\n", " mingroup=5,\n", " return_type=\"table\",\n", ")\n", "\n", "surv_tbl.head(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Columns:\n", "\n", "| Column | Description |\n", "|---|---|\n", "| `Organization` | Group identifier |\n", "| `time` | Week number |\n", "| `survival` | Estimated survival probability at that time |\n", "| `at_risk` | Persons still in the risk set (event not yet occurred) |\n", "| `events` | Events occurring at this time point |\n", "\n", "### Extracting median event times\n", "\n", "The **median survival time** is the week at which 50% of the group has experienced the event:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First time per group that survival drops to or below 0.5\n", "median_times = (\n", " surv_tbl[surv_tbl[\"survival\"] <= 0.5]\n", " .groupby(\"Organization\", as_index=False)\n", " .first()[[\"Organization\", \"time\", \"survival\"]]\n", " .sort_values(\"time\")\n", " .rename(columns={\"time\": \"median_time\"})\n", ")\n", "\n", "median_times" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Groups with a smaller `median_time` value reach the event threshold faster on average.\n", "\n", "---\n", "\n", "## Grouping by a different HR variable\n", "\n", "Any character column can be used as the grouping variable. Here we compare adoption by `LevelDesignation`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "surv_level = vi.create_survival_prep(\n", " data=pq_data,\n", " metric=\"After_hours_collaboration_hours\",\n", " event_condition=lambda x: x > 0,\n", " hrvar=\"LevelDesignation\",\n", ")\n", "\n", "vi.create_survival(\n", " data=surv_level,\n", " time_col=\"time\",\n", " event_col=\"event\",\n", " hrvar=\"LevelDesignation\",\n", " mingroup=5,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Changing the event definition\n", "\n", "The `event_condition` argument in `create_survival_prep()` accepts any callable that takes a pandas Series and returns a boolean Series. This makes it easy to explore different thresholds without modifying your data.\n", "\n", "### Higher after-hours threshold" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Event: first week with more than 2 hours of after-hours collaboration\n", "surv_high = vi.create_survival_prep(\n", " data=pq_data,\n", " metric=\"After_hours_collaboration_hours\",\n", " event_condition=lambda x: x > 2,\n", " hrvar=\"Organization\",\n", ")\n", "\n", "print(f\"Event rate (> 2 h after-hours): {surv_high['event'].mean() * 100:.1f}%\")\n", "\n", "vi.create_survival(\n", " data=surv_high,\n", " time_col=\"time\",\n", " event_col=\"event\",\n", " hrvar=\"Organization\",\n", " mingroup=5,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Network growth milestone" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Event: first week where internal network size exceeds 10 contacts\n", "surv_net = vi.create_survival_prep(\n", " data=pq_data,\n", " metric=\"Internal_network_size\",\n", " event_condition=lambda x: x > 10,\n", " hrvar=\"Organization\",\n", ")\n", "\n", "print(f\"Event rate (network > 10): {surv_net['event'].mean() * 100:.1f}%\")\n", "\n", "vi.create_survival(\n", " data=surv_net,\n", " time_col=\"time\",\n", " event_col=\"event\",\n", " hrvar=\"Organization\",\n", " mingroup=5,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Privacy filtering\n", "\n", "Groups below `mingroup` unique persons are removed before the curve is estimated. Increase the threshold to be more conservative:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "surv_strict = vi.create_survival(\n", " data=surv_data,\n", " time_col=\"time\",\n", " event_col=\"event\",\n", " hrvar=\"Organization\",\n", " mingroup=20,\n", " return_type=\"table\",\n", ")\n", "\n", "# Which groups remained after stricter filtering?\n", "surv_strict[[\"Organization\", \"at_risk\"]].groupby(\"Organization\")[\"at_risk\"].max().sort_values(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n\n## Using `return_type` to retrieve the table or figure\n\n`create_survival()` accepts a `return_type` argument:\n\n- `return_type=\"table\"` — returns the long-format Kaplan–Meier table so you can inspect survival probabilities directly.\n- `return_type=\"plot\"` (default) — returns the matplotlib Figure for further customisation (e.g., adding reference lines).\n\nThere is no need to call `create_survival_calc` or `create_survival_viz` directly in typical usage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "survival_long = vi.create_survival(\n data=surv_data,\n time_col=\"time\",\n event_col=\"event\",\n hrvar=\"Organization\",\n mingroup=5,\n return_type=\"table\",\n)\n\n# Inspect group sizes (max at_risk per group = initial size after filtering)\ncounts = survival_long.groupby(\"Organization\")[\"at_risk\"].max()\nprint(\"Group sizes after privacy filtering:\")\nprint(counts)\n\nsurvival_long.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nfig = vi.create_survival(\n data=surv_data,\n time_col=\"time\",\n event_col=\"event\",\n hrvar=\"Organization\",\n mingroup=5,\n title=\"Time to first after-hours week\",\n)\n\n# Add a 50% threshold line to read off median times\nax = fig.get_axes()[0]\nax.axhline(y=0.5, linestyle=\"--\", color=\"grey\", linewidth=1, alpha=0.7)\nax.text(0.5, 0.52, \"50% threshold\", fontsize=8, color=\"grey\")\n\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.13.0" }, "nbsphinx": { "execute": "never" } }, "nbformat": 4, "nbformat_minor": 5 }