{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Survival Curves with `create_survival()`\n",
    "\n",
    "`create_survival()` produces a **Kaplan–Meier survival curve** — a non-parametric estimate of how quickly a specific event occurs across a population. In a workforce analytics context typical events include:\n",
    "\n",
    "- First week of after-hours collaboration\n",
    "- First week a collaboration or network metric crosses a threshold\n",
    "- First observed use of a new tool\n",
    "\n",
    "Three exported functions cover the full workflow:\n",
    "\n",
    "| Function | What it does |\n",
    "|---|---|\n",
    "| `create_survival_prep()` | Converts panel Person Query data into person-level survival format |\n",
    "| `create_survival_calc()` | Computes Kaplan–Meier curves per group |\n",
    "| `create_survival_viz()` | Renders step-function curves from a pre-computed table |\n",
    "| `create_survival()` | End-to-end wrapper (calc + viz) |\n",
    "\n",
    "### Reframing \"survival\" for workforce contexts\n",
    "\n",
    "In classical survival analysis the *event* is typically something negative — death, equipment failure — and the y-axis probability represents \"still alive\". In workforce analytics the event is usually **a positive milestone**: first adoption of a tool, first week as a power user, first time a metric crosses a meaningful threshold.\n",
    "\n",
    "The terminology inverts: \"surviving\" means *not yet having reached the milestone*, and the event means *success — the person converted*. It is often more intuitive to read the chart as a **time-to-adoption curve** or a **conversion curve**:\n",
    "\n",
    "- **A curve that drops steeply early** → most people reached the milestone quickly.\n",
    "- **A curve that stays high** → many people had not yet converted by the end of the observation window.\n",
    "- **The y-axis** → the share of people who have *not yet* experienced the milestone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import vivainsights as vi\n",
    "\n",
    "pq_data = vi.load_pq_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Step 1 — Prepare person-level survival data\n",
    "\n",
    "`pq_data` is a panel dataset with one row per person per week. `create_survival_prep()` collapses it to one row per person, recording:\n",
    "\n",
    "- **`time`**: the week number at which the event first occurred, or the total number of observed weeks if the event never occurred (censored).\n",
    "- **`event`**: 1 if the condition was met in at least one week, 0 otherwise.\n",
    "\n",
    "Here we define the event as a person's **first week with any after-hours collaboration** (`After_hours_collaboration_hours > 0`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "surv_data = vi.create_survival_prep(\n",
    "    data=pq_data,\n",
    "    metric=\"After_hours_collaboration_hours\",\n",
    "    event_condition=lambda x: x > 0,\n",
    "    hrvar=\"Organization\",\n",
    ")\n",
    "\n",
    "surv_data.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Event rate and time distribution\n",
    "n_total = len(surv_data)\n",
    "event_rate = surv_data[\"event\"].mean() * 100\n",
    "time_range = surv_data[\"time\"].agg([\"min\", \"max\"])\n",
    "\n",
    "print(f\"Total persons:  {n_total}\")\n",
    "print(f\"Event rate:     {event_rate:.1f}%\")\n",
    "print(f\"Weeks observed: {time_range['min']} – {time_range['max']}\")\n",
    "\n",
    "surv_data.groupby(\"event\").size().rename(\"n\").reset_index().assign(\n",
    "    pct=lambda d: (d[\"n\"] / d[\"n\"].sum() * 100).round(1),\n",
    "    label=lambda d: d[\"event\"].map({1: \"Had event\", 0: \"Censored\"}),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Persons who never had any after-hours work during the observation window appear as censored (`event = 0`). The survival curve accounts for this: they contribute information up to their last observed week.\n",
    "\n",
    "---\n",
    "\n",
    "## Step 2 — Plot the Kaplan–Meier curve\n",
    "\n",
    "Pass the person-level data to `create_survival()`, specifying the time and event columns and the grouping variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vi.create_survival(\n",
    "    data=surv_data,\n",
    "    time_col=\"time\",\n",
    "    event_col=\"event\",\n",
    "    hrvar=\"Organization\",\n",
    "    mingroup=5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Reading the chart:**\n",
    "\n",
    "- **Y-axis** — probability of *not yet* having reached the milestone (the \"survival\" probability, or equivalently, the share yet to convert).\n",
    "- **X-axis** — week number (time since the start of observation).\n",
    "- **Each step down** marks one or more conversions in that group at that week.\n",
    "- **Curves that drop quickly** indicate groups where most people reached the milestone early (fast adoption).\n",
    "- **Curves that stay high** indicate groups where many people had not yet converted by the end of the window.\n",
    "\n",
    "---\n",
    "\n",
    "## Overall curve (no grouping)\n",
    "\n",
    "Set `hrvar=None` to estimate a single curve across the whole population:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vi.create_survival(\n",
    "    data=surv_data,\n",
    "    time_col=\"time\",\n",
    "    event_col=\"event\",\n",
    "    hrvar=None,\n",
    "    mingroup=5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Returning the survival table\n",
    "\n",
    "Set `return_type=\"table\"` to get the underlying long-format data frame. Each row represents one event time within one group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "surv_tbl = vi.create_survival(\n",
    "    data=surv_data,\n",
    "    time_col=\"time\",\n",
    "    event_col=\"event\",\n",
    "    hrvar=\"Organization\",\n",
    "    mingroup=5,\n",
    "    return_type=\"table\",\n",
    ")\n",
    "\n",
    "surv_tbl.head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Columns:\n",
    "\n",
    "| Column | Description |\n",
    "|---|---|\n",
    "| `Organization` | Group identifier |\n",
    "| `time` | Week number |\n",
    "| `survival` | Estimated survival probability at that time |\n",
    "| `at_risk` | Persons still in the risk set (event not yet occurred) |\n",
    "| `events` | Events occurring at this time point |\n",
    "\n",
    "### Extracting median event times\n",
    "\n",
    "The **median survival time** is the week at which 50% of the group has experienced the event:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First time per group that survival drops to or below 0.5\n",
    "median_times = (\n",
    "    surv_tbl[surv_tbl[\"survival\"] <= 0.5]\n",
    "    .groupby(\"Organization\", as_index=False)\n",
    "    .first()[[\"Organization\", \"time\", \"survival\"]]\n",
    "    .sort_values(\"time\")\n",
    "    .rename(columns={\"time\": \"median_time\"})\n",
    ")\n",
    "\n",
    "median_times"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Groups with a smaller `median_time` value reach the event threshold faster on average.\n",
    "\n",
    "---\n",
    "\n",
    "## Grouping by a different HR variable\n",
    "\n",
    "Any character column can be used as the grouping variable. Here we compare adoption by `LevelDesignation`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "surv_level = vi.create_survival_prep(\n",
    "    data=pq_data,\n",
    "    metric=\"After_hours_collaboration_hours\",\n",
    "    event_condition=lambda x: x > 0,\n",
    "    hrvar=\"LevelDesignation\",\n",
    ")\n",
    "\n",
    "vi.create_survival(\n",
    "    data=surv_level,\n",
    "    time_col=\"time\",\n",
    "    event_col=\"event\",\n",
    "    hrvar=\"LevelDesignation\",\n",
    "    mingroup=5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Changing the event definition\n",
    "\n",
    "The `event_condition` argument in `create_survival_prep()` accepts any callable that takes a pandas Series and returns a boolean Series. This makes it easy to explore different thresholds without modifying your data.\n",
    "\n",
    "### Higher after-hours threshold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Event: first week with more than 2 hours of after-hours collaboration\n",
    "surv_high = vi.create_survival_prep(\n",
    "    data=pq_data,\n",
    "    metric=\"After_hours_collaboration_hours\",\n",
    "    event_condition=lambda x: x > 2,\n",
    "    hrvar=\"Organization\",\n",
    ")\n",
    "\n",
    "print(f\"Event rate (> 2 h after-hours): {surv_high['event'].mean() * 100:.1f}%\")\n",
    "\n",
    "vi.create_survival(\n",
    "    data=surv_high,\n",
    "    time_col=\"time\",\n",
    "    event_col=\"event\",\n",
    "    hrvar=\"Organization\",\n",
    "    mingroup=5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Network growth milestone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Event: first week where internal network size exceeds 10 contacts\n",
    "surv_net = vi.create_survival_prep(\n",
    "    data=pq_data,\n",
    "    metric=\"Internal_network_size\",\n",
    "    event_condition=lambda x: x > 10,\n",
    "    hrvar=\"Organization\",\n",
    ")\n",
    "\n",
    "print(f\"Event rate (network > 10): {surv_net['event'].mean() * 100:.1f}%\")\n",
    "\n",
    "vi.create_survival(\n",
    "    data=surv_net,\n",
    "    time_col=\"time\",\n",
    "    event_col=\"event\",\n",
    "    hrvar=\"Organization\",\n",
    "    mingroup=5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Privacy filtering\n",
    "\n",
    "Groups below `mingroup` unique persons are removed before the curve is estimated. Increase the threshold to be more conservative:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "surv_strict = vi.create_survival(\n",
    "    data=surv_data,\n",
    "    time_col=\"time\",\n",
    "    event_col=\"event\",\n",
    "    hrvar=\"Organization\",\n",
    "    mingroup=20,\n",
    "    return_type=\"table\",\n",
    ")\n",
    "\n",
    "# Which groups remained after stricter filtering?\n",
    "surv_strict[[\"Organization\", \"at_risk\"]].groupby(\"Organization\")[\"at_risk\"].max().sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n\n## Using `return_type` to retrieve the table or figure\n\n`create_survival()` accepts a `return_type` argument:\n\n- `return_type=\"table\"` — returns the long-format Kaplan–Meier table so you can inspect survival probabilities directly.\n- `return_type=\"plot\"` (default) — returns the matplotlib Figure for further customisation (e.g., adding reference lines).\n\nThere is no need to call `create_survival_calc` or `create_survival_viz` directly in typical usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "survival_long = vi.create_survival(\n    data=surv_data,\n    time_col=\"time\",\n    event_col=\"event\",\n    hrvar=\"Organization\",\n    mingroup=5,\n    return_type=\"table\",\n)\n\n# Inspect group sizes (max at_risk per group = initial size after filtering)\ncounts = survival_long.groupby(\"Organization\")[\"at_risk\"].max()\nprint(\"Group sizes after privacy filtering:\")\nprint(counts)\n\nsurvival_long.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n\nfig = vi.create_survival(\n    data=surv_data,\n    time_col=\"time\",\n    event_col=\"event\",\n    hrvar=\"Organization\",\n    mingroup=5,\n    title=\"Time to first after-hours week\",\n)\n\n# Add a 50% threshold line to read off median times\nax = fig.get_axes()[0]\nax.axhline(y=0.5, linestyle=\"--\", color=\"grey\", linewidth=1, alpha=0.7)\nax.text(0.5, 0.52, \"50% threshold\", fontsize=8, color=\"grey\")\n\nplt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.13.0"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}