Repository Cohorts

What is this page?

This page is a very brief demonstration of the concept of repository cohorts. It is designed to act as a companion to a talk that will be given at Open Source Summit North America 2024 titled "Repository Cohorts: How OSPO's Can Programmatically Categorize All Their Repositories". The presentation slides are online.

What to know about the data being shown here?

Data on this page is loaded from a CSV file and is not updated post 2024-03-19. It is from a fixed snapshot in time that is now out of date.

It shows data from from real repositories. These repositories are all from GitHub organizations monitored by Microsoft either owns or has a hand in the governance of and therefore collects data on. All the data shown is public data that we collect and harvest via the GitHub API, just like anyone else could.

Contributing (or adapting for your own purposes)

This demo is built from code in the repository_cohorts/framework directory of the https://github.com/microsoft/OSPO repository.

To see the code, look at the cohort.js file and index.md file in the repository that builds this page. The open source Observable Framework is used to generate the data visualization static site.

How each repository cohort is defined is described in the const jsonThatDescribesCohortsToCreate. That JSON data structure contains for each repository cohort, what function is used to make it and all the required function arguments. If you have a repository cohort of your own making that relies on standard GitHub metadata, please share it with others by submitting a pull request.

Data transformation steps

Internally, Microsoft uses Kusto Query Language to generate the cohort columns. The logic used on this page is identical but in JavaScript.

The initial data processing steps are load the CSV file, rename several of the columns as needed, then calculate additional columns, including cohort columns. The data is then ready for further analysis to generate insights.

Unprocessesd raw data

Keys are renamed to mirror ecosyste.ms key names, so this example code can be reused more easily

The renamed key mapping:

The data table with renamed keys:

Calculated columns are created, including cohort columns. These are added on to the existing columns.

Table showing all repositories with calculated cohort columns.

Scroll to the far right to see cohort columns


What problems do repository cohorts solve?

No wants to read 3,000 READMEs

OSPOs often have to manage, enforce compliance, and recommend best practice for many hundreds or thousands of repositories. This page shows over 10,000 repositories. That is way too many to read, so there's a strong tendancy for OSPOs to treat every repository the same and make policy and best practive recommendations for an average repository or more accurately an average top of mind repository.

There are many situation where this pattern is less than ideal. Many times it would be useful to know the distribution of different types of repositories, make different best practice recommendations based on repository characteristics, or know more about the communities building open source that will be impacted by a potential future policy change in order to design effectice communication and execution measures.

Metadata but make it more easily reusable

Metadata reduces the need to read thousands of repositories by letting OSPOs understand repositories according to their easily measurable characteristics. Examples include the size of the contributing community, amount of recent activity, whether the repository is a sample based on key terms in repository name or description, count of stars, count of forks, presence of key files like CONTRIBUTING.md, etc.

Working with raw repository metadata fields, however, requires thought about how to combine raw metadata, where to make thresholds, etc. For example, how many forks is a high amount of forks? If this is done again and again for each potential use case of the metadata, it imposes time and cognitive burdens that limit how often the metadata is used to make data informed decisions. Additionally, small differences in where to make a cut off for categories such as highly forked repositories versus normally forked repositories can lead to inefficiencies in applying the learnings from one project to another project.

Repository cohorts attempts to solve these problems by being standardized labels for repostories. They have meanings that are easy to remember and can be reused with very little effort as they become additional columns in the repository metadata table.

Repository cohort structure

Repository cohorts are either true or false for each repository. There can be groups of cohorts that split a dimension. For exampple, there can be cohorts of repository age of baby, toddler, teenager, adult, and senior. Each repository, or row in the table, will be true for only one of these cohorts and false for all the other cohorts in the group.

This makes filtering on cohorts or combining cohorts easier cognitively than working with raw metadata as it becomes a matter of using AND or OR statements to combine them, which are both easier to think about and easier to remember than cutoffs in raw metadata values.

Benefits of repository cohorts

The characteristics of repository cohorts reduces the complexity, time, and cognitive burdens to using metadata to analyze large amounts of repositories. By reducing these burdens, it makes it more likely that data-driven insights will be leveraged in OSPO operations and more likely OSPOs can deliver fit-for-purpose guidance and compliance experiences rather than everything being one-size-fits-all.

What is shown in this demo is snapshot of a few cohorts based on easily collected metadata everyone will have. Other cohorts possible with additional data include whether a repository builds a package, whether it uses GitHub Actions, cohorts based on rate of company vs. member of public contributors, etc.


Initial visualizations of main repository cohorts

Note that the tool tips have a limit of how many repositories they will show information for.

What is a Nadia Cohort?

The Nadia cohort group attemps to capture a description of community user / contributor patterns that reveal something about the structure of a community that builds a repository. This idea comes from Nadia Asparouhova's book "Working in Public: The Making and Maintenance of Open Source Software" Stripe Matter Incorporated, 2020, pp59-65, which categorizes open source project communities as being federations, clubs, stadiums, or toys. No exact metrics are given for defining the boundaries, but rather they are described using a matrix of high or low contributor growth and high or low user growth.

HIGH USER GROWTH LOW USER GROWTH
HIGH CONTRIBUTOR GROWTH Federations Clubs
LOW CONTRIBUTOR GROWTH Stadiums Toys

For repository cohorts, we have taken these ideas and modified them to work with easily available repository metadata. We have also added a category of "middle" or "mid" repositories that are not quite one of those four but some where in between them. There is also a "missing data" Nadia cohort for when we lack the repository metadata needed to calculate a Nadia cohort.

Metadata thresholds for Nadia community cohorts

The metadata thresholds used for what we are calling Nadia cohorts is based on a combination of community size, stargazer count, and ratio of stargazers vs. committer count.

ratioStargazersVsCommitters > 2 ratioStargazersVsCommitters < 2
> 60 contributors Federation cohort Club cohort
6 < contributors < 60 Mid cohort
stargazers_count > 100 stargazers_count < 100
< 6 contributors Stadium cohort Toy cohort

These thresholds seem to work well for Microsoft repositories. We would be interested in learning what other organizations use if they model similar cohorts, so please add an issue with any feedback or comments.

To describe these cohorts in plain english:

  1. Federations are built by a large community with a much larger silent group of watchers.
  2. Clubs are built but a large community with the ratio of silent watchers closer to the size of the contributor community.
  3. Stadium is built by a small community with a larger proportion of silent watchers.
  4. Toy is built by a small community with a small number of silent watchers.
  5. mid is built by a moderately sized community that sits between the other cohorts in terms of community size.

Note how it is cognitively easier to reason about the data in the chart where age is shown as a cohort versus a continuous value. It would also be easier to describe to others.


These user inputs are used in a SQL query below and filters ALL tables and plots below

Most repositories are in the 'microsoft', 'Azure', or 'Azure-Samples' GitHub organizations.

The SQL command used to create the table and visualizations below based on the usser selected filters:


SELECT * FROM reposSQL WHERE size > ${sizeMin} AND size < ${sizeMax} AND stargazers_count > ${stargazer_count_min} AND daysSinceUpdated < ${max_days_since_updated} AND daysSinceUpdated > ${min_days_since_updated} AND Archived == ${archived} LIMIT ${limitNumberRowsToShow}

Number of repositories selected:

Looking at only repositories in the Nadia federation cohort if they exist

Looking at only repositories in the sample cohort if they exist

This cohort is made of repositories with "sample","demo","example","tutorial" in either the organization name, repository name, or repository description. It is a simplistic means to identify repsoitory that are mostly created for the purpose of being samples rather than a website or a package. It will contain some false positives and miss some false negatives.

Number of repositories in the sample cohort:

The distribution of Nadia community types in repositories that are in the samples cohort is surprising similar to the distribution of Nadia community types when not filtering to sample repositories.

Epilogue

Small win

If you've reached the end and you're left thinking "Wait, that isn't that complicated and it doesn't offer anything radically different", you're right. The benefit of the repository cohorts concept isn't that it is extremely novel, uses a sophisticated algorithm, or predicts something extremely hard to figure out. It is simply that it makes it a little easier for an OSPO to leverage metadata for typical Open Source Program Office tasks and more likely that work is sharable and reusable.

On metadata collection

The complicated part of all this is often collecting metadata and keeping it up-to-date. Microsoft has a centralized team responsible for collecting accurate metadata across all of its code platforms. This isn't possible for every organization. An alternative worth considering is the Centers for Medicare and Medicaid Services OSPO's approach using Augur and GitHub Actions as seen in their metrics repository which builds their metrics site.