Tutorial - NER Statistics

Getting statistics about your NER data is extremely helpful throughout the annotation process. It helps you ensure that you're spendind time on the right annotations and that you have enough examples for each type as well as enough examples with NO ENTITIES at all (this is often overlooked but VERY important to build a model that generalizes well).

Once you have your data loaded either by itself as a list of Examples or as a Dataset you can easily get statistics using the stats.ner_stats function.

The stats.get_ner_stats function expects a List[Example] as it's input parameter and will return a serializable response with info about your data. Let's see how this works on the provided example data.

Example

Create a file main.py with:

from pathlib import Path

import typer
from recon.loaders import read_jsonl
from recon.stats import get_ner_stats


def main(data_file: Path):
    data = read_jsonl(data_file)
    train_stats = get_ner_stats(data)
    print(get_ner_stats(data, serialize=True))


if __name__ == "__main__":
    typer.run(main)

Run the application with the example data and you should see the following results.

$ python main.py ./examples/data/skills/train.jsonl
{
    "n_examples":106,
    "n_examples_no_entities":29,
    "n_annotations":243,
    "n_annotations_per_type":{
        "SKILL":197,
        "PRODUCT":33,
        "JOB_ROLE":10,
        "skill":2,
        "product":1
    },
    "examples_with_type":null
}

Great! We have some basic stats about our data but we can already see some issues. Looks like some of our examples are annotated with lowercase labels. These are obviously mistakes and we'll see how to fix these shortly.

But first, it isn't super helpful to have stats on just your train data. And it'd be really annoying to have to call the same function on each list of examples:

train = read_jsonl(train_file)
print(get_ner_stats(train, serialize=True))

dev = read_jsonl(dev_file)
print(get_ner_stats(dev, serialize=True))

test = read_jsonl(test_file)
print(get_ner_stats(test, serialize=True))

Next Steps

In the next step step of this tutorial we'll introduce the core containers Recon uses for managing examples and state:

  1. Dataset - A Dataset has a name and holds a list of examples. Its also responsible for tracking any mutations done to its internal data throught Recon operations. (More on this later)

and

  1. Corpus. A Corpus is a wrapper around a set of datasets that represent a typical train/eval or train/dev/test split. Using a Corpus allows you to gain insights on how well your train set represents your dev/test sets.