Tutorial - Using Corpus.apply

In the previous step, we used the stats.get_ner_stats function to some stats on our train_data. Now, we want to be able to get these same stats across our train/dev/test split.

SO, ReconNER's Corpus class provides a useful method called apply that takes a Callable as a parameter that can run on a list of Examples (e.g. stats.get_ner_stats) and run that Callable over all the datasets as well as a concatenation of all the datasets so you get the full picture.

Tip

You can pass arbitary *args and **kwargs to Corpus.apply and they will be passed along to the callable you provide as the required argument.

Update script to use Dataset.apply

Let's edit that main.py file you created in the previous step to utilize the Dataset.apply method.

from pathlib import Path

import srsly
import typer
from recon.dataset import Dataset
from recon.stats import get_ner_stats


def main(data_dir: Path):
    ds = Dataset.from_disk(data_dir)
    ds_stats = ds.apply(get_ner_stats, serialize=True, no_print=True)
    for name, stats in ds_stats.items():
        print(f"{name}")
        print("=" * 50)
        print(stats)


if __name__ == "__main__":
    typer.run(main)

Run the application

Now, if you run your script you should get the following output:

$ python main.py ./examples/data
train
==================================================
{
    "n_examples":102,
    "n_examples_no_entities":29,
    "ents_per_type":{
        "SKILL":191,
        "PRODUCT":34,
        "JOB_ROLE":5
    }
}
dev
==================================================
{
    "n_examples":110,
    "n_examples_no_entities":49,
    "ents_per_type":{
        "SKILL":159,
        "PRODUCT":20,
        "JOB_ROLE":1
    }
}
test
==================================================
{
    "n_examples":96,
    "n_examples_no_entities":38,
    "ents_per_type":{
        "PRODUCT":35,
        "SKILL":107,
        "JOB_ROLE":2
    }
}
all
==================================================
{
    "n_examples":308,
    "n_examples_no_entities":116,
    "ents_per_type":{
        "SKILL":457,
        "PRODUCT":89,
        "JOB_ROLE":8
    }
}

Analyzing the results

Now that we have a good understanding of the distribution of labels in across our train/dev/test split as well as the summation of all those numbers to the "all" data, we can start to see some issues.

1. Not enough JOB_ROLE annotations

We clearly don't have enough annotations of the JOB_ROLE in our data. There's no way an NER model could learn to capture JOB_ROLE in a generic way with only 8 total annotations.

2. Barely enough PRODUCT annotations

We're also a little low (though not nearly as much) on our PRODUCT label.

What to do from here

We want our final model to be equally good at extracting these 3 labels of SKILL, PRODUCT and JOB_ROLE so we now know exactly where to invest more time in our annotations effort: getting more examples of JOB_ROLE.

Note

This is a VERY small dataset sampled from a much larger NER dataset that's powering part of our work on the new v3 Text Analytics Cognitive Service. So here's your glimpse into how we work with data at Microsoft. Until we fix the lack of annotations for the JOB_ROLE label we won't be launching it in production.

Next Steps

We've only scratched the surface of ReconNER. It's great to have these global stats about our dataset so we can track trends and make sure we're trending in the right direction as we annotate more data. But this data doesn't debug the data we already have. 34 of our 191 SKILL annotations in our train set might actually be instances where JOB_ROLE or PRODUCT is more appropriate.

We might have subsets of our data annotated by different people that had a slightly different understanding of the annotation requirements.

In the next step of this tutorial we'll dive into the insights module of ReconNER to examine the quality of our existing annotations.