Loading your data

Recon NER expects your data to be in the most basic Prodigy Annotation Format.

A single example in this format looks like:

{
  "text": "Apple updates its analytics service with new metrics",
  "spans": [{ "start": 0, "end": 5, "label": "ORG" }]
}

Recon does require that you have the tokens property set and will try to resolve any tokenization errors in your data for you as well as add tokens if they don't already exist. If your have already been tokenized (which is true if you used the ner_manual Prodigy recipe), Recon will skip the tokenization step.

Recon expects your data to be in a collection in a JSONL or JSON file.

Note

More loaders for different file types (CONLL) will be added in future versions

Loaders

Recon comes with a few loaders, read_jsonl and read_json. They're simple enough, they just load the data from disk and create instances of the strongly typed Example class for each raw example.

The Example class provides some basic validation that ensures all spans have a text property (which they don't if you're using newer versions of Prodigy and the ner.manual recipe for annotation).

Everything in Recon is built to run on a single Example or a List[Example].

However, the goal of Recon is to provide insights across all of your annotated examples, not just one. For this, we need a wrapper around a set of examples. This is called a Dataset.

Let's use the read_jsonl loader to load some annotated data created with Prodigy

Tip

If you don't have any data available, you can use the data in the examples folder here. We'll be using this data for the rest of the tutorial.

from recon.loaders import read_jsonl
from recon.types import Example


data = read_jsonl('examples/data/skills/train.jsonl')

assert isinstance(data, Example)

Now we have some examples to work, we can start examining our data.

Next Steps

Once you have your data loaded, you can run other Recon functions on top of it to gain insights into the quality and completeness of your NER data as well as to start making corrections to the inconsistently annotated examples you almost certainly have (Don't worry, that's fine! Messy data is everywhere, even Microsoft)