Introduction

In Recon, a Dataset has a few responsibilities.

  • Store exampels
  • Store state of every mutation made to it using recon operations
  • Provide an easy interface to apply functions and pipelines to the dataset data
  • Easily serialize and deserialize from/to disk to track state of data across the duration of an annotation project

Getting Started with Datasets

The easiest way to get started with a Dataset is using the from_disk method.

The following example starts by initializing a Dataset with a name ("train") and loading the train.jsonl data for the skills example dataset

Replace the code in main.py with the following

from pathlib import Path

import typer
from recon.dataset import Dataset
from recon.stats import get_ner_stats


def main(data_file: Path):
    ds = Dataset("train").from_disk(data_file)
    print(get_ner_stats(data, serialize=True))


if __name__ == "__main__":
    typer.run(main)

and run with the same command. You should see the exact same result as you did without using a Dataset. That's because Dataset.from_disk calls read_jsonl

$ python main.py ./examples/data/skills/train.jsonl
{
    "n_examples":106,
    "n_examples_no_entities":29,
    "n_annotations":243,
    "n_annotations_per_type":{
        "SKILL":197,
        "PRODUCT":33,
        "JOB_ROLE":10,
        "skill":2,
        "product":1
    },
    "examples_with_type":null
}

Applying functions to Datasets

In the previous example we called the get_ner_stats function on the data from the train Dataset. Dataset provides a utility function called apply. Dataset.apply takes any function that operates on a List of Examples and runs it on the Dataset's internal data.

from pathlib import Path

import typer
from recon.dataset import Dataset
from recon.stats import get_ner_stats


def main(data_file: Path):
    ds = Dataset("train").from_disk(data_file)
    print(ds.apply(get_ner_stats, serialize=True))


if __name__ == "__main__":
    typer.run(main)

This might not be that interesting (it doesn't save you a ton of code) but Dataset.apply can accept either a function or a name for a registered Recon operation. All functions are registered in a Recon registry.

All functions packaged with recon have "recon.vN..." as a prefix.

So the above example can be converted to:

from pathlib import Path

import typer
from recon.dataset import Dataset


def main(data_file: Path):
    ds = Dataset("train").from_disk(data_file)
    print(ds.apply("recon.v1.get_ner_stats", serialize=True))


if __name__ == "__main__":
    typer.run(main)

This means you don't have to import the get_ner_stats function. For a full list of operations see the operations API guide

All of these examples should return the exact same response. See for yourself:

$ python main.py ./examples/data/skills/train.jsonl
{
    "n_examples":106,
    "n_examples_no_entities":29,
    "n_annotations":243,
    "n_annotations_per_type":{
        "SKILL":197,
        "PRODUCT":33,
        "JOB_ROLE":10,
        "skill":2,
        "product":1
    },
    "examples_with_type":null
}

Next Steps

It's great that we can manage our data operations using a Dataset and named functions but our data is still messy. We still have those pesky lowercased labels for "skill" and "product" that should clearly be "SKILL" and "PRODUCT" respectively. In the next step of the tutorial we'll learn how to run operations that mutate a Dataset and everything Recon does to keep track of these operations for you.