In [1]:
# Load from parent directory if not installed
import importlib
import os

if not importlib.util.find_spec("sammo"):
    import sys

    sys.path.append("../")
os.environ["CACHE_FILE"] = "cache/working_with_data.tsv"

# Data & Templating

SAMMO uses DataTables as a thin wrapper around lists of dictionaries. They also separate your data input from the desired or actual output.


## Loading data
First, let's download the implicatures dataset from BIGBENCH as an example:

In [2]:
import requests
import json

URL = "https://github.com/google/BIG-bench/raw/main/bigbench/benchmark_tasks/implicatures/task.json"

task = json.loads(requests.get(URL).content)
# convert label to single string
for x in task["examples"]:
    x["output"] = max(x["target_scores"], key=x["target_scores"].get)

With `DataTables`, there are two kinds of information: inputs and outputs. Inputs are treated as immutable while outputs are mutable. This protects against accidentally changing the starting data. To build the `DataTable`, we need to specify which fields should be used as inputs and which as outputs.

In [3]:
from sammo.data import DataTable

mydata = DataTable.from_records(
    task["examples"],
    input_fields="input",
    constants={"instructions": task["task_prefix"]},
)
mydata[:3]

+-------------------------------------------------------------+----------+
| input                                                       | output   |
| Speaker 1: 'But aren't you afraid?' Speaker 2: 'Ma'am,      | no       |
| sharks never attack anybody.'                               |          |
+-------------------------------------------------------------+----------+
| Speaker 1: 'Do you want to quit?' Speaker 2: 'I've never    | no       |
| been the type of person who throws in the towel when things |          |
| get tough.'                                                 |          |
+-------------------------------------------------------------+----------+
| Speaker 1: 'Should I convince these clients?' Speaker 2:    | yes      |
| 'These are really important clients with deep pockets.'     |          |
+-------------------------------------------------------------+----------+
Constants: {'instructions': "Does Speaker 2's answer mean yes or no? "}

Much easier to read! We also added task instructions as a constant.

There are other ways to construct `DataTables`, e.g., from a pandas DataFrame. 

In [4]:
import pandas as pd

df = pd.DataFrame(task["examples"])
mydata = DataTable.from_pandas(df, input_fields="input", constants={"instructions": task["task_prefix"]})
mydata[:3]

+-------------------------------------------------------------+----------+
| input                                                       | output   |
| Speaker 1: 'But aren't you afraid?' Speaker 2: 'Ma'am,      | no       |
| sharks never attack anybody.'                               |          |
+-------------------------------------------------------------+----------+
| Speaker 1: 'Do you want to quit?' Speaker 2: 'I've never    | no       |
| been the type of person who throws in the towel when things |          |
| get tough.'                                                 |          |
+-------------------------------------------------------------+----------+
| Speaker 1: 'Should I convince these clients?' Speaker 2:    | yes      |
| 'These are really important clients with deep pockets.'     |          |
+-------------------------------------------------------------+----------+
Constants: {'instructions': "Does Speaker 2's answer mean yes or no? "}

## Indexing

Outputs can be assigned new values using the usual slicing syntax:

In [5]:
cloned = mydata.copy()
cloned.outputs[:] = "yes"
cloned.outputs.unique()

['yes']

If inputs are dictionaries, we can use the `.field()` function to access those.

In [6]:
struc_dt = DataTable([{"one": 1, "two": 2}])
print(struc_dt)
print(struc_dt.inputs.field("one"))

+----------------------+----------+
| input                | output   |
| {'one': 1, 'two': 2} | None     |
+----------------------+----------+
Constants: None
+---------+----------+
| input   | output   |
| 1       | None     |
+---------+----------+
Constants: None


We can also individually query inputs and outputs, for example when we want only positive instances.

In [7]:
mydata.outputs.filtered_on(lambda x: x == "yes")

+--------------------------------------------------------------+----------+
| input                                                        | output   |
| Speaker 1: 'Should I convince these clients?' Speaker 2:     | yes      |
| 'These are really important clients with deep pockets.'      |          |
+--------------------------------------------------------------+----------+
| Speaker 1: 'You have it, then?' Speaker 2: 'I had to slit a  | yes      |
| few throats to get it.'                                      |          |
+--------------------------------------------------------------+----------+
| Speaker 1: 'Do they fight?' Speaker 2: 'They fight like cats | yes      |
| and dogs.'                                                   |          |
+--------------------------------------------------------------+----------+
| Speaker 1: 'Do you want to come out for a juice?' Speaker 2: | yes      |
| 'I am so thirsty that my throat is as dry as a bone.'        |          |
+-----------

## Templating

Let's annotate 10 examples from the implicatures dataset. Below, we initialize our runner as before.


In [8]:
# %load -r 3:25 _init.py
import pathlib
import sammo
from sammo.runners import OpenAIChat
from sammo.base import Template, EvaluationScore
from sammo.components import Output, GenerateText, ForEach, Union
from sammo.extractors import ExtractRegex
from sammo.data import DataTable
import json
import requests
import os

if not 'OPENAI_API_KEY' in os.environ:
    raise ValueError("Please set the environment variable 'OPENAI_API_KEY'.")

_ = sammo.setup_logger("WARNING")  # we're only interested in warnings for now

runner = OpenAIChat(
    model_id="gpt-3.5-turbo",
    api_config={"api_key": os.environ['OPENAI_API_KEY']},
    cache=os.getenv("CACHE_FILE", "cache.tsv"),
    timeout=30,
)

To refer to fields in the DataTable, SAMMO automatically recognizes the values `constants` and `input` (or `inputs`, if minibatching is activated).
Other than that, it follows the standard [handlebar.js syntax](https://handlebarsjs.com/guide/expressions.html).

In [9]:
labeling_prompt = GenerateText(
    Template(
        "Instructions:{{constants.instructions}}\nOutput labels: yes, no\nInput: {{input}}\nOutput:"
    )
)
sample = mydata.sample(10, seed=42)
result = Output(labeling_prompt).run(runner, sample)
result

minibatches[#################################################################################]10/10[00:00<??:??, 0.00it/s]


+--------------------------------------------------------------+----------+
| input                                                        | output   |
| Speaker 1: 'You do this often?' Speaker 2: 'It's my first    | no       |
| time.'                                                       |          |
+--------------------------------------------------------------+----------+
| Speaker 1: 'Are you trying to make me mad?' Speaker 2: 'I'm  | no       |
| just saying, I'd understand if you were upset. '             |          |
+--------------------------------------------------------------+----------+
| Speaker 1: 'You want answers?!' Speaker 2: 'I want the       | no       |
| truth.'                                                      |          |
+--------------------------------------------------------------+----------+
| Speaker 1: 'Are you able to carry the box?' Speaker 2: 'It   | yes      |
| is as light as a feather.'                                   |          |
+-----------

Outputs have three different access methods. First, if we only want the list of final values, we can use the `.values` property. These are also the values shown by default.

In [10]:
y_pred = result.outputs.values
y_pred[:5]

['no', 'no', 'no', 'yes', 'yes']

If we want lower-level access to the result objects, we can call `.raw`.

In [13]:
result.outputs.raw_values

[LLMResult(value='no', parent=TextResult),
 LLMResult(value='no', parent=TextResult),
 LLMResult(value='no', parent=TextResult),
 LLMResult(value='yes', parent=TextResult),
 LLMResult(value='yes', parent=TextResult),
 LLMResult(value='no', parent=TextResult),
 LLMResult(value='no', parent=TextResult),
 LLMResult(value='yes', parent=TextResult),
 LLMResult(value='no', parent=TextResult),
 LLMResult(value='yes', parent=TextResult)]

This returns the set of underlying result objects which also keep track of the entire chain of calls.

Finally, we can use `.normalized_values()` to apply common post-processing steps, e.g., replacing missing values or making them a list. 
This can be useful when computing metrics, e.g., accuracy below:

In [15]:
y_true = sample.outputs.normalized_values(on_empty="")
n_correct = sum([y_p == y_t for y_p, y_t in zip(y_pred, y_true)])
accuracy = n_correct / len(y_true)
accuracy

0.8

Not bad, but it still seems to be a hard task. Let's see what we can do with basic prompt engineering in the next section.