7. Instruction Optimization#
We use the term instruction optimization to refer to the problem of finding the task instructions that maximize some target metric (e.g., accuracy).
Note
We will work with an extremely small number of data instances here to show the general flow. We recommend using 100+ examples for train and test..
We start by initalizing things as before.
Show code cell source
# %load -r 3:25 _init.py
import pathlib
import sammo
from sammo.runners import OpenAIChat
from sammo.base import Template, EvaluationScore
from sammo.components import Output, GenerateText, ForEach, Union
from sammo.extractors import ExtractRegex
from sammo.data import DataTable
import json
import requests
import os
if not "OPENAI_API_KEY" in os.environ:
raise ValueError("Please set the environment variable OPENAI_API_KEY'.")
_ = sammo.setup_logger("WARNING") # we're only interested in warnings for now
runner = OpenAIChat(
model_id="gpt-3.5-turbo-16k",
api_config={"api_key": os.getenv("OPENAI_API_KEY")},
cache=os.getenv("CACHE_FILE", "cache.tsv"),
timeout=30,
)
Show code cell source
# %load -s load_data,accuracy _init.py
def load_data(
url="https://github.com/google/BIG-bench/raw/main/bigbench/benchmark_tasks/implicatures/task.json",
):
task = json.loads(requests.get(url).content)
# convert label to single string
for x in task["examples"]:
x["output"] = max(x["target_scores"], key=x["target_scores"].get)
return DataTable.from_records(
task["examples"],
input_fields="input",
constants={"instructions": task["task_prefix"]},
)
def accuracy(y_true: DataTable, y_pred: DataTable) -> EvaluationScore:
y_true = y_true.outputs.normalized_values()
y_pred = y_pred.outputs.normalized_values()
n_correct = sum([y_p == y_t for y_p, y_t in zip(y_pred, y_true)])
return EvaluationScore(n_correct / len(y_true))
7.1. Step 1: Defining the set of initial candidates#
Our plan is to use beam search with mutation operators to refine a set of initial candidates. Similar to using grid search previously, we can use the same syntax to define a parametric set of initial candidates.
7.1.1. Using Callables to bind static values#
A very common problem is that of having a set of static values, e.g., configuration or input datasets, that are needed in constructing a metaprompt.
To bind these static values, we recommend using callables. These are objects that behave like functions but can be initalized with the static values for the task. In essence, they behave like partially bound functions but offer a cleaner interface.
Below, we show how we can bind the training dataset to the search space object so we can use its values during the construction of the initial candidate space.
from sammo.instructions import MetaPrompt, Section, Paragraph, InputData
from sammo.dataformatters import PlainFormatter
from sammo.search_op import one_of
class InititialCandidates:
def __init__(self, dtrain):
self.dtrain = dtrain
def __call__(self):
example_formatter = PlainFormatter(
all_labels=self.dtrain.outputs.unique(), orient="item"
)
labels = self.dtrain.outputs.unique()
instructions = MetaPrompt(
[
Paragraph("Instructions: "),
Paragraph(
one_of(
[
self.dtrain.constants["instructions"],
"",
"Find the best output label given the input.",
self.dtrain.constants["instructions"] * 2,
]
),
reference_id="instructions",
),
Paragraph("\n"),
Paragraph(
f"Output labels: {', '.join(labels)}\n" if len(labels) <= 10 else ""
),
Paragraph(InputData()),
Paragraph("Output: "),
],
render_as="raw",
data_formatter=example_formatter,
)
return Output(
instructions.with_extractor("raise"),
minibatch_size=1,
on_error="empty_result",
)
7.2. Step 2: Define a set of mutation operators#
In each step of the beam search, SAMMO
will sample a set of mutation operators and apply them to the current set of active candidates (beams).
from sammo.mutators import BagOfMutators, InduceInstructions, Paraphrase
mydata = load_data()
d_train = mydata.sample(10, seed=42)
mutation_operators = BagOfMutators(
InititialCandidates(d_train),
InduceInstructions("#instructions", d_train),
Paraphrase("#instructions"),
sample_for_init_candidates=False,
)
What we have done above is to define a set of mutators to be applied. We say that we want to initialize with our previously defined InitialCandidates
set, and can apply two different mutation operations here: we can induce new instructions from labeled samples, or just paraphrase existing ones. To know what part of the metaprompt we want to apply a mutator to, we need to pass a path descriptor.
7.3. Step 3: Run the beam search#
Letβs set up our beam search and run it.
from sammo.search import BeamSearch
prompt_optimizer = BeamSearch(
runner,
mutation_operators,
accuracy,
maximize=True,
depth=3,
mutations_per_beam=2,
n_initial_candidates=4,
beam_width=4,
add_previous=True,
)
prompt_optimizer.fit(d_train)
prompt_optimizer.show_report()
search depth[############]3/3[00:00<00:00] >> eval[#################################]8/8 >> tasks[######]80/80[00:00<00:00, 62.50it/s]
Fitting log:
iteration action objective costs parse_errors prev_actions
----------- ------------------ ----------- ----------------------------- -------------- --------------------------------------------------
-1 init 0.8 {'input': 466, 'output': 10} 0.0 ['init']
-1 init 0.8 {'input': 546, 'output': 10} 0.0 ['init']
-1 init 0.5 {'input': 576, 'output': 10} 0.0 ['init']
-1 init 0.5 {'input': 686, 'output': 10} 0.0 ['init']
0 Paraphrase 0.6 {'input': 636, 'output': 10} 0.0 ['Paraphrase', 'init']
0 Paraphrase 0.0 {'input': 516, 'output': 282} 0.0 ['Paraphrase', 'init']
0 InduceInstructions 0.6 {'input': 796, 'output': 10} 0.0 ['InduceInstructions', 'init']
0 Paraphrase 0.8 {'input': 566, 'output': 10} 0.0 ['Paraphrase', 'init']
0 Paraphrase 0.6 {'input': 586, 'output': 10} 0.0 ['Paraphrase', 'init']
0 Paraphrase 0.6 {'input': 586, 'output': 10} 0.0 ['Paraphrase', 'init']
0 InduceInstructions 0.8 {'input': 926, 'output': 10} 0.0 ['InduceInstructions', 'init']
0 Paraphrase 0.3 {'input': 696, 'output': 17} 0.0 ['Paraphrase', 'init']
1 InduceInstructions 0.7 {'input': 706, 'output': 10} 0.0 ['InduceInstructions', 'Paraphrase', 'init']
1 Paraphrase 0.8 {'input': 566, 'output': 10} 0.0 ['Paraphrase', 'Paraphrase', 'init']
1 InduceInstructions 0.8 {'input': 586, 'output': 10} 0.0 ['InduceInstructions', 'InduceInstructions',
'init']
1 InduceInstructions 0.9 {'input': 1136, 'output': 10} 0.0 ['InduceInstructions', 'InduceInstructions',
'init']
1 Paraphrase 0.3 {'input': 516, 'output': 137} 0.0 ['Paraphrase', 'init']
1 InduceInstructions 0.7 {'input': 546, 'output': 10} 0.0 ['InduceInstructions', 'init']
1 InduceInstructions 0.7 {'input': 546, 'output': 10} 0.0 ['InduceInstructions', 'init']
1 Paraphrase 0.8 {'input': 566, 'output': 10} 0.0 ['Paraphrase', 'init']
2 InduceInstructions 0.8 {'input': 626, 'output': 10} 0.0 ['InduceInstructions', 'InduceInstructions',
'InduceInstructions', 'init']
2 InduceInstructions 0.8 {'input': 856, 'output': 10} 0.0 ['InduceInstructions', 'InduceInstructions',
'InduceInstructions', 'init']
2 Paraphrase 0.8 {'input': 566, 'output': 10} 0.0 ['Paraphrase', 'Paraphrase', 'Paraphrase', 'init']
2 Paraphrase 0.8 {'input': 566, 'output': 10} 0.0 ['Paraphrase', 'Paraphrase', 'Paraphrase', 'init']
2 InduceInstructions 0.7 {'input': 636, 'output': 10} 0.0 ['InduceInstructions', 'InduceInstructions',
'InduceInstructions', 'init']
2 Paraphrase 0.9 {'input': 586, 'output': 10} 0.0 ['Paraphrase', 'InduceInstructions',
'InduceInstructions', 'init']
2 Paraphrase 0.8 {'input': 566, 'output': 10} 0.0 ['Paraphrase', 'Paraphrase', 'init']
2 Paraphrase 0.8 {'input': 566, 'output': 10} 0.0 ['Paraphrase', 'Paraphrase', 'init']
Action stats:
action stats
------------------ -----------------------------
Paraphrase {'chosen': 14, 'improved': 0}
InduceInstructions {'chosen': 10, 'improved': 1}
Great! Our best prompt gets an accuracy of 0.9 on the training set. Letβs see what it came up with.
print(prompt_optimizer.best_prompt)
Output(
child = StripWhitespace(
child = GenerateText(
child = MetaPrompt(
structure = [
0 : Paragraph(
content = 'Instructions: ',
id = None
),
1 : Paragraph(
content = 'The instruction given is to determine whether the second speaker's response indicates agreement or disagreement with the first speaker's statement. If the second speaker's response supports or aligns with the first speaker's statement, the output is "yes." If the second speaker's response contradicts or opposes the first speaker's statement, the output is "no."',
id = 'instructions'
),
2 : Paragraph(
content = '
',
id = None
),
3 : Paragraph(
content = 'Output labels: no, yes
',
id = None
),
4 : Paragraph(
content = InputData(
id_offset = 0,
name = None
),
id = None
),
5 : Paragraph(
content = 'Output: ',
id = None
)
],
render_as = 'raw',
data_formatter = PlainFormatter(
all_labels = [
0 : 'no',
1 : 'yes'
],
orient = 'item'
),
name = None,
seed = 0
),
name = None,
system_prompt = None,
history = None,
seed = 0,
randomness = 0,
max_tokens = None,
on_error = 'empty_result'
),
on_error = 'raise',
flatten = True
),
minibatch_size = 1,
on_error = 'empty_result'
)