# BigBench-Hard

## Introduction

In this notebook, we will demonstrate how to use the `trace` package to optimize prompts and code for natural language processing tasks using the BigBench-Hard benchmark. SotA approaches on this benchmark only optimize prompts, while relying on hand-written code to extract answers from LLM responses. By leveraging the LLM-based optimizers provided in `trace`, we aim to enhance the performance of a workflow calling LLMs and post-processing their responses in generating accurate and relevant answers.

## Setup

First, we'll import the necessary packages and set up our environment. We will use a copy of the BigBench-Hard benchmark hosted on [HuggingFace](https://huggingface.co/datasets/maveriq/bigbenchhard). To use HuggingFace datasets, ensure that you have the `datasets` package installed:

pip install datasets

In [None]:
!pip install trace-opt

In [1]:
# Import necessary libraries
import autogen
from opto.trace.nodes import node, GRAPH, ParameterNode
from opto.optimizers import OptoPrime
from datasets import load_dataset
from textwrap import dedent
from opto.trace.bundle import bundle
from opto.trace.modules import model
from opto.trace.errors import ExecutionError
from opto.trace.nodes import ExceptionNode
from typing import List
import re

## Define the Evaluation Function

Next, we'll define the utility function for evaluating answers obtained by prompting an LLM.

In [2]:
def eval_metric(true, prediction):
    matches = re.findall(r"\([A-Z]\)", true)
    if matches:
        pred = prediction
        matches = re.findall(r"\([A-Z]\)", pred)
        parsed_answer = matches[-1] if matches else ""
        return parsed_answer == true
    else:
        return prediction == true

## Helper Function

We'll create a helper class called `LLMCallable` to interact with the LLM API.

In [3]:
class LLMCallable:
    def __init__(self, config_list=None, max_tokens=1024, verbose=False):
        if config_list is None:
            config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
        self.llm = autogen.OpenAIWrapper(config_list=config_list)
        self.max_tokens = max_tokens
        self.verbose = verbose

    @bundle(catch_execution_error=True)
    def call_llm(self, user_prompt):
        system_prompt = "You are a helpful assistant.\n"
        messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}]
        response = self.llm.create(messages=messages, max_tokens=self.max_tokens)
        response = response.choices[0].message.content

        if self.verbose:
            print("LLM response:\n", response)
        return response


## Define a Traced Class

We will define a Predict class to generate predictions using LLM. Note that we use a module provided by `trace` called `Model` which can wrap a python class to enable tracing. 

In [4]:
@model
class Predict(LLMCallable):
    def __init__(self):
        super().__init__()

        self.demos = []
        self. prompt_template = dedent(
            """
        Given the fields `question`, produce the fields `answer`.

        ---

        Follow the following format.

        Question: 
        Answer: 

        ---
        Question: {}
        Answer:
        """
        )
        self.prompt_template = ParameterNode(self.prompt_template, trainable=True,
                                             description="This is the Prompt Template to the LLM. " + \
                                                         "Need to include information about what the format of answers LLM should output. " + \
                                                         "They can be (A)/(B), a number like 8, or a string, or Yes/No.")

    @bundle(trainable=True, catch_execution_error=True, allow_external_dependencies=True)
    def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer

    @bundle(trainable=True, catch_execution_error=True, allow_external_dependencies=True)
    def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)

    def forward(self, question):
        user_prompt = self.create_prompt(self.prompt_template, question)
        response = self.call_llm(user_prompt)
        answer = self.extract_answer(self.prompt_template, question, response)
        return answer

## Define the optimizer

Note that the `prompt_template` is a `ParameterNode` as well as the `extract_answer` is a trainable function. `trace` handles the optimization of heterogenous parameters seamlessly.

In [5]:
def learn_predict(dp, optimizer, examples):
    for step, example in enumerate(examples):
        try:
            response = dp.forward(example['question'])
            correctness = eval_metric(example['answer'], response)
            feedback = "The answer is correct! No need to change anything." if correctness else f"The answer is wrong. We expect the output of your answer to be \"{example['answer']}\". Please modify the prompt and relevant parts of the program to help LLM produce the right answer."
        except ExecutionError as e:
            response = e.exception_node
            feedback = response.data
            correctness = False
            
        print("Question:", example["question"])
        print("Expected answer:", example["answer"])
        print("Answer:", response)

        if correctness:
            continue

        optimizer.zero_feedback()
        optimizer.backward(response, feedback)

        print(f"Output: {response}, Feedback: {feedback}, Variables:")  # Logging
        for p in optimizer.parameters:
            print(p.name, p.data)
        optimizer.step(verbose=True)

## Putting it all together

Finally, we use the optimizer to find better prompts using a small training set as follows.

In [6]:
task = "sports_understanding"
train = load_dataset("maveriq/bigbenchhard", task)["train"]
examples = [{"question": r["input"], "answer": r["target"]} for r in train]

dp = Predict()
optimizer = OptoPrime(dp.parameters(),
                                    config_list=autogen.config_list_from_json("OAI_CONFIG_LIST"))

print("Training on a few examples:")
learn_predict(dp, optimizer, examples[:5])
    
print("\nTesting on new examples:")
for example in examples[5:6]:
    try:
        response = dp.forward(example["question"])
        print("Question:", example["question"])
        print("Expected answer:", example["answer"])
        print("Answer:", response.data)
    except ExecutionError as e:
        print("Question:", example["question"])
        print("Expected answer:", example["answer"])
        print("Error:", e.exception_node.data)

Training on a few examples:
Question: Is the following sentence plausible? "Elias Lindholm beat the buzzer."
Expected answer: no
Answer: MessageNode: (eval:1, dtype=<class 'str'>, data=Yes)
Output: MessageNode: (eval:1, dtype=<class 'str'>, data=Yes), Feedback: The answer is wrong. We expect the output of your answer to be "no". Please modify the prompt and relevant parts of the program to help LLM produce the right answer., Variables:
__code:1 def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
__code:0 def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer
str:0 
Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: 
Answer: 

---
Question: {}
Answer:

str:0 
Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: 
Answer: 

---
Question: {}
Answer:

Prompt


Now, you can run each cell in this notebook step by step to walk through the process of setting up and optimizing prompts for the trading game. Happy optimizing!