BigBench-Hard#

Introduction#

In this notebook, we will demonstrate how to use the trace package to optimize prompts and code for natural language processing tasks using the BigBench-Hard benchmark. SotA approaches on this benchmark only optimize prompts, while relying on hand-written code to extract answers from LLM responses. By leveraging the LLM-based optimizers provided in trace, we aim to enhance the performance of a workflow calling LLMs and post-processing their responses in generating accurate and relevant answers.

Setup#

First, we’ll import the necessary packages and set up our environment. We will use a copy of the BigBench-Hard benchmark hosted on HuggingFace. To use HuggingFace datasets, ensure that you have the datasets package installed:

Note

To replicate our experiment in the paper, run the script here: microsoft/Trace

%pip install datasets
%pip install trace-opt
# Import necessary libraries
import autogen
from opto.trace.nodes import node, GRAPH, ParameterNode
from opto.optimizers import OptoPrime
from datasets import load_dataset
from textwrap import dedent
from opto.trace.bundle import bundle
from opto.trace.modules import model
from opto.trace.errors import ExecutionError
from opto.trace.nodes import ExceptionNode
from typing import List
import re
/home/aswaminathan/miniconda3/envs/trace/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Hide code cell source
import os
import ipywidgets as widgets
from IPython.display import display

# Function to save the environment variable and API key
def save_env_variable(env_name, api_key):
    # Validate inputs
    if not env_name.strip():
        print("⚠️ Environment variable name cannot be empty.")
        return
    if not api_key.strip():
        print("⚠️ API key cannot be empty.")
        return
    
    # Store the API key as an environment variable
    os.environ[env_name] = api_key
    globals()[env_name] = api_key  # Set it as a global variable
    print(f"βœ… API key has been set for environment variable: {env_name}")

# Create the input widgets
env_name_input = widgets.Text(
    value="OPENAI_API_KEY",  # Default value
    description="Env Name:",
    placeholder="Enter env variable name (e.g., MY_API_KEY)",
)

api_key_input = widgets.Password(
    description="API Key:",
    placeholder="Enter your API key",
)

# Create the button to submit the inputs
submit_button = widgets.Button(description="Set API Key")

# Display the widgets
display(env_name_input, api_key_input, submit_button)

# Callback function for the button click
def on_button_click(b):
    env_name = env_name_input.value
    api_key = api_key_input.value
    save_env_variable(env_name, api_key)

# Attach the callback to the button
submit_button.on_click(on_button_click)

Define the Evaluation Function#

Next, we’ll define the utility function for evaluating answers obtained by prompting an LLM.

def eval_metric(true, prediction):
    matches = re.findall(r"\([A-Z]\)", true)
    if matches:
        pred = prediction
        matches = re.findall(r"\([A-Z]\)", pred)
        parsed_answer = matches[-1] if matches else ""
        return parsed_answer == true
    else:
        return prediction == true

Helper Function#

We’ll create a helper class called LLMCallable to interact with the LLM API.

class LLMCallable:
    def __init__(self, config_list=None, max_tokens=1024, verbose=False):
        if config_list is None:
            config_list = autogen.config_list_from_json("OAI_CONFIG_LIST")
        self.llm = autogen.OpenAIWrapper(config_list=config_list)
        self.max_tokens = max_tokens
        self.verbose = verbose

    @bundle(catch_execution_error=True)
    def call_llm(self, user_prompt):
        system_prompt = "You are a helpful assistant.\n"
        messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}]
        response = self.llm.create(messages=messages, max_tokens=self.max_tokens)
        response = response.choices[0].message.content

        if self.verbose:
            print("LLM response:\n", response)
        return response

Define a Traced Class#

We will define a Predict class to generate predictions using LLM. Note that we use a module provided by trace called Model which can wrap a python class to enable tracing.

@model
class Predict(LLMCallable):
    def __init__(self):
        super().__init__()

        self.demos = []
        self. prompt_template = dedent(
        """
        Given the fields `question`, produce the fields `answer`.

        ---

        Follow the following format.

        Question: 
        Answer: 

        ---
        Question: {}
        Answer:
        """
        )
        self.prompt_template = ParameterNode(self.prompt_template, trainable=True,
                                             description="This is the Prompt Template to the LLM. " + \
                                                         "Need to include information about what the format of answers LLM should output. " + \
                                                         "They can be (A)/(B), a number like 8, or a string, or Yes/No.")

    @bundle(trainable=True, catch_execution_error=True, allow_external_dependencies=True)
    def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer

    @bundle(trainable=True, catch_execution_error=True, allow_external_dependencies=True)
    def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)

    def forward(self, question):
        user_prompt = self.create_prompt(self.prompt_template, question)
        response = self.call_llm(user_prompt)
        answer = self.extract_answer(self.prompt_template, question, response)
        return answer

Define the optimizer#

Note that the prompt_template is a ParameterNode as well as the extract_answer is a trainable function. trace handles the optimization of heterogenous parameters seamlessly.

def train(dp, optimizer, examples):
    for step, example in enumerate(examples):
        try:
            response = dp.forward(example['question'])
            correctness = eval_metric(example['answer'], response)
            feedback = "The answer is correct! No need to change anything." if correctness else f"The answer is wrong. We expect the output of your answer to be \"{example['answer']}\". Please modify the prompt and relevant parts of the program to help LLM produce the right answer."
        except ExecutionError as e:
            response = e.exception_node
            feedback = response.data
            correctness = False
            
        print("Question:", example["question"])
        print("Expected answer:", example["answer"])
        print("Answer:", response)

        if correctness:
            continue

        optimizer.zero_feedback()
        optimizer.backward(response, feedback)

        print(f"Output: {response}, Feedback: {feedback}, Variables:")  # Logging
        for p in optimizer.parameters:
            print(p.name, p.data)
        optimizer.step(verbose=True)

Putting it all together#

Finally, we use the optimizer to find better prompts using a small training set as follows.

task = "sports_understanding"
train_set = load_dataset("maveriq/bigbenchhard", task)["train"]
examples = [{"question": r["input"], "answer": r["target"]} for r in train_set]

dp = Predict()
optimizer = OptoPrime(dp.parameters(),
                                    config_list=autogen.config_list_from_json("OAI_CONFIG_LIST"))

print("Training on a few examples:")
train(dp, optimizer, examples[:5])

test_accuracy = []
print("\nTesting on new examples:")
for example in examples[5:10]:
    try:
        response = dp.forward(example["question"])
        correctness = eval_metric(example["answer"], response.data)
    except ExecutionError as e:
        correctness = 0

    test_accuracy.append(correctness)

print("Accuracy: ", sum(test_accuracy) / len(test_accuracy))
Training on a few examples:
Question: Is the following sentence plausible? "Elias Lindholm beat the buzzer."
Expected answer: no
Answer: MessageNode: (eval:1, dtype=<class 'str'>, data=Yes, the sentence "Elias Lindholm beat the buzzer" is plausible. It is commonly used in sports contexts to describe a scenario where a player, such as Elias Lindholm in ice hockey, scores just before the time runs out in a period or game.)
Output: MessageNode: (eval:1, dtype=<class 'str'>, data=Yes, the sentence "Elias Lindholm beat the buzzer" is plausible. It is commonly used in sports contexts to describe a scenario where a player, such as Elias Lindholm in ice hockey, scores just before the time runs out in a period or game.), Feedback: The answer is wrong. We expect the output of your answer to be "no". Please modify the prompt and relevant parts of the program to help LLM produce the right answer., Variables:
__code:1 def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
__code:0 def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer
str:0 
Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: 
Answer: 

---
Question: {}
Answer:

Prompt
 
You're tasked to solve a coding/algorithm problem. You will see the instruction, the code, the documentation of each function used in the code, and the feedback about the execution result.

Specifically, a problem will be composed of the following parts:
- #Instruction: the instruction which describes the things you need to do or the question you should answer.
- #Code: the code defined in the problem.
- #Documentation: the documentation of each function used in #Code. The explanation might be incomplete and just contain high-level description. You can use the values in #Others to help infer how those functions work.
- #Variables: the input variables that you can change.
- #Constraints: the constraints or descriptions of the variables in #Variables.
- #Inputs: the values of other inputs to the code, which are not changeable.
- #Others: the intermediate values created through the code execution.
- #Outputs: the result of the code output.
- #Feedback: the feedback about the code's execution result.

In #Variables, #Inputs, #Outputs, and #Others, the format is:

<data_type> <variable_name> = <value>

If <type> is (code), it means <value> is the source code of a python code, which may include docstring and definitions.

Output_format: Your output should be in the following json format, satisfying the json syntax:

{{
"reasoning": <Your reasoning>,
"answer": <Your answer>,
"suggestion": {{
    <variable_1>: <suggested_value_1>,
    <variable_2>: <suggested_value_2>,
}}
}}

In "reasoning", explain the problem: 1. what the #Instruction means 2. what the #Feedback on #Output means to #Variables considering how #Variables are used in #Code and other values in #Documentation, #Inputs, #Others. 3. Reasoning about the suggested changes in #Variables (if needed) and the expected result.

If #Instruction asks for an answer, write it down in "answer".

If you need to suggest a change in the values of #Variables, write down the suggested values in "suggestion". Remember you can change only the values in #Variables, not others. When <type> of a variable is (code), you should write the new definition in the format of python code without syntax errors, and you should not change the function name or the function signature.

If no changes or answer are needed, just output TERMINATE.

Now you see problem instance:

================================

#Instruction
You need to change the <value> of the variables in #Variables to improve the output in accordance to #Feedback.

#Code
eval0 = eval(self=self0, prompt_template=str0, question=question0, __code=__code1)
LLMCallable.call_llm0 = LLMCallable.call_llm(self=self1, user_prompt=eval0)
eval1 = eval(self=self2, prompt_template=str0, question=question1, response=LLMCallable.call_llm0, __code=__code0)

#Documentation
[eval] This operator eval(__code, *args, **kwargs) evaluates the code block, where __code is the code (str) and *args and **kwargs are the arguments of the function. The output is the result of the evaluation, i.e., __code(*args, **kwargs).
[LLMCallable.call_llm] .

#Variables
(str) str0=
Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: 
Answer: 

---
Question: {}
Answer:

(code) __code1:def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
(code) __code0:def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer

#Constraints
(code) __code1: The code should start with:
def create_prompt(self, prompt_template, question):
(code) __code0: The code should start with:
def extract_answer(self, prompt_template, question, response):

#Inputs
(ModelWrapper) self2=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question1=Is the following sentence plausible? "Elias Lindholm beat the buzzer."
(ModelWrapper) self1=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(ModelWrapper) self0=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question0=Is the following sentence plausible? "Elias Lindholm beat the buzzer."

#Others
(str) eval0=
Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: 
Answer: 

---
Question: Is the following sentence plausible? "Elias Lindholm beat the buzzer."
Answer:

(str) LLMCallable.call_llm0=Question: Is the following sentence plausible? "Elias Lindholm beat the buzzer."
Answer: Yes, the sentence "Elias Lindholm beat the buzzer" is plausible. It is commonly used in sports contexts to describe a scenario where a player, such as Elias Lindholm in ice hockey, scores just before the time runs out in a period or game.

#Outputs
(str) eval1=Yes, the sentence "Elias Lindholm beat the buzzer" is plausible. It is commonly used in sports contexts to describe a scenario where a player, such as Elias Lindholm in ice hockey, scores just before the time runs out in a period or game.

#Feedback
The answer is wrong. We expect the output of your answer to be "no". Please modify the prompt and relevant parts of the program to help LLM produce the right answer.

================================


Your response:

LLM response:
 {
  "reasoning": "The objective is to modify the prompt template (str0) such that the Language Model (LLM) produces an answer that aligns with the expected answer 'no' as opposed to 'yes'. The function 'create_prompt' as defined by '__code1', uses the prompt template (str0) to format a question. The prompt template instructs how the question should be posed to the LLM. Currently, the prompt template is generic and does not guide the LLM towards evaluating the plausibility of the statement correctly. Since the feedback demands a specific answer ('no'), the prompt template itself should be tailored to facilitate evaluating the plausibility specifically, possibly by providing a context or criteria under which the statement might be deemed implausible. The function 'extract_answer' as defined by '__code0' correctly identifies the answer segment from the LLM's output, and hence does not require changes.",
  "suggestion": {
    "str0": "Given the grammatical fields `question`, produce the fields `answer`.\n\n---\n\nPlease analyze the grammatical plausibility of the question provided:\n\nQuestion: {}\nAnswer:"
  }
}
Question: Is the following sentence plausible? "John Carlson scored in the third period."
Expected answer: yes
Answer: MessageNode: (exception_eval:0, dtype=<class 'str'>, data=(IndexError) list index out of range)
Output: MessageNode: (exception_eval:0, dtype=<class 'str'>, data=(IndexError) list index out of range), Feedback: (IndexError) list index out of range, Variables:
__code:1 def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
__code:0 def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer
str:0 Given the grammatical fields `question`, produce the fields `answer`.

---

Please analyze the grammatical plausibility of the question provided:

Question: {}
Answer:
Prompt
 
You're tasked to solve a coding/algorithm problem. You will see the instruction, the code, the documentation of each function used in the code, and the feedback about the execution result.

Specifically, a problem will be composed of the following parts:
- #Instruction: the instruction which describes the things you need to do or the question you should answer.
- #Code: the code defined in the problem.
- #Documentation: the documentation of each function used in #Code. The explanation might be incomplete and just contain high-level description. You can use the values in #Others to help infer how those functions work.
- #Variables: the input variables that you can change.
- #Constraints: the constraints or descriptions of the variables in #Variables.
- #Inputs: the values of other inputs to the code, which are not changeable.
- #Others: the intermediate values created through the code execution.
- #Outputs: the result of the code output.
- #Feedback: the feedback about the code's execution result.

In #Variables, #Inputs, #Outputs, and #Others, the format is:

<data_type> <variable_name> = <value>

If <type> is (code), it means <value> is the source code of a python code, which may include docstring and definitions.

Output_format: Your output should be in the following json format, satisfying the json syntax:

{{
"reasoning": <Your reasoning>,
"answer": <Your answer>,
"suggestion": {{
    <variable_1>: <suggested_value_1>,
    <variable_2>: <suggested_value_2>,
}}
}}

In "reasoning", explain the problem: 1. what the #Instruction means 2. what the #Feedback on #Output means to #Variables considering how #Variables are used in #Code and other values in #Documentation, #Inputs, #Others. 3. Reasoning about the suggested changes in #Variables (if needed) and the expected result.

If #Instruction asks for an answer, write it down in "answer".

If you need to suggest a change in the values of #Variables, write down the suggested values in "suggestion". Remember you can change only the values in #Variables, not others. When <type> of a variable is (code), you should write the new definition in the format of python code without syntax errors, and you should not change the function name or the function signature.

If no changes or answer are needed, just output TERMINATE.

Now you see problem instance:

================================

#Instruction
You need to change the <value> of the variables in #Variables to improve the output in accordance to #Feedback.

#Code
eval2 = eval(self=self3, prompt_template=str0, question=question2, __code=__code1)
LLMCallable.call_llm1 = LLMCallable.call_llm(self=self4, user_prompt=eval2)
exception_eval0 = eval(self=self5, prompt_template=str0, question=question3, response=LLMCallable.call_llm1, __code=__code0)

#Documentation
[exception] The operator eval raises an exception.
[LLMCallable.call_llm] .

#Variables
(str) str0=Given the grammatical fields `question`, produce the fields `answer`.

---

Please analyze the grammatical plausibility of the question provided:

Question: {}
Answer:
(code) __code1:def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
(code) __code0:def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer

#Constraints
(code) __code1: The code should start with:
def create_prompt(self, prompt_template, question):
(code) __code0: The code should start with:
def extract_answer(self, prompt_template, question, response):

#Inputs
(ModelWrapper) self5=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question3=Is the following sentence plausible? "John Carlson scored in the third period."
(ModelWrapper) self4=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(ModelWrapper) self3=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question2=Is the following sentence plausible? "John Carlson scored in the third period."

#Others
(str) eval2=Given the grammatical fields `question`, produce the fields `answer`.

---

Please analyze the grammatical plausibility of the question provided:

Question: Is the following sentence plausible? "John Carlson scored in the third period."
Answer:
(str) LLMCallable.call_llm1=Yes, the sentence "John Carlson scored in the third period." is grammatically plausible. It follows the standard subject-verb-object format and provides clear information about who did what and when. The subject is "John Carlson," the verb is "scored," and the phrase "in the third period" provides additional context about the timing of the action.

#Outputs
(str) exception_eval0=(IndexError) list index out of range

#Feedback
(IndexError) list index out of range

================================


Your response:

LLM response:
 {
    "reasoning": "The IndexError 'list index out of range' usually results from trying to access an index in a list or string that doesn't exist. From the provided feedback, we see that the exception occurs when executing `exception_eval0`. Specifically, the exception arises in the `extract_answer` function defined in `__code0`. The assignment `answer = response.split(\"Answer:\")[1].strip()` attempts to access the second element (index 1) after splitting `response` by the substring 'Answer:'. If 'Answer:' isn't found or there is no information following 'Answer:', the index [1] will be out of range, leading to the IndexError. In the string `LLMCallable.call_llm1`, the sentence 'Answer:' does not immediately follow a colon but merely appears in the string, hence it doesn't split as expected. \nThe fix involves ensuring that 'Answer:' appears correctly formatted so when split, it follows with enough content to access index [1]. Hence, the structure of `str0` should correctly prepare the `response` to include 'Answer:' followed by the answer content.",
    "suggestion": {
        "str0": "Given the grammatical fields `question`, produce the fields `answer`: Answer: {}"
    }
}
Question: Is the following sentence plausible? "Marcelo got on the end of a through ball."
Expected answer: yes
Answer: MessageNode: (eval:4, dtype=<class 'str'>, data=Yes, the sentence "Marcelo got on the end of a through ball" is plausible. It is commonly used in sports commentary, especially in soccer (football), to describe a player receiving a pass that goes through the defense.)
Output: MessageNode: (eval:4, dtype=<class 'str'>, data=Yes, the sentence "Marcelo got on the end of a through ball" is plausible. It is commonly used in sports commentary, especially in soccer (football), to describe a player receiving a pass that goes through the defense.), Feedback: The answer is wrong. We expect the output of your answer to be "yes". Please modify the prompt and relevant parts of the program to help LLM produce the right answer., Variables:
__code:1 def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
__code:0 def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer
str:0 Given the grammatical fields `question`, produce the fields `answer`: Answer: {}
Prompt
 
You're tasked to solve a coding/algorithm problem. You will see the instruction, the code, the documentation of each function used in the code, and the feedback about the execution result.

Specifically, a problem will be composed of the following parts:
- #Instruction: the instruction which describes the things you need to do or the question you should answer.
- #Code: the code defined in the problem.
- #Documentation: the documentation of each function used in #Code. The explanation might be incomplete and just contain high-level description. You can use the values in #Others to help infer how those functions work.
- #Variables: the input variables that you can change.
- #Constraints: the constraints or descriptions of the variables in #Variables.
- #Inputs: the values of other inputs to the code, which are not changeable.
- #Others: the intermediate values created through the code execution.
- #Outputs: the result of the code output.
- #Feedback: the feedback about the code's execution result.

In #Variables, #Inputs, #Outputs, and #Others, the format is:

<data_type> <variable_name> = <value>

If <type> is (code), it means <value> is the source code of a python code, which may include docstring and definitions.

Output_format: Your output should be in the following json format, satisfying the json syntax:

{{
"reasoning": <Your reasoning>,
"answer": <Your answer>,
"suggestion": {{
    <variable_1>: <suggested_value_1>,
    <variable_2>: <suggested_value_2>,
}}
}}

In "reasoning", explain the problem: 1. what the #Instruction means 2. what the #Feedback on #Output means to #Variables considering how #Variables are used in #Code and other values in #Documentation, #Inputs, #Others. 3. Reasoning about the suggested changes in #Variables (if needed) and the expected result.

If #Instruction asks for an answer, write it down in "answer".

If you need to suggest a change in the values of #Variables, write down the suggested values in "suggestion". Remember you can change only the values in #Variables, not others. When <type> of a variable is (code), you should write the new definition in the format of python code without syntax errors, and you should not change the function name or the function signature.

If no changes or answer are needed, just output TERMINATE.

Now you see problem instance:

================================

#Instruction
You need to change the <value> of the variables in #Variables to improve the output in accordance to #Feedback.

#Code
eval3 = eval(self=self6, prompt_template=str0, question=question4, __code=__code1)
LLMCallable.call_llm2 = LLMCallable.call_llm(self=self7, user_prompt=eval3)
eval4 = eval(self=self8, prompt_template=str0, question=question5, response=LLMCallable.call_llm2, __code=__code0)

#Documentation
[eval] This operator eval(__code, *args, **kwargs) evaluates the code block, where __code is the code (str) and *args and **kwargs are the arguments of the function. The output is the result of the evaluation, i.e., __code(*args, **kwargs).
[LLMCallable.call_llm] .

#Variables
(str) str0=Given the grammatical fields `question`, produce the fields `answer`: Answer: {}
(code) __code1:def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
(code) __code0:def extract_answer(self, prompt_template, question, response):
        answer = response.split("Answer:")[1].strip()
        return answer

#Constraints
(code) __code1: The code should start with:
def create_prompt(self, prompt_template, question):
(code) __code0: The code should start with:
def extract_answer(self, prompt_template, question, response):

#Inputs
(ModelWrapper) self8=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question5=Is the following sentence plausible? "Marcelo got on the end of a through ball."
(ModelWrapper) self7=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(ModelWrapper) self6=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question4=Is the following sentence plausible? "Marcelo got on the end of a through ball."

#Others
(str) eval3=Given the grammatical fields `question`, produce the fields `answer`: Answer: Is the following sentence plausible? "Marcelo got on the end of a through ball."
(str) LLMCallable.call_llm2=Answer: Yes, the sentence "Marcelo got on the end of a through ball" is plausible. It is commonly used in sports commentary, especially in soccer (football), to describe a player receiving a pass that goes through the defense.

#Outputs
(str) eval4=Yes, the sentence "Marcelo got on the end of a through ball" is plausible. It is commonly used in sports commentary, especially in soccer (football), to describe a player receiving a pass that goes through the defense.

#Feedback
The answer is wrong. We expect the output of your answer to be "yes". Please modify the prompt and relevant parts of the program to help LLM produce the right answer.

================================


Your response:

LLM response:
 {
    "reasoning": "The goal is to adjust the necessary variables so the final output (`eval4`) from the language model (LLM) is 'yes'. Analyzing the processes in the code reveals that `str0`, `__code1` and `__code0` are instrumental in crafting the prompt passed into the LLM and formatting its response to extract an answer. The original format of the string template `str0` used to generate the prompt seems to influence how detailed the LLM answers. The function `create_prompt` builds the prompt using `str0`, but the prompt explicitly includes an instruction to 'produce the fields' which might result in more elaborate answers from the LLM, not just a simple 'yes' or 'no'. To change this, we can tweak `str0` to request a concise response. The output is essentially derived from how the LLM responds and due to how we extract the `answer` using `__code0`, it takes everything following 'Answer:'. Given the feedback is seeking a precise 'yes' answer, adjustments to both the request template (`str0`) and the extraction method can be made to align outputs with expectations.",
    "answer": "",
    "suggestion": {
        "str0": "Given the grammatical fields `question`, specify whether the statement is plausible in one word (yes/no):",
        "__code0": "def extract_answer(self, prompt_template, question, response):\n answer = response.split(' ')[1].strip()\n return answer"
    }
}
Question: Is the following sentence plausible? "Deshaun Watson was called for the goal tend in the Eastern Conference Finals."
Expected answer: no
Answer: MessageNode: (eval:6, dtype=<class 'str'>, data=provide)
Output: MessageNode: (eval:6, dtype=<class 'str'>, data=provide), Feedback: The answer is wrong. We expect the output of your answer to be "no". Please modify the prompt and relevant parts of the program to help LLM produce the right answer., Variables:
__code:1 def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
__code:0 def extract_answer(self, prompt_template, question, response):
 answer = response.split(' ')[1].strip()
 return answer
str:0 Given the grammatical fields `question`, specify whether the statement is plausible in one word (yes/no):
Prompt
 
You're tasked to solve a coding/algorithm problem. You will see the instruction, the code, the documentation of each function used in the code, and the feedback about the execution result.

Specifically, a problem will be composed of the following parts:
- #Instruction: the instruction which describes the things you need to do or the question you should answer.
- #Code: the code defined in the problem.
- #Documentation: the documentation of each function used in #Code. The explanation might be incomplete and just contain high-level description. You can use the values in #Others to help infer how those functions work.
- #Variables: the input variables that you can change.
- #Constraints: the constraints or descriptions of the variables in #Variables.
- #Inputs: the values of other inputs to the code, which are not changeable.
- #Others: the intermediate values created through the code execution.
- #Outputs: the result of the code output.
- #Feedback: the feedback about the code's execution result.

In #Variables, #Inputs, #Outputs, and #Others, the format is:

<data_type> <variable_name> = <value>

If <type> is (code), it means <value> is the source code of a python code, which may include docstring and definitions.

Output_format: Your output should be in the following json format, satisfying the json syntax:

{{
"reasoning": <Your reasoning>,
"answer": <Your answer>,
"suggestion": {{
    <variable_1>: <suggested_value_1>,
    <variable_2>: <suggested_value_2>,
}}
}}

In "reasoning", explain the problem: 1. what the #Instruction means 2. what the #Feedback on #Output means to #Variables considering how #Variables are used in #Code and other values in #Documentation, #Inputs, #Others. 3. Reasoning about the suggested changes in #Variables (if needed) and the expected result.

If #Instruction asks for an answer, write it down in "answer".

If you need to suggest a change in the values of #Variables, write down the suggested values in "suggestion". Remember you can change only the values in #Variables, not others. When <type> of a variable is (code), you should write the new definition in the format of python code without syntax errors, and you should not change the function name or the function signature.

If no changes or answer are needed, just output TERMINATE.

Now you see problem instance:

================================

#Instruction
You need to change the <value> of the variables in #Variables to improve the output in accordance to #Feedback.

#Code
eval5 = eval(self=self9, prompt_template=str0, question=question6, __code=__code1)
LLMCallable.call_llm3 = LLMCallable.call_llm(self=self10, user_prompt=eval5)
eval6 = eval(self=self11, prompt_template=str0, question=question7, response=LLMCallable.call_llm3, __code=__code0)

#Documentation
[eval] This operator eval(__code, *args, **kwargs) evaluates the code block, where __code is the code (str) and *args and **kwargs are the arguments of the function. The output is the result of the evaluation, i.e., __code(*args, **kwargs).
[LLMCallable.call_llm] .

#Variables
(str) str0=Given the grammatical fields `question`, specify whether the statement is plausible in one word (yes/no):
(code) __code1:def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
(code) __code0:def extract_answer(self, prompt_template, question, response):
 answer = response.split(' ')[1].strip()
 return answer

#Constraints
(code) __code1: The code should start with:
def create_prompt(self, prompt_template, question):
(code) __code0: The code should start with:
def extract_answer(self, prompt_template, question, response):

#Inputs
(ModelWrapper) self11=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question7=Is the following sentence plausible? "Deshaun Watson was called for the goal tend in the Eastern Conference Finals."
(ModelWrapper) self10=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(ModelWrapper) self9=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question6=Is the following sentence plausible? "Deshaun Watson was called for the goal tend in the Eastern Conference Finals."

#Others
(str) eval5=Given the grammatical fields `question`, specify whether the statement is plausible in one word (yes/no):
(str) LLMCallable.call_llm3=Please provide the statement for evaluation.

#Outputs
(str) eval6=provide

#Feedback
The answer is wrong. We expect the output of your answer to be "no". Please modify the prompt and relevant parts of the program to help LLM produce the right answer.

================================


Your response:

LLM response:
 {
  "reasoning": "The code block executes a sequence of operations involving assessment and response generation based on the question of the plausibility of a statement. The problem lies in the provided string in str0 and possibly the formulation of calls and return statements in __code0. The variable str0 is meant to construct a prompt template, and __code1 formats this to include the question. However, str0 fails to communicate clearly to the LLM that it needs to evaluate the truth of the statement, rather than asking for an explanation or any other output format. This unclear prompt might lead to LLMCallable.call_llm3 generating an unhelpful response like 'Please provide the statement for evaluation.' Subsequently, __code0 is expected to extract the answer, but its logic needs refinement to reliably parse the response from LLM. The variable eval6 results in 'provide' which indicates the LLM response was not correctly parsed and also shows it might not have been guided correctly by the prompt to give a yes/no answer. To address the issue, we need to reformulate the prompt in str0 and adjust the response parsing logic in __code0 to ensure a correct and clear extraction of 'yes' or 'no'.",
  "suggestion": {
    "str0": "Is the statement \"{question}\" plausible? Answer 'yes' or 'no'.",
    "__code0": "def extract_answer(self, prompt_template, question, response):\n answer = response.strip().lower()\n return answer"
  }
}
Question: Is the following sentence plausible? "Mookie Betts skated behind the net."
Expected answer: no
Answer: MessageNode: (exception_eval:1, dtype=<class 'str'>, data=(KeyError) 'question')
Output: MessageNode: (exception_eval:1, dtype=<class 'str'>, data=(KeyError) 'question'), Feedback: (KeyError) 'question', Variables:
__code:1 def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)
__code:0 def extract_answer(self, prompt_template, question, response):
 answer = response.strip().lower()
 return answer
str:0 Is the statement "{question}" plausible? Answer 'yes' or 'no'.
Prompt
 
You're tasked to solve a coding/algorithm problem. You will see the instruction, the code, the documentation of each function used in the code, and the feedback about the execution result.

Specifically, a problem will be composed of the following parts:
- #Instruction: the instruction which describes the things you need to do or the question you should answer.
- #Code: the code defined in the problem.
- #Documentation: the documentation of each function used in #Code. The explanation might be incomplete and just contain high-level description. You can use the values in #Others to help infer how those functions work.
- #Variables: the input variables that you can change.
- #Constraints: the constraints or descriptions of the variables in #Variables.
- #Inputs: the values of other inputs to the code, which are not changeable.
- #Others: the intermediate values created through the code execution.
- #Outputs: the result of the code output.
- #Feedback: the feedback about the code's execution result.

In #Variables, #Inputs, #Outputs, and #Others, the format is:

<data_type> <variable_name> = <value>

If <type> is (code), it means <value> is the source code of a python code, which may include docstring and definitions.

Output_format: Your output should be in the following json format, satisfying the json syntax:

{{
"reasoning": <Your reasoning>,
"answer": <Your answer>,
"suggestion": {{
    <variable_1>: <suggested_value_1>,
    <variable_2>: <suggested_value_2>,
}}
}}

In "reasoning", explain the problem: 1. what the #Instruction means 2. what the #Feedback on #Output means to #Variables considering how #Variables are used in #Code and other values in #Documentation, #Inputs, #Others. 3. Reasoning about the suggested changes in #Variables (if needed) and the expected result.

If #Instruction asks for an answer, write it down in "answer".

If you need to suggest a change in the values of #Variables, write down the suggested values in "suggestion". Remember you can change only the values in #Variables, not others. When <type> of a variable is (code), you should write the new definition in the format of python code without syntax errors, and you should not change the function name or the function signature.

If no changes or answer are needed, just output TERMINATE.

Now you see problem instance:

================================

#Instruction
You need to change the <value> of the variables in #Variables to improve the output in accordance to #Feedback.

#Code
exception_eval1 = eval(self=self12, prompt_template=str0, question=question8, __code=__code1)

#Documentation
[exception] The operator eval raises an exception.

#Variables
(str) str0=Is the statement "{question}" plausible? Answer 'yes' or 'no'.
(code) __code1:def create_prompt(self, prompt_template, question):
        return prompt_template.format(question)

#Constraints
(code) __code1: The code should start with:
def create_prompt(self, prompt_template, question):

#Inputs
(ModelWrapper) self12=<opto.trace.modules.model.<locals>.ModelWrapper object at 0x7f4bd428a410>
(str) question8=Is the following sentence plausible? "Mookie Betts skated behind the net."

#Others


#Outputs
(str) exception_eval1=(KeyError) 'question'

#Feedback
(KeyError) 'question'

================================


Your response:

LLM response:
 {
  "reasoning": "The operation fails because it is trying to use the Python str.format method in the __code1 function, and is looking for a key named 'question' in the format string. However, in the original format string 'str0', this key is represented as {question} inside double quotes, which Python cannot recognize as the names of the placeholders for the format method. As a result, the KeyError means that the template string 'str0' does not match the expected format string requirements where placeholder names are used directly without additional quotes or alterations.",
  "answer": "",
  "suggestion": {
    "str0": "Is the statement {question} plausible? Answer 'yes' or 'no'."
  }
}

Testing on new examples:
Accuracy:  0.0
Generating train split: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 250/250 [00:00<00:00, 40783.17 examples/s]

Now, you can run each cell in this notebook step by step to walk through the process of setting up and optimizing prompts. Happy optimizing!