🧙 PromptWizard

Task-Aware Prompt Optimization Framework

Microsoft Research
1 / 3

2 / 3

3 / 3


PromptWizard is an open source framework for automated prompt and example optimization, leveraging a feedback-driven critique and synthesis process to balance exploration and exploitation. It consistently outperforms state-of-the-art methods while significantly reducing computational costs, enabling efficient and scalable prompt engineering across diverse tasks and LLMs.

Overview

Large language models (LLMs) like GPT-4 have achieved remarkable performance across diverse tasks. At the core of this success is prompting—the process of providing input instructions to guide models toward desired outputs. Studies have shown that prompting significantly influences LLM performance, making prompt engineering—the design and refinement of prompts—critical for maximizing accuracy. However, crafting effective prompts remains a labor-intensive and domain-specific task, requiring human expertise and subjective judgment. As models evolve and tasks vary, the need to repeatedly design prompts raises an important question:
Can prompt engineering be automated to streamline this process and enhance scalability?

Motivation

Prompting is central to LLMs!

  • Prompting: The process of providing input instructions to guide models towards desired output
  • Prompt Engineering: The process of designing and refining of prompts​
  • Crating effective prompts is a challenge as:​
    1. The task is labor-intensive
    2. Prompts need to be domain-specific to work effectively
    3. Often it equires human expertise and is subjective​
    4. Also as models and tasks evolve, there is a need for repeated design

PromptWizard Working

PromptWizard (PW) is a discrete prompt optimization framework that employs a self-evolving mechanism where the LLM generates, critiques, and refines its own prompts and examples, continuously improving through iterative feedback and synthesis. This self-adaptive approach ensures holistic optimization by evolving both the instructions and in-context learning examples for better task performance.

Three Key Insights :

  1. Feedback-driven Refinement: LLM generates, critiques, and refines its own prompts and examples, continuously improving through iterative feedback and synthesis​
  2. Critique and Synthesize diverse examples: Generates synthetic examples that are robust, diverse and task-aware. Also it optimizes both prompt and examples in tandem​
  3. Self generated Chain of Thought (CoT) steps with combination of positive, negative and synthetic examples

Following are the details of each step :

  • Prompt wizard uses a systematic, feedback-driven proces where it incorporates a critique component that provides feedback, thus guiding and refining the prompt over multiple iterations​
  • The following steps help in carrying out this systematically
    • Mutate: Takes an initial problem description + thinking Styles to generate prompts​
    • Scoring: Evaluate the performance of the generated prompts to determine best prompt​
    • Critique: Reviews where the prompt succeeded and failed by analyzing cases where the LLM struggled​
    • Synthesize: Uses critique’s feedback to refine the best prompt

  • PromptWizard improves both prompt instructions and few-shot examples in tandem​
  • It uses self-reflection to synthesize examples that are diverse and task-relevant​
  • An iterative feedback loop is used that continuously refines both the prompt and few-shot examples​
  • Few shot example optimization:​
    • Critique: Analyzes previously selected examples and use the feedback to determine how examples should evolve​
    • Synthesize: Incorporates feedback to generate new synthetic examples that are more diverse, robust, and task-relevant​
  • Prompt instruction optimization:​
    • Critique: Identifies weaknesses and gaps that require addressing to further refine the prompt instruction​
    • Synthesize: Leverages feedback from the critique to synthesize and refine the prompt instruction

  • Incorporating chain-of-thought (CoT) reasoning improves problem-solving abilities of the model​
  • CoT Reasoning takes the selected few-shot examples and generates a detailed reasoning chain for each example to facilitate problem-solving​
  • An LLM to check the coherence and relevance of examples​
​

Results

PromptWizard outperforms the baselines, achieving the highest accuracy on 13/19 tasks (68%) with 0-shot and 16/19 (84%) with 1-shot

PromptWizard consistently performs near the best possible accuracy across all tasks

PromptWizard costs just $0.05 per task, 5-60x reduction in overall tokens/cost​



PromptWizard using Llama-70B show a negligible < 1% drop in accuracy ​


PromptWizard shows strong resilience even with fewer training samples mainly due to synthetic example generation and reasoning chains​​


Substantial performance improvements across all models when optimized prompts are generated by PromptWizard on GSM8k dataset​


Dataset Accuracy (high)
DSPy PromptAgent APO PW
GSM8k 78.2 68.84 25.67 90
AQUARAT 55.1 56.67 20.12 58.2
SVAMP 77 78.67 75.25 82.3
ETHOS 84.1 84.25 80.62 89.4

Dataset Calls (low)
DSPy PromptAgent APO PW
GSM8k 915 2115 8490 147
AQUARAT 920 2200 8500 112
SVAMP 2300 2111 8000 178
ETHOS 660 2217 8200 80

Dataset Tokens (low)
DSPy PromptAgent APO PW
GSM8k 262 500 109 237
AQUARAT 326 875 125 200
SVAMP 189 680 85 127
ETHOS 175 417 55 190


PromptWizard outperforms feedback based methods like APO, PromptAgent and other prompt optimization techniques like DSPy in terms of accuracy and number of API calls for optimization on various datasets. ​

BibTeX

@misc{agarwal2024promptwizardtaskawarepromptoptimization,
      title={PromptWizard: Task-Aware Prompt Optimization Framework}, 
      author={Eshaan Agarwal and Joykirat Singh and Vivek Dani and Raghav Magazine and Tanuja Ganu and Akshay Nambi},
      year={2024},
      eprint={2405.18369},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.18369}, 
}