PromptWizard

PromptWizard is an open source framework for automated prompt and example optimization, leveraging a feedback-driven critique and synthesis process to balance exploration and exploitation. It consistently outperforms state-of-the-art methods while significantly reducing computational costs, enabling efficient and scalable prompt engineering across diverse tasks and LLMs.

Overview

Large language models (LLMs) like GPT-4 have achieved remarkable performance across diverse tasks. At the core of this success is prompting—the process of providing input instructions to guide models toward desired outputs. Studies have shown that prompting significantly influences LLM performance, making prompt engineering—the design and refinement of prompts—critical for maximizing accuracy. However, crafting effective prompts remains a labor-intensive and domain-specific task, requiring human expertise and subjective judgment. As models evolve and tasks vary, the need to repeatedly design prompts raises an important question:
Can prompt engineering be automated to streamline this process and enhance scalability?

Motivation

Prompting is central to LLMs!

Prompting: The process of providing input instructions to guide models towards desired output
Prompt Engineering: The process of designing and refining of prompts
Crating effective prompts is a challenge as:

The task is labor-intensive
Prompts need to be domain-specific to work effectively
Often it equires human expertise and is subjective
Also as models and tasks evolve, there is a need for repeated design

PromptWizard Working

PromptWizard (PW) is a discrete prompt optimization framework that employs a self-evolving mechanism where the LLM generates, critiques, and refines its own prompts and examples, continuously improving through iterative feedback and synthesis. This self-adaptive approach ensures holistic optimization by evolving both the instructions and in-context learning examples for better task performance.

Three Key Insights :

Feedback-driven Refinement: LLM generates, critiques, and refines its own prompts and examples, continuously improving through iterative feedback and synthesis
Critique and Synthesize diverse examples: Generates synthetic examples that are robust, diverse and task-aware. Also it optimizes both prompt and examples in tandem
Self generated Chain of Thought (CoT) steps with combination of positive, negative and synthetic examples

Following are the details of each step :

Prompt wizard uses a systematic, feedback-driven proces where it incorporates a critique component that provides feedback, thus guiding and refining the prompt over multiple iterations
The following steps help in carrying out this systematically

Mutate: Takes an initial problem description + thinking Styles to generate prompts
Scoring: Evaluate the performance of the generated prompts to determine best prompt
Critique: Reviews where the prompt succeeded and failed by analyzing cases where the LLM struggled
Synthesize: Uses critique’s feedback to refine the best prompt

PromptWizard improves both prompt instructions and few-shot examples in tandem
It uses self-reflection to synthesize examples that are diverse and task-relevant
An iterative feedback loop is used that continuously refines both the prompt and few-shot examples
Few shot example optimization:

Critique: Analyzes previously selected examples and use the feedback to determine how examples should evolve
Synthesize: Incorporates feedback to generate new synthetic examples that are more diverse, robust, and task-relevant

Prompt instruction optimization:

Critique: Identifies weaknesses and gaps that require addressing to further refine the prompt instruction
Synthesize: Leverages feedback from the critique to synthesize and refine the prompt instruction

Incorporating chain-of-thought (CoT) reasoning improves problem-solving abilities of the model
CoT Reasoning takes the selected few-shot examples and generates a detailed reasoning chain for each example to facilitate problem-solving
An LLM to check the coherence and relevance of examples

Results

PromptWizard outperforms the baselines, achieving the highest accuracy on 13/19 tasks (68%) with 0-shot and 16/19 (84%) with 1-shot

PromptWizard consistently performs near the best possible accuracy across all tasks

PromptWizard costs just $0.05 per task, 5-60x reduction in overall tokens/cost

PromptWizard using Llama-70B show a negligible < 1% drop in accuracy

PromptWizard shows strong resilience even with fewer training samples mainly due to synthetic example generation and reasoning chains

Substantial performance improvements across all models when optimized prompts are generated by PromptWizard on GSM8k dataset

Dataset	Accuracy (high)
	DSPy	PromptAgent	APO	PW
GSM8k	78.2	68.84	25.67	90
AQUARAT	55.1	56.67	20.12	58.2
SVAMP	77	78.67	75.25	82.3
ETHOS	84.1	84.25	80.62	89.4

Dataset	Calls (low)
	DSPy	PromptAgent	APO	PW
GSM8k	915	2115	8490	147
AQUARAT	920	2200	8500	112
SVAMP	2300	2111	8000	178
ETHOS	660	2217	8200	80

Dataset	Tokens (low)
	DSPy	PromptAgent	APO	PW
GSM8k	262	500	109	237
AQUARAT	326	875	125	200
SVAMP	189	680	85	127
ETHOS	175	417	55	190

PromptWizard outperforms feedback based methods like APO, PromptAgent and other prompt optimization techniques like DSPy in terms of accuracy and number of API calls for optimization on various datasets.

BibTeX

@misc{agarwal2024promptwizardtaskawarepromptoptimization,
      title={PromptWizard: Task-Aware Prompt Optimization Framework}, 
      author={Eshaan Agarwal and Joykirat Singh and Vivek Dani and Raghav Magazine and Tanuja Ganu and Akshay Nambi},
      year={2024},
      eprint={2405.18369},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.18369}, 
}

🧙 PromptWizardTask-Aware Prompt Optimization Framework