Stanford and OpenAI proposed meta-prompting, and the strongest zero-sample prompting technology was born.
The latest generation of language models (especially GPT-4, PaLM, and LLaMa) have successfully pushed the boundaries of natural language processing and generation. These large-scale models can solve many different tasks, from writing Shakespearean sonnets to summarizing complex medical reports and solving competition-level programming problems. Although these models can solve a wide variety of problems, they are not always correct and sometimes produce inaccurate, misleading, or contradictory responses.
As these models become increasingly cheaper to run, it is natural to ask whether it is possible to use a scaffolding system and use multiple language model queries to optimize and improve the accuracy and robustness of these model outputs.
This research from Stanford and OpenAI proposes a new technology that can be used to improve the functionality and performance of language models, called meta-prompting.
- Paper title: Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding
- Paper address: https://arxiv.org/abs/2401.12954
- Project address: https://github.com/suzgunmirac/meta-prompting
This technique involves building a high-level "meta" prompt that instructs the language model to do the following:
1. Break down complex tasks or problems into smaller sub-tasks that are easy to solve;
2. Assign these subtasks to specialized “expert” models using appropriate and detailed natural language instructions;
3. Supervise communication between these expert models;
4. Apply their own critical thinking, reasoning, and verification skills through this process.
For language models that can be effectively invoked using meta-prompting, the model acts as a conductor when queried. It outputs a message history (or narrative) consisting of responses from multiple expert models. This language model is first responsible for generating the commander part of the message history, which includes the selection of experts and the construction of specific instructions for them. However, the same language model also acts as an independent expert in its own right, generating output based on expertise and information selected by the commander for each specific query.
This approach allows a single unified language model to maintain a coherent line of reasoning while also leveraging a variety of expert roles. By dynamically selecting context for prompts, these experts can bring a fresh perspective to the process, while the commander model maintains a bird's-eye view of the complete history and maintains coordination.
This approach therefore allows a single black-box language model to effectively act as both a central commander and a series of different experts, resulting in more accurate, reliable and coherent responses.
The meta-prompting technique newly proposed here combines and extends many different prompting ideas proposed by recent research, including high-level planning and decision-making, dynamic person assignment, multi-agent debate, self-debugging, and self-reflection.
A key aspect of meta-prompting is its property of being task-agnostic.
Unlike traditional scaffolding methods, which require specific instructions or examples tailored to each task, meta-prompting uses the same set of high-level instructions across multiple tasks and inputs. This versatility is especially beneficial for trouble-shy users, since it eliminates the need to provide detailed examples or specific instructions for each specific task.
For example, for a one-time request like "Write a Shakespearean sonnet about taking a selfie," users don't need to supplement it with high-quality examples of neoclassical poetry.
Meta-prompting methods can improve the usefulness of language models by providing a broad and flexible framework without compromising their specificity or relevance. In addition, in order to demonstrate the versatility and integration capabilities of the meta-prompting method, the team also enhanced its system so that it can call the Python interpreter. This will allow the technology to support more dynamic and comprehensive applications, further increasing its potential to efficiently handle a wide range of tasks and queries.
Figure 2 shows an example of a meta-prompting session flow.
It depicts the process by which a Meta Model (Commander Model) interprets its own output using input and output from multiple different professional expert models or code executions. This configuration makes meta-prompting a nearly universal tool. It allows the interactions and computations of multiple language models to be aggregated into a single and coherent narrative. Meta-prompting is different in that it lets the language model decide which prompts to use or which snippets to use.
The team conducted comprehensive experiments using GPT-4 as the base language model, comparing meta-prompting with other task-independent scaffolding methods.
Experiments have found that meta-prompting not only improves overall performance, but also often achieves new best results on multiple different tasks. Its flexibility is particularly noteworthy: the commander model has the ability to call on the expert model (which is basically itself, with different instructions) to perform a variety of different functions. These functions may include reviewing previous output, selecting specific AI personalities for specific tasks, optimizing generated content, and ensuring that the final output meets required standards in both substance and form.
As shown in Figure 1, compared with various previous methods, the new method has obvious improvements.
meta-prompting
Intuitive knowledge and abstract overview. Meta-prompting works by using a model to coordinate and execute multiple independent queries, then combining their responses to render a final response. In principle, this mechanism adopts an integrated approach that borrows the power and diversity of independent professional models to collaboratively solve and handle multi-faceted tasks or problems.
At the heart of the meta-prompting strategy is its shallow structure, which uses a single model (called the metamodel) as the authoritative master entity.
This prompting structure is similar to an orchestra, in which the role of the conductor is played by meta-models, and each musical player corresponds to a different domain-specific model. Just as a conductor can coordinate multiple instruments to play a harmonious melody, a metamodel can combine answers and insights from multiple models to provide accurate and comprehensive answers to complex questions or tasks.
Conceptually, within this framework, domain-specific experts can take many forms, such as language models fine-tuned for specific tasks, specialized APIs for handling specific types of queries, or even calculators. Coding tools such as computing tools or Python interpreters for executing code. These functionally diverse experts are instructed and unified under the supervision of the meta-model and cannot directly interact or communicate with each other.
Algorithmic Procedure. Algorithm 1 gives the pseudocode of the newly proposed meta-prompting method.
To briefly summarize, it starts by performing a transformation on the input so that it conforms to the appropriate template; it then performs the following loop: (a) submits a prompt to the metamodel, (b) uses a domain-specific expert model if necessary, (c) returns Final response, (d) error handling.
It should be pointed out that the meta-model and expert model used by the team in the experiment are both GPT-4. The difference in their roles is determined by the instructions each receives; where the meta-model follows the set of instructions provided in Figure 3, while the expert model follows the instructions dynamically determined by the meta-model at inference time.
Experimental setup
benchmark
The team compared meta-prompting with a task-agnostic, zero-shot version of the following prompting method:
- Standard prompting
- Zero sample thinking chain prompting
- Expert prompting
- Multiplayer prompting
Datasets and tasks
The team used a variety of tasks and data sets in their experiments that required a variety of different abilities, such as mathematical and algorithmic reasoning, domain-specific knowledge, and literary creativity. These datasets and tasks include:
- Game of 24: The goal is to use four given values (each used only once) to construct an arithmetic expression that results in 24.
- Three BIG-Bench Hard (BBH) tasks: Geometric Shapes, MultiStep Arithmetic Two, and Word Sorting; plus an inference task Checkmate-in-One taken directly from the BIG-Bench suite.
- Python Programming Puzzles (P3), namely Python programming questions, include multiple difficulties.
- Multilingual Grade School Math is a multilingual version of the GSM8K data set, including Bengali, Japanese, and Swahili.
- Shakespearean Sonnet Writing is a new task created by the team. The goal is to write a sonnet that rhymes strictly with "ABAB CDCD EFEF GG" and contains exactly one word. Three words provided.
Answer extraction and evaluation protocol
As shown in Figure 3, for the newly proposed meta-prompting method, system instructions will encourage the meta-model to give the final answer in a specific format.
As for evaluation, one of the following three indicators will be used, depending on the nature and form of the task:
- Exact Match (EM), exact match
- Soft Match (SM), soft match
- Functionally Correct (FC), functional correctness
Models and inference
The team's main experiments all used GPT-4 (gpt-4-32k). Some supplementary experiments used GPT-3.5 (gpt-35-turbo). Whether it is GPT-3.5 or GPT-4, the following instructions are used for fine-tuning.
The parameters and system instructions used by the metamodel were the same in all experiments. The temperature value is set to 0, the top-p value is set to 0.95, and the maximum number of tokens is 1024.
Main results and discussion
Table 1 summarizes the experimental results, and the superiority of the newly proposed meta-prompting is reflected.
Looking at the overall performance of these methods on all tasks, we can see that meta-prompting brings significant improvements in accuracy, especially when assisted by the Python interpreter tool.
Specifically, the meta-prompting method outperforms the standard prompting method by 17.1%, exceeds expert (dynamic) prompting by 17.3%, and is 15.2% better than multi-person prompting.
In addition, we can see from Figures 4 and 5 that compared to meta-prompting without using the Python interpreter, when integrating the Python interpreter, the overall performance on different tasks can be improved by 11.5%.
The team also discussed in depth in the paper the key insights gained from the experiments, including the performance superiority of meta-prompting, zero-shot decomposition capabilities, error detection, information aggregation, and code execution. We won’t go into details here, but the concept of Fresh Eyes is worth introducing.
Fresh Eyes, or seeing with another pair of eyes, helps alleviate a well-known problem with language models: making mistakes all the way through and showing overconfidence.
Fresh Eyes is a key difference between meta-prompting and multi-player prompting, and experimental results have proven its advantages. In meta-prompting, experts (or personas) can be used to re-evaluate the problem. This approach offers the opportunity to gain new insights, potentially uncovering answers that have not been found to be incorrect before.
Based on cognitive psychology, Fresh Eyes leads to more creative problem solving and error detection results.
The following examples demonstrate the benefits of Fresh Eyes in practice. Assume that the task is Game of 24. The values provided are 6, 11, 12 and 13. It is required to construct an arithmetic expression that results in 24 and each number can only be used once. Its historical process may look like this:
1. The meta-model proposes consulting expert models that solve mathematical problems and programming using Python. It emphasizes the need for accuracy and compliance with constraints and recommends involving another expert if necessary.
2. One expert gives a solution, but another expert thinks it is wrong, so the meta-model suggests writing a Python program to find a valid solution.
3. Consult a programming expert and ask him to write a program.
4. Another programming expert finds an error in the script, modifies it, and executes the modified script.
5. Consult a math expert to verify the solution output by the program.
6. After the verification is completed, the meta-model outputs it as the final answer.
This example shows how meta-prompting can incorporate new insights at every step, not only to find answers, but also to effectively identify and correct errors.
The team concluded by discussing some other issues related to meta-prompting, including an analysis of the type of experts used, the number of dialogue rounds required to achieve the final result, and how to deal with situations such as unsolvable problems. Please refer to the original paper for details.