Generate from Constraints
Using Prompt-Hoisting for GPT-based Code Generation
As society deals with the broader implications of generative pre-trained transformer (GPT) technology, developers are trying to understand how it will impact software development. While there are a wide variety of ways to leverage it, the easiest, and most common, is code generation.
Today, there are a number of GPT-based tools that analyze comments and code and suggest completions as you type. You can also prompt them to generate or transform code. In either case, you can choose to accept the changes (or not) but you have to be careful. GPT-based tools are prone to the hallucination problem – they are indeterminate. The randomness they use to generate creative output and avoid being “stuck” can lead to too much creativity — situations where generated code is subtly wrong or not quite what you had intended.
When we use AI for code generation, quality assurance becomes much more important.
During a recent conference, I was taken aback when a speaker suggested using AI to handle tasks disliked by developers, such as writing unit tests. The recommendation shocked me because it overlooked the inherent uncertainty associated with how transformers operate. Asking an AI-based tool to write tests of correctness for our code seems like a good idea, but how do we know whether those tests check the behavior we intended? The situation is worse when we are using AI to generate the code that it is testing. The generated code could behave differently than we intend and the tests it generates might wrongly interpret these differences as the correct behavior.
Let’s make this concrete.
Here is a natural workflow using today’s AI tools:
- Prompt the tool to generate a method for you (or start to type it and have the tool complete it).
- Review the generated code (because it might not be what you intended).
- Prompt the tool to write tests for you.
- Review the generated tests (because they might not check the exact behavior you intended either).
The review steps are there because of the indeterminacy that the tool adds to the process. They are additional work for us. Worse, when we use one indeterminacy to check another we could be compounding any errors we miss in review.
How can we move past this to a better workflow?
One thing that we can do is look for leverage. Is there a way to use something that we are doing to keep the indeterminacy in check?
There is. We can write our prompts as tests. They can serve as input to a tool but also as a check on its output.
Here is an example, a test I wrote to start work on a tiny To-do list app:
@Test
public void renderSessionWithOneTask() {
Session session = new Session();
session.add(new Task("task 1"));
assertEquals(" 0: task 1\n”, session.render());
}
Then, I used the test as part of my prompt and asked the tool to write code to satisfy it.
It did. I ran the test and it passed.
Next, I added another test:
@Test
public void renderSessionWithSeveralTasks() {
Session session = new Session();
session.add(new Task("task 1"));
session.add(new Task("task 2"));
session.add(new Task("task 3"));
assertEquals(" 0: task 1\n 1: task 2\n 2: task 3\n", session.render());
}
When I prompted the tool to change the code so that it would satisfy both tests, it did the work and the tests passed.
Sometimes generated code doesn’t pass the tests. At other times, it passes them but it takes the design in a direction that I disagree with. That’s okay; I’m using the tests in my prompt as a way of checking the output of the tool. If the tool misses, I can ask it to try again, or I can write the code by hand to satisfy the prompt the way that I want to.
Many people reading this probably recognize this process as a variant of Test-Driven Development. It is, but it is also a generalization of it. When we write prompts, we can hoist them above the generation process by using them as automated checks for tool output.
Note that I’m using the word prompt in an expansive way – the prompt is the code generation request along with a set of executable descriptions (tests) that we can use to verify the output.
Our new workflow looks like this:
- Write a prompt that is executable as a test.
- Use the prompt to ask a tool to generate code that passes it.
- Run the prompt as a test 1.
- Review code that passes the test to see if it passes in a way which advances your design.
Instead of having the two review steps we had in our original workflow, we have just one – a review for style and design, not correctness. Prompts, repurposed as tests, simplify our workflow and drive behavioral indeterminacy out of the process.
This workflow is definitely practicable today, but there are some rough edges. I often have to tell the tool that I want it to pass the tests in the simplest way possible without introducing new behavior. Current tools seem to be eager to give us more than we ask for. That just leads to more review and possible errors. Another thing that can get in the way is fatigue. If a tool “misses” too often when it makes suggestions, development becomes more like debugging — far more exhausting than just writing the code.
Prompt-hoisting can be seen as just Test-Driven Development in an AI context, but the strategy of using prompts as constraints can be used for many other tasks. As an example, imagine asking a tool to generate recipes based on a list of ingredients and appliances you have in your kitchen. You prompt it with the list and receive a recipe as output. You could have an external checker that takes the list you used as a prompt, along with the produced recipe, and verifies that you can prepare it with the list items. Extending this idea to any other kind of assembly task or generative engineering task is rather straightforward.
Generative AI leads us to a "generate and check" paradigm. When we are ideating or doing free-form creative work, we can check the output manually and use our qualitative judgment. In an engineering context, we need ways of constraining generation and making sure that it satisfies our requirements — the "must-haves."Making our prompts dual-purpose is one way to achieve this. They can specify what needs to be generated and automatically verify its adequacy. This strategy can help us manage AI's tendency to "hallucinate" and allow us to introduce precision when needed.
-
This step has an inner cycle. If the test fails, you can try generating code again or modify the generated code to make it pass. I often find that no matter how I write a test, some generation tasks are beyond the capability of the tool I’m using. I take this as a hint that I need to back off and write a test for some smaller piece of behavior that can be used as a building block toward the original behavior. ↩