Michael Feathers

Generate from Constraints

Tue, 11 Jul 2023 10:44:36 -0500

As society deals with the broader implications of generative pre-trained transformer (GPT) technology, developers are trying to understand how it will impact software development. While there are a wide variety of ways to leverage it, the easiest, and most common, is code generation.

Today, there are a number of GPT-based tools that analyze comments and code and suggest completions as you type. You can also prompt them to generate or transform code. In either case, you can choose to accept the changes (or not) but you have to be careful. GPT-based tools are prone to the hallucination problem – they are indeterminate. The randomness they use to generate creative output and avoid being “stuck” can lead to too much creativity — situations where generated code is subtly wrong or not quite what you had intended.

When we use AI for code generation, quality assurance becomes much more important.

During a recent conference, I was taken aback when a speaker suggested using AI to handle tasks disliked by developers, such as writing unit tests. The recommendation shocked me because it overlooked the inherent uncertainty associated with how transformers operate. Asking an AI-based tool to write tests of correctness for our code seems like a good idea, but how do we know whether those tests check the behavior we intended? The situation is worse when we are using AI to generate the code that it is testing. The generated code could behave differently than we intend and the tests it generates might wrongly interpret these differences as the correct behavior.

Let’s make this concrete.

Here is a natural workflow using today’s AI tools:

Prompt the tool to generate a method for you (or start to type it and have the tool complete it).
Review the generated code (because it might not be what you intended).
Prompt the tool to write tests for you.
Review the generated tests (because they might not check the exact behavior you intended either).

The review steps are there because of the indeterminacy that the tool adds to the process. They are additional work for us. Worse, when we use one indeterminacy to check another we could be compounding any errors we miss in review.

How can we move past this to a better workflow?

One thing that we can do is look for leverage. Is there a way to use something that we are doing to keep the indeterminacy in check?

There is. We can write our prompts as tests. They can serve as input to a tool but also as a check on its output.

Here is an example, a test I wrote to start work on a tiny To-do list app:

@Test
public void renderSessionWithOneTask() {
    Session session = new Session();
    session.add(new Task("task 1"));

    assertEquals(" 0: task 1\n”, session.render());
}

Then, I used the test as part of my prompt and asked the tool to write code to satisfy it.

It did. I ran the test and it passed.

Next, I added another test:

@Test
public void renderSessionWithSeveralTasks() {
    Session session = new Session();
    session.add(new Task("task 1"));
    session.add(new Task("task 2"));
    session.add(new Task("task 3"));

    assertEquals(" 0: task 1\n 1: task 2\n 2: task 3\n", session.render());
}

When I prompted the tool to change the code so that it would satisfy both tests, it did the work and the tests passed.

Sometimes generated code doesn’t pass the tests. At other times, it passes them but it takes the design in a direction that I disagree with. That’s okay; I’m using the tests in my prompt as a way of checking the output of the tool. If the tool misses, I can ask it to try again, or I can write the code by hand to satisfy the prompt the way that I want to.

Many people reading this probably recognize this process as a variant of Test-Driven Development. It is, but it is also a generalization of it. When we write prompts, we can hoist them above the generation process by using them as automated checks for tool output.

Note that I’m using the word prompt in an expansive way – the prompt is the code generation request along with a set of executable descriptions (tests) that we can use to verify the output.

Our new workflow looks like this:

Write a prompt that is executable as a test.
Use the prompt to ask a tool to generate code that passes it.
Run the prompt as a test ¹.
Review code that passes the test to see if it passes in a way which advances your design.

Instead of having the two review steps we had in our original workflow, we have just one – a review for style and design, not correctness. Prompts, repurposed as tests, simplify our workflow and drive behavioral indeterminacy out of the process.

This workflow is definitely practicable today, but there are some rough edges. I often have to tell the tool that I want it to pass the tests in the simplest way possible without introducing new behavior. Current tools seem to be eager to give us more than we ask for. That just leads to more review and possible errors. Another thing that can get in the way is fatigue. If a tool “misses” too often when it makes suggestions, development becomes more like debugging — far more exhausting than just writing the code.

Prompt-hoisting can be seen as just Test-Driven Development in an AI context, but the strategy of using prompts as constraints can be used for many other tasks. As an example, imagine asking a tool to generate recipes based on a list of ingredients and appliances you have in your kitchen. You prompt it with the list and receive a recipe as output. You could have an external checker that takes the list you used as a prompt, along with the produced recipe, and verifies that you can prepare it with the list items. Extending this idea to any other kind of assembly task or generative engineering task is rather straightforward.

Generative AI leads us to a "generate and check" paradigm. When we are ideating or doing free-form creative work, we can check the output manually and use our qualitative judgment. In an engineering context, we need ways of constraining generation and making sure that it satisfies our requirements — the "must-haves."Making our prompts dual-purpose is one way to achieve this. They can specify what needs to be generated and automatically verify its adequacy. This strategy can help us manage AI's tendency to "hallucinate" and allow us to introduce precision when needed.

This step has an inner cycle. If the test fails, you can try generating code again or modify the generated code to make it pass. I often find that no matter how I write a test, some generation tasks are beyond the capability of the tool I’m using. I take this as a hint that I need to back off and write a test for some smaller piece of behavior that can be used as a building block toward the original behavior. ↩

(Possible) AI Impacts on Development Practice

Thu, 06 Apr 2023 18:47:03 -0500

We can’t accurately predict AI’s impact but we can explore some possible disruption points.

Source Code:

Currently most software systems are developed using source code – a human readable, (hopefully) deterministic specification of system behavior. Development tools that allow end-user programming often bypass the need for source code by using record/playback and visual programming-style interaction modes. When behavior needs to be altered, a source code representation is useful. It aids system understanding and enables fine-grained, parametric control[1] of behavior.
There is a broad class of systems for which most behavior is determined by variation in data. This class includes traditional database systems as well as ML and NN-based systems which create internal (usu. non-human readable) representations that drive behavior. As useful as these systems are, they are not transparent or easily modifiable in a parametric way. We sacrifice the rapid, fine-grained control of behavior that source code provides in order to do things not easily achievable otherwise. Source code is a compact way of parameterizing behavior.
Programming languages often build on top of each other. Compilers and translators for new programming languages often target other high level languages. When they do this, they are able to leverage existing development platforms and ecosystems. This bootstrapping process has the nice side effect of giving us a human readable representation (the generated source code) but in most cases the build process pipeline eventually collapses, removing the use of source as an intermediate language. Example: C++ was originally implemented as a preprocessor for C. Eventually, C++ compilers compiled directly to lower-level representations, object code etc.
Today, GPT-based systems are being used to produce source code as an intermediate representation. We review and then feed that generated code into the build process. It’s worth considering whether this is necessary. The near term answer is: maybe. The output of GPT-based systems is probability-based, alterable by a “temperature” that produces novelty in solution generation but also the chance that a solution may “miss”, be wrong, or have bad consequences (the “hallucination problem”). Human validation is currently a key part of the process and source code is a traceable medium.
As development moves to a generate-and-test paradigm, quality assurance will have renewed importance.
There is a strong possibility that prompting replaces source code to some extent, however, ‘temperature’ is a problem. To the degree that AI produces novelty, prompts are no longer deterministic specifications of behavior. Regardless, we may see the development of prompting languages with the aim of specifying must-have behaviors and placing constraints on stochasticity. It is likely these will be developed ad hoc in GPT sessions and later standardized.

Modularity:

Will systems continue to be built from modular pieces? Let’s look at two common ways of composing systems from parts. The first is gluing pieces together. The other is creating pieces with ‘holes’ or extension points. In loose terms, the former is the library composition approach and the latter is the framework approach. We can view pieces ‘with and without holes' as modules.
As the sizes of AI context windows increase, the size of generated modules can increase. These modules can be read by AI, modified through prompts and re-generated on demand.
Most systems that generate code have something I call the ‘escape hatch’ problem: Invariably there are things can not be easily done with the generator. Traditionally, the way to handle this has been to provide a way to ‘drop down’ into a lower level representation and fill in the details. To the degree that this problem persists, generated modules can provide ‘holes’ as ‘escape hatches.’
AI enables a third way of composing software. A module with extensive functionality can be read by an AI and then rewritten with only the subset of functionality and any additional tailoring needed for a particular use. This is like an automated private fork. High value code bases can be reused in this way.
Another possible scenario for development is the use of peer AI systems that negotiate feature sets with each other in order to arrive at a system design. It is hard to know what this means for modularity but it is likely that Conway's Law will hold in many cases with the modularization mirroring the interactions of the peer systems responsible for producing the design.

Closing Thoughts:

Source code seems to have value for the immediate future. We could arrive at trained AI that, through interaction and learning, “understand” a domain and its constraints so well that they can generate large target systems on request. If that happens, it might make sense to simply persist context as a set of embeddings, bypassing the need for source code.
The relatively small size of context windows that we have today seems like it would be the primary motivator for the continued use of modularity in software development, but all of the uses of modularity outlined by Baldwin and Clark[2] remain. Modularity isn’t just a software concept; it’s a strategy used in nearly all systems of scale.

This is just the way things appear to me now. It will be interesting to see how it all plays out in the near and long term.

[1] Parametric control - a system behavior is parametrically controllable to the degree that it can be changed concisely and deterministically with specificity.

[2] Baldwin and Clark - Design Rules: The Power of Modularity

No AI were harmed in the writing of this article.

Gateway Teams

Mon, 23 Aug 2021 14:36:08 -0500

One of the things that is hard to appreciate in complex systems is path-dependence — the fact that most systems have memory. What we see today is a consequence of what came before. This is a very simple thing to say but even when we know it, we forget it and we don’t really think about its ramifications. If we want to change things, it helps to be upstream of the change. The earliest decisions are often the most significant ones. The things we have to react to are often most determinative of how we work.

Practice change and organizational learning are two areas where this is important.

Teams that want to adopt different practices often try them for a while and then revert back to the old way of doing things. It's easy to see why. When we are surrounded by artifacts and systems that represent an earlier way of doing things, it is almost like they have a magnetic pull. What we believe is possible depends upon what we’ve experienced and it’s unfortunate that many people haven’t experienced great culture, teams, or technical practice. Training, workshops and books can help but they aren’t real experience; they are borrowed experience. They aren’t the same as a real work environment.

If we accept that the earliest experiences we have are most significant and that the best learning happens in the work, it becomes important to find the teams that are doing well. Hire people into these teams and let that be their first work experience in your company. These teams become gateways for participation in the rest of the organization.

Because work is a socio-technical system, it isn’t just the team that matters. Do you have a few products with exemplary code, great continuous delivery practice and people who are kind to each other? If you do, bring people into your organization through those products' teams. At the very least, having worked with one of those teams for a few months, they will know what “good” looks like. It’s the early experience that helps them see what is possible and what to strive for in culture and practice.

You might be worried that once people go through the gateways they’ll be despondent if they end up working on products that aren’t doing as well. It’s going to happen. Support them and help them make it all better. Remember that it is better than bringing people directly into teams without a north star, or guiding experience of what your organization can be.

Give people the important experiences first. It's a way to curate culture and practice as you grow.