I've been doing a bunch of coding with AI assistance ranging from souped-up auto-complete to full on vibe coding. I’m learning a ton and am blogging AI coding projects here.
Three thoughts about software development with AI that I wanted to get down on paper: the return to waterfall, the perils of “product” and the primacy of evaluation.
Return to Waterfall
Many vibe coding best practices are a regression to waterfall code development. Waterfall is known for its sequential thinking about project development. An idea is translated into a detailed product specification that is then used as the basis for implementation. Harper Reed’s excellent AI codegen process is pretty much what I use, and is a good example of this type of thinking. A thorough specification is developed through conversation with an LLM. Then a set of prompts are developed from the specification. Then the spec and prompts are used by the codegen AI for implementation. This contrasts with more modern agile development processes that integrate requirements gathering, design and implementation.
Is this a step backward? Perhaps the speed of iteration means that the waterfall cycles become quick enough to look more agile? Or maybe agile can’t work if the implementer is an LLM? Or maybe the overwhelming difficulty in keeping the AI codegen programs on task and executing within scope merit more specification and less iterative adaptation?
In any case, the problems of waterfall need to be considered in the development process. In my experimentation, AI codegen often gets hung up on a way of doing things that is consistent with the spec but not with reality (just as happens in other waterfall processes). In response, I usually scrap the whole attempt and start again at the beginning of the waterfall with a new specification.
Perils of “Product”
With waterfall development also comes thinking about software as a “product” that gets “finished” as opposed to a service that gets continuously maintained and improved. While AI one-shots get a lot of coverage, much less is devoted to the more difficult proposition of working on an established code base or, even more importantly, AI coding on top of AI coding to continuously maintain and improve a code base.* Thinking of software as a product has all sorts of pitfalls, not least of which is the fact that almost no software of any import stays the same because use of it changes over time and the world does too. Maintenance and constant evolution is a more important pattern. I’m excited that I’m starting to see more work on those patterns. How we think about AI-aided maintenance of existing code bases will be extremely important. Knowing the current problems of AI codegen, I’m pretty worried about AI maintenance of AI code.
Primacy of Evaluation
That ties neatly into the importance of being able to evaluate AI-coded changes. Lili Jiang did a great talk on the subject at the O’Reilly Coding with AI Conference. She also has a Medium post that is well worth a read. She highlights that, for software that incorporates AI functionality, evaluation is a bigger part of building great software, comparing changes to benchmarks is key, and that human evaluation is also important. A big part of the greater importance of eval is the shift from relatively deterministic approaches to automation to these non-deterministic ones. While you might evaluate a deterministic system on the basis of the correctness of the algorithm or output, non-deterministic systems can frustrate that approach and call for more investment in evaluation. This is especially true with relatively opaque non-deterministic systems. That has some significant ramifications for policy, e.g. maybe FDA is the better model than FTC. And, though the thrust of Ms. Jiang’s arguments are about software that incorporates AI functionality, her prescriptions also apply to coding with AI.
My key takeaway is to front-load project elements that enable human evaluation of progress. This is similar to the agile concept of getting to MVP fast but it means that I intentionally front-load human-readable output and evaluations that have straightforward answers while delaying the harder to evaluate pieces. It also means that human-readable hooks are important. I don’t just develop an API with a test suite, instead I make sure that I can use the API and see its output. This is one protection against AI coding assistants’ constant reversion to gaming tests to make them pass. If I can see what is going on, it is easier for me to catch it. If all of that happens earlier in the process, not only do I not waste a bunch of time and dollars but I also don’t have a codebase that has grown from a flawed premise.
It is worth noting however that this disbelief in AI agentic testing puts a lot of extra burden on the human evaluator. In regular coding I would never simply trust my evaluation based on seeing the output, I would want an excellent suite of tests with good test coverage. If you are developing something real, that's still going to be the right approach and you'll need to be able to understand and verify the tests. That job will likely include humans who code for a while longer, or maybe always.