Some Current AI Coding Thoughts

I've been doing a bunch of coding with AI assistance ranging from souped-up auto-complete to full on vibe coding. I’m learning a ton and am blogging AI coding projects here.

Three thoughts about software development with AI that I wanted to get down on paper: the return to waterfall, the perils of “product” and the primacy of evaluation.


Return to Waterfall


Many vibe coding best practices are a regression to waterfall code development. Waterfall is known for its sequential thinking about project development. An idea is translated into a detailed product specification that is then used as the basis for implementation. Harper Reed’s excellent AI codegen process is pretty much what I use, and is a good example of this type of thinking. A thorough specification is developed through conversation with an LLM. Then a set of prompts are developed from the specification. Then the spec and prompts are used by the codegen AI for implementation. This contrasts with more modern agile development processes that integrate requirements gathering, design and implementation. 


Is this a step backward? Perhaps the speed of iteration means that the waterfall cycles become quick enough to look more agile? Or maybe agile can’t work if the implementer is an LLM? Or maybe the overwhelming difficulty in keeping the AI codegen programs on task and executing within scope merit more specification and less iterative adaptation?


In any case, the problems of waterfall need to be considered in the development process. In my experimentation, AI codegen often gets hung up on a way of doing things that is consistent with the spec but not with reality (just as happens in other waterfall processes). In response, I usually scrap the whole attempt and start again at the beginning of the waterfall with a new specification. 


Perils of “Product”


With waterfall development also comes thinking about software as a “product” that gets “finished” as opposed to a service that gets continuously maintained and improved. While AI one-shots get a lot of coverage, much less is devoted to the more difficult proposition of working on an established code base or, even more importantly, AI coding on top of AI coding to continuously maintain and improve a code base.* Thinking of software as a product has all sorts of pitfalls, not least of which is the fact that almost no software of any import stays the same because use of it changes over time and the world does too. Maintenance and constant evolution is a more important pattern. I’m excited that I’m starting to see more work on those patterns. How we think about AI-aided maintenance of existing code bases will be extremely important. Knowing the current problems of AI codegen, I’m pretty worried about AI maintenance of AI code.


Primacy of Evaluation


That ties neatly into the importance of being able to evaluate AI-coded changes. Lili Jiang did a great talk on the subject at the O’Reilly Coding with AI Conference. She also has a Medium post that is well worth a read. She highlights that, for software that incorporates AI functionality, evaluation is a bigger part of building great software, comparing changes to benchmarks is key, and that human evaluation is also important. A big part of the greater importance of eval is the shift from relatively deterministic approaches to automation to these non-deterministic ones. While you might evaluate a deterministic system on the basis of the correctness of the algorithm or output, non-deterministic systems can frustrate that approach and call for more investment in evaluation. This is especially true with relatively opaque non-deterministic systems. That has some significant ramifications for policy, e.g. maybe FDA is the better model than FTC. And, though the thrust of Ms. Jiang’s arguments are about software that incorporates AI functionality, her prescriptions also apply to coding with AI. 


My key takeaway is to front-load project elements that enable human evaluation of progress. This is similar to the agile concept of getting to MVP fast but it means that I intentionally front-load human-readable output and evaluations that have straightforward answers while delaying the harder to evaluate pieces. It also means that human-readable hooks are important. I don’t just develop an API with a test suite, instead I make sure that I can use the API and see its output. This is one protection against AI coding assistants’ constant reversion to gaming tests to make them pass. If I can see what is going on, it is easier for me to catch it. If all of that happens earlier in the process, not only do I not waste a bunch of time and dollars but I also don’t have a codebase that has grown from a flawed premise.


It is worth noting however that this disbelief in AI agentic testing puts a lot of extra burden on the human evaluator. In regular coding I would never simply trust my evaluation based on seeing the output, I would want an excellent suite of tests with good test coverage. If you are developing something real, that's still going to be the right approach and you'll need to be able to understand and verify the tests. That job will likely include humans who code for a while longer, or maybe always.


* And yes, that sounds like guaranteed full employment for human coders to me.

Understanding Claude Code Sessions

Claude Code logs a bunch of stuff via jsonl. The logs are a little hard to read but include all the requests a user makes and a bunch of other information (such as tool calls). I made a rough and ready parser that shows the logs in a more human readable form. It can also show git commits in the log timeline so that you can see what changes to the code correspond to a set of Claude Code work. 

https://github.com/amac0/ClaudeCodeJSONLParser

All coded via LLM, mostly o3 & when o3 had trouble, switched to Claude 4 Sonnet (in Claude Code). Very easy to install just download the html file and open it locally in your browser.

One big learning from this one as it is mostly javascript and I don't know javascript that well was the importance of making the LLMs send a lot of debugging information to the console so that I could see it.

Claude Code + Theater Scraper

Am playing around with AI coding. Which is fun and frustrating and very educational.

Am going to post some of my attempts and observations.

I use Harper's excellent LLM Codegen Workflow (see also this followup). For this experiment, I used Claude Code. I did this in late Feb 2025 (sorry for the delay in writing it up).

My basic project was to write a scraper that would get theater listings from a bunch of London theaters and send me an email daily with new listings. I thought this was a good experiment because it was a relatively easy project in a domain I know well enough to do myself.

Here is the Spec that I came up with in a back and forth with ChatGPT 4o. It really wanted to expand the scope and wanted to use selenium (to simulate web browser requests) in spite of it not being needed. 

I then asked for a specific set of prompts and a todo list for those prompts (I think this was with o1, but it could have been 03-mini). The result was pretty good and I felt ready to go.

I was trying to minimally intervene in the code generation process. 

Today the results may be different (I have since borrowed a better Claude.md (thanks Jessie Vincent) and both Claude Code and Claude itself have gotten better) but the first attempt was a disaster. Claude Code kept trying to do all of the prompts at once but more importantly, it seemed completely lost in terms of actually doing the work of figuring out how to get the right information from the web pages. I spent a bunch of time and Claude $ but eventually scrapped that entirely and started fresh with one big change: I manually went out and downloaded every single page I wanted to be able to scrape and put them in a folder (tests/fixtures). 

That one change really made a big difference. Claude Code still wanted to do everything all at once, but now I could push it towards getting correct answers for what to look for in the html and what the outputs of its scraping of the fixtures should be. The result is something that is useful and seems to be working.

My big takeaways were: 

  1. be prepared to throw everything out (also, learn & incorporate git);
  2. make the spec and the prompts simple -- no, simpler than that; 
  3. anything you would want to have at your disposal when coding, make sure Claude Code has and knows it has; 
  4. stop Claude Code often to point out obvious things -- "that is out of scope for this step", "mocking the test result doesn't mean you passed the test", "yes the tests are important, you should still do the tests and not move on if some are failing.";
  5. pay attention to Claude Code and intervene;
  6. Claude Code will do better in areas that you know because you'll be able to tell when it is not doing good stuff and stop/redirect it;
  7. this type of coding is a bit like social media scrolling in terms of dopamine slot machine (someone at Coding AI said this and I agree but forgot who said it)