bricoleur

AI Presentations

Have been doing some fun presentations on AI. They are split between policy (like the one below) and struggling with when and when not to use AI in learning (will post a recording when I have one).

The AI in Learning presentation dives deep into the analogy of regular learning as weightlifting to AI's forklift. I first heard it on the great Hard Fork podcast and Kevin Roose said he had first heard it elsewhere but didn't remember where. Anyhow, here's the slide from the presentation.

It has been fun to get my thoughts in a presentable form.

AI Coding Another Update

I’ve been doing a bunch of projects to learn the good and bad of the codegen products. Those products include Claude Code, Codex CLI, Gemini CLI, etc. and their uses range from autocomplete to vibe coding or engineering (I prefer the word “codegen” h/t Harper Reed). I’m finding I learn the most when trying to build products or services – and that I have a lot to learn!

Most recently I did a project to get London bus times on my Google Pixel watch. I called it BusWatch. The finished code is on GitHub.

BusWatch App Screens

I have never actually coded an app for a phone or written in Kotlin, which meant I wasn’t reading or writing any code myself through this whole process. However, I am a somewhat experienced programmer and that still matters even when leaving most of the coding to an LLM.

I’ll detail my process and what I learned below.

The tl;dr includes:

Models still get basic stuff disastrously wrong (but are surprisingly effective anyway).
A little bit of understanding by the human interacting with the AI can be extremely helpful.
Jesse Vincent’s Superpowers is great!
Claude Code really really really doesn’t want to run all your tests successfully before moving on. It also loves to mock for a test rather than test a real thing. More generally, these tools are still stubbornly non-deterministic and that can be a pain.
More small batch and customized personal development is coming.

Building the Specification: Part 1 with ChatGPT

The first thing I worked on was moving from the vague idea of an app to get live London bus times to a more detailed specification. I started that in ChatGPT 5 Thinking. I used the following prompt (based on Harper Reed’s excellent and now ancient codegen processes):

Ask me one question at a time so we can develop a thorough, step-by-step spec for this idea. Each question should build on my previous answers, and our end goal is to have a detailed specification I can hand off to a developer. Let’s do this iteratively and dig into every relevant detail. Remember, only one question at a time.

Here’s the idea:

I'd like to develop a Wear OS app that will give real-time london bus arrival times for bus stops near the wearer. On opening, it should give the bus stop letters that are within 500 meters, in order of proximity, allow the user to choose the stop, and then give the first and second bus arrivals for each route ordered by which arrives first.

ChatGPT usually does a pretty good job at this type of thing but near the beginning we had this exchange:

ChatGPT: Great. Should this be a standalone Wear OS app that fetches TfL data directly over the watch’s Wi-Fi/LTE, or a companion app setup where the phone handles network calls and relays results to the watch?

Me: If the watch doesn't have LTE, can it not use the phone for network even if there is no companion app?

ChatGPT: Good question. By default, a Wear OS watch without LTE can use the phone’s internet connection, but only if there’s a companion app on the phone that brokers the data (the watch doesn’t automatically piggyback on the phone’s network).
So, you’ll need to decide:
Standalone app → Works fully on watch (requires LTE or Wi-Fi).
Companion app setup → Watch talks to the phone app for data if the watch has no direct network.
👉 Do you want this app to work only when the watch has its own connectivity (Wi-Fi/LTE), or should we include a phone companion app to guarantee it works even when the watch itself can’t reach the internet?

This is incorrect in a fairly fundamental way. It is the kind of mistake that cascades badly if you don’t notice it early. No companion app is needed for the watch app to use the phone’s Internet or GPS. ChatGPT’s mistake* would have meant coding an additional app for a different OS and managing the complexities of communication between the apps. It would also have meant rebuilding a bunch of builtin functionality from scratch, which would have likely led to more complexity and errors but no gain in functionality. I had a vague feeling it was wrong but didn’t correct ChatGPT for a bit, which led to a bunch of wasted time. If I didn’t understand the watch capabilities, this would have been a big red herring that would have led to more and more complex development and likely a bunch of more errors.

ChatGPT then asked me 59 (!) more questions without seeming even close to attempting to summarize in a spec. It didn’t seem to have any understanding of the relative importance of different questions and didn’t seem at all focused on the goal of having a spec (as opposed to having a long conversation with a user). I gave up working with it and shifted to Claude Code.

Building the Specification: Part 2 with Claude Code & Superpowers

I’ve been meaning to try Jesse Vincent’s Superpowers for a while now. Superpowers is a set of prompts and tools that are invoked by Claude Code to accomplish specific tasks while pushing it towards better development practices, such as test-driven development, structured planning, or using worktrees to isolate new functionality. For example, Superpowers has a brainstorming skill that Claude Code should add to prompts where the user is asking to brainstorm. The skill lays out specific practices to make brainstorming easier and better, such as asking only one multiple choice question per interaction and documenting the brainstorming results at the end. Claude Code’s skill capability [link] is modular and extensible. OpenAI looks like it is adopting something similar as well.

Installing Superpowers in Claude Code is as easy as:

/plugin marketplace add obra/superpowers-marketplace
/plugin install superpowers@superpowers-marketplace

So I installed them and asked the same prompt as above using Claude Code Sonnet 4.5.

Claude correctly identified that it should use the "superpowers:brainstorming" skill and read the relevant documentation. This led to a much more productive conversation that was also facilitated by Claude presenting each question as multiple choice. For example, the first thing it asked after my project prompt was:

I can see this is a fresh project. Let me start refining your spec with some questions, amac.

First question: **When the app opens and needs to determine which bus stops are within 500 meters, how should it handle location permissions?**

A) Request location permission on first launch, and if denied, ask the user to manually enter a bus stop code
B) Request location permission on first launch, and if denied, show a helpful error message explaining the app requires location to function
C) Request location permission on first launch, and if denied, default to showing bus stops near a landmark (like Trafalgar Square) with a note that location is disabled
D) Something else?

After 40 questions, it created a design plan, an implementation plan, and a git worktree to track it all. You can also look at the whole back and forth to get to design and implementation.

Implementation

Superpowers also does a good job at forcing the use of a todo list. For BusWatch’s implementation, it looked like this:

Superpowers prefers Git Worktrees for getting work done in an isolated way. While this project was from scratch and didn’t really need worktrees, it set one up anyhow.

One of the great newish features of Claude Code is sub-agents. Sub-agents allow for parallel development but also help with context management because context flow between the main agent and sub-agents is not complete. Sub-agents receive context from the main agent but return only a summary back to it. Similarly sub-agents do not pollute each other’s context. One useful result of this is that sub-agents can more effectively take on distinct roles. Superpowers prompts for sub-agent or parallel execution. I have found sub-agent execution to work very well and chose it here. Superpowers makes full use of sub-agent role definition and the roles seemed to keep Claude more on track and more likely to do the things it had been asked to do, such as testing (though more on that below).

The creation of a more detailed plan followed, with a new Todo list:

Claude motored away for about an hour and a half completing seven of the tasks on its list before bumping into my subscription limit. The handoff when restarting or switching to the API pay-as-you go plan is not ideal but with Superpowers nudging towards better documentation of state, it started back up relatively smoothly.

Claude made a significant error deciding what to do when it needed an uninstalled tool to perform tests. Rather installing the tool or raising the issue for me to direct it, it decided to just postpone all the testing to the end (!). It noted matter-of-factly:

I see Gradle isn't set up yet. Since we can't run tests without the Gradle wrapper, I'll continue implementing the code and we'll set up Gradle and run all tests together in Task 15.
...

That’s not good, and not what it was told repeatedly by the various Superpowers instructions and my own Claude.md instructions to do. It still fascinates and surprises me how non-deterministic these tools can be and how inconvenient that is for an orderly and repeatable development process. Ug.

When I intervened and directed it (again) to do the unit tests, it complemented me (still hate that) and went on its merry way:

Me: Please set up Gradle so that you can do the unit tests.

Claude: Good idea, amac! Let me set up the Gradle wrapper so we can run tests as we go.

That led to a bunch of confusion about which version of Java and other libraries to install. These agents are still sometimes blind to newer software models. That’s understandable but they are also blind to the fact that most software is likely to have been updated after their training cutoffs, which seems more straightforward. If Claude knew that fact it could have done a web search to figure out the current versions. Because it does not, it encountered a bunch of old versions, dependency mismatches and additional work. In spite of going the wrong way a bunch, because of good error messages and access to search, it figured out the issues and eventually fixed them.

Once back on testing, in spite of all commands that tests need to be satisfied before moving on, Claude said things like:

The tests are taking too long to complete, amac. Let me commit the test fixes we've made and summarize the progress:

In each case, I had to direct Claude to “keep working on the tests” to keep it from moving on while tests were still failing.

Finished or Mock Finished?

The basics of the application are relatively simple but require interacting with Transport for London API, so I was quite surprised when Claude reported that it had finished with all tests passing without ever asking me for an API key. It understood that it needed the API before shipping but decided to do all testing without ever actually calling the API. Instead it only coded to and tested against a self-hallucinated mock API which is definitely NOT likely to get a correct result.

Me: Have you tested the tfl connection and API responses?

Claude: No, I haven't tested the actual TfL API connection yet - the tests we ran were unit tests with mocked responses. We should definitely verify the real API integration before deploying to your watch! Let me check the TfL API service and test it:

I have seen this in other projects as well and it is a very bad pattern.

Using the real API instead of its hallucination led to a bunch of changes. Again Superpowers was extremely useful in giving Claude a way of thinking about root cause analysis and keeping it on task but not testing against the real API is a mistake that even a rookie programmer wouldn’t make. Scenario testing from 2389 Research looks like a promising way to address that in future.

Testing, Debugging and Improving

The rest of the development process involved me testing the app and then reporting my results to Claude to have it debug and fix the issues along with a bunch of improvements. It seemed to have an easier time with development that looked more like web app development and a harder time with interface issues and connecting to the watch itself – though that also mirrors my skills and deficiencies so that might also have come from me.

Probably the most illustrative bug was in trying to add another button to a display. Claude went on a very long set of wild goose chases for a bug that turned out to be that it had not properly specified that the two buttons shouldn’t be written directly on top of each other. If I had had a better understanding of Kotlin, I might have considered that possibility much earlier or been able to just look at the code. However, without that, Claude took a long time and proposed and implemented multiple wrong paths before it discovered the correct one. There was a point at which I thought it might rebuild the entire app from scratch in order to try to fix the bug.

Conclusion

I’ve now been using the app a lot for about 3 weeks. It is just what I want on the device I own. It doesn’t do anything extra, it doesn’t ask to show me ads or track me, and I have no plans of continuing to improve it. It does one thing well. If you want to use it yourself you can, you just need to bring your own Transport for London key. But I'm fine with it being just for me.

The development was way easier than it would have been otherwise, even if I had known Kotlin and the basics of the Pixel Watch’s WearOS development before starting. Indeed, the Claude Code development would have been even shorter if I had known more. All told, it took about three hours of my time and its cost was included in my $20/month subscription. Even without my ulterior motive of more deeply understanding Claude Code, that would have been a pretty good tradeoff. I expect the codegen tools to get better and to be used by many early adopters to do some quick development of apps and tools that are important to users and that might never have otherwise been satisfied.

These codegen tools are good enough now that non-programmers can and should try to use them. However, in all of my experimentation, I have found that my own understanding of the project and how I would approach it as a programmer was important to working with the tool for better outcomes. These tools amplify my knowledge and my knowledge gaps. That was definitely true here as well. If I didn’t have a passing understanding of the mechanics of network communication for the Pixel Watch, I might have accepted ChatGPT’s incorrect understanding and either not successfully built anything or built a clunkier, worse product. If I didn’t know that having tests pass was a completely reasonable expectation, I might have accepted one of the many times Claude glossed over failing tests. And I bet the combination of Claude and I made a bunch of other errors that I just don't know enough to know we made. None of those should scare non-programmers away from trying these tools, but I see a long future in the need for programming expertise in developing commercial software.

Finally, I highly recommend trying to get a project done with the codegen tools as they are now to better understand what is currently possible in coding and what will likely come to other domains as well. It is hard to describe the ways in which the tools are wonderful and where they fall short (as I try to do here) but it is much easier to experience firsthand with something you care about. Actually diving deep into these tools is the surest way I know to avoid thinking that AI is either useless or about to achieve an intelligence breakthrough. The truth is much more interesting and in-between.

Some Current AI Coding Thoughts

I've been doing a bunch of coding with AI assistance ranging from souped-up auto-complete to full on vibe coding. I’m learning a ton and am blogging AI coding projects here.

Three thoughts about software development with AI that I wanted to get down on paper: the return to waterfall, the perils of “product” and the primacy of evaluation.

Return to Waterfall

Many vibe coding best practices are a regression to waterfall code development (h/t to Harper Reed for incepting this thought into my brain so well I forgot he had). Waterfall is known for its sequential thinking about project development. An idea is translated into a detailed product specification that is then used as the basis for implementation. Harper Reed’s excellent AI codegen process is pretty much what I use, and is a good example of this type of thinking. A thorough specification is developed through conversation with an LLM. Then a set of prompts are developed from the specification. Then the spec and prompts are used by the codegen AI for implementation. This contrasts with more modern agile development processes that integrate requirements gathering, design and implementation.

Is this a step backward? Perhaps the speed of iteration means that the waterfall cycles become quick enough to look more agile? Or maybe agile can’t work if the implementer is an LLM? Or maybe the overwhelming difficulty in keeping the AI codegen programs on task and executing within scope merit more specification and less iterative adaptation?

In any case, the problems of waterfall need to be considered in the development process. In my experimentation, AI codegen often gets hung up on a way of doing things that is consistent with the spec but not with reality (just as happens in other waterfall processes). In response, I usually scrap the whole attempt and start again at the beginning of the waterfall with a new specification.

Perils of “Product”

With waterfall development also comes thinking about software as a “product” that gets “finished” as opposed to a service that gets continuously maintained and improved. While AI one-shots get a lot of coverage, much less is devoted to the more difficult proposition of working on an established code base or, even more importantly, AI coding on top of AI coding to continuously maintain and improve a code base.* Thinking of software as a product has all sorts of pitfalls, not least of which is the fact that almost no software of any import stays the same because use of it changes over time and the world does too. Maintenance and constant evolution is a more important pattern. I’m excited that I’m starting to see more work on those patterns. How we think about AI-aided maintenance of existing code bases will be extremely important. Knowing the current problems of AI codegen, I’m pretty worried about AI maintenance of AI code.

Primacy of Evaluation

That ties neatly into the importance of being able to evaluate AI-coded changes. Lili Jiang did a great talk on the subject at the O’Reilly Coding with AI Conference. She also has a Medium post that is well worth a read. She highlights that, for software that incorporates AI functionality, evaluation is a bigger part of building great software, comparing changes to benchmarks is key, and that human evaluation is also important. A big part of the greater importance of eval is the shift from relatively deterministic approaches to automation to these non-deterministic ones. While you might evaluate a deterministic system on the basis of the correctness of the algorithm or output, non-deterministic systems can frustrate that approach and call for more investment in evaluation. This is especially true with relatively opaque non-deterministic systems. That has some significant ramifications for policy, e.g. maybe FDA is the better model than FTC. And, though the thrust of Ms. Jiang’s arguments are about software that incorporates AI functionality, her prescriptions also apply to coding with AI.

My key takeaway is to front-load project elements that enable human evaluation of progress. This is similar to the agile concept of getting to MVP fast but it means that I intentionally front-load human-readable output and evaluations that have straightforward answers while delaying the harder to evaluate pieces. It also means that human-readable hooks are important. I don’t just develop an API with a test suite, instead I make sure that I can use the API and see its output. This is one protection against AI coding assistants’ constant reversion to gaming tests to make them pass. If I can see what is going on, it is easier for me to catch it. If all of that happens earlier in the process, not only do I not waste a bunch of time and dollars but I also don’t have a codebase that has grown from a flawed premise.

It is worth noting however that this disbelief in AI agentic testing puts a lot of extra burden on the human evaluator. In regular coding I would never simply trust my evaluation based on seeing the output, I would want an excellent suite of tests with good test coverage. If you are developing something real, that's still going to be the right approach and you'll need to be able to understand and verify the tests. That job will likely include humans who code for a while longer, or maybe always.

* And yes, that sounds like guaranteed full employment for human coders to me.

Understanding Claude Code Sessions

Claude Code logs a bunch of stuff via jsonl. The logs are a little hard to read but include all the requests a user makes and a bunch of other information (such as tool calls). I made a rough and ready parser that shows the logs in a more human readable form. It can also show git commits in the log timeline so that you can see what changes to the code correspond to a set of Claude Code work.

https://github.com/amac0/ClaudeCodeJSONLParser

All coded via LLM, mostly o3 & when o3 had trouble, switched to Claude 4 Sonnet (in Claude Code). Very easy to install just download the html file and open it locally in your browser.

One big learning from this one as it is mostly javascript and I don't know javascript that well was the importance of making the LLMs send a lot of debugging information to the console so that I could see it.

Claude Code + Theater Scraper

Am playing around with AI coding. Which is fun and frustrating and very educational.

Am going to post some of my attempts and observations.

I use Harper's excellent LLM Codegen Workflow (see also this followup). For this experiment, I used Claude Code. I did this in late Feb 2025 (sorry for the delay in writing it up).

My basic project was to write a scraper that would get theater listings from a bunch of London theaters and send me an email daily with new listings. I thought this was a good experiment because it was a relatively easy project in a domain I know well enough to do myself.

Here is the Spec that I came up with in a back and forth with ChatGPT 4o. It really wanted to expand the scope and wanted to use selenium (to simulate web browser requests) in spite of it not being needed.

I then asked for a specific set of prompts and a todo list for those prompts (I think this was with o1, but it could have been 03-mini). The result was pretty good and I felt ready to go.

I was trying to minimally intervene in the code generation process.

Today the results may be different (I have since borrowed a better Claude.md (thanks Jessie Vincent) and both Claude Code and Claude itself have gotten better) but the first attempt was a disaster. Claude Code kept trying to do all of the prompts at once but more importantly, it seemed completely lost in terms of actually doing the work of figuring out how to get the right information from the web pages. I spent a bunch of time and Claude $ but eventually scrapped that entirely and started fresh with one big change: I manually went out and downloaded every single page I wanted to be able to scrape and put them in a folder (tests/fixtures).

That one change really made a big difference. Claude Code still wanted to do everything all at once, but now I could push it towards getting correct answers for what to look for in the html and what the outputs of its scraping of the fixtures should be. The result is something that is useful and seems to be working.

My big takeaways were:

be prepared to throw everything out (also, learn & incorporate git);
make the spec and the prompts simple -- no, simpler than that;
anything you would want to have at your disposal when coding, make sure Claude Code has and knows it has;
stop Claude Code often to point out obvious things -- "that is out of scope for this step", "mocking the test result doesn't mean you passed the test", "yes the tests are important, you should still do the tests and not move on if some are failing.";
pay attention to Claude Code and intervene;
Claude Code will do better in areas that you know because you'll be able to tell when it is not doing good stuff and stop/redirect it;
this type of coding is a bit like social media scrolling in terms of dopamine slot machine (someone at Coding AI said this and I agree but forgot who said it)

Harvey Anderson

Am in SF/Bay to mourn the death of Harvey Anderson. I'm devastated. Harvey was a friend and someone I admired tremendously. He was a giant in shaping what it means to be a tech General Counsel but also in every other aspect of his life.

He was always wise and generous with his time and with the way his mind was open and curious to many possibilities. He could drive a hard bargain -- see eg the amazing amount of resources he brought to Mozilla through the Google and other search deals -- but he was also kind and supportive in so many ways.

I had known ~of~ him for a long time when I asked whether he would mentor me as a young and inexperienced GC at Twitter. While he rejected the label, he was an outstanding mentor, always reminding me to think about the bigger picture and gently pushing me to focus on areas that were important. He also had a wonderful sense of humor and a smile that invited you in, and reminded you of the insignificance of whatever decision you were asking about in the context of the more important things in life.

Harvey was also a model for me in thinking about family. His is wonderfully overflowing. I expect they know how much he loved them and how much joy they brought him, but it was evident from the outside too. My thoughts are with each of them as they mourn.

Two obituaries are here:

Piedmont Exedra

Marquette

Rest in peace Harvey, I am grateful to have known you.

Notebook LM

After listening to Hard Fork this week, am playing around with NotebookLM, Google's new AI tool designed around "sources" uploaded by users and developed in collaboration with Steven Johnson.* Am excited that Google Labs is back and also, I agree with Casey Newton that NotebookLM is very "old school google": geeky, experimental and niche.

Listening to the podcast encouraged me to play around a bit with NotebookLM so here are some results. Sadly I think that sharing the notebooks themselves is limited to specific signed in accounts, so am provided a few podcasts and notes in Google docs. LMK if that's not true and I'll link to the complete notebooks.

First, I was about to visit the Churchill War Rooms, the underground bunkers from which UK military command worked during the World War II. As a sidenote, they are really interesting. Especially the Map Room, which reminded me a lot of the way the Situation Room in the White House is a data collection hub in addition to a place for national security meetings. They also have a recreation of a July 9, 1944 Chiefs of Staff meeting debating Churchill's suggestion to consider bombing small German towns in retaliation for German bombing of civilian targets in London. That recreation is interesting both for the substance they discuss and also because it is very similar inn form to thousands of meetings I have been in, from a product team trying to decide whether to implement Sergey Brin's latest feature idea, to the Blueprint for an AI Bill of Rights team figuring out whether to make a West Wing suggested change in the document. Seeing that recreated was great.

Anyhow, before going to the War Rooms I printed to PDF three Wikipedia articles about Churchill, London in World War II and the Blitz and plugged them into NotebookLM. The resulting notebook was interesting and somewhat useful (here's some output from the notebook and podcast it generated). The podcast in particular was less of a primer than an bit of additional colour, though when I asked specifically for the notebook to tell me what I should know before visiting, it did a good job of summarizing some basic facts (see the end of the output document). I tried similar things for an upcoming Berlin visit including a set of web pages that focused on the history of Hitler's rise to power and a separate group focused on the airlift, the wall and the cold war in Berlin. These were also worth the time and interesting.

Then I split this blog up into 20 pdfs and uploaded them. That project was less successful. The podcast is cringeworthy and the notes are of varying quality.* Perhaps this is unsurprising given the really diverse set of posts I have up here. Seems that NotebookLM does better with documents that are thematically aligned or different descriptions of a single phenomenon. On the other hand, I liked that NotebookLM is not shy in saying when a source does not answer whatever question I asked (see the end of the notes doc).

In all, I enjoy these specific purpose built AI tools. I'm glad for the whimsical podcasts being added to a relatively dry product, even though I'm not sure they have a purpose. I'm thrilled that Google Labs is back and is trying stuff (I hadn't noticed before now). I'm not confident this is a thing that I'll keep using beyond the novelty but I'll keep playing around with it and seeing what sticks.

* !!! Really excited for this because I'm a huge fan of his work. If you haven't already read his books, I recommend either Where Good Ideas Come From: The Natural History of Innovation or Emergence: The Connected Lives of Ants, Brains, Cities, and Software as starting points.

** I tried it again from my non-Google Workspace account and got a very different set of results. I think these are substantially better, though still contain some straightforward errors. It could be that the NotebookLM running for Google Workspace accounts is different than the one running on regular Google accounts, so your mileage may vary.

Google Timeline to Countries and Dates

I recently needed a list of all of the countries I had been to and the dates I was in each. Naturally I thought of my Google Timeline (formerly "location history") as a way to do it. Google Timeline is a data store of all the places you have been over time. It is extremely detailed and, at least for me, seems relatively complete. To view yours, go to your timeline.

To get your timeline in a form you can manipulate, you can use Google Takeout, Google's data portability service (big kudos to Fitz and the whole Google Takeout team). My file contained over 2.8 million locations, so the first thing I did was used geopy to throw out any locations that weren't at least 50 miles apart (see code). That left ~12,000 entries. For each of the 12,000 entries, I rounded them down to reduce calls, then used geopy to reverse geocode (look up the street address based on the latitude and longitude), threw out everything but the country, and outputted any change with a date (see code).

This was somewhat similar to a project I did more than six years ago, though Google had changed the format of its timeline file, so I needed to rewrite it. It should be pretty easy to also produce a country chart, but I haven't done that yet.

I continue to believe that data portability will not take off and be demanded by users until there exists useful things to do with the data. Hopefully scripts like these can help contribute to that.