Two Steps Forward, One Step Back

Posted on Jan 26, 2025

Building software with generative AI

Before we get down to business, I would like to add a little preface to set the context. I want to share my reasons, constraints, and goals for experimenting with generative AI. Although not unique, but certainly less common, I set out to build software MVPs exclusively using GPT. From scratch, as opposed to both organizations and individuals using GPT to work on existing, large and complex codebases. I vowed not to write a single line of code myself.

Why?

My goal in building software products with GPT was multifaceted. Through this experience I hoped to

Learn how to work with GenAI through application, abstract patterns and learning, if possible. When I finished, it made me think deeply about another problem. I write about that later.
Build and ship software (always fun!).
Understanding how real the hype is

How?

When learning something new, my Type-A brain has always found it best to make lists of topics I want to cover in order. Which is exactly what I did! I created the following list of types of apps (both web and mobile) that I wanted to build, in order of increasing complexity. Keeping an eye on the field helped me come up with this fairly quickly.

Vanilla web/mobile app
LLM wrapper app
LLM app with RAG
Agentic app
Self-hosted LLM app with RAG
LLM on edge app

I wanted to build fun and useful, but most importantly, fun products. I spent the first few weeks of my sabbatical brainstorming with friends and documenting potential ideas I’d like to work on. Below are the ones that made it past the idea phase.

[Vanilla Web App] Polyglotrot: Travelling? Get the top phrases from the local language at your fingertips!
[LLM Agent App] Sheety: Chat with your spreadsheet in natural language and uncover insights faster!
[LLM Wrapper App] Potluru: Report pothole sightings and explore your daily commute’s worst enemies’ fan page!
[LLM Wrapper App] FIFAWall: Immortalize yourself and your FIFA battles on a public wall!

Although I am only halfway through this list, I built and deployed the aforementioned 4 web applications over the past couple of months. I finally felt like I had enough experience working with GPT to put my thoughts together while continuing to build.

Background

I also think it is only fair that I provide some background about myself. I’ll only include details that I think might be relevant to this post. I am a former entrepreneur who has worked in product management for the past decade. I have a decent understanding of software engineering practices, working with Python, databases, developing APIs with lightweight frameworks like Falcon, Flask, etc., but little to no understanding of JavaScript or other front-end frameworks. Another way of putting it, which is what I often resort to, is that I am skilled enough to write hacky, (barely) production-ready, but non-scalable code.

About This Post

I didn’t actively write this during the experience. I jotted down a few notes here and there, but nothing substantial. So, in a way, this is an output of my reflection on this period of my professional life. On the one hand, this post had the benefit of hindsight, but on the other, I think a lot of nuance got lost along the way. Mental note taken :)

The Beginning

We’ve been telling machines what to do for a long time. In fact, we’ve been writing code for 60 years. The first instance of writing programming code to instruct a modern electronic computer was developed during World War II. In the decades since, programming has been widely adopted and applied. Even then, only a few (as a fraction of the total population) who had access to resources and training were able to do so, making the community close-gated. Unlocking this capability through natural language holds the promise of mass adoption and large-scale impact. The promise that now anyone with agency can build software solutions to a problem they may be facing.

The tech world is abuzz with claims that generative AI will soon completely replace entry-level software engineers. Countless articles suggest that junior engineering roles are becoming obsolete and that opportunities for new developers are rapidly disappearing. Based on my experience of working as a PM with a dedicated team of engineers and working with GPT to build software, I’ve tried to draw parallels between GPT and a junior software engineer in the post below.

This may feel like trivializing software engineering, but it is anything but. Software engineering is an art. Only if you’ve never contributed to software would you argue against it. But as GPTs generate works of art? and as we ponder whether art is something intrinsically human, you’ll realize that our world has permanently changed!

Since it’s always changing, what’s new, you might ask. For one thing, the entropy of this change, and therefore it’s impact radius, is massive. Without knowing much about anthropology, philosophy, or even biology, I had liked to believe that creating something, anything, that another being could share in was something fundamentally human. I am sure I will end up questioning that belief.

Stack

Before I started building, I had to choose my toolbox. What may have seemed like an innocuous decision quickly turned out to be an extremely overwhelming experience for me. There are layers, layers, and more layers of software written on top of the other, all designed to work with LLMs, all solving different problems in the SDLC. There are IDEs, managed hosting/infrastructure platforms, low/no code platforms, open source libraries, and more, all leveraging AI to optimize and enhance specific parts of the lifecycle. I guess it is turtles all the way down! There is Cursor, Bolt, CoPilot, Replit, Streamlit, HuggingFace, Embedchain, Langchain, Lovable, just to name a few. What adds to the complexity of the decision is that there is a lot of overlap as more and more tools try to expand their value proposition to cover the full stack. This only makes it harder to choose. If you’re someone who could easily spend hours in a brick-and-mortar store browsing the aisles without making a decision, I’ll save you some time: Go with the noise. Now, similar to gold rush times, go with the easy choices influenced by what the community is talking about the most. Good or bad. It is a virtuous cycle. Regardless of what I say, this is a deeply personal decision that should be driven by your constraints, goals, and skill set. After spending some time understanding some of these tools, I have narrowed it down to the following.

IDE: Cursor
LLM: Claude 3.5 Sonnet
UIDE: v0.dev by Vercel
Backend: Django, FastAPI
Database: SQLlite (development) and MySQL (production)
Frontend: Next.JS, React.JS and Typescript
Cloud: AWS (Lightsail)

I’ve tried to bring about some semblance of structure to my experience by abstracting themes that I observed while working with GPT.

WOW

In the beginning, I found myself saying that out loud, a lot. I mean, it is exciting to see your brief translated into working code in a matter of seconds. I’d call over my wife, friends sitting nearby, and let them marvel at what GPT had managed to produce. Be it

Building and maintaining responsive designs for multiple devices from the start, which any engineer will tell you is a pain in the ass.
Building a fully functional search with a seamless UX by simply specifying that I wanted to give users the ability to search from a list of categorical options.
Building a client-side image compression/downsampling utility as a drop-in replacement for my existing codebase.

However, I noticed an underlying pattern in the input when the output was categorized as “wow”. More often than not, the requirement was commoditized and not novel. GPT had seen tens of thousands of similar implementations. To give it its due, GPT saved hours setting up authentication, integrating with external services, adding code comments, explaining what the code did, building commoditized features, and more. Like everything else in life, I was seduced by the promise of AI, only to see the shine fade a while later.

I have come to realize that one of the ways to look at GPT-driven development is to learn how to strike a balance between breaking a complex task into chunks. Go granular and you may end up doing more work than you intended, and skimming the surface leads to systematic and structural defects downstream.

System Design

On more than one occasion, I have realized that a major refactor was the only way to move forward. Worse, two out of three times this was a realization that I, not GPT, brought to the table. When trying to build something, GPT was incapable of analyzing the codebase for systemic defects unless explicitly told to do so. Even then, half the time it didn’t come back with an appropriate answer. Most of these were related to the definition, structure, interaction, and data flow of components on the front-end, and the database design on the back-end.

It took me hours and sometimes days to get back to where I was. Even though my codebase was by no means large (given that I was building MVPs), there was no way a request like “Help me refactor my current codebase to support X while preserving all current features, functionality and UI/UX” was going to work.

Is this what I would expect when working with a junior engineer? Yes, I would but it is perfectly acceptable.

Nitwit Misses

Defining variables/functions without invoking them, invoking variables/functions without defining them, suggesting unnecessary code refactors, missing requirements partially/completely, code that led to countless typescript build errors, creating new components/apis for every additional requirement, poor cleanup of unused variables, libraries, and the list goes on and on. All of this comes with the full package that GPT promises. I definitely don’t expect 90% of them when working with a junior engineer. However, most of them don’t end up being headaches because they are either easily fixable or can be ignored, but every now and then you find yourself saying “How stupid is that!”

One Step Forward, Two Steps Backwards

Contrary to the title, there was a persistent feeling of taking one step forward and two steps back. Testing: The source of my enduring frustration. In the absence of following rigorous software development practices such as TDD (not writing a single test case) and the complexity involved in writing functional test cases, it was almost too easy for GPT to break anything existing that had a minor structural overlap. On countless occasions, when developing something new

for which a brief had already been shared to influence the design, and
that had a small overlap with existing functionality

resulted in breaking existing user flows, UI and UX. There were many times when I had to explicitly ask GPT not to break existing flows because of proposed code changes. I’d usually do this after going through a few chat sessions and following their suggestions, only to find that they’d irreversibly broken some of my core functionality. Later, I started doing this proactively.

At no point in development would I want something I spent countless hours on to break when I built something new. I do have to admit that it is not easy to write functional test cases working backwards from the user, across a variety of languages anyway. My expectation from the slew of dev tools is to have this deeply integrated to write, maintain and update test cases using context to LLMs as input when building software with LLMs. Any major (I say major with caution) implementation should go through a regression test suite to ensure that nothing existing breaks as a result of the proposed changes.

Would I expect this from a junior engineer? Absolutely not!

Product Design

Creating something from nothing is always exciting. I found myself marveling at GPT’s ability to create clean, consistent, and aesthetically pleasing visual designs from just a few words in the prompt. Heck, even responsive from the first iteration. Designs that serve their purpose in the context of building an MVP. AI almost always did a great job of understanding the overall vibe/theme of the envisioned product. However, I had bad experiences trying to generate specific visual elements that were more art than code. I found some limitations on what I could generate. My thesis behind this is that since most of the tools I used (v0 by vercel.dev and LLMs like ChatGPT and Claude Sonnet) have to generate working code of the visual design, their approach is code first instead of design backwards.

I was extremely disappointed with GPTs, even those trained to produce art, when trying to produce something useful for digital products such as illustrations and logos. This was almost always the case when I had a specific theme or concept in mind. I spent a couple of days trying to get these graphics right when building the first two MVPs, only to abandon them and use off-the-shelf alternatives. I am still baffled by the fact that I cannot find a single good logo generator that takes a brief as input and generates high quality logos for digital products. But that’s a problem for another day.

Inexpensive iterations turned out to be more expensive and painful than expected. During small iterations involving spacing, element styles, other visual elements, etc. I found it harder to work with GPT than a colleague. I often found myself over-contextualizing and trying to explain the same approach in different ways to get the desired results. I am sure that such discussions would have been a breeze if I was working with a product designer. I guess GPT has yet to catch up with the level of visual and product context that your colleague can understand. On the other hand, this may turn out to be largely a function of Cursor not being able to hear me speak or see my screen. This is where Gemini’s latest release and other products coming up in this space might make a splash.

Limited (No?) Line of Sight

Through no fault of its own, Cursor proved poor in the following areas where it had limited or no line of sight.

Maintaining context of the development history: I found that a proper understanding of what had already been developed, especially user flows and product decisions that had been made, was lacking. This forced me to provide more context about the development process than I would have liked. A likely contributing factor was my iterative approach to building the product. I saw my thought process evolve from unstructured to structured as I iteratively built a project.
DevOps: If I had a rupee for every time I faced a new challenge when deploying applications in the public cloud, I’d be a rich man. Perhaps a tall order for GPT, but critical to software deployment, I don’t see AI being able to predict and solve challenges such as serving static/UGC images, working Caddy/nginx configs, file permission issues, missing packages, dependencies and more any time soon. Problems that are as unique as they come. I know this is where tools like Bolt would come in handy, but I seriously doubt how well it solves unique user requirements keeping in mind constraints like infrastructure, cost, flexibility etc.
Ability to simulate: I found GPT’s ability to predict, extrapolate and simulate conditions outside it’s current environment to be poor. It failed to anticipate, debug and fix problems in known but different environments, even when explicitly told to do so. A plethora of problems with CI/CD, Typescript errors when building Next.JS projects in production and build environments led me to abandon some and push through others by being persistent and thick in some sense.

Conclusion

I read somewhere that GPT is like an overexcited junior engineer who types really fast. Having mentored similar people, I thought this would either be a great asset or the cause of a major headache. Turns out, like everything in life, it is a little bit of both. GPT may not be the smart intern you thought it would be, but rather a midwit you have to actively manage.

It is true that what might take a software engineer a few days to solve and implement, GPT can do in a matter of hours. The only difference is that you’ll have to spend those hours giving it context, debating multiple approaches, and guiding it to define a solution that actually works, over and over again.

I felt that identity and creativity are what GPTs lack today. And building software/writing code is inherently a creative endeavor. I’ve had to explain the same thing 4 different ways, absorbing changes from each iteration, to get where I want to go. Maybe it is the way I communicate, or maybe it is GPT’s inability to put the current version of the product, the codebase, into context to solve a novel problem. Most of the time GPT took a shot and missed. I kept at it and was able to bend it to my will.

GPT beats around the bush and is easily trapped by its own structure. Once trapped, it has a hard time breaking free. It just becomes more and more verbose and stupid, apologizing profusely. GPT acts less like a sounding board and more like a conformist, ready to hail any idea I put forward as revolutionary. Except a few times, I can’t recall it giving me any actionable insights or helping me discover blind spots in my approach. It seems willing to pander, to please, and to get positive feedback instead of maintaining a truth-seeking nature. One you would expect from a partner, confidant, or virtuous colleague.

GPT is great at gleaning over the details. I have found that it is unable to quickly develop sophisticated business/software logic. If you’re a product manager who’s really passionate about your craft, you’ll need a lot of persistence and a working knowledge of software development to get things right with GPT. Even more so if you believe in getting things right rather than getting them done.

To be fair, building software is a challenging endeavor. Writing code has never been the hardest part of building software. Managing, mutating, and extending code in the face of change and emergent behavior is hard. In the long run, I am not sure how code written by GPT will age. Is engineering based on a true understanding of product, society, and culture? Only time will tell.

It is critical to note that all of this is the child of an unplanned process. Although I described each prompt in great detail, the building process has always been iterative. I am confident that the results could be better with more planning. However, this is exactly why I am reluctant to work on PRDs, test cases, and better planning in general with GPT. I believe that for anything non-commoditized or novel, the amount of context I’ll need to provide GPT to produce a good document will be equivalent in effort, if not more. But I understand the appeal of using it, since a large percentage of any product will be non-novel. Also, maybe GPT will just surprise me!

As I continue to tinker with generative AI, I’ve been thinking about finding the easiest and fastest way to build quality software products using only AI. Maybe even a template that you can collaborate on with an agent and then let it run in the background.

Overall, in my opinion, building software with GPT leaves you feeling like you are taking two steps forward and one step back!