The Operator and the Agent

Curse/or

I spent a month building two applications using Cursor and forcing myself to let the system drive as much of the development as possible. I wanted to see how far a modern AI agent could take software development if I stayed in a strictly supervisory role. The applications had to matter to me, and they had to matter to other people—otherwise the experiment would retreat into the pejorative vibe-coding novelty. Stakes change how software behaves, or at least how it feels when it behaves.

My first application, Until Debt Do We Part, came from attending a season of weddings and noticing how people playfully predicted what would happen next—over/unders on the length of vows, whether the bouquet would be caught on the first toss, how long the officiant would talk. Combine that with the broader "bet on everything" moment we're living in, and suddenly there was a lightweight mobile-web prediction game for weddings.

A colleague of mine, Tristan, and his now-wife, Stefanie, kindly let me test the app at their wedding. People took it far more seriously than I expected. The most ambitious play was Devan massively betting against the best man, Roshan, in his wedding reception speech—and he won the entire game outright. Tristan's dad came in second with well-placed bets. One wonders if he had insider information.

Knowing the stakes of building a wedding app and opening my mouth about it forced Cursor—er, myself(?)—to produce something more than a prototype. The game itself failed during the wedding dinner, which caused me to scramble back to my room to fight with Cursor for 30 minutes to make an emergency patch. In a way, it remained a demonstration after all.

Post-wedding, there was great feedback, and I decided to create a design system for the application where we could store and test function and form before elevating into the development environment. This approach was part of a broader attempt to rein in generative volatility in my prompt requests. It is a strategy I will maintain for all forward applications built with a generative system. You can see Until Debt's design system here.

The second application was a personal events and attendance manager, boldly named Attending. Its purpose is to help manage tickets to sporting events, concerts, and other shows at the event, ticket, and attendee levels. It allowed friends to join me or claim the tickets if I could not make it while sending alerts. The smaller and more straightforward nature of this application was easier to handle for Cursor, as it had a unique constraint of keeping the application size smaller than 1M tokens so it could fit into the context window for an agent powered by Claude's flagship models. I kept everything inside a single long-lived Cursor agent because I wanted to test whether a continuous working state could model the way a human developer accumulates context over time. If the agent could maintain coherence across a large sequence of changes, that would imply something meaningful about its ability to support ground-up application development. The main failures here were related to managing the different statuses for events and tickets. For an unknown reason, Cursor repeatedly generated non-transactional patterns for this application, which led to many headaches.

Each application gave me a motivation and context to test how far AI could generate software I would meaningfully use in the workflows of my life. Prior to that, my experiments consisted of static sites, code snippets, or complex spreadsheet queries. Those use cases have steadily improved, and nearly resolved, since 2021.

The Building Experience

// Stack: ChatGPT for PRD, Cursor for construction

// Infrastructure: AWS (us-east-1), Twilio for alerts

// Deployment: SST, then CDK after SST failures

My approach for application design began by using ChatGPT to outline the PRD and initial technical approach. Everything then moved into Cursor for construction. We chose AWS because setup was fast, and the CLI gave Cursor a broadly accessible surface which sped up application development through deployment. With the exception of Twilio, everything lived in us-east-1 (again, Cursor's preferred selection).

It was remarkable how much Cursor was able to build in a single 60-second pass of generation after dependency updates and secrets were set in place. From scaffolds, schema definitions, API layers, early frontends, and flow logic—the application needed little iteration. Furthermore, an unexpected amount of the early system architecture became "locked in" because the initial pass captured the base-level concepts and operations.

The problems, however, began almost immediately after the initial build. To stabilize anything, I had to adapt my workflow away from long and specific prompts. I created a docs folder containing the API schema, data model, and a development protocol detailing how features are to be extended, validated, and subsequently deployed. Every prompt into the system started with a reference to that folder as a context-setter. Without these guardrails, the agent would reinvent the structure it had built a minute before.

After hundreds of cycles of fixing failures, I tried to improve my success rate of feature extension by building a self-improving loop. After each prompt, I asked Cursor to evaluate what went well, what broke, what it misunderstood, and what could be added to documentation. The updated docs would feed into the next prompt, theoretically reducing mistakes from previous prompts. Unfortunately, in practice, less than half of my requests would yield something reliable enough to elevate to the production environment.

Even with the added success rate, there were ten- to twelve-hour days prompting, awaiting, deploying, validating, and debugging what the agent generated.

The process shifted my programming workflow. Historically, the key was to get locked in with context and tasks, or what we would call flow states. In this case, the process was repeatedly fragmented, and I would keep YouTube videos up to pass time between deployments. The culture might call this "vibe coding." And if the vibe is frustration, then sure.

Technical Challenges

Cursor repeatedly selected GraphQL as the API layer, insisting it was the right approach for the application. That's where the cracks began. Cursor struggled with schema evolution, resolver logic, and safe updates. The same feature would be implemented in different ways across prompts, highlighting the challenges with model prediction and context-window limitations when compared to contextual and experiential intelligence. Index configurations changed silently. Fields appeared or vanished without corresponding DynamoDB logic updates.

When I asked Cursor to audit the entire application, the audits sounded competent and well-researched—consolidate schema logic, move operations into transactions, refactor resolver flows—but implementing those recommendations produced downstream failures the agent could not anticipate. It could design. It could build. It could not understand what it had built. In a way, this made sense since each prompt felt memoryless in nature. There was something beautiful in it not being scarred by a mistake and is an essential quality in the best athletes in the world (see Stephen Curry and in-game shooting slumps).

The data-modeling issues amplified the errors, however. Cursor was not only inconsistent; at times it behaved like it didn't know why the schema existed at all. DynamoDB became an unplanned crash course for me. I learned about it deeply because Cursor kept breaking it. The agent would rewrite DB operations unpredictably, change field structures, forget implicit migration histories, and produce failing index strategies. A newcomer to software wouldn't even know which layer had failed, let alone how to diagnose the failure.

Frontend behavior had its own invisible traps. Cursor chose to lean on localStorage for session-state management, which worked fine until it did not. Safari and iOS aggressively clear localStorage as part of Apple's broader privacy-first model. Cursor and I did not know that. Parts of the applications silently broke and resolved only after a lot of inspection-tool debugging and console copy-pasting. These are the kinds of failure modes you know if you've been around software long enough to expect them. Someone new to programming would never guess this was even a category of issue, particularly when Cursor would confidently declare it had found and fixed the root cause.

Deployment and infrastructure management were a different class of difficulty. Cursor initially chose SST, and I followed its recommendation. After a series of deployment failures, it felt necessary to move the entire application over to CDK with custom management, upping complexity I could not even understand. CDK exposed far more of AWS's underlying primitives, which meant Cursor generated infrastructure with a level of explicitness I couldn't interpret or maintain. Even when deployments successfully worked, they didn't work for long despite documenting a comprehensive step-by-step plan and deployment script. Cache clearing, bundling, and dependency installation would fail right after working. It blew through my monthly GitHub Actions allocation in a few days. I'd get a few hours where everything worked, and then a small change would cause the whole flow to collapse. It felt like watching an engineer build a bridge and then forget what a bridge was supposed to do.

The Emotional Reality

The emotional layer of working this way is something I haven't heard many people talk about. Cursor was always available (if you had tokens to spend), incredibly confident, endlessly willing to try anything I asked. And it was deeply inconsistent. That combination is disastrous for your patience—of which mine is underdeveloped—sense of momentum, and mood. There were days I'd watch in awe as it produced something brilliant in one pass. Other days, I'd burn an entire night on a minor update. I do not have children, but the mix of frustration and awe felt akin to what I imagine it must be like watching a child attempt something ambitious without the skills to back it up. I canceled many social plans because I was emotionally exhausted and underslept.

The Larger Implications

Despite my curt personality, I am a natural optimist. In the exciting (for me) moments of being able to show these applications to friends and family, I thought about what might happen when programming agents become reliable. The people who benefit may not only be existing or aspiring software engineers. It opens the door for the domain experts who have spent decades inside their fields to build the tools they need. These are the people who know where workflows break down inside the constraints of their own systems, people, and markets. They know where their incumbent vendors support their business and where they slow them down—the translation layer between domain knowledge and software implementation.

Domain expertise remains the rarest and most irreplaceable form of knowledge in any market.

If programming becomes accessible at the level of workflow expression, these experts will no longer be constrained by the bandwidth, priorities, or blind spots of technologists who sit outside their world.

The reinvention of industries through digital modernization is an ongoing effort to bridge technologists and domain experts. The former has been, in this moment of time, prized because learning how to write and apply software was deemed harder than accruing domain expertise. In a way, the modernization of industries was not supported by technologists so much as it was hijacked by them. If generated-programming tools are able to stabilize further, domain experts will be able to prototype directly as an alternative to vendor or consultant education, making the Build-vs-Buy discussion a far more nuanced and value-laden opportunity than it is today. The psychological limit of "I can't build software" will fall first, long before the technical limits do.

While there are valid fears of AI as a replacement rather than complement, I do not fear AI surpassing me at programming or other skills. It is a better programmer than I am. The important shift is the opportunity to move the center of gravity toward the practitioners—the people who built their industries before software outsiders moved in. Silicon Valley appears to follow the Armageddon (1998) analogy of teaching astronauts to drill over teaching drillers to become astronauts. In reality, there are not enough software engineers embedded into their respective markets to keep advancing the world's workflows; strapping programming capability onto domain experts will be a breakthrough to finally matching the pace at which these industries aspire and need to evolve.

On the flip side, the recent automated cyberattack executed via misuse of Anthropic's API is a reminder that capability scales in every direction at once. The same class of tool that struggled with my deployments powered something far more (nefariously) sophisticated; a reminder that capability reflects whoever wields it. The gaps I fought through are irrelevant in the hands of an operator who knows what they're trying to achieve.

Beyond tokens and context windows, there is a lot to be gained in prompt processing and programming-model specificities that will undoubtedly increase the effectiveness of the user. How much more there is to be gained is a question for which the range of expectation is varied. There are additional gains from compute-service vendors evolving their systems to better cooperate with agent-driven programming. What we want is better outcomes at a lower cost in energy, time, and maintenance. It's not clear to me there will be a singular breakthrough as opposed to a general maturing, though releases like Gemini 3 and its coding variants may nullify my hypothesis.

Beyond pontification, though, I know that building these applications made programming feel novel again. And while I was forcibly taken on learning journeys for specific technologies against my will, the agent capability and continuity showed me great promise. Inside that promise is an early outline of a future where someone with exceptional domain expertise might directly build the tools they need at the many layers of their organization and market, without awaiting a technologist or incumbent to finally prioritize their enhancement. There's a larger essay coming about that future possibility. This was the part I could see up close.

There was no specific reason to use Cursor versus any other generative-programming service. It was the most visible in my realm, buoyed by their incredible revenue figures. As this market matures, I expect these revenues per capita to fall dramatically when generative success rates increase, meaning the cost basis is not paid 10–100× over in failures, as it was in my own experiments. My suspicion is, like everyone else in the AI market, the hope is generalized sophisticated compute demand will stay outpaced for decades to come.
Cursor and Perplexity, unlike foundational-model competitors, provide choice when it comes to model selection for their services (code or search/purchase, respectively). Their valuable asset becomes the prompts themselves, which serve as intent markers; the ability to service customers' intents then becomes the key to their competitiveness in driving down cost of delivery to maximize their intent-driven margin. For Cursor, it's not clear where that margin grows beyond volume of generation today.
There are quiet heroes in the generated-programming market: the many open-source projects that now can be taken advantage of in a more complete and majestic way while building applications. Creating a strong feedback loop of financial and intellectual support for those projects is critical to broader quality software creation and delivery.
The most common reaction to the Until Debt Do We Part application was assuming it would be a prediction on whether the couple was going to get divorced…
I am dubious about context-window size meaningfully improving agent and model outputs. However, a lot of how prompts are turned into code is obfuscated from the user, despite the illusion of choice through model selection. A better experiment might have been to recreate the same application from the different model vendors and then across different generative-programming purveyors to get a better sense of efficacy.
As a result of item 5, my anecdotal narrative and half-baked conclusions require healthy skepticism. The experience, though, was personally educative and gratifying.

Non-compliantly yours,
nnd

The Operator and the Agent ▌

Curse/or

The Building Experience

Technical Challenges

The Emotional Reality

The Larger Implications