LLM Idolatrine: Why no one wants your agentic PRs
I promise this blog will have more than an engineer yelling at the clouds about LLM pain. I was actually working on a retrospective on X-Men Legends, or my experiences building an NES emulator in Rust. However slop PRs started hitting a codebase I maintain at work. And like any healthy individual I wish to commiserate about this with others on the internet.
I’d like to think I have some ability for introspection. Is it me that’s the problem? Am I just not with it anymore? It’s one thing to see the spam come in and go “man this sucks”. But it’s another thing to have to explain to misguided individuals why their behavior is destructive. This is my attempt at delving into the “why”. Why do these slop PRs bother me so much? Why do they feel like such a drain?
What’s in a PR?
Different people and different organizations have different standards, but my expectations for pull requests were as follows
- The Problem: The submitter has done their due diligence in understanding the problem they are trying to solve
- The Code: The submitter has coded their solution to the best of their ability given the information that they have.
- Testing: The submitter has ran end to end tests as needed to verify their changes work.
- Build: The build and automated checks pass.
- Security: The submitter has done their due diligence to ensure no security issues have been introduced.
Before Generative Text Excretion was the norm, pull requests acted as a sanity check for the above. It was a chance for someone else to come in and interrogate that the solution presented would address the problem at hand. It also gave a chance for the reviewer to surface any new information that the submitter may not have had at the time of writing. Generally the reviewer shouldn’t have to do all the pre-work in step 1 that the submitter did. The reviewer should be able to ask the submitter any clarifying questions.
For testing, I personally work on a 1-2 strike rule. The first 1-2 times, assuming that there’s sufficient unit test coverage, there’s nothing glaringly wrong like git merge conflict text in the code, and the build passes, I assume that the submitter made the appropriate end to end tests. If there’s a pattern of bugs that come in from a submitters changes then we revisit that assumption and start pulling down the branch and testing. At this point in a professional org we’re also having conversations with the submitter about good testing practices and, if needed, conversations with the manager (but ideally you assume good faith and start with the engineer).
For security, in my opinion that’s a responsibility of both the submitter and the reviewer.
If the size of the PR is massive you schedule some time with the submitter and pair on it together, or you request that the pr be split up into smaller pieces of work. Otherwise, if the assumptions above hold, then pull requests don’t take that much time. You go in, leave your thoughts, and either approve or request changes
But, what happens when the above doesn’t hold?
Pure, unmitigated, waste
Let’s take an eager employee. Their C-Suites preach from their pulpit that they need to embrace AI because it is the future. There’s a race to the top and only those that have mastered the AI tooling will make it. No caution to the wind, it’s a gold rush baby! This is the age of everyone and everything coding!
This eager employee has their hammer and needs some nails, lest they get left behind! They see a bug backlog and go “what if instead of engineers doing this we let the AI do it? Bugs are a detriment to roadmaps and we can be 10x productive!” So the employee wires up Copilot, Cursor, pick your code poison, to your ticketing system and lets it rip. The agents go, ticket by ticket, and make a PR for each of them. I said above that PRs shouldn’t take much time, so these should get approved and make a large dent in the backlog right?
Well, no. They won’t get reviewed quickly. Because the assumptions no longer hold.
A quick detour on LLM context
Before I go further I should elaborate on LLM context. A lot of modern LLM techniques revolve around context management. For those that aren’t that familiar with how LLMs and Gen AI work, the easiest way to think of them are like functions.
f(text input) -> text output
Given an input generate an output that’s statistically most likely to come after the given input. That’s it. There’s no magical thinking that goes on to an LLM it’s just that function above.1 Any of the “thinking” you see from an LLM is code that executes the function above to generate a plan output, then executes that function repeatedly across the plan. It’s more function composition than true thinking.
Even “chat histories” are a misnomer. Let’s say you have the following conversation with an LLM
You: Tell me a joke!
LLM: Why did the chicken cross the road? To get to the other side!
You: Replace chicken with turkey
LLM: Why did the turkey cross the road? To get to the other side!
At first glance for that turkey message you think “wow it remembered our conversation!” But you have to understand the processing looked like this
You: Tell me a joke!
LLM: f("You: Tell me a joke!") -> Why did the chicken cross the road? To get to the other side!
You: Replace chicken with turkey
LLM: f("You: Tell me a joke!
\nLLM: Why did the chicken cross the road? To get to the other side!
\nYou: Replace chicken with turkey") -> Why did the turkey cross the road? To get to the other side!
It doesn’t actually “remember” anything, it’s just passing the entire conversation to that function as you go. It’s why for tools like Cursor you’ll notice a text limit per chat window. There’s only so much of the entire chat it can physically pass to the model before it can’t support it anymore. If you navigate to a new chat window it’s not going to remember previous conversations2
So that means you can just go to the context limit and get the smartest answer possible right? Nope! Because if you go too far you get what’s called context rot. You end up throwing so much information that it becomes less useful overall and you get some real bad results. So to make effective use of LLMs you have to both give it a decent amount of context, but not too much that it goes off the rails. And because of their non-deterministic nature it’s a guessing game on how much is too much and how little is too little.
I state all of the above so that I can confidently claim
LLMs can’t understand the problem
For our Copilot Jira shit factory, let’s go back to The Problem expectation
The Problem: The submitter has done their due diligence in understanding the problem they are trying to solve
Given that LLMs function best on limited context scopes, do you have the trust that they will do their due diligence to understand the problem they are trying to solve? Are they going to speak to customers, other domain teams, and other stakeholders to gather the correct requirements? If they encounter a bug that’s a result of an important change, are they going to escalate and be willing to make the trade off to drop it?
No, they are just going to analyze whats in the ticket, maybe a couple surrounding PRs and go for it. From what I’ve seen it will invent a fix even if one isn’t needed. I’ve seen it add columns to CSVs without adjusting headers, adjust query class parameters without adjusting the callers or frontend to compensate or use the new parameters.
That’s all also assuming those JIRA tickets are well formed and correctly identify the problem. A lot of the times bug tickets are “customer says this isn’t working”, leaving it to the engineer to figure out what “not working” means. I’ve seen LLMs try to invent a fix for something as innocuous as “Your app sucks”.
So if an LLM isn’t going to do this and just defecate out code, who’s job is it to do that due diligence? Now it falls on the reviewer to do so. You can’t ask an LLM why it did something because it doesn’t know. Remember it’s just a function of input -> output. There’s no memory of decisions. It’ll just make up a justification given a question. It’s why you get those great convos that are like
You: You just wiped production when I told you not to!
LLM: You're totally right! I'm sorry! I messed up
In my opinion, the problem step is the second most critical part of feature development (right next to security). You need to ask why you are doing something and be able to defend it.
LLMs can’t end to end test
- Testing: The submitter has ran end to end tests as needed to verify their changes work.
There’s zero chance that an agentic workflow is going to test this end to end in a manner anyone that’s on pager duty would be happy with. It can run unit tests. It may be able to write some precursory selenium/playwright and execute it3. But it can’t interrogate a lot of aspects that a normal engineer would, such as correct layout, behavior at different sizes, etc. A lot of these agents will run unit tests and call it a day, when most engineers know that most bugs appear in the integration of different units rather than individual units.
Sure, you can make an organizational push to do more E2E tests, but theres a reason these are on the top of the testing pyramid. They’re slow, brittle, and difficult to maintain. Builds would take forever if everything was an E2E test.
What’s left
We’re left with
The Code: The submitter has coded their solution to the best of their ability given the information that they have.
Build: The build and automated checks pass.
Security: The submitter has done their due diligence to ensure no security issues have been introduced.
I don’t know what level of familiarity new models are trained on Security best practices. From my personal experience a lot of code completions make some bonkers changes that I’d never let in production. But I don’t explicitly poke at security stuff with LLMs nor let them autonomously code anything remotely touching user input.
I will also say that depending on what you’re pointing Cursor/Copilot/Claude agents at you should make sure it’s sanitized input. I don’t know how safe it is for agents to be making PRs based on external statements.
Builds are meant to be run by automation, most LLMs are capable of running automated tests as a part of their workflow.
And then we have the code. Which sure is a feat to have something. But I argue without the problem context and testing that it’s absolutely worthless. You’re basically saying to someone “I have no clue what this does or why we’re doing it, but I want this to hit production. Make it work for me.” Why would anyone in their right mind allow a change like that to go through? So you need the problem context and testing to make this change worthwhile. It has to be done.
But if the LLM isn’t going to do it, who is? Well that leaves the reviewer to do this. Instead of a cooperative pairing experience we’ve bastardized it to shove all the responsibility on the reviewer to do the due diligence to ensure the change is worth merging. We’re not saving time, we’re just shifting the responsibility further along the chain and introducing a single point of failure rather than 2 people validating a change.
If you’re in a corporate environment, it’s miserable to have to push back against this sewage. For an open source project? You’re up shit creek without a paddle. I can’t imagine trying to keep the boat afloat in all of this waste.
What do we do
Candidly, I wish I knew. For corporate environments it depends on how much influence you have and how much pain C-Suites are willing to inflict on their employees and customers. I can’t speak much to Open Source as I don’t have experience managing OSS projects.
What you should not do is throw more AI at the problem like I see orgs doing. Having some LLM bleat out bile doesn’t help make things go faster here. All of the problems I outlined about LLMs cutting PRs also apply to them reviewing PRs. They lack the context to make good decisions and occasionally invent problems that don’t exist.
-
That isn’t to say it isn’t a massively complex function that has a ton of math and statistics powering it that requires massive computations. At the end of the day though it’s still “Given an input give me an output”. ↩︎
-
I think ChatGPT does some fuckery where it passes some precursory summaries of previous chats to pretend it remembers, but it’s not the same thing. ↩︎
-
Credit where credit’s due this is impressive. I experimented once with a vague prompt and got Claude Code to execute an end to end test. It took 20 minutes with a lot of handholding. Absolutely worthless in any capacity, but still an impressive feat of engineering. ↩︎