Rendered at 15:24:46 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
wg0 21 hours ago [-]
I write detailed specs. Multifile with example code. In markdown.
Then hand over to Claude Sonnet.
With hard requirements listed, I found out that the generated code missed requirements, had duplicate code or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed) along with tests that fake and work around to pass.
So turns out that I'm not writing code but I'm reading lots of code.
The fact that I know first hand prior to Gen AI is that writing code is way easier. It is reading the code, understanding it and making a mental model that's way more labour intensive.
Therefore I need more time and effort with Gen AI than I needed before because I need to read a lot of code, understand it and ensure it adheres to what mental model I have.
Hence Gen AI at this price point which Anthropic offers is a net negative for me because I am not vibe coding, I'm building real software that real humans depend upon and my users deserve better attention and focus from me hence I'll be cancelling my subscription shortly.
gwerbin 21 hours ago [-]
Or just don't use AI to write code. Use it as a code reviewer assistant along with your usual test-lint development cycle. Use it to help evaluate 3rd party libraries faster. Use it to research new topics. Use it to help draft RFCs and design documents. Use it as a chat buddy when working on hard problems.
I think the AI companies all stink to high heaven and the whole thing being built on copyright infringement still makes me squirm. But the latest models are stupidly smart in some cases. It's starting to feel like I really do have a sci-fi AI assistant that I can just reach for whenever I need it, either to support hard thinking or to speed up or entirely avoid drudgery and toil.
You don't have to buy into the stupid vibecoding hype to get productivity value out of the technology.
You of course don't have to use it at all. And you don't owe your money to any particular company. Heck for non-code tasks the local-capable models are great. But you can't just look at vibecoding and dismiss the entire category of technology.
onlyrealcuzzo 20 hours ago [-]
> Or just don't use AI to write code.
Anecdata, but I'm still finding CC to be absolutely outstanding at writing code.
It's regularly writing systems-level code that would take me months to write by hand in hours, with minimal babysitting, basically no "specs" - just giving it coherent sane direction: like to make sure it tests things in several different ways, for several different cases, including performance, comparing directly to similar implementations (and constantly triple-checking that it actually did what you asked after it said "done").
For $200/mo, I can still run 2-3 clients almost 24/7 pumping out features. I rarely clear my session. I haven't noticed quality declines.
Though, I will say, one random day - I'm not sure if it was dumb luck - or if I was in a test group, CC was literally doing 10x the amount of work / speed that it typically does. I guess strange things are bound to happen if you use it enough?
Related anecdata: IME, there has been a MASSIVE decline in the quality of claude.ai (the chatbot interface). It is so different recently. It feels like a wanna-be crapier version of ChatGPT, instead of what it used to be, which was something that tried to be factual and useful rather than conversational and addictive and sycophantic.
mlinsey 20 hours ago [-]
My anecdata is that it heavily depends on how much of the relevant code and instructions it can fit in the context window.
A small app, or a task that touches one clear smaller subsection of a larger codebase, or a refactor that applies the same pattern independently to many different spots in a large codebase - the coding agents do extremely well, better than the median engineer I think.
Basically "do something really hard on this one section of code, whose contract of how it intereacts with other code is clear, documented, and respected" is an ideal case for these tools.
As soon as the codebase is large and there are gotchas, edge cases where one area of the code affects the other, or old requirements - things get treacherous. It will forget something was implemented somewhere else and write a duplicate version, it will hallucinate what the API shapes are, it will assume how a data field is used downstream based on its name and write something incorrect.
IMO you can still work around this and move net-faster, especially with good test coverage, but you certainly have to pay attention. Larger codebases also work better when you started them with CC from the beginning, because it's older code is more likely to actually work how it exepects/hallucinates.
onlyrealcuzzo 19 hours ago [-]
> My anecdata is that it heavily depends on how much of the relevant code and instructions it can fit in the context window.
Agreed, but I'm working on something >100k lines of code total (a new language and a runtime).
It helps when you can implement new things as if they're green-field-ish AND THEN implement and plumb them later.
antonvs 10 hours ago [-]
In a well-designed system, you can point an agent at a module of that system and it's perfectly capable of dealing with it. Humans also have a limited context window, and divide and conquer is always how we've dealt with it. The same approach works for agents.
janalsncm 19 hours ago [-]
How can a person reconcile this comment with the one at the root of this thread? One person says Claude struggles to even meet the strict requirements of a spec sheet, another says Claude is doing a great job and doesn’t even need specific specs?
I have my own anecdata but my comment is more about the dissonance here.
oefrha 14 hours ago [-]
One aspect you have to consider is the differences in human beings doing the evaluation. I had a coworker/report who would hand me obvious garbage tier code with glaring issues even in its output, and it would take multiple iterations to address very specific review comments (once, in frustration, I showed a snippet of their output to my nontechnical mom and even my mom wtf’ed and pointed out the problem unprompted); I’m sure all the AI-generated code I painstakingly spec, review and fix is totally amazing to them and need very little human input. Not saying it must be the case here, that was extreme, but it’s a very likely factor.
rhubarbtree 7 hours ago [-]
This is plausible. Assuming it’s true, we would see the adoption of vibe coding at a faster rate amongst inexperienced developers. I think that’s true.
A counterpoint is Google saying the vast majority of their code is written by AI. The developers at Google are not inexperienced. They build complex critical systems.
But it still feels odd to me, this contradiction. Yes there’s some skill to using AI but that doesn’t feel enough to explain the gap in perception. Your point would really explain it wonderfully well, but it’s contradicted by pronouncements by major companies.
One thing I would add is that code quality is absolutely tanking. PG mentioned YC companies adopted AI generated code at Google levels years ago. Yesterday I was using the software of one such company and it has “Claude code” levels of bugginess. I see it in a bunch of startups. One of the tells is they seem to experience regressions, which is bizarre. I guess that indicates bugs with their AI generated tests.
SpaceNoodled 30 minutes ago [-]
You don't think Sundar would do that, just go on the Internet and tell lies?
tclancy 12 hours ago [-]
This is magical because you are both on the exact right path and not right. My theory is there’s a sort of skill to teasing code from AI (or maybe not and it’s alchemy all over again) and this is all new enough and we don’t have a common vocabulary for it that it’s hard for one person who is having a good experience and one person who is not to meaningfully sort out what they are doing differently.
Alternatively, it could be there’s a large swath of people out there so stupid they are proud of code your mom can somehow review and suggest improvements in despite being nontechnical.
DennisP 13 hours ago [-]
I just read Steve Yegge's book Vibe Coding, and he says learning to use AI effectively is a skill of its own, and takes about a year of solid work to get good at it. It will sometimes do a good job and other times make a mess, and he has a lot of tips on how to get good results, but also says a lot of it is just experience and getting a good feel for when it's about to go haywire.
sarchertech 18 hours ago [-]
One person is rigorously checking to see if Claude is actually following the spec and one person isn’t?
hunterpayne 17 hours ago [-]
One is getting paid by a marketing department program and the other isn't. Remember how much has been spent making LLMs and they have now decided that coding is its money maker. I expect any negative comment on LLM coding to be replied to by at least 2 different puppets or bots.
riquito 17 hours ago [-]
Then you should expect any positive comment to be replied negatively by a competition's puppet or bot too
SpaceNoodled 27 minutes ago [-]
Not necessarily; rising tide and all that. When a new scam like this emerges, it behooves all of the grifters to cooperate and not muddy the waters with distrust.
flyinglizard 17 hours ago [-]
... or one person has a very strong mental model of what he expects to do, but the LLM has other ideas. FWIW I'm very happy with CC and Opus, but I don't treat it as a subordinate but as a peer; I leave it enough room to express what it thinks is best and guide later as needed. This may not work for all cases.
sarchertech 17 hours ago [-]
If you don’t have a very strong mental model for what you are working on Claude can very easily guide in you into building the wrong thing.
For example I’m working on a huge data migration right now. The data has to be migrated correctly. If there are any issues I want to fail fast and loud.
Claude hates that philosophy. No matter how many different ways I add my reasons and instructions to stop it to the context, it will constantly push me towards removing crashes and replacing them with “graceful error handling”.
If I didn’t have a strong idea about what I wanted, I would have let it talk me into building the wrong thing.
Claude has no taste and its opinions are mostly those of the most prolific bloggers. Treating Claude like a peer is a terrible idea unless you are very inexperienced. And even then I don’t know if that’s a good idea.
timr 14 hours ago [-]
> Claude has no taste and its opinions are mostly those of the most prolific bloggers.
I often think that LLMs are like a reddit that can talk. The more I use them, the more I find this impression to be true - they have encyclopedic knowledge at a superficial level, the approximate judgement and maturity of a teenager, and the short-term memory of a parakeet. If I ask for something, I get the statistical average opinion of a bunch of goons, unconstrained by context or common sense or taste.
That’s amazing and incredible, and probably more knowledgeable than the median person, but would you outsource your thinking to reddit? If not, then why would you do it with an LLM?
prmph 6 hours ago [-]
> they have encyclopedic knowledge at a superficial level, the approximate judgement and maturity of a teenager, and the short-term memory of a parakeet. If I ask for something, I get the statistical average opinion of a bunch of goons, unconstrained by context or common sense or taste.
Love this paragraph; it's exactly how I feel about the LLMs. Unless you really know what you are doing, they will produce very sub-optimal code, architecturally speaking. I feel like a strong acumen for proper software architecture is one of the main things that defines the most competent engineers, along with naming things properly. LLMs are a long, long way from having architectural taste
flyinglizard 5 hours ago [-]
Try asking to review your code as if it were Linus Torvalds. No, really.
datavirtue 4 hours ago [-]
Holding it wrong.
oops 16 hours ago [-]
That’s interesting to hear as for me Claude has been quite good about writing code that fails fast and loud and has specifically called it out more than once. It has also called out code that does not fail early in reviews.
justinclift 10 hours ago [-]
> it will constantly push me towards removing crashes and replacing them with “graceful error handling”.
Is it generating JS code for that?
aforwardslash 14 hours ago [-]
Have you created a plan where the requisite is not to bother you with x and y, and to use some predetermined approach? What you describe sometimes happens to me, but it happens less when its part of the spec.
flyinglizard 5 hours ago [-]
You're right, data migration is a specific case where you have a very strong set of constraints.
I, on the other hand, am doing a new UI for an existing system, which is exactly where you want more freedom and experimentation. It's great for that!
mojuba 8 hours ago [-]
I think it depends on both the complexity and the quality bars set by the engineer.
From my observations, generally AI-generated code is average quality.
Even with average quality it can save you a lot of time on some narrowly specialized tasks that would otherwise take you a lot of research and understanding. For example, you can code some deep DSP thingie (say audio) without understanding much what it does and how.
For simpler things like backend or frontend code that doesn't require any special knowledge other than basic backend or frontend - this is where the bars of quality come into play. Some people will be more than happy with AI generated code, others won't be, depending on their experience, also requirements (speed of shipping vs. quality, which almost always resolves to speed) etc.
justinclift 10 hours ago [-]
Note that one person is mentioning they use Claude Sonnet, which is less capable than the higher tiers (Opus, etc).
sameerds 12 hours ago [-]
It could just be that each of the two reviewers is merely focussing on different sides of the same coin? I use Claude all the time. It saves me a lot of effort that I would have otherwise spent in looking up specific components. The magically autocompleted pieces of boilerplate are a tangible relief. It also catches issues that I missed. But when it is wrong, it can be subtly or embarassingly or spectacularly wrong depending on the situation.
aforwardslash 14 hours ago [-]
It boils down to scope. I use CC in both very specific one-language systems and broad backend-frontend-db-cache systems. You can guess where the difficulty lies. (Hint: its the stuff with at least 3 distinct languages)
ghurtado 20 hours ago [-]
> basically no "specs" - just giving it coherent sane direction
This is one variable I almost always see in this discussion: the more strict the rules that you give the LLM, the more likely it is to deeply disappoint you
The earlier in the process you use it (ie: scaffolding) the more mileage you will get out of it
It's about accepting fallability and working with it, rather than trying to polish it away with care
phatskat 19 hours ago [-]
To me this still feels like it would be a net negative. I can scaffold most any project with a language/stack specific CLI command or even just checking out a repo.
And sure, AI could “scaffold” further into controllers and views and maybe even some models, and they probably work ok. It’s then when they don’t, or when I need something tweaked, that the worry becomes “do I really understand what’s going on under the hood? Is the time to understand that worth it? Am I going to run across a small thread that I end up pulling until my 80% done sweater is 95% loose yarn?”
To me the trade-off hasn’t proven worth it yet. Maybe for a personal pet project, and even then I don’t like the idea of letting something else undeterministically touch my system. “But use a VM!” they say, but that’s more overhead than I care for. Just researching the safest way to bootstrap this feels like more effort than value to me.
Lastly, I think that a big part of why I like programming is that I like the act of writing code, understanding how it works, and building something I _know_.
michaelmrose 15 hours ago [-]
A lot of the benefit of scaffolding is building basic context which you can also build by feeding it the files produced by whatever CLI tool and talk about it forcing it to think for lack of a better word about your design. You can also force feed it design and api documentation. If you think that you have given it too much you are almost certainly wrong.
Doing nonsensical things with a library feed it the documentation still busted make it read the source
prmph 19 hours ago [-]
But, how do you know the code is good?
If you do spot checks, that is woefully inadequate. I have lost count of the number of times when, poring over code a SOTA LLM has produced, I notice a lot of subtle but major issues (and many glaring ones as well), issues a cursory look is unlikely to pick up on. And if you are spending more time going over the code, how is that a massive speed improvement like you make it seem?
And, what do you even mean by 10x the amount of work? I keep saying anybody that starts to spout these sort of anecdotes absolutely does NOT understand real world production level serious software engineering.
Is the model doing 10x the amount of simplification, refactoring, and code pruning an effective senior level software engineer and architect would do? Is it doing 10x the detailed and agonizing architectural (re)work that a strong developer with honed architectural instincts would do?
And if you tell me it's all about accepting the LLM being in the driver's seat and embracing vibe coding, it absolutely does NOT work for anything exceeding a moderate level of complexity. I used to try that several times. Up to now no model is able to write a simple markdown viewer with certain specific features I have wanted for a long time. I really doubt the stories people tell about creating whole compilers with vide coding.
If all you see is and appreciate that it is pumping out 10x features, 10x more code, you are missing the whole point. In my experience you are actually producing a ton of sh*t, sorry.
datavirtue 4 hours ago [-]
Way better than the random India dev output. I seriously don't know what everyone around here is doing. All I see are complaints while I produce the output of ten devs. Clean code, solid design.
Spend a few hours writing context files. Spend the rest of the week sipping bourbon.
sarchertech 2 hours ago [-]
So what have you released?
10x means you could have built something that would have taken 4 or 5 years in the time you've had since Opus 4.5 came out.
Where's your operating system, game engine, new programming language, or complex SaaS app?
hirvi74 18 hours ago [-]
> But, how do you know the code is good?
Honestly, this more of a question about scope of the application and the potential threat vectors.
If the GP is creating software that will never leave their machine(s) and is for personal usage only, I'd argue the code quality likely doesn't matter. If it's some enterprise production software that hundreds to millions of users depend on, software that manages sensitive data, etc., then I would argue code quality should asymptotically approach perfection.
However, I have many moons of programming under my belt. I would honestly say that I am not sure what good code even is. Good to who? Good for what? Good how?
I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
I apply the Herbie Hancock philosophy when defining good code. When once asked what is Jazz music, Herbie responded with, "I can't describe it in words, but I know it when I hear it."
sarchertech 17 hours ago [-]
> I apply the Herbie Hancock philosophy when defining good code. When once asked what is Jazz music, Herbie responded with, "I can't describe it in words, but I know it when I hear it."
That’s the problem. If we had an objective measure of good code, we could just use that instead of code reviews, style guides, and all the other things we do to maintain code quality.
> I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
Not if you have more than a few years of experience.
But what your point is missing is the reason that software keeps working in the fist, or stays in a good enough state that development doesn’t grind to a halt.
There are people working on those code bases who are constantly at war with the crappy code. At every place I’ve worked over my career, there have been people quietly and not so quietly chipping away at the horrors. My concern is that with AI those people will be overwhelmed.
They can use AI too, but in my experience, the tactical tornadoes get more of a speed boost than the people who care about maintainability.
hirvi74 11 hours ago [-]
I had a long reply to your comment, then decide it was not truly worth reading. However, I do have one question remaining:
> the tactical tornadoes get more of a speed boost than the people who care about maintainability.
Why are these not the same people? In my job, I am handed a shovel. Whatever grave I dig, I must lay in. Is that not common? Seriously, I am not being factious. I've had the same job for almost a decade.
sarchertech 5 hours ago [-]
That’s because you’ve been there a decade. It’s very common for people to skip jobs every 2 years so that they never end up seeing the long term consequences of their actions.
The other common pattern I’ve seen goes something like this.
Product asks Tactical Tornado if they can building something TT says sure it will take 6 weeks. TT doesn’t push back or asks questions, he builds exactly what product asks for in an enormous feature branch.
At the end of 6 weeks he tries to merge it and he gets pushback from one or more of the maintainability people.
Then he tells management that he’s being blocked. The feature is already done and it works. Also the concerns other engineers have can’t be addressed because “those are product requirements”. He’ll revisit it later to improve on it. He never does because he’s onto the next feature.
Here’s the thing. A good engineer would have worked with product to tweak the feature up front so that it’s maintainable, performant etc…
This guy uses product requirements (many that aren’t actually requirements) and deadlines to shove his slop through.
At some companies management will catch on and he’ll get pushed out. At other companies he’ll be praised as a high performer for years.
Peritract 4 hours ago [-]
> It's regularly writing systems-level code that would take me months to write by hand in hours, with minimal babysitting
Has your output kept pace with the code? Because months in hours means, even pushing those ratios quite far, to be years in days.
Has your roadmap accelerated multiple years in the last few months in terms of verifiable results?
sameerds 12 hours ago [-]
> I can still run 2-3 clients almost 24/7 pumping out features.
Honest question. How does one do that? My workflow is to create one git worktree per feature and start one session per worktree. And then I spent two hours in a worktree talking to Opus and reviewing what it is doing.
kobe_bryant 18 hours ago [-]
months you say? how incredible. it beggars belief in fact
hirvi74 18 hours ago [-]
Not sure about ChatGPT, but Claude was (is still?) an absolute ripper at cracking some software if one has even a little bit of experience/low level knowledge. At least, that's what my friend told me... I would personally never ever violate any software ToA.
buredoranna 20 hours ago [-]
> the whole thing being built on copyright infringement
I am not a lawyer, but am generally familiar with two "is it fair use" tests.
1. Is it transformative?
I take a picture, I own the copyright. You can't sell it. But if you take a copy, and literally chop it to pieces, reforming it into a collage, you can sell that.
2. Does the alleged infringing work devalue the original?
If I have a conversation with ai about "The Lord of the Rings". Even if it reproduces good chunks of the original, it does not devalue the original... in fact, I would argue, it enhances it.
Have I failed to take into account additional arguments and/or scenarios? Probably.
But, in my opinion, AI passes these tests. AI output is transformative, and in general, does not devalue the original.
taikahessu 19 hours ago [-]
In order for LLM to be useful, you need to copy and steal all of the work. Yes, you can argue you don't need the whole work, but that's what they took and feed it in.
And they are making money off of other people's work. Sure, you can use mental jiujutsu to make it fair use. But fair use for LLMs means you basically copy the whole thing. All of it. It sounds more like a total use to me.
I hope the free market and technology catches up and destroys the VC backed machinery. But only time will tell.
ragequittah 19 hours ago [-]
I always wonder if anyone out there thinks they're not making money off of other people's work. If you're coding, writing a fantasy novel, taking a photograph or drawing a picture from first principals you came up with yourself I applaud you though.
taikahessu 18 hours ago [-]
You are absolutely right.
Seriously though, I do think that is the case. It would be self-righteous to argue otherwise. It's just the scale and the nature of this, that makes it so repulsive. For my taste, copying something without permission, is stealing. I don't care what a judge somewhere thinks of it. Using someone's good will for profit is disgusting. And I hope we all get to profit from it someday, not just a select few. But that is just my opinion.
IcyWindows 15 hours ago [-]
This kind of thinking seems like a road for people to have to pay a license for the rest of their life after going to school for the knowledge they "stole" from their textbooks.
taikahessu 13 hours ago [-]
Except the school paid royalties for that specific book. Every book. The money was distributed. Writers, publishers and so on. The normal stuff.
Or if you had to buy the book yourself, same thing, distributed, royalties paid.
IcyWindows 13 hours ago [-]
So your complaint is that they didn't pay for training data by buying every book found online?
That does seem more reasonable, but makes public libraries also evil.
taikahessu 4 hours ago [-]
Except the libraries pay the fees of the books, they only serve a dedicated local region of people and by loaning a book, you will know the author of the book.
For LLMs the transformative part is then removing the copyright info and serving it to you as OpenAI whatever.
Sure, you can query multiple books at the same time and the technology is godlike. But the underlying issue remains. Without the original content, the LLM is useless. Someone took all the books, feed them in and didn't pay anything back to the authors.
I'm not sure whether arguing in good faith here. This information you could easily check for yourself too. The problem is not the information itself. It's the massive machinery that steals all the works and one day we are staring at the paywall. And the artists are still not funded. I'd rather just do something nice offline in the future.
ragequittah 9 hours ago [-]
I understand but I think this will be quite a quaint idea soon in all honesty. Imagine these things are able to progress the world of science, math, physics, and whatever else (they already are) and we stopped them because someone didn't make enough royalties first. That to me would be more repulsive. We stop/slow the progress of all humanity because there wasn't enough temporary gain for x individual who wrote y book. And if it all turns out to be bogus nonsense then I doubt x individual who wrote y book loses much in the process anyway.
taikahessu 4 hours ago [-]
Yeah, it's not an easy puzzle piece. How far are we going to go in the name of science and progress again? Are you buying it, that it's all for the greater good? Quite a lot of money involved here. Everyone wants a piece of it. But I digress. Dropping the big bomb, stealing the lands and riches of the natives, using slaves and colonies to power the whole civilization into a new era might be powerful and efficient. But it doesn't make it right. I don't buy the narrative. Do no evil until you can no longer say no?
jjwiseman 19 hours ago [-]
And in Bartz v. Anthropic, the court found that Anthropic training their LLMs on books was "highly transformative."
verve_rat 10 hours ago [-]
The US is not the only legal jurisdiction these services are being sold in.
Madmallard 17 hours ago [-]
What in the mental gymnastics?
They just stole everyone's hard work over decades to make this or it wouldn't have been useful at all.
NewsaHackO 14 hours ago [-]
That's a statement. The comment you are replying to had actual reasoning behind his claim. Do you have any actual reasoning behind yours?
Madmallard 3 hours ago [-]
Let's not ignore the entirety of reality and what has been going on for the last few years to defend a pestilence on mankind you probably have stock invested in. I'm not going to acknowledge how insane of an argument that is you're making. It's like you heard of zero leaks, zero law suits, zero open source complaints. Zero anything. Just either intentionally or unintentionally astroturfing.
Thanks.
idiotsecant 18 hours ago [-]
This is a tiresome and well trod road.
The fact of the matter is that for profit corporations consumed the sum knowledge of mankind with the intent to make money on it by encoding it into a larger and better organized corpus of knowledge. They cited no sources and paid no fees (to any regular humans, at least).
They are making enormous sums of money (and burning even more, ironically) doing this.
If that doesn't violate copyright, it violates some basic principle of decency.
michaelmrose 15 hours ago [-]
You are assuming intellectual property has intrinsic basis when it's at best functional not foundational. It's only useful if the net value to society is positive which is extremely dubious.
idiotsecant 10 hours ago [-]
I'm assuming human creativity has intrinsic value, or what's the point of being human?
Aurornis 21 hours ago [-]
Writing detailed specs and then giving them to an AI is not the optimal way to work with AI.
That's vibecoding with an extra documentation step.
Also, Sonnet is not the model you'd want to use if you want to minimize cleanup. Use the best available model at the time if you want to attempt this, but even those won't vibecode everything perfectly for you. This is the reality of AI, but at least try to use the right model for the job.
> Therefore I need more time and effort with Gen AI than I needed before
Stop trying to use it as all-or-nothing. You can still make the decisions, call the shots, write code where AI doesn't help and then use AI to speed up parts where it does help.
That's how most non-junior engineers settle into using AI.
Ignore all of the LinkedIn and social media hype about prompting apps into existence.
EDIT: Replaced a reference to Opus and GPT-5.5 with "best available model at the time" because it was drawing a lot of low-effort arguments
wg0 21 hours ago [-]
> Writing detailed specs and then giving them to an AI is not the optimal way to work with AI.
It is NOT the way to work with humans basically because most software engineers I worked with in my career were incredibly smart and were damn good at identifying edge cases and weird scenarios even when they were not told and the domain wasn't theirs to begin with. You didn't need to write lengthy several page long Jira tickets. Just a brief paragraph and that's it.
With AI, you need to spell everything out in detail. But that's NO guarantee either because these models are NOT deterministic in their output. Same prompt different output each time. That's why every chat box has that "Regenerate" button. So your output with even a correct and detailed prompt might not lead to correct output. You're just literally rolling a dice with a random number generator.
Lastly - no matter how smart and expensive the model is, the underlying working principles are the same as GPT-2. Same transformers with RL on top, same random seed, same list of probabilities of tokens and same temperature to select randomly one token to complete the output and feedback in again for the next token.
aforwardslash 14 hours ago [-]
> It is NOT the way to work with humans basically because most software engineers I worked with in my career were incredibly smart and were damn good at identifying edge cases and weird scenarios even when they were not told and the domain wasn't theirs to begin with.
I have no clue what AI you're using, but both Claude and Codex, you just explain the outcome, and they are pretty smart figuring out stuff on complex codebases.You don't even need a paragraph, just say "doing this I got an error".
> NO guarantee either because these models are NOT deterministic in their output. Same prompt different output each time.
So, exactly like humans. But a bit more predictable and way more reliable.
> That's why every chat box has that "Regenerate" button.
If you're using the chat box to write code, that's a human error, not an LLM one. Don't blame "AI" for your ignorance.
> no matter how smart and expensive the model is, the underlying working principles are the same as GPT-2.
Sure. Every machine is a smoke machine if operated wrong enough. This tells me you should not get your insight from random YT videos. As a bit of nugget, some of the underlying working principles of the chat system also powered search engines; and their engineers also drank water, like hitler.
throwaway7783 18 hours ago [-]
This is not true in my experience at all. I never write such detailed spec for AI - and that is my value as the human in the loop - to be iterative, to steer and make decisions. The AI in fact catches more edge cases than I do, and can point me to things that I never considered myself. Our productivity has increased manyfold, and code quality has increased significantly because writing tests is no longer a chore or an afterthought, or the biggest one for us - "test setup is too complicated". All of that is gone. And it is showing in a decrease in customer reported issues
snarkconjecture 20 hours ago [-]
> the underlying working principles are the same as GPT-2
I don't think anyone was claiming otherwise. Sonnet is still better at writing code than GPT-2, and worse than Opus. Workflows that work with Opus won't always work with Sonnet, just as you can't use GPT-2 in place of Sonnet to do code autocomplete.
jonas21 19 hours ago [-]
> That's why every chat box has that "Regenerate" button.
Wait, are you doing this in the web chat interface?!
That's definitely not a good way. You need to be using a harness (like Claude Code) where the agent can plan its work, explore the codebase, execute code, run tests, etc. With this sort of set up, your prompts can be short (like 1 to 5 sentences) and still get great results.
wg0 18 hours ago [-]
I use claud CLI or OpenCode. The "Regenerate" example is just to illustrate that same prompt would produce different output each time. You're rolling a dice.
hnfong 5 hours ago [-]
Not sure what your point is.
Sure, AI output is kind of random.
But that's also basically true for humans. It's harder to "prove" humans are random, but wouldn't you think a person would do things slightly differently when given the same tasks but on different days? People change their minds a lot, it's just that there's no "reconsider" button for people so you feel a bit of social friction if you pester somebody to rethink an issue. But it's no different.
I'd be really surprised if your point is that humans, unlike AI, are super deterministic and that's why they are so much more trustworthy and smarter than AI...
rafram 21 hours ago [-]
> Opus or GPT-5.5 are the only ways to even attempt this.
It’s pretty funny to claim that a model released 22 hours ago is the bare minimum requirement for AI-assisted programming. Of course the newest models are best at writing code, but GPT-* and Claude have written pretty decent systems for six months or so, and they’ve been good at individual snippets/edits for years.
Aurornis 21 hours ago [-]
> It’s pretty funny to claim that a model released 22 hours ago is the bare minimum requirement for AI-assisted programming.
Not what I said.
The OP was trying to write specs and have an AI turn it into an app, then getting frustrated with the amount of cleanup.
If you want the AI to write code for you and minimize your cleanup work, you have to use the latest models available.
They won't be perfect, but they're going to produce better results than using second-tier models.
rafram 21 hours ago [-]
Is it actually the case that 5.5 is that much better at implementing specs than its very capable predecessor released a month ago? Just seems like a baseless and silly claim about a model that has barely been out long enough for anyone to do serious work with it.
Aurornis 20 hours ago [-]
> Is it actually the case that 5.5 is that much better at implementing specs than its very capable predecessor released a month ago?
The OP comment was talking about Claude Sonnet. I was comparing to that.
I should have just said "use the best model available"
ghurtado 20 hours ago [-]
> Is it actually the case that 5.5 is that much better
Nobody was talking about how much better it is until you wrote this though
It's like you're building your own windmills brick by brick
munk-a 21 hours ago [-]
> Stop trying to use it as all-or-nothing. You can still make the decisions, call the shots, write code where AI doesn't help and then use AI to speed up parts where it does help.
You're assuming that finding the places where AI needs help isn't already a larger task than just writing it yourself. AI can be helpful in development in very limited scenarios but the main thrust of the comment above yours is that it takes longer to read and understand code than to write it and AI tooling is currently focused on writing code.
We're optimizing the easy part at the expense of the difficult part - in many cases it simply isn't worth the trouble (cases where it is helpful, imo, exist when AI is helping with code comprehension but not new code production).
Aurornis 21 hours ago [-]
> You're assuming that finding the places where AI needs help isn't already a larger task than just writing it yourself.
Not assuming anything, I'm well versed in how to do this.
Anyone who defers to having AI write massive blocks of code they don't understand is going to run into this.
You have to understand what you want and guide the AI to write it.
The AI types faster than me. I can have the idea and understand and then tell the LLM to rearrange the code or do the boring work faster than I can type it.
Exoristos 20 hours ago [-]
The number of devs I've worked with who can't touch-type and don't use or know their way around a proper IDE is depressingly large.
Aurornis 20 hours ago [-]
Same with debuggers. I run into people with 10 years of experience who are still trying to printf debug complex problems that would be easy with 5 minutes in a debugger.
I think we're seeing something similar with AI: There are devs who spend a couple days trying to get AI to magically write all of their code for them and then swear it off forever, thinking they're the only people who see the reality of AI and everyone else is wrong.
munk-a 17 hours ago [-]
At the same time - there are devs that spend two days setting up a debugger for a simple problem that would be easy with five minutes and printf. AI is a tool and it's a useful tool - it's not always the best tool for the job and the real skill is in knowing when you use it and when not to.
It's a sort of context of life that the easy problems are solved - those where an extreme answer is always correct are things we no longer even consider problems... most of the options that remain have their advantages and disadvantages so the true answer is somewhere in the middle.
hunterpayne 16 hours ago [-]
Right, but then the AI doesn't have a positive ROI. In all fairness, it never has a positive ROI but now its much more negative, to the point the accountants will put an end to the experiment after year end reveals how negative it really is.
throwuxiytayq 16 hours ago [-]
This isn't about touch typing or IDE tricks. I'm an IDE power user and - reasoning aside - I used to run circles around my peers when it comes to raw code editing efficiency. This is increasingly an obsolete workflow. LLMs can execute codebase-wide refactors in seconds. You can use them as a (foot-)shotgun, or as a surgical tool.
Exoristos 16 hours ago [-]
So many are masters of AI marketing, it's thinkable one of them has mastered AI.
ryan_n 20 hours ago [-]
You've come full circle and are essentially just describing what the OP was saying in their initial post lol.
kakacik 20 hours ago [-]
If you are trying to sell it, you are doing a poor job and effectively siding with OP while desperately trying to write the opposite.
Juniors are mostly better than what you write as behavior, I certainly never had to correct as much after any junior as OP writes. If you have 'boring code' in your codebase, maybe it signals not that great architecture (and I presume we don't speak about some codegens which existed since 90s at least).
Also, any senior worth their salt wants to intimately understand their code, the only way you can anyhow guarantee correctness. Man, I could go on and on and pick your statements one by one but that would take long.
_puk 20 hours ago [-]
The problem I have with this take is it's focused on solving the right now problem.
Yes, it's quicker to do it yourself this time, but if we build out the artifacts to do a good enough job this time, next time it'll have all the context it needs to take a good shot at it, and if you get overtaken by AI in the meantime you've got an insane head start.
Which side of history are you betting on?
munk-a 19 hours ago [-]
I don't believe that investing more of my time in a slower process now would result in an advantage if that other process was refined. I've toyed around with these tools and know enough to get an environment up and running so what would I gain from using them more right now if those tools may significantly change before they're adapted to more efficient usage?
I'm okay not being at the bleeding edge - I can see the remains of the companies that aggressively switch to the new best thing. Sometimes it'll pay off and sometimes it won't. I am comfortable being a person that waits until something hits a 2.0 and the advantages and disadvantages are clear before seriously considering a migration.
torben-friis 14 hours ago [-]
If you don't do it yourself and you don't get overtaken by AI, you've lost the head start to be better next time - humans learn, and they atrophy as well.
afro88 20 hours ago [-]
> Writing detailed specs and then giving them to an AI is not the optimal way to work with AI.
> That's vibecoding with an extra documentation step.
Read uncharitably, yeah. But you're making a big assumption that the writing of spec wasn't driven by the developer, checked by developer, adjusted by developer. Rewritten when incorrect, etc.
> You can still make the decisions, call the shots
One way to do this is to do the thinking yourself, tell it what you want it to do specifically and... get it to write a spec. You get to read what it thinks it needs to do, and then adjust or rewrite parts manually before handing off to an agent to implement. It depends on task size of course - if small or simple enough, no spec necessary.
It's a common pattern to hand off to a good instruction following model - and a fast one if possible. Gemini 3 Flash is very good at following a decent spec for example. But Sonnet is also fine.
> Stop trying to use it as all-or-nothing
Agree. Some things just aren't worth chasing at the moment. For example, in native mobile app development, it's still almost impossible to get accurate idiomatic UI that makes use of native components properly and adheres to HIG etc
yonaguska 14 hours ago [-]
this is my workflow, converse with it to write a spec. I'm reviewing the spec myself. Ask it to trace out how it would implement it. I know the codebase because it was originally written mostly by hand. Correct it with my best practices. Have it challenge my assumptions and read the code to do so. then it s usually good enough to go on it's on. the beauty of having a well defined spec is that once it's done, I can have another agent review it and it generates good feedback if it deviates from the spec at all.
I'm unsure if this is actually faster than me writing it myself, but it certainly expends less mental energy for me personally.
The real gains I'm getting are with debugging prod systems, where normally I would have to touch five different interfaces to track down an issue, I've just encompassed it all within an mcp and direct my agent on the debugging steps(check these logs, check this in the db, etc)
ost-ing 8 hours ago [-]
> That's vibecoding with an extra documentation step.
This sounds like an LLM talking.
Either you're a bot, or our human languages are being modified in realtime by the influence of these tools.
mandeepj 21 hours ago [-]
Sure, Opus is next level than Sonnet, but it still doesn't free OP from these handcuffs - It is reading the code, understanding it and making a mental model that's way more labour intensive.
Aurornis 21 hours ago [-]
The OP's problem was treating the situation as two extremes: Either write everything myself, or defer entirely to the AI and be forced to read it later.
I was trying to explain that this isn't how successful engineers use AI. There is a way to understand the code and what the AI is doing as you're working with it.
Writing a spec, submitting it to the AI (a second-tier model at that) and then being disappointed when it didn't do exactly what you wanted in a perfect way is a tired argument.
hunterpayne 15 hours ago [-]
Is doing that faster than just writing it by hand? Remember to include the time you need to review the code afterwards. The research so far says it isn't faster. Yet people keep doubling down on it and thinking winning an Internet argument is going to matter when it hits the fan in the near future.
WesolyKubeczek 21 hours ago [-]
But when you write code by hand, you at least are there as it’s happening, which makes reading and understanding way easier.
elAhmo 21 hours ago [-]
Funny hearing you’re saying only GPT 5.5 (and Opus) can do this, having in mind that it came out last night.
Aurornis 21 hours ago [-]
To be clear, I'm not saying that they can do this.
I'm saying that if you're trying to have AI write code for you and you want to do as little cleanup as possible, you have to use the best model available.
ForOldHack 19 hours ago [-]
"Writing detailed specs and then giving them to an AI is not the optimal way to work with AI." Perfect. I loosely define things, and then correct it, and tell it to make the corrections, and it gets trained, but you have to constantly watch it. Its like a glorified auto-typer.
"Ignore all of the LinkedIn and social media hype about prompting apps into existence." Absolutely, its not hype, its pure marketing bullshitzen.
scuderiaseb 21 hours ago [-]
I must be doing something very different from everyone else, but I write what I want and how I want it and Opus 4.7 plans it for me, then I carefully review. Often times I need to validate and check things, sometimes I’ve revised the plan multiple times. Then implementation which I still use Opus for because I get a warning that my current model holds the cache so Sonnet shouldn’t implement. And honestly, I’m mostly within my Pro subscription, granted I also have ChatGPT Plus but I’ve mostly only used that as the chat/quick reference model. But yeah takes some time to read and understand everything, a lot of the time I make manual edits too.
wg0 21 hours ago [-]
>Then implementation which I still use Opus for because I get a warning that my current model holds the cache so Sonnet shouldn’t implement.
This is based on the premise that given detailed plan, the model will exactly produce the same thing because the model is deterministic in nature which is NOT the case. These models are NOT deterministic no matter how detailed plan you feed it in. If you doubt, give the model same plan twice and see something different churned out each time.
> And honestly, I’m mostly within my Pro subscription, granted I also have ChatGPT Plus but I’ve mostly only used that as the chat/quick reference model. But yeah takes some time to read and understand everything, a lot of the time I make manual edits too.
I do not know how you can do it on a Pro plan with Claude Opus 4.7 which is 7.5x more in terms of limit consumption and any small to medium size codebase would easily consume your limits in just the planning phase up to 50% in a single prompt on a Pro plan (the $20/month one that they are planning to eliminate)
scuderiaseb 19 hours ago [-]
> I do not know how you can do it on a Pro plan with Claude Opus 4.7 which is 7.5x more in terms of limit consumption and any small to medium size codebase would easily consume your limits in just the planning phase up to 50% in a single prompt
I also don’t understand because all I ever hear is people saying $100 Max plan is the minimum for serious work. I made 3-4 plans today, I’m familiar with the codebase and pointed the LLM in the direction where it needed to go. I described the functionality I wanted which wasn’t a huge rewrite, it touched like 4 files of which one was just a module of pydantic models. But one plan was 30% of usage and I had this over two sessions because I got a reset. I did read and understand everything line of code so that part takes me some time to do.
aforwardslash 14 hours ago [-]
One of the simple "reasons" is to keep context clean; if you're doing planning, you're not loading source code, its just the plan. Also, it may happen that if you're running parallel manual sessions, cache expires after 1h, so a prompt on an idle session will re-trigger re-evaluating the whole context (something quite heavy on a 1M context window). This burns a lot of credit.
_puk 21 hours ago [-]
Rather than vibe, write your thoughts and get the model to challenge you / flesh it out is my preferred approach.
Get it to write a context capsule of everything we've discussed.
Chuck that in another model and chat around it, flesh out the missing context from the capsule. Do that a couple of times.
Now I have an artifact I can use to one-shot a hell of a lot of things.
This is amazing for 0-1.
For brown field development, add in a step to verify against the current code base, capture the gotchas and bounds, and again I've got something an agent has a damn good chance of one-shotting.
coldtea 16 hours ago [-]
>I write detailed specs. Multifile with example code. In markdown.
Then hand over to Claude Sonnet. With hard requirements listed, I found out that the generated code missed requirements, had duplicate code or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed) along with tests that fake and work around to pass.
Stop doing that. Micromanage it instead. Don't give it the specs for the system, design the system yourself (can use it for help doing that), inform it of the general design, but then give it tasks, ONE BY ONE, to do for fleshing it out. Approve each one, ask for corrections if needed, go to the next.
Still faster than writing each of those parts yourself (a few minutes instead of multiple hours), but much more accurate.
dieortin 16 hours ago [-]
Might as well just write the code yourself at that point. And as a bonus, end up with a much better understanding of the codebase (and way better code)
coldtea 16 hours ago [-]
>Might as well just write the code yourself at that point
"We have this thing that can speed your code writing 10x"
"If it isn't 1000x and it doesn't give me a turnkey end to end product might as well write the whole thing myself"
People have forgotten balance. Which is funny, because the inability of the AI to just do the whole thing end to end correctly is what stands between 10 developers having a job versus 1 developer having a job telling 10 or 20 agents what to do end to end and collecting the full results in a few hours.
And if you do it the way I describe you get to both use AI, AND have "a much better understanding of the codebase (and way better code)".
dieortin 6 hours ago [-]
Writing the code is usually not the bottleneck, so you don’t gain that much speeding it up. And as I said, you lose a lot of knowledge about the code when you don’t write it yourself.
Unless coding is most of your job, which is rare, you’re giving up really knowing what your software does in order to achieve a very minor speed up. Just to end up having to spend way more time later trying to understand the AI generated code when inevitably something breaks.
> And if you do it the way I describe you get to both use AI, AND have "a much better understanding of the codebase (and way better code)".
Using AI is not a goal in itself, so I don’t care about “getting to use AI”. I care about doing my job as efficiently as possible, considering all parts of my job, not just coding.
Aurornis 15 hours ago [-]
The goalposts move every month. We’re at the stage where handing an entire specification to a mid-tier AI and walking away while it does all the work and then being disappointed that it wasn’t perfect means it’s useless.
hintymad 21 hours ago [-]
> With hard requirements listed, I found out that the generated code missed requirements,
This is hardly a surprise, no? No matter how much training we run, we are still producing a generative model. And a generative model doesn't understand your requirements and cross them off. It predicts the next most likely token from a given prompt. If the most statistically plausible way to finish a function looks like a version that ignores your third requirement, the model will happily follow through. There's really no rules in your requirements doc. They are just the conditional events X in a glorified P(Y|X). I'd venture to guess that sometimes missing a requirement may increase the probability of the generated tokens, so the model will happily allow the miss. Actually, "allow" is too strong a word. The model does not allow shit. It just generates.
teucris 20 hours ago [-]
But agents do keep task lists and check the tasks off as they go. Of course it’s not perfect either but it’s MUCH better than an LLM can offer on its own.
If you are seeing an agent missing tasks, work with it to write down the task list first and then hold it accountable to completing them all. A spec is not a plan.
mathisfun123 20 hours ago [-]
bro do you really not understand that that's a game played for your sake - it checks boxes yes but you have no idea what effect the checking of the boxes actually has. like do you not realize/understand that anthropic/openai is baking this kind of stuff into models/UI/UX to give the sensation of rigor.
jwitthuhn 17 hours ago [-]
The checkboxes inform the model as well as the user, and you can observe this yourself. For example in a C++ project with MyClass defined in MyClass.cpp/h:
I ask the model to rename MyClass to MyNewClass. It will generate a checklist like:
- Rename references in all source files
- Rename source/header files
- Update build files to point at new source files
Then it will do those things in that order.
Now you can re-run it but inject the start of the model's response with the order changed in that list. It will follow the new order. The list plainly provides real information that influences future predictions and isn't just a facade for the user.
dragandj 6 hours ago [-]
And when it doesn't, it politely apologizes, at least :)
_puk 20 hours ago [-]
Not to knee jerk on a bro comment, but, bro..
Are you seriously saying that breaking a large complex problem down into it's constituent steps, and then trying to solve each one of them as an individual problem is just a sensation of rigour?
stvltvs 19 hours ago [-]
I believe they're saying that the checkboxes are window dressing, not an accurate reflection of what the LLM has done.
kazinator 20 hours ago [-]
To some extent, I could agree with that idea. One purpose of that process is to match the impedance between the problem, and human cognition. But that presumes problem solving inherently requires human cognition, which is false; that's just the tool that we have for problem solving. When the problem-solving method matches the cognitive strengths and weaknesses of the problem solvers, they do have a certain sensation of having an upper hand over the problem. Part of that comes from the chunking/division allowing the problem solvers to more easily talk about the problem; have conversations and narratives around it. The ability to spin coherent narratives feels like rigor.
mathisfun123 19 hours ago [-]
I'm saying that's not what the stupid bot is actually doing, it's what anthropic added to the TUI to make you feel good in your feelies about what the bot is actually doing (spamming).
Edit: I'll give you another example that I realized because someone pointed it out here: when the stupid bot tells you why it fucked up, it doesn't actually understand anything about itself - it's just generating the most likely response given the enormous amount of pontification on the internet about this very subject...
_puk 19 hours ago [-]
I'm not disagreeing in principle, but the detritus left after an anthropic outage is usually quite usable in a completely fresh session. The amount of context pulled and stored in the sandbox is quite hefty.
Whist I can't usually start from the exact same point in the decisioning, I can usually bootstrap a new session. It's not all ephemeral.
To your edit: I find that the most galling thing about finding out about the thinking being discarded at cache clear.
Reconstruction of the logical route it took to get to the end state is just not the same as the step by step process it took in the first place, which again I feel counters your "feelies".
mathisfun123 19 hours ago [-]
> I find that the most galling thing about finding out about the thinking being discarded at cache clear
There's a really simple solution to this galling sensation: simply always keep in mind it's a stupid GenAI chat bot.
bmurphy1976 18 hours ago [-]
I'm starting to think a lot of the problem people are having is just that they have unrealistic expectations.
I'm not having the same problem as you and I follow a very similar methodology. I'm producing code faster and at much higher quality with a significant reduction in strain on my wrists. I doubt I'm typing that much less, but what I am typing is prose which is much more compatible with a standard QWERTY keyboard.
I think part of it is that I'm not running forward as fast as I can and I keep scope constrained and focused. I'm using the AI as a tool to help me where it can, and using my brain and multiple decades of experience where it can't.
Maybe you're expecting too much and pushing it too hard/fast/prematurely?
I don't find the code that hard to read, but I'm also managing scope and working diligently on the plans to ensure it conforms to my goals and taste. A stream of small well defined and incremental changes is quite easy to evaluate. A stream of 10,000 line code dumps every day isn't.
I bet if you find that balance you will see value, but it might not be as fast as you want, just as fast as is viable which is likely still going to be faster than you doing it on your own.
dragandj 6 hours ago [-]
If the main problem is programming languages incompatibility with QWERTY, that problem has been solved many decades ago. The programmers can switch to Colemak, and save many trillions of dollars of AI expenses.
linsomniac 16 hours ago [-]
>Then hand over to Claude Sonnet.
Have you tried Opus 4.6 with "/effort max" in Claude Code? That's pretty much all I use these days, and it is, honestly, doing a fantastic job. The code it's writing looks quite good to me. Doesn't seem to matter if it's greenfield or existing code.
If code is harder to read than to write, you're doing yourself a disservice by having the output stage not be top shelf.
dragandj 6 hours ago [-]
I find it works even better with "/effort ultra".
rsanek 19 hours ago [-]
I'm confused. If you have detailed, specific expectations, why aren't using the best model available? Even if you were using Opus 4.7, I would inquire if you're using high/xhigh effort by default.
Feels crazy to me for people to use anything other than the best available.
xpe 18 hours ago [-]
I also have the same question. That said, for some problems, at least over the last week or so, I did sometimes get better results from lower-effort Opus or even Sonnet. Sometimes I get (admittedly this is by feels) a better experience from voice mode which uses Haiku. This is somewhat surprising in some ways but maybe not in others. Some possible explanations include: (a) bugs relating to Anthropic's recent post-mortem [1] or (b) a tendency for a more loquacious Claude to get off in the weeds rather than offering a concise answer which invite short back-and-forth conversations and iteration.
> Feels crazy to me for people to use anything other than the best available.
Not everyone has unlimited budgets to burn on tokens.
MattRix 2 hours ago [-]
Yeah but in a discussion about technology it’s a little silly. It’s like someone complaining about their phone and then finding out they still use a Nokia.
eweise 19 hours ago [-]
I give Claude small incremental tasks to do and it usually does them flawlessly. I know how to design the software and break into incremental tasks. Claude does the work. The productivity increase has been incredible. I think I'll be able to bootstrap a single person lifestyle business just using Claude.
jwpapi 14 hours ago [-]
I have the same feeling.
Like there is no way in world that Gen AI is faster then an actual cracked coder shooting the exact bash/sql commands he needs to explore and writing a proper intent-communicating abstraction.
I’m thinking the difference is in order of magnitudes.
On top of that it adds context loss, risk of distraction, the extra work of reading after the job is done + you’ll have less of a mental model no matter how good you read, because active > passive.
Man it was really the weirdest thing that Claude Coded started hiding more and more changes. Thats what you need, staying closely on the loop.
throwaway7783 19 hours ago [-]
I don't know. I don't write detailed specs, but make it very iterative, with two sessions. One for coding and one for reviews at various levels.
Just the coding window makes mistakes, duplicates code, does not follow the patterns. The reviewer catches most of this, and the coder fixes them all after rationalizing them.
Works pretty well for me. This model is somewhat institutionalized in my company as well.
I use CC Opus 4.7 or Codex GPT 5.4 High (more and more codex off late).
meroes 19 hours ago [-]
This is how I feel with AI math proofs. I’m not sure where they’re at now, but a year ago it took so much more time to check if an LLM proof was technically correct even if hard to understand, compared to a well structured human proof.
Maybe it was Timothy Gowers who commented on this.
Lots of human proofs have the unfortunate “creative leap” that isn’t fully explained but with some detectable subtlety. LLMs end up making large leaps too, but too often the subtle ways mathematicians think and communicate is lost, and so the proof becomes so much more laborious to check.
Like you don’t always see how a mathematician came up with some move or object to “try”, and to an LLM it appears random large creative leaps are the way to write proofs.
baranul 15 hours ago [-]
Now that there is Claw Code[1], seems like many of these cancellations are easier to do.
I use open spec to negotiate requirements before the handoff, it's helped me a lot. You could also use GSD2 or Amazon's Kiro, or Spec Kit but I find they have too many stages and waste tokens.
abustamam 19 hours ago [-]
This may be a bit silly but I do what you do and then I tell Claude to review the code it wrote and compare it to the specs. It will often find issues and fix it. Then I review the reviewed code, and it's leagues better than pre reviewed code.
This may be worth trying out.
moribunda 10 hours ago [-]
And it leaves 25 TODO comments in code silently, reporting to you that everything is done.
dannersy 20 hours ago [-]
Beautifully stated and I couldn't agree more. This is my experience.
GoToRO 17 hours ago [-]
you are holding it wrong. For real this time.
rob 20 hours ago [-]
I use the "Superpowers" plugin that creates an initial spec via brainstorming together, and then takes that spec and creates an implementation spec file based on your initial spec. It also has other agents make sure the spec doesn't drift between those two stages and does its own self-reviews. Almost every time, it finds and fixes a bunch of self-review issues before writing the final plan. Then I take that final plan and run it through the actual execution phase that does its own reviews after everything.
Just saying that I know a lot of people like to raw dog it and say plugins and skills and other things aren't necessary, but in my case I've had good success with this.
hirvi74 18 hours ago [-]
That is why I still use the Chatbots and not the CLI/desktop tools. I am in 100% control. I mainly ask question surrounding syntax with languages I am not well experienced in, snippets/examples, and sometimes feedback on certain bits of logic.
I feel like I have easily multiplied my productivity because I do not really have to read more than a single chat response at a time, and I am still familiar with everything in my apps because I wrote everything.
I've been working on Window Manager + other nice-to-haves for macOS 26. I do not need a model to one-shot the program for me. However, I am thrilled to get near instantaneous answers to questions I would generally have to churn through various links from Google/StackOverflow for.
tengbretson 20 hours ago [-]
> or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed)
Dude! The amount of ad-hoc, interface-specific DTOs that LLM coding agents define drives me up the wall. Just use the damn domain models!
varispeed 18 hours ago [-]
You can quickly get something "working" until you realise it has a ton of subtle bugs that make it unusable in the long run.
You then spend months cleaning it up.
Could just have written it by hand from scratch in the same amount of time.
But the benefit is not having to type code.
CamperBob2 19 hours ago [-]
Then hand over to Claude Sonnet.
Well, there's your problem. Why aren't you using the best tool for the job?
xpe 20 hours ago [-]
I very much value and appreciate the first four paragraphs! [3] This is my favorite kind of communication in a social setting like this: it reads more like anthropology and less like judgment or overgeneralization.
The last two paragraphs, however, show what happens when people start trying to use inductive reasoning -- and that part is really hard: ...
> Therefore I need more time and effort with Gen AI than I needed before because I need to read a lot of code, understand it and ensure it adheres to what mental model I have.
I don't disagree that the above is reasonable to say. But it isn't all -- not even enough -- about what needs to be said. The rate of change is high, the amount of adaptation required is hard. This in a nutshell is why asking humans to adapt to AI is going to feel harder and harder. I'm not criticizing people for feeling this. But I am criticizing the one-sided-logic people often reach for.
We have a range of options in front of us:
A. sharing our experience with others
B. adapting
C. voting with your feet (cancelling a subscription)
D. building alternatives to compete
E. organizing at various levels to push back
(A) might start by sounding like venting. Done well it progresses into clearer understanding and hopefully even community building towards action plans: [1]
> Hence Gen AI at this price point which Anthropic offers is a net negative for me because I am not vibe coding, I'm building real software that real humans depend upon and my users deserve better attention and focus from me hence I'll be cancelling my subscription shortly.
The above quote is only valid unless some pretty strict (implausible) assumptions: (1) "GenAI" is a valid generalization for what is happening here; (2) Person cannot learn and adapt; (2) The technology won't get better.
[1]: I'm at heart more of a "let's improve the world" kind of person than "I want to build cool stuff" kind of person. This probably causes some disconnect in some interactions here. I think some people primarily have other motives.
Some people cancel their subscriptions and kind of assume "the market and public pushback will solve this". The market's reaction might be too slow or too slight to actually help much. Some people put blind faith into markets helping people on some particular time scales. This level of blind faith reminds me of Parable of the Drowning Man. In particular, markets often send pretty good signals that mean, more or less, "you need to save yourself, I'm just doing my thing." Markets are useful coordinating mechanisms in the aggregate when functioning well. One of the best ways to use them is to say "I don't have enough of a cushion or enough skills to survive what the market is coordinating" so I need a Plan B!
Some people go further and claim markets are moral by virtue of their principles; this becomes moral philosophy, and I think that kind of moral philosophy is usually moral confusion. Broadly speaking, in practice, morality is a complex human aspiration. We probably should not not abdicate our moral responsibilities and delegate them to markets any more than we would say "Don't worry, people who need significant vision correction (or other barrier to modern life)... evolution will 'take care' of you."
One subscription cancellation is a start (if you actually have better alternative and that alternative being better off for the world ... which is debatable given the current set of alternatives!)
Talking about it, i.e. here on HN might one place to start. But HN is also kind of a "where frustration turns into entertainment, not action" kind of place, unfortunately. Voting is cheap. Karma sometimes feels like a measure of conformance than quality thinking. I often feel like I am doing better when I write thoughtfully and still get downvotes -- maybe it means I got some people out of their comfort zone.
Here's what I try to do (but fail often): Do the root cause analysis, vent if you need to, and then think about what is needed to really fix it.
I write detailed specs. Multifile with example code. In markdown.
Then hand over to Claude Sonnet.
With hard requirements listed, I found out that the generated code missed requirements, had duplicate code or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed) along with tests that fake and work around to pass.
So turns out that I'm not writing code but I'm reading lots of code.
rectang 23 hours ago [-]
I feel like I'm using Claude Opus pretty effectively and I'm honestly not running up against limits in my mid-tier subscriptions. My workflow is more "copilot" than "autopilot", in that I craft prompts for contained tasks and review nearly everything, so it's pretty light compared to people doing vibe coding.
The market-leading technology is pretty close to "good enough" for how I'm using it. I look forward to the day when LLM-assisted coding is commoditized. I could really go for an open source model based on properly licensed code.
Retr0id 22 hours ago [-]
I also use it this way and I'm overall pretty happy with it, but it feels like they really want us to use it in "autopilot" mode. It's like they have two conflicting priorities of "make people use more tokens so we can bill them more" and "people are using more tokens than expected, our pricing structure is no longer sustainable"
(but I guess they're not really conflicting, if the "solution" involves upgrading to a higher plan)
fluidcruft 22 hours ago [-]
I feel like they are making it harder to use it this way. Encouraging autonomous is one thing, but it really feels more like they are handicapping engaged use. I suspect it reflects their own development practices and needs.
freedomben 22 hours ago [-]
This is something I've thought of as well. The way the caps are implemented, it really disincentivizes engaged use. The 5-hour window especially is very awkward and disruptive. The net result is that I have to somewhat plan my day around when the 5-hour window will affect it. That by itself is a powerful disincentive from using Claude. It has also caused me to use different tools for things I previously would have used Claude for. For example, detailed plans I use codex now rather than Claude, because I hit the limit way too fast when doing documentation work. It certainly doesn't hurt that codex seems to be better at it, but I wouldn't even have a codex subscription if it wasn't for claude's usage limits
j3g6t 16 hours ago [-]
Wow, weird to see someone mirror my experience so closely. At the $100 plan my day was being warped around how to maximise multple 5 hour sessions so that it felt worth it. Dropped down to the $20 plan and stopped playing the game as I know I'll just consume the weekly usage in the few days I have free. Meanwhile codex gave me a free month, their 5HourUsageWindow:WeeklyUsageWindow ratio feels way better balanced and it gets may more work done from it. Similar to you, any task involving reading/reviewing docs [or code reviews] now insta-nukes claude's usage. My record is 12 minutes so far...
Retr0id 22 hours ago [-]
Another big one for me is that they dropped the cache TTLs. It is normal for me to come back to a session an hour later, but someone "autopilot"-ing won't have such gaps.
p_stuart82 21 hours ago [-]
not just the cache though. every time you stop and come back, it basically reloads the whole session. if you just let it keep going, it counts like one smooth run. you hit the wall faster for actually checking its work.
fluidcruft 18 hours ago [-]
It was probably the bug about cache getting purged after 5min rather than 1hour. You can review things pretty well within an hour. 5min is a real crunch. 5min doesn't mix with multitasking or getting interrupted.
22 hours ago [-]
dandaka 22 hours ago [-]
autopilot (yolo mode) is amazing and feels great, truly delegate instead of hand-holding on every step
dutchCourage 21 hours ago [-]
Do you have any good resources on how to work like that? I made the move from "auto complete on steroids" to "agents write most of my code". But I can't imagine running agents unchecked (and in parallel!) for any significant amount of time.
dandaka 1 hours ago [-]
It will come naturally! I have started with autocomplete as well. I was stumbling upon different problems and was fixing them by implementing best practices. Current stack is:
1/ Claude Code with yolo mode
2/ superpowers plugin
3/ red/green tdd
4/ a lot of planning and requirements before writing any code
It feels like you always touch this edge of capability of models and your current workflow. Delegate more complex task, and system fails. Delegate more simple and system works great. Improve your workflow and move this complexity to a higher level.
sroerick 20 hours ago [-]
Right now, I'm finding a decent rhythm in running 10-20 prompts and then kind of checking the results a few different ways. I'll ask the agent to review the code, I'll go through myself, I'll do some usability and gut checks.
This seems to be a good window where I can implement a pretty large feature, and then go through and address structural issues. Goofy thinks like the agent adding an extra database, weird fallback logic where it ends up building multiple systems in parallel, etc.
Currently, I find multiple agents in parallel on the same project to be not super functional. Theres just a lot of weird things, agents get confused about work trees, git conflicts abound, and I found the administrative overhead to be too heavy. I think plenty of people are working on streamlining the orchestration issue.
In the mean time, I combat the ADD by working on a few projects in parallel. This seems to work pretty well for now.
It's still cat herding, but the thing is that refactors are now pretty quick. You just have to have awareness of them
I was thinking it'd be cool to have an IDE that did coloring of, say, the last 10 git commits to a project so you could see what has changed. I think robust static analysis and code as data tools built into an IDE would be powerful as well.
The agents basically see your codebase fresh every time you prompt. And with code changes happening much more regularly, I think devs have to build tools with the same perspective.
mescalito 21 hours ago [-]
I would also be interested on resources on "agents write most of your code" if you can share some.
hypercube33 10 hours ago [-]
For me you open a markdown editor and draft up a code plan and details of what you'd do as a coder at a high level then bust into whatever tool in planning mode (I usually fire this into the opus 4.5 model) and have it break it down into concise steps and then hand it off to a simple model (gpt spark, sonnet, composer or whatever) to execute. when I feel frisky I'll just have opus one shot it and it can be done in a few minutes.
nurettin 20 hours ago [-]
Same here, especially when I keep catching things like DRY violations and a lack of overall architecture. Everything feels tacked on.
To give them the benefit of doubt, perhaps these people provide such detailed spec that they basically write code in natural language.
8ytecoder 19 hours ago [-]
I use Claude “on the web” or Google Jules. Essentially everything happens in a sandbox - so yolo isn’t a huge risk. You can even box its network access. You review the PR at the end or steer it if it’s veering off course.
naravara 22 hours ago [-]
I think the culty element of AI development is really blinding a lot of these companies to what their tools are actually useful for. They’re genuinely great productivity enhancers, but the boosters are constantly going on about how it’s going to replace all your employees and it’s just. . .not good for that! And I don’t mean “not yet” I mean I don’t see it ever getting there barring some major breakthrough on the order of inventing a room-temp superconductor.
dasil003 21 hours ago [-]
I agree with you, the "replacing people" narrative is not only wrong, it's inflammatory and brand suicide for these AI companies who don't seem to realize (or just don't care) the kind of buzz saw of public opinion they're walking straight towards.
That said, looking at the way things work in big companies, AI has definitely made it so one senior engineer with decent opinions can outperform a mediocre PM plus four engineers who just do what they're told.
raincole 21 hours ago [-]
> the day when LLM-assisted coding is commoditized
Like yesterday? LLM-assisted coding is $100/mo. It looks very commoditized when most houses in developed world pay more for electricity than that.
My definition of LLM-assisted coding is that you fully understand every change and every single line of the code. Otherwise it's vibe coding. And I believe if one is honest to this principle, it's very hard to deplete the quota of the $100 tier.
windexh8er 21 hours ago [-]
> Like yesterday? LLM-assisted coding is $100/mo. It looks very commoditized when most houses in developed world pay more for electricity than that.
But, it's not $100/mo. I think the best showcase of where AI is at is on the generative video side. Look at players like Higgsfield. Check out their pricing and then go look at Reddit for actual experiences. With video generation the results are very easy to see. With code generation the results are less clear for many users. Especially when things "just work".
Again, it's not $100/month for Anthropic to serve most uses. These costs are still being subsidized and as more expensive plans roll out with access to "better" models and "more* tokens and context the true cost per user is slowly starting to be exposed. I routinely hit limits with Anthropic that I hadn't been for the same (and even less) utilization. I dumped the Pro Max account recently because the value wasn't there anymore. I am convinced that Opus 3 was Anthropic's pinnacle at this point and while the SotA models of today are good they're tuned to push people towards paying for overages at a significantly faster consumption rate than a right sized plan for usage.
The reality is that nobody can afford to continue to offer these models at the current price points and be profitable at any time in the near future. And it's becoming more and more clear that Google is in a great position to let Anthropic and OAI duke it out with other people's money while they have the cash, infrastructure and reach to play the waiting game of keeping up but not having to worry about all of the constraints their competitors do.
But I'd argue that nothing has been commoditized as we have no clue what LLMs cost at scale and it seems that nobody wants to talk about that publicly.
KaiserPro 20 hours ago [-]
> I think the best showcase of where AI is at is on the generative video side. Look at players like Higgsfield. Check out their pricing and then go look at Reddit for actual experiences. With video generation the results are very easy to see
Video is a different ballgame entirely, its less than realtime on _large_ gpus. moreover because of the inter-frame consistency its really hard to transfer and keep context
Running inference on text is, or can be very profitable. its research and dev thats expensive.
windexh8er 19 hours ago [-]
My point wasn't the delta in work between video and text generation. It was that the degradation of a prompt is much more visible (because: literal). But, generally agree on the research/dev part.
sidrag22 21 hours ago [-]
> fully understand every change and every single line of the code.
im probably just not being charitable enough to what you mean, but thats an absurd bar that almost nobody conforms to even if its fully handwritten. nothing would get done if they did. But again, my emphasis is on that im probably just not being charitable to what you mean.
Maxatar 21 hours ago [-]
You're most likely being pedantic, like when someone says they understand every single line of this code:
x = 0
for i in range(1, 10):
x += i
print(x)
They don't mean they understand silicon substrate of the microprocessor executing microcode or the CMOS sense amplifiers reading the SRAM cells caching the loop variable.
They just mean they can more or less follow along with what the code is doing. You don't need to be very charitable in order to understand what he genuinely meant, and understanding code that one writes is how many (but not all) professional software developers who didn't just copy and paste stuff from Stackoverflow used to carry out their work.
sidrag22 20 hours ago [-]
you drew it to its most uncharitable conclusion for sure, but ya thats pretty much the point i was making.
How deep do i need to understand range() or print() to utilize either, on the slightly less extreme end of the spectrum.
But ya, im pretty sure its a point that maybe i coulda kept to myself and been charitable instead.
_puk 19 hours ago [-]
Understand your code in this day and age likely means hit the point of deterministic evaluation.
print(X) is a great example.
That's going to print X. Every time.
Agent.print(x) is pretty likely to print X every time. But hey, who knows, maybe it's having an off day.
thomasmg 21 hours ago [-]
Well that is how it mostly worked until recently... unless if the developer copied and pasted from stackoverflow without understanding much. Which did happen.
satvikpendem 21 hours ago [-]
How is that an absurd bar? If you're handwriting code, you'd need to know what you actually want to write in the first place, hence you understand all the code you write. Therefore the code the AI produces should also be understood by you. Anything else than that is indeed vibe coding.
Maxatar 21 hours ago [-]
A lot of developers don't actually understand the code they write. Sure nowadays a lot of code is generated by LLMs, but in the past people just copied and pasted stuff off of blogs, Stack Overflow, or whatever other resources they could find without really understanding what it did or how it worked.
Jeff Atwood, along with numerous others (who Atwood cites on his blog [1]) were not exaggerating when the observed that the majority of candidates who had existing professional experience, and even MSc. degrees, were unable to code very simple solutions to trivial problems.
its an absurd bar if you are being a uncharitable jerk like i was, the layers go deep, and technically i can claim I have never fully grasped any of my code. It is likely just a dumb point to bring up tbh.
satvikpendem 16 hours ago [-]
I saw your reply to another comment [0], I see what you mean now. By "understand each line of code" I meant that one would know how that for loop works not the underlying levels of the implementation of the language. I replied initially because lots of vibe coding devs in fact do not read all the code before submitting, much less actually review it line by line and understand each line.
Could they have meant "every line of code being committed by the LLM" within the current scope of work?
That's how I read it, and I would agree with that.
andrewjvb 21 hours ago [-]
It's a good point. To me this really comes down to the economics of the software being written.
If it's low-stakes, then the required depth to accept the code is also low.
hunterpayne 15 hours ago [-]
I do. If you don't, maybe you shouldn't be writing software professionally. And yes, I've written both DBs and compilers so I do understand what is happening down to the CMOS. I think what you are doing is just cope.
raincole 20 hours ago [-]
I mean "understanding it just like when you hand wrote the code in 2019."
Obviously I don't mean "understanding it so you can draw the exact memory layout on the white board from memory."
torben-friis 20 hours ago [-]
You don't understand every change you make in the PRs you offer for review?
rectang 20 hours ago [-]
Commoditization will be complete for my purposes when an LLM trained on a legitimately licensed corpus can achieve roughly what Opus 4.5+ or the highest powered GPTs can today.
I anticipate a Napster-style reckoning at some point when there's a successful high-profile copyright suit around obviously derivative output. It will probably happen in video or imagery first.
fsckboy 19 hours ago [-]
>LLM-assisted coding is $100/mo. It looks very commoditized when most houses in developed world pay more for electricity than that.
this is a small nit, but you still have to pay your electric bill, the $100/mo is on top of that. if you're doing cost accounting you don't want to neglect any costs. Just because you can afford to lease a car, doesn't mean you can afford to lease a 2nd car.
BowBun 20 hours ago [-]
In industry, the cost is more than 100/mo for engineers. With increased adoption and what I know now, I expect full time devs to rack up $500-$2000 usage bills if they're going full parallel agentic dev. Personal usage for projects and non-production software is not a benchmark IMO
mchusma 20 hours ago [-]
I work with a lot of full-time devs, and it is very hard to go beyond the $200 max plan. If you use API credits, and I think the enterprise plan kind of forces you to do this, you can definitely incur this much, particularly if you're not using prompt caching and things like that.
But I and others in my company have very heavy usage. We only rarely, with parallel agentic processes, run out of the $200 a month plan.
And what do I mean by "hard"? I mean, it requires a lot of active thinking to think about how you can actively max it out. I'm sure there's some use cases where maybe it is not hard to do this, but in general, I find most devs can't even max out the $100 a month plan, because they haven't quite figured out how to leverage it to that degree yet.
(Again, if someone is using the API instead of subscription, I wouldn't be surprised to see $2,000 bills.)
ebiester 19 hours ago [-]
Business/Enterprise accounts are billed at $20/seat + API prices, not subscription prices. You can give them a monthly dollar quota or let them go unlimited, but they're not being subsidized like in team. And team can't get a 20x plan from what I can tell.
adastra22 20 hours ago [-]
I routinely use $4k to $5k worth of tokens a month on my $200/mo Max subscription. I don't even code every day.
You can use a Max subscription for work, btw.
hunterpayne 15 hours ago [-]
You do understand the concept of a subsidy right?
adastra22 5 hours ago [-]
I do. Do you? A company providing a cheaper subscription plan is not a subsidy.
I assume you meant loss-leader. We can’t know that without knowing their financials. The actual marginal cost of inference is demonstrably less than $200/mo though, so it’s not clear whether they are operating at a loss. Without seeing their books we can’t know.
goalieca 22 hours ago [-]
Similar with the copilot and not autopilot usage. I find its the best of them all. Mostly i just use it as an occasionnal search engine. I've never found LLMs to be efficient to actually do work. I do miss the day when tech docs were usable. Claude seems like a crutch for gaps in developer experience more than anything.
llm_nerd 22 hours ago [-]
I have Max 5x and use only Claude Opus on xhigh mode. I don't use agents, or even MCPs, and stick to Claude Code.
I find it incredibly difficult to saturate my usage. I'm ending the average week at 30-ish percentage, despite this thing doing an enormous amount of work for (with?) me.
Now I will say that with pro I was constantly hitting the limit -- like comically so, and single requests would push me over 100% for the session and into paying for extra usage -- and max 5x feels like far more than 5x the usage, but who knows. Anthropic is extremely squirrely about things like surge rates, and so on.
I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
pixelpoet 21 hours ago [-]
I was in the same boat until last few days, where just a handful queries were enough to saturate my 5h session in about 30 mins.
Recently I've gotten Qwen 3.6 27b working locally and it's pretty great, but still doesn't match Opus; I've gotten check out that new Deepseek model sometime.
NewsaHackO 22 hours ago [-]
Yea, I never got how people are even able to hit the weekly limits so consistently. Maybe it's because they use it for work? But in that case, you would expect the employer to cover it so idk.
>I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
A lot of people are angry about the whole openclaw situation. They are especially bitter that when they attempted to justify exfiltrating the OAuth token to use for openclaw, nobody agreed with them that they had the right to do so, and sided with Claude that different limits for first-party use is standard. So they create threads like this, and complain about some opaque reason why Anthropic is finished (while still keeping their subscription, of course).
RealStupidity 21 hours ago [-]
If only OpenAI spent a significant amount of money on some kind of generative software that was predominantly trained on internet comments that'd be able to do all the astroturfing for them...
llm_nerd 20 hours ago [-]
A bunch of green accounts would be a bit of a tell. They need to use established accounts, ideally pre-llm, for astroturfing. This is going to be increasingly true.
dwedge 21 hours ago [-]
This kind of "if only" sarcastic comment belongs on reddit from 5 years ago
dboreham 21 hours ago [-]
Same. Never hit a limit. Use it heavily for real work. Never even thought of firing off an LLM for hours of...something. Seems like a recipe for wasting my time figuring out what it did and why.
taytus 22 hours ago [-]
I'd recommend Kimi k2.6 for your use. It is an excellent model at a fraction of the cost, and you can use Claude Code with it.
I did a 1:1 map of all my Claude Code skills, and it feels like I never left Opus.
Super happy with the results.
wolttam 22 hours ago [-]
I was saying the same until DeepSeek v4 this morning... sorry, Kimi. The competition is intense!
Aldipower 21 hours ago [-]
Fascinated, a bummer that DeepSeek does not offer a DPA or opt-out for training. This renders it unusable for my use cases unfortunately. At least z.ai GLM has a somewhat DPA in Singapore.
wolttam 20 hours ago [-]
The weights are open and you can use the model with any third party provider that gives you the DPA you want.
For my use-case, I want the providers to get my tokens as long as they plan to keep releasing open-weight models
folmar 21 hours ago [-]
If you don't use a lot of quota the cheapest monthly Claude Code is $20, Kimi Code is $19, i.e. the cost difference is minuscule.
Kimi wants my phone number on signup so a no-go for me.
ramoz 22 hours ago [-]
What provider do you use for Kimi
skippyboxedhero 21 hours ago [-]
The provider is a massive issue. People moving off Claude tend to assume this is solved.
Claude's uptime is terrible. The uptime of most other providers is even worse...and you get all the quantization, don't know what model you are actually getting, etc.
Leynos 18 hours ago [-]
Kimi 2.5 was like using Sonnet 4 on a flaky ADSL line. I haven't tried K2.6 yet, but the physical unreliability of the connection was too off-putting.
bigethan 20 hours ago [-]
OpenRouter and I'm toying around with Hermes. Seems good so far, but haven't really gotten into anything heavy yet. Though the "freedom" of not sweating the token pause and the costs not being too high is real.
taytus 21 hours ago [-]
Straight from them, but I know other providers like io.net can be faster but I like to directly support the project.
subscribed 18 hours ago [-]
Thx. I'll try with my personal projects (because dues to the data collection and ToS most providers are forbidden in my company), if I can opt out of training on my input.
I'm just getting a but tired of using Opus 2.6 which eats my whole allowance and then some £££ going through the 4kB prompt to review ~13 kB text file twice - and that's on top of the sometimes utter bonkers, bad, lazy answers I'm not getting even from the local Gemma 4 E4B.
spaceman_2020 18 hours ago [-]
did you just copy-paste or is there a difference in the way kimi uses skills?
taytus 17 hours ago [-]
I don’t have the prompt at hand but basically I told Kimi (paraphrasing): I have these Claude code skills, and I know it uses different tool calls than you but read them and re-write them as your own tools.
I also created a mini framework so it can test that the skills are actually working after implementation.
Everything runs perfectly.
cyanydeez 22 hours ago [-]
Honestly, it sounds like, assuming you have no ethical qualms, you could get by with a Mac or AMD 395+ and the newest models, specifically QWEN3.5-Coder-Next. It does exactly as you describe. It maxes out around 85k context, which if you do a good job providing guard rails, etc, is the length of a small-medium project.
It does seem like the sweet spot between WallE and the destroyed earth in WallE.
ethicalqualms 22 hours ago [-]
Sorry, out of the loop. Which ethical qualms are you referring to?
kbelder 22 hours ago [-]
Using a Mac, obviously.
rectang 20 hours ago [-]
I have ethical qualms to varying degrees with most LLMs, primarily because of copyright laundering.
I'm a BSD-style Open Source advocate who has published a lot of Apache-licensed code. I have never accepted that AI companies can just come in and train their models on that code without preserving my license, just allowing their users to claim copyright on generated output and take it proprietary or do whatever.
I would actually not mind licensing my work in an LLM-friendly way, contributing towards a public pool from which generated output would remain in that pool. Perhaps there is opportunity for Open Source organizations to evolve licenses to facilitate such usage.
For what it's worth, I would be happy to pay for a commercial LLM trained on public domain or other properly licensed works whose output is legitimately public domain.
folkrav 22 hours ago [-]
My guess - China.
hadlock 17 hours ago [-]
Seems like AMD 395+ is only about 16 tokens/s which is 25-33% the speed of SOTA models. Break even on a $3000 machine is ~15 months
cyanydeez 16 hours ago [-]
thats pessimistic. do the calc assuming Cloud provider X changes your nondetermistic output every Y Months by Z probability and increases prices by 10% every 6 months.
slow and steady is worth exponentials. keep slopppping it my boid.
djyde 18 hours ago [-]
[dead]
boxingdog 22 hours ago [-]
[dead]
kashunstva 27 minutes ago [-]
I’m sympathetic to the author’s complaints about Anthropic’s support, though I would go further. It doesn’t exist.
For reasons that continue to elude me, almost exactly one year ago, Anthropic cancelled my Claude Pro plan. To appeal, you must fill out a Google docs form. And wait. In my case, I’ve waited for about one year. Once I managed to email with a human but they quickly plugged that hole with a chatbot that sends you back to their never-to-be-reviewed form. No route to escalate.
A year gives one a long time to think about things. Maybe it was because I was on a VPN temporarily. Otherwise, no clue. I’m a hobbyist embedded developer. That’s it.
So no, Anthropic support isn’t just poor; it’s nonexistent.
janwillemb 23 hours ago [-]
This is what worries me. People become dependent on these GenAI products that are proprietary, not transparant, and need a subscription. People build on it like it is a solid foundation. But all of a sudden the owner just pulls the foundation from under your building.
jjfoooo4 22 hours ago [-]
But these products are all drop in replacements for each other. I've recently favored Codex more than CC, just because rate limits got mildly annoying. I really didn't have to change anything about my workflow in doing that.
Capricorn2481 22 hours ago [-]
> But these products are all drop in replacements for each other
For now. That doesn't really change the risk, that just means they are all hyper competitive right this moment, and so they are comparable. If one of them becomes king of the hill, nothing stops them from silently degrading or jacking prices.
The only shield is to not be dependent in the first place. That means keeping your skills sharp and being willing to pass on your knowledge to juniors, so they aren't dependent on these things.
Of course, many people are building their business on huge AI scaffolding. There's nothing they can do.
conrs 22 hours ago [-]
I'm curious - why for now? This stuff is practically commoditized. Trying to think of anything that ever successfully got back into proprietary land from there.
zozbot234 21 hours ago [-]
The thing is that AI is still more akin to a glorified autocomplete than something that can really supersede your skills. Proprietary model suppliers are constantly trying to obscure this basic underlying fact, without much success (much of the unpredictable shifts you see in proprietary AI behavior ultimately boils down to this); so it becomes far more crystal-clear when using open models that really are a pure commodity.
conrs 21 hours ago [-]
yeah, I think there's the marketing and then there's the actual true utility. AI isn't a better computer program. It's not going to be able to do everything you want autonomously. But, it's pretty good at some stuff!
Capricorn2481 21 hours ago [-]
It doesn't look commoditized to me, it looks subsidized. It looks like everyone is trying to be "the one" and running as competitively as possible until the others fail. Commoditized would imply these services are all going to mellow into a stable state and mostly compete on price. I don't think that's happening. These aren't paper clips, they are courting governments and trying to pull the ladder up behind them. That's why both Anthropic and OpenAI are preaching doomsday and trying to build a moat with regulations.
conrs 21 hours ago [-]
Fair. I have high hope for local inference, feel like right now it is simply cost prohibitive to get the hardware. It will be interesting to see what happens.
GaryBluto 23 hours ago [-]
Luckily local AI is becoming more feasible every day.
Someone1234 23 hours ago [-]
It feels more and more like OpenAI/Anthoropic aren't the future but Qwen, Kimi, or Deepseek are. You can run them locally, but that isn't really the point, it is about democratization of service providers. You can run any of them on a dozen providers with different trade-offs/offerings OR locally.
They won't ever be SOTA due to money, but "last year's SOTA" when it costs 1/4 or less, may be good enough. More quantity, more flexibility, at lower edge quality. It can make sense. A 7% dumber agent TEAM Vs. a single objectively superior super-agent.
That's the most exciting thing going on in that space. New workflows opening up not due to intelligence improvements but cost improvements for "good enough" intelligence.
2ndorderthought 18 hours ago [-]
You can run local models on junker laptops for specific tasks that are about as good as last years SOTA. If the manufactured compute hardware shortage wasn't happening a lot more people would be running two months ago SOTA locally right now. Funny thoughts...
echelon 22 hours ago [-]
Open Source isn't even within 50% of what the SOTA models are. Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.
Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters. I'm not a poor college student anymore, and I need more return on my time.
I'm not shitting on open weights here - I want open source to win. I just don't see how that's possible.
It's like Photoshop vs. Gimp. Not only is the Gimp UX awful, but it didn't even offer (maybe still doesn't?) full bit depth support. For a hacker with free time, that's fine. But if my primary job function is to transform graphics in exchange for money, I'm paying for the better tool. Gimp is entirely a no-go in a professional setting.
Or it's like Google Docs / Microsoft Office vs. LibreOffice. LibreOffice is still pretty trash compared to the big tools. It's not just that Google and Microsoft have more money, but their products are involved in larger scale feedback loops that refine the product much more quickly.
But with weights it's even worse than bad UX. These open weights models just aren't as smart. They're not getting RLHF'd on real world data. The developers of these open weights models can game benchmarks, but the actual intelligence for real world problems is lacking. And that's unfortunately the part that actually matters.
Again, to be clear: I hate this. I want open. I just don't see how it will ever be able to catch up to full-featured products.
twobitshifter 22 hours ago [-]
Unless you are getting outside of your comfort zone and taking a month off from your $200 subscription, every other month, I can’t see how you can make the universal claim that the open weights models are all 50% as good. Just today, DeepSeek released a new model, so nobody knows how that will compare, a week ago it was Gemma 4, etc. I’m okay with you making a comparison, but state the model and the timeframe in which it was tested that you are basing your conclusions on.
MostlyStable 22 hours ago [-]
I think that there will come a point when open source models are "good enough" for many tasks (they probably already are for some tasks; or at least, some small number of people seem happy with them), but, as you suggest, it will likely always (for the forseeable future at least) be the case that closed SOTA models are significantly ahead of open models, and any task which can still benefit from a smarter model (which will probably always remain some large subset of tasks) will be better done on a closed model.
The trick is going to be recognizing tasks which have some ceiling on what they need and which will therefore eventually be doable by open models, and those which can always be done better if you add a bit more intelligence.
oceanplexian 22 hours ago [-]
> Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.
I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?
You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.
bachmeier 22 hours ago [-]
> Benchmarks are toys, real world use is vastly different...Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters.
This kind of rhetoric is not helpful. If you want to make a point, then make one, but this adds nothing to the conversation. Maybe open source models don't work for you. They work very well for me.
lelanthran 18 hours ago [-]
> Open Source isn't even within 50% of what the SOTA models are.
The gap has been shrinking with each release, and the SOTA has already run into diminishing returns for each extra unit of data+computation it uses.
Do you really want to bet that the gap will not eventually be a hairs breadth?
bandrami 22 hours ago [-]
> Why should anyone waste time on poorer results?
Because in almost no real-world project is "programming time" the limiting factor?
dymk 21 hours ago [-]
No, it's rate at which you can solve problems, and weaker models waste your time because they don't solve problems at the same speed.
hunterpayne 14 hours ago [-]
No, its the number of debug cycles you need to solve said problems. That's the major attribute that controls dev time. And models require far more than I need. You are paying money to take longer and produce worse code. If its different for you, that's a you problem.
bdangubic 14 hours ago [-]
amazing how often is this repeated on here are some sort of a gospel SWEs pass down to one another to continue this charade. I have worked in this industry for 30+ years on countless projects, last decade+ as consultant - at every single project (every single one) programming time was the limiting factor. there is a whole industry inside our industry dealing with “processes” and “how to estimate” (apparently we are incapable of doing that) and whatnot, all because the actual programming time is always a limiting factor and there isn’t an even close 2nd
hypnoce_fr 7 hours ago [-]
What counts as programming time ? Writing ? Reviewing ? Compiling ? Debugging ? It also depends the industry. From idea to production, the limiting factor is not always writing the code, and in my experience (15years in fintech) it almost has never been. Discussion, alignment, compilation, heavy testing pipelines, shipping, all of this on a 30million line monorepo.
On a greenfield 10k line repo, yes, AI really shines. In other cases, it’s currently just a helper on very specific narrow tasks, that is not always programming.
bandrami 13 hours ago [-]
That's just not my experience. Making the software in the first place is never even the cost center.
Someone1234 22 hours ago [-]
> Open Source isn't even within 50% of what the SOTA models are.
When was the last time you used any of them? Because, a lot of people are actively using them for 9-5 work today, I count myself in that group. That opinion feels outdated, like it was formed a year ago+ and held onto. Or based on highly quantized versions and or small non-Thinking models.
Do you really think Qwen3.6 for a specific example is "50%" as good as Opus4.7? Opus4.7 is clearly and objectively better, no debate on that, but the gap isn't anywhere near that wide. I'd call "20%" hyperbole, the true difference is difficult to exactly measure but sub-10% for their top-tier Thinking models is likely.
cwnyth 21 hours ago [-]
Their opinion is also behind on LibreOffice, too. I won't defend GIMP's monstrosity, but I finished a whole dissertation, do all my regular spreadsheet work (that isn't done via R), and have created plenty of visual mockups with LibreOffice. Plus, I don't have to deal with a spammy Windows environment.
Sure, we use Google Drive, too, but that's just for sharing documents across offices, not for everyday use. For that, the open source model is a clear winner in my book.
vlovich123 22 hours ago [-]
Qwen3.6 at which model size and quantization? I already think Opus 4.6 is usable but still dumb as bricks. A 20% cut off that feels like it would still be unusable. And that's not even getting to the annoyance of setting everything up to run locally & getting HW that can run it locally which basically looks like a Macbook M4 these days as the x86 side is ridiculously pricey to get decent performance out of models.
Someone1234 19 hours ago [-]
At their highest model size and quant. We are discussing price and quality at the top, not what you can run on the lower end.
So the starting point is Opus 4.7 pricing and we're contrasting alternatives near the top end (offered across multiple providers).
Also I said 20% was hyperbole, meaning far too high.
vlovich123 18 hours ago [-]
That makes no sense because the largest Qwen models are not even open weight so I’m not sure how that’s any different.
Someone1234 15 hours ago [-]
Right, which isn't what we're discussing, since I mentioned "across multiple providers" in every comment about this topic.
Those closed weight models aren't available like we're discussing. They're only available from the vendor that created them.
vlovich123 9 hours ago [-]
The largest qwen model is similar so I’m not sure what point you’re trying to make. The only ones available are the open weight ones which are the smaller variants and nowhere near within 20% of the closed frontier models.
Someone1234 2 hours ago [-]
The largest open models are within 20%; they're likely within 10%. Go actually try them and stop making outdated assumptions. You don't need to invest a lot of money either, just pick your favorite vendor, and send out a few prompts.
conrs 21 hours ago [-]
IMO It's a different and new model. We're engineers, and we're rich. It's not going to be good enough for us. But the much larger market by far is all the people who used to HAVE to work with engineers. They now have optionality; the pendulum is going to swing.
swader999 21 hours ago [-]
Also, this space will (and perhaps already is for some of us) be an arms race. Sure you can go local but hosted will always be able to offer more and if you want to be competitive, you'll need to be using the most capable.
nancyminusone 21 hours ago [-]
People pirate photoshop and office if they don't want to pay for it, making it as "free" as GIMP. If there is a free option people will use it. never underestimate the cheapskates.
kube-system 22 hours ago [-]
There's going to be a day when we look back at $200/mo price tags and say "wow that was cheap".
The breakeven at this price is 6 minutes of productivity per work day for an engineer making $200k.
cheschire 21 hours ago [-]
Okay, but then by that logic a person making only $20k would break even at about an hour.
Are you suggesting that someone making $20k should be spending $200/mo on Claude?
kube-system 20 hours ago [-]
I'm talking about the cost of labor.
If you pay someone $20,000 for labor, and they save 65 minutes worth of labor per day using a $200/mo Claude subscription, you are better off buying the Claude subscription.
dragandj 4 hours ago [-]
Who's gonna pay $20,000 for labor that can be done by anyone with a $200/mo subscription?
kube-system 1 hours ago [-]
Nobody, but that doesn’t exist yet. Currently these solutions enhance the productivity of workers, but it can’t quite replace them.
kuboble 10 hours ago [-]
I think if you (a company) pay someone for labor, your labor cannot use personal subscription and you have to pay considerably higher api prices.
hrimfaxi 42 minutes ago [-]
Most companies don't provide a corporate cell phone and have no problems with answering emails from a personal account. Can't have it both ways.
echelon 21 hours ago [-]
Everyone is arguing why I'm wrong or that I should have presented more data.
You've got the real insight with this claim.
This is the way the world is moving. Open source isn't even going where the ball is being tossed. There is no leadership here.
You're spot on.
If the cost to deliver a unit of business automation is:
A. $1M with human labor
B. $700k human labor + open source models
C. $500k human labor + $10,000 in claude code max (duration of project)
D. $250k with humans + $200k claude code "mythos ultra"
The one that will get picked is option "D".
Your poor college students and hobbyists will be on option "B". But this won't be as productive as evidenced by the human labor input costs.
Option "C" will begin to disappear as models/compute get more expensive and capable.
Option "A" will be nonviable. Humans just won't be able to keep up.
Open source strictly depends on models decreasing their capability gap. But I'm not seeing it.
Targeting home hardware is the biggest smell. It's showing that this is non-serious, hobby tinkery and has no real role in business.
For open source to work and not to turn into a toy, the models need to target data center deployment.
hunterpayne 14 hours ago [-]
You are assuming (imagining) a cost relationship which doesn't exist and when researched was the opposite of what you claim.
brazukadev 2 hours ago [-]
This is you playing with imaginary numbers, like Sam Altman is doing for a long time. It won't end well.
kube-system 20 hours ago [-]
Yeah, I don't wanna shit on open source, there will certainly be uses for all different kinds of models.
The real money in this market, though, is going to be made in the C suite, and they don't really care about the model. They don't care if it's open source, closed source, or what it is. They don't want to buy a model. They're interested in buying a solution to their problems. They're not going to be afraid of a software price tag -- any number they spend on labor is far more.
Labor is something like 50%+ of the Fortune 500's operating expenses -- capturing any chunk of this is a ridiculous sum of money.
kardos 20 hours ago [-]
If sharing all of your code with the closed providers is OK then it works. If that is a blocker, open weights becomes much more compelling...
joquarky 20 hours ago [-]
What will you do when they stop burning cash and the $200 plan becomes $2000?
brazukadev 22 hours ago [-]
> Open Source isn't even within 50% of what the SOTA models are
Who said so? GLM 5.1 is 90% Opus, at least. Some people quite happy with Kimi 2.6 too. I did not try Deepseek 4 yet but also hearing it is as good as Opus. You might be confusing open source models with local models. It is not easy to run a 1.6T model locally, but they are not 50% of SOTA models.
jawilson2 18 hours ago [-]
I think the problem is that we're all waiting for the patented Silicon Value Rug Pull and ensuing enshittification, where there are a dozen tiers of products, you need 4 of them, and they now cost $2000/month. I want to hedge against that.
fourside 23 hours ago [-]
Maybe for folks who are deep into this, but it’s not exactly accessible. I tried reading up on it a couple of months ago, but parsing through what hardware I needed, the model and how to configure it (model size vs quantization), how I’d get access to the hardware (which for decent results in coding, new hardware runs $4k-$10k last I checked)—it had a non trivial barrier of entry. I was trying to do this over a long weekend and ran out of time. I’ll have to look into it again because having the local option would be great.
Edit: the replies to my comment are great examples of what I’m talking about when I say it’s hard to determine what hardware I’d need :).
jonaustin 21 hours ago [-]
Just get a decent macbook, use LM Studio or OMLX and the latest qwen model you can fit in unified ram.
Hooking up Claude Code to it is trivial with omlx.
For me the big hangup is the hardware. If I could find a simple guide to putting together a machine that I can run off an outlet in my home, I am sold. The problem is that I haven't found this yet (though I suppose I haven't looked very hard either).
root_axis 22 hours ago [-]
> new hardware runs $4k-$10k last I checked
Starting closer to 40k if you want something that's practical. 10k can't run anything worthwhile for SDLC at useful speeds.
zozbot234 22 hours ago [-]
$10K should be enough to pay for a 512GB RAM machine which in combination with partial SSD offload for the remaining memory requirements should be able to run SOTA models like DS4-Pro or Kimi 2.6 at workable speed. It depends whether MoE weights have enough locality over time that the SSD offload part is ultimately a minor factor.
(If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)
SwellJoe 21 hours ago [-]
You can't put "SSD offload" and "workable speed" in the same sentence.
zozbot234 21 hours ago [-]
As a typical example DeepSeek v4-pro has 59B active params at mostly FP4 size, so it needs to "find" around 30GB worth of params in RAM per inferred token. On a 512GB total RAM machine, most of those params will actually be cached in RAM (model size on disk is around 862GB), so assuming for the sake of argument that MoE expert selection is completely random and unpredictable, around 15GB in total have to be fetched from storage per token. If MoE selection is not completely random and there's enough locality, that figure actually improves quite a bit and inference becomes quite workable.
SwellJoe 14 hours ago [-]
I've never seen reports of this kind of setup being able to deliver more than low single-digit tokens per second. That's certainly not usable interactively, and only of limited utility for "leave it to think overnight" tasks. Am I missing something?
Also, I don't know of a general solution to streaming models from disk. Is there an inference engine that has this built-in in a way that is generally applicable for any model? I know (I mean, I've seen people say it, I haven't tried it) you can use swap memory with CPU offloading in llama.cpp, and I can imagine that would probably work...but definitely slowly. I don't know if it automatically handles putting the most important routing layers on the GPU before offloading other stuff to system RAM/swap, though. I know system RAM would, over time, come to hold the hottest selection of layers most of the time as that's how swap works. Some people seem to be manually splitting up the layers and distributing them across GPU and system RAM.
Have you actually done this? On what hardware? With what inference engine?
nozzlegear 22 hours ago [-]
I've been using local AI via LM Studio ever since I canceled my Claude subscription. It's obviously slower than Claude on my M1 Studio[†], but like someone else said, I use AI more like a copilot than an autopilot. I'm pretty enthused that I can give it a small task and let it churn through it for a few minutes, while I work on something alongside – all for free with no goddamned arbitrary limits.
[†] The latest Qwen 3.6 whatever has been a noticeable improvement, and I'm not even at the point where I tweak settings like sampling, temperature, etc. No idea what that stuff does, I just use the staff picks in LM Studio and customize the system prompts.
politelemon 23 hours ago [-]
Feasibility on commodity hardware would be the true watermark. Running high end computers is the only way to get decent results at the moment, but if we can run inference on CPUs, NPUs, and GPUs on everyday hardware, the moat should disappear.
zozbot234 22 hours ago [-]
You can already run inference on ordinary hardware but if you want workable throughput you're limited to small models, and these have very poor world-knowledge.
aleqs 23 hours ago [-]
Indeed, I feel like we are in the early computer equivalent phase of AI, where giant expensive hardware is still required for frontier models. In 5 years I bet there will be fully open models we'll be able to run on a few $1000 of consumer hardware with equivalent performance to opus 4.7/4.6.
whattheheckheck 22 hours ago [-]
You'll never have the power of what they have though. Cloud capital is insane.
So you can run 1 agent locally on $1k to $3k hardware
They can run a fleet of thousands
nozzlegear 21 hours ago [-]
But does one individual need a fleet of thousands of agents?
aleqs 22 hours ago [-]
I think intelligence per compute will go up significantly in the coming years, while the cost per compute will drop significantly. No way to know for sure, so I guess we'll see
andyfilms1 22 hours ago [-]
Sure, but local AI is still a black box. They can be influenced by training data selection, poisoning, hidden system prompts, etc. That recent Wordpress supply chain hack goes to show that the rug can still be pulled even if the software is FOSS.
ModernMech 23 hours ago [-]
I love how it's just a tacit understanding that these companies' entire MO is to carve out a territory, get everyone hooked on the good stuff and then jack up the price when they're addicted and captured -- literally the business plan of crack dealers, and it's just business as usual in the tech industry.
strbean 22 hours ago [-]
I was recently introduced to the term "vcware", ala shareware or vaporware, to describe these products. "Don't use that, it's vcware, enshitification is coming soon."
Not really. The hardware requirements remain indefinitely out of reach.
Yes, it's possible to run tiny quantized models, but you're working with extremely small context windows and tons of hallucinations. It's fun to play with them, but they're not at all practical.
ac29 21 hours ago [-]
The memory requirements aren't that intense. You can run useful (not frontier) models on a $2-5K machine at reasonable speeds. The capabilities of Qwen3.6 27B or 35B-A3B are dramatically better than what was available even a few months ago.
Practical? Maybe not (unless you highly value privacy) because you can get better models and better performance with cheap API access or even cheaper subscriptions. As you said, this may indefinitely be the case.
root_axis 19 hours ago [-]
> The capabilities of Qwen3.6 27B or 35B-A3B are dramatically better than what was available even a few months ago.
Yes, a lot better, but still terribly unreliable and far less capable than the big unquantized models.
SwellJoe 22 hours ago [-]
At least some of the investors in this tech are hoping for a monopoly position. They'd like to outspend the competition to get an insurmountable lead, at which point they can set their price.
But, so far, competition remains fierce. Anthropic still has the best tools for writing code. That lead is smaller than it's ever been, though. But, honestly, Opus 4.5 is when it got Good Enough. If Anthropic suddenly increased prices beyond what I'm willing to pay, any model that gives me Opus 4.5 or better performance is good enough for the vast majority of the work I do with agents. And, there are a bunch of models at that level, now maybe including some discount Chinese models. Certainly Gemini Pro 3.1 is on par with Opus 4.5. Current Codex is better than Opus 4.5 and close to Opus 4.7 (though I won't use OpenAI because I don't trust them to be the dominant player in AI).
I often switch agents/models on the same project because I like tinkering with self-hosted and I like to keep an eye on the most efficient way to work...which models wastes less of my time on silly stuff. Switching is literally nothing; I run `gemini` or `copilot` or `hermes` instead of `claude`. There's simply no deep dependency on a specific model or agent. They're all trying to find ways to make unique features for people to build a dependence on, of course, but the top models are all so fucking smart you can just tell them to do whatever thing it is that you need done. That feature could probably be a skill, whatever it is, and the model can probably write the skill. Or, even better, it could be actual software, also written by the model, rather than a set of instructions for the model to interpret based on the current random seed.
Currently, the only consistent moat is making the best model. Anthropic makes the best model and tools for coding, but that's a pretty shallow moat...I could live with several other models for coding. I'll gladly pay a premium for the best model and tools for coding, but I also won't be devastated if I suddenly don't have Claude Code tomorrow. Even open models I can host myself are getting very close to Good Enough.
gip 23 hours ago [-]
True. That is why it is key important to have open source and sovereign models that will be accessible to all and always on / local.
Competition (OpenAI vs Anthropic is fun to watch) and open source will get us there soon I think.
tetha 22 hours ago [-]
The owner rug-pulls, or Broadcom buys the owner and starts squeezing.
pmarsh 13 hours ago [-]
For the sake of argument if you build on AWS is that any more of a solid foundation? You're beholden to Amazon, unless you have the bandwidth to be able to DR immediately to another provider.
blueone 21 hours ago [-]
Anthropic sells due to unrelenting pressure and unachievable demand > new owner cuts costs > models become worse > new owner sells > the capitalistic cycle wins > we, the people, suffer
_the_inflator 17 hours ago [-]
“In the future there might be the possibility that catastrophic event A could happen.”
Not the best argument.
Also there is nothing without dependencies. Loose coupling means coupling.
sdevonoes 23 hours ago [-]
The sooner you cancel the sooner you become independent of them
derektank 21 hours ago [-]
You could say the same thing about your mobile phone bill. Most people still consider the benefits of roaming access to the internet greater than the downsides of being dependent on it.
zdragnar 15 hours ago [-]
There's very few, if any, alternatives to roaming internet access.
AI tools... do what you already do, sometimes faster, sometimes worse, usually both depending on the task.
There's a massive gap of necessity between them.
agumonkey 22 hours ago [-]
Some people are so dependent on it they can't even say it without twisting words to hide the fact that they're now stuck at zero
fortyseven 21 hours ago [-]
This is why, despite enjoying all of this, I really want to focus on locally hosted models. If we don't host the technology ourselves, we're setting ourselves up for a hard fall down the line.
Until very recently, local models been little more than brittle toys in my experience, if you're trying to use them for coding.
But lately I've been running Pi (minimal coding agent harness) with Gemma4 and Qwen3.6 and I've been blown away by how capable and fast they are compared to other models of their size. (I'm using the biggest that can fit into 24gb, not the smaller ones.) In fact, I don't really need to reach for Claude and friends much of the time (for my use cases at least).
2ndorderthought 18 hours ago [-]
Imagine if anthropic and openai went bankrupt in the next 2 years. If you look at their financials its a real possibility.
wongarsu 23 hours ago [-]
[dead]
wood_spirit 22 hours ago [-]
Me and so many coworkers have been struggling with a big cognitive decline in Claude over the last two months. 4.5 was useful and 4.6 was great. I had my own little benchmark and 4.5 could just about keep track of a two way pointer merge loop whereas 4.6 managed a 3 way and the 1M context managed k-way. And this ability to track braids directly helped it understand real production code and make changes and be useful etc.
but then two months ago 4.6 started getting forgetful and making very dumb decisions and so on. Everyone started comparing notes and realising it wasn’t “just them”. And 4.7 isn’t much better and the last few weeks we keep having to battle the auto level of effort downgrade and so on. So much friction as you think “that was dumb” and have to go check the settings again and see there has been some silent downgrade.
We all miss the early days of 4.6, which just show you can have a good useful model. LLMs can be really powerful but in delivering it to the mass market Anthropic throttle and downgrade it to not useful.
My thinking is that soon deepseek reaches the more-than-good-enough 4.6+ level and everyone can get off the Claude pay-more-for-less trajectory. We don’t need much more than we’ve already had a glimpse of and now know is possible. We just need it in our control and provisioned not metered so we can depend upon it.
hungryhobbit 22 hours ago [-]
This was a real issue, and Anthropic recently awknowledged it:
Of course, it sucks when companies screw up ... but at the same time, they "paid everyone back" by removing limits for awhile, and (more importantly to me) they were transparent about the whole thing.
I have a hard time seeing any other major AI provider being this transparent, so while I'm annoyed at Claude ... I respect how they handled it.
swdunlop 20 hours ago [-]
Amusingly, when a coworker was looking for this postmortem, they found a different postmortem of three Claude issues that caused decay. This one was in the platform, not in Claude Code:
I think there's a certain amount of running with scissors going on here. I appreciate the transparency, but the time to remediation here seems pretty long compared to the rate of new features.
15 hours ago [-]
wood_spirit 21 hours ago [-]
Yes that was one issue. It’s not the general degradation I have been talking about though, which is ongoing.
I recall reading similar tales of woe with other providers here on HN. I think the gradual dialling back of capability as capacity becomes strained as users pile on is part of the MO of all the big AI companies.
did you set your 4.7 to xhigh or max effort? anything else is basically not worth your time...
Flavius 18 hours ago [-]
Why would I set 4.7 to xhigh or max when the original 4.6 was doing just fine with medium and high?
shockleyje 3 hours ago [-]
Inflation
16 hours ago [-]
wilbur_whateley 23 hours ago [-]
Claude with Sonnet medium effort just used 100% of my session limit, some extra dollars, thought for 53 minutes, and said:
API Error: Claude's response exceeded the 32000 output token maximum. To configure this behavior, set the CLAUDE_CODE_MAX_OUTPUT_TOKENS environment variable.
amarcheschi 23 hours ago [-]
And on the seventh day, API Error: Claude's response exceeded the 32000 output token maximum
Oras 21 hours ago [-]
More on the 7th minute if you’re using opus
couchdb_ouchdb 22 hours ago [-]
I don't think i'd let it think more than 5 minutes without killing the process.
deckar01 21 hours ago [-]
They changed it do all of the changes in a virtual cloud environment, then dump the final result at the end of the response. Before it would stream changes, so if it made a minimal fix, then decided to go off on a tangent you could stop it quickly. Now you have to wait 5+ minutes to get a single line of code out of it just to find out it also refactored everything and burned a stack of tokens. No amount of prompting seems to force it to make incremental changes locally.
thepasch 19 hours ago [-]
> They changed it do all of the changes in a virtual cloud environment, then dump the final result at the end of the response.
That’s a hallucination. All they did was hide thinking by default. Quick Google search should easily teach you how to turn it back on (I literally have it enabled in my harness).
VertanaNinjai 17 hours ago [-]
Is anything that might be wrong or misinformation now a “hallucination”?
reddozen 15 hours ago [-]
Can you blame them for believing thinking tokens are completely hidden now? Anthropic has changed the way to see it 3 times in 3 months with no warnings or visible upgrade path. First it was shown by default, then you had to press control+o, then control+t, then it got locked behind a settings.json, then you had to manually enable with --verbose, now it's some random ENV var.
Whoever is their product manager should be embarrassed at the UX they provide.
jdiff 13 hours ago [-]
Product managers reduce velocity. The behavior changes every time another instance of Claude Code thinks something else would be a marginal improvement, with no further oversight or thought put into it.
2ndorderthought 17 hours ago [-]
I hope this doesn't come out wrong but. When this happens do agentic/vibe coders message their boss and say "sorry can't work until tomorrow?"
zulban 17 hours ago [-]
People hired to do jobs they cannot do have many, many more methods than that. For thousands of years.
shepherdjerred 17 hours ago [-]
I write down the time I run out of tokens each day and pray my employer will pay for more
jansenmac 22 hours ago [-]
Just copy and past the error back to Claude and you will be able to continue. I have seen this many times over the past few months. I thought it was related to AWS bedrock that I have been using - but probably not.
jasonlotito 23 hours ago [-]
Just curious, what version of Max are you on: 5x or 20x?
giancarlostoro 23 hours ago [-]
You're using it within their high usage rate window. I hope you're aware of this, if you use it out of the high usage time window it's supposed to use less, but it does seem a little odd that Sonnet uses so much, even on Medium.
drunken_thor 23 hours ago [-]
Ah so we are only supposed to use this work tool outside of work hours?
giancarlostoro 20 hours ago [-]
If you're on a personal tier, they prioritize those on the business tier yes.
ModernMech 23 hours ago [-]
No, you're supposed to make all your hours work hours. This is the way of AI.
isjcjwjdkwjxk 23 hours ago [-]
“Work tool”
Please. This is a toy. A novel little tech-toy. If you depend on it now for doing your job then, frankly, you deserve to have your rug pulled now and then.
subscribed 21 hours ago [-]
If you didn't found the way to use the tool constructively, keep trying.
If you didn't try to use it to work for you, that's okay, but maybe try once more? It does work and adds value. It's a non-standard and weirdly flexible tool with limitations.
...but in retrospect, seeing how you finished your comment, maybe you really want to remain angry and misinformed.
anonyfox 22 hours ago [-]
My max20 sub is sitting unused since april mostly now, codex with 5.4 (and now 5.5) even with fast mode (= double token costs) is night and day. Opus is doing convincing failures and either forgets half the important details or decides to do "pragmatic" (read: technical debt bandaids or worse) silently and claims success even with everything crashing and burning after the changes. and point out the errors it will make even more messes. Opus works really well for oneshotting greenfield scopes, but iterating on it later or doing complex integrations its just unusable and even harmfully bad.
GPT 5.4+ takes its time and considers even edgecases unprovoked that in fact are correct and saves me subsequent error hunting turns and finally delivers. Plus no "this doesn't look like malware" or "actually wait" thinking loops for minutes over a oneliner script change.
fluidcruft 22 hours ago [-]
My mental model for LLM is I don't expect them to chew gum and walk at the same time. Cleaning code up is a different task from building new functionality.
GLM always feels like it's doing things smarter, until you actually review the code. So you still need the build/prune cycle. That's my experience anyway.
jorjon 21 hours ago [-]
Can I get that max20 if you are not using it?
cmrdporcupine 20 hours ago [-]
Most "productive" flow I found was when I had both memberships and let Claude do the "I go yeet your feature" side and Codex do the "WTF bro, that's full of race conditions!" review phase.
But now I just use Codex. Claude is unreliable and leaves data races all over and leaves, as you say, negative conditions unhandled fairly consistently.
zkmon 23 hours ago [-]
Yesterday was a realization point for me. I gave a simple extraction task to Claude code with a local LLM and it "whirred" and "purred" for 10 minutes. Then I submitted the same data and prompt directly to model via llama_cpp chat UI and the model single-shotted it in under a minute. So obviously something wrong with coding agent or the way it is talking to LLM.
Now I'm looking for an extremely simple open-source coding agent. Nanocoder doesn't seem install on my Mac and it brings node-modules bloat, so no. Opencode seems not quite open-source. For now, I'm doing the work of coding agent and using llama_cpp web UI. Chugging it along fine.
syhol 23 hours ago [-]
https://pi.dev/ seems popular, whats not open source about opencode? The repo has an MIT License.
xlii 16 hours ago [-]
+1 for pi. I used claude and opencode but pi is the first agent tool that made me excited about the whole thing.
tfrancisl 22 hours ago [-]
Some people believe only copyleft licenses are open source. They're right on principle, wrong in (legal) practice.
Probably a silly idea, but I'll throw it into the mix - have your current AI build one for you. You can have exactly the coding agent you want, especially if you're looking for "extremely simple".
I got annoyed enough with Anthropic's weird behavior this week to actually try this, and got something workable up & running in a few days. My case was unique: there's no Claude Code for BeOS, or my older / ancient Macs, so it was easier to bootstrap & stitch something together if I really wanted an agentic coding agent on those platforms. You'll learn a lot about how models actually work in the process too, and how much crazy ridiculous bandaid patching is happening Claude Code. Though you might also appreciate some of the difficulties that the agent / harnesses have to solve too. (And to be clear, I'm still using CC when I'm on a platform that supports it.)
As for the llama_cpp vs Claude Code delays - I've run into that too. My theory is API is prioritized over Claude Code subscription traffic. API certainly feels way faster. But you're also paying significantly more.
appcustodian2 23 hours ago [-]
Just in case it didn't occur to you already, you can just build whatever coding agent you want. They're pretty simple
btbuildem 19 hours ago [-]
You'd figure by now we would have something between a TUI and an IDE.
btbuildem 22 hours ago [-]
You can run CC with local models, it's pretty straightforward. I've done this with vLLM + a thin shim to change the endpoint syntax.
jedisct1 23 hours ago [-]
Swival is not bloated and was specifically made for local agents: https://swival.dev
pferdone 23 hours ago [-]
pi.dev as well
banditelol 22 hours ago [-]
what model you used with llama_cpp?
zkmon 22 hours ago [-]
Qwen3.6-35B quant-4 gguf
enraged_camel 23 hours ago [-]
I use both Cursor and Claude Code, and yes, the latter is noticeably slower with the same model at the same settings.
However, it's hard to justify Cursor's cost. My bill was $1,500/mo at one point, which is what encouraged me to give CC a try.
drunken_thor 23 hours ago [-]
AI services are only minorly incentivized to reduce token usage. They want high token usage, it makes you pay more. They are going to continually test where the limit is, what is the max token usage before you get angry. All AI companies will continue to trade places for token use and cost as cost increases. We are in tepid water pretending it is a bath pretending we aren’t about to be boiled frogs.
jedberg 22 hours ago [-]
People said this about AWS too. "Why would they save you money??". It turns out that every time they reduce prices, they make more money, because more people use their services.
AI companies have the same incentive. Make it cheaper and people will use it more, making you more money (assuming your price is still above cost). And of course they have every reason to reduce their on costs.
esperent 12 hours ago [-]
AWS is notorious for being extremely expensive, so it's not like they became cheap. They just reduce prices from extremely expensive to slightly less extremely expensive, and that makes more people decide to start using it.
Since the price they are charging is still way, way above their operating costs there's no surprise really that they end up making more from small price reductions.
If competition drove them to reduce costs to the point where their operating costs started to be a large factor, the paradox would disappear.
zormino 21 hours ago [-]
jevons paradox
minimaxir 22 hours ago [-]
To an extent. That economic incentive stops making sense when a) capacity is an actual constraint and b) Anthropic is not a monopoly and is subject to pressure from competitors who are more user-friendly.
GodelNumbering 21 hours ago [-]
I am betting on the fact that people will get increasingly frustrated at closed agent lock-ins. I built (cline fork) and open-sourced https://github.com/dirac-run/dirac with the sole focus on token efficiency expecting that the closed-lock-in vendors will do enough to frustrate their users over time. Looking for contributors
y42 22 hours ago [-]
That's what I am thinking, too. It sound's like a conspiracy theory, but at the end Anthropic et al benefits from models that don't finish their jobs. I recently read about this "over editing phenomenon". The machine is never done. It doesn't want to.
It's like dating apps. They don't want you to find a good match, because then you cancel the subscription.
biglyburrito 22 hours ago [-]
Which works fine, right up until China releases a new DeepSeek model that's 85% as capable as an Anthropic or OpenAI premium model but costs a fraction of what either of those US companies are charging.
Up to a point. There is incentive when they get to the point where they literally can't serve their userbase and customers start leaving.
estimator7292 19 hours ago [-]
I severely doubt it. Token spend translates to real cost for the provider. Each token involves real and expensive compute. They aren't free monopoly money you get billed arbitrarily for. You're paying for electricity and infrastructure involved in generating each token.
Less spend means less real cost to the provider while your flat monthly subscription stays the same price. As well, reducing token use per customer means you can over-subscribe even harder, allowing for more flat monthly subscriptions.
Less tokens = more free capacity = more subscription income.
zzzeek 22 hours ago [-]
Well that's why threads like this are important to upvote. On hacker news , they're angry !
22 hours ago [-]
bryan0 16 hours ago [-]
I see a lot of people struggling to work with agents. This post has a good example:
> “you can’t be serious — is this how you fix things? just WORKAROUNDS????”
If this is how you’re interacting with your agents I think you’re in for a world of disappointment. An important part of working with agents is providing specific feedback. And beyond that making sure this feedback actually available to them in their context when relevant.
I will ask them why they made a decision and review alternatives with them. These learnings will aid both you and the agent in the future.
aulin 15 hours ago [-]
After you see it skip reasoning so many times and saying "actually the simplest fix is" the laziest thing ever you get kind of tired of babysitting it.
areoform 22 hours ago [-]
I've noticed that sometimes the same Claude model will make logical errors sometimes but not other times. Claude's performance is highly temporal. There's even a graph! https://marginlab.ai/trackers/claude-code/
I haven't seen anyone mention this publicly, but I've noticed that the same model will give wildly different results depending on the quantization. 4-bit is not the same as 8-bit and so on in compute requirements and output quality. https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
I'm aware that frontier models don't work in the same way, but I've often wondered if there's a fidelity dial somewhere that's being used to change the amount of memory / resources each model takes during peak hours v. off hours. Does anyone know if that's the case?
8organicbits 22 hours ago [-]
I'm not sure that graph shows a time-based correlation. The 60% line stays inside the 95% confidence interval. Is that not just a measurement of noise?
lukaslalinsky 20 hours ago [-]
I feel like Opus 4.5 was the peak in Claude Code usefulness. It was smart, it was interactive, it was precise. In 4.6 and 4.7, it spends a long time thinking and I don't know what's happening, often hits a dead-end and just continues. For a while I was setting Opus 4.5 in Claude Code, but it got reset often. I just canceled my Max plan, don't know where to look for alternatives.
petterroea 23 hours ago [-]
Looking at Anthropic's new products I think they understand they don't really have a cutting edge other than the brand.
I tried Kimi 2.6 and it's almost comparable to Opus. Anthropic lost the ball. I hope this is a sign the we are moving towards a future where model usage is a commodity with heavy competition on price/performance
mmonaghan 21 hours ago [-]
Kimi nowhere close to opus on extended use but definitely highly competitive with sonnet. I will probably end up using kimi for personal stuff when I find some time to get it running or get a non-anthropic/openai harness set up on my personal machine.
jetbalsa 21 hours ago [-]
I've been mostly using Kimi has a hacker of sorts, putting it places I want to attach AI directly as their API for their plans are not completely user hostile. Need to do OCR for scanning Magic the Gathering Cards. Sure!, have it attached to X4: Foundations as a AI manager for some stuff. sounds fun. Can't really do that with claude
alex-onecard 21 hours ago [-]
How are you using kimi 2.6? I am considering their coding plan to replace my claude max 5x but I am worried about privacy and security.
ac29 21 hours ago [-]
I'm using it via OpenCode Go, which claims to only use Zero Data Retention providers.
How much you trust any particular provider's claim to not retain data is subjective though.
petterroea 20 hours ago [-]
I'm only using it for a project I'm already expecting to open source later. Don't think I'm comfortable using it for more private work
ChicagoDave 20 hours ago [-]
I think there’s a clear split amongst GenAI developers.
One group is consistently trying to play whack-a-mole with different models/tools and prompt engineering and has shown a sine-wave of success.
The other group, seemingly made up of architects and Domain-Driven Design adherents has had a straight-line of high productivity and generating clean code, regardless of model and tooling.
I have consistently advised all GenAI developers to align with that second group, but it’s clear many developers insist on the whack-a-mole mentality.
I have even wrapped my advice in https://devarch.ai/ which has codified how I extract a high level of quality code and an ability to manage a complex application.
Anthropic has done some goofy things recently, but they cleaned it up because we all reported issues immediately. I think it’s in their best interests to keep developers happy.
My two cents.
joquarky 19 hours ago [-]
I kind of wonder if people with ADHD tend to fall into the latter group, as we are used to setting guardrails to keep us aligned to a goal.
ChicagoDave 13 hours ago [-]
ADHD may be a part of it but mentoring and some experience with TDD or strong unit testing is also a part.
camel_Snake 20 hours ago [-]
FYI that prominent link to your sharpee repo on GitHub 404s
ChicagoDave 20 hours ago [-]
OMG (fixed) - I updated devarch to check all html links for accuracy. This came from a recent update to the website to simplify it and of course Claude completely hallucinated that URL.
You can NEVER stop being vigilant. This is why I still have no faith in things like OpenClaw. Letting an AI just run off unsupervised makes me sweat.
rglover 20 hours ago [-]
Dead on. Any company not thinking about this like the 2nd group is setting themselves up for a bad time (and sadly, anecdotally, that seems to be an emerging majority).
ChicagoDave 13 hours ago [-]
Sadly, I agree that the majority are undisciplined.
estimator7292 18 hours ago [-]
IME it seems that output quality is directly proportional to the amount of engineering effort you put in. If a bug happens and you just tell the model to fix it over and over with no critical thinking, you end up with an 800 line shell script meant to change the IP address on an interface (real example). If you stop and engage your brain to reason about bugs and explain the problem, the model can fix it in an acceptable manner.
If you want to get good results, you still have to be an engineer about it. The model multiplies the effort you put in. If your effort and input is near zero, you get near zero quality out. If you do the real work and relegate the model to coloring inside the lines, you get excellent results.
ChicagoDave 13 hours ago [-]
Even my guardrails can’t replace experience. You have to pay attention. This is exactly how some devs land in whack-a-mole loops.
cbg0 23 hours ago [-]
I've been a fan since the launch of the first Sonnet model and big props for standing up to the government, but you can sure lose that good faith fast when you piss off your paying customers with bad communication, shaky model quality and lowered usage limits.
jp0001 2 hours ago [-]
Max x20 user here. As long as Opus 4.6 is available and they fix Opus 4.7, I'll stay with Anthropic. Tho, I'd imagine in 5 years we'll have Opus 4.6 equivalent performance available in an at home consumer model.
stldev 23 hours ago [-]
Same, after being a long-time proponent too.
First was the CC adaptive thinking change, then 4.7. Even with `/effort max` and keeping under 20% of 1M context, the quality degradation is obvious.
I don't understand their strategy here.
23 hours ago [-]
siliconc0w 23 hours ago [-]
Shameless self plug but also worried about the silent quality regressions, I started building a tool to track coding agent performance over time.. https://github.com/s1liconcow/repogauge
Here is a sample report that tries out the cheaper models + the newest Kimi2.6 model against the 5.4 'gold' testcases from the repo: https://repogauge.org/sample_report.
conception 23 hours ago [-]
This is cool - just wanted to note https://marginlab.ai is one that has been around for a while.
aleksiy123 22 hours ago [-]
are there any tools anyone knows to collect this kind of telemetry while using the tools instead of offline evals.
running evals seems like it may be a bit too expensive as a solo dev.
taffydavid 8 hours ago [-]
I know this thread is likely full of similar anecdotes, but I also want to share.
My experience very suddenly and very clearly degraded over the last few days.
Today I was trying to build a simple chess game. Previous one shots were HTML, this gave me a jsx file. I asked it to put it HTML and it absolutely devoured my credits doing so, I had to abort and do it manually. The resulting app didn't work, and it had decided that multiplayer could work by storing the game state only on local storage without the clients communicating at all
binaryturtle 22 hours ago [-]
I have a simple rule: I won't pay for that stuff. First they steal all my work to feed into those models, afterwards I shall pay for it? No way!
I use AI, but only what is free-of-charge, and if that doesn't cut it, I just do it like in the good old times, by using my own brain.
joquarky 18 hours ago [-]
[flagged]
mrinterweb 22 hours ago [-]
My recent frustration with Claude has been it feels like I'm waiting on responses more. I don't have historical latency to compare this with, but I feel like it has been getting slower. I may be wrong, and maybe its just spending more time thinking than it used to. My guess is Anthropic is having capacity issues. I hope I'm wrong because I don't want to switch.
hu3 1 hours ago [-]
About slowdowns... I have this theory that if they sneak some sleep(1) calls while processing medium to complex prompts they can serve more clients.
But I think "context switching" between 2 different prompts might be too expensive for GPUs to be worth it for LLM providers. Who knows.
janalsncm 21 hours ago [-]
There was a really good point in this podcast episode about the speed of LLMs. They are so slow that all of the progress messages and token streaming are necessary. But the core problem is that the technology is so darn slow.
As someone who both uses and builds this technology I think this is a core UX issue we’re going to be improving for a while. At times it really feels like a choose 2+ of: slow, bad, and expensive.
pram 23 hours ago [-]
I’ve noticed most of the complaints are about the Pro plan. Anecdotally I pay for the $200 Max plan and haven’t noticed anything radically different re: tokens or thinking time (availability is still a crapshoot)
I am certainly not saying people should “spend more money,” more like the Claude Code access in the Pro plan seems kind of like false advertising. Since it’s technically usable, but not really.
swiftcoder 22 hours ago [-]
> I am certainly not saying people should “spend more money,” more like the Claude Code access in the Pro plan seems kind of like false advertising
Its particularly noticeable when for a long time you could work an 8 hour day in codex on ChatGPT´s $20/month plan (though they too started tightening the screws a couple of weeks back)
thebitguru 22 hours ago [-]
My guess is that the higher plans will be next, especially as more people upgrade to those and maximize their usage.
bauerd 23 hours ago [-]
They can't afford to care about individual customers because enterprise demand exploded and they're short on compute
stan_kirdey 22 hours ago [-]
I also cancelled my subscription.The $20 Pro plan has become completely unusable for any real work. What is especially frustrating is that Claude Chat and Claude Code now share the exact same usage limits — it makes zero sense from a product standpoint when the workflows are so different. Even the $200 Max plan got heavily nerfed. What used to easily last me a full week (or more) of solid daily use now burns out in just a few days. Combined with the quality drop and unpredictable token consumption, it simply stopped being worth it.
algoth1 23 hours ago [-]
Doesn't "poor support" implies that there is some sort of support? Shouldnt it be "no support"
And by crikey do I empathise with the poor support in this article. Nothing has soured me on Anthropic more than their attitude.
Great AI engineers. Questionable command line engineers (but highly successful.) Downright awful to their customers.
lanthissa 23 hours ago [-]
for all the drama, its pretty clear both openai, google, and anthropic have had to degrade some of their products because of a lack of supply.
There's really no immediate solution to this other than letting the price float or limiting users as capacity is built out this gets better.
PeterStuer 21 hours ago [-]
I'm on max x5. No limit problems, but I am definetly feeling the decline. Early stopping and being hellbent on taking shortcuts being the main culprits, closely followed by over optimistic (stale) caching (audit your hooks!).
All mostly mitigatable by rigorous audits and steering, but man, it should not have to be.
aucisson_masque 17 hours ago [-]
First ever time I used ai to code was a week ago, went with the Claude pro because I didn't want to commit.
The 20$ plan has incredible value but also, the limit are just way too tight.
I'm glad Claude made me discover the strength of ai, but now it's time to poke around and see what is more customer friendly. I found deepseek V4 to be extremely cheap and also just as good.
Plus I get the benefit to use it in vs code instead of using Claude proprietary app.
I think that when people goes over the hype and social pressure, anthropic will lose quite a lot of customer.
torstenvl 21 hours ago [-]
I feel like almost everyone using AI for support systems is utterly failing at the same incredibly obvious place.
The first job of any support system—both in terms of importance and chronologically—is triage. This is not a research issue and it's not an interaction issue. It's at root a classification problem and should be trained and implemented as such.
There are three broad categories of interaction: cranks, grandmas, and wtfs.
Cranks are the people opening a support chat to tell you they have vital missing information about the Kennedy Assassintion or they want your help suing the government for their exposure to Agent Orange when they were stationed at Minot. "Unfortunately I can't help with that. We are a website that sells wholesale frozen lemonade. Good luck!"
Grandma questions are the people who can't navigate your website. (This isn't meant to be derogatory, just vivid; I have grandma questions often enough myself.) They need to be pointed toward some resource: a help page, a kb article, a settings page, whatever. These are good tasks for a human or LLM agent with a script or guideline and excellent knowledge/training on the support knowledge base.
WTFs are everything else. Every weird undocumented behavior, every emergent circumstance, every invalid state, etc. These are your best customers and they should be escalated to a real human, preferably a smart one, as soon as realistically possible. They're your best customers because (a) they are investing time into fixing something that actually went wrong; (b) they will walk you through it in greater detail than a bug report, live, and help you figure it out; and (c) they are invested, which means you have an opportunity for real loyalty and word-of-mouth gains.
What most AI systems (whether LLMs or scripts) do wrong is that they treat WTFs like they're grandmas. They're spending significant money on building these systems just to destroy the value they get from the most intelligent and passionate people in their customer base doing in-depth production QC/QA.
dboreham 21 hours ago [-]
This rings true. However I have used one AI automated support chat that didn't behave that way. I wish I could remember the vendor but I do remember being blown away when it said something like "that sounds like a real problem would you like me to open a support ticket for this?". Which it then did and subsequently a human addressed my issue.
lawrence1 21 hours ago [-]
The timeline doesn't make any sense. How can you subscribe a couple weeks ago and the problem start 3 weeks ago and yet things also went well for the first few weeks. was this written by GPT 5.5?
wg0 21 hours ago [-]
The author is not a native English speaker it seems.
They might mean "few weeks ago" and the phrase "couple of weeks ago" might not be exactly as "Vor ein paar Wochen" in their mind rather could be as "few weeks ago."
Rest of the prose in the article seems to support the assumption.
The post is handwritten with no LLMs involved.
fortynights 13 hours ago [-]
Seriously, I’ve been subscribed to Claude for 18 months now ($20/mo) during which time I’ve seen the hype around other models come and go in a matter of weekends. I’ve just come to accept that everyone is largely commodity and sometimes it’s worth taking a longer view (provided one can afford it).
arikrahman 15 hours ago [-]
I use Aider nowadays, and will probably cancel my Github multi AI bundle subscription due to the new training policy. I find using Aider with the new open models and using Open Spec to negotiate requirements before the handoff, has helped me a lot.
zulban 17 hours ago [-]
Curious. Not my experience whatsoever.
I tried Claude recently and it was able to one-shot fixes on 9/9 of the bugs I gave it on my large and older Unity C# project. Only 2/9 needed minor tweaks for personal style (functionally the same).
Maybe it helps that I separately have a CLI with very extensive unit tests. Or that I just signed up. Or that I use Claude late in the evenings (off hours). I also give it very targeted instructions and if it's taking longer than a couple minutes - I abort and try a different or more precise prompt. Maybe the backend recognizes that I use it sparingly and I get better service.
The author describes what sounds like very large tasks that I'd never hand off to an AI to run wild in 2026.
Anyway I thought I'd give a different perspective than this thread.
joozio 21 hours ago [-]
Funny. I thought I was the only one. Then I found more people and now you wrote about that. Just this week I also wrote about Claude Opus 4.7 and how I came back to Codex after that: https://thoughts.jock.pl/p/opus-4-7-codex-comeback-2026
y42 18 hours ago [-]
I like your blog and I can totally relate to this article - it's like something I wanted to write about for a couple of weeks now. :D
Wait, weren't there posts in the not too distant past where everyone was signing the praises for Claude and wondering how OpenAI will catch up?
swader999 21 hours ago [-]
Yep. I think the sentiment here isn't lagging too much in terms of the day to day experience of what is being offered. Kind of makes HN very useful in this regard.
cyanydeez 22 hours ago [-]
Wait, are SaaS's fundamentally shifting business models searching to maximize the value of a product at the expense of a customer over time?
Strange how things can change!
Capricorn2481 21 hours ago [-]
We've seen this sentiment shift on HN like 20 times in the past year, too often for it to be a real reflection of service quality. Feels more like people rooting for sports teams.
The services (OpenAI, Anthropic) are not wildly changing that much. People are just using LLMs more and getting frustrated because they were told it would change the world, and then they take it out on their current patron. Give it a month and we'll be hearing how far OpenAI has fallen behind.
vivin 19 hours ago [-]
This is interesting to me, because Claude has been a net-positive for me. I haven't faced token issues or declining quality. I generally work with Claude as an assistant -- I may have it do planning and have it "one shot" a thing, but it's usually a personal tool or a utility that I want it to write.
For actual code that goes out to production, I generally figure out how I would solve the problem myself (but will use Claude to bounce ideas and approaches -- or as a search engine) and then have Claude do the boring bits.
Recently I had to migrate a rules-engine into an FSM-based engine. I already had my plan and approach. I had Claude do the boring bits while I implemented the engine myself. I find that Claude does best when you give it small, focused, incremental tasks.
isjcjwjdkwjxk 23 hours ago [-]
Oh no, the unreliable product people pretend is the next coming of Jesus turned out to be thoroughly unreliable. Who coulda thunk it.
easythrees 23 hours ago [-]
I have to say, this has been the opposite of my experience. If anything, I have moved over more work from ChatGPT to Claude.
kleene_op 23 hours ago [-]
Same. I am getting crazy good value from Claude at work, on both scientific applications and deployment environments.
There is one caveat, and that is you have to give the model well thought out constraints to guide it properly, and absolutely take the time to read all the thinking it's doing and not be afraid to stop the process whenever things go sideway.
People who just let Claude roam free on their repository deserve everything they end up with.
burnJS 20 hours ago [-]
My experience is Claude and others are good at writing methods and smaller because you can dictate what it should do in less tokens and easily read the code. This closes the feadback loop for me.
I occasionally ask AI to write lots of code such as a whole feature (>= medium shirt size) or sometimes even bigger components of said feature and I often just revert what it generated. It's not good for all the reasons mentioned.
Other times I accept its output as a rough draft and then tell it how to refactor its code from mid to senior level.
I'm sure it will get better but this is my trust level with it. It saves me time within these confines.
Edit: it is a valuable code reviewer for me, especially as a solo stealth startup.
brunooliv 16 hours ago [-]
I still haven’t seen any other models be as complete as Claude inside Claude Code. I bet Anthropic knows this and they turn the knobs and see people’s reactions… I have been planning with Qwen3.6 Max inside opencode, absolutely game changer.
Opus can then follow the plan quite detailed and like this I can make progress on my toy apps on Pro plan at 20/mo.
For work, unlimited usage via Bedrock.
Yes I’d like to get more usage out of my personal sub, but at 20/mo no complains
throwaway2027 23 hours ago [-]
Same. I think one of the issues is that Claude reached a treshold where I could just rely on it being good and having to manually fix it up less and less and other models hadn't reached that point yet so I was aware of that and knew I had to fix things up or do a second pass or more. Other providers also move you to a worse model after you run out which is key in setting expectation as well. Developers knew that that was the trade-off.
I think even with the worse limits people still hated it but when you start to either on purpose or inadvertently make the model dumber that's when there's really no purpose to keep using Claude anymore.
duxup 20 hours ago [-]
I’ve definitely encountered a drop in Claude quality.
Even a simple prompt focused on two files I told Claude to do a thing to file A and not change file B (we were using it as a reference).
Claude’s plan was to not touch file B.
First thing it did was alter file B. Astonishing simple task and total failure.
It was all of one prompt, simple task, it failed outright.
I also had it declare that some function did not have a default value and then explain what the fun does and how it defaults to a specific value….
Fundamentally absurd failures that have seriously impacted my level of trust with Claude.
nikolay 22 hours ago [-]
I can agree. ChatGPT 5.5 made this a no-brainer choice. Anthropic are idiots removing Claude Code from the Pro plan. They need to ask Claude if what they did was a natural intelligence bug! Greed kills companies, too!
Capricorn2481 22 hours ago [-]
> I can agree. ChatGPT 5.5 made this a no-brainer choice
The new model that came out less than 24 hours ago made this obvious? This feels like when a new video game comes out and there's 1,000 steam reviews glazing it in the first hours of release. Don't you think you should use it for longer than a day before declaring it a game changer?
nikolay 21 hours ago [-]
Plus, I don't like Anthropic allowing "chosen people" to access models - OpenAI would offer anything to anybody who's willing to pay for it, and that's a real business. Anthropic claims some moral superiority, but their model is used to kill civilians in Iran and to spy on all of us, so I don't buy their BS!
nikolay 21 hours ago [-]
Yes, even 5.4 was better.
robotnikman 19 hours ago [-]
>removing Claude Code from the Pro plan
Wait really? I wanted to give it a try, but for $200 a month no way am I paying that for something I just want to experiment around with
nikolay 19 hours ago [-]
It was all over the Hacker News, but they do have a $100/mo Max plan. But so does OpenAI now, too. I guess companies assume people are willing to cough up hundreds of dollars per month, and we are. I need to review all my AI subscriptions, as it's over $1,000 per month now - $250 for Gemini Ultra (although it comes with tons and tons of other benefits), plus, it could be shared with my family - other companies live in a world where families don't exist, and people don't share a work laptop for both personal and corporate stuff. Add $200 for Claude Max and $100 for ChatGPT Pro. Add Cursor, add Zed, add Lovable, add a bunch of other things I'm experimenting with... It's getting expensive!
nikolay 19 hours ago [-]
Let me take this back - you are probably grandfathered into having Claude Code in your Pro plan though! Now I feel sorry I upgraded to Max as I could've kept Pro, just in case, but now I will just have to cancel it if I want to move to ChatGPT, or Gemini 4 when it's out, and if it's worth it. They need to consult with AI when making such stupid choices, honestly!
airbreather 18 hours ago [-]
I am sort of in the same place, it seems to have lost enough of the magic that I might be better trying to do more with running local LLMs on my 4090.
The thing is running local LLMs will give some kind of reliability and fixed expectations that saves a lot of time - yeah sure Claude might be fantastic one day, but what do I do when the same workload churns out shit the next and I am halfway thru updating and referencing a 500 document set?
Better the devil you know and all that.
0xchamin 17 hours ago [-]
One of the biggest problem with Claude is, it tries to do things that I don't even ask. I really like to have full control over what I do. I feel sometimes, Claude has the urgency to keep going with what it is hardcore programmed for instead waiting for my feedback. Looks like, Claude consider everything to be oneshot. I maybe wrong, this is my personal experience
olcay_ 14 hours ago [-]
Claude Code has something about picking sensible choices instead of asking questions in the system prompt, that's probably the problem.
dostick 20 hours ago [-]
The discussion about Claude always omit the important context - which language/platform you’re using it for. It is best trained for web languages and has most up to date knowledge for that.
If you use it for Swift it is trained on whole landfill of code and that gives you strong bias towards pre-Swift 6 coding output. Imagine you would give Claude a requirements for a web app, and it implements it all in JQuery. That’s what happens with other platforms.
adamors 20 hours ago [-]
It’s not ommited, OP clearly talks about editing Javascript.
chaosprint 17 hours ago [-]
I bought a Claude membership a few days ago. I asked him to fix a React issue—a very simple UI modification with almost no logic. He still failed to understand it. And after three attempts, the 5-hour limit was reached. This was a disaster. I had to immediately buy a CodeX membership and also tried Image2. I won't give Claude another chance.
jryio 17 hours ago [-]
I find it strange that you've anthropomorphized Claude but not ChatGPT seemingly based on one having a human name and the other not
lawrence1 21 hours ago [-]
The timeline of the first few sentences doesn't add up. how can you subscribe 2 weeks ago when the problem started 3 weeks ago.
(I am just learning that "a couple of weeks" apparently means "2 weeks"...)
rurban 19 hours ago [-]
That's bad for him, because he already had a cheap plan. Now he wont get it back that easy.
Pro is gone. OpenAI plans are more expensive. He can only buy a Kimi plan, which is at least better than Sonnet. But frontier for cheap is gone. Even copilot business plans are getting very expensive soon, also switching to API usage only.
kx_x 19 hours ago [-]
After the fixes in Claude Code, Opus 4.6/4.7 have been performing well.
Before the fixes, they were complete trash and I was ready to cancel this month.
Now, I'm feeling like the AI wars are back -- GPT 5.5 and Opus 4.7 are both really good. I'm no longer feeling like we're using nerfed models (knock on wood)!
binyu 22 hours ago [-]
I feel like Anthropic is forcing their new model (Opus 4.7) to do much less guess work when making architectural choices, instead it prefers to defer back decisions to the user. This is likely done to mine sessions for Reinforcement-Learning signals which is then used to make their future models even smarter.
On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode. This was the wrong tradeoff. We reverted this change on April 7 after users told us they'd prefer to default to higher intelligence and opt into lower effort for simple tasks. This impacted Sonnet 4.6 and Opus 4.6.
On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.
On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
binyu 21 hours ago [-]
How does this address the point I moved specifically?
exabrial 20 hours ago [-]
It's bad, really bad.
The filesystem tool cannot edit xml files with <name></name> elements in it
Animats 11 hours ago [-]
Support? You expected support? Live support?
Most of this is about the billing system, which is apparently broken.
elevaet 21 hours ago [-]
I've been very happy using Codex in the VScode extension. Very high quality coding and generous token limits. I've been running Claude in the CLI over the last couple of months to compare and overall I prefer Codex, but would be happy with either.
hybrid_study 22 hours ago [-]
Sometimes it feels like Anthropic uses token processing as a throttling tool, to their advantage.
_pdp_ 19 hours ago [-]
Signup for all major providers (pro plan) and round-robin between all of them. This is the only way to protect against not having access to all of these heavily subsidised subscriptions. See what happened to Copilot.
yalogin 22 hours ago [-]
If someone wants to move off Claude what are the alternatives? More importantly can another system pick up from where Claude left off or is there some internal knowledge Claude keeps in their configuration that I need to extract before canceling?
janalsncm 22 hours ago [-]
Opencode is a great cli for driving a coding agents.
Like 3 weeks ago Qwen3-coder was the best coding LLM to run locally. I haven’t spent time since to figure out if anything is better.
You can also power Opencode with OpenRouter which lets you pay for any LLM à la carte.
y42 21 hours ago [-]
I am trying Qwen3.5-9B-Claude-4.6 since a couple of days now locally coming from OMLX. Either via Hermes or Continue in VS Code. It's oka'ish, even performance-wise.
These changes fixed some of the token issues, but the token bloat is an intrinsic problem to the model, and Anthropic's solution of defaulting to xhigh reasoning for Opus 4.7 just means you'll go through tokens faster anyways.
sgt 23 hours ago [-]
I'm worried anything less than xhigh is insufficient though. What do you do?
giancarlostoro 23 hours ago [-]
The problem is they changed people's default settings, and if you're like me, you keep a Claude Code session open for days, maybe weeks and even a month, and just come back to it and keep going. I wouldn't be surprised if there's hundreds if not thousands of people still on these broken configurations / models.
Dear Anthropic:
Please, for the love of all things holy, NEVER change someone's defaults without INFORMING the end user first, because you will wind up with people confused, upset, and leaving your service.
sreekanth850 20 hours ago [-]
Biggest issue i see is, models are not getting efficient. This is no where going to get commoditised. There will be a limit at which you can burn money at subsidised cost.
mattas 21 hours ago [-]
I've see a post like this every week for the last 2 years. Are these models actually getting worse? Or do folks start noticing the cracks as they use them more and more?
DeathArrow 23 hours ago [-]
I use Claude Code with GLM, Kimi and MiniMax models. :)
I was worried about Anthropic models quality varying and about Anthropic jacking up prices.
I don't think Claude Code is the best agent orchestrator and harness in existence but it's most widely supported by plugins and skills.
droidjj 22 hours ago [-]
Where are you getting inference from? I'm overwhelmed by the options at the moment.
alex-onecard 21 hours ago [-]
I am also curious. Considering the kimi coding plan but I'm worried about data privacy and security.
DeathArrow 21 hours ago [-]
I don't send much data to cloud, mostly code. And I don't believe in security by obscurity, if I need high security I do proper implementation.
DeathArrow 21 hours ago [-]
I am using Ollama Cloud and Moonshot Ai.
brachkow 19 hours ago [-]
As many others I had negative (not good as before) feeling about Claude Code lately
What I don't understand is these loud "voting with money" comments. What they are canceling is very subsidized plan to buy something that delivers a lot of value.
There are only two providers that can provide this level of models at very subsidized price - anthropic and openai. Both of them are bad in terms of reliability.
So I wonder what these people do after they "cancel" both of them? Do they see producing less result at same hourly rate as everyone else on the market as viable option?
giancarlostoro 23 hours ago [-]
I'm torn because I use it in my spare time, so I've missed some of these issues, I don't use it 9 to 5, but I've built some amazing things, when 1 Million tokens dropped, that was peak Claude Code for me, it was also when I suspect their issues started. I've built up some things I've been drafting in my head for ages but never had time for, and I can review the code and refine it until it looks good.
I'm debating trying out Codex, from some people I hear its "uncapped" from others I hear they reached limits in short spans of time.
There's also the really obnoxious "trust me bro" documentation update from OpenClaw where they claim Anthropic is allowing OpenClaw usage again, but no official statement?
Dear Anthropic:
I would love to build a custom harness that just uses my Claude Code subscription, I promise I wont leave it running 24/7, 365, can you please tell me how I can do this? I don't want to see some obscure tweet, make official blog posts or documentation pages to reflect policies.
Can I get whitelisted for "sane use" of my Claude Code subscription? I would love this. I am not dropping $2400 in credits for something I do for fun in my free time.
fluidcruft 23 hours ago [-]
It sounds like we have very similar usage/projects. codex had been essentially uncapped (via combination of different x-factors between Plus and Pro and promotions) until very recently when they copied Anthropic's notes.
Plus is still very usable for me though. I have not tried Claude Pro in quite a while and if people are complaining about usage limits I know it's going to be a bad time for me. I had to move up from Claude Pro when the weekly limits were introduced because it was too annoying to schedule my life around 5hr windows.
I started using codex around December when I started to worry I was becoming too dependent on Claude and need to encourage competition. codex wasn't particularly competitive with Claude until 5.4 but has grown on me.
The only thing I really care about is that whatever I'm using "just works" and doesn't hurt limits and Claude code has been flaky as all hell on multiple fronts ever since everyone discovered it during the Pentagon flap. So I tend to reach for ChatGPT and codex at the moment because it will "just work" and there's a good chance Claude will not.
scottyah 23 hours ago [-]
Don't forget, Openclaw was basically bought by OpenAI so there's only incentive to use it as a wedge to pry people off Anthropic.
dheera 23 hours ago [-]
Claude Code now has an official telegram plugin and cron jobs and can do 80% of the things people used OpenClaw for if you just give it access to tools and run it with --dangerously-skip-permissions.
giancarlostoro 20 hours ago [-]
I don't use OpenClaw is what I'm saying though, I use Claude Code for coding, and would like to better equip Claude by a custom coding harness that has superior tooling out of the box, but that is fair.
Der_Einzige 22 hours ago [-]
The /loop command which is supposed to be the equivilant to heartbeat.md is EXTREMELY unreliable/shitty.
giancarlostoro 19 hours ago [-]
I use it sparingly with my guardrails project. I basically tell it to:
Check any tasks if it's not currently working on one, and to continue until it finishes, dismiss this reminder if it's done, and then to ensure it runs unit tests / confirms the project builds before moving on to the next one. Compact the context when it will move to the next one. Once its exhausted all remaining tasks close the loop.
Works for me for my side projects, I can leave it running for a bit until it exhausts all remaining tasks.
hedgehog 23 hours ago [-]
I used Opus via Copilot until December and then largely switched over to Claude Code. I'm not sure what the difference is but I haven't seen any of these issues in daily use.
nickdothutton 22 hours ago [-]
Switched to local models after quality dropped off a cliff and token consumption seemed to double. Having some success with Qwen+Crush and have been more productive.
tfrancisl 22 hours ago [-]
Would love some more info on how you got any local model working with Crush. Love charmbracelet but the docs are all over the place on linking into arbitrary APIs.
porkloin 21 hours ago [-]
assuming you have a locally running llama-server or llama-swap, just drop this into your crush.json with your setup details/local addresses etc:
Obviously the context window settings are going to depend on what you've got set on the llama-server/llama-swap side. Multiple models on the same server like I have in the config snippet above is mostly only relevant if you're using llama-swap.
TL;DR is you need to set up a provider for your local LLM server, then set at least one model on that server, then set the large and small models that crush actually uses to respond to prompts to use that provider/model combo. Pretty straightforward but agree that their docs could be better for local LLM setups in particular.
For me, I've got llama-swap running and set up on my tailnet as a [tailscale service](https://tailscale.com/docs/features/tailscale-services) so I'm able to use my local LLMs anywhere I would use a cloud-hosted one, and I just set the provider baseurl in crush.json to my tailscale service URL and it works great.
chadleriv 18 hours ago [-]
Off topic: I do feel like this model switching content feels very circa 2010 "I'm quitting Facebook"
r0fl 13 hours ago [-]
I hope codex doesn’t decline the same way
I’m blown away by how good it is lately
datavirtue 4 hours ago [-]
I have enterprise plans for all AI services except Google. GitHub Copilot in VS Code is the best I have used so far. I hear a lot of complaints from people who are holding it wrong. In a single day I can have a beautiful greenfield app deployed. One dev. One day. Something that would have taken weeks with two teams bumping into each other. It's fully documented. Beautiful code. I read the reasoning prompts as it flows by to get an idea of what is going on. I work in phases and review the code and working product quickly after that. Minimal issues.
I'm an executive, the devs complaining are getting retrained or put on the chopping block.
My rockstars are now random contractor devs from Vietnam. The aloof FTE grey beards saying "I don't know, it doesn't work very good on X." Are getting a talking to or being sidelined/canned. So far most of my grey beards are adapting pretty well.
I'm not waiting on people to write code any more. No way in hell.
sfmike 22 hours ago [-]
i ran prompts used up a ton of usage, and got no return just showed error.
Asked support hey i got nothing back i tried prompting several times used a ton of usage and it gave no response. I'd just like usage back. What I payed for I never got.
Just bot response we don't do refunds no exceptions. Even in the case they don't serve you what your plan should give you.
caycep 22 hours ago [-]
If all Claude does is automate mundane code, why not just make a "meta library" of said common mundane code snippets?
twobitshifter 22 hours ago [-]
maybe make it so that when you start typing it completes the snippet?
queuebert 22 hours ago [-]
Like Stack Overflow?
caycep 22 hours ago [-]
you still have to search stack overflow and sift, but I'm surprised someone doesn't just make a TLDR style or shortcuts-expanding product out of it that you can just pop in code for this use case or that. vs. the current product spending a few datacenter's worth of energy to give you the answer that's only correct x% of the time
aleqs 23 hours ago [-]
The usage metering is just so incredibly inconsistent, sometimes 4 parallel Opus sessions for 3 hours straight on max effort only uses up 70% of a session, other times 20 mins / 3 prompts in one session completely maxes it out. (Max x20 plan)
Is this just a bug on anthropic side or is the usage metering just completely opaque and arbitrary?
unshavedyak 22 hours ago [-]
It's something strange because i never have these issues. I often run two in parallel (though not all day), and generally have something running anytime i look at my laptop to advance the steps/tasks/etc. Usually i struggle to hit 50% on my Max20.
Heck two weeks ago i tried my hardest to hit my limit just to make use of my subscription (i sometimes feel like i'm wasting it), and i still only managed to get to 80% for the week.
I generally prune my context frequently though, each new plan is a prune for example, because i don't trust large context windows and degradation. My CLAUDE.md's are also somewhat trim for this same fear and i don't use any plugins, and only a couple MCPs (LSP).
No idea why everyone seems to be having such wildly different experiences on token usage.
subscribed 19 hours ago [-]
Maybe you should try running exactly the same prompts in exactly the same settings?
Chances are one of you has been drafted into an unpleasant experiment.
AJRF 17 hours ago [-]
We are in the 'we need to IPO so screw our customers' phase of the cycle
SwellJoe 22 hours ago [-]
I don't get it. I use Claude Code every day, what I would consider pretty heavy usage...at least as heavy as I can use it while actually paying attention to what it's producing and guiding it effectively into producing good software. I literally never run into usage limits on the $100 plan, even when the bugs related to caching, etc. were happening that led to inflated token usage.
WTF are y'all doing that chews tokens so fast? I mean, sure, I could spin up Gas Town and Beads and produce infinite busy work for the agents, but that won't make useful software, because the models don't want anything. They don't know what to build without pretty constant guidance. Left to their own devices, they do busy work. The folks who "set and forget" on AI development are producing a whole lot of code to do nothing that needed doing. And, a lot of those folks are proud of their useless million lines of code.
I'm not trying to burn as many tokens as a possible, I'm trying to build good software. If you're paying attention to what you're building, there's so many points where a human is in the loop that it's unusual to run up against token limits.
Anyway, I assume that at some point they have to make enough money to pay the bills. Everything has been subsidized by investors for quite some time, and while the cost per token is going down with efficiency gains in the models/harnesses and with newer compute hardware tuned for these workloads, I think we're all still enjoying subsidized compute at the moment. I don't think Anthropic is making much profit on their plans, especially with folks who somehow run right at the edge of their token limit 24/7. And, I would guess OpenAI is running an even lossier balance sheet (they've raised more money and their prices are lower).
I dunno. I hear a lot of complaining about Claude, but it's been pretty much fine for me throughout 4.5, 4.6 and 4.7. It got Good Enough at 4.5, and it's never been less than Good Enough since. And, when I've tried alternatives, they usually proved to be not quite Good Enough for some reason, sometimes non-technical reasons (I won't use OpenAI, anymore, because I don't trust OpenAI, and Gemini is just not as good at coding as Claude).
tacker2000 20 hours ago [-]
I would say most people are complaining about the $20 plan, which is now actually indistinguishable from the free one. I tested it and ran into limits immediately. With Max, i can work properly again.
antics9 21 hours ago [-]
From the comments here it seems like people are stress prompting their builds instead of planning and reviewing.
If one model seems to be a bit off during a session I just switch to another (Opencode) and plan and review from there.
dboreham 21 hours ago [-]
Same here. I've asked this question before. Haven't received an answer yet.
dannypostma 17 hours ago [-]
When I saw the German screenshot it all made sense to me.
captainregex 20 hours ago [-]
anyone remember the whole “delete uber” thing from 2017ish? good times
bad_haircut72 23 hours ago [-]
Waiting 60s every time I send a msg really kills the ux of claude
spaceman_2020 18 hours ago [-]
4.7 is the breaking point for me
It's almost unusable
postepowanieadm 21 hours ago [-]
Yeah, session limits are kinda show stoppers.
zh_code 22 hours ago [-]
I just cancelled my Max20 plan yesterday.
smashah 9 hours ago [-]
Did the same with Google Ai Ultra. They rug pulled the subscribers. They changed the deal, we cancel. Simple.
varispeed 23 hours ago [-]
It also seems to me they route prompts to cheaper dumber models that present themselves as e.g. Opus 4.7. Perhaps that's what is "adaptive reasoning" aka we'll route your request to something like Qwen saying it's Opus. Sometimes I get a good model, so I found I'll ask a difficult question first and if answer is dumb, I terminate the session and start again and only then go with the real prompt. But there is no guarantee model will be downgraded mid session. I wish they just charged real price and stopped these shenanigans. It wastes so much time.
dswalter 23 hours ago [-]
You're describing a Taravangian prompt situation (a character in a book series who wakes up with a different/random intelligence level each day and has a series of tests for himself to determine which kind of decisions he's capable of that day). https://coppermind.net/wiki/Taravangian
r00t- 21 hours ago [-]
Same, it's a mess.
danjl 22 hours ago [-]
This sounds just like all my neighbors complaining about their internet provider.
gizmodo59 22 hours ago [-]
Codex is becoming such a good product. I have the 100$ pro lite. I have Claude still but 20$. I rarely use it. Let’s see if they give generous limits and more importantly a model that’s better than 5.5. The mythos fear mongering did not give me a good impression that they care about the average developer.
tamimio 14 hours ago [-]
Very similar experience, although I didn’t use claude for anything in production, but I did try some tests with some few topics and questions on things that I know, and while initially it works very well, but as soon as you dive deeper you get all sort of extra none sense that was never asked to add/do nor it’s useful, just workarounds after workarounds after duct tape solutions, several times I would say “no, why are you introducing xyz, that will cause this and that” to get similar answer of “thanks for pushing back, you are right bla bla”.
We probably hit peak generative AI last year, now they probably use AI to improve the AI so it’s kinda garbage in garbage out, or maybe anthropic is deprioritizing users while favoring enterprise or even government where it provides better quality for higher contracts.
queuebert 22 hours ago [-]
Maybe this is an unpopular opinion, but I think choosing which companies to support during this period of pre-alignment is one way to vote which direction this all goes. I'm happy to accept a slightly worse coding agent if it means I don't get exterminated someday.
johanneskanybal 15 hours ago [-]
It's not magic but for me definitly claude is the way to go. Not expecting magic it's just another level of non-slop than the rest I've tried.
drivebyhooting 22 hours ago [-]
Imagine vibe coding your core consumer application and associated backend…
Oh wait, I don’t have to imagine. That’s what Anthropic does. A nice preview for what is in store for those who chose to turn off their brains and turn on their AI agents.
kissgyorgy 19 hours ago [-]
I cancelled in the minute my subscription stopped working in Pi. Not going back to the slopfest what Claude Code is.
gverrilla 19 hours ago [-]
My main problem with claude code right now is observability. I've been experimenting a lot with vibe coding, but nowadays I can't even tell what it's doing. It's still delivering me value, but the trust on the company is going down and I've already started looking for alternatives.
josefritzishere 19 hours ago [-]
AI has a lot of future potential but at every level... it's still not very good. And certainly not good enough to validate the expense, let alone what the actual cost would be were it profitable.
wslh 20 hours ago [-]
Anthropic is astroscaling. We're essentially buying into a loop where speed and iteration take precedence over stability and support. If you view them as an experimental lab undergoing rapid atmospheric friction rather than a company, the "unreliability" is just the cost of being at the frontier. This is not an endorsement for Anthropic, just imagining their craziness on how you "can" grow in a fraction of time.
20 hours ago [-]
shevy-java 21 hours ago [-]
Those AI using software developers begin to show signs of addiction:
From "yay, claude is awesome" to "damn, it sucks". This is like with
withdrawal symptoms now.
My approach is much easier: I'll stay the oldschool way, avoid AI
and come up with other solutions. I am definitely slower, but I reason
that the quality FOR other humans will be better.
ForOldHack 21 hours ago [-]
I have token issues three times a day, and I just upgraded to pro... and now this... now I cancel. my work flow was co-pilot to Gemini to Claude Code... and the bottle neck was always CC. Always. I am done. It should be pretty easy to replace CC.
AI used to be, the punched card replicator... its all replaceable.
moralestapia 21 hours ago [-]
The midwit curve of LLMs has OpenAI on both ends.
docheinestages 21 hours ago [-]
Me too.
estimator7292 21 hours ago [-]
I just noticed today that it doesn't warn about approaching limits and just blows straight into billing extra tokens.
I'm pretty sure it used to warn when you got close to your 5hr limit, but no, it happily billed extra usage. Granted only about $10 today, but over the span of like 45 minutes. Not super pleased.
GrumpyGoblin 23 hours ago [-]
Cool
gexla 13 hours ago [-]
We can't do it. We standardized. They got us.
semiinfinitely 20 hours ago [-]
absolute garbage support was the reason why I canceled. who would have thought that an AI company has only bots as support agents
whalesalad 22 hours ago [-]
I've spent thousands of dollars on API tokens in the last few months. Out of my own pocket, as an indie contractor. I used the API specifically instead of Pro/Max/Plus/Silver/Gold/Platinum/Diamond to avoid all of the mess there regarding usage resets and potential hidden routing to worse models. It worked great for months, I got a ton of shit done, shipped a bunch of features. I really began to rely on the tech. I was not happy about the cost, but the value proposition was there.
Then within the last few months everything changed and went to shit. My trust was lost. Behavior became completely inconsistent.
During the height of Claude's mental retardation (now finally acknowledged by the creators) I had an incident where CC ran a query against an unpartitioned/massive BQ table that resulted in $5,000 in extra spend because it scanned a table which should have been daily partitioned 30 times. 27 TB per scan. I recall going over and over the setup and exhaustively refining confidence. After I realized this blunder, I referred to it in the same CC session, "jesus fucking christ, I flagged this issue earlier" -- it responded, "you did. you called out the string types and full table scans and I said "let's do it later." That was wrong. I should have prioritized it when you raised it". Now obviously this is MY fault. I fucked up here, because I am the operator, and the buck stops with me. But this incident really galvinized that the Claude I had come to vibe with so well over the last N months was entirely gone.
We all knew it was making making mistakes, becoming fully retarded. We all felt and flagged this. When Anthropic came out and said, "yeah ... you guys are using it wrong, its a skill issue" I knew this honeymoon was over. Then recently when they finally came out and ack'd more of the issues (while somehow still glossing over how bad they fucked up?) it was the final nail. I'm done spending $ on Anthropic ecosystem. I signed up for OpenAI pro $200/mo and will continue working on my own local inference in the meantime.
whalesalad 22 hours ago [-]
The thing that irks me the most is that I was paying full price. Literally spreading my cheeks wide open and saying, "here ya go Anthropic, all yours." and for actually close to a year it was great. Looking back I see I began using it in March of 2025. Thats roughly a year of integrating it into my day to day, pairing with it constantly, and as I said shipping really well engineered meaningful work.
They could have just kept doing this - literally printing money. Literally: do absolutely nothing, go on vacation, profit $$$. So why did so much change? I think that the issue is they were trying to optimize CC for the monthly plan folks, the ones who are likely losing the company money, but API users became collateral damage.
scuff3d 22 hours ago [-]
Welcome to the future. Anthropic is currently speed running it but this is what all LLM tools are going to look like in the next few years, once they turn the enshitification corner.
rvz 23 hours ago [-]
The great de-skilling programme continues in Anthropic's casino. They completely want you dependent on gambling tokens on their slot machines with extortionate prices, fees and limits.
Anthropic can't even scale their own infrastructure operations, because it does not exist and they do not have the compute; even when they are losing tens of billions and can nerf models when they feel like it.
Once again, local models are the answer and Anthropic continues to get you addicted to their casino instead of running your own cheaper slot machine, which you save your money.
Every time you go to Anthropic's casino, the house always wins.
system2 23 hours ago [-]
Same here. The single prompt burnt all my tokens in 3 minutes for the day. What happened to Claude in the last 2 months? I was happy with what they were providing and was happy to pay whatever for it. Why did they mess with it? Why are they destroying the tool we all loved?
I hate enshittification and I hate seeing this happening to Claude Code right now.
mwigdahl 20 hours ago [-]
What level of subscription are you on? If you're complaining about running out of tokens but are happy to pay "whatever" for it, it should be Max 20x, yes? And one prompt drained all your tokens for the day on Max 20x?
system2 19 hours ago [-]
I am on max 10x ($100 a month). I asked it to summarize a small codebase (2000 lines). Instead, it ran 4 agents in the background, and those agents went nuts, started reading everything and related dependencies, and sucked all my daily quota, forcing me to wait till 2 AM to continue using.
Up until last month, a $100 plan was more than enough, and it was difficult to run out of tokens per day for me. Something fundamentally changed, and Claude started making more mistakes and using more tokens. I know I am not tripping because I used it for over a year; this is absolutely new.
mwigdahl 5 hours ago [-]
Agreed, and sorry for the snarky tone in my reply. The $100 / month plan is what I use at home and it has always been more than sufficient for something like what you describe. It does sound like maybe you have MCPs or skills that might be changing the default behavior as the other poster suggests, or the agent somehow interpreted your request as wanting it to read node_modules in addition to the code or something. In any event I hope you or Anthropic or both figures out what the issue was.
gitaarik 12 hours ago [-]
Maybe you need to clean up MCP tools? My codebase is very large (hundreds of files, hundreds of lines per file), and I am managing fine on a 5x subscription, chatting all day.
Some time ago I cleaned up a bunch of MCP tools I had installed some time ago and that did make a significant impact on token usage.
system2 10 hours ago [-]
Thanks, I will definitely check them out.
tokenbar 9 hours ago [-]
[dead]
panavm 11 hours ago [-]
[dead]
rambojohnson 11 hours ago [-]
[dead]
yujunjie 13 hours ago [-]
[dead]
j_gonzalez 22 hours ago [-]
[dead]
steelkilt 20 hours ago [-]
[dead]
23 hours ago [-]
deferredgrant 21 hours ago [-]
[dead]
jccx70 21 hours ago [-]
[dead]
jwaldrip 23 hours ago [-]
I would love to just say that if you are using claude code, you should no be on pro. I feel like all the people complaining are complaining that an agent cant handle the work of a developer for $20/m. Get on at least max 5, its a world of a difference.
Larrikin 23 hours ago [-]
It's impossible to justify the jump in the expense unless you are directly working on something that makes you money. Messing around on a hobby project, doing some quick research, and getting personalized notifications was a no brainer for 200 a year.
The product keeps getting worse so I will definitely evaluate options and possibly switch if management keeps screwing up the product.
terrut 23 hours ago [-]
I found the perfect sweet spot for my hobby development. I pay 7 euros for Gemini plus and use it for creating the architecture and technical specs. Those are fed to Sonnet on Pro that just implements the instructions. This gives plenty of space to do long sessions several times a week.
willio58 23 hours ago [-]
True that.
Max 5, sonnet for 95% of things. I never run out of tokens in a week and I use it for ~5-6 hours a day.
y42 22 hours ago [-]
I dare to call meself a senior dev, so I don't need a replacement, I need a tool.
subscribed 21 hours ago [-]
I'm not a vibe coder or software manufacturer.
I juts need a convenient commandline tool to sometimes analyse the repo and answer a few questions about it.
Am I unworthy of using CC then? Until now I thought Pro entitles me to doing so.
LOL, the elitism is through the roof.
kin 21 hours ago [-]
For some reason you are being downvoted but I wanted to echo your sentiment. As someone who tries to switch things up for every next task, the productivity of Claude Max is worth every penny.
And I actually read the output to fix what I don't like and ever since Opus 4.5, I've had to less and less. 4.6 had issues at the beginning but that's because you have to manually make sure you change the effort level.
Then hand over to Claude Sonnet.
With hard requirements listed, I found out that the generated code missed requirements, had duplicate code or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed) along with tests that fake and work around to pass.
So turns out that I'm not writing code but I'm reading lots of code.
The fact that I know first hand prior to Gen AI is that writing code is way easier. It is reading the code, understanding it and making a mental model that's way more labour intensive.
Therefore I need more time and effort with Gen AI than I needed before because I need to read a lot of code, understand it and ensure it adheres to what mental model I have.
Hence Gen AI at this price point which Anthropic offers is a net negative for me because I am not vibe coding, I'm building real software that real humans depend upon and my users deserve better attention and focus from me hence I'll be cancelling my subscription shortly.
I think the AI companies all stink to high heaven and the whole thing being built on copyright infringement still makes me squirm. But the latest models are stupidly smart in some cases. It's starting to feel like I really do have a sci-fi AI assistant that I can just reach for whenever I need it, either to support hard thinking or to speed up or entirely avoid drudgery and toil.
You don't have to buy into the stupid vibecoding hype to get productivity value out of the technology.
You of course don't have to use it at all. And you don't owe your money to any particular company. Heck for non-code tasks the local-capable models are great. But you can't just look at vibecoding and dismiss the entire category of technology.
Anecdata, but I'm still finding CC to be absolutely outstanding at writing code.
It's regularly writing systems-level code that would take me months to write by hand in hours, with minimal babysitting, basically no "specs" - just giving it coherent sane direction: like to make sure it tests things in several different ways, for several different cases, including performance, comparing directly to similar implementations (and constantly triple-checking that it actually did what you asked after it said "done").
For $200/mo, I can still run 2-3 clients almost 24/7 pumping out features. I rarely clear my session. I haven't noticed quality declines.
Though, I will say, one random day - I'm not sure if it was dumb luck - or if I was in a test group, CC was literally doing 10x the amount of work / speed that it typically does. I guess strange things are bound to happen if you use it enough?
Related anecdata: IME, there has been a MASSIVE decline in the quality of claude.ai (the chatbot interface). It is so different recently. It feels like a wanna-be crapier version of ChatGPT, instead of what it used to be, which was something that tried to be factual and useful rather than conversational and addictive and sycophantic.
A small app, or a task that touches one clear smaller subsection of a larger codebase, or a refactor that applies the same pattern independently to many different spots in a large codebase - the coding agents do extremely well, better than the median engineer I think.
Basically "do something really hard on this one section of code, whose contract of how it intereacts with other code is clear, documented, and respected" is an ideal case for these tools.
As soon as the codebase is large and there are gotchas, edge cases where one area of the code affects the other, or old requirements - things get treacherous. It will forget something was implemented somewhere else and write a duplicate version, it will hallucinate what the API shapes are, it will assume how a data field is used downstream based on its name and write something incorrect.
IMO you can still work around this and move net-faster, especially with good test coverage, but you certainly have to pay attention. Larger codebases also work better when you started them with CC from the beginning, because it's older code is more likely to actually work how it exepects/hallucinates.
Agreed, but I'm working on something >100k lines of code total (a new language and a runtime).
It helps when you can implement new things as if they're green-field-ish AND THEN implement and plumb them later.
I have my own anecdata but my comment is more about the dissonance here.
A counterpoint is Google saying the vast majority of their code is written by AI. The developers at Google are not inexperienced. They build complex critical systems.
But it still feels odd to me, this contradiction. Yes there’s some skill to using AI but that doesn’t feel enough to explain the gap in perception. Your point would really explain it wonderfully well, but it’s contradicted by pronouncements by major companies.
One thing I would add is that code quality is absolutely tanking. PG mentioned YC companies adopted AI generated code at Google levels years ago. Yesterday I was using the software of one such company and it has “Claude code” levels of bugginess. I see it in a bunch of startups. One of the tells is they seem to experience regressions, which is bizarre. I guess that indicates bugs with their AI generated tests.
Alternatively, it could be there’s a large swath of people out there so stupid they are proud of code your mom can somehow review and suggest improvements in despite being nontechnical.
For example I’m working on a huge data migration right now. The data has to be migrated correctly. If there are any issues I want to fail fast and loud.
Claude hates that philosophy. No matter how many different ways I add my reasons and instructions to stop it to the context, it will constantly push me towards removing crashes and replacing them with “graceful error handling”.
If I didn’t have a strong idea about what I wanted, I would have let it talk me into building the wrong thing.
Claude has no taste and its opinions are mostly those of the most prolific bloggers. Treating Claude like a peer is a terrible idea unless you are very inexperienced. And even then I don’t know if that’s a good idea.
I often think that LLMs are like a reddit that can talk. The more I use them, the more I find this impression to be true - they have encyclopedic knowledge at a superficial level, the approximate judgement and maturity of a teenager, and the short-term memory of a parakeet. If I ask for something, I get the statistical average opinion of a bunch of goons, unconstrained by context or common sense or taste.
That’s amazing and incredible, and probably more knowledgeable than the median person, but would you outsource your thinking to reddit? If not, then why would you do it with an LLM?
Love this paragraph; it's exactly how I feel about the LLMs. Unless you really know what you are doing, they will produce very sub-optimal code, architecturally speaking. I feel like a strong acumen for proper software architecture is one of the main things that defines the most competent engineers, along with naming things properly. LLMs are a long, long way from having architectural taste
Is it generating JS code for that?
I, on the other hand, am doing a new UI for an existing system, which is exactly where you want more freedom and experimentation. It's great for that!
From my observations, generally AI-generated code is average quality.
Even with average quality it can save you a lot of time on some narrowly specialized tasks that would otherwise take you a lot of research and understanding. For example, you can code some deep DSP thingie (say audio) without understanding much what it does and how.
For simpler things like backend or frontend code that doesn't require any special knowledge other than basic backend or frontend - this is where the bars of quality come into play. Some people will be more than happy with AI generated code, others won't be, depending on their experience, also requirements (speed of shipping vs. quality, which almost always resolves to speed) etc.
This is one variable I almost always see in this discussion: the more strict the rules that you give the LLM, the more likely it is to deeply disappoint you
The earlier in the process you use it (ie: scaffolding) the more mileage you will get out of it
It's about accepting fallability and working with it, rather than trying to polish it away with care
And sure, AI could “scaffold” further into controllers and views and maybe even some models, and they probably work ok. It’s then when they don’t, or when I need something tweaked, that the worry becomes “do I really understand what’s going on under the hood? Is the time to understand that worth it? Am I going to run across a small thread that I end up pulling until my 80% done sweater is 95% loose yarn?”
To me the trade-off hasn’t proven worth it yet. Maybe for a personal pet project, and even then I don’t like the idea of letting something else undeterministically touch my system. “But use a VM!” they say, but that’s more overhead than I care for. Just researching the safest way to bootstrap this feels like more effort than value to me.
Lastly, I think that a big part of why I like programming is that I like the act of writing code, understanding how it works, and building something I _know_.
Doing nonsensical things with a library feed it the documentation still busted make it read the source
If you do spot checks, that is woefully inadequate. I have lost count of the number of times when, poring over code a SOTA LLM has produced, I notice a lot of subtle but major issues (and many glaring ones as well), issues a cursory look is unlikely to pick up on. And if you are spending more time going over the code, how is that a massive speed improvement like you make it seem?
And, what do you even mean by 10x the amount of work? I keep saying anybody that starts to spout these sort of anecdotes absolutely does NOT understand real world production level serious software engineering.
Is the model doing 10x the amount of simplification, refactoring, and code pruning an effective senior level software engineer and architect would do? Is it doing 10x the detailed and agonizing architectural (re)work that a strong developer with honed architectural instincts would do?
And if you tell me it's all about accepting the LLM being in the driver's seat and embracing vibe coding, it absolutely does NOT work for anything exceeding a moderate level of complexity. I used to try that several times. Up to now no model is able to write a simple markdown viewer with certain specific features I have wanted for a long time. I really doubt the stories people tell about creating whole compilers with vide coding.
If all you see is and appreciate that it is pumping out 10x features, 10x more code, you are missing the whole point. In my experience you are actually producing a ton of sh*t, sorry.
Spend a few hours writing context files. Spend the rest of the week sipping bourbon.
10x means you could have built something that would have taken 4 or 5 years in the time you've had since Opus 4.5 came out.
Where's your operating system, game engine, new programming language, or complex SaaS app?
Honestly, this more of a question about scope of the application and the potential threat vectors.
If the GP is creating software that will never leave their machine(s) and is for personal usage only, I'd argue the code quality likely doesn't matter. If it's some enterprise production software that hundreds to millions of users depend on, software that manages sensitive data, etc., then I would argue code quality should asymptotically approach perfection.
However, I have many moons of programming under my belt. I would honestly say that I am not sure what good code even is. Good to who? Good for what? Good how?
I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
I apply the Herbie Hancock philosophy when defining good code. When once asked what is Jazz music, Herbie responded with, "I can't describe it in words, but I know it when I hear it."
That’s the problem. If we had an objective measure of good code, we could just use that instead of code reviews, style guides, and all the other things we do to maintain code quality.
> I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
Not if you have more than a few years of experience.
But what your point is missing is the reason that software keeps working in the fist, or stays in a good enough state that development doesn’t grind to a halt.
There are people working on those code bases who are constantly at war with the crappy code. At every place I’ve worked over my career, there have been people quietly and not so quietly chipping away at the horrors. My concern is that with AI those people will be overwhelmed.
They can use AI too, but in my experience, the tactical tornadoes get more of a speed boost than the people who care about maintainability.
> the tactical tornadoes get more of a speed boost than the people who care about maintainability.
Why are these not the same people? In my job, I am handed a shovel. Whatever grave I dig, I must lay in. Is that not common? Seriously, I am not being factious. I've had the same job for almost a decade.
The other common pattern I’ve seen goes something like this.
Product asks Tactical Tornado if they can building something TT says sure it will take 6 weeks. TT doesn’t push back or asks questions, he builds exactly what product asks for in an enormous feature branch.
At the end of 6 weeks he tries to merge it and he gets pushback from one or more of the maintainability people.
Then he tells management that he’s being blocked. The feature is already done and it works. Also the concerns other engineers have can’t be addressed because “those are product requirements”. He’ll revisit it later to improve on it. He never does because he’s onto the next feature.
Here’s the thing. A good engineer would have worked with product to tweak the feature up front so that it’s maintainable, performant etc…
This guy uses product requirements (many that aren’t actually requirements) and deadlines to shove his slop through.
At some companies management will catch on and he’ll get pushed out. At other companies he’ll be praised as a high performer for years.
Has your output kept pace with the code? Because months in hours means, even pushing those ratios quite far, to be years in days.
Has your roadmap accelerated multiple years in the last few months in terms of verifiable results?
Honest question. How does one do that? My workflow is to create one git worktree per feature and start one session per worktree. And then I spent two hours in a worktree talking to Opus and reviewing what it is doing.
I am not a lawyer, but am generally familiar with two "is it fair use" tests.
1. Is it transformative?
I take a picture, I own the copyright. You can't sell it. But if you take a copy, and literally chop it to pieces, reforming it into a collage, you can sell that.
2. Does the alleged infringing work devalue the original?
If I have a conversation with ai about "The Lord of the Rings". Even if it reproduces good chunks of the original, it does not devalue the original... in fact, I would argue, it enhances it.
Have I failed to take into account additional arguments and/or scenarios? Probably.
But, in my opinion, AI passes these tests. AI output is transformative, and in general, does not devalue the original.
And they are making money off of other people's work. Sure, you can use mental jiujutsu to make it fair use. But fair use for LLMs means you basically copy the whole thing. All of it. It sounds more like a total use to me.
I hope the free market and technology catches up and destroys the VC backed machinery. But only time will tell.
Seriously though, I do think that is the case. It would be self-righteous to argue otherwise. It's just the scale and the nature of this, that makes it so repulsive. For my taste, copying something without permission, is stealing. I don't care what a judge somewhere thinks of it. Using someone's good will for profit is disgusting. And I hope we all get to profit from it someday, not just a select few. But that is just my opinion.
Or if you had to buy the book yourself, same thing, distributed, royalties paid.
That does seem more reasonable, but makes public libraries also evil.
For LLMs the transformative part is then removing the copyright info and serving it to you as OpenAI whatever.
Sure, you can query multiple books at the same time and the technology is godlike. But the underlying issue remains. Without the original content, the LLM is useless. Someone took all the books, feed them in and didn't pay anything back to the authors.
I'm not sure whether arguing in good faith here. This information you could easily check for yourself too. The problem is not the information itself. It's the massive machinery that steals all the works and one day we are staring at the paywall. And the artists are still not funded. I'd rather just do something nice offline in the future.
They just stole everyone's hard work over decades to make this or it wouldn't have been useful at all.
Thanks.
The fact of the matter is that for profit corporations consumed the sum knowledge of mankind with the intent to make money on it by encoding it into a larger and better organized corpus of knowledge. They cited no sources and paid no fees (to any regular humans, at least).
They are making enormous sums of money (and burning even more, ironically) doing this.
If that doesn't violate copyright, it violates some basic principle of decency.
That's vibecoding with an extra documentation step.
Also, Sonnet is not the model you'd want to use if you want to minimize cleanup. Use the best available model at the time if you want to attempt this, but even those won't vibecode everything perfectly for you. This is the reality of AI, but at least try to use the right model for the job.
> Therefore I need more time and effort with Gen AI than I needed before
Stop trying to use it as all-or-nothing. You can still make the decisions, call the shots, write code where AI doesn't help and then use AI to speed up parts where it does help.
That's how most non-junior engineers settle into using AI.
Ignore all of the LinkedIn and social media hype about prompting apps into existence.
EDIT: Replaced a reference to Opus and GPT-5.5 with "best available model at the time" because it was drawing a lot of low-effort arguments
It is NOT the way to work with humans basically because most software engineers I worked with in my career were incredibly smart and were damn good at identifying edge cases and weird scenarios even when they were not told and the domain wasn't theirs to begin with. You didn't need to write lengthy several page long Jira tickets. Just a brief paragraph and that's it.
With AI, you need to spell everything out in detail. But that's NO guarantee either because these models are NOT deterministic in their output. Same prompt different output each time. That's why every chat box has that "Regenerate" button. So your output with even a correct and detailed prompt might not lead to correct output. You're just literally rolling a dice with a random number generator.
Lastly - no matter how smart and expensive the model is, the underlying working principles are the same as GPT-2. Same transformers with RL on top, same random seed, same list of probabilities of tokens and same temperature to select randomly one token to complete the output and feedback in again for the next token.
I have no clue what AI you're using, but both Claude and Codex, you just explain the outcome, and they are pretty smart figuring out stuff on complex codebases.You don't even need a paragraph, just say "doing this I got an error".
> NO guarantee either because these models are NOT deterministic in their output. Same prompt different output each time.
So, exactly like humans. But a bit more predictable and way more reliable.
> That's why every chat box has that "Regenerate" button.
If you're using the chat box to write code, that's a human error, not an LLM one. Don't blame "AI" for your ignorance.
> no matter how smart and expensive the model is, the underlying working principles are the same as GPT-2.
Sure. Every machine is a smoke machine if operated wrong enough. This tells me you should not get your insight from random YT videos. As a bit of nugget, some of the underlying working principles of the chat system also powered search engines; and their engineers also drank water, like hitler.
I don't think anyone was claiming otherwise. Sonnet is still better at writing code than GPT-2, and worse than Opus. Workflows that work with Opus won't always work with Sonnet, just as you can't use GPT-2 in place of Sonnet to do code autocomplete.
Wait, are you doing this in the web chat interface?!
That's definitely not a good way. You need to be using a harness (like Claude Code) where the agent can plan its work, explore the codebase, execute code, run tests, etc. With this sort of set up, your prompts can be short (like 1 to 5 sentences) and still get great results.
Sure, AI output is kind of random.
But that's also basically true for humans. It's harder to "prove" humans are random, but wouldn't you think a person would do things slightly differently when given the same tasks but on different days? People change their minds a lot, it's just that there's no "reconsider" button for people so you feel a bit of social friction if you pester somebody to rethink an issue. But it's no different.
I'd be really surprised if your point is that humans, unlike AI, are super deterministic and that's why they are so much more trustworthy and smarter than AI...
It’s pretty funny to claim that a model released 22 hours ago is the bare minimum requirement for AI-assisted programming. Of course the newest models are best at writing code, but GPT-* and Claude have written pretty decent systems for six months or so, and they’ve been good at individual snippets/edits for years.
Not what I said.
The OP was trying to write specs and have an AI turn it into an app, then getting frustrated with the amount of cleanup.
If you want the AI to write code for you and minimize your cleanup work, you have to use the latest models available.
They won't be perfect, but they're going to produce better results than using second-tier models.
The OP comment was talking about Claude Sonnet. I was comparing to that.
I should have just said "use the best model available"
Nobody was talking about how much better it is until you wrote this though
It's like you're building your own windmills brick by brick
You're assuming that finding the places where AI needs help isn't already a larger task than just writing it yourself. AI can be helpful in development in very limited scenarios but the main thrust of the comment above yours is that it takes longer to read and understand code than to write it and AI tooling is currently focused on writing code.
We're optimizing the easy part at the expense of the difficult part - in many cases it simply isn't worth the trouble (cases where it is helpful, imo, exist when AI is helping with code comprehension but not new code production).
Not assuming anything, I'm well versed in how to do this.
Anyone who defers to having AI write massive blocks of code they don't understand is going to run into this.
You have to understand what you want and guide the AI to write it.
The AI types faster than me. I can have the idea and understand and then tell the LLM to rearrange the code or do the boring work faster than I can type it.
I think we're seeing something similar with AI: There are devs who spend a couple days trying to get AI to magically write all of their code for them and then swear it off forever, thinking they're the only people who see the reality of AI and everyone else is wrong.
It's a sort of context of life that the easy problems are solved - those where an extreme answer is always correct are things we no longer even consider problems... most of the options that remain have their advantages and disadvantages so the true answer is somewhere in the middle.
Juniors are mostly better than what you write as behavior, I certainly never had to correct as much after any junior as OP writes. If you have 'boring code' in your codebase, maybe it signals not that great architecture (and I presume we don't speak about some codegens which existed since 90s at least).
Also, any senior worth their salt wants to intimately understand their code, the only way you can anyhow guarantee correctness. Man, I could go on and on and pick your statements one by one but that would take long.
Yes, it's quicker to do it yourself this time, but if we build out the artifacts to do a good enough job this time, next time it'll have all the context it needs to take a good shot at it, and if you get overtaken by AI in the meantime you've got an insane head start.
Which side of history are you betting on?
I'm okay not being at the bleeding edge - I can see the remains of the companies that aggressively switch to the new best thing. Sometimes it'll pay off and sometimes it won't. I am comfortable being a person that waits until something hits a 2.0 and the advantages and disadvantages are clear before seriously considering a migration.
Read uncharitably, yeah. But you're making a big assumption that the writing of spec wasn't driven by the developer, checked by developer, adjusted by developer. Rewritten when incorrect, etc.
> You can still make the decisions, call the shots
One way to do this is to do the thinking yourself, tell it what you want it to do specifically and... get it to write a spec. You get to read what it thinks it needs to do, and then adjust or rewrite parts manually before handing off to an agent to implement. It depends on task size of course - if small or simple enough, no spec necessary.
It's a common pattern to hand off to a good instruction following model - and a fast one if possible. Gemini 3 Flash is very good at following a decent spec for example. But Sonnet is also fine.
> Stop trying to use it as all-or-nothing
Agree. Some things just aren't worth chasing at the moment. For example, in native mobile app development, it's still almost impossible to get accurate idiomatic UI that makes use of native components properly and adheres to HIG etc
I'm unsure if this is actually faster than me writing it myself, but it certainly expends less mental energy for me personally.
The real gains I'm getting are with debugging prod systems, where normally I would have to touch five different interfaces to track down an issue, I've just encompassed it all within an mcp and direct my agent on the debugging steps(check these logs, check this in the db, etc)
This sounds like an LLM talking.
Either you're a bot, or our human languages are being modified in realtime by the influence of these tools.
I was trying to explain that this isn't how successful engineers use AI. There is a way to understand the code and what the AI is doing as you're working with it.
Writing a spec, submitting it to the AI (a second-tier model at that) and then being disappointed when it didn't do exactly what you wanted in a perfect way is a tired argument.
I'm saying that if you're trying to have AI write code for you and you want to do as little cleanup as possible, you have to use the best model available.
"Ignore all of the LinkedIn and social media hype about prompting apps into existence." Absolutely, its not hype, its pure marketing bullshitzen.
This is based on the premise that given detailed plan, the model will exactly produce the same thing because the model is deterministic in nature which is NOT the case. These models are NOT deterministic no matter how detailed plan you feed it in. If you doubt, give the model same plan twice and see something different churned out each time.
> And honestly, I’m mostly within my Pro subscription, granted I also have ChatGPT Plus but I’ve mostly only used that as the chat/quick reference model. But yeah takes some time to read and understand everything, a lot of the time I make manual edits too.
I do not know how you can do it on a Pro plan with Claude Opus 4.7 which is 7.5x more in terms of limit consumption and any small to medium size codebase would easily consume your limits in just the planning phase up to 50% in a single prompt on a Pro plan (the $20/month one that they are planning to eliminate)
I also don’t understand because all I ever hear is people saying $100 Max plan is the minimum for serious work. I made 3-4 plans today, I’m familiar with the codebase and pointed the LLM in the direction where it needed to go. I described the functionality I wanted which wasn’t a huge rewrite, it touched like 4 files of which one was just a module of pydantic models. But one plan was 30% of usage and I had this over two sessions because I got a reset. I did read and understand everything line of code so that part takes me some time to do.
Get it to write a context capsule of everything we've discussed.
Chuck that in another model and chat around it, flesh out the missing context from the capsule. Do that a couple of times.
Now I have an artifact I can use to one-shot a hell of a lot of things.
This is amazing for 0-1.
For brown field development, add in a step to verify against the current code base, capture the gotchas and bounds, and again I've got something an agent has a damn good chance of one-shotting.
Stop doing that. Micromanage it instead. Don't give it the specs for the system, design the system yourself (can use it for help doing that), inform it of the general design, but then give it tasks, ONE BY ONE, to do for fleshing it out. Approve each one, ask for corrections if needed, go to the next.
Still faster than writing each of those parts yourself (a few minutes instead of multiple hours), but much more accurate.
"We have this thing that can speed your code writing 10x"
"If it isn't 1000x and it doesn't give me a turnkey end to end product might as well write the whole thing myself"
People have forgotten balance. Which is funny, because the inability of the AI to just do the whole thing end to end correctly is what stands between 10 developers having a job versus 1 developer having a job telling 10 or 20 agents what to do end to end and collecting the full results in a few hours.
And if you do it the way I describe you get to both use AI, AND have "a much better understanding of the codebase (and way better code)".
Unless coding is most of your job, which is rare, you’re giving up really knowing what your software does in order to achieve a very minor speed up. Just to end up having to spend way more time later trying to understand the AI generated code when inevitably something breaks.
> And if you do it the way I describe you get to both use AI, AND have "a much better understanding of the codebase (and way better code)".
Using AI is not a goal in itself, so I don’t care about “getting to use AI”. I care about doing my job as efficiently as possible, considering all parts of my job, not just coding.
This is hardly a surprise, no? No matter how much training we run, we are still producing a generative model. And a generative model doesn't understand your requirements and cross them off. It predicts the next most likely token from a given prompt. If the most statistically plausible way to finish a function looks like a version that ignores your third requirement, the model will happily follow through. There's really no rules in your requirements doc. They are just the conditional events X in a glorified P(Y|X). I'd venture to guess that sometimes missing a requirement may increase the probability of the generated tokens, so the model will happily allow the miss. Actually, "allow" is too strong a word. The model does not allow shit. It just generates.
If you are seeing an agent missing tasks, work with it to write down the task list first and then hold it accountable to completing them all. A spec is not a plan.
I ask the model to rename MyClass to MyNewClass. It will generate a checklist like:
- Rename references in all source files
- Rename source/header files
- Update build files to point at new source files
Then it will do those things in that order.
Now you can re-run it but inject the start of the model's response with the order changed in that list. It will follow the new order. The list plainly provides real information that influences future predictions and isn't just a facade for the user.
Are you seriously saying that breaking a large complex problem down into it's constituent steps, and then trying to solve each one of them as an individual problem is just a sensation of rigour?
Edit: I'll give you another example that I realized because someone pointed it out here: when the stupid bot tells you why it fucked up, it doesn't actually understand anything about itself - it's just generating the most likely response given the enormous amount of pontification on the internet about this very subject...
Whist I can't usually start from the exact same point in the decisioning, I can usually bootstrap a new session. It's not all ephemeral.
To your edit: I find that the most galling thing about finding out about the thinking being discarded at cache clear. Reconstruction of the logical route it took to get to the end state is just not the same as the step by step process it took in the first place, which again I feel counters your "feelies".
There's a really simple solution to this galling sensation: simply always keep in mind it's a stupid GenAI chat bot.
I'm not having the same problem as you and I follow a very similar methodology. I'm producing code faster and at much higher quality with a significant reduction in strain on my wrists. I doubt I'm typing that much less, but what I am typing is prose which is much more compatible with a standard QWERTY keyboard.
I think part of it is that I'm not running forward as fast as I can and I keep scope constrained and focused. I'm using the AI as a tool to help me where it can, and using my brain and multiple decades of experience where it can't.
Maybe you're expecting too much and pushing it too hard/fast/prematurely?
I don't find the code that hard to read, but I'm also managing scope and working diligently on the plans to ensure it conforms to my goals and taste. A stream of small well defined and incremental changes is quite easy to evaluate. A stream of 10,000 line code dumps every day isn't.
I bet if you find that balance you will see value, but it might not be as fast as you want, just as fast as is viable which is likely still going to be faster than you doing it on your own.
Have you tried Opus 4.6 with "/effort max" in Claude Code? That's pretty much all I use these days, and it is, honestly, doing a fantastic job. The code it's writing looks quite good to me. Doesn't seem to matter if it's greenfield or existing code.
If code is harder to read than to write, you're doing yourself a disservice by having the output stage not be top shelf.
Feels crazy to me for people to use anything other than the best available.
[1]: https://www.anthropic.com/engineering/april-23-postmortem ... but also see the September 2025 one at https://www.anthropic.com/engineering/a-postmortem-of-three-...
Not everyone has unlimited budgets to burn on tokens.
Like there is no way in world that Gen AI is faster then an actual cracked coder shooting the exact bash/sql commands he needs to explore and writing a proper intent-communicating abstraction.
I’m thinking the difference is in order of magnitudes.
On top of that it adds context loss, risk of distraction, the extra work of reading after the job is done + you’ll have less of a mental model no matter how good you read, because active > passive.
Man it was really the weirdest thing that Claude Coded started hiding more and more changes. Thats what you need, staying closely on the loop.
Just the coding window makes mistakes, duplicates code, does not follow the patterns. The reviewer catches most of this, and the coder fixes them all after rationalizing them.
Works pretty well for me. This model is somewhat institutionalized in my company as well.
I use CC Opus 4.7 or Codex GPT 5.4 High (more and more codex off late).
Maybe it was Timothy Gowers who commented on this.
Lots of human proofs have the unfortunate “creative leap” that isn’t fully explained but with some detectable subtlety. LLMs end up making large leaps too, but too often the subtle ways mathematicians think and communicate is lost, and so the proof becomes so much more laborious to check.
Like you don’t always see how a mathematician came up with some move or object to “try”, and to an LLM it appears random large creative leaps are the way to write proofs.
[1]: https://github.com/ultraworkers/claw-code
This may be worth trying out.
Just saying that I know a lot of people like to raw dog it and say plugins and skills and other things aren't necessary, but in my case I've had good success with this.
I feel like I have easily multiplied my productivity because I do not really have to read more than a single chat response at a time, and I am still familiar with everything in my apps because I wrote everything.
I've been working on Window Manager + other nice-to-haves for macOS 26. I do not need a model to one-shot the program for me. However, I am thrilled to get near instantaneous answers to questions I would generally have to churn through various links from Google/StackOverflow for.
Dude! The amount of ad-hoc, interface-specific DTOs that LLM coding agents define drives me up the wall. Just use the damn domain models!
You then spend months cleaning it up.
Could just have written it by hand from scratch in the same amount of time.
But the benefit is not having to type code.
Well, there's your problem. Why aren't you using the best tool for the job?
The last two paragraphs, however, show what happens when people start trying to use inductive reasoning -- and that part is really hard: ...
> Therefore I need more time and effort with Gen AI than I needed before because I need to read a lot of code, understand it and ensure it adheres to what mental model I have.
I don't disagree that the above is reasonable to say. But it isn't all -- not even enough -- about what needs to be said. The rate of change is high, the amount of adaptation required is hard. This in a nutshell is why asking humans to adapt to AI is going to feel harder and harder. I'm not criticizing people for feeling this. But I am criticizing the one-sided-logic people often reach for.
We have a range of options in front of us:
(A) might start by sounding like venting. Done well it progresses into clearer understanding and hopefully even community building towards action plans: [1]> Hence Gen AI at this price point which Anthropic offers is a net negative for me because I am not vibe coding, I'm building real software that real humans depend upon and my users deserve better attention and focus from me hence I'll be cancelling my subscription shortly.
The above quote is only valid unless some pretty strict (implausible) assumptions: (1) "GenAI" is a valid generalization for what is happening here; (2) Person cannot learn and adapt; (2) The technology won't get better.
[1]: I'm at heart more of a "let's improve the world" kind of person than "I want to build cool stuff" kind of person. This probably causes some disconnect in some interactions here. I think some people primarily have other motives.
Some people cancel their subscriptions and kind of assume "the market and public pushback will solve this". The market's reaction might be too slow or too slight to actually help much. Some people put blind faith into markets helping people on some particular time scales. This level of blind faith reminds me of Parable of the Drowning Man. In particular, markets often send pretty good signals that mean, more or less, "you need to save yourself, I'm just doing my thing." Markets are useful coordinating mechanisms in the aggregate when functioning well. One of the best ways to use them is to say "I don't have enough of a cushion or enough skills to survive what the market is coordinating" so I need a Plan B!
Some people go further and claim markets are moral by virtue of their principles; this becomes moral philosophy, and I think that kind of moral philosophy is usually moral confusion. Broadly speaking, in practice, morality is a complex human aspiration. We probably should not not abdicate our moral responsibilities and delegate them to markets any more than we would say "Don't worry, people who need significant vision correction (or other barrier to modern life)... evolution will 'take care' of you."
One subscription cancellation is a start (if you actually have better alternative and that alternative being better off for the world ... which is debatable given the current set of alternatives!)
Talking about it, i.e. here on HN might one place to start. But HN is also kind of a "where frustration turns into entertainment, not action" kind of place, unfortunately. Voting is cheap. Karma sometimes feels like a measure of conformance than quality thinking. I often feel like I am doing better when I write thoughtfully and still get downvotes -- maybe it means I got some people out of their comfort zone.
Here's what I try to do (but fail often): Do the root cause analysis, vent if you need to, and then think about what is needed to really fix it.
[2]: https://en.wikipedia.org/wiki/Parable_of_the_drowning_man
[3]: The first four are:
The market-leading technology is pretty close to "good enough" for how I'm using it. I look forward to the day when LLM-assisted coding is commoditized. I could really go for an open source model based on properly licensed code.
(but I guess they're not really conflicting, if the "solution" involves upgrading to a higher plan)
1/ Claude Code with yolo mode
2/ superpowers plugin
3/ red/green tdd
4/ a lot of planning and requirements before writing any code
It feels like you always touch this edge of capability of models and your current workflow. Delegate more complex task, and system fails. Delegate more simple and system works great. Improve your workflow and move this complexity to a higher level.
This seems to be a good window where I can implement a pretty large feature, and then go through and address structural issues. Goofy thinks like the agent adding an extra database, weird fallback logic where it ends up building multiple systems in parallel, etc.
Currently, I find multiple agents in parallel on the same project to be not super functional. Theres just a lot of weird things, agents get confused about work trees, git conflicts abound, and I found the administrative overhead to be too heavy. I think plenty of people are working on streamlining the orchestration issue.
In the mean time, I combat the ADD by working on a few projects in parallel. This seems to work pretty well for now.
It's still cat herding, but the thing is that refactors are now pretty quick. You just have to have awareness of them
I was thinking it'd be cool to have an IDE that did coloring of, say, the last 10 git commits to a project so you could see what has changed. I think robust static analysis and code as data tools built into an IDE would be powerful as well.
The agents basically see your codebase fresh every time you prompt. And with code changes happening much more regularly, I think devs have to build tools with the same perspective.
To give them the benefit of doubt, perhaps these people provide such detailed spec that they basically write code in natural language.
That said, looking at the way things work in big companies, AI has definitely made it so one senior engineer with decent opinions can outperform a mediocre PM plus four engineers who just do what they're told.
Like yesterday? LLM-assisted coding is $100/mo. It looks very commoditized when most houses in developed world pay more for electricity than that.
My definition of LLM-assisted coding is that you fully understand every change and every single line of the code. Otherwise it's vibe coding. And I believe if one is honest to this principle, it's very hard to deplete the quota of the $100 tier.
But, it's not $100/mo. I think the best showcase of where AI is at is on the generative video side. Look at players like Higgsfield. Check out their pricing and then go look at Reddit for actual experiences. With video generation the results are very easy to see. With code generation the results are less clear for many users. Especially when things "just work".
Again, it's not $100/month for Anthropic to serve most uses. These costs are still being subsidized and as more expensive plans roll out with access to "better" models and "more* tokens and context the true cost per user is slowly starting to be exposed. I routinely hit limits with Anthropic that I hadn't been for the same (and even less) utilization. I dumped the Pro Max account recently because the value wasn't there anymore. I am convinced that Opus 3 was Anthropic's pinnacle at this point and while the SotA models of today are good they're tuned to push people towards paying for overages at a significantly faster consumption rate than a right sized plan for usage.
The reality is that nobody can afford to continue to offer these models at the current price points and be profitable at any time in the near future. And it's becoming more and more clear that Google is in a great position to let Anthropic and OAI duke it out with other people's money while they have the cash, infrastructure and reach to play the waiting game of keeping up but not having to worry about all of the constraints their competitors do.
But I'd argue that nothing has been commoditized as we have no clue what LLMs cost at scale and it seems that nobody wants to talk about that publicly.
Video is a different ballgame entirely, its less than realtime on _large_ gpus. moreover because of the inter-frame consistency its really hard to transfer and keep context
Running inference on text is, or can be very profitable. its research and dev thats expensive.
im probably just not being charitable enough to what you mean, but thats an absurd bar that almost nobody conforms to even if its fully handwritten. nothing would get done if they did. But again, my emphasis is on that im probably just not being charitable to what you mean.
They just mean they can more or less follow along with what the code is doing. You don't need to be very charitable in order to understand what he genuinely meant, and understanding code that one writes is how many (but not all) professional software developers who didn't just copy and paste stuff from Stackoverflow used to carry out their work.
How deep do i need to understand range() or print() to utilize either, on the slightly less extreme end of the spectrum.
But ya, im pretty sure its a point that maybe i coulda kept to myself and been charitable instead.
print(X) is a great example. That's going to print X. Every time.
Agent.print(x) is pretty likely to print X every time. But hey, who knows, maybe it's having an off day.
Jeff Atwood, along with numerous others (who Atwood cites on his blog [1]) were not exaggerating when the observed that the majority of candidates who had existing professional experience, and even MSc. degrees, were unable to code very simple solutions to trivial problems.
[1] https://blog.codinghorror.com/why-cant-programmers-program/
[0] https://news.ycombinator.com/item?id=47894279
That's how I read it, and I would agree with that.
If it's low-stakes, then the required depth to accept the code is also low.
Obviously I don't mean "understanding it so you can draw the exact memory layout on the white board from memory."
I anticipate a Napster-style reckoning at some point when there's a successful high-profile copyright suit around obviously derivative output. It will probably happen in video or imagery first.
this is a small nit, but you still have to pay your electric bill, the $100/mo is on top of that. if you're doing cost accounting you don't want to neglect any costs. Just because you can afford to lease a car, doesn't mean you can afford to lease a 2nd car.
But I and others in my company have very heavy usage. We only rarely, with parallel agentic processes, run out of the $200 a month plan.
And what do I mean by "hard"? I mean, it requires a lot of active thinking to think about how you can actively max it out. I'm sure there's some use cases where maybe it is not hard to do this, but in general, I find most devs can't even max out the $100 a month plan, because they haven't quite figured out how to leverage it to that degree yet.
(Again, if someone is using the API instead of subscription, I wouldn't be surprised to see $2,000 bills.)
You can use a Max subscription for work, btw.
I assume you meant loss-leader. We can’t know that without knowing their financials. The actual marginal cost of inference is demonstrably less than $200/mo though, so it’s not clear whether they are operating at a loss. Without seeing their books we can’t know.
I find it incredibly difficult to saturate my usage. I'm ending the average week at 30-ish percentage, despite this thing doing an enormous amount of work for (with?) me.
Now I will say that with pro I was constantly hitting the limit -- like comically so, and single requests would push me over 100% for the session and into paying for extra usage -- and max 5x feels like far more than 5x the usage, but who knows. Anthropic is extremely squirrely about things like surge rates, and so on.
I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
Recently I've gotten Qwen 3.6 27b working locally and it's pretty great, but still doesn't match Opus; I've gotten check out that new Deepseek model sometime.
>I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
A lot of people are angry about the whole openclaw situation. They are especially bitter that when they attempted to justify exfiltrating the OAuth token to use for openclaw, nobody agreed with them that they had the right to do so, and sided with Claude that different limits for first-party use is standard. So they create threads like this, and complain about some opaque reason why Anthropic is finished (while still keeping their subscription, of course).
I did a 1:1 map of all my Claude Code skills, and it feels like I never left Opus.
Super happy with the results.
For my use-case, I want the providers to get my tokens as long as they plan to keep releasing open-weight models
Kimi wants my phone number on signup so a no-go for me.
Claude's uptime is terrible. The uptime of most other providers is even worse...and you get all the quantization, don't know what model you are actually getting, etc.
I'm just getting a but tired of using Opus 2.6 which eats my whole allowance and then some £££ going through the 4kB prompt to review ~13 kB text file twice - and that's on top of the sometimes utter bonkers, bad, lazy answers I'm not getting even from the local Gemma 4 E4B.
I also created a mini framework so it can test that the skills are actually working after implementation.
Everything runs perfectly.
It does seem like the sweet spot between WallE and the destroyed earth in WallE.
I'm a BSD-style Open Source advocate who has published a lot of Apache-licensed code. I have never accepted that AI companies can just come in and train their models on that code without preserving my license, just allowing their users to claim copyright on generated output and take it proprietary or do whatever.
I would actually not mind licensing my work in an LLM-friendly way, contributing towards a public pool from which generated output would remain in that pool. Perhaps there is opportunity for Open Source organizations to evolve licenses to facilitate such usage.
For what it's worth, I would be happy to pay for a commercial LLM trained on public domain or other properly licensed works whose output is legitimately public domain.
slow and steady is worth exponentials. keep slopppping it my boid.
For reasons that continue to elude me, almost exactly one year ago, Anthropic cancelled my Claude Pro plan. To appeal, you must fill out a Google docs form. And wait. In my case, I’ve waited for about one year. Once I managed to email with a human but they quickly plugged that hole with a chatbot that sends you back to their never-to-be-reviewed form. No route to escalate.
A year gives one a long time to think about things. Maybe it was because I was on a VPN temporarily. Otherwise, no clue. I’m a hobbyist embedded developer. That’s it.
So no, Anthropic support isn’t just poor; it’s nonexistent.
For now. That doesn't really change the risk, that just means they are all hyper competitive right this moment, and so they are comparable. If one of them becomes king of the hill, nothing stops them from silently degrading or jacking prices.
The only shield is to not be dependent in the first place. That means keeping your skills sharp and being willing to pass on your knowledge to juniors, so they aren't dependent on these things.
Of course, many people are building their business on huge AI scaffolding. There's nothing they can do.
They won't ever be SOTA due to money, but "last year's SOTA" when it costs 1/4 or less, may be good enough. More quantity, more flexibility, at lower edge quality. It can make sense. A 7% dumber agent TEAM Vs. a single objectively superior super-agent.
That's the most exciting thing going on in that space. New workflows opening up not due to intelligence improvements but cost improvements for "good enough" intelligence.
Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters. I'm not a poor college student anymore, and I need more return on my time.
I'm not shitting on open weights here - I want open source to win. I just don't see how that's possible.
It's like Photoshop vs. Gimp. Not only is the Gimp UX awful, but it didn't even offer (maybe still doesn't?) full bit depth support. For a hacker with free time, that's fine. But if my primary job function is to transform graphics in exchange for money, I'm paying for the better tool. Gimp is entirely a no-go in a professional setting.
Or it's like Google Docs / Microsoft Office vs. LibreOffice. LibreOffice is still pretty trash compared to the big tools. It's not just that Google and Microsoft have more money, but their products are involved in larger scale feedback loops that refine the product much more quickly.
But with weights it's even worse than bad UX. These open weights models just aren't as smart. They're not getting RLHF'd on real world data. The developers of these open weights models can game benchmarks, but the actual intelligence for real world problems is lacking. And that's unfortunately the part that actually matters.
Again, to be clear: I hate this. I want open. I just don't see how it will ever be able to catch up to full-featured products.
The trick is going to be recognizing tasks which have some ceiling on what they need and which will therefore eventually be doable by open models, and those which can always be done better if you add a bit more intelligence.
I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?
You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.
This kind of rhetoric is not helpful. If you want to make a point, then make one, but this adds nothing to the conversation. Maybe open source models don't work for you. They work very well for me.
The gap has been shrinking with each release, and the SOTA has already run into diminishing returns for each extra unit of data+computation it uses.
Do you really want to bet that the gap will not eventually be a hairs breadth?
Because in almost no real-world project is "programming time" the limiting factor?
When was the last time you used any of them? Because, a lot of people are actively using them for 9-5 work today, I count myself in that group. That opinion feels outdated, like it was formed a year ago+ and held onto. Or based on highly quantized versions and or small non-Thinking models.
Do you really think Qwen3.6 for a specific example is "50%" as good as Opus4.7? Opus4.7 is clearly and objectively better, no debate on that, but the gap isn't anywhere near that wide. I'd call "20%" hyperbole, the true difference is difficult to exactly measure but sub-10% for their top-tier Thinking models is likely.
Sure, we use Google Drive, too, but that's just for sharing documents across offices, not for everyday use. For that, the open source model is a clear winner in my book.
So the starting point is Opus 4.7 pricing and we're contrasting alternatives near the top end (offered across multiple providers).
Also I said 20% was hyperbole, meaning far too high.
Those closed weight models aren't available like we're discussing. They're only available from the vendor that created them.
The breakeven at this price is 6 minutes of productivity per work day for an engineer making $200k.
Are you suggesting that someone making $20k should be spending $200/mo on Claude?
If you pay someone $20,000 for labor, and they save 65 minutes worth of labor per day using a $200/mo Claude subscription, you are better off buying the Claude subscription.
You've got the real insight with this claim.
This is the way the world is moving. Open source isn't even going where the ball is being tossed. There is no leadership here.
You're spot on.
If the cost to deliver a unit of business automation is:
The one that will get picked is option "D".Your poor college students and hobbyists will be on option "B". But this won't be as productive as evidenced by the human labor input costs.
Option "C" will begin to disappear as models/compute get more expensive and capable.
Option "A" will be nonviable. Humans just won't be able to keep up.
Open source strictly depends on models decreasing their capability gap. But I'm not seeing it.
Targeting home hardware is the biggest smell. It's showing that this is non-serious, hobby tinkery and has no real role in business.
For open source to work and not to turn into a toy, the models need to target data center deployment.
The real money in this market, though, is going to be made in the C suite, and they don't really care about the model. They don't care if it's open source, closed source, or what it is. They don't want to buy a model. They're interested in buying a solution to their problems. They're not going to be afraid of a software price tag -- any number they spend on labor is far more.
Labor is something like 50%+ of the Fortune 500's operating expenses -- capturing any chunk of this is a ridiculous sum of money.
Who said so? GLM 5.1 is 90% Opus, at least. Some people quite happy with Kimi 2.6 too. I did not try Deepseek 4 yet but also hearing it is as good as Opus. You might be confusing open source models with local models. It is not easy to run a 1.6T model locally, but they are not 50% of SOTA models.
Edit: the replies to my comment are great examples of what I’m talking about when I say it’s hard to determine what hardware I’d need :).
Hooking up Claude Code to it is trivial with omlx.
https://github.com/jundot/omlx
Starting closer to 40k if you want something that's practical. 10k can't run anything worthwhile for SDLC at useful speeds.
(If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)
Also, I don't know of a general solution to streaming models from disk. Is there an inference engine that has this built-in in a way that is generally applicable for any model? I know (I mean, I've seen people say it, I haven't tried it) you can use swap memory with CPU offloading in llama.cpp, and I can imagine that would probably work...but definitely slowly. I don't know if it automatically handles putting the most important routing layers on the GPU before offloading other stuff to system RAM/swap, though. I know system RAM would, over time, come to hold the hottest selection of layers most of the time as that's how swap works. Some people seem to be manually splitting up the layers and distributing them across GPU and system RAM.
Have you actually done this? On what hardware? With what inference engine?
[†] The latest Qwen 3.6 whatever has been a noticeable improvement, and I'm not even at the point where I tweak settings like sampling, temperature, etc. No idea what that stuff does, I just use the staff picks in LM Studio and customize the system prompts.
So you can run 1 agent locally on $1k to $3k hardware
They can run a fleet of thousands
Yes, it's possible to run tiny quantized models, but you're working with extremely small context windows and tons of hallucinations. It's fun to play with them, but they're not at all practical.
Practical? Maybe not (unless you highly value privacy) because you can get better models and better performance with cheap API access or even cheaper subscriptions. As you said, this may indefinitely be the case.
Yes, a lot better, but still terribly unreliable and far less capable than the big unquantized models.
But, so far, competition remains fierce. Anthropic still has the best tools for writing code. That lead is smaller than it's ever been, though. But, honestly, Opus 4.5 is when it got Good Enough. If Anthropic suddenly increased prices beyond what I'm willing to pay, any model that gives me Opus 4.5 or better performance is good enough for the vast majority of the work I do with agents. And, there are a bunch of models at that level, now maybe including some discount Chinese models. Certainly Gemini Pro 3.1 is on par with Opus 4.5. Current Codex is better than Opus 4.5 and close to Opus 4.7 (though I won't use OpenAI because I don't trust them to be the dominant player in AI).
I often switch agents/models on the same project because I like tinkering with self-hosted and I like to keep an eye on the most efficient way to work...which models wastes less of my time on silly stuff. Switching is literally nothing; I run `gemini` or `copilot` or `hermes` instead of `claude`. There's simply no deep dependency on a specific model or agent. They're all trying to find ways to make unique features for people to build a dependence on, of course, but the top models are all so fucking smart you can just tell them to do whatever thing it is that you need done. That feature could probably be a skill, whatever it is, and the model can probably write the skill. Or, even better, it could be actual software, also written by the model, rather than a set of instructions for the model to interpret based on the current random seed.
Currently, the only consistent moat is making the best model. Anthropic makes the best model and tools for coding, but that's a pretty shallow moat...I could live with several other models for coding. I'll gladly pay a premium for the best model and tools for coding, but I also won't be devastated if I suddenly don't have Claude Code tomorrow. Even open models I can host myself are getting very close to Good Enough.
Competition (OpenAI vs Anthropic is fun to watch) and open source will get us there soon I think.
Not the best argument.
Also there is nothing without dependencies. Loose coupling means coupling.
AI tools... do what you already do, sometimes faster, sometimes worse, usually both depending on the task.
There's a massive gap of necessity between them.
Until very recently, local models been little more than brittle toys in my experience, if you're trying to use them for coding.
But lately I've been running Pi (minimal coding agent harness) with Gemma4 and Qwen3.6 and I've been blown away by how capable and fast they are compared to other models of their size. (I'm using the biggest that can fit into 24gb, not the smaller ones.) In fact, I don't really need to reach for Claude and friends much of the time (for my use cases at least).
but then two months ago 4.6 started getting forgetful and making very dumb decisions and so on. Everyone started comparing notes and realising it wasn’t “just them”. And 4.7 isn’t much better and the last few weeks we keep having to battle the auto level of effort downgrade and so on. So much friction as you think “that was dumb” and have to go check the settings again and see there has been some silent downgrade.
We all miss the early days of 4.6, which just show you can have a good useful model. LLMs can be really powerful but in delivering it to the mass market Anthropic throttle and downgrade it to not useful.
My thinking is that soon deepseek reaches the more-than-good-enough 4.6+ level and everyone can get off the Claude pay-more-for-less trajectory. We don’t need much more than we’ve already had a glimpse of and now know is possible. We just need it in our control and provisioned not metered so we can depend upon it.
https://www.anthropic.com/engineering/april-23-postmortem
Of course, it sucks when companies screw up ... but at the same time, they "paid everyone back" by removing limits for awhile, and (more importantly to me) they were transparent about the whole thing.
I have a hard time seeing any other major AI provider being this transparent, so while I'm annoyed at Claude ... I respect how they handled it.
https://www.anthropic.com/engineering/a-postmortem-of-three-...
I think there's a certain amount of running with scissors going on here. I appreciate the transparency, but the time to remediation here seems pretty long compared to the rate of new features.
I recall reading similar tales of woe with other providers here on HN. I think the gradual dialling back of capability as capacity becomes strained as users pile on is part of the MO of all the big AI companies.
API Error: Claude's response exceeded the 32000 output token maximum. To configure this behavior, set the CLAUDE_CODE_MAX_OUTPUT_TOKENS environment variable.
That’s a hallucination. All they did was hide thinking by default. Quick Google search should easily teach you how to turn it back on (I literally have it enabled in my harness).
Whoever is their product manager should be embarrassed at the UX they provide.
Please. This is a toy. A novel little tech-toy. If you depend on it now for doing your job then, frankly, you deserve to have your rug pulled now and then.
If you didn't try to use it to work for you, that's okay, but maybe try once more? It does work and adds value. It's a non-standard and weirdly flexible tool with limitations.
...but in retrospect, seeing how you finished your comment, maybe you really want to remain angry and misinformed.
GPT 5.4+ takes its time and considers even edgecases unprovoked that in fact are correct and saves me subsequent error hunting turns and finally delivers. Plus no "this doesn't look like malware" or "actually wait" thinking loops for minutes over a oneliner script change.
GLM always feels like it's doing things smarter, until you actually review the code. So you still need the build/prune cycle. That's my experience anyway.
But now I just use Codex. Claude is unreliable and leaves data races all over and leaves, as you say, negative conditions unhandled fairly consistently.
Now I'm looking for an extremely simple open-source coding agent. Nanocoder doesn't seem install on my Mac and it brings node-modules bloat, so no. Opencode seems not quite open-source. For now, I'm doing the work of coding agent and using llama_cpp web UI. Chugging it along fine.
Even the FSF recognizes that non-copyleft licenses still follow the Freedoms, and therefore are still Free Software.
On launch, it checks for updates and autoupdates.
I got annoyed enough with Anthropic's weird behavior this week to actually try this, and got something workable up & running in a few days. My case was unique: there's no Claude Code for BeOS, or my older / ancient Macs, so it was easier to bootstrap & stitch something together if I really wanted an agentic coding agent on those platforms. You'll learn a lot about how models actually work in the process too, and how much crazy ridiculous bandaid patching is happening Claude Code. Though you might also appreciate some of the difficulties that the agent / harnesses have to solve too. (And to be clear, I'm still using CC when I'm on a platform that supports it.)
As for the llama_cpp vs Claude Code delays - I've run into that too. My theory is API is prioritized over Claude Code subscription traffic. API certainly feels way faster. But you're also paying significantly more.
However, it's hard to justify Cursor's cost. My bill was $1,500/mo at one point, which is what encouraged me to give CC a try.
AI companies have the same incentive. Make it cheaper and people will use it more, making you more money (assuming your price is still above cost). And of course they have every reason to reduce their on costs.
Since the price they are charging is still way, way above their operating costs there's no surprise really that they end up making more from small price reductions.
If competition drove them to reduce costs to the point where their operating costs started to be a large factor, the paradox would disappear.
It's like dating apps. They don't want you to find a good match, because then you cancel the subscription.
Speaking of which:
https://www.cnbc.com/2026/04/24/deepseek-v4-llm-preview-open...
Less spend means less real cost to the provider while your flat monthly subscription stays the same price. As well, reducing token use per customer means you can over-subscribe even harder, allowing for more flat monthly subscriptions.
Less tokens = more free capacity = more subscription income.
> “you can’t be serious — is this how you fix things? just WORKAROUNDS????”
If this is how you’re interacting with your agents I think you’re in for a world of disappointment. An important part of working with agents is providing specific feedback. And beyond that making sure this feedback actually available to them in their context when relevant.
I will ask them why they made a decision and review alternatives with them. These learnings will aid both you and the agent in the future.
I haven't seen anyone mention this publicly, but I've noticed that the same model will give wildly different results depending on the quantization. 4-bit is not the same as 8-bit and so on in compute requirements and output quality. https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
I'm aware that frontier models don't work in the same way, but I've often wondered if there's a fidelity dial somewhere that's being used to change the amount of memory / resources each model takes during peak hours v. off hours. Does anyone know if that's the case?
I tried Kimi 2.6 and it's almost comparable to Opus. Anthropic lost the ball. I hope this is a sign the we are moving towards a future where model usage is a commodity with heavy competition on price/performance
How much you trust any particular provider's claim to not retain data is subjective though.
One group is consistently trying to play whack-a-mole with different models/tools and prompt engineering and has shown a sine-wave of success.
The other group, seemingly made up of architects and Domain-Driven Design adherents has had a straight-line of high productivity and generating clean code, regardless of model and tooling.
I have consistently advised all GenAI developers to align with that second group, but it’s clear many developers insist on the whack-a-mole mentality.
I have even wrapped my advice in https://devarch.ai/ which has codified how I extract a high level of quality code and an ability to manage a complex application.
Anthropic has done some goofy things recently, but they cleaned it up because we all reported issues immediately. I think it’s in their best interests to keep developers happy.
My two cents.
You can NEVER stop being vigilant. This is why I still have no faith in things like OpenClaw. Letting an AI just run off unsupervised makes me sweat.
If you want to get good results, you still have to be an engineer about it. The model multiplies the effort you put in. If your effort and input is near zero, you get near zero quality out. If you do the real work and relegate the model to coloring inside the lines, you get excellent results.
First was the CC adaptive thinking change, then 4.7. Even with `/effort max` and keeping under 20% of 1M context, the quality degradation is obvious.
I don't understand their strategy here.
Here is a sample report that tries out the cheaper models + the newest Kimi2.6 model against the 5.4 'gold' testcases from the repo: https://repogauge.org/sample_report.
running evals seems like it may be a bit too expensive as a solo dev.
My experience very suddenly and very clearly degraded over the last few days.
Today I was trying to build a simple chess game. Previous one shots were HTML, this gave me a jsx file. I asked it to put it HTML and it absolutely devoured my credits doing so, I had to abort and do it manually. The resulting app didn't work, and it had decided that multiplayer could work by storing the game state only on local storage without the clients communicating at all
I use AI, but only what is free-of-charge, and if that doesn't cut it, I just do it like in the good old times, by using my own brain.
But I think "context switching" between 2 different prompts might be too expensive for GPUs to be worth it for LLM providers. Who knows.
https://podcasts.apple.com/us/podcast/this-episode-is-a-cogn...
As someone who both uses and builds this technology I think this is a core UX issue we’re going to be improving for a while. At times it really feels like a choose 2+ of: slow, bad, and expensive.
I am certainly not saying people should “spend more money,” more like the Claude Code access in the Pro plan seems kind of like false advertising. Since it’s technically usable, but not really.
Its particularly noticeable when for a long time you could work an 8 hour day in codex on ChatGPT´s $20/month plan (though they too started tightening the screws a couple of weeks back)
And by crikey do I empathise with the poor support in this article. Nothing has soured me on Anthropic more than their attitude.
Great AI engineers. Questionable command line engineers (but highly successful.) Downright awful to their customers.
There's really no immediate solution to this other than letting the price float or limiting users as capacity is built out this gets better.
All mostly mitigatable by rigorous audits and steering, but man, it should not have to be.
The 20$ plan has incredible value but also, the limit are just way too tight.
I'm glad Claude made me discover the strength of ai, but now it's time to poke around and see what is more customer friendly. I found deepseek V4 to be extremely cheap and also just as good.
Plus I get the benefit to use it in vs code instead of using Claude proprietary app.
I think that when people goes over the hype and social pressure, anthropic will lose quite a lot of customer.
The first job of any support system—both in terms of importance and chronologically—is triage. This is not a research issue and it's not an interaction issue. It's at root a classification problem and should be trained and implemented as such.
There are three broad categories of interaction: cranks, grandmas, and wtfs.
Cranks are the people opening a support chat to tell you they have vital missing information about the Kennedy Assassintion or they want your help suing the government for their exposure to Agent Orange when they were stationed at Minot. "Unfortunately I can't help with that. We are a website that sells wholesale frozen lemonade. Good luck!"
Grandma questions are the people who can't navigate your website. (This isn't meant to be derogatory, just vivid; I have grandma questions often enough myself.) They need to be pointed toward some resource: a help page, a kb article, a settings page, whatever. These are good tasks for a human or LLM agent with a script or guideline and excellent knowledge/training on the support knowledge base.
WTFs are everything else. Every weird undocumented behavior, every emergent circumstance, every invalid state, etc. These are your best customers and they should be escalated to a real human, preferably a smart one, as soon as realistically possible. They're your best customers because (a) they are investing time into fixing something that actually went wrong; (b) they will walk you through it in greater detail than a bug report, live, and help you figure it out; and (c) they are invested, which means you have an opportunity for real loyalty and word-of-mouth gains.
What most AI systems (whether LLMs or scripts) do wrong is that they treat WTFs like they're grandmas. They're spending significant money on building these systems just to destroy the value they get from the most intelligent and passionate people in their customer base doing in-depth production QC/QA.
They might mean "few weeks ago" and the phrase "couple of weeks ago" might not be exactly as "Vor ein paar Wochen" in their mind rather could be as "few weeks ago."
Rest of the prose in the article seems to support the assumption.
The post is handwritten with no LLMs involved.
I tried Claude recently and it was able to one-shot fixes on 9/9 of the bugs I gave it on my large and older Unity C# project. Only 2/9 needed minor tweaks for personal style (functionally the same).
Maybe it helps that I separately have a CLI with very extensive unit tests. Or that I just signed up. Or that I use Claude late in the evenings (off hours). I also give it very targeted instructions and if it's taking longer than a couple minutes - I abort and try a different or more precise prompt. Maybe the backend recognizes that I use it sparingly and I get better service.
The author describes what sounds like very large tasks that I'd never hand off to an AI to run wild in 2026.
Anyway I thought I'd give a different perspective than this thread.
https://thoughts.jock.pl/p/adhd-ai-agent-personal-experience...
Strange how things can change!
The services (OpenAI, Anthropic) are not wildly changing that much. People are just using LLMs more and getting frustrated because they were told it would change the world, and then they take it out on their current patron. Give it a month and we'll be hearing how far OpenAI has fallen behind.
For actual code that goes out to production, I generally figure out how I would solve the problem myself (but will use Claude to bounce ideas and approaches -- or as a search engine) and then have Claude do the boring bits.
Recently I had to migrate a rules-engine into an FSM-based engine. I already had my plan and approach. I had Claude do the boring bits while I implemented the engine myself. I find that Claude does best when you give it small, focused, incremental tasks.
There is one caveat, and that is you have to give the model well thought out constraints to guide it properly, and absolutely take the time to read all the thinking it's doing and not be afraid to stop the process whenever things go sideway.
People who just let Claude roam free on their repository deserve everything they end up with.
I occasionally ask AI to write lots of code such as a whole feature (>= medium shirt size) or sometimes even bigger components of said feature and I often just revert what it generated. It's not good for all the reasons mentioned.
Other times I accept its output as a rough draft and then tell it how to refactor its code from mid to senior level.
I'm sure it will get better but this is my trust level with it. It saves me time within these confines.
Edit: it is a valuable code reviewer for me, especially as a solo stealth startup.
For work, unlimited usage via Bedrock.
Yes I’d like to get more usage out of my personal sub, but at 20/mo no complains
I think even with the worse limits people still hated it but when you start to either on purpose or inadvertently make the model dumber that's when there's really no purpose to keep using Claude anymore.
Even a simple prompt focused on two files I told Claude to do a thing to file A and not change file B (we were using it as a reference).
Claude’s plan was to not touch file B.
First thing it did was alter file B. Astonishing simple task and total failure.
It was all of one prompt, simple task, it failed outright.
I also had it declare that some function did not have a default value and then explain what the fun does and how it defaults to a specific value….
Fundamentally absurd failures that have seriously impacted my level of trust with Claude.
The new model that came out less than 24 hours ago made this obvious? This feels like when a new video game comes out and there's 1,000 steam reviews glazing it in the first hours of release. Don't you think you should use it for longer than a day before declaring it a game changer?
Wait really? I wanted to give it a try, but for $200 a month no way am I paying that for something I just want to experiment around with
The thing is running local LLMs will give some kind of reliability and fixed expectations that saves a lot of time - yeah sure Claude might be fantastic one day, but what do I do when the same workload churns out shit the next and I am halfway thru updating and referencing a 500 document set?
Better the devil you know and all that.
(I am just learning that "a couple of weeks" apparently means "2 weeks"...)
Pro is gone. OpenAI plans are more expensive. He can only buy a Kimi plan, which is at least better than Sonnet. But frontier for cheap is gone. Even copilot business plans are getting very expensive soon, also switching to API usage only.
Before the fixes, they were complete trash and I was ready to cancel this month.
Now, I'm feeling like the AI wars are back -- GPT 5.5 and Opus 4.7 are both really good. I'm no longer feeling like we're using nerfed models (knock on wood)!
On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode. This was the wrong tradeoff. We reverted this change on April 7 after users told us they'd prefer to default to higher intelligence and opt into lower effort for simple tasks. This impacted Sonnet 4.6 and Opus 4.6.
On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.
On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
The filesystem tool cannot edit xml files with <name></name> elements in it
Most of this is about the billing system, which is apparently broken.
Like 3 weeks ago Qwen3-coder was the best coding LLM to run locally. I haven’t spent time since to figure out if anything is better.
You can also power Opencode with OpenRouter which lets you pay for any LLM à la carte.
[1] https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-R...
https://www.anthropic.com/engineering/april-23-postmortem
Dear Anthropic:
Please, for the love of all things holy, NEVER change someone's defaults without INFORMING the end user first, because you will wind up with people confused, upset, and leaving your service.
I was worried about Anthropic models quality varying and about Anthropic jacking up prices.
I don't think Claude Code is the best agent orchestrator and harness in existence but it's most widely supported by plugins and skills.
What I don't understand is these loud "voting with money" comments. What they are canceling is very subsidized plan to buy something that delivers a lot of value.
There are only two providers that can provide this level of models at very subsidized price - anthropic and openai. Both of them are bad in terms of reliability.
So I wonder what these people do after they "cancel" both of them? Do they see producing less result at same hourly rate as everyone else on the market as viable option?
I'm debating trying out Codex, from some people I hear its "uncapped" from others I hear they reached limits in short spans of time.
There's also the really obnoxious "trust me bro" documentation update from OpenClaw where they claim Anthropic is allowing OpenClaw usage again, but no official statement?
Dear Anthropic:
I would love to build a custom harness that just uses my Claude Code subscription, I promise I wont leave it running 24/7, 365, can you please tell me how I can do this? I don't want to see some obscure tweet, make official blog posts or documentation pages to reflect policies.
Can I get whitelisted for "sane use" of my Claude Code subscription? I would love this. I am not dropping $2400 in credits for something I do for fun in my free time.
Plus is still very usable for me though. I have not tried Claude Pro in quite a while and if people are complaining about usage limits I know it's going to be a bad time for me. I had to move up from Claude Pro when the weekly limits were introduced because it was too annoying to schedule my life around 5hr windows.
I started using codex around December when I started to worry I was becoming too dependent on Claude and need to encourage competition. codex wasn't particularly competitive with Claude until 5.4 but has grown on me.
The only thing I really care about is that whatever I'm using "just works" and doesn't hurt limits and Claude code has been flaky as all hell on multiple fronts ever since everyone discovered it during the Pentagon flap. So I tend to reach for ChatGPT and codex at the moment because it will "just work" and there's a good chance Claude will not.
Check any tasks if it's not currently working on one, and to continue until it finishes, dismiss this reminder if it's done, and then to ensure it runs unit tests / confirms the project builds before moving on to the next one. Compact the context when it will move to the next one. Once its exhausted all remaining tasks close the loop.
Works for me for my side projects, I can leave it running for a bit until it exhausts all remaining tasks.
Edit: i forgot HN doesn't do code fences. See https://pastebin.com/2rQg0r2L
Obviously the context window settings are going to depend on what you've got set on the llama-server/llama-swap side. Multiple models on the same server like I have in the config snippet above is mostly only relevant if you're using llama-swap.
TL;DR is you need to set up a provider for your local LLM server, then set at least one model on that server, then set the large and small models that crush actually uses to respond to prompts to use that provider/model combo. Pretty straightforward but agree that their docs could be better for local LLM setups in particular.
For me, I've got llama-swap running and set up on my tailnet as a [tailscale service](https://tailscale.com/docs/features/tailscale-services) so I'm able to use my local LLMs anywhere I would use a cloud-hosted one, and I just set the provider baseurl in crush.json to my tailscale service URL and it works great.
I’m blown away by how good it is lately
I'm an executive, the devs complaining are getting retrained or put on the chopping block.
My rockstars are now random contractor devs from Vietnam. The aloof FTE grey beards saying "I don't know, it doesn't work very good on X." Are getting a talking to or being sidelined/canned. So far most of my grey beards are adapting pretty well.
I'm not waiting on people to write code any more. No way in hell.
Asked support hey i got nothing back i tried prompting several times used a ton of usage and it gave no response. I'd just like usage back. What I payed for I never got.
Just bot response we don't do refunds no exceptions. Even in the case they don't serve you what your plan should give you.
Heck two weeks ago i tried my hardest to hit my limit just to make use of my subscription (i sometimes feel like i'm wasting it), and i still only managed to get to 80% for the week.
I generally prune my context frequently though, each new plan is a prune for example, because i don't trust large context windows and degradation. My CLAUDE.md's are also somewhat trim for this same fear and i don't use any plugins, and only a couple MCPs (LSP).
No idea why everyone seems to be having such wildly different experiences on token usage.
Chances are one of you has been drafted into an unpleasant experiment.
WTF are y'all doing that chews tokens so fast? I mean, sure, I could spin up Gas Town and Beads and produce infinite busy work for the agents, but that won't make useful software, because the models don't want anything. They don't know what to build without pretty constant guidance. Left to their own devices, they do busy work. The folks who "set and forget" on AI development are producing a whole lot of code to do nothing that needed doing. And, a lot of those folks are proud of their useless million lines of code.
I'm not trying to burn as many tokens as a possible, I'm trying to build good software. If you're paying attention to what you're building, there's so many points where a human is in the loop that it's unusual to run up against token limits.
Anyway, I assume that at some point they have to make enough money to pay the bills. Everything has been subsidized by investors for quite some time, and while the cost per token is going down with efficiency gains in the models/harnesses and with newer compute hardware tuned for these workloads, I think we're all still enjoying subsidized compute at the moment. I don't think Anthropic is making much profit on their plans, especially with folks who somehow run right at the edge of their token limit 24/7. And, I would guess OpenAI is running an even lossier balance sheet (they've raised more money and their prices are lower).
I dunno. I hear a lot of complaining about Claude, but it's been pretty much fine for me throughout 4.5, 4.6 and 4.7. It got Good Enough at 4.5, and it's never been less than Good Enough since. And, when I've tried alternatives, they usually proved to be not quite Good Enough for some reason, sometimes non-technical reasons (I won't use OpenAI, anymore, because I don't trust OpenAI, and Gemini is just not as good at coding as Claude).
If one model seems to be a bit off during a session I just switch to another (Opencode) and plan and review from there.
It's almost unusable
We probably hit peak generative AI last year, now they probably use AI to improve the AI so it’s kinda garbage in garbage out, or maybe anthropic is deprioritizing users while favoring enterprise or even government where it provides better quality for higher contracts.
Oh wait, I don’t have to imagine. That’s what Anthropic does. A nice preview for what is in store for those who chose to turn off their brains and turn on their AI agents.
From "yay, claude is awesome" to "damn, it sucks". This is like with withdrawal symptoms now.
My approach is much easier: I'll stay the oldschool way, avoid AI and come up with other solutions. I am definitely slower, but I reason that the quality FOR other humans will be better.
AI used to be, the punched card replicator... its all replaceable.
I'm pretty sure it used to warn when you got close to your 5hr limit, but no, it happily billed extra usage. Granted only about $10 today, but over the span of like 45 minutes. Not super pleased.
Then within the last few months everything changed and went to shit. My trust was lost. Behavior became completely inconsistent.
During the height of Claude's mental retardation (now finally acknowledged by the creators) I had an incident where CC ran a query against an unpartitioned/massive BQ table that resulted in $5,000 in extra spend because it scanned a table which should have been daily partitioned 30 times. 27 TB per scan. I recall going over and over the setup and exhaustively refining confidence. After I realized this blunder, I referred to it in the same CC session, "jesus fucking christ, I flagged this issue earlier" -- it responded, "you did. you called out the string types and full table scans and I said "let's do it later." That was wrong. I should have prioritized it when you raised it". Now obviously this is MY fault. I fucked up here, because I am the operator, and the buck stops with me. But this incident really galvinized that the Claude I had come to vibe with so well over the last N months was entirely gone.
We all knew it was making making mistakes, becoming fully retarded. We all felt and flagged this. When Anthropic came out and said, "yeah ... you guys are using it wrong, its a skill issue" I knew this honeymoon was over. Then recently when they finally came out and ack'd more of the issues (while somehow still glossing over how bad they fucked up?) it was the final nail. I'm done spending $ on Anthropic ecosystem. I signed up for OpenAI pro $200/mo and will continue working on my own local inference in the meantime.
They could have just kept doing this - literally printing money. Literally: do absolutely nothing, go on vacation, profit $$$. So why did so much change? I think that the issue is they were trying to optimize CC for the monthly plan folks, the ones who are likely losing the company money, but API users became collateral damage.
Anthropic can't even scale their own infrastructure operations, because it does not exist and they do not have the compute; even when they are losing tens of billions and can nerf models when they feel like it.
Once again, local models are the answer and Anthropic continues to get you addicted to their casino instead of running your own cheaper slot machine, which you save your money.
Every time you go to Anthropic's casino, the house always wins.
I hate enshittification and I hate seeing this happening to Claude Code right now.
Up until last month, a $100 plan was more than enough, and it was difficult to run out of tokens per day for me. Something fundamentally changed, and Claude started making more mistakes and using more tokens. I know I am not tripping because I used it for over a year; this is absolutely new.
Some time ago I cleaned up a bunch of MCP tools I had installed some time ago and that did make a significant impact on token usage.
The product keeps getting worse so I will definitely evaluate options and possibly switch if management keeps screwing up the product.
Max 5, sonnet for 95% of things. I never run out of tokens in a week and I use it for ~5-6 hours a day.
I juts need a convenient commandline tool to sometimes analyse the repo and answer a few questions about it.
Am I unworthy of using CC then? Until now I thought Pro entitles me to doing so.
LOL, the elitism is through the roof.
And I actually read the output to fix what I don't like and ever since Opus 4.5, I've had to less and less. 4.6 had issues at the beginning but that's because you have to manually make sure you change the effort level.