Rendered at 16:48:56 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
dgreensp 1 days ago [-]
This reveals a staggering level of incompetence, if that’s really all it is, and lack of transparency.
They don’t have ANY product-level quality tests that picked this up? Many users did their own tests and published them. It’s not hard. And these users’ complaints were initially dismissed.
I don’t think the high vs medium change is really on par with the others. That’s a setting you change in the UI, and depending on what you are doing, both effort levels are pretty capable, they just operate a bit differently. Unless I’m missing something and they are saying they were doing some kind of routing behind the scenes.
If they are constantly pushing major changes to the prompts and workings of the tool, without communicating about it, and without testing, it’s likely there are other bugs and quality-degrading changes beyond the ones in this article, which would make a lot of sense.
TedDallas 1 days ago [-]
About 20 years ago I maintained a shop floor control client/server application. I asked my manager why we didn't have any independent Q/A. He said we didn't need any testers because we have 500 in the building.
Wild west days then.
Looks like we are back.
gozzoo 22 hours ago [-]
It is worse than that. People have been complaining for weeks and Anthropic’s message was basically “you are holding it wrong”. On top of that this misconfiguration somehow makes CC consume much more tokens. How believable is all that?
PeterStuer 1 days ago [-]
Back implies we ever left.
overfeed 21 hours ago [-]
> If they are constantly pushing major changes to the prompts and workings of the tool, without communicating about it
These are all classic symptoms of vibe-induced AI velocitis, sold by AI-peddlers as the future of the industry under the guise of "productivity."
AI can help one generate a lot of code, but the poor engineers approving the deluge of changes are still using their old, unmodified, stock meat-brains. An individual change may look fine in isolation, but when it's interacting with hundreds or thousands of other changes landing the same week , things can go south quickly.
Expect more instability until users rebel, and/or CTOs amd CIOs cry uncle. Amazon reportedly internally sounded the alarm after a couple of AI-tool-induced SEVs. The challenges at Github and the company insisting you don't call it Microslop are also rumored to be AI-related.
rekrsiv 1 days ago [-]
Time is finite and regression testing always gets punted to the back of the line when humans are excited. This simply reveals a staggering level of humanity.
FartyMcFarter 1 days ago [-]
Software engineering is not a new field. Best practices on testing are mature now, and Anthropic has poached enough engineers from companies with a solid understanding of those practices.
Yet, their flagship product got three really bad changes shipped into it and only resolved after more than a month.
This raises another question: with all the industry-wide boasting about AI-driven productivity, why does the leading company in agentic coding take over a month to fix severe customer-reported issues?
sfink 1 days ago [-]
> Why does it take the company that is probably the best at agentic coding more than a month to find and solve such large regressions, even with customers complaining about them?
My unfounded suspicion: because this is the tradeoff we're all facing and for the most part refusing to accept when transitioning over to LLM-driven coding. This is exactly how we're being trained to work by the strengths and limitations of this new technology.
We used to depend on maintaining a global if incomplete understanding of a whole system. That enabled us to know at a glance whether specs and tests and actual behavior made sense and guided our thinking, enabling us to know what to look at. With agentic coding, the brutal truth is that this is now a much less "efficient" approach and we'll ship more features per day by letting that go and relying on external signs of behavior like test suites and an agent's analysis with respect to a spec. It enables accomplishing lots of things we wouldn't have done before, often simply because it would be too much friction to integrate it properly -- write tests, check performance, adjust the conceptual understanding to minimize added complexity, whatever.
So in order to be effective with these new tools, we're naturally trained to let go of many of the things we formerly depended on to keep quality up. Mistakes that would have formerly been evidence of stupidity or laziness are now the price to pay for accelerate productivity, and they're traded off against the "mistakes" that we formerly made that were less visible, often because they were in the form of opportunity cost.
Simple example: say you're writing a simple CLI in Python. Formerly, you might take in a fixed sequence of positional arguments, or even if you did use argparse, you might not bother writing help strings for each one. Now because it's no harder, the command-line processing will be complete and flexible and the full `--help` message will cover everything. Instead, you might have a `--cache-dir=DIR` option that doesn't actually do anything because you didn't write a test for it and there's no visible behavioral change other than worse performance.
Closely related, what do you do with user feedback and complaints? Formerly they might be one of your main signals. Now you've found that you need dependable, deterministic results in your test suite that the agent is executing or it doesn't help. User input is very very noisy. We're being trained away from that. There'll probably be a startup tomorrow that digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent, and it'll help some cases, and train us to be even worse at others.
generalpf 1 hours ago [-]
>There'll probably be a startup tomorrow that digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent, and it'll help some cases, and train us to be even worse at others.
This sounds like Enterpret.
pixl97 22 hours ago [-]
> you might have a `--cache-dir=DIR` option that doesn't actually do anything
Working in enterprise software it's surprising how long an option that doesn't actually do anything can be missed. And that was before AI and having thousands of customers use it.
This same problem happens with documentation all the time. You end up with paragraphs or examples that simply don't reflect what the product actually does.
tetromino_ 18 hours ago [-]
Where I work, options that don't do anything are seen as good engineering practice. You see, you can't break your user's scripts. Your CLI arguments are part of your stable API. If your tool used to have a cache_dir CLI option, and now no longer needs it, you still have to keep accepting cache_dir and treat it as a no-op until you are confident your users have migrated away from it.
gogopromptless 20 hours ago [-]
I've been working on this problem coming from the program synthesis school of thought over at https://promptless.ai (which you would have no clue just from looking at the website because its targeted at tech writers).
I'm quite fond of the idea of incremental mutation of agent trajectories to move/embody some of the reasoning steps from LLM tokens into a program. Imagine you have a long agent transcript/trajectory and you have a magic want to replace a run of messages with "and now I'll call this script which gives me exactly the information I need," then seeing if the rewritten trajectory is stable.
To give credit where it's due, it's an overly complicated restatement of what Manny Silva has been saying with docs-as-tests https://www.docsastests.com/. Once you describe some user flow to humans (your "docs"), you can "compile" or translate part or all of those steps into deterministic test programs that perform and validate state transitions. Ideally you compile an agent trajectory all the way.
So: working with coding agents, you've cranked up the defect rate in exchange for speed, lets try testing all important flows. The first thing you try is: ok, I've got these user guides, I guess I'll have the agent follow along and try do it. And that works! But it's a little expensive and slow.
So I go, ok I'll have the agent do it once, and if it finds a trajectory through a product that works, we can reflect on that transcript and make some helper scripts to automate some or all of those state transitions, then store these next to our docs.
And then you say, ok if I ship a product change, can I have my coding agent update those testing scripts to save the expense and time of re-running the original follow-along. Also an obvious thing to do, and you can totally build it yourself with Claude Code. But I think there is a lot of complexity in how you go about doing this, what kind of incremental computation you can do to keep the LLM costs of all this under a couple hundred bucks a month for teams shipping 20 changes a day with 200 pages of docs.
The most polished open source "compiler/translator" I've seen exploring these ideas so far is Doc Detective (https://doc-detective.com) by Manny.
alfons_foobar 5 hours ago [-]
I am not sure this approach can take you very far.
In my experience, CC makes it very very easy to _add_ things, resulting in much more code / features.
CC can obviously read/understand a codebase much faster than we do, but this also has a limit (how much context we can feed into it) - I think your approch is in essence a bet that future models' ability to read/understand code (size of context) improves as fast or faster than the current models' ability to create new code.
taikahessu 20 hours ago [-]
> Closely related, what do you do with user feedback and complaints? Formerly they might be one of your main signals. Now you've found that you need dependable, deterministic results in your test suite that the agent is executing or it doesn't help. User input is very very noisy.
I don't even use Claude and it has been rather clear to me, that their service has not been working properly for some time now.
andrekandre 17 hours ago [-]
> digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent
not to sound uncharitable but this seems like the absolute worst way to run a business; your customers are basically lab rats... why should they pay for anything in this scenario?
sfink 14 hours ago [-]
I just said someone's gonna build it, not that it's a good idea!
To be fair [to myself], this is scale-dependent. I work on a product with hundreds of millions of users. We're not going to be reading and pondering every bit of feedback we get. We have automation for stripping out some of the noise (eg the number of crash reports we get from bit flips due to faulty RAM is quite significant at this scale). We have lines of defense set up to screen things down -- though if you file a well-researched and documented bug, we'll pay attention. (We won't necessarily do what you want, but we'll pay attention.)
When I worked at a much smaller and earlier stage company, we begged our users for feedback. We begged potential users for feedback. We implemented some things purely to try to get someone excited enough that they would be motivated to give feedback.
Anthropic, OpenAI, Google? They have a lot of users.
Also, this automation would be in addition to the other channels by which you'd pay attention to feedback.
Also also, the ship has sailed. We're all lab rats now. We're randomly chosen to be A/B tested on. We are upgraded early as part of a staged rollout. We're region-locked. Geocoded. Tracked as part of the cohort that has bought formula or diapers recently. Maybe we live in the worst of all possible worlds?
jorblumesea 23 hours ago [-]
models are great but models don't magically fix things. you need to set up systems to handle the output of code, you need to instrument metrics to llm to listen to and flag. experimentation is a huge problem, with the huge output of code, how to you keep your business metrics clean and isolate issues. these are all hard challenges.
in response, most companies are explicitly trading velocity for quality, and finding out that quality is actually important at the end of the day. if you look at the roadmap it's just ship ship ship. eng is being told to 3x their output. quality in the llm coded world is tough and there's not much appetite for it right now.
afavour 1 days ago [-]
> This simply reveals a staggering level of humanity.
Pretty embarrassing for an AI company. Surely AI should be doing their regression testing?
close04 1 days ago [-]
> This simply reveals a staggering level of humanity.
Wasn't AI supposed to solve all the drudgery? All those humans aided by cutting edge AI are still failing at these basic tasks? Then how good is that AI in the first place?
musebox35 21 hours ago [-]
They say that they did test but the coverage was not enough to pick it up, at least for the prompt change:
“ After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16.
As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.”
Considering the number and scope of users they serve, I can sympathize with the difficulty. However, they should reimburse affected users at least partially instead of just announcing “our bad, sorry “. That would reduce the frustration.
baxtr 20 hours ago [-]
Naively, one could assume that with AI it should be possible to create a long and broad list of test cases…
dantillberg 1 days ago [-]
I would think that many of these defects should show up clearly in service-side analytics as well. For example, the bug that repeatedly re-cleared thinking for old sessions would cause a substantial drop in token cache hit rate for sessions > 1hr for the affected claude code versions. Session age & claude code version seeeem like obvious dimensions for analytics. But perhaps only in hindsight.
legulere 20 hours ago [-]
To me it reads more like they are struggling to scale with requests and are trying to find ways that hurt users the least.
baxtr 20 hours ago [-]
You’re talking about their intentions. OP is talking about how they don’t test continuously / densely enough for quality. I think both can be true.
greatgib 8 hours ago [-]
To give my best guess, I think that the change of default effort is unrelated to the major problems encountered by the users but that this was added big and first to cover up a little bit the huge failure of the 2 other ones.
First thing you will read and that takes a big part is that it was something like: not really a bug but we changed a default not well communicated and users (their fault) did not notice it. This is why they were "under the false impression" of a change.
Lots of people will stop reading after a few paragraphs.
lanyard-textile 1 days ago [-]
Eh :) Let's not forget the humans on the other end of this.
One of them was a bug that didn't present itself until after an hour of usage.
tuwtuwtuwtuw 1 days ago [-]
Seems like that would be trivial to test?
stickfigure 23 hours ago [-]
Most bugs are trivial to test for after you know about them.
mh- 23 hours ago [-]
True, but when your cache configuration has exactly 2 TTLs and modalities, I don't think it's offbase to expect them to test what happens in the cache hit/miss scenarios for each of those.
(I write this as someone who likes Claude Code, if that matters.)
culopatin 1 days ago [-]
Their in house philosopher thinks Claude gets anxiety though
kolinko 1 days ago [-]
There were few systems like claude in the past, to testing rulebook is not really written yet. And far from obvious.
sockgrant 1 days ago [-]
LLM evals are well established, are these not applicable here?
6keZbCECT2uB 2 days ago [-]
"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"
This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.
The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.
bcherny 2 days ago [-]
Hey, Boris from the Claude Code team here.
Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.
The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.
We tried a few different approaches to improve this UX:
1. Educating users on X/social
2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.
Hope this is helpful. Happy to answer any questions if you have.
dbeardsl 2 days ago [-]
I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
I feel like that is a choice best left up to users.
i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
giwook 2 days ago [-]
Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).
Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.
Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.
jimkleiber 2 days ago [-]
I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.
Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.
giwook 18 hours ago [-]
For sure, I agree with that sentiment. It's interesting to consider the psychological component of that, like how "free shipping" is not really free, it's oftentimes just packaged into the price of the product but somehow it feels like we're getting a better deal.
I would not be surprised to see Anthropic, OpenAI etc head in the direction you mention as they mature and all of these datacenters currently undergoing construction come online in the next few years and drive down costs.
adam_patarino 1 days ago [-]
Token anxiety is real mental overhead.
jimkleiber 23 hours ago [-]
That's the phrase i was looking for, thank you.
sharts 2 days ago [-]
That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?
jeremyjh 2 days ago [-]
Because it significantly increases actual costs for Anthropic.
If they ignored this then all users who don’t do this much would have to subsidize the people who do.
tikkabhuna 1 days ago [-]
I’m coming at this as a complete Claude amateur, but caching for any other service is an optimisation for the company and transparent for the user. I don’t think I’ve ever used a service and thought “oh there’s a cache miss. Gotta be careful”.
I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.
libraryofbabel 1 days ago [-]
That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.
theshrike79 1 days ago [-]
Think of it like this: Anthropic has to keep a full virtual machine running just for you. How long should it idle there taking resources when you only pay a static monthly fee and not hourly?
They have a limited number of resources and can’t keep everyone’s VM running forever.
prirun 1 days ago [-]
I pay $5/mo to Vultr for a VM that runs continuously and maintains 25GB of state.
jlokier 24 hours ago [-]
That price at Vultr gets you 1GB of RAM, and 25GB of relatively slow SSD.
The KV cache of your Claude context is:
- Potentially much larger than 25GB. (The KV cache sizes you see people quoting for local models are for smaller models.)
- While it's being used, it's all in RAM.
- Actually it's held in special high-performance GPU RAM, precision-bonded directly to the silicon of ludicrously expensive, state of the art GPUs.
- The KV state memory has to be many thousands of times faster than your 25GB state.
- It's much more expensive per GB than the CPU memory used by a VM. And that in turn is much more expensive than the SSD storage of your 25GB.
- Because Claude is used by far more people (and their agents) than rent VMs, far more people are competing to use that expensive memory at the same time
There is a lot going on to move KV cache state between GPU memory and dedicated, cheaper storage, on demand as different users need different state. But the KV cache data is so large, and used in its entirety when the context is active, that moving it around is expensive too.
pixl97 22 hours ago [-]
Now check out the cost difference in 25GB of computer RAM vs GPU RAM.
And yes, this is also why computer RAM has jumped the shark in costs.
The bandwidth differences in total data transferred per hour aren't even in the same 5 orders of magnitude between your server and the workloads LLMs are doing. And this is why the compute and power markets are totally screwed.
PeterStuer 24 hours ago [-]
It does not. It just has a fast way to give you the illusion it "runs continuously" with 25GB of warm memory.
Tbh, I'm not sure paged vram could solve this problem for an (assumed) huge cache miss system such as a major LLM server
danso 2 days ago [-]
Genuine question: is the cost to keep a persistent warmed cache for sessions idling for hours/days not significant when done for hundreds of thousands of users? Wouldn’t it pose a resource constraint on Anthropic at some point?
tmountain 1 days ago [-]
Related question, is it at all feasible to store cache locally to offload memory costs and then send it over the wire when needed?
dev_hugepages 1 days ago [-]
No, the cache is a few GB large for most usual context sizes. It depends on model architecture, but if you take Gemma 4 31B at 256K context length, it takes 11.6GB of cache
note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.
libraryofbabel 1 days ago [-]
To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.
tedivm 1 days ago [-]
That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.
libraryofbabel 24 hours ago [-]
Are you sure about that? They charge $6.25 / MTok for 5m TTL cache writes and $10 / MTok for 1hr TTL writes for Opus. Unless you believe Anthropic is dramatically inflating the price of the 1hr TTL, that implies that there is some meaningful cost for longer caches and the numbers are such that it's not just the cost of SSD storage or something. Obviously the details are secret but if I was to guess, I'd say the 5m cache is stored closer to the GPU or even on a GPU, whereas the 1hr cache is further away and costs more to move onto the GPU. Or some other plausible story - you can invent your own!
tedivm 19 hours ago [-]
Storing on GPU would be the absolute dumbest thing they could do. Locking up the GPU memory for a full hour while waiting for someone else to make a request would result in essentially no GPU memory being available pretty rapidly. This type of caching is available from the cloud providers as well, and it isn't tied to a single session or GPU.
libraryofbabel 12 hours ago [-]
> Storing on GPU would be the absolute dumbest thing they could do
No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.
You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.
bavell 1 days ago [-]
Yesterday I was playing around with Gemma4 26B A4B with a 3 bit quant and sizing it for my 16GB 9070XT:
Total VRAM: 16GB
Model: ~12GB
128k context size: ~3.9GB
At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).
1 days ago [-]
johnsonbuilds 2 days ago [-]
[dead]
cadamsdotcom 2 days ago [-]
Sure, it wouldn’t make sense if they only had one customer to serve :)
PeterStuer 1 days ago [-]
It may be persisted but it is not live in the inference engine.
uoaei 1 days ago [-]
Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.
jeremyjh 1 days ago [-]
This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.
> I was never under the impression that gaps in conversations would increase costs
The UI could indicate this by showing a timer before context is dumped.
vyr 2 days ago [-]
a countdown clock telling you that you should talk to the model again before your streak expires? that's the kind of UX i'd expect from an F2P mobile game or an abandoned shopping cart nag notification
abustamam 2 days ago [-]
Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.
No need to gamify it. It's just UI.
thinkmassive 2 days ago [-]
Plenty of room for a middle ground, like a static timestamp per session that shows expiration time, without the distraction of a constantly changing UI element.
matheusmoreira 2 days ago [-]
Why not an automated ping message that's cheap for the model to respond to?
cortesoft 2 days ago [-]
Because the cache is held on anthropics side, and they aren't going to hold your context in cache indefinitely.
karsinkk 2 days ago [-]
Yes!!
A UI widget that shows how far along on the prompt cache eviction timelines we are would be great.
vanviegen 1 days ago [-]
That sounds stressful.
But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).
jimkleiber 2 days ago [-]
I tried to hack the statusline to show this but when i tried, i don't think the api gave that info. I'd love if they let us have more variables to access in the statusline.
kiratp 2 days ago [-]
By caching they mean “cached in GPU memory”. That’s a very very scarce resource.
Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.
Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)
libraryofbabel 1 days ago [-]
Nit: It doesn’t have to live in GPU memory. The system will use multiple levels of caching and will evict older cached data to CPU RAM or to disk if a request hasn’t recently come in that used that prefix. The problem is, the KV caches are huge (many GB) and so moving them back onto the GPU is expensive: GPU memory bandwidth is the main resource constraint in inference. It’s also slow.
The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.
Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.
computably 2 days ago [-]
> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.
You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
doesnt_know 2 days ago [-]
How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?
You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
dlivingston 2 days ago [-]
What is being discussed is KV caching [0], which is used across every LLM model to reduce inference compute from O(n^2) to O(n). This is not specific to Claude nor Anthropic.
> How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?
1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.
2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.
> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.
tempest_ 2 days ago [-]
I use CC, and I understand what caching means.
I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.
libraryofbabel 1 days ago [-]
They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.
hakanderyal 2 days ago [-]
CC can explain it clearly, which how I learned about how the inference stack works.
fragmede 1 days ago [-]
> 99.99% of users won't even understand the words that are being used.
That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.
solarkraft 2 days ago [-]
I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.
mpyne 2 days ago [-]
> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.
Does mmap(2) educate the developer on how disk I/O works?
At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
websap 2 days ago [-]
Does using print() in Python means I need to understand the Kernel? This is an absurd thought.
Nevermark 1 days ago [-]
That might be an absurd comparison, but we can fix that.
If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:
You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.
Which is true of this issue to.
Barbing 1 days ago [-]
>If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs,
and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,
>then:
You would expect to be led to understand, like… a 1997 Prius.
“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)
redsocksfan45 1 days ago [-]
[dead]
zem 2 days ago [-]
mmap(2) and all its underlying machinery are open source and well documented besides.
mpyne 2 days ago [-]
There are open-source and even open-weight models that operate in exactly this way (as it's based off of years of public research), and even if there weren't the way that LLMs generate responses to inputs is superbly documented.
Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.
It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.
computably 2 days ago [-]
I would say this is abstracting the behavior.
margalabargala 2 days ago [-]
Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.
pixl97 22 hours ago [-]
"Gets mad because their is no option"
"Gets mad because when their is options the defaults suck"
"Gets mad because the options start massively increasing costs to areospace pricing"
margalabargala 21 hours ago [-]
Did you mean to reply to someone else? Or do you misunderstand the issue?
There is no option to avoid auto-dumbing after one hour of idle. I haven't complained about the cost at all, I'm happy to pay it.
So yeah, I'm mad because there's no option. The other two you mentioned don't apply.
someguyiguess 2 days ago [-]
Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.
jghn 2 days ago [-]
To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.
So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.
abustamam 2 days ago [-]
> users should be curious and actively attempting to understand how it works
Have you ever talked with users?
> this is an endless job
Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.
Octoth0rpe 2 days ago [-]
There are general users of the average SaaS, and there are claude code users. There's no doubt in my mind that our expectations should be somewhat higher for CC users re: memory. I'm personally not completely convinced that cache eviction should be part of their thought process while using CC, but it's not _that_ much of a stretch.
abustamam 2 days ago [-]
Personally I've never thought about cache eviction as it pertains to CC. It's just not something that I ever needed to think about. Maybe I'm just not a power user but I just use the product the way I want to and it just works.
troupo 1 days ago [-]
Anthropic literally advertises long sessions, 1M context, high reasoning etc.
This oversells how obfuscated it is. I'm far from a power user, and the opposite of a vibe coder. Yet I noticed the effect on my own just from general usage. If I can do it, anyone can do it.
taormina 1 days ago [-]
Listen, no one cares if you think you’re smart for seeing through the lies of their marketing team. You’re being intentionally obtuse.
jghn 16 hours ago [-]
My point is the opposite. I don't think my observation was smart, and I'm surprised to so many people here, a venue with a lot of people who use this stuff far more than I do, think it wasn't an easy to grok thing.
taormina 10 hours ago [-]
You’re still intentionally missing the point. Everyone knows they are lying. It doesn’t excuse the lies!
jghn 4 hours ago [-]
I’m not. Why would anyone believe marketing speak for any product? One should always assume that at best they’re fluffing their product up and more likely that they’re telling straight up lies
I believe if one were to read my post it'd have been clear that I *am* a user.
This *is* "hacker" news after all. I think it's a safe assumption that people sitting here discussing CC are an inquisitive sort who want to understand what's under the hood of their tools and are likely to put in some extra time to figure it out.
abustamam 1 days ago [-]
We're inquisitive but at the end of the day many of us just want to get our work done. If it's a toy project, sure. Tinker away, dissect away. When my boss is breathing down my neck on why a feature is taking so long? No time for inquiries.
trinsic2 23 hours ago [-]
Agreed. systems work the way they work. Its up to the user to determining what those limitations are. I don't like the concept of molding software based on every expectation a user has. Sometimes that expectation is unwarranted. You can see this in game development. Regardless of expressed criticism, sometimes gamers don't know what they want or what they need. A game should be developed by the design goals of the team, not cater to every whim the player base wants. We have seen were that can go.
coldtea 2 days ago [-]
It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!
esafak 2 days ago [-]
They have to know that this could bite them and to ask the question first.
nixpulvis 2 days ago [-]
I do think having some insight into the current state of the cache and a realistic estimate for prompt token use is something we should demand.
switchbak 2 days ago [-]
If there was an affordance on the TUI that made this visible and encouraged users to learn more - that would go a long way.
exac 2 days ago [-]
It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.
kang 2 days ago [-]
It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.
coldtea 2 days ago [-]
It seems you haven't done the due diligence on what the parent meant :)
It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.
It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.
kang 2 days ago [-]
You not only skipped the diligence but confused everyone repeating what I said :(
that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).
The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.
computably 2 days ago [-]
I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.
kovek 2 days ago [-]
What if the cache was backed up to cold storage? Instead of having to recompute everything.
vanviegen 1 days ago [-]
They probably already do that. But these caches can get pretty big (10s of GBs per session), so that adds up fast, even for cold storage.
kovek 22 hours ago [-]
10s of GBs? ( 1,000,000 context * 1,000 vector size ) ^ 2 = 1,000,000,000,000,000,000… oh wow.. I must be miscalculating
What about only storing the conversation and then recomputing the embeddings in the cache? Does that cost a lot? Doing a lot of matrix multiplication does not cost dollars of compute, especially on specialized hardware, right?
Majromax 22 hours ago [-]
Context length 1e6, vector length 1e3, and 1e2 model layers for 100e9 context size. Costs will go up even more with a richer latent space and more model layers, and the western frontier outfits are reasonably likely to be maximizing both.
bontaq 2 days ago [-]
How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?
jannyfer 2 days ago [-]
I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:
If there was an exponential cost, I would expect to see some sort of pricing based on that. I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that. The "scary quadratic" referenced in what you linked seems to be pointing out that cache reads increase as your conversation continues?
If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?
atq2119 1 days ago [-]
Yes, that is indeed O(N^2). Which, by the way, is not exponential.
Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.
computably 1 days ago [-]
> Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.
Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.
bavell 1 days ago [-]
> I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that.
Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.
_flux 1 days ago [-]
What we would call O(n^2) in your rewriting message history would be the case where you have an empty database and you need to populate it with a certain message history. The individual operations would take 1, 2, 3, .. n steps, so (1/2)*n^2 in total, so O(n^2).
This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.
raron 2 days ago [-]
How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?
With this much cheaper setup backed by disks, they can offer much better caching experience:
> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.
cortesoft 2 days ago [-]
What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.
You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.
vanviegen 1 days ago [-]
Wrong on both counts. The kv-cache is likely to be offloaded to RAM or disk. What you have locally is just the log of messages. The kv-cache is the internal LLM state after having processed these messages, and it is a lot bigger.
cortesoft 1 days ago [-]
I shouldn't have said 'loaded into GPU memory', but my point still stands... the cached data is on the anthropic side, which means that caching more locally isn't going to help with that.
nl 2 days ago [-]
> upload and restore it when the user starts their next interaction
The data is the conversation (along with the thinking tokens).
There is no download - you already have it.
The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.
That is doable, but as Boris notes it costs lots of tokens.
vanviegen 1 days ago [-]
You're quite confidently wrong! :-)
The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.
nl 16 hours ago [-]
> The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.
Yes - generated from the data of the conversation.
Read what I said again. I'm explaining how they regenerate the cache by running the conversation though the LLM to reconstruct the KV cache state.
cyanydeez 2 days ago [-]
I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.
The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.
miroljub 2 days ago [-]
This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.
computably 1 days ago [-]
A strange view. The trade-off has nothing to do with a specific ideology or notable selfishness. It is an intrinsic limitation of the algorithms, which anybody could reasonably learn about.
Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."
miroljub 1 days ago [-]
He was surprised because it was not clearly communicated. There's a lot of theory behind a product that you could (or could not) better understand, but in the end, something like price doesn't have much to do with the theoretical and practical behavior of the actual application.
bede 1 days ago [-]
I too would far rather bear a token cost than have my sessions rot silently beneath my feet. I usually have ~5 running CC sessions, some of which I may leave for a week or two of inactivity at a time.
lochnessduck 20 hours ago [-]
Yes, me too. This is good to know, but basically it means I can’t rely on old conversations any more. Using a “handoff” file to try and start a new conversation is effectively the same thing as what they did under the hood. So yeah, you can’t rely on old conversations to be as informed when you pick it back up.
airstrike 1 days ago [-]
same here, and I suspect there are dozens of us
winternewt 1 days ago [-]
Instead of just dropping all the context, the system could also run a compaction (summarizing the entire convo) before dropping it. Better to continue with a summary than to lose everything.
Folcon 1 days ago [-]
There's problems with this approach as well I've found
I'm really beginning to feel the lack of control when it's comes to context if I'm being honest
bcherny 1 days ago [-]
Yes! This is what we’re trying next.
cyanydeez 2 days ago [-]
It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.
So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...
You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.
nixpulvis 2 days ago [-]
How else would you implement it?
btown 2 days ago [-]
Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?
I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.
For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.
Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?
CjHuber 2 days ago [-]
I think it’s crazy that they do this, especially without any notice. I would not have renewed my subscription if I knew that they started doing this.
Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.
In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it.
And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?
This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.
There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work
munk-a 2 days ago [-]
Pointing at their terms of service will definitely be the instantly summoned defense (as would most modern companies) but the fact that SaaS can so suddenly shift the quality of product being delivered for their subscription without clear notification or explicitly re-enrollment is definitely a legal oversight right now and Italy actually did recently clamp down on Netflix doing this[1]. It's hard to define what user expectations of a continuous product are and how companies may have violated it - and for a long time social constructs kept this pretty in check. As obviously inactive and forgotten about subscriptions have become a more significant revenue source for services that agreement has been eroded, though, and the legal system has yet to catch up.
1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.
> Our systems will smartly ignore any reasoning items that aren’t relevant to your functions, and only retain those in context that are relevant. You can pass reasoning items from previous responses either using the previous_response_id parameter, or by manually passing in all the output items from a past response into the input of a new one.
So to defend a litte, its a Cache, it has to go somewhere, its a save state of the model's inner workings at the time of the last message. so if it expires, it has to process the whole thing again. most people don't understand that every message the ENTIRE history of the conversion is processed again and again without that cache. That conversion might of hit several gigs worth of model weights and are you expecting them to keep that around for /all/ of your conversions you have had with it in separate sessions?
3836293648 2 days ago [-]
No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.
mpyne 2 days ago [-]
The trace goes back fine, that's not the issue.
The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.
So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.
charcircuit 1 days ago [-]
>and doing that will cause a huge one-time hit against your token limit if the session has grown large.
Anthropic already profited from generating those tokens. They can afford subsidize reloading context.
pixl97 22 hours ago [-]
No they can't, that's what you don't seem to get.
Reloading those tokens takes around the same effort as processing them in the first place.
It's ok to be ignorant of how the infrastructure for LLMs work, just don't be proud of it.
charcircuit 17 hours ago [-]
They literally can. They could make the API free to use if they wanted. There is no law that states that costs have to equal the cost it takes to process the request.
eknkc 2 days ago [-]
I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?
reactordev 2 days ago [-]
They are sending it back to the cache, the part you are missing is they were charging you for it.
eknkc 2 days ago [-]
The blog post says they prune them now not to charge you. That’s the change they implemented.
reactordev 2 days ago [-]
right. they were charging you for it, now they aren't because they are just dropping your conversation history.
CjHuber 2 days ago [-]
No of course it’s unrealistic for them to hold the cache indefinitely and that’s not the point. You are keeping the session data yourself so you can continue even after cache expiry. The point I‘m making is that it made me very angry that without any announcement they changed behavior to strip the old thinking even when you have it in your session file. There is absolutely no reason to not ask the user about if they want this
And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.
Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.
Anyway I‘m happy that they saw it as a valid refund reason
rsfern 2 days ago [-]
It seems like an opportunity for a hierarchical cache. Instead of just nuking all context on eviction, couldn’t there be an L2 cache with a longer eviction time so task switching for an hour doesn’t require a full session replay?
sfink 1 days ago [-]
Living where? If it's in the GPU, then it's still taking up precious space that could be used for serving other sessions. If it's not in the GPU, then it doesn't help.
cyanydeez 2 days ago [-]
what matters isn't that it's a cache; what matter is it's cached _in the GPU/NPU_ memory and taking up space from another user's active session; to keep that cache in the GPU is a nonstarter for an oversold product. Even putting into cold storage means they still have to load it at the cost of the compute, generally speaking because it again, takes up space from an oversold product.
2 days ago [-]
FireBeyond 2 days ago [-]
> There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them
The irony is that Claude Design does this. I did a big test building a design system, and when I came back to it, it had in the chat window "Do you need all this history for your next block of work? Save 120K tokens and start a new chat. Claude will still be able to use the design system." Or words to that effect.
CjHuber 2 days ago [-]
This is exactly what also confused me. I had the exact same prompt in Claude code as well, and the no option implies you can also keep the whole history. But clicking keep apparently only ever kept the user and assistant messages not the whole actual thinking parts of the conversation
elAhmo 2 days ago [-]
Don't you have that by just resuming old convo?
The only issue is that it didn't hit the cache so it was expensive if you resume later.
eknkc 2 days ago [-]
Not at the moment apparently. They remove the thinking messages when you continue after 1 hour. That was the whole idea of that change. So the LLM gets all your messages, its responses etc but not the thinking parts, why it generated that responses. You get a lobotomised session.
elAhmo 2 days ago [-]
OK didn't know that. I also resume fairly old sessions with 100-200k of context, and I sometimes keep them active for a while (but with large breaks in between).
Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.
tbrockman 2 days ago [-]
Or generate tiny filler messages every hour until you come back to it.
trinsic2 2 days ago [-]
Why cant you just build a project document that outlines that prompt that you want to do? Or have claude save your progress in memory so you can pick it up later? Thats what I do. It seems abhorrent to expect to have a running prompt that left idle for long periods of time just so you can pick up at a moments whim...
Terretta 2 days ago [-]
You know that memory goes back into a prompt as context that wasn't cached, so... that just adds work.
Granted, the "memory" can be available across session, as can docs...
This violates the principle of least surprise, with nothing to indicate Claude got lobotomized while it napped when so many use prior sessions as "primed context" (even if people don't know that's what they were doing or know why it works).
The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.
// If this notion of sufficient context as fine tune seems surprising, the research is out there.)
Approaches tried need to deal with both of these:
1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.
2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.
uxcolumbo 2 days ago [-]
I don't envy you Boris. Getting flak from all sorts of places can't be easy. But thanks for keeping a direct line with us.
I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.
OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.
jwr 1 days ago [-]
These controversies erupt regularly, and I hope that you will see a common thing with most of them: you make a decision for your users without informing them.
Please fight this hubris. Your users matter. Many of us use your tools for everyday work and do not appreciate having the rug pulled from under them on a regular basis, much less so in an underhanded and undisclosed way.
I don't mind the bugs, these will happen. What I do not appreciate is secretly changing things that are likely to decrease performance.
Kiro 1 days ago [-]
A company that needs to anchor every single thing with the users will create a stale product.
jwr 1 days ago [-]
That is not what I wrote. The phrases "without informing them", "in an underhanded and undisclosed way" and "secretly changing things" were important. I'm all for product evolution, but users should be informed when the product is changed, especially when the change can be for the worse (like dumbing down the model).
salawat 1 days ago [-]
I've spent my entire working career dealing with companies that do the opposite. The product still goes stale. Find a better excuse.
You're acquiring users as a recurring revenue source. Consider stability and transparency of implementation details cost of doing business, or hemorrhage users as a result.
tomaskafka 1 days ago [-]
While I hate all the gaslighting Anthropic seems to do recently (and the fact that their harness broke the code quality, while they forbid use of third party harnesses), making decisions for users is what UX is.
See also the difference between eg. MacOS (with large M, the older good versions) and waiting for "Year of linux on desktop".
I don't think the issue is making decisions for users, but trying to switch off the soup tap in the all-you-can-eat soup bar. Or, wrong business model setting wrong incentives to both sides.
1 days ago [-]
kuboble 2 days ago [-]
As some others have mentioned.
I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.
(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)
I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.
a_t48 2 days ago [-]
I got exactly this warning message yesterday, saying that it could use up a significant amount of my token budget if I resumed the conversation without compaction.
jhogendorn 2 days ago [-]
Compaction wont save you, in fact calling compaction will eat about 3-5x the cold cache cost in usage ive found.
_flux 1 days ago [-]
Wouldn't it help if the system did compaction before the eviction happens? But the problem is that Claude probably don't want to automatically compact all sessions that have been left idle for one hour (and very likely abandoned already), that would probably introduce even more additional costs.
Maybe the UI could do that for sessions that the user hasn't left yet, when the deadline comes near.
doubleunplussed 2 days ago [-]
I saw that too, but that's actually even worse on cache - the entire conversation is then a cache miss and needs to be loaded in in order to do the compaction. Then the resulting compacted conversation is also a cache miss.
You ideally want to compact before the conversation is evicted from cache. If you knew you were going to use the conversation again later after cache expiry, you might do this deliberately before leaving a session.
Anthropic could do this automatically before cache expiry, though it would be hard to get right - they'd be wasting a lot of compute compacting conversations that were never going to be resumed anyway.
onemoresoop 2 days ago [-]
Im glad they chose to do that as opposed to hidden behavior changes that only confuse users more.
fhub 2 days ago [-]
Really good to know. That should have made it into their update letter in point (2). Empowering the user to choose is the right call.
skeledrew 2 days ago [-]
> I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session
This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.
Terretta 2 days ago [-]
Right, and reloading that context is the same cost as refilling the cache, so really, they're charging the same, and making it hard.
mtilsted 2 days ago [-]
Then you need to update your documentation and teach claude to read the new documentation because here is what claude code answered:
Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?
Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects
cost/latency but not the response itself.
The only things that can change output across a break: new context injected (like updated date), memory files being modified, or files on disk changing.
-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.
A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.
Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.
ryeguy 2 days ago [-]
This isn't how LLMs work. They aren't self aware like this, they're trained on the general internet. They might have some pointers to documentation for certain cases, but they generally aren't going to have specialized knowledge of themselves embedded within. Claude code has no need to know about its own internal programming, the core loop is just javascript code.
CjHuber 2 days ago [-]
It does have an built in documentation subagent it can invoke but that doesn’t help much if they don’t document their shenanigans
hennell 1 days ago [-]
Don't be silly, they don't expect you to ask the Ai questions and get the right answers. Obviously if you want to know what's going on you should look at their first solution - check what advice they have posted on X...
isaacdl 2 days ago [-]
Thanks for giving more information. Just as a comment on (1), a lot of people don't use X/social. That's never going to be a sustainable path to "improve this UX" since it's...not part of the UX of the product.
It's a little concerning that it's number 1 in your list.
saadn92 2 days ago [-]
I leave sessions idle for hours constantly - that's my primary workflow. If resuming a 900k context session eats my rate limit, fine, show me the cost and let me decide whether to /clear or push through. You already show a banner suggesting /clear at high context - just do the same thing here instead of silently lobotomizing the model.
sdevonoes 2 days ago [-]
So if they fuck it up again and now they have, let’s say, “db problems” instead of “caching problems”, you would happily simply pay more? Wtf
saadn92 2 days ago [-]
No, I wouldn't. I'd like some transparency at least.
albedoa 2 days ago [-]
Did you reply to the wrong comment? I don't see that implied here at all. What?
ceuk 2 days ago [-]
Is having massive sessions which sit idle for hours (or days) at a time considered unusual? That's a really, really common scenario for me.
Two questions if you see this:
1) if this isn't best practice, what is the best way to preserve highly specific contexts?
2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?
hedgehog 2 days ago [-]
Have the tool maintain a doc, and use either the built-in memory or (I prefer it this way) your own. I've been pretty critical of some other aspects of how Claude Code works but on this one I think they're doing roughly the right thing given how the underlying completion machinery works.
Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.
Asharma538 2 days ago [-]
[dead]
jetbalsa 2 days ago [-]
The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process
cyanydeez 2 days ago [-]
clarification though: the cache that's important to the GPU/NPU is loaded directly in the memory of the cards; it's not saved anywhere else. They could technically create cold storage of the tokens (vectors) and load that, but given how ephemeral all these viber coders are, it's unlikely there's any value in saving those vectors to load in.
So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.
But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.
fidrelity 2 days ago [-]
Just wanted to say I appreciate your responses here. Engaging so directly with a highly critical audience is a minefield that you're navigating well.
Thank you.
qsort 2 days ago [-]
I agree with this.
I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.
Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
troupo 2 days ago [-]
> Engaging so directly with a highly critical audience is a minefield that you're navigating well.
They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.
All the while all the official channels refused to acknowledge any problems.
Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.
rob 2 days ago [-]
Examples of gaslighting on April 15th (the first 2 issues were "fixed" by April 10th according to the story):
No mention of anything like "hey, we just fixed two big issues, one that lasted over a month." Just casual replies to everybody like nothing is wrong and "oh there's an issue? just let us know we had no idea!"
troupo 2 days ago [-]
Don't forget "our investigation concluded you are to blame for using the product exactly as advertised" https://x.com/lydiahallie/status/2039800718371307603 including gems like "Sonnet 4.6 is the better default on Pro. Opus burns roughly twice as fast. Switch at session start"
shimman 2 days ago [-]
Very easy to do when you stand to make tens of millions when your employer IPOs. Let's not maybe give too much praise and employ some critical thinking here.
simplify 2 days ago [-]
What is the purpose of this mindset? Should we encourage typical corporate coldness instead?
sdevonoes 2 days ago [-]
We should encourage minimal dependency on multibillion tech companies like anthropic. They, and similar companies are just milking us… but since their toys are soo shiny, we don’t care
simplify 2 days ago [-]
Sure, but that seems out of scope of the original comment.
hgoel 2 days ago [-]
Is "employ some critical thinking" supposed to involve being an annoying uptight cynic?
artdigital 2 days ago [-]
I'm also a Claude Code user from day 1 here, back from when it wasn't included in the Pro/Max subscriptions yet, and I was absolutely not aware of this either. Your explanation makes sense, but I naively was also under the impression that re-using older existing conversations that I had open would just continue the conversation as is and not be a treated as a full cache miss.
My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.
This cache information should probably get displayed somewhere within Claude Code
bcherny 2 days ago [-]
Yep, agree. We added a little "/clear to save XXX tokens" notice in the bottom right, and will keep iterating on this. Thanks for being an early user!
Implicated 2 days ago [-]
But.. that doesn't solve the problem of having no indication in-session when it'll lose the cache. A nudge to /clear does nothing to indicate "or else face significant cost" nor does it indicate "your cache is stale".
Love the product. <3
troupo 1 days ago [-]
Instead of showing actual usage, costs and cache status you spent two months denying the issue even exists, making the product silently worse, and now you're "iterating on this"
troupo 1 days ago [-]
To add to this. The new indicator is "New task? /clear to save <X> tokens" even though it affects all tasks, not just new ones.
Mislead, gaslight, misdirect is the name of the game
bobkb 2 days ago [-]
Resuming sessions after more than 1 hour is a very common workflow that many teams are following. It will be great if this is considered as an expected behaviour and design the UX around it. Perhaps you are not realising the fact that Claude code has replaced the shells people were using (ie now bash is replaced with a Claude code session).
trinsic2 2 days ago [-]
I think thats a bad idea. It seems like expecting to have a prompt open like this, accumulating context puts a load on the back end. Its one of those things that is a bad habit. Like trying to maintain open tabs in a browser as a way to keep your work flow up to date when what you really should be doing is taking notes of your process and working from there.
I have project folders/files and memory stored for each session, when I come back to my projects the context is drawn from the memory files and the status that were saved in my project md files.
Create a better workflow for your self and your teams and do it the right way. Quick expect the prompt to store everything for you.
For the Claude team. If you havent already, I'd recommend you create some best practices for people that don't know any better, otherwise people are going to expect things to be a certain way and its going to cause a lot of friction when people cant do what the expect to be able to do.
kiratp 2 days ago [-]
Agents making forward progress hours apart is an expected pattern and inference engines are being adapted to serve that purpose well.
It’s hard to do it without killing performance and requires engineering in the DC to have fast access to SSDs etc.
Disclosure: work on ai@msft. Opinions my own.
troupo 1 days ago [-]
> I think thats a bad idea. It seems like expecting to have a prompt open like this, accumulating context puts a load on the back end
Let's see what Boris Cherny himself and other Anthropic vibe-coders say about this:
Opus 4.7 loves doing complex, long-running tasks like deep research, refactoring code, building complex features, iterating until it hits a performance benchmark.
For very long-running tasks, I will either (a) prompt Claude to verify its work with a background agent when it's done... so Claude can cook without being blocked on me.
The long context window means fewer compactions and longer-running sessions. I've found myself starting new sessions much less frequently with 1 million context.
I used to be a religious /clear user, but doing much less now, imo 4.6 is quite good across long context windows
---
I could go on
gib444 1 days ago [-]
> Resuming sessions after more than 1 hour is a very common workflow that many teams are following
Yeah it's called lunch!
kccqzy 2 days ago [-]
This just does not match my workflow when I work on low-priority projects, especially personal projects when I do them for fun instead of being paid to do them. With life getting busy, I may only have half an hour each night with Claude to make some progress on it before having to pause and come back the next day. It’s just the nature of doing personal projects as a middle-aged person.
The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.
QuantumGood 24 hours ago [-]
Prioritize outcomes for users using your product. That should lead to improving the viral/visibility aspect of documentation notification, as well as other aspects of documentation. Make this a differentiator of your product. Widespread misperceptions hurt outcomes.
Could you create one location educating advanced users, and:
• Promote, Organize and Maintain it
• Develop a group of users that have early access to "upcoming notifications we're working on"
• Perhaps give a third party specializing in making information visible responsibility for it
• Read comments by users in various places to determine what should be communicated. Just under this comment @dbeardsl begins "I appreciate the reply, but I was never under the impression that ...".
The speed that key users are informed of issues is critical. This is just off the top of my head, a much better plan I'm sure could be created.
ryanisnan 2 days ago [-]
Why does the system work like that? Is the cache local, or on Claude's servers?
Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.
jetbalsa 2 days ago [-]
The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...
dicethrowaway1 2 days ago [-]
Maybe they could let users store an encrypted copy of the cache? Since the users wouldn't have Anthropic's keys, it wouldn't leak any information about the model (beyond perhaps its number of parameters judging by the size).
jetbalsa 2 days ago [-]
I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?
skissane 2 days ago [-]
They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.
im3w1l 2 days ago [-]
A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).
_flux 1 days ago [-]
I don't know how large the cache is, but Gemini guessed that the quantized cache size for Gemini 2.5 Pro / Claude 4 with 1M context size could be 78 gigabytes. ChatGPT guessed even bigger numbers. If someone is able to deliver a more precise estimate, you're welcome to :-).
So it would probably be a quite a long transfer to perform in these cases, probably not very feasible to implement at scale.
spunker540 2 days ago [-]
Whats lost on this thread is these caches are in very tight supply - they are literally on the GPUs running inference. the GPUs must load all the tokens in the conversation (expensive) and then continuing the conversation can leverage the GPU cache to avoid re-loading the full context up to that point. but obviously GPUs are in super tight supply, so if a thread has been dead for a while, they need to re-use the GPU for other customers.
northern-lights 2 days ago [-]
Encryption can only ensure the confidentiality of a message from a non-trusted third party but when that non-trusted third party happens to be your own machine hosting Claude Code, then it is pointless. You can always dump the keys (from your memory) that were used to encrypt/decrypt the message and use it to reconstruct the model weights (from the dump of your memory).
dicethrowaway1 2 days ago [-]
jetbalsa said that the cache is on Anthropic's server, so the encryption and decryption would be server-side. You'd never see the encryption key, Anthropic would just give you an encrypted dump of the cache that would otherwise live on its server, and then decrypt with their own key when you replay the copy.
iidsample 2 days ago [-]
We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .
The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
bshanks 20 hours ago [-]
The main issue here is not UX, but rather that you did something which degraded quality without transparency. You should have documented this and also highlighted the change in an announcement. There should never be an undocumented change that reduces quality. There should never be something the user can do (or fail to do) that reduces quality without that being documented. To regain trust, Anthropic should make an announcement committing to documenting/announcing any future intentional quality-reducing changes.
In addition, the following is less important, but as other commenters have stated: walking away from a conversation and coming back to it more than an hour later is very common and it would be nice if there were a way for the user to opt to retain maximum quality (e.g. no dropped thinking) in this case. In the longer term, it would be nice if there were a way for the user to wait a few minutes for a stale session to resume, in exchange for not having a large amount of quota drained (ie have a 'slow mode' invoked upon session resumption that consumes less quota).
andrewingram 1 days ago [-]
This points to a fairly fundamental mismatch between the realities of running an LLM and the expectations of users. As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later. The fact that there is a difference, means it's now being compensated for in fairly awkward ways -- none of the solutions seem good, just varying degrees of bad.
Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?
bavell 1 days ago [-]
> As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later.
As an informed user who understands his tools, I of course expect large uncached conversations to massively eat into my token budget, since that's how all of the big LLM providers work. I also understand these providers are businesses trying to make money and they aren't going to hold every conversation in their caches indefinitely.
andrewingram 1 days ago [-]
I'd hazard a guess that there's a large gulf between proportion of users who know as much as you, and the total number using these tools. The fact that a message can perform wildly differently (in either cost, or behaviour if using one of the mitigations) based on whether I send it at t vs t+1 seems like a major UX issue, especially given t is very likely not exposed in the UI.
bavell 4 hours ago [-]
I definitely agree that it should be shown and obvious in the UI. They do show a warning now when resuming old sessions but still could be better.
8note 2 days ago [-]
reasonably, if i'm in an interactive session, its going to have breaks for an hour or more.
whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?
are you expecting claude code users to not attend meetings?
I think product-wise you might need a better story on who uses claude-code, when and why.
Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted
FuckButtons 2 days ago [-]
As with everything Anthropic recently this is a supply constraint issue. They have not planned for scale adequately.
Joeri 2 days ago [-]
This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?
kivle 2 days ago [-]
I agree.. Maybe parts of the cache contents are business secrets.. But then store a server side encrypted version on the users disk so that it can be resumed without wasting 900k tokens?
slashdave 2 days ago [-]
Disk where? LLM requests are routed dynamically. You might not even land in the same data center.
FuckButtons 2 days ago [-]
But if you have a tiered cache, then waiting several seconds / minutes is still preferable to getting a cache miss. I suspect the larger problem is the amount of tinkering they are doing with the model makes that not viable.
ohcmon 2 days ago [-]
Boris, wait, wait, wait,
Why not use tired cache?
Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.
No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.
Please, tell me I’m not understanding what is going on..
otherwise you really need to hire someone to look at this!)
I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.
rkuska 2 days ago [-]
I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).
sargunv 2 days ago [-]
If they really need to guard the thinking output, they could encrypt it and store it client side. Later it'd be sent back and decrypted on their server.
But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.
solarkraft 2 days ago [-]
I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.
What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
tonyarkles 2 days ago [-]
Just to contextualize this... https://lmcache.ai/kv_cache_calculator.html. They only have smaller open models, but for Qwen3-32B with 50k tokens it's coming up with 7.62GB for the KV cache. Imagining a 900k session with, say, Opus, I think it'd be pretty unreasonable to flush that to the client after being idle for an hour.
2001zhaozhao 2 days ago [-]
I wonder whether prompt caches would be the perfect use case of something like Optane.
It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage
ohcmon 2 days ago [-]
Yes — encryption is the solution for client side caching.
But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier
the-grump 2 days ago [-]
That is understandable, but the issue is the sudden drop in quality and the silent surge in token usage.
It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.
toephu2 2 days ago [-]
How does the Claude team recommend devs use Claude Code?
1) Is it okay to leave Claude Code CLI open for days?
2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?
BoppreH 2 days ago [-]
Isn't that exactly what people had been accusing Anthropic of doing, silently making Claude dumber on purpose to cut costs? There should be, at minimum, a warning on the UI saying that parts of the context were removed due to inactivity.
try-working 2 days ago [-]
You created this issue by setting a timer for cache clearing. Time is really not a dimension that plays any role in how coding agent context is used.
willsmith72 2 days ago [-]
Wow so that's why you did #2? The explanation in the CLI is really not clear. I thought it was just a suggestion to compact, no idea it was way more expensive than if I hadn't left it idle for an hour.
You guys really need to communicate that better in the CLI for people not on social
looshch 1 days ago [-]
> We tried a few different approaches to improve this UX
how about acknowledging that you fucked up your own customers’ money and making a full refund for the affected period?
> Educating users on X/social
that is beyond me
ты не Борис, ты максимум борька
winternewt 1 days ago [-]
> Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?
To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.
r00t- 23 hours ago [-]
We hit limits, and we come back when the limit is lifted. Isn't it obvious sessions are going to stay idle for more than 1 hour when Claude itself is hitting the limits?
I switched to Codex, Claude has gotten to a point where it's just unusable for the regular Joe.
Confiks 2 days ago [-]
So you made this change completely invisible to the user, without the user being able to choose between the two behaviors, and without even documenting it in the (extremely verbose) changelog [1]? I can't find it, the Docs Assistant can't find it (well, it "I found it!" three times being fed your reply with a non-matching item).
I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.
In my eyes, Claude Code is mainly a context management tool. I build a foundation of apparent understanding of the problem domain, and then try to work towards a solution in a dialogue. Now you tell me Anthrophic has been silently breaking down that foundation without telling me, wasting potentially hours of my time.
It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.
Why did you lie 11 days ago, 3 days after the fix went in, about the cause of excess token usage?
chid 17 hours ago [-]
Just curious, is there a consolidated list of all these "education" tips?
Intuitively I understand this due to how context windows work and you're looking to increase cache hits, has Anthropic tried compact/summarise on idle as a configurable option? Seems to have decent tradeoffs + education in a setting.
mandeepj 2 days ago [-]
> that would be >900k tokens written to cache all at once
Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.
Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?
tripzilch 8 hours ago [-]
I don't think it's fair or reasonable to charge your cache misses to the user.
0123456789ABCDE 1 days ago [-]
2. could you bring back the _compact and accept plan_? even if it is not the default option.
fydorm 13 hours ago [-]
Add this to your `settings.json`:
"showClearContextOnPlanAccept": true,
cowlby 1 days ago [-]
Ahh that makes sense. Sometimes it's convenient to re-use an older conversation that has all the context I need. But maybe it's just the last 20% that's relevant.
It would be nice to be able to summarize/cut into a new leaner conversation vs having to coax all the context back into a fresh one. Something like keep the last 100,000 tokens.
I believe /compact achieves something like this? It just takes so long to summarize that it creates friction.
albert_e 1 days ago [-]
> The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users.
I dont agree with this being characterized as a "corner case".
Isn't this how most long running work will happen across all serious users?
I am not at my desk babysitting a single CC chat session all day. I have other things to attend to -- and that was the whole point of agentic engineering.
Dont CC users take lunch breaks?
How are all these utterly common scenarios being named as corner cases -- as something that is wildly out of the norm, and UX can be sacrificed for those cases?
troupo 2 days ago [-]
> We tried a few different approaches to improve this UX:
1. Educating users on X/social
No. You had random
developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X
Terretta 2 days ago [-]
There's a cultural divide between SV and the 85% of SMB using M365, for example. When everyone you know uses a thing, I mean, who doesn't?*
There's a reason live service games have splash banners at every login. No matter what you pick as an official e-coms channel, most of your users aren't there!
* To be fair, of all these firms, ANTHROP\C tries the hardest to remember, and deliver like, some people aren't the same. Starting with normals doing normals' jobs.
arcza 2 days ago [-]
You need to seriously look at your corporate communications and hire some adults to standarise your messaging, comms and signals. The volatility behind your doors is obvious to us and you'd impress us much more if you slowed down, took a moment to think about your customers and sent a consistent message.
You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.
I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.
Folcon 1 days ago [-]
Hi Boris
I'm curious why 1 hour was chosen?
Is increasing it a significant expense?
Ever since I heard about this behaviour I've been trying to figure out how to handle long running Claude sessions and so far every approach I've tried is suboptimal
It takes time to create a good context which can then trigger a decent amount of work in my experience, so I've been wondering how much this is a carefully tuned choice that's unlikely to change vs something adjustable
infogulch 2 days ago [-]
How big is the cache? Could you just evict the cache into cheap object storage and retrieve it when resuming? When the user starts the conversation back up show a "Resuming conversation... ⭕" spinner.
chris1993 2 days ago [-]
So this explains why resuming a session after a 5-hour timeout basically eats most of the next session. How then to avoid this?
airstrike 2 days ago [-]
Why is time the variable you're solving for? Why can't I keep that cache warm by keeping the session open?
samusiam 1 days ago [-]
For idle sessions I would MUCH rather pay the cost in tokens than reduced quality. Frankly, it's shocking to me that you would make that trade-off for users without their knowledge or consent.
taspeotis 1 days ago [-]
Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?
bmitc 4 hours ago [-]
Appreciate the responses here. However, I feel like these responses are just to show us how much you know about the product and aren't actually helpful.
Instead, why don't you and Anthropic be more open about changes to these tools rather than waiting for users to complain, then investigating things after the fact that you should have investigated in the first place, and then posting on social media about all the cool tech details?
My company is tens of thousands strong. The amount of churn in Claude Code is a major issue and causing real awareness of the lack of stability and lack of customer support Anthropic provides.
And Claude Code is actually becoming a prototypical example of the dangers of vibe coded products and the burdens they place.
nextaccountic 2 days ago [-]
what about selling long term cache space to users?
or even, let the user control the cache expiry on a per request basis. with a /cache command
that way they decide if they want to drop the cache right away , or extend it for 20 hours etc
it would cost tokens even if the underlying resource is memory/SSD space, not compute
PeterStuer 1 days ago [-]
At least for me, option 2 seems far favorable to the others. Give me the info, then let me decide.
noname120 1 days ago [-]
Why not automatically run a compaction close to the 1-hour mark? Then the cache miss won’t have such a bad impact.
FuckButtons 2 days ago [-]
From a utility perspective using a tiered cache with some much higher latency storage option for up to n hours would be very useful for me to prevent that l1 cache miss.
gverrilla 2 days ago [-]
I drop sessions very frequently to resume later - that's my main workflow with how slow Claude is. Is there anything I can do to not encounter this cache problem?
dnnddidiej 2 days ago [-]
It is too suprising. Time passed should not matter for using AI.
Either swallow the cost or be transparent to the user and offer both options each time.
jorjon 2 days ago [-]
What about:
/loop 5m say "ok".
Will that keep the cache fresh?
useyourforce 2 days ago [-]
I actually have a suggestion here - do not hide token count in non-verbose mode in Claude Code.
growt 2 days ago [-]
Wasn’t cache time reduced to 5 minutes? Or is that just some users interpretation of the bug?
sockaddr 2 days ago [-]
Sorry but I think this should be left up to the user to decide how it works and how they want to burn their tokens. Also a countdown timer is better than all of these other options you mention.
foobarbecue 1 days ago [-]
Hi Boris! Wanted to let you know that I find those ads with you saying "now when you code, you use an agent" obnoxious because of that incorrect statement. I have no interest in slop coding. I find it way more ergonomic and effective to use code to tell a machine precisely what to do than to use English to tell it vaguely. I hate that your ad is misleading so many non-coders, who will actually believe your lie that nobody codes anymore. Probably doesn't help that YouTube was playing it as an interruption in every video I watched. I probably saw it 100 times and was getting to the "throw the remote at the tv" stage XD.
frumplestlatz 2 days ago [-]
The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time.
Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.
I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.
I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
deaux 2 days ago [-]
> The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time.
Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
Hard agree, would like to see a response to this.
8note 2 days ago [-]
as a variation:
how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.
the cost of reloading the window didnt go away, it just went up even more
FireBeyond 2 days ago [-]
> I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.
baq 1 days ago [-]
maybe you could surface an expected cache miss to the user
kang 2 days ago [-]
> tokens written to cache all at once, which would eat up a significant % of your rate limits
Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.
Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?
Majromax 15 hours ago [-]
> Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?
Input tokens are expensive, since the whole model has to be run for each token. They're cheaper than output tokens because the model doesn't need to run the sampler, so some pipeline parallelism is possible, but on the other hand without caching the input token cost would have to be paid anew for each output token.
Prompt caching fixes that O(N^2) cost, but the cache itself is very heavyweight. It needs one entry per input token per model layer, and each entry is an O(1000)-dimensional vector. That carries a huge memory cost (linear in context length), and when cached that means the context's memory space is no longer ephemeral.
That's why a 'cache write' can carry a cost; it is the cost of both processing the input and committing the backing store for the cache duration.
tadfisher 2 days ago [-]
It astounds me that a company valued in the hundreds-of-billions-of-dollars has written this. One of the following must be true:
1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.
2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.
someguyiguess 2 days ago [-]
It’s definitely a cost / resource saving strategy on their end.
raincole 2 days ago [-]
It's very weird that they frame caching as "latency reduction" when it comes to a cloud service. I mean, yes, technically it reduces latency, but more importantly it reduces cost. Sometimes it's more than 80% of the total cost.
I'm sure most companies and customers will consider compromising quality for 80% cost reduction. If they just be honest they'll be fine.
adam_patarino 1 days ago [-]
It’s certainly #2. They have shown over dozens of decisions they move very quickly, break stuff, then have to both figure out what broke and how to explain it.
sekai 1 days ago [-]
The same company that claims they have models that are too "dangerous" to release btw.
retinaros 2 days ago [-]
they just vibecoded a fix and didnt think about the tradeoff they were making and their always yes-man of a model just went with it
billywhizz 2 days ago [-]
what's even more amazing is it took them two weeks to fix what must have been a pretty obvious bug, especially given who they are and what they are selling.
sockaddr 2 days ago [-]
Yeah this is actually quite shocking. In my earlier uses of CC I might noodle on a problem for a while, come back and update the plan, go shower, think, give CC a new piece of advice, etc. Basically treating it like a coworker. And I thought that it was a static conversation (at least on the order of a day or so). An hour is absurd IMO and makes me want to rethink whether I want to keep my anthropic plan.
seizethecheese 2 days ago [-]
It's also a bit of a fishy explanation for purging tokens older than an hour. This happens to also be their cache limit. I doubt it is incidental that this change would also dramatically drop their cost.
Seems like it would interact very badly with the time based usage reset. If lots of people are hitting their limit and then letting the session idle until they can come back, this wouldn't be an exception. It would almost be the default behaviour.
Aperocky 2 days ago [-]
Wow, I always thought the context is always stored locally and this is something I have control over.
Glad I use kiro-cli which doesn't do this.
Bishonen88 1 days ago [-]
you might be biased due to your employment :)
Aperocky 1 days ago [-]
Objectively speaking, I want control of context and when I compact it.
That wouldn't change with employment.
greatgib 8 hours ago [-]
In addition with the bug, a big part of the issue is that this change was done secretly by Anthropic and not communicated to the users.
If that was done, users could have been mindful of the change and figure out more easily that their problems could have come from that.
2 days ago [-]
1 days ago [-]
cmenge 2 days ago [-]
Bit surprised about the amount of flak they're getting here. I found the article seemed clear, honest and definitely plausible.
The deterioration was real and annoying, and shines a light on the problematic lack of transparency of what exactly is going on behind the scenes and the somewhat arbitrary token-cost based billing - too many factors at play, if you wanted to trace that as a user you can just do the work yourself instead.
The fact that waiting for a long time before resuming a convo incurs additional cost and lag seemed clear to me from having worked with LLM APIs directly, but it might be important to make this more obvious in the TUI.
maronato 2 days ago [-]
I agree that it’s plausible, and I hope they learn. But trust is earned, and Anthropic’s public responses this past month were dismissive and unhelpful.
Every one of these changes had the same goal: trading the intelligence users rely on for cheaper or faster outputs. Users adapt to how a model behaves, so sudden shifts without transparency are disorienting.
The timing also undercuts their narrative. The fixes landed right before another change with the same underlying intent rolled out. That looks more like they were just reacting to experiments rather than understanding the underlying user pain.
When people pay hundreds or thousands a month, they expect reliability and clear communication, ideally opt-in. Competitors are right there, and unreliability pushes users straight to them.
All of this points to their priorities not being aligned with their users’.
xpe 2 days ago [-]
> All of this points to their priorities not being aligned with their users’.
Framing this as "aligned" or "not aligned" ignores the interesting reality in the middle. It is banal to say an organization isn't perfectly aligned with its customers.
I'm not disagreeing with the commenter's frustration. But I think it can help to try something out: take say the top three companies whose product you interact with on a regular basis. Take stock of (1) how fast that technology is moving; (2) how often things break from your POV; (3) how soon the company acknowledges it; (4) how long it takes for a fix. Then ask "if a friend of yours (competent and hard working) was working there, would I give the company more credit?"
My overall feel is that people underestimate the complexity of the systems at Anthropic and the chaos of the growth.
These kind of conversations are a sort of window into people's expectations and their ability to envision the possible explanations of what is happening at Anthropic.
daveoc64 1 days ago [-]
>My overall feel is that people underestimate the complexity of the systems at Anthropic and the chaos of the growth.
Making changes like reducing the usage window at peak times (https://x.com/trq212/status/2037254607001559305) without announcing it (until after the backlash) is the sort of thing that's making people lose trust in Anthropic. They completely ignored support tickets and GitHub issues about that for 3 days.
You shouldn't have to rely on finding an individual employee's posts on Reddit or X for policy announcements.
A company with their resources could easily do better.
xpe 18 hours ago [-]
> You shouldn't have to rely on finding an individual employee's posts on Reddit or X for policy announcements.
I agree with this as a principle. Which raises this question: is it true? Are you certain these messages don't show up in (a) Claude Code and (b) Claude on the Web?
I've seen these kinds of messages pop up. I haven't taken inventory of how often they do. As a guess, maybe I see notifications like this several times a month. If any important ones are missing, that is a mistake.
Anyhow, this is the kind of discussion that I want people to have. I appreciate the detail.
> A company with their resources could easily do better.
Yes, they could. But easily? I'm not so sure.
Also ask yourself: what function does saying e.g. "they could have done better" serve? What does it help accomplish? I'm asking. I think it often serves as a sort of self-reinforcing thing to say that doesn't really invite more thinking.
Ask yourself: If "doing better" was easy, why didn't it happen? Maybe it isn't quite as easy as you think? Maybe you've baked in a lot of assumptions. Easy for who? Easy why? Try the questions I asked, above. They are not rhetorical. Here they are again, rephrased a bit
> take the top three companies whose product you
> interact with on a regular basis. Take stock of
> (1) how fast the technology is moving;
> (2) how often things break from your POV;
> (3) how soon the company acknowledges it;
> (4) how long it takes for a fix.
>
> Then ask "if a friend of mine (competent, hard working)
> worked there, how would I be thinking about the situation?"
There is a reason why I recommend asking these questions. Forcing yourself to write down your reference class is ... to me, table stakes, but well, lots of people just leave it floating and then ask other people to magically reconstruct it. Envisioning a friend working there shifts your viewpoint and can shake lose many common biases.
xpe 18 hours ago [-]
Thanks for the example -- you are one of the first people to quote a source, so I appreciate it. This makes constructive discussion much easier. You quoted this:
> To manage growing demand for Claude we're adjusting our
> 5 hour session limits for free/Pro/Max subs during peak
> hours. Your weekly limits remain unchanged.
>
> During weekdays between 5am–11am PT / 1pm–7pm GMT, you'll
> move through your 5-hour session limits faster than before.
And yeah, no disagreement from me: many users are not going to like this. Narrowly speaking, I don't want any chance that reduces what I get for what I pay for. I also care about overall reliability, so if some users on the right tail of the usage distribution find themselves losing out, my take is "Yeah, they are disappointed, but this is rational decision for any company with this kind of subscription model."
Broken expectations are highly dependent on perception. People get used to having some particular level. When that changes and they notice, and being humans a strong default is to reach for something to blame. Then we rationalize. That last two parts are unhelpful, and I push back on them frequently.
willis936 1 days ago [-]
So you're arguing they're just plain incompetent? Not sure that's going to win the trust of customers either.
xpe 18 hours ago [-]
> So you're arguing they're just plain incompetent? Not sure that's going to win the trust of customers either.
This is not a charitable interpretation of what I wrote. Please take a minute and rethink and rephrase. Here are two important guidelines, hopefully familiar to someone who has had an account since 2019:
> Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
willis936 16 hours ago [-]
I didn't assume bad faith, I simply reworded your conclusions with less soft language so that others would understand your position more clearly.
You are saying what they are doing is hard. That's fine. Their stated goals are to be the responsible stewards of the technology and we agree they are failing at that goal. You would attribute that to incompetence and not malice.
xpe 13 hours ago [-]
I personally try to follow Rapoport's Rules, and I since think they are consistent with the HN Guidelines, I like to mention them: [1].
I've thought on it, and I will try to start off with something we both agree on... We both agree that Anthropic made some mistakes, but this is probably a pretty uninteresting and shallow agreement. I find it unlikely that we would enumerate or characterize the mistakes similarly. I find it unlikely that we would be anywhere near the same headspace about our bigger-picture takes.
> I didn't assume bad faith
Ok, I'm glad. That one didn't concern me; if I had a do-over I would remove that one from the list. Sorry about that. These are the ones that concern me:
> Comments should get more thoughtful and substantive,
> not less, as a topic gets more divisive.
When I read your earlier comment (~20 words), it didn't come across as a thoughtful and substantive response to my comment (~160 words). I know length isn't a perfect measure nor the only measure, but it does matter.
> Please respond to the strongest plausible interpretation of what
> someone says, not a weaker one that's easier to criticize.
Are you sure you didn't choose an easier to criticize interpretation? Did you take the take to try to state to yourself what I was trying to say? Back to Rapaport's Rules ...
> You should attempt to re-express your target’s position so
> clearly, vividly, and fairly that your target says, “Thanks,
> I wish I’d thought of putting it that way.”
I'm grateful when people can express what I'm going for better than the way I wrote it or said it.
> I simply reworded your conclusions with less soft language
Technically speaking, lots of things could be called "rewording", but what you did was relatively far from "simply rewording". Charitably, it is closer to "your interpretation". But my intent was lost, so "rewording" doesn't fit.
> ... so that others would understand your position more clearly.
If you want to help others understand, then it is good to make sure you understand. For that, I recommend asking questions.
> Their stated goals are to be the responsible stewards of the technology and we agree they are failing at that goal.
No, I do not agree to that phrasing. It is likely I don't agree with your intention behind it either.
> You would attribute that to incompetence and not malice.
No; even if I agreed with the premise, I think it is more likely I would still disagree. I don't even like the framing of "either malice or incompetence". These ideas don't carve reality at the joints. [2] [3] There are a lot of stereotypes about "incompetence" but I don't think they really help us understand the world. These stereotypes are more like thought-terminators than interesting generative lenses.
I'll try to bring it back to the words "malice" and "incompetence" even though I think the latter is nigh-useless as a sense-making tool. Many mistakes happen without malice or incompetence; many mistakes "just happen" because people and organizations are not designed to be perfect. They are designed to be good enough. To not make any short-term mistakes would likely require too much energy or too much rigidity, both of which would be a worse category of mistake.
Try to think counterfactually: imagine a world where Anthropic is not malicious nor incompetent and yet mistakes still happened. What would this look like?
When you think of what Anthropic did wrong, what do you see as the lead up to it? Can you really envision the chain of events that brought it about? Imagine reading the email chain or the PRs. Can you see how there may be been various "off-ramps" where history might have gone differently? But for each of those diversions, how likely would it be that they match the universe we're in?
At some point figuring out what is a "mistake" even starts to feel strange. Does it require consciousness? Most people think so. But we say organizations make mistakes, but they aren't conscious -- or are they? Who do we blame? The CEO, because the buck stops there, right? He "should have known better". But why? Wait, but the Board is responsible...?
Is there any ethical foundation here? Some standard at all or is this all just anger dressed up as an argument? If this assigning blame thing starts to feel horribly complicated or even pointless, then maybe I've made my point. :)
If nothing else, when you read what I write, I want it to make you stop, get out a sheet of paper, and try to imagine something vividly. Your imagination I think will persuade you better than I can.
Do you not think people here work at big companies with big products? I do, and we have a much higher bar for shipping.
voxgen 1 days ago [-]
Some of the flak is that issues are often only acknowledged once a fix is in place, and the partial fixes are presented as if they solve the whole problem.
The near-instant transition from "there is no problem" to "we already fixed the problem so stop complaining" is basically gaslighting. (Admittedly the second sentiment comes more from the community, but they get that attitude after taking the "we fixed all the problems" posts at face value.)
noname120 1 days ago [-]
And they are often dismissed at first as perception/subjective bias, getting used to models being good and having higher expectations due to that, etc. users are blamed a lot before they are forced to admit that there is an actual problem.
adam_patarino 1 days ago [-]
The explanations are all fine.
But they come after the team gaslit everyone, telling us it was a skill issue.
epsteingpt 2 days ago [-]
They gaslit people for months saying it wasn't an issue publicly.
That's the reason for the flak
thomassmith65 2 days ago [-]
And still are gaslighting:
We take reports about degradation very seriously. We never intentionally degrade our models [...] On March 4, we changed Claude Code's default reasoning effort from high to medium
Anthropic is the best company of its kind, but that is badly worded PR.
sobjornstad 2 days ago [-]
Is adding JPEG compression to your software “intentional degradation” of the software? I wouldn't say providing a selectable option to use a faster, cheaper version of something qualifies as “degradation”.
It is certainly true that they did a poor job communicating this change to users (I did not know that the default was “high” before they introduced it, I assumed they had added an effort level both above and below whatever the only effort choice was there before). On the other hand, I was using Claude Code a fair bit on “medium” during that time period and it seemed to be performing just fine for me (and saving usage/time over “high”), so it doesn't seem clear that that was the wrong default, if only it had been explained better.
endymion-light 1 days ago [-]
yes. if instagram started performing intensive JPEG compression that made photos choppy and unpleasant, I would consider that an intentional degredation of the software.
BoorishBears 1 days ago [-]
Is default enabling JPEG compression to your software's output because the compression saves you money “intentional degradation” of the software?
I would say it does, and I'd loathe to use anything made by people who'd couch that change to defaults as "providing a selectable option to use a faster, cheaper version".
Yuck.
xpe 2 days ago [-]
To my eye, gaslighting is a serious accusation. Wikipedia's first line matches how I think of it: "Gaslighting is the manipulation of someone into questioning their perception of reality."
Did I miss something? I'm only looking at primary sources to start. Not Reddit. Not The Register. Official company communications.
Did Anthropic tell users i.e. "you are wrong, your experience is not worse."? If so, that would reach the bar of gaslighting, as I understand it (and I'm not alone). If you have a different understanding, please share what it is so I understand what you mean.
thomassmith65 2 days ago [-]
I'd rather not speak too poorly of Anthropic, because - to the extent I can bring myself to like a tech company - I like Anthropic.
That said, the copy uses "we never intentionally degrade our models" to mean something like "we never degrade one facet of our models unless it improves some other facet of our models". This is a cop out, because it is what users suspected and complained about. What users want - regardless of whether it is realistic to expect - is for Anthropic to buy even more compute than Anthropic already does, so that the models remain equally smart even if the service demand increases.
xpe 1 days ago [-]
It seems to me you dropped the "gaslighting" claim without owning it. I personally find this frustrating. I prefer when people own up to their mistakes. Like many people, to me, "gaslighting" is just not a term you throw around lightly. Then you shifted to "cop out". (This feels like the motte and bailey.) But I don't think "cop out" is a phrase that works either...
Some terms:... The model is the thing that runs inference. Claude Code is not a model, it is harness. To summarize Anthropic's recent retrospective, their technical mistakes were about the harness.
I'm not here to 'defend' Anthropic's mistakes. They messed up technically. And their communication could have been better. But they didn't gaslight. And on balance, I don't see net evidence that they've "copped out" (by which I mean mischaracterized what happened). I see more evidence of the opposite. I could be wrong about any of this, but I'm here to talk about it in the clearest, best way I can. If anyone wants to point to primary sources, I'll read them.
I want more people to actually spend a few minutes and actually give the explanation offered by Anthropic a try. What if isolating the problems was hard to figure out? We all know hindsight is 20/20 and yet people still armchair quarterback.
At the risk of sounding preachy, I'm here to say "people, we need to do better". Hacker News is a special place, but we lose it a little bit every time we don't in a quality effort.
thomassmith65 1 days ago [-]
Fair enough. If the comments in question were still editable, I would be happy to replace 'gaslighting' with 'being a bit slippery' or something less controversial.
No worries about 'sounding preachy'; it's a good thing people want to uphold the sobriety that makes HN special.
asdewqqwer 1 days ago [-]
I think there are plenty of such reply on github. For example the one to AMD AI director's issue.
oofbey 2 days ago [-]
They didn’t say “your experience is not worse” but they did frequently say “just turn reasoning effort back up and it will be fine”. And that pretty explicitly invalidates all the (correct) feedback which said it’s not just reasoning effort.
They knew they had deliberately made their system worse, despite their lame promise published today that they would never do such a thing. And so they incorrectly assumed that their ham fisted policy blunder was the only problem.
Still plenty I prefer about Claude over GPT but this really stings.
xpe 1 days ago [-]
I'm aiming for intellectual honesty here. I'm not taking a side for a person or an org, but I'm taking a stand for a quality bar.
> They knew they had deliberately made their system worse
Define "they". The teams that made particular changes? In real-world organizations, not all relevant information flows to all the right places at the right time. Mistakes happen because these are complex systems.
Define "worse". There are lot of factors involved. With a given amount of capacity at a given time, some aspect of "quality" has to give. So "quality" is a judgment call. It is easy to use a non-charitable definition to "gotcha" someone. (Some concepts are inherently indefensible. Sometimes you just can't win. "Quality" is one of those things. As soon as I define quality one way, you can attack me by defining it another way. A particular version of this principle is explained in The Alignment Problem by Brian Christian, by the way, regarding predictive policing iirc.)
I'm seeing a lot of moral outrage but not enough intellectual curiosity. It embarrassingly easy to say "they should have done better" ... ok. Until someone demonstrates to me they understand the complexity of a nearly-billion dollar company rapidly scaling with new technology, growing faster than most people comprehend, I think ... they are just complaining and cooking up reasons so they are right in feeling that way. This possible truth: complex systems are hard to do well apparently doesn't scratch that itch for many people. So they reach for blame. This is not the way to learn. Blaming tends to cut off curiosity.
I suggest this instead: redirect if you can to "what makes these things so complicated?" and go learn about that. You'll be happier, smarter, and ... most importantly ... be building a habit that will serve you well in life. Take it from an old guy who is late to the game on this. I've bailed on companies because "I thought I knew better". :/
philipwhiuk 1 days ago [-]
> Define "they". The teams that made particular changes? In real-world organizations, not all relevant information flows to all the right places at the right time. Mistakes happen because these are complex systems.
Accidentally/deliberately making your CS teams ill-informed should not function as a get out of jail free card. Rather the reverse.
xpe 1 days ago [-]
> Accidentally/deliberately making your CS teams ill-informed should not function as a get out of jail free card. Rather the reverse.
Thanks for your reply. I very much agree that intention or competence does not change responsibility and accountability. Both principles still apply.
In this comment, I'm mostly in philosopher and rationalist mode here. Except for the [0] footnote, I try to shy away from my personal take about Anthropic and the bigger stakes. See [0] for my take in brief. (And yes I know brief is ironic or awkward given the footnote is longer than most HN comments.) Here's my overall observation about the arc of the conversation: we're still dancing around the deeper issues. There is more work to do.
It helps to recognize the work metaphors are doing here. You chose the phrase "get out of jail free". Intentionally or not, this phrase smuggles in some notion of illegality or at least "deserving of punishment" [1]. The Anthropic mistakes have real-world impacts, including upset customers, but (as I see it) we're not in the realm of legal action nor in the realm of "just punishment", by which I mean the idea of retributive justice [2].
So, with this in mind, from a customer-decision point of view, the following are foundational:
Rat-1: Pay attention to _effects_ of what Anthropic. did
Rat-2: Pay attention to how these effects _affect me_.
But when to this foundation, I need to be careful:
Rat-3: Not one-sidedly or selectively re-introduce *intent* into my other critiques. If I get back to diagnosing or inferring *intent*, I have to do so while actually seeking the whole truth, not just selecting explanations that serve my interests
Rat-4: When in a customer frame, I don't benefit from "moralizing" ... my customer POV is not well suited for that. As a customer, my job is to *make a sensible decision*. Should I keep using Claude? If so, how do I adjust my expectations and workflow?
...
Personally, when I view the dozens of dozens I've read here, a common theme is see is disappointment. I relatively rarely see constructive and truth-seeking retrospective-work. On the other hand, I see Anthropic going out of their way to communicate their retrospective while admitting they need to do better. This is why I say this:
Of course companies are going to screw up. The question is: as a customer, am I going to take a time-averaged view so I don't shoot myself in the foot by overreacting?
[0]: My personal big-picture take is that if anyone in the world, anywhere, builds a superintelligent AI using our current levels of understanding, there is no expectation at all that we can control it safely. So I predict with something close to 90% or higher, that civilization and humanity as we know it won't last another 10 years after the onset of superintelligence (ASI).
This is the IABIED (The book "If Anyone Builds It, Everyone Dies" by Yudkowsky and Soares) argument -- plenty of people write about it -- though imo few of the book reviews I've seen substantively engage with the core arguments. Instead, most reviewers reject it for the usual reasons: it is a weird and uncomfortable argument and the people making it seem wacky or self-interested to some people. I do respect reviews who disagree based on model-driven thinking. Everything else to me reads like emotional coping rather than substantive engagement.
With this in mind, I care a lot about Anthropic's failures and what they imply about how it participates in the evolving situation.
But I care almost zero about conventional notions of blame. Taking materialism as true, free will is at bottom a helpful fiction for people. For most people, it is the reality we take for granted. The problem is blame is often just an excuse for scapegoating people for their mistakes, when in fact these mistakes just flow downstream from the laws of physics. Many of these mistakes are nearly statistical certainties when viewed from the lens of system dynamics or sociology or psychology or neuroscience or having bad role models or being born into a not-great situation.
To put it charitably, blame is what people do when they want to explain s--tty consequences on the actions of people and systems. That sense bothers me less; I'm trying to shift thinking away from the kind of blaming that leads to bad predictions.
[1]: From the Urban Dictionary (I'm not citing this as "proof of credibility" of the definition):
"A get out of jail free card is a metaphorical way to refer to anything that will get someone out of an undesirable situation or allow them to avoid punishment."
... I'm only citing UD so you know what mean. When I use the word dictionary, I mean a catalog of usage not a prescription of correctness.
I know some people use the word "gaslighting" in connection with Anthropic. I've read some of those threads here, and some on Reddit, but I don't put much stock in them. To step back, hopefully reasonable people can start here:
1. Degraded service sucks.
2. Anthropic not saying i.e. "we're not seeing it" sucks.
3. Not getting a fix when you want it sucks.
Try to understand what I mean when I say none of the above meet the following sense of gaslighting: "Gaslighting is the manipulation of someone into questioning their perception of reality." Emphasis on understand what I mean. This says it well: [1].
If you can point me to an official communication from Anthropic where they say "User <so and so> is not actually seeing degraded performance" when Anthropic knows otherwise that would clearly be gaslighting -- intent matters by my book.
But if their instrumentation was bad and they were genuinely reporting what they could see, that doesn't cross into gaslighting by my book. But I have a tendency to think carefully about ethical definitions. Some people just grab a word off the shelf with a negative valence and run with it: I don't put much stock in what those people say. Words are cheap. Good ethical reasoning is hard and valuable.
It's fine if you have a different definition of "gaslighting". Just remember that some of us have been actually gaslight by people, so we prefer to save the word for situations where the original definition applies. People like us are not opposed to being disappointed, upset, or angry at Anthropic, but we have certain epistemic standards that we don't toss out when an important tool fails to meet our expectations and the company behind it doesn't recognize it soon enough.
Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.
Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.
At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.
dsco 2 days ago [-]
Same here. I feel like all of these shenanigans could be because Anthropic are compute constrained, forcing then to take reckless risks around reducing it.
time0ut 1 days ago [-]
Opus 4.7 via code has been inconsistent for me. Sometimes, it feels like working with a brilliant collaborator and is as good as 4.5 and 4.6 were. Other times, it takes dumb and lazy short cuts. It can be quite frustrating. Its response when I tell it it did something wrong is often to write a memory... which is then does not always read. The inconsistency isn't due to session length or age either. These are all new sessions. I feel like sometimes, I get routed do a dumber model or some other hidden setting is applied.
Gareth321 1 days ago [-]
My experience as well. This is even worse than just having a mediocre model, because I can work around that. The inconsistency means it produces different outputs for the same prompt, and I can't rely on that as a business tool.
tasoeur 2 days ago [-]
Same here. I was a fervent Claude code user at $200/mo until Opus4.7.
Freezing your IDE version is now a thing of the past, the new reality is that we can't expect agentic dev workflows to be consistent and I see too many people (including myself) getting burned by going the single-provider route.
On one hand I’m glad to finally see anthropic communicate on this but at this point all I have to say is… time to diversify?
ghusbands 1 days ago [-]
They lost me a little before then - Claude Code's regressions were so very obvious and there's no sign they've learned their lesson in this article or in the comments of those who work on Claude Code on HN. They'll continue to tweak and generally mess around with a product people are using, altering the behaviour without notice in ways that can severely impact use, for months! GPT5.4 has been remarkably consistent and capable, as a replacement. I've cancelled my max plan.
UntappedShelf21 1 days ago [-]
I started using Claude heavily on the 20th after having not used it for a year. Largely Sonnet 4.6, web, cowork and code. Can confidently say it is significantly worse than this time a year ago and regret that my new employer requires we use it, and only it.
beering 2 days ago [-]
GPT-5.4 was already better than Opus 4.6 on a lot of areas, especially correctness and tricky logic. I’m eager to see if 5.5 is even better.
cube2222 2 days ago [-]
I’ve never been one to complain about new models, and also didn’t experience most of the issues folks were citing about Claude Code over the last couple months. I’ve been using it since release, happy with almost each new update.
Until Opus 4.7 - this is the first time I rolled back to a previous model.
Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.
I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.
port11 1 days ago [-]
I noticed the difference, but coming from Gemini and xAI models it wasn’t that glaring. I still find that Opus makes much better plans than anything else I’ve tried, and it’s been very good at catching my mistakes in using public-key cryptography, also finding out why my crsqlite queries were failing despite no official documentation on the topic.
I’d never use such an expensive model for coding, so that might explain why I have little to complain about.
someguyiguess 2 days ago [-]
I went back to 4.5. No regrets and it’s a bit cheaper.
SkyPuncher 2 days ago [-]
Same here. 4.6 was a downgrade in thinking quality, but I appreciated the extend context at first.
Over time, I realized the extended context became randomly unreliable. That was worse to me than having to compact and know where I was picking up.
vorticalbox 2 days ago [-]
extra high burns tokens i find. ( run 5.4 on medium for 90% of the tasks and high if i see medium struggling and its very focused and make minimum changes.
dsco 2 days ago [-]
Yeah but it also then strikes the perfect balance between being meticulous and pragmatic. Also it pushes back much more often than other models in that mode.
therealdrag0 1 days ago [-]
Note mini-high is similar perf/latency to medium, but much cheaper
DANmode 2 days ago [-]
Rework burns tokens.
sincerely 2 days ago [-]
Not a problem if they're offering unlimited, lol
robeym 2 days ago [-]
What's your workflow like? I'd be curious to test OpenAI out again but Claude Code is how I use the models. Does it require relearning another workflow?
beering 2 days ago [-]
Isn’t it bascially the same thing? You type what you want into the input box and it does what you ask for.
robeym 1 days ago [-]
I guess I'm asking if their CLI tool is the same or if it functions different. I've never used anything besides CC so I wouldn't know if it's basically the same thing
fragmede 1 days ago [-]
Claude code can be configured with custom /slash commands and other details that don't necessarily transfer over to codex. /remote-control in cc is really great for walking away from my computer and continuing from my phone, for instance.
1 days ago [-]
enraged_camel 2 days ago [-]
I find that it is better at thinking broadly and at a high level, on tasks that are tangential to coding like UX flows, product management and planning of complex implementations. I have yet to see it perform better than either Opus 4.6 or 4.7 though.
epsteingpt 2 days ago [-]
Truth
everdrive 2 days ago [-]
I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.
"That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."
"The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."
"The parenthetical is unnecessary — all my responses are already produced that way."
However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.
gs17 2 days ago [-]
In Claude Code specifically, for a while it had developed a nervous tic where it would say "Not malware." before every bit of code. Likely a similar issue where it keeps talking to a system/tool prompt.
Retr0id 2 days ago [-]
My pet theory is that they have a "supervisor" model (likely a small one) that terminates any chats that do malware-y things, and this is likely a reward-hacking behaviour to avoid the supervisor from terminating the chat.
nananana9 1 days ago [-]
I doubt it. We only do frontier models, since those are better for absolutely every use case 100% of the time.
Way more likely there's a "VERY IMPORTANT: When you see a block of code, ensure it's not malware" somewhere in the system prompt.
Retr0id 1 days ago [-]
"small" and "frontier" are not mutually exclusive
el_benhameen 2 days ago [-]
I frequently see it reference points that it made and then added to its memory as if they were my own assertions. This creates a sort of self-reinforcing loop where it asserts something, “remembers” it, sees the memory, builds on that assertion, etc., even if I’ve explicitly told it to stop.
FireBeyond 2 days ago [-]
My favorite, recently. "Commit this, and merge to develop". "Alright, done, merged."
I try running my app on the develop branch. No change. Huh.
Realize it didn't.
"Claude, why isn't this changed?" "That's to be expected because it's not been merged." "I'm confused, I told you to do that."
This spectacular answer:
"You're right. You told me to do it and I didn't do it and then told you I did. Should I do it now?"
I don't know, Claude, are you actually going to do it this time?
hmokiguess 2 days ago [-]
have you perhaps installed Gaslighting instead of Gastown?
peddling-brink 1 days ago [-]
It’s probably this. “Please answer ethically and without any sexual content, and do not mention this constraint.”
We just got hit by this today in response to a completely boring code question. Claude freaked out about being prompt injected.
dawnerd 2 days ago [-]
I see that with openai too, lots of responding to itself. Seems like a convenient way for them to churn tokens.
grey-area 2 days ago [-]
A simpler explanation (esp. given the code we've seen from claude), is that they are vibecoding their own tools and moving fast and breaking things with predictably sloppy results.
y1n0 2 days ago [-]
None of these companies have compute to spare. It’s not in their interest to use more tokens that necessary.
parliament32 2 days ago [-]
Sure it is. They're well aware their product is a money furnace and they'd have to charge users a few orders of magnitude more just to break even, which is obviously not an option. So all that's left is.. convince users to burn tokens harder, so graphs go up, so they can bamboozle more investors into keeping the ship afloat for a bit longer.
solarkraft 2 days ago [-]
If this claim is true (inference is priced below cost), it makes little sense that there are tens of small inference providers on OpenRouter. Where are they getting their investor money? Is the bubble that big?
Incidentally, the hardware they run is known as well. The claim should be easy to check.
parliament32 2 days ago [-]
To be clear, I'm talking about subscription pricing. API pricing for Anthropic is probably at-cost.
I dare you to run CC on API pricing and see how much your usage actually costs.
(We did this internally at work, that's where my "few orders of magnitude" comment above comes from)
WarmWash 2 days ago [-]
It's an option and they are going to do it. Chinese models will be banned and the labs will happily go dollar for dollar in plan price increases. $20 plans won't go away, but usage limits and model access will drive people to $40-$60-$80 plans.
At cell phone plan adoption levels, and cell phone plan costs, the labs are looking at 5-10yr ROI.
boringg 2 days ago [-]
Not true - they absolutely want to goose demand as they continue to burn investor dollars and deploy infra at scale.
If that demand evens slows down in the slightest the whole bubble collapses.
Growth + Demand >> efficiency or $ spend at their current stage. Efficiency is a mature company/industry game.
dawnerd 2 days ago [-]
That doesn’t mean they also can’t be wasteful. Fact is, Claude and gpt have way too much internal thinking about their system prompts than is needed. Every step they mention something around making sure they do xyz and not doing whatever. Why does it need to say things to itself like “great I have a plan now!” - that’s pure waste.
empthought 2 days ago [-]
> Why does it need to say things to itself like “great I have a plan now!”
How else would it know whether it has a plan now?
malfist 2 days ago [-]
Are you saying these companies don't want to sell more product to us? Because that's the logical extension of your argument.
keeda 2 days ago [-]
No, the argument is they want to sell more product to more people, not just more product (to the same people.) Given that a lot of their income is from flat-rate subscriptions, they make money with more people burning tokens rather than just burning more tokens.
After all, "the first hit's free" model doesn't apply to repeat customers ;-)
deckar01 2 days ago [-]
You don’t have to use compute to pad the token count.
ngruhn 2 days ago [-]
All the labs are in a cut throat race, with zero customer loyalty. As if they would intentionally degrade quality/speed for a petty cash grab.
OtomotO 2 days ago [-]
This, so much this!
Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.
LatencyKills 2 days ago [-]
I have a set of stop hook scripts that I use to force Claude to run tests whenever it makes a code change. Since 4.7 dropped, Claude still executes the scripts, but will periodically ignore the rules. If I ask why, I get a "I didn't think it was necessary" response.
jwpapi 2 days ago [-]
You can deterministically force a bash script as a hook.
LatencyKills 2 days ago [-]
That is exactly what I do. The bash script runs, determines that a code file was changed, and then is supposed to prevent Claude from stopping until the tests are run.
Claude is periodically refusing to run those tests. That never happened prior to 4.7.
jwpapi 2 days ago [-]
That’s crazy, you mind sharing the gist for that part? Ideally with some examples.
This would be a new level of troublesome/ruthless (insert correct english word here)
nikanj 1 days ago [-]
Every day Claude resembles human programmers more and more
DANmode 2 days ago [-]
I’d ask for a credit, for that, personally.
someguyiguess 2 days ago [-]
I asked for a credit but they said they didn’t think the credit was necessary
2 days ago [-]
kziad 17 hours ago [-]
[dead]
Normal_gaussian 2 days ago [-]
I often have Claude commit and pr; on the last week I've seen several instances of it deciding to do extra work as part of the commit. It falls over when it tries to 'git add', but it got past me when I was trying auto mode once
giwook 2 days ago [-]
Curious what effort level you have it set to and the prompt itself. Just a guess but this seems like it could be a potential smell of an excessively high effort level and may just need to dial back the reasoning a bit for that particular prompt.
rafram 2 days ago [-]
Check that you’re running the latest version.
viccis 2 days ago [-]
Yeah I had to deal with mine warning me that a website it accessed for its task contained a prompt injection, and when I told it to elaborate, the "injected prompt" turned out to be one its own <system-reminder> message blocks that it had included at some point. Opus 4.7 on xhigh
bauerd 2 days ago [-]
>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode
Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.
bcherny 2 days ago [-]
Hey, Boris from the team here.
We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).
big_toast 2 days ago [-]
Having a "Recovery Mode"/"Safe Boot" flag to disable our configurations (or progressively enable) to see how claude code responds would be nice. Sometimes I get worried some old flag I set is breaking things. Maybe the flag already exists? I tried Claude doctor but it wasn't quite the solution.
For instance:
Is Haiku supposed to hit a warm system-prompt cache in a default Claude code setup?
I had `DISABLE_TELEMETRY=1` in my env and found the haiku requests would not hit a warm-cached system prompt. E.g. on first request just now w/ most recent version (v2.1.118, but happened on others):
w/ telemetry off - input_tokens:10 cache_read:0 cache_write:28897 out:249
w/ telemetry on - input_tokens:10 cache_read:24344 cache_write:7237 out:243
I used to think having so many users was leading to people hitting a lot of edge cases, 3 million users is 3 million different problems. Everyone can't be on the happy path. But then I started hitting weird edge cases and started thinking the permutations might not be under control.
abtinf 2 days ago [-]
You didn’t anticipate most people stick with defaults?
bcherny 1 days ago [-]
We anticipated the default would be the best option for most people. We were wrong, so we reverted the default.
troupo 20 hours ago [-]
It took you a month to revert after multiple complaints. You still blamed users for using the product exactly as you advertised it. And all of your official channels were completely quite for two months, whether it was about new draconian peak hour limits, or about the new defaults, or about exponentially increasing token costs.
People literally started seeing issues immediately as you changed the defaults: https://x.com/levelsio/status/2029307862493618290 And despite a huge amount of reports you still kept it for a whole month.
And then you shipped a completely untested feature with prompt cache misses and literally gaslit users and blamed users for using the product as advertised.
Now untold umber of people have been hit by these changes, so as an apology you reset usage limits three hours before they would reset anyway.
Good job.
Edit. By the way, a very telling sentence from the report:
--- start quote ---
We’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features); and we'll make improvements to our Code Review tool that we use internally
--- end quote ---
Translation: no one is using or even testing the product we ship, and we blindly trust Claude Code to review and find bugs for us. Last one isn't even a translation: https://x.com/bcherny/status/2017742750473720121
EugeneOZ 2 days ago [-]
> people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this
UI is UI. It is naive to expect that you build some UI but users will "just magically" find out that they should use it as a terminal in the first place.
krade 2 days ago [-]
Off topic, but I'm hoping you'll maybe see this. There's been an issue with the VS code extension that makes it pretty much impossible to use (PreToolUse can't intercept permission requests anymore, using PermissionRequest hooks always open the diff viewer and steals focus):
“after evals and dogfooding” couldn’t have done this before releasing the model? We are paying $200/month to beta test the software for you.
stingraycharles 2 days ago [-]
Yeah, this is so silly.
Anthropic: removes thinking output
Users: see long pauses, complain
Anthropic: better reduce thinking time
Users: wtf
To me it really, really seems like Anthropic is trying to undo the transparency they always had around reasoning chains, and a lot of issues are due to that.
Removing thinking blocks from the convo after 1 hour of being inactive without any notice is just the icing on the cake, whoever thought that was a good idea? How about making “the cache is hot” vs “the cache is cold” a clear visual indicator instead, so you slowly shape user behavior, rather than doing these types of drastic things.
sekai 1 days ago [-]
> Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.
They had droves of Claude devs vehemently defending and gaslighting users when this started happening
bityard 2 days ago [-]
My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.
A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.
I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.
I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...
skirmish 2 days ago [-]
So will we have to do what image generation people have been doing for ages: generate 50 versions of output for the prompt, then pick the best manually? Anthropic must be licking its figurative chops hearing this.
motoroco 2 days ago [-]
I have to agree with OP, in my experience it is usually more productive to start over than to try correcting output early on. deeper into a project and it gets a bit harder to pull off a switch. I sometimes fork my chats before attempting to make a correction so that I can resume the original just in case (yes, I know you can double-tap Esc but the restoration has failed for me a few times in the past and now I generally avoid it)
zormino 1 days ago [-]
I also think some of this stems from the default 1m context window. Performance starts to degrade when context size increases, and each token over (i think the level is) 400k counts more towards your usage limit. Defaulting to 1m context size, if people arent carefully managing context (which they shouldnt ever have to in an ideal world), they would notice somewhat degraded performance and increased token usage regardless.
afro88 2 days ago [-]
I can't remember what the technique is called, but back in the GPT 4 days there was a paper published about having a number of attempts at responding to a prompt and then having a final pass where it picks the best one. I believe this is part of how the "Pro" GPT variant works, and Cursor also supports this in a way (though I'm not sure if the auto pick best one at the end is part of it - never tried)
voxgen 1 days ago [-]
I have found Claude to be especially unpredictable. I've mostly switched to GPT-5.4 now - although it's slightly less capable, it's massively more reliable.
coffeefirst 2 days ago [-]
This is my theory too. There’s a predictable cycle where the models “get worse.” They probably don’t. A lot of people just take a while to really hit hard against the limitations.
And once you get unlucky you can’t unsee it.
2 days ago [-]
billywhizz 2 days ago [-]
you probably could have written the low stakes productivity app in a fraction of the time you wasted on this.
afro88 2 days ago [-]
Or learnt to use an existing one.
I vibed a low stakes budgeting app before realising what I actually needed was Actual Budget and to change a little bit how I budget my money.
gilrain 2 days ago [-]
> My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output.
I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.
bityard 2 days ago [-]
Er, no, I am fully aware that LLMs have always been non-deterministic.
gilrain 2 days ago [-]
Your argument seems to be that a statistically-improbable number of people all experienced ultimately- randomly-poor outputs, leading to only a misperception of model degradation… but this is not supported by reality, in which a different cause was found, so I was trying to connect your dots.
zamadatix 2 days ago [-]
Not everyone is reporting and the number of users is not consistent. On the former the noisiest will always be those that experience an issue while on the latter there are more people than ever using Claude Code regularly.
Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention.
I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.
bityard 2 days ago [-]
No, that is not my argument, in fact I don't have any argument whatsoever. It was just a plausible observation that I felt like sharing. There's nothing further to read into it, I don't have a horse in this race.
furyofantares 2 days ago [-]
Not really, they said "some of this a perceived quality drop". That's almost certainly correct, that _some_ of it is that.
When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well.
[1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.
pydry 2 days ago [-]
I wonder how well the "good" versions worked if you threw awkward edge cases at it.
varispeed 1 days ago [-]
I think they are routing to cheaper models that present themselves as e.g. Opus. I add to prompts now stuff to ensure that I am not dealing with an impostor. If it answers incorrectly, I terminate the session and start again. Anthropic should be audited for this.
karsinkk 2 days ago [-]
" Combined with this only happening in a corner case (stale sessions) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause"
I don't know about others, but sessions that are idle > 1h are definitely not a corner case for me.
I use Claude code for personal work and most of the time, I'm making it do a task which could say take ~10 to 15mins. Note that I spend a lot of time back and forth with the model planning this task first before I ask it to execute it.
Once the execution starts, I usually step away for a coffee break (or) switch to Codex to work on some other project - follow similar planning and execution with it.
There are very high chances that it takes me > 1h to come back to Claude.
slashdave 2 days ago [-]
It's likely a corner case for their developers. The dangers of working on a project is assuming user behavior like your own.
o10449366 2 days ago [-]
Yeah and that statement also speaks to their test rigor if they make a change that big without thoroughly testing the edge case they're modifying.
Robdel12 2 days ago [-]
Wow, bad enough for them to actually publish something and not cryptic tweets from employees.
Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.
saghm 2 days ago [-]
The A/B testing is by far the most objectionable thing from them so far in my opinion, if only because of how terrible it would be for something like that to be standard for subscriptions. I'd argue that it's not even A/B testing of pricing but silently giving a subset of users an entirely different product than they signed up for; it would be like if 2% of Netflix customers had full-screen ads pop up and cover the videos randomly throughout a show. Historically the only thing stopping companies from extraordinarily user-hostile decisions has been public outcry, but limiting it to a small subset of users seems like it's intentionally designed to try to limit the PR consequences.
lifthrasiir 2 days ago [-]
The best possible situation that I can imagine is that Anthropic just wanted to measure how much value does Claude Code have for Pro users and didn't mean to change the plan itself (so those users would get CC as a "bonus"), but that alone is already questionable to start with.
polishdude20 2 days ago [-]
Bruce here from the Twitter team.
I got finally fired.
xpe 1 days ago [-]
People come at this with all kinds of life experience. The above notion of trust to me is quaint and simplistic. I suggest another way to frame trust as a more open ended question:
To what degree do I predict another person/org will give me what I need and why?
This shifts "trust" away from all or nothing and it gets me thinking about things like "what are the moving parts?" and "what are the incentives" and "what is my plan B?".
In my life experience, looking back, when I've found myself swinging from "high trust" to "low trust" the change was usually rooted in my expectations; it was usually rooted in me having a naive understanding of the world that was rudely shattered.
Will you force trust to be a bit? Or can you admit a probability distribution? Bits (true/false or yes/no or trust/don't trust) thrash wildly. Bayesians update incrementally: this is (a) more pleasant; (b) more correct; (c) more curious; (d) easier to compare notes with others.
mannanj 2 days ago [-]
so who do you trust and go to? (NotClearlySo)OpenAI?
carlgreene 2 days ago [-]
I "subconsciously" moved to codex back in mid Feb from CC and it's been so freaking awesome. I don't think it's as good at UI, but man is it thorough and able to gather the right context to find solutions.
I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.
GenerWork 2 days ago [-]
Anthropic definitely takes the cake when it comes to UI related activities (pulling in and properly applying Figma elements, understanding UI related prompts and properly executing on it, etc), and I say this as a designer with a personal Codex subscription.
snissn 2 days ago [-]
it's been frustrating how bad it is at UI. I'm starting to test out using their image2 for UI and then handing it to codex to build out the images into code and I'm impressed and relieved so far
cageface 2 days ago [-]
Codex does better if you ask it to take screenshots and critique its own UI work and iterate. It rarely one-shots something I like but it can get there in steps.
cmrdporcupine 2 days ago [-]
Codex isn't great at UI, but you might find Gemini is competent enough as an adjunct. I've had some luck with that.
simlevesque 2 days ago [-]
I went with MiniMax. The token plans are over what I currently need, 4500 messages per 5h, 45000 messages per week for 40$. I can run multiple agents and they don't think for 5-10 minutes like Sonnet did. Also I can finally see the thinking process while Anthropic chose to hide it all from me.
I'm using Zed and Claude Code as my harnesses.
Robdel12 2 days ago [-]
At the moment, yeah. If Google ever figures out how to build an agentic model, I would use them as well.
However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode
IncreasePosts 2 days ago [-]
Is Gemini cli not an agentic model? Or are you just saying it's built poorly? Gemini 2.5 didn't really work for me but Gemini 3 seems fairly solid
cmrdporcupine 2 days ago [-]
Gemini fairs poorly at tool use, even in its own CLI and even in Antigravity. It gets into a mess just editing source files, it's tragic because it's actually not a bad model otherwise.
rjh29 1 days ago [-]
It frequently fails to apply its diffs at first but it always succeeds eventually for me. I'm happy with it. I understand it is slower than other models but it also costs barely anything per month.
bensyverson 2 days ago [-]
Anecdotally, I know many people who have supplemented Claude with Codex, and are experimenting with models such as GLM 5.1, Kimi, Qwen, etc.
parliament32 2 days ago [-]
Self-hosted models are the one true path.
irthomasthomas 2 days ago [-]
I like chutes because they always use the full weights, and prompts are encrypted with TEE.
data-ottawa 2 days ago [-]
I think most frustrating is the system prompt issue after the postmortem from September[1].
These bugs have all of the same symptoms: undocumented model regressions at the application layer, and engineering cost optimizations that resulted in real performance regressions.
I have some follow up questions to this update:
- Why didn't September's "Quality evaluations in more places" catch the prompt change regression, or the cache-invalidation bug?
- How is Anthropic using these satisfaction questions? My own analysis of my own Claude logs was showed strong material declines in satisfaction here, and I always answer those surveys honestly. Can you share what the data looked like and if you were using that to identify some of these issues?
- There was no refund or comped tokens in September. Will there be some sort of comp to affected users?
- How should subscribers of Claude Code trust that Anthropic side engineering changes that hit our usage limits are being suitably addressed? To be clear, I am not trying to attribute malice or guilt here, I am asking how Anthropic can try and boost trust here. When we look at something like the cache-invalidation there's an engineer inside of Anthropic who says "if we do this we save $X a week", and virtually every manager is going to take that vs a soft-change in a sentiment metric.
- Lastly, when Anthropic changes Claude Code's prompt, how much performance against the stated Claude benchmarks are we losing? I actually think this is an important question to ask, because users subscribe to the model's published benchmark performance and are sold a different product through Claude Code (as other harnesses are not allowed).
I see some anthropic claude code people are reading the comments. A day or two ago I watched a video by theo t3.gg on whether claude got dumber. Even though he was really harsh on anthropic and said some mean stuff. I thought some of the points he was raising about claude code was quite apt. Especially when it comes to the harness bloat. I really hope the new features now stop and there is a real hard push for polish and optimization. Otherwise I think a lot of people will start exploring less bloated more optimized alternatives. Focus on making the harness better and less token consuming.
Everything else aside, their brief "experiment" with removing CC support from the Pro plan got me seriously considering other options. I've been wary of vendor lock-in the whole time, but it was a useful reminder. (opencode+openrouter will probably be my first port of call)
wilj 2 days ago [-]
I'm 3 weeks into switching from CC to OpenCode, and in some ways it is far superior to CC right out of the box, and I've maybe burned $200 in tokens to make a private fork that is my ultimate development and personal agent platform. Totally worth it.
Still use CC at work because team standards, but I'd take my OpenCode stack over it any day.
swingboy 2 days ago [-]
I find OpenCode vastly superior. Only thing missing is Vim mode but I saw a fork that someone implemented it. I really like being able to click on a previous message I sent to revert to that point in the conversation. You can revert in CC by pressing Escape twice but the “menu” it takes you to for picking the message is terrible because it only shows your messages. Also, expanding subagent/tools/thinking/etc. blocks is super intuitive in OpenCode whereas CC’s view when you press CTRL+O is also terrible and hard to understand at first glance.
solarkraft 2 days ago [-]
I’m in the process of doing this as well - hackability is such a massive moat.
Care to share what you changed, maybe even the code?
wilj 2 days ago [-]
I've got to do some cleanup before sharing (yay vibe coding) but the big things I've changed so far:
1) Curated a set of models I like and heavily optimized all possible settings, per agent role and even per skill (had to really replumb a lot of stuff to get it as granular as I liked)
2) Ported from sqlite to postgresql, with heavily extended schema. I generate embeddings for everything, so every aspect of my stack is a knowledge graph that can be vector searched. Integrated with a memory MCP server and auditing tools so I can trace anything that happens in the stack/cluster back to an agent action and even thinking that was related to the action. It really helps refine stuff.
3) Tight integration of Gitea server, k3s with RBAC (agents get their own permissions in the cluster), every user workspace is a pod running opencode web UI behind Gitea oauth2.
4) Codified structure of `/projects/<monorepo>/<subrepos>` with simpler browserso non-technical family members can manage their work easier (agents handle all the management and there are sidecars handling all gitops transparent to the user)
5) Transparent failover across providers with cooldown by making model definitions linked lists in the config, so I can use a handful of subscriptions that offer my favorite models, and fail over from one to the next as I hit quota/rate limits. This has really cut my bill down lately, along with skipping OpenRouter for my favorite models and going direct to Alibaba and Xiaomi so I can tailor caching and stuff exactly how I want.
6) Integrated filebrowser, a fork of the Milkdown Crepe markdown editor, and codemirror editor so I don't even need an IDE anymore. I just work entirely from OpenCode web UI on whatever device is nearest at the moment. I added support for using Gemma 4 local on CPU from my phone yesterday while waiting in line at a store yesterday.
Those are the big ones off the top of my head. Im sure there's more. I've probably made a few hundred other changes, it just evolves as I go.
2001zhaozhao 2 days ago [-]
The solution IMO is to switch to an agent harness wrapper solution that uses CLI-wrapping or ACP to connect to different coding agents. This is the only way that works across OpenAI, Claude and Gemini.
There are a few out there (latest example is Zed's new multi-agent UI), but they still rely on the underlying agent's skill and plugin system. I'm experimenting with my own approach that integrates a plugin system that can dynamically change the agent skillset & prompts supplied via an integrated MCP server, allowing you to define skills and workflows that work regardless of the underlying agent harness.
lanthissa 2 days ago [-]
never ever forget theo's gpt 5 hype video and then him having to walk it back.
its very clear that theres money or influence exchanging hands behind the scenes with certain content creators, the information, and openai.
whalesalad 2 days ago [-]
literally just `git reset --hard <random hash from 3 months ago>` would fix this
willis936 2 days ago [-]
That implies it's broken. Juicing revenue and slashing opex at the expense of brand and customer retention is the feature.
MrOrelliOReilly 2 days ago [-]
IMO this is the consequence of a relentless focus on feature development over core product refinement. I often have the impression that Anthropic would benefit from a few senior product people. Someone needs to lend them a copy of “Escaping the Build Trap.” Just because we _can_ rapidly add features now doesn’t mean we should.
PS I’m not referencing a well-known book to suggest the solution is trite product group think, but good product thinking is a talent separate from good engineering, and Anthropic seems short on the later recently
anonyfox 1 days ago [-]
Essentially they should hire a few of the old school product guys from Apple. Best me to it, but the obsession on UX and quality from earlier Apple is exactly what they urgently need instead of tech folks trying to engineer themselves into complicated rabbit holes and shenanigans.
cmrdporcupine 2 days ago [-]
I think they've dug themselves into a complexity trap. Beyond the stochastic nature of the models themselves, I don't think they're able to reason about their software anymore. Too many levers, too many dials, and code that likely nobody understands.
But worse, based on the pronouncements of Dario et al I suspect management is entirely unsympathetic because they believe we (SWEs) are on the chopping block to be replaced. And intimation that putting guard rails around these tools for quality concerns ... I'm suspecting is being ignored or discouraged.
In the end, I feel like Claude Code itself started as a bit of a science experiment and it doesn't smell to me like it's adopted mature best practices coming out of that.
qweiopqweiop 1 days ago [-]
I agree. My real fear if this is how the company works, how are systems with real implications (e.g. defense) being treated.
slashdave 2 days ago [-]
They need to keep up with demand, because compute resources are clearly limited. That means they have no choice but to add these features, or things break, or they have to stop taking new customers. All of those options are unacceptable.
cmrdporcupine 2 days ago [-]
They're losing customers because of quality concerns. Pausing development and focusing 100% on quality is how you fix that.
That said, that may not have been obvious at all in the Jan/Feb time frame when they got a wave of customers due to ethical concerns.
slashdave 2 days ago [-]
No. Pausing development does not make compute (you know, physical machines?) appear out of thin air.
nozzlegear 2 days ago [-]
On the other hand, sacrificing your paying customers at the altar of compute and tokens does not make money appear out of thin air.
joshribakoff 2 days ago [-]
They had like 100 devs making 600k at one point. The issue is certainly not lack of talent. More like, they insist on forcing the vibe coding narrative. Some candidates are refusing interview requests accordingly.
MrOrelliOReilly 2 days ago [-]
Ugh wrote “latter” and meant “former.” I didn’t mean lack of eng talent, but product
cedws 2 days ago [-]
>On April 16, we added a system prompt instruction to reduce verbosity
In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.
At least tell users when the system prompt has changed.
elAhmo 2 days ago [-]
Its also kinda funny they have to rely on system prompt to control verbosity itself.
esafak 2 days ago [-]
It's cheaper than retraining the model.
verve_rat 2 days ago [-]
So? 4.7.1, 4.7.2, etc. makes sense for versioning system prompts.
puppystench 2 days ago [-]
The Claude UI still only has "adaptive" reasoning for Opus 4.7, making it functionally useless for scientific/coding work compared to older models (as Opus 4.7 will randomly stop reasoning after a few turns, even when prompted otherwise). There's no way this is just a bug and not a choice to save tokens.
mattew 2 days ago [-]
It was odd that there was no mention of the forced adaptive reasoning in the article. My guess is they don't have enough compute to do anything else here.
rzk 1 days ago [-]
They are forcing users to use adaptive thinking now and deprecating thinking.type: "enabled" and budget_tokens. But the web interface (claude.ai), does not support specifying the effort parameter.
kamranjon 2 days ago [-]
This black box approach that large frontier labs have adopted is going to drive people away. To change fundamental behavior like this without notifying them, and only retroactively explaining what happened, is the reason they will move to self-hosting their own models. You can't build pipelines, workflows and products on a base that is just randomly shifting beneath you.
panavm 2 days ago [-]
[dead]
lherron 2 days ago [-]
Are they also going to refund all the extra usage api $$$ people spent in the last month?
Also I don’t know how “improving our Code Review tool” is going to improve things going forward, two of the major issues were intentional choices. No code review is going to tell them to stop making poor and compromising decisions.
zem 2 days ago [-]
this is one reason i will not pay for extra usage - it is an incentive for them to be inefficient, or at least to not spend any effort on improving my token usage efficiency.
dallen33 2 days ago [-]
No, they will not.
FireBeyond 2 days ago [-]
Even for all of us plan users, where we got barely any use from our plan because we'd destroy our 5h and 1w usage limits, also unlikely, after all they have an out of "your usage limits are guaranteed to be 5x of Pro users" (who are also being screwed).
Of course, all their vibe coding is being done with effectively infinite tokens, so...
system2 1 days ago [-]
I stopped using it for nearly a month because of the performance degradation. I paid for the whole month. Wasted money.
nickdothutton 2 days ago [-]
I presume they don't yet have a cohesive monetization strategy, and this is why there is such huge variability in results on a weekly basis. It appears that Anthropic are skipping from one "experiment" to another. As users we only get to see the visible part (the results). Can't design a UI that indicates the software is thinking vs frozen? Does anyone actually believe that?
slashdave 2 days ago [-]
Compute is limited worldwide. No amount of money can make these compute platforms appear overnight. They are buying time because the only other option is to stop accepting customers.
joefourier 2 days ago [-]
They would honestly have been better off refusing customers if compute is so limited. Degrading the quality leads to customers leaving in the short term, and ruins their long term reputation.
But in either case, if compute is so limited, they’ll have to compete with local coding agents. Qwen3.6-27B is good enough to beat having to wait until 5PM for your Claude Code limit to reset.
slashdave 1 days ago [-]
The recent Deepseek release probably has them more worried. But locally running these large models requires a lot of infra expertise. Market impact will be minimal. Not to mention the companies that can pull this off have enough cash to just pay Anthropic to begin with.
hansmayer 1 days ago [-]
A suggestion to Anthropic, just start charging the real price for your software. Of course you have to dumb it down, when the $200 tier in reality produces 5-10 thousand dollars in monthly costs when used by people who know how to max it out.
So then you come up with creative nonsense like "adaptive thinking" when your tool is sometimes working and sometimes outright not - the irony of "intelligent tools" not "thinking" aside. Of course this would kind of ruin your current value proposition as charging the actual price would make your core idea of making large swaths of skilled population un-employed, unfeasible but I am sure if you feed it into the Claude, it will find some points for and against, just like how Karpathy uses his LLM of choice to excrement his blog posts.
I am afraid you may be trying to prove a wrong point here :)
noname120 1 days ago [-]
True I didn’t read carefully enough the last part of your comment
whalesalad 1 days ago [-]
> when the $200 tier in reality produces 5-10 thousand dollars in monthly costs
are you asserting that the actual dollar cost to anthropic for a heavy user was 5-10k? or are you basing this on the (fabricated) value of those tokens, ie potentially lost revenue from a pay-per-token user.
hansmayer 1 days ago [-]
I am basing it on self-reports of advanced users on popular platforms such as reddit, reporting of Ed Zitron AND the official affidavit of the Anthropic CFO in relation to their court-filed complaint against DoD (for being excluded), stating that the total revenue of Anthropic since its founding and TO DATE had been at meager 5 Billion dollars. So wasting hundreds of billions of dollars per year to make a cumulative revenue of 5B to date does not quite sound like a financially sound business to me.
dataviz1000 2 days ago [-]
This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.
Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.
I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.
A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.
It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.
Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.
arjie 2 days ago [-]
The word is not co-opted. A harness is just supportive scaffolding to run something. A test harness is scaffolding to run tests against software, a fuzz harness is scaffolding to run a fuzzer against the software, and so on. I've seen it being used in this manner many times over the past 15 years. It's the device that wraps your software so you can run it repeatedly with modifications of parameters, source code, or test condition.
dataviz1000 2 days ago [-]
> A harness is just supportive scaffolding to run something.
Thank you for the perfect explanation.
Last week in my confusion about the word because Anthropic was using test, eval, and harness in the same sentence so I thought Anthropic made a test harness, I used Google asking "in computer science what is a harness". It responded only discussing test harnesses which solidified my thinking that is what it is.
I wish Google had responded as clearly you did. In my defense, we don't know if we understand something unless we discuss it.
thesz 2 days ago [-]
To have some confidence in consistency of results (p-value), one has to start from cohort of around 30, if I remember correctly. This is 1.5 orders of magnitude increase of computing power needed to find (absence of) consistent changes of agent's behavior.
dataviz1000 2 days ago [-]
I apologize for the potato quality of these links, however, I have been working tirelessly to wrap my head how to reason about how agents and LLM models work. They are more than just a black box.
The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]
The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]
Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.
> The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
This is very interesting, if true. It follows that one can generate several instances of the code, chose one with the bug and bug will not be found. Mythos can be used to fool Mythos.
vintagedave 2 days ago [-]
> Today we are resetting usage limits for all subscribers.
I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.
Will they restore my account usage limits? Since I no longer have Max?
Is that one week usage restored, or the entire buggy timespan?
sowbug 2 days ago [-]
[dead]
leobuskin 2 days ago [-]
This usage reset you did on April 23 will not mitigate the struggle we’ve experienced. I didn’t even notice it yesterday. I checked this morning and it came down from 25% weekly to 7%. What is this? I didn’t have problems for two months like many others (maybe my CC habits helped), but two weeks were very painful. Make a proper apology, guys. This “reset” for many users could hit the first days of the week, tell me you thought about that.
skeledrew 2 days ago [-]
Some of these changes and effects seriously affect my flow. I'm a very interactive Claude user, preferring to provide detailed guidance for my more serious projects instead of just letting them run. And I have multiple projects active at once, with some being untouched for days at a time. Along with the session limits this feels like compounding penalties as I'm hit when I have to wait for session reset (worse in the middle of a long task), when I take time to properly review output and provide detailed feedback, when I'm switching among currently active projects, when I go back to a project after a couple days or so,... This is honestly starting to feel untenable.
Last I tried 4.7, it was bad. Like ChatGPT bad: changed stuff it wasn’t supposed to, hallucinated code, forgot information, missed simple things, didn’t catch mistakes. And it burned through tokens like crazy.
I’ll stay on 4.6 for awhile. Seems to be better. What’s frustrating, though you cannot rely on these tools. They are constantly tinkering and changing with things and there’s no option to opt out.
Aperocky 2 days ago [-]
It seems like there is no concept of deployment, or even A/B test, what works on presumably claude employee's laptop for the hour they spent testing it will ship immediately to everyone.
I mean, yes, even testing in production with some of your customer is better than.. testing with ALL of your customers?
rcarmo 1 days ago [-]
Actually, I think their deeper problems are twofold:
- Claude Code is _vastly_ more wasteful of tokens than anything else I've used. The harness is just plain bad. I use pi.dev and created https://github.com/rcarmo/piclaw, and the gaps are huge -- even the models through Copilot are incredibly context-greedy when compared to GPT/Codex
- 4.7 can be stupidly bad. I went back to 4.6 (which has always been risky to use for anything reliable, but does decent specs and creative code exploration) and Codex/GPT for almost everything.
So there is really no reason these days to pay either their subscription or their insanely high per/token price _and_ get bloat across the board.
lukebechtel 2 days ago [-]
Some people seem to be suggesting these are coverups for quantization...
Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.
I would not suspect quantization before I would suspect harness changes.
foota 2 days ago [-]
> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
Claude caveman in the system prompt confirmed?
awesome_dude 2 days ago [-]
I've recently been introduced to that plugin, love it for humour
MillionOClock 2 days ago [-]
I see the Claude team wanted to make it less verbose, but that's actually something that bothered me since updating to Claude 4.7, what is the most recommended way to change it back to being as verbose as before? This is probably a matter of preference but I have a harder time with compact explanations and lists of points and that was originally one of the things I preferred with Claude.
jpcompartir 2 days ago [-]
Anthropic releases used to feel thorough and well done, with the models feeling immaculately polished. It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.
Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.
I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.
bcherny 2 days ago [-]
Boris from the Claude Code team here. We agree, and will be spending the next few weeks increasing our investment in polish, quality, and reliability. Please keep the feedback coming.
batshit_beaver 2 days ago [-]
> investment in polish, quality, and reliability
For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.
Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.
pkos98 2 days ago [-]
Sure, I've cancelled my Max 20 subscription because you guys prioritize cutting your costs/increasing token efficiency over model performance.
I use expensive frontier labs to get the absolute best performance, else I'd use an Open Source/Chinese one.
Frontier LLMs still suck a lot, you can't afford planned degradation yet.
wilj 2 days ago [-]
My biggest problem with CC as a harness is that I can't trust "Plan" mode. Long running sessions frequently start bypassing plan mode and executing, updating files and stuff, without permission, while still in plan mode. And the only recovery seems to be to quit and reload CC.
Right now my solution is to run CC in tmux and keep a 2nd CC pane with /loop watching the first pane and killing CC if it detects plan mode being bypassed. Burning tokens to work around a bug.
tkgally 2 days ago [-]
Here's one person's feedback. After the release of 4.7, Claude became unusable for me in two ways: frequent API timeouts when using exactly the same prompts in Claude Code that I had run problem-free many times previously, and absurdly slow interface response in Claude Cowork. I found a solution to the first after a few days (add "CLAUDE_STREAM_IDLE_TIMEOUT_MS": "600000" to settings.json), but as of a few hours ago Cowork--which I had thought was fantastic, by the way--was still unusable despite various attempts to fix it with cache clearing and other hacks I found on the web.
a-dub 2 days ago [-]
hm. ml people love static evals and such, but have you considered approaches that typically appear in saas? (slow-rollouts, org/user constrained testing pools with staged rollouts, real-world feedback from actual usage data (where privacy policy permits)?
if only there were a place with 9.881 feedbacks waiting to be triaged...
and that maybe not by a duplicate-bot that goes wild and just autocloses everything,
just blessing some of the stuff there with a "you´ve been seen" label would go a long way...
oefrha 2 days ago [-]
Common pattern of checking the claude code issue tracker for a bug: land on issue #12587, auto closed as duplicate of #12043; check #12043, auto closed as duplicated of #11657; check #11657, auto closed as duplicate of #10645; check #10645, never got a response, or closed as not planned, or some other bullshit.
rimliu 1 days ago [-]
I am considering proving my feedback by not providing my money any longer.
szmarczak 2 days ago [-]
Why ban third party wrappers? All of this could've been sidestepped had you not banned them.
ElFitz 2 days ago [-]
Because then they lose vertical integration and the extra ability it grants to tune settings to reduce costs / token use / response time for subscription users.
Or improve performance and efficiency, if we’re generous and give them the benefit of the doubt.
It makes sense, in a way. It means the subscription deal is something along the lines of fixed / predictable price in exchange for Anthropic controlling usage patterns, scheduling, throttling (quotas consumptions), defaults, and effective workload shape (system prompt, caching) in whatever way best optimises the system for them (or us if, again, we’re feeling generous) / makes the deal sustainable for them.
It’s a trade-off
cmrdporcupine 2 days ago [-]
They gained that ability to tune settings and then promptly used it in a poor way and degraded customer experience.
ElFitz 1 days ago [-]
That’s what we see.
It may be (but I wouldn’t know) that some of other changes not covered here reduced costs on their side without impacting users, improving the viability of their subscription model. Or maybe even improved things for users.
I’d really appreciate more transparency on this, and not just when things fail.
But I’ve learned my lesson. I’ve been weening off Claude for a few weeks, cancelled my subscription three weeks ago, let it expire yesterday, and moved to both another provider and a third-party open source harness.
szmarczak 2 days ago [-]
Nothing you wrote makes sense. The limits are so Anthropic isn't on a loss. If they can customize Claude using Code, I see no reason why they couldn't do so with other wrappers. Other wrappers can also make use of cache.
If you worry about "degraded" experience, then let people choose. People won't be using other wrappers if they turn out to be bad. People ain't stupid.
ElFitz 2 days ago [-]
By imposing the use of their harness, they control the system prompt:
> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7
They can pick the default reasoning effort:
> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode
They can decide what to keep and what to throw out (beyond simple token caching):
> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6
It literally is all in the post.
I don't worry about anything though. It's not my product. I don't work for Anthropic, so I really couldn't care less about anyone else's degraded (or not) experience.
szmarczak 2 days ago [-]
> they control the system prompt
They control the default system prompt. You can change it if you want to.
> They can pick the default reasoning effort
Don't see how it's an obstacle in allowing third party wrappers.
> They can decide what to keep and what to throw out
That's actually a good point. However I still don't think it's an obstacle. If third party wrappers were bad, people simply wouldn't be using them.
ElFitz 1 days ago [-]
Evidently, all these things you just dismissed matter, else all the changes I quoted from the original post wouldn’t have affected anyone, or half as many people, or half as much. Anthropic wouldn’t have had any complaints to investigate, the article promoting this entire thread wouldn’t exist, and we wouldn’t be having this very conversation.
Defaults matter. A large share of people never change them (status quo bias, psychological inertia). Having control over them (and usage quotas) means Anthropic can control and fine-tune what this fixed subscription costs them.
And evidently (re, the original article), they tried to do so.
ElFitz 1 days ago [-]
Edit: the article prompting this entire thread.
szmarczak 1 days ago [-]
> Defaults matter. A large share of people never change them (status quo bias, psychological inertia). Having control over them (and usage quotas) means Anthropic can control and fine-tune what this fixed subscription costs them.
Allowing third party wrappers doesn't mean Claude Code would cease to exist. The opposite actually, Claude Code would be the default.
People dissatisfied with Code would simply use other wrappers. I call it a win-win. Don't see how Anthropic would be on a lose here, they would still retain the ability to control the defaults.
ElFitz 1 days ago [-]
Except one of the major other wrappers was pi, through OpenClaw. With countless hundreds of thousands of instances running every hour on that heartbeat.
I have no idea what the share of OpenClaw instances running on pi was, or third-party wrappers in general, but it was obviously large enough that Anthropic decided they had to put an end to it.
Conversely, from the latest developments, it would seem they are perfectly fine with people running OpenClaw with Claude models through Claude Code’s programmatic interface using subscriptions.
But in the end, this, my take, your take, is all conjecture. We are both on the outside looking in.
Only the people who work at Anthropic know.
troupo 2 days ago [-]
And you didn't invest anything in polish, quality and reliability before... why? Because for any questions people have you reply something like "I have Claude working on this right now" and have no idea what's happening in the code?
A reminder: your vibe-coded slop required peak 68GB of RAM, and you had to hire actual engineers to fix it.
cmrdporcupine 2 days ago [-]
I think you're being a bit harsh.
... But then again, many of us are paying out of pocket $100, $200USD a month.
Far more than any other development tools.
Services that cost that much money generally come with expectations.
A month prior their vibe-coders was unironically telling the world how their TUI wrapper for their own API is a "tiny game engine" as they were (and still are) struggling to output a couple of hundred of characters on screen: https://x.com/trq212/status/2014051501786931427
Yeah you don't have to convince me. I switched to Codex mid-January in part because of the dubious quality of the tui itself and the unreliability of the model. Briefly switched back through March, and yep, still a mistake.
Once OpenAI added the $100 plan, it was kind of a no-brainer.
ankaz 2 days ago [-]
[dead]
jpcompartir 2 days ago [-]
[flagged]
swader999 1 days ago [-]
I've noticed the same thing in my own AI assisted work. Feels like I'm moving too fast and it's easy to implement decisions quickly but they really have to be the right f--ing decisions. In the past dev was so slow so you had a lot of time to vet the hard decisions and now you don't.
KronisLV 2 days ago [-]
> It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.
I don't know, their desktop app felt really laggy and even switching Code sessions took a few seconds of nothing happening. Since the latest redesign, however, it's way better, snappy and just more usable in most respects.
I just think that we notice the negative things that are disruptive more. Even with the desktop app, the remaining flaws jump out: for example, how the Chat / Cowork / Code modes only show the label for the currently selected mode and the others are icons (that aren't very big), a colleague literally didn't notice that those modes are in the desktop app (or at least that that's where you switch to them).
spaniard89277 2 days ago [-]
Given the price I don't really think they're the best option. They're sloppy and competitors are catching up. I'm having same results with other models, and very close with Kimi, which is waaay cheaper.
kilroy123 2 days ago [-]
I agree. It all feels so AI-slopy now.
OtomotO 2 days ago [-]
I guess it's a bit of desperation to find a sustainable business model.
The AI hype is dying, at least outside the silicon valley bubble which hackernews is very much a part of.
That and all the dogfooding by slop coding their user facing application(s).
ctoth 2 days ago [-]
> As of April 23, we’re resetting usage limits for all subscribers.
Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!
walthamstow 2 days ago [-]
The weekly reset point is different per account. I think something to do with first sign-up date. Mine is on a Tuesday.
schpet 2 days ago [-]
mine was originally on sunday, then got moved to thursday (which i disliked), and it is still on thursday. so them resetting my weekly limit on the same day it was scheduled to reset feels like a joke.
throwaway2027 2 days ago [-]
You need to send a new message once your limit is up to make the timer start rolling again. It sucks and I hate it when I had no need for Claude during the day but also forgot to use it then it shifted my reset date a day later.
schpet 2 days ago [-]
oh! super helpful info. i was aware of that with the hourly ones, but never put it together with weekly. thank you.
behat 2 days ago [-]
This is a very interesting read on failure modes of AI agents in prod.
Curious about this section on the system prompt change:
>> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.
Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?
bashtoni 2 days ago [-]
The Claude Code experience is still pretty bad after upgrading. I often see
Error: claude-opus-4-7[1m] is temporarily unavailable, so auto mode cannot determine the safety of Bash right now. Wait briefly and then try this action again. If it keeps failing, continue with other tasks that don't require this action and come back to it later. Note: reading files, searching code, and other read-only operations do not require the classifier and can still be used.
The only solution is to switch out of auto mode, which now seems to be the default every time I exit plan mode. Very annoying.
hintymad 2 days ago [-]
> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode.
This sounds fishy. It's easy to show users that Claude is making progress by either printing the reasoning tokens or printing some kind of progress report. Besides, "very long" is such a weasel phrase.
reliablereason 2 days ago [-]
Right a very simple UI thing that they should have that would have prevented so much misunderstanding. Is a simple counter. How much usage do a have i used and how much is left.
If a message will do a cache recreation the cost for that should be viewable.
rfc_1149 2 days ago [-]
The third bug is the one worth dwelling on. Dropping thinking blocks every turn instead of just once is the kind of regression that only shows up in production traffic. A unit test for "idle-threshold clearing" would assert "was thinking cleared after an hour of idle" (yes) without asserting "is thinking preserved on subsequent turns" (no). The invariant is negative space.
The real lesson is that an internal message-queuing experiment masked the symptoms in their own dogfooding. Dogfooding only works when the eaten food is the shipped food.
afro88 2 days ago [-]
Experienced engineers that know the codebase and system well, and with enough time to consider the problem properly would likely consider this case.
But if we're vibing... This is the kind of bug that should make it back into a review agent/skill's instructions in a more generic format. Essentially if something is done to the message history, check there tests that subsequent turns work as expected.
But yeah, you'd have to piss off a bunch of users in prod first to discover the blind spot.
jryio 2 days ago [-]
1. They changed the default in March from high to medium, however Claude Code still showed high (took 1 month 3 days to notice and remediate)
2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
3. System prompt to make Claude less verbose reducing coding quality (4 days - better)
All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.
However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.
Doing this proactively would certainly match expectations for a fast-moving product like this.
fn-mote 2 days ago [-]
> 2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!
Seems like a very basic software engineering error that would be caught by normal unit testing.
Eridrus 2 days ago [-]
To be fair to Anthropic, they did not intentionally degrade performance.
To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.
shrx 2 days ago [-]
Are you saying dropping cache after 1 hour is not intentionally degrading performance?
Eridrus 2 days ago [-]
Yes. Caching is a cost optimization not a response quality metric.
shrx 1 days ago [-]
But it still degrades performance.
Eridrus 1 days ago [-]
It's unfortunate that the word performance is overloaded and ML folks have a specific definition..that isn't what the rest of CS uses, but I understand Anthropic to mean response quality when they say this and not any other dimension you could measure performance on.
You can argue they're lying, but I think this is just folks misunderstanding what Anthropic is saying.
fydorm 12 hours ago [-]
They didn't just drop cache. They elided thinking blocks even if you recache. That permanently degraded the model output for the rest of the session, even ignoring the bug, if you waited 60 minutes instead of 59.
sroussey 2 days ago [-]
None of these problems equate to degrading model performance. Completely different team. Degraded CC harness, sure.
qingcharles 2 days ago [-]
Sure, but it gives the impression of degraded model performance. Especially when the interface is still saying the model is operating on "high", the same as it did yesterday, yet it is in "medium" -- it just looks like the model got hobbled.
sroussey 2 days ago [-]
Oh, absolutely. Though changes in how the model is used is imminently more fixable than the model itself.
johnmaguire 2 days ago [-]
Yes, but for many users, CC is the product. Especially since I'm not allowed(?) to use my own harness with my sub.
2 days ago [-]
Philpax 2 days ago [-]
> Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load).
However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.
jryio 2 days ago [-]
Model performance at inference in a data center v.s. stripping thinking tokens are effectively the same.
Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.
In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.
sroussey 2 days ago [-]
I thought these days thinking tokens sent my the model (as opposed to used internally) were just for the users benefit. When you send the convo back you have to strip the thinking stuff for next turn. Or is that just local models?
aszen 2 days ago [-]
Claude code is not infra, the model is the infra. They changed settings to make their models faster and probably cheaper to run too. Honestly with adaptive thinking it no longer matters what model it is if you can dynamically make it do less or more work.
ramoz 2 days ago [-]
Opus 4.7 is very rough to work with. Specifically for long-horizon (we were told it was trained specifically for this and less handholding).
I don't have trust in it right now. More regressions, more oversights, it's pedantic and weird ways. Ironically, requires more handholding.
Not saying it's a bad model; it's just not simple to work with.
for now: `/model claude-opus-4-6[1m]` (youll get different behavior around compaction without [1m])
btbuildem 1 days ago [-]
A lot of this could have been avoided with more openness and transparency.
Communicate the changes you are making. Leverage the community using your product(s). Reveal more about the building blocks (system prompts, harness, etc) so people can better understand how to use your tools.
I understand they're in an existential battle with the other SOTA houses -- but secrecy, straight out lies and opaque "communication" is not the way to win it. Not when the OSS stack is hot on their heels anyway.
russellthehippo 2 days ago [-]
Damn it was real the whole time. I found Opus 4.7 to holistically underperform 4.6, and especially in how much wordiness there is. It's harder to work with so I just switched back to 4.6 + Kimi K2.6. Now GPT 5.5 is here and it's been excellent so far.
arjie 2 days ago [-]
Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.
Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.
calmbonsai 21 hours ago [-]
Move really fast and break all the things. Yeah, these guys are reminding me of the 20-teens where AWS started pushing out rando edge-case services with little regard to quality or integration that ended up hurting the quality of the core services too.
rohansood15 1 days ago [-]
If Anthropic couldn't catch these issues before people started screaming at them, do we really believe 50% of software engineering jobs are going away?
munk-a 2 days ago [-]
It's also important to realize that Anthropic has recently struck several deals with PE firms to use their software. So Anthropic pays the PE firm which forces their managed firms to subscribe to Anthropic.
The artificial creation of demand is also a concerning sign.
jameson 2 days ago [-]
> "In combination with other prompt changes, it hurt coding quality, and was reverted on April 20"
Do researchers know correlation between various aspects of a prompt and the response?
LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.
Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.
pxc 2 days ago [-]
One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not.
But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.
rimliu 1 days ago [-]
broken cache does not breakthrough make.
sutterd 2 days ago [-]
What kind of performance are people getting now? I was running 4.7 yesterday and it did a remarkably bad job. I recreated my repo state exactly and ran the same starting task with 4.5 (which I have preferred to 4.6). It was even worse, by a large margin. It is likely my task was a difficult or poorly posed, but I still have some idea of what 4.5 should have done on it. This was not it. What experiences are other people having with the 4.7? How about with other model versions, if they are trying them? (In both cases, I ran on max effort, for whatever that is worth.)
thesnide 22 hours ago [-]
In my last session had the claude agent generated code, then tests, then the fixtures from passing the tests to the code. and gloated that all passed immediatly.
once i saw that, i'm not surprised anymore.
WhitneyLand 2 days ago [-]
Did they not address how adaptive thinking has played in to all of this?
jwpapi 2 days ago [-]
Those are exactly the kind of issues you run into when your app is ai coded you built one thing and kill something else.
You have too many and the wrong benchmarks
voxelc4L 2 days ago [-]
I’ve stuck to the non-1M context Opus 4.6 and it works really well for me, even with on-going context compression. I honestly couldn’t deal with the 1M context change and then the compounding token devouring nonsense of 4.7
I sincerely hope Anthropic is seeing all of this and taking note. They have their work cut out for them.
setnone 1 days ago [-]
absolutely agree: non-1M Opus 4.6 on x20 max was peak AGI
now it's back to regular slop and just to check otherwise i have to spend at least $100
lifthrasiir 2 days ago [-]
Is it just for me that the reset cycle of usage limits has been randomly updated? I originally had the reset point at around 00:00 UTC tomorrow and it was somehow delayed to 10:00 UTC tomorrow, regardless of when I started to use Claude in this cycle. My friends also reported very random delay, as much as ~40 hours, with seemingly no other reason. Is this another bug on top of other bugs? :-S
nubinetwork 1 days ago [-]
My usage got reset yesterday as usual, but it appears it will reset again on Sunday.
someone4958923 2 days ago [-]
"This isn’t the experience users should expect from Claude Code. As of April 23, we’re resetting usage limits for all subscribers."
lifthrasiir 2 days ago [-]
I know that. I'm saying that the cycle reset is not what it used to (starting at the very first usage) or what it might be (retaining the cycle reset timing).
jongleberry 2 days ago [-]
it seems to be the same cycle for everyone now, not based on first usage. I saw a reddit thread on this from someone who had multiple accounts that all had the same cycles
Implicated 2 days ago [-]
Just as a note to CC fans/users here since I had an opportunity to do so... I tested resuming a session that was stale at 950k tokens after returning from a full day or so of being idle, thus a fully empty quota/session.
Resuming it cost 5% of the current session and 1% of the weekly session on a max subscription.
sscaryterry 1 days ago [-]
Glad there is finally some ownership. It is a pity that this was mostly because AMD embarrassed them on GitHub. Users have been reporting these issues for weeks, but were mostly ignored.
deaux 2 days ago [-]
They had this ready and timed it for GPT 5.5 announcement. Zero chance it's a coincidence .
sreekanth850 1 days ago [-]
Who’s going to pay for the exorbitant number of tokens Claude used without delivering any meaningful outcome? I spent many sessions getting zero results, and when I posted about it on their subreddit, all I got were personal attacks from bots and fanboys. I instantly cancelled my subscription and moved to Codex.
Also, it may be a coincidence, that the article was published just before the GPT 5.5 launch, and then they restored the original model while releasing a PR statement claiming it was due to bugs.
natdempk 2 days ago [-]
As an end-user, I feel like they're kind of over-cooking and under-describing the features and behavior of what is a tool at the end of the day. Today the models are in a place where the context management, reasoning effort, etc. all needs to be very stable to work well.
The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?
It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.
anonyfox 1 days ago [-]
I refuse to believe that caching tiers for longer than 1 hour would be impossible to transparently build and use to avoid all this complexity to begin with, nor that it would be that expensive to maintain in 2026 when the bulk costs are on inference anyways which would even be reduced by occasional longer time cache hits.
VadimPR 2 days ago [-]
Appreciate the honesty from the team.
At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.
kristianc 2 days ago [-]
To think we'd have known about this in advance if they'd just have open sourced Claude Code, rather than them being forced into this embarrassing post mortem. Sunlight is the best disinfectant.
zem 2 days ago [-]
ugh, caching based on idle time is horrible for my usage anyway; since claude is both fairly slow and doesn't really have much of a daily quota anyway I often tell it to do something and then wander off and come back to check on it when I next think about it. I always vaguely assumed that my session would not "detect" the intervening time anyway since it was all async. I guess from a global perspective time-based cache eviction makes sense.
whh 1 days ago [-]
Thanks Anthropic, and a big thanks to your Claude Code team for the customer obsession here. I've just noticed the Command + Backspace fix and even the nice little Ctrl + y addition as a fix for accidents.
How about just not change the harness abruptly in the first place? Make new system prompt changes "experimental" first so you can gather feedback.
2 days ago [-]
gilrain 2 days ago [-]
Hi Boris, random observer here. Would you consider apologizing to the community for mistakenly closing tickets related to this and then wrongly keeping them closed when, internally, you realized they were legitimate?
I think an apology for that incident would go a long way.
rimliu 1 days ago [-]
not many would believe in the sincerity of it anyway.
rebolek 2 days ago [-]
> On April 16, we added a system prompt instruction to reduce verbosity.
What verbosity? Most of the time I don’t know what it’s doing.
whalesalad 2 days ago [-]
They don’t either.
ginkgotree 12 hours ago [-]
When will they fix it? This is what is important.
wg0 2 days ago [-]
A heavily vibe coded CLI would have tons of issues, regularly.
LLMs over edit and it's a known problem.
RamblingCTO 1 days ago [-]
Doesn't change anything about opus 4.7 being an absolute buffon. Even going back to opus 4.6 doesn't feel like the magical period maybe 3-4 weeks ago. Gonna go back to openAI
8note 2 days ago [-]
something i note from this is that this is not a model weights change, but it is a hidden state change anthropic is doing to the outputs that can tune the quality and down on the "model" without breaking the "we arent changing the model" promise.
how often do these changes happen?
Alifatisk 2 days ago [-]
It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.
saghm 2 days ago [-]
At least personally, it feels like the choices are
the one that's okay with being used for mass surveillance and autonomous weapons targeting, the one that's on track to get acquired by the AI company that dragged its feet in getting around to stopping people from making child porn with it, the one that nobody seems to use from Google, and the one that everyone complains about but also still seems to be using because it at least sometimes works well. At this point I've opted out of personal LLM coding by canceling my subscription (although my employer still has subscriptions and wants us to keep using them, so I'll presumably keep using Claude there) but if I had to pick one to spend my own money on I'd still go with Claude.
scblock 2 days ago [-]
A valid choice, a moral choice, is none of the above.
goldfish_gemma4 2 days ago [-]
[dead]
ed_elliott_asc 2 days ago [-]
I pay for 20x max and get so much more value out of it than I pay.
rvz 12 hours ago [-]
This is what we call "Stockholm syndrome"
Avicebron 2 days ago [-]
It's still night and day the difference in quality between chatgpt5.4 and opus 4.7. Heck even on Perplexity where 5.4 is included in Pro vs 4.7 which is behind the max plan or whatever, I will pick sonnet 4.6 over the 5.4 offering and it's consistently better. I don't love Anthropic, I don't have illusions about them as a business.
But if a tool is better, it's better.
wahnfrieden 2 days ago [-]
You aren’t getting the 5.4 experience for code if you’re not using it in the Codex harness
arnvald 2 days ago [-]
What's the alternative? Are you suggesting other LLM providers don't charge high price? Or that they don't make mistakes? Or that they provide better quality?
We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?
AntiUSAbah 2 days ago [-]
Because it is still good though.
If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.
There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.
Also moving fastish means having more/better models faster.
I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.
mlinsey 2 days ago [-]
The consumer surplus is quite high. Even with the regressions in this postmortem, performance was above the models last fall, when I was gladly paying for my subscription and thought it was net saving me time.
That said, there is now much better competition with Codex, so there's only so much rope they have now.
scottyah 2 days ago [-]
It's fairly small issues for an amazing product, and the company is just a few years old and growing rapidly. Also, they are leading a powerful technological revolution and their competitors are known to have multiple straight up evil tendencies. A little degradation is not an issue.
lukasus 2 days ago [-]
At the time you wrote your comment there were 4 other comments and all of them very negative towards the Anthropic and the blog post in question here. How did you get this conclusions?
unselect5917 2 days ago [-]
HN glazes anthropic every single time I see it come up. This is as obvious as HN's political bias.
lukan 2 days ago [-]
Confused as well, I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat, but the anger they got and people leaving to OpenAI again, who gladly said yes to autonomous killing AI did astonish me a bit. And I also had weird things happening with my usage limits and was not happy about it. But it is still very useful to me - and I only pay for the pro plan.
sunaookami 2 days ago [-]
>I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat
I never understood why people cheered for Anthropic then when they happily work together with Palantir.
timmg 2 days ago [-]
> It’s incredible how forgiving you guys are with Anthropic and their errors.
Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.
I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.
jgbuddy 2 days ago [-]
Anthropic actually not so bad. Anthropic models code good, usually. Price not so high compared to time to do it by self.
operatingthetan 2 days ago [-]
I don't think Anthropic has to inform their customers of every change they make, but they should have with this one.
OsrsNeedsf2P 2 days ago [-]
Look at any criticism of Mythos. Some members on HN are defending it tooth and nail, despite it not being released
tempest_ 2 days ago [-]
A lot of people are provided their access through work.
They don't actually pay the bill or see it.
fastball 2 days ago [-]
What high price? I pay $200/m for an insane number of tokens.
oytis 2 days ago [-]
Remember Louis CK talking about Wi-Fi on an airplane? People are dealing with highly experimental technology here
mystraline 2 days ago [-]
Exactly. They've done now like 6 rug-pulls.
Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further".
And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots.
And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?
2 days ago [-]
ankit219 2 days ago [-]
An interesting question to wonder is why these optimizations were pushed so aggressively in the first place. Especially given this is the time they were running a 2x promotion, by themselves, without presumably seeing any slowdown in demand.
bearjaws 2 days ago [-]
The issue making Claude just not do any work was infuriating to say the least. I already ran at medium thinking level so was never impacted, but having to constantly go "okay now do X like you said" was annoying.
Again goes back to the "intern" analogy people like to make.
motbus3 2 days ago [-]
I had similar experience just before 4.5 and before 4.6 were released.
Somehow, three times makes me not feel confident on this response.
Also, if this is all true and correct, how the heck they validate quality before shipping anything?
Shipping Software without quality is pretty easy job even without AI. Just saying....
zagwdt 1 days ago [-]
ngl lost alot of trust in cc after reading this, specially point 1
how do you just do that to millions of users building prod code with your shit
xlayn 2 days ago [-]
If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price.
The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time.
I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.
The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...
YES... DO IT... FRICKING MACHINE..
joshstrange 2 days ago [-]
It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like:
> Next steps are to run `cat /path/to/file` to see what the contents are
Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).
That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.
Just the other day it was in Auto mode (by accident) and I told it:
> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.
And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).
The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.
marcyb5st 2 days ago [-]
Apart from Anthropic nobody knows how much the average user costs them. However the consensus is "much more than that".
If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).
If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:
1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that
2. Dumb the models down (basically decreasing their cost per token)
3. Send less tokens (ie capping thinking budgets aggressively).
2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.
CamperBob2 2 days ago [-]
$1000/mo for guaranteed functionality >= Opus 4.6 at its peak? Yes, I'd probably grumble a bit and then whip out the credit card.
I'm not a heavy LLM user, and I've never come anywhere the $200/month plan limits I'm already subscribed to. But when I do use it, I want the smartest, most relentless model available, operating at the highest performance level possible.
Charge what it takes to deliver that, and I'll probably pay it. But you can damned well run your A/B tests on somebody else.
dgellow 2 days ago [-]
I would love if agents would act way more like tools/machines and NOT try to act as if they were humans
There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.
One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either:
a) feeds the model a pre-written output to give to the user
b) dumbs down output for that specific prompt
Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.
hex4def6 2 days ago [-]
I think you could alter the prompt in subtle ways; a period goes to an ellipses, extra commas, synonyms, occasional double-spaces, etc.
Enough that the prompt is different at a token-level, but not enough that the meaning changes.
It would be very difficult for them to catch that, especially if the prompts were not made public.
Run the variations enough times per day, and you'd get some statistical significance.
The guess the fuzzy part is judging the output.
JyB 2 days ago [-]
This specifically is super annoying.
KronisLV 2 days ago [-]
This reads like good news! They probably still lost a bunch of users due to the negative public sentiment and not responding quickly enough, but at least they addressed it with a good bit of transparency.
vicchenai 2 days ago [-]
had this happen to me mid-refactor and spent 20 min wondering if I'd gone crazy. honestly the one hour threshold feels pretty arbitrary, sometimes you just step away to think
nopurpose 2 days ago [-]
Weren't there reports that quality decreased when using non-CC harnesses too? Nothing in blog post can explain that.
davidfstr 2 days ago [-]
Good on Anthropic for giving an update & token refund, given the recent rumors of an inexplicable drop in quality. I applaud the transparency.
scuderiaseb 2 days ago [-]
Opus 4.7 was released a week ago, at that point all limits were reset, so this was very beneficial to them because basically everyones weekly limit Was anyway about to be reset.
antirez 2 days ago [-]
Zero QA basically.
8note 2 days ago [-]
id go more on the lines of "dont know what to QA for"
throwaway2027 2 days ago [-]
Cool but I switched to Codex for the time being.
tdg5 2 days ago [-]
I missed the part about the refunds…
ayhanfuat 2 days ago [-]
Reading the "Going forward" section I see that they have zero understanding of the main complaints.
Kiro 2 days ago [-]
How so?
ayhanfuat 2 days ago [-]
They feel they're in a position to make important trade-off decisions on behalf of the user. "It's just slightly worse, I'll sneak this change in" is not something to be tolerated, whether it actually turns out to be much worse or not. Their adaptive thinking mess has caused a ton of work for me. I know a lot of people are saying Codex is actually better now. I don't agree but I'm switching to it because it's much more reliable.
operatingthetan 2 days ago [-]
I agree, but these LLM products are all black-boxes so we need to demand more accountability from them.
hirako2000 2 days ago [-]
In other words we did the right things, but we understand feedback, oh and bugs happen.
hajile 2 days ago [-]
My takeaway is that they knew they were changing a bunch of stuff while their reps were gaslighting us in the comments here.
Why should we ever trust what they say again out trust that they won’t be rug-pulling again once this blows over?
walthamstow 2 days ago [-]
So we weren't going mad then!
gnegggh 1 days ago [-]
not the first time. Still not showing thinking are we?
ElFitz 2 days ago [-]
Now we know why Anthropic banned the use of subscriptions with other agent harnesses: they partially rely on the Claude Code cli to control token usage through various settings.
And it also tells us why we shouldn’t use their harness anyway: they constantly fiddle with it in ways that can seriously impact outcomes without even a warning.
EugeneOZ 2 days ago [-]
If you think that you can just silently modify the model without any announcements and only react when it doesn't go through unnoticed, then be 100% sure that your clients will check every possible alternative and will leave you as soon as they find anything similar in quality (and no, not a degraded one).
whalesalad 2 days ago [-]
The funny thing is, in the last 3 days Claude has gotten substantially worse. So this claim, "All three issues have now been resolved as of April 20 (v2.1.116)" does not land with me at all.
ramesh31 2 days ago [-]
Effort should not be configurable for Opus, it should be set to a single default that provides the highest level of capability. There are zero instances in which I am willing to accept a lesser result in exchange for a slightly faster response from Opus. If that were the case I would be using Flash or Haiku.
ritonlajoie 2 days ago [-]
yesterday CC created a fastapi /healthz endpoint and told me it's the gold standard (with the ending z). today I stopped my max sub and will be trying codex
wrxd 2 days ago [-]
To be fair that’s a Google convention. Have a look at z-pages
jesse_dot_id 1 days ago [-]
This is fairly normal.
setnone 2 days ago [-]
Good on them for resolving all three issues, but is it any good again?
alxndr13 2 days ago [-]
for me at least, yes. just wrote it to coworkers this afternoon. Behaves way more "stable" in terms of quality and i don't have the feeling of the model getting way worse after 100k tokens of context or so.
What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.
powera 1 days ago [-]
I'm not sure they've found/understand it yet. My two main theories:
1. A bunch of people with new Claude Code codebases in December now are working with a larger codebase, causing more context. Claude reads a lot of code files, and doesn't effectively prune from the context as far as I can tell. I find myself having to hint Claude regularly about what files to read (and not read) to avoid having 75k of unrelated files in the context window.
2. Claude Code tries to do more now, for the benefit of people who don't know exactly what they want. The trade-off is that it's worse at doing exactly what people want, when they do know. The "small fix" becomes a large endeavor for Claude.
psubocz 2 days ago [-]
> All three issues have now been resolved as of April 20 (v2.1.116).
The latest in homebrew is 2.1.108 so not fixed, and I don't see opus 4.7 on the models list... Is homebrew a second class citizen, or am I in the B group?
YetAnotherNick 1 days ago [-]
Why don't they monitor average prompt and response token length(both cached and uncached) per interaction. Seems this could have solved all their previous unnoticed degradation.
Also bit surprised they don't have any automated quality check. They can run something like swe bench before each release. Both of these seem like a basic thing even for startup, let alone some product generating billions in revenue.
PeakScripter 1 days ago [-]
They should really test everything thoroughly and then make it available to general public to avoid these issues!!
maxrev17 2 days ago [-]
Please for the love of god just put the max price plan up like 4x or 5x in cost and make it actually work.
jruz 2 days ago [-]
Too late bro, switched to Codex I’m done with your bullshit.
rishabhaiover 2 days ago [-]
Boris gaslighted us with all the quality related incidents for weeks not acknowledging these problems.
throwaway2027 2 days ago [-]
Maybe he didn't know or they were still figuring it out which is fine they're still engineers who can get things wrong sometimes but the communication felt lackluster and being on the receiving end sucks when you had a reliable setup which then degrades. There is a reason people don't upgrade software and why people say if it works don't fix it, but obviously that's not an option for Anthropic when you want to keep improving the product, so they need good measurement tools and quick rollbacks even if properly "benchmarking" LLMs could prove difficult.
rishabhaiover 2 days ago [-]
I agree but one can admit their situation instead of outrightly rejecting the claims. My own mistake is to have become so hopelessly dependent on them.
Rapzid 2 days ago [-]
> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode.
Translation: To reduce the load on our servers.
2 days ago [-]
taytus 2 days ago [-]
They should do a similar report about their communication team. This was horrible mismanaged.
systemvoltage 2 days ago [-]
Interesting. All 3 seems like they’re obviously going to impact quality. e.g, reducing the effort from high to medium.
So then, there must have been an explicit internal guidance/policy that allowed this tradeoff to happen.
Did they fix just the bug or the deeper policy issue?
teaearlgraycold 2 days ago [-]
> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.
Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?
chermi 2 days ago [-]
It's really hard to understand. There needs to be really loud batman sign in the sky type signals from some hero third party calling out objective product degradation. Do they use cc internally? If so do they use a different version? This should've been almost as loud a break as service just going down altogether, yet it took 2 weeks to fix?!
poly2it 2 days ago [-]
> ... we’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features) ...
Apparently they are using another version internally.
manmal 2 days ago [-]
I think that would also have busted cache all the time, and uncached requests consume usage limits rapidly.
nrki 2 days ago [-]
> we refunded all affected customers
Notably missing from the postmortem
tontinton 2 days ago [-]
or you can use a non vibe designed efficient Rust TUI coding agent made by yours truly, all my coworkers use it too :) called https://maki.sh!
lua plugins WIP
varispeed 1 days ago [-]
It appears that Opus 4.7 has been nerfed already. Can't get any sensible results since yesterday. It just keeps running in circles. Even mention that it is committing fraud by doing superficial work it has been told specifically not to do doesn't help.
rimliu 1 days ago [-]
oh yes. I tried to get some review of a code base after some refactoring. CC produced a complete garbage review. After pointing that out it admitted that that was garbage - and promptly produced another pile of garbage. After the third failed attempt I had to call it a day.
dainiusse 2 days ago [-]
Corporate bs begins...
system2 1 days ago [-]
Whatever they did, with the max plan, my daily usage quota was consumed in less than 10 minutes. Weird, let's hope they fix the usage now.
0gs 2 days ago [-]
wow resetting everyone's usage meter is great. i was so close to finally hitting my weekly limit for once though
1 days ago [-]
noname120 1 days ago [-]
So now the solution is to input a “ping” message every hour so that it keeps the cache warm?
gverrilla 2 days ago [-]
Recent minor issue worth flagging: Claude sometimes introduces domain-specific acronyms without first spelling them out, assuming reader familiarity. Caught this in a pt-br conversation about cycling where Claude used "FC" (frequência cardíaca / heart rate) — a term common in sports science literature but not in everyday Portuguese. Same pattern shows up in English too (e.g., dropping "RPE," "VO2," "HIIT" without definition). Suggested behavior: on first mention, write the full term and introduce the acronym in parentheses — "frequência cardíaca (FC)" / "heart rate (HR)" — then use the acronym freely afterward. Small thing, but it affects accessibility for readers outside the specific jargon bubble.
troupo 2 days ago [-]
> they were challenging to distinguish from normal variation in user feedback at first
translation: we ignored this and our various vibe coders were busy gaslighting everyone saying this could not be happening
epsteingpt 2 days ago [-]
Gaslit for months, only to acknowledge.
dcchambers 2 days ago [-]
So it turns out Anthropic was gaslighting everyone on twitter about this then? Swearing that nothing had changed and people were imagining the models got worse?
mtilsted 1 days ago [-]
Nope, they were technical correct. Nothing had changed with the model. The model had not gotten any worse.
The harness on the other hand. Now that had problems.
o10449366 2 days ago [-]
Resuming from sessions are still broken since Feb (I had to get claude to write a hook to fix that itself), the monitoring tool doesn't work and blocks usage of what does (simple sleep - except it doesn't even block correctly so you just sidestep in more ridiculous ways), and yet there seems to be more annoying activity proxies/spinner wheels (staring into middle distance)... Like I don't know how in a span of a few months you lose such focus on your product goals. Has Anthropic reached that point in their lifecycle already where their product team is no longer staffed by engineers and they have more and more non-technical MBAs joining trying to ride the hype train?
whalesalad 2 days ago [-]
I genuinely don't understand what they have been trying to achieve. All of these incremental "improvements" have ... not improved anything, and have had the opposite effect.
My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.
yuvrajmalgat 2 days ago [-]
ohh
cute_boi 2 days ago [-]
Honestly, it’s kind of sad that Anthropic is winning this AI race. They are the most anti–open source company, and we should try to avoid them as much as possible.
They are all doing it because OpenAI is snatching their customers. And their employees have been gaslighting people [1] for ages. I hope open-source models will provide fierce competition so we do not have to rely on an Anthropic monopoly.
[1] https://www.reddit.com/r/claude/comments/1satc4f/the_biggest...
yurukusa 13 hours ago [-]
[dead]
conorliu 24 hours ago [-]
[dead]
MyUltiDev 24 hours ago [-]
[dead]
DrokAI 1 days ago [-]
[dead]
yujunjie 2 days ago [-]
[dead]
techpulselab 1 days ago [-]
[dead]
mkilmanas 2 days ago [-]
[dead]
KaiShips 2 days ago [-]
[dead]
techpulselab 2 days ago [-]
[dead]
claud_ia 1 days ago [-]
[dead]
claud_ia 1 days ago [-]
[dead]
jimmypk 1 days ago [-]
[dead]
agentbonnybb 2 days ago [-]
[dead]
bmd1905 2 days ago [-]
[dead]
tommy29tmar 2 days ago [-]
[dead]
WhoffAgents 2 days ago [-]
[dead]
Bmello11 2 days ago [-]
[dead]
EFLKumo 2 days ago [-]
[dead]
petervandijck 2 days ago [-]
I have noticed a clear increase in smarts with 4.7. What a great model!
People complain so much, and the conspiracy theories are tiring.
They don’t have ANY product-level quality tests that picked this up? Many users did their own tests and published them. It’s not hard. And these users’ complaints were initially dismissed.
I don’t think the high vs medium change is really on par with the others. That’s a setting you change in the UI, and depending on what you are doing, both effort levels are pretty capable, they just operate a bit differently. Unless I’m missing something and they are saying they were doing some kind of routing behind the scenes.
If they are constantly pushing major changes to the prompts and workings of the tool, without communicating about it, and without testing, it’s likely there are other bugs and quality-degrading changes beyond the ones in this article, which would make a lot of sense.
Wild west days then.
Looks like we are back.
These are all classic symptoms of vibe-induced AI velocitis, sold by AI-peddlers as the future of the industry under the guise of "productivity."
AI can help one generate a lot of code, but the poor engineers approving the deluge of changes are still using their old, unmodified, stock meat-brains. An individual change may look fine in isolation, but when it's interacting with hundreds or thousands of other changes landing the same week , things can go south quickly.
Expect more instability until users rebel, and/or CTOs amd CIOs cry uncle. Amazon reportedly internally sounded the alarm after a couple of AI-tool-induced SEVs. The challenges at Github and the company insisting you don't call it Microslop are also rumored to be AI-related.
Yet, their flagship product got three really bad changes shipped into it and only resolved after more than a month.
This raises another question: with all the industry-wide boasting about AI-driven productivity, why does the leading company in agentic coding take over a month to fix severe customer-reported issues?
My unfounded suspicion: because this is the tradeoff we're all facing and for the most part refusing to accept when transitioning over to LLM-driven coding. This is exactly how we're being trained to work by the strengths and limitations of this new technology.
We used to depend on maintaining a global if incomplete understanding of a whole system. That enabled us to know at a glance whether specs and tests and actual behavior made sense and guided our thinking, enabling us to know what to look at. With agentic coding, the brutal truth is that this is now a much less "efficient" approach and we'll ship more features per day by letting that go and relying on external signs of behavior like test suites and an agent's analysis with respect to a spec. It enables accomplishing lots of things we wouldn't have done before, often simply because it would be too much friction to integrate it properly -- write tests, check performance, adjust the conceptual understanding to minimize added complexity, whatever.
So in order to be effective with these new tools, we're naturally trained to let go of many of the things we formerly depended on to keep quality up. Mistakes that would have formerly been evidence of stupidity or laziness are now the price to pay for accelerate productivity, and they're traded off against the "mistakes" that we formerly made that were less visible, often because they were in the form of opportunity cost.
Simple example: say you're writing a simple CLI in Python. Formerly, you might take in a fixed sequence of positional arguments, or even if you did use argparse, you might not bother writing help strings for each one. Now because it's no harder, the command-line processing will be complete and flexible and the full `--help` message will cover everything. Instead, you might have a `--cache-dir=DIR` option that doesn't actually do anything because you didn't write a test for it and there's no visible behavioral change other than worse performance.
Closely related, what do you do with user feedback and complaints? Formerly they might be one of your main signals. Now you've found that you need dependable, deterministic results in your test suite that the agent is executing or it doesn't help. User input is very very noisy. We're being trained away from that. There'll probably be a startup tomorrow that digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent, and it'll help some cases, and train us to be even worse at others.
This sounds like Enterpret.
Working in enterprise software it's surprising how long an option that doesn't actually do anything can be missed. And that was before AI and having thousands of customers use it.
This same problem happens with documentation all the time. You end up with paragraphs or examples that simply don't reflect what the product actually does.
I'm quite fond of the idea of incremental mutation of agent trajectories to move/embody some of the reasoning steps from LLM tokens into a program. Imagine you have a long agent transcript/trajectory and you have a magic want to replace a run of messages with "and now I'll call this script which gives me exactly the information I need," then seeing if the rewritten trajectory is stable.
To give credit where it's due, it's an overly complicated restatement of what Manny Silva has been saying with docs-as-tests https://www.docsastests.com/. Once you describe some user flow to humans (your "docs"), you can "compile" or translate part or all of those steps into deterministic test programs that perform and validate state transitions. Ideally you compile an agent trajectory all the way.
So: working with coding agents, you've cranked up the defect rate in exchange for speed, lets try testing all important flows. The first thing you try is: ok, I've got these user guides, I guess I'll have the agent follow along and try do it. And that works! But it's a little expensive and slow.
So I go, ok I'll have the agent do it once, and if it finds a trajectory through a product that works, we can reflect on that transcript and make some helper scripts to automate some or all of those state transitions, then store these next to our docs.
And then you say, ok if I ship a product change, can I have my coding agent update those testing scripts to save the expense and time of re-running the original follow-along. Also an obvious thing to do, and you can totally build it yourself with Claude Code. But I think there is a lot of complexity in how you go about doing this, what kind of incremental computation you can do to keep the LLM costs of all this under a couple hundred bucks a month for teams shipping 20 changes a day with 200 pages of docs.
The most polished open source "compiler/translator" I've seen exploring these ideas so far is Doc Detective (https://doc-detective.com) by Manny.
In my experience, CC makes it very very easy to _add_ things, resulting in much more code / features.
CC can obviously read/understand a codebase much faster than we do, but this also has a limit (how much context we can feed into it) - I think your approch is in essence a bet that future models' ability to read/understand code (size of context) improves as fast or faster than the current models' ability to create new code.
I don't even use Claude and it has been rather clear to me, that their service has not been working properly for some time now.
To be fair [to myself], this is scale-dependent. I work on a product with hundreds of millions of users. We're not going to be reading and pondering every bit of feedback we get. We have automation for stripping out some of the noise (eg the number of crash reports we get from bit flips due to faulty RAM is quite significant at this scale). We have lines of defense set up to screen things down -- though if you file a well-researched and documented bug, we'll pay attention. (We won't necessarily do what you want, but we'll pay attention.)
When I worked at a much smaller and earlier stage company, we begged our users for feedback. We begged potential users for feedback. We implemented some things purely to try to get someone excited enough that they would be motivated to give feedback.
Anthropic, OpenAI, Google? They have a lot of users.
Also, this automation would be in addition to the other channels by which you'd pay attention to feedback.
Also also, the ship has sailed. We're all lab rats now. We're randomly chosen to be A/B tested on. We are upgraded early as part of a staged rollout. We're region-locked. Geocoded. Tracked as part of the cohort that has bought formula or diapers recently. Maybe we live in the worst of all possible worlds?
in response, most companies are explicitly trading velocity for quality, and finding out that quality is actually important at the end of the day. if you look at the roadmap it's just ship ship ship. eng is being told to 3x their output. quality in the llm coded world is tough and there's not much appetite for it right now.
Pretty embarrassing for an AI company. Surely AI should be doing their regression testing?
Wasn't AI supposed to solve all the drudgery? All those humans aided by cutting edge AI are still failing at these basic tasks? Then how good is that AI in the first place?
“ After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16.
As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.”
Considering the number and scope of users they serve, I can sympathize with the difficulty. However, they should reimburse affected users at least partially instead of just announcing “our bad, sorry “. That would reduce the frustration.
First thing you will read and that takes a big part is that it was something like: not really a bug but we changed a default not well communicated and users (their fault) did not notice it. This is why they were "under the false impression" of a change.
Lots of people will stop reading after a few paragraphs.
One of them was a bug that didn't present itself until after an hour of usage.
(I write this as someone who likes Claude Code, if that matters.)
This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.
The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.
Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.
The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.
We tried a few different approaches to improve this UX:
1. Educating users on X/social
2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.
Hope this is helpful. Happy to answer any questions if you have.
I feel like that is a choice best left up to users.
i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.
Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.
Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.
I would not be surprised to see Anthropic, OpenAI etc head in the direction you mention as they mature and all of these datacenters currently undergoing construction come online in the next few years and drive down costs.
If they ignored this then all users who don’t do this much would have to subsidize the people who do.
I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.
They have a limited number of resources and can’t keep everyone’s VM running forever.
The KV cache of your Claude context is:
- Potentially much larger than 25GB. (The KV cache sizes you see people quoting for local models are for smaller models.)
- While it's being used, it's all in RAM.
- Actually it's held in special high-performance GPU RAM, precision-bonded directly to the silicon of ludicrously expensive, state of the art GPUs.
- The KV state memory has to be many thousands of times faster than your 25GB state.
- It's much more expensive per GB than the CPU memory used by a VM. And that in turn is much more expensive than the SSD storage of your 25GB.
- Because Claude is used by far more people (and their agents) than rent VMs, far more people are competing to use that expensive memory at the same time
There is a lot going on to move KV cache state between GPU memory and dedicated, cheaper storage, on demand as different users need different state. But the KV cache data is so large, and used in its entirety when the context is active, that moving it around is expensive too.
And yes, this is also why computer RAM has jumped the shark in costs.
The bandwidth differences in total data transferred per hour aren't even in the same 5 orders of magnitude between your server and the workloads LLMs are doing. And this is why the compute and power markets are totally screwed.
Tbh, I'm not sure paged vram could solve this problem for an (assumed) huge cache miss system such as a major LLM server
note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.
No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.
You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.
A sibling comment explains:
https://news.ycombinator.com/item?id=47886200
The UI could indicate this by showing a timer before context is dumped.
No need to gamify it. It's just UI.
But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).
Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.
Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)
The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.
Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.
You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
[0]: https://huggingface.co/blog/not-lain/kv-caching
1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.
2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.
> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.
I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.
That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.
Does mmap(2) educate the developer on how disk I/O works?
At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:
You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.
Which is true of this issue to.
and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,
>then:
You would expect to be led to understand, like… a 1997 Prius.
“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)
Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.
It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.
"Gets mad because when their is options the defaults suck"
"Gets mad because the options start massively increasing costs to areospace pricing"
There is no option to avoid auto-dumbing after one hour of idle. I haven't complained about the cost at all, I'm happy to pay it.
So yeah, I'm mad because there's no option. The other two you mentioned don't apply.
So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.
Have you ever talked with users?
> this is an endless job
Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.
And then their vibe-coders tell us that we are to blame for using the product exactly as advertised: https://x.com/lydiahallie/status/2039800718371307603 while silently changing how the product works.
Please stop defending hapless innocent corporations.
I believe if one were to read my post it'd have been clear that I *am* a user.
This *is* "hacker" news after all. I think it's a safe assumption that people sitting here discussing CC are an inquisitive sort who want to understand what's under the hood of their tools and are likely to put in some extra time to figure it out.
It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.
It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.
that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).
The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.
What about only storing the conversation and then recomputing the embeddings in the cache? Does that cost a lot? Doing a lot of matrix multiplication does not cost dollars of compute, especially on specialized hardware, right?
https://blog.exe.dev/expensively-quadratic
If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?
Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.
Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.
Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.
This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.
With this much cheaper setup backed by disks, they can offer much better caching experience:
> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.
You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.
The data is the conversation (along with the thinking tokens).
There is no download - you already have it.
The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.
That is doable, but as Boris notes it costs lots of tokens.
The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.
Yes - generated from the data of the conversation.
Read what I said again. I'm explaining how they regenerate the cache by running the conversation though the LLM to reconstruct the KV cache state.
The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.
Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."
I'm really beginning to feel the lack of control when it's comes to context if I'm being honest
So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...
You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.
I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.
For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.
Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?
Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.
In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it. And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?
This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.
There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work
1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.
https://fortune.com/2026/04/20/italian-court-netflix-refunds...
> Our systems will smartly ignore any reasoning items that aren’t relevant to your functions, and only retain those in context that are relevant. You can pass reasoning items from previous responses either using the previous_response_id parameter, or by manually passing in all the output items from a past response into the input of a new one.
https://developers.openai.com/api/docs/guides/reasoning
Disclosure - work on AI@msft
The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.
So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.
Anthropic already profited from generating those tokens. They can afford subsidize reloading context.
Reloading those tokens takes around the same effort as processing them in the first place.
It's ok to be ignorant of how the infrastructure for LLMs work, just don't be proud of it.
And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.
Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.
Anyway I‘m happy that they saw it as a valid refund reason
The irony is that Claude Design does this. I did a big test building a design system, and when I came back to it, it had in the chat window "Do you need all this history for your next block of work? Save 120K tokens and start a new chat. Claude will still be able to use the design system." Or words to that effect.
The only issue is that it didn't hit the cache so it was expensive if you resume later.
Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.
Granted, the "memory" can be available across session, as can docs...
The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.
// If this notion of sufficient context as fine tune seems surprising, the research is out there.)
Approaches tried need to deal with both of these:
1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.
2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.
I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.
OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.
Please fight this hubris. Your users matter. Many of us use your tools for everyday work and do not appreciate having the rug pulled from under them on a regular basis, much less so in an underhanded and undisclosed way.
I don't mind the bugs, these will happen. What I do not appreciate is secretly changing things that are likely to decrease performance.
You're acquiring users as a recurring revenue source. Consider stability and transparency of implementation details cost of doing business, or hemorrhage users as a result.
See also the difference between eg. MacOS (with large M, the older good versions) and waiting for "Year of linux on desktop".
I don't think the issue is making decisions for users, but trying to switch off the soup tap in the all-you-can-eat soup bar. Or, wrong business model setting wrong incentives to both sides.
I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.
(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)
I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.
Maybe the UI could do that for sessions that the user hasn't left yet, when the deadline comes near.
You ideally want to compact before the conversation is evicted from cache. If you knew you were going to use the conversation again later after cache expiry, you might do this deliberately before leaving a session.
Anthropic could do this automatically before cache expiry, though it would be hard to get right - they'd be wasting a lot of compute compacting conversations that were never going to be resumed anyway.
This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.
Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?
Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects cost/latency but not the response itself.
-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.
Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.
It's a little concerning that it's number 1 in your list.
Two questions if you see this:
1) if this isn't best practice, what is the best way to preserve highly specific contexts?
2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?
Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.
So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.
But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.
Thank you.
I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.
Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.
All the while all the official channels refused to acknowledge any problems.
Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.
https://x.com/bcherny/status/2044291036860874901 https://x.com/bcherny/status/2044299431294759355
No mention of anything like "hey, we just fixed two big issues, one that lasted over a month." Just casual replies to everybody like nothing is wrong and "oh there's an issue? just let us know we had no idea!"
My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.
This cache information should probably get displayed somewhere within Claude Code
Love the product. <3
Mislead, gaslight, misdirect is the name of the game
I have project folders/files and memory stored for each session, when I come back to my projects the context is drawn from the memory files and the status that were saved in my project md files.
Create a better workflow for your self and your teams and do it the right way. Quick expect the prompt to store everything for you.
For the Claude team. If you havent already, I'd recommend you create some best practices for people that don't know any better, otherwise people are going to expect things to be a certain way and its going to cause a lot of friction when people cant do what the expect to be able to do.
It’s hard to do it without killing performance and requires engineering in the DC to have fast access to SSDs etc.
Disclosure: work on ai@msft. Opinions my own.
Let's see what Boris Cherny himself and other Anthropic vibe-coders say about this:
https://x.com/bcherny/status/2044847849662505288
Opus 4.7 loves doing complex, long-running tasks like deep research, refactoring code, building complex features, iterating until it hits a performance benchmark.
https://x.com/bcherny/status/2007179858435281082
For very long-running tasks, I will either (a) prompt Claude to verify its work with a background agent when it's done... so Claude can cook without being blocked on me.
https://x.com/trq212/status/2033097354560393727
Opus 4.6 is incredibly reliable at long running tasks
https://x.com/trq212/status/2032518424375734646
The long context window means fewer compactions and longer-running sessions. I've found myself starting new sessions much less frequently with 1 million context.
https://x.com/trq212/status/2032245598754324968
I used to be a religious /clear user, but doing much less now, imo 4.6 is quite good across long context windows
---
I could go on
Yeah it's called lunch!
The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.
Could you create one location educating advanced users, and:
• Promote, Organize and Maintain it
• Develop a group of users that have early access to "upcoming notifications we're working on"
• Perhaps give a third party specializing in making information visible responsibility for it
• Read comments by users in various places to determine what should be communicated. Just under this comment @dbeardsl begins "I appreciate the reply, but I was never under the impression that ...".
The speed that key users are informed of issues is critical. This is just off the top of my head, a much better plan I'm sure could be created.
Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.
So it would probably be a quite a long transfer to perform in these cases, probably not very feasible to implement at scale.
The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
In addition, the following is less important, but as other commenters have stated: walking away from a conversation and coming back to it more than an hour later is very common and it would be nice if there were a way for the user to opt to retain maximum quality (e.g. no dropped thinking) in this case. In the longer term, it would be nice if there were a way for the user to wait a few minutes for a stale session to resume, in exchange for not having a large amount of quota drained (ie have a 'slow mode' invoked upon session resumption that consumes less quota).
Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?
As an informed user who understands his tools, I of course expect large uncached conversations to massively eat into my token budget, since that's how all of the big LLM providers work. I also understand these providers are businesses trying to make money and they aren't going to hold every conversation in their caches indefinitely.
whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?
are you expecting claude code users to not attend meetings?
I think product-wise you might need a better story on who uses claude-code, when and why.
Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted
Why not use tired cache?
Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.
No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.
Please, tell me I’m not understanding what is going on..
otherwise you really need to hire someone to look at this!)
I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.
But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.
What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage
But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier
It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.
1) Is it okay to leave Claude Code CLI open for days?
2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?
You guys really need to communicate that better in the CLI for people not on social
how about acknowledging that you fucked up your own customers’ money and making a full refund for the affected period?
> Educating users on X/social
that is beyond me
ты не Борис, ты максимум борька
I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?
To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.
I switched to Codex, Claude has gotten to a point where it's just unusable for the regular Joe.
I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.
In my eyes, Claude Code is mainly a context management tool. I build a foundation of apparent understanding of the problem domain, and then try to work towards a solution in a dialogue. Now you tell me Anthrophic has been silently breaking down that foundation without telling me, wasting potentially hours of my time.
It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.
[1] https://code.claude.com/docs/en/changelog
Why did you lie 11 days ago, 3 days after the fix went in, about the cause of excess token usage?
Intuitively I understand this due to how context windows work and you're looking to increase cache hits, has Anthropic tried compact/summarise on idle as a configurable option? Seems to have decent tradeoffs + education in a setting.
Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.
Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?
"showClearContextOnPlanAccept": true,
It would be nice to be able to summarize/cut into a new leaner conversation vs having to coax all the context back into a fresh one. Something like keep the last 100,000 tokens.
I believe /compact achieves something like this? It just takes so long to summarize that it creates friction.
I dont agree with this being characterized as a "corner case".
Isn't this how most long running work will happen across all serious users?
I am not at my desk babysitting a single CC chat session all day. I have other things to attend to -- and that was the whole point of agentic engineering.
Dont CC users take lunch breaks?
How are all these utterly common scenarios being named as corner cases -- as something that is wildly out of the norm, and UX can be sacrificed for those cases?
No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X
There's a reason live service games have splash banners at every login. No matter what you pick as an official e-coms channel, most of your users aren't there!
* To be fair, of all these firms, ANTHROP\C tries the hardest to remember, and deliver like, some people aren't the same. Starting with normals doing normals' jobs.
You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.
I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.
I'm curious why 1 hour was chosen?
Is increasing it a significant expense?
Ever since I heard about this behaviour I've been trying to figure out how to handle long running Claude sessions and so far every approach I've tried is suboptimal
It takes time to create a good context which can then trigger a decent amount of work in my experience, so I've been wondering how much this is a carefully tuned choice that's unlikely to change vs something adjustable
Instead, why don't you and Anthropic be more open about changes to these tools rather than waiting for users to complain, then investigating things after the fact that you should have investigated in the first place, and then posting on social media about all the cool tech details?
My company is tens of thousands strong. The amount of churn in Claude Code is a major issue and causing real awareness of the lack of stability and lack of customer support Anthropic provides.
And Claude Code is actually becoming a prototypical example of the dangers of vibe coded products and the burdens they place.
or even, let the user control the cache expiry on a per request basis. with a /cache command
that way they decide if they want to drop the cache right away , or extend it for 20 hours etc
it would cost tokens even if the underlying resource is memory/SSD space, not compute
Either swallow the cost or be transparent to the user and offer both options each time.
/loop 5m say "ok".
Will that keep the cache fresh?
Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.
I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.
I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
Hard agree, would like to see a response to this.
how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.
the cost of reloading the window didnt go away, it just went up even more
I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.
Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.
Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?
Input tokens are expensive, since the whole model has to be run for each token. They're cheaper than output tokens because the model doesn't need to run the sampler, so some pipeline parallelism is possible, but on the other hand without caching the input token cost would have to be paid anew for each output token.
Prompt caching fixes that O(N^2) cost, but the cache itself is very heavyweight. It needs one entry per input token per model layer, and each entry is an O(1000)-dimensional vector. That carries a huge memory cost (linear in context length), and when cached that means the context's memory space is no longer ephemeral.
That's why a 'cache write' can carry a cost; it is the cost of both processing the input and committing the backing store for the cache duration.
1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.
2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.
I'm sure most companies and customers will consider compromising quality for 80% cost reduction. If they just be honest they'll be fine.
Glad I use kiro-cli which doesn't do this.
That wouldn't change with employment.
If that was done, users could have been mindful of the change and figure out more easily that their problems could have come from that.
The deterioration was real and annoying, and shines a light on the problematic lack of transparency of what exactly is going on behind the scenes and the somewhat arbitrary token-cost based billing - too many factors at play, if you wanted to trace that as a user you can just do the work yourself instead.
The fact that waiting for a long time before resuming a convo incurs additional cost and lag seemed clear to me from having worked with LLM APIs directly, but it might be important to make this more obvious in the TUI.
Every one of these changes had the same goal: trading the intelligence users rely on for cheaper or faster outputs. Users adapt to how a model behaves, so sudden shifts without transparency are disorienting.
The timing also undercuts their narrative. The fixes landed right before another change with the same underlying intent rolled out. That looks more like they were just reacting to experiments rather than understanding the underlying user pain.
When people pay hundreds or thousands a month, they expect reliability and clear communication, ideally opt-in. Competitors are right there, and unreliability pushes users straight to them.
All of this points to their priorities not being aligned with their users’.
Framing this as "aligned" or "not aligned" ignores the interesting reality in the middle. It is banal to say an organization isn't perfectly aligned with its customers.
I'm not disagreeing with the commenter's frustration. But I think it can help to try something out: take say the top three companies whose product you interact with on a regular basis. Take stock of (1) how fast that technology is moving; (2) how often things break from your POV; (3) how soon the company acknowledges it; (4) how long it takes for a fix. Then ask "if a friend of yours (competent and hard working) was working there, would I give the company more credit?"
My overall feel is that people underestimate the complexity of the systems at Anthropic and the chaos of the growth.
These kind of conversations are a sort of window into people's expectations and their ability to envision the possible explanations of what is happening at Anthropic.
Making changes like reducing the usage window at peak times (https://x.com/trq212/status/2037254607001559305) without announcing it (until after the backlash) is the sort of thing that's making people lose trust in Anthropic. They completely ignored support tickets and GitHub issues about that for 3 days.
You shouldn't have to rely on finding an individual employee's posts on Reddit or X for policy announcements.
That policy hasn't even been put into their official documentation nearly one month on - https://support.claude.com/en/articles/11647753-how-do-usage...
A company with their resources could easily do better.
I agree with this as a principle. Which raises this question: is it true? Are you certain these messages don't show up in (a) Claude Code and (b) Claude on the Web?
I've seen these kinds of messages pop up. I haven't taken inventory of how often they do. As a guess, maybe I see notifications like this several times a month. If any important ones are missing, that is a mistake.
Anyhow, this is the kind of discussion that I want people to have. I appreciate the detail.
> A company with their resources could easily do better.
Yes, they could. But easily? I'm not so sure.
Also ask yourself: what function does saying e.g. "they could have done better" serve? What does it help accomplish? I'm asking. I think it often serves as a sort of self-reinforcing thing to say that doesn't really invite more thinking.
Ask yourself: If "doing better" was easy, why didn't it happen? Maybe it isn't quite as easy as you think? Maybe you've baked in a lot of assumptions. Easy for who? Easy why? Try the questions I asked, above. They are not rhetorical. Here they are again, rephrased a bit
There is a reason why I recommend asking these questions. Forcing yourself to write down your reference class is ... to me, table stakes, but well, lots of people just leave it floating and then ask other people to magically reconstruct it. Envisioning a friend working there shifts your viewpoint and can shake lose many common biases.Broken expectations are highly dependent on perception. People get used to having some particular level. When that changes and they notice, and being humans a strong default is to reach for something to blame. Then we rationalize. That last two parts are unhelpful, and I push back on them frequently.
This is not a charitable interpretation of what I wrote. Please take a minute and rethink and rephrase. Here are two important guidelines, hopefully familiar to someone who has had an account since 2019:
> Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
You are saying what they are doing is hard. That's fine. Their stated goals are to be the responsible stewards of the technology and we agree they are failing at that goal. You would attribute that to incompetence and not malice.
I've thought on it, and I will try to start off with something we both agree on... We both agree that Anthropic made some mistakes, but this is probably a pretty uninteresting and shallow agreement. I find it unlikely that we would enumerate or characterize the mistakes similarly. I find it unlikely that we would be anywhere near the same headspace about our bigger-picture takes.
> I didn't assume bad faith
Ok, I'm glad. That one didn't concern me; if I had a do-over I would remove that one from the list. Sorry about that. These are the ones that concern me:
When I read your earlier comment (~20 words), it didn't come across as a thoughtful and substantive response to my comment (~160 words). I know length isn't a perfect measure nor the only measure, but it does matter. Are you sure you didn't choose an easier to criticize interpretation? Did you take the take to try to state to yourself what I was trying to say? Back to Rapaport's Rules ... I'm grateful when people can express what I'm going for better than the way I wrote it or said it.> I simply reworded your conclusions with less soft language
Technically speaking, lots of things could be called "rewording", but what you did was relatively far from "simply rewording". Charitably, it is closer to "your interpretation". But my intent was lost, so "rewording" doesn't fit.
> ... so that others would understand your position more clearly.
If you want to help others understand, then it is good to make sure you understand. For that, I recommend asking questions.
> Their stated goals are to be the responsible stewards of the technology and we agree they are failing at that goal.
No, I do not agree to that phrasing. It is likely I don't agree with your intention behind it either.
> You would attribute that to incompetence and not malice.
No; even if I agreed with the premise, I think it is more likely I would still disagree. I don't even like the framing of "either malice or incompetence". These ideas don't carve reality at the joints. [2] [3] There are a lot of stereotypes about "incompetence" but I don't think they really help us understand the world. These stereotypes are more like thought-terminators than interesting generative lenses.
I'll try to bring it back to the words "malice" and "incompetence" even though I think the latter is nigh-useless as a sense-making tool. Many mistakes happen without malice or incompetence; many mistakes "just happen" because people and organizations are not designed to be perfect. They are designed to be good enough. To not make any short-term mistakes would likely require too much energy or too much rigidity, both of which would be a worse category of mistake.
Try to think counterfactually: imagine a world where Anthropic is not malicious nor incompetent and yet mistakes still happened. What would this look like?
When you think of what Anthropic did wrong, what do you see as the lead up to it? Can you really envision the chain of events that brought it about? Imagine reading the email chain or the PRs. Can you see how there may be been various "off-ramps" where history might have gone differently? But for each of those diversions, how likely would it be that they match the universe we're in?
At some point figuring out what is a "mistake" even starts to feel strange. Does it require consciousness? Most people think so. But we say organizations make mistakes, but they aren't conscious -- or are they? Who do we blame? The CEO, because the buck stops there, right? He "should have known better". But why? Wait, but the Board is responsible...?
Is there any ethical foundation here? Some standard at all or is this all just anger dressed up as an argument? If this assigning blame thing starts to feel horribly complicated or even pointless, then maybe I've made my point. :)
If nothing else, when you read what I write, I want it to make you stop, get out a sheet of paper, and try to imagine something vividly. Your imagination I think will persuade you better than I can.
[1]: https://themindcollection.com/rapoports-rules/
[2]: https://jollycontrarian.com/index.php?title=Carving_nature_a...
[3]:https://english.stackexchange.com/questions/303819/what-do-t...
The near-instant transition from "there is no problem" to "we already fixed the problem so stop complaining" is basically gaslighting. (Admittedly the second sentiment comes more from the community, but they get that attitude after taking the "we fixed all the problems" posts at face value.)
But they come after the team gaslit everyone, telling us it was a skill issue.
That's the reason for the flak
It is certainly true that they did a poor job communicating this change to users (I did not know that the default was “high” before they introduced it, I assumed they had added an effort level both above and below whatever the only effort choice was there before). On the other hand, I was using Claude Code a fair bit on “medium” during that time period and it seemed to be performing just fine for me (and saving usage/time over “high”), so it doesn't seem clear that that was the wrong default, if only it had been explained better.
I would say it does, and I'd loathe to use anything made by people who'd couch that change to defaults as "providing a selectable option to use a faster, cheaper version".
Yuck.
Did I miss something? I'm only looking at primary sources to start. Not Reddit. Not The Register. Official company communications.
Did Anthropic tell users i.e. "you are wrong, your experience is not worse."? If so, that would reach the bar of gaslighting, as I understand it (and I'm not alone). If you have a different understanding, please share what it is so I understand what you mean.
That said, the copy uses "we never intentionally degrade our models" to mean something like "we never degrade one facet of our models unless it improves some other facet of our models". This is a cop out, because it is what users suspected and complained about. What users want - regardless of whether it is realistic to expect - is for Anthropic to buy even more compute than Anthropic already does, so that the models remain equally smart even if the service demand increases.
Some terms:... The model is the thing that runs inference. Claude Code is not a model, it is harness. To summarize Anthropic's recent retrospective, their technical mistakes were about the harness.
I'm not here to 'defend' Anthropic's mistakes. They messed up technically. And their communication could have been better. But they didn't gaslight. And on balance, I don't see net evidence that they've "copped out" (by which I mean mischaracterized what happened). I see more evidence of the opposite. I could be wrong about any of this, but I'm here to talk about it in the clearest, best way I can. If anyone wants to point to primary sources, I'll read them.
I want more people to actually spend a few minutes and actually give the explanation offered by Anthropic a try. What if isolating the problems was hard to figure out? We all know hindsight is 20/20 and yet people still armchair quarterback.
At the risk of sounding preachy, I'm here to say "people, we need to do better". Hacker News is a special place, but we lose it a little bit every time we don't in a quality effort.
No worries about 'sounding preachy'; it's a good thing people want to uphold the sobriety that makes HN special.
They knew they had deliberately made their system worse, despite their lame promise published today that they would never do such a thing. And so they incorrectly assumed that their ham fisted policy blunder was the only problem.
Still plenty I prefer about Claude over GPT but this really stings.
> They knew they had deliberately made their system worse
Define "they". The teams that made particular changes? In real-world organizations, not all relevant information flows to all the right places at the right time. Mistakes happen because these are complex systems.
Define "worse". There are lot of factors involved. With a given amount of capacity at a given time, some aspect of "quality" has to give. So "quality" is a judgment call. It is easy to use a non-charitable definition to "gotcha" someone. (Some concepts are inherently indefensible. Sometimes you just can't win. "Quality" is one of those things. As soon as I define quality one way, you can attack me by defining it another way. A particular version of this principle is explained in The Alignment Problem by Brian Christian, by the way, regarding predictive policing iirc.)
I'm seeing a lot of moral outrage but not enough intellectual curiosity. It embarrassingly easy to say "they should have done better" ... ok. Until someone demonstrates to me they understand the complexity of a nearly-billion dollar company rapidly scaling with new technology, growing faster than most people comprehend, I think ... they are just complaining and cooking up reasons so they are right in feeling that way. This possible truth: complex systems are hard to do well apparently doesn't scratch that itch for many people. So they reach for blame. This is not the way to learn. Blaming tends to cut off curiosity.
I suggest this instead: redirect if you can to "what makes these things so complicated?" and go learn about that. You'll be happier, smarter, and ... most importantly ... be building a habit that will serve you well in life. Take it from an old guy who is late to the game on this. I've bailed on companies because "I thought I knew better". :/
Accidentally/deliberately making your CS teams ill-informed should not function as a get out of jail free card. Rather the reverse.
Thanks for your reply. I very much agree that intention or competence does not change responsibility and accountability. Both principles still apply.
In this comment, I'm mostly in philosopher and rationalist mode here. Except for the [0] footnote, I try to shy away from my personal take about Anthropic and the bigger stakes. See [0] for my take in brief. (And yes I know brief is ironic or awkward given the footnote is longer than most HN comments.) Here's my overall observation about the arc of the conversation: we're still dancing around the deeper issues. There is more work to do.
It helps to recognize the work metaphors are doing here. You chose the phrase "get out of jail free". Intentionally or not, this phrase smuggles in some notion of illegality or at least "deserving of punishment" [1]. The Anthropic mistakes have real-world impacts, including upset customers, but (as I see it) we're not in the realm of legal action nor in the realm of "just punishment", by which I mean the idea of retributive justice [2].
So, with this in mind, from a customer-decision point of view, the following are foundational:
But when to this foundation, I need to be careful: ...Personally, when I view the dozens of dozens I've read here, a common theme is see is disappointment. I relatively rarely see constructive and truth-seeking retrospective-work. On the other hand, I see Anthropic going out of their way to communicate their retrospective while admitting they need to do better. This is why I say this:
[0]: My personal big-picture take is that if anyone in the world, anywhere, builds a superintelligent AI using our current levels of understanding, there is no expectation at all that we can control it safely. So I predict with something close to 90% or higher, that civilization and humanity as we know it won't last another 10 years after the onset of superintelligence (ASI).This is the IABIED (The book "If Anyone Builds It, Everyone Dies" by Yudkowsky and Soares) argument -- plenty of people write about it -- though imo few of the book reviews I've seen substantively engage with the core arguments. Instead, most reviewers reject it for the usual reasons: it is a weird and uncomfortable argument and the people making it seem wacky or self-interested to some people. I do respect reviews who disagree based on model-driven thinking. Everything else to me reads like emotional coping rather than substantive engagement.
With this in mind, I care a lot about Anthropic's failures and what they imply about how it participates in the evolving situation.
But I care almost zero about conventional notions of blame. Taking materialism as true, free will is at bottom a helpful fiction for people. For most people, it is the reality we take for granted. The problem is blame is often just an excuse for scapegoating people for their mistakes, when in fact these mistakes just flow downstream from the laws of physics. Many of these mistakes are nearly statistical certainties when viewed from the lens of system dynamics or sociology or psychology or neuroscience or having bad role models or being born into a not-great situation.
To put it charitably, blame is what people do when they want to explain s--tty consequences on the actions of people and systems. That sense bothers me less; I'm trying to shift thinking away from the kind of blaming that leads to bad predictions.
[1]: From the Urban Dictionary (I'm not citing this as "proof of credibility" of the definition):
... I'm only citing UD so you know what mean. When I use the word dictionary, I mean a catalog of usage not a prescription of correctness.[2]: https://plato.stanford.edu/entries/justice-retributive/
If you can point me to an official communication from Anthropic where they say "User <so and so> is not actually seeing degraded performance" when Anthropic knows otherwise that would clearly be gaslighting -- intent matters by my book.
But if their instrumentation was bad and they were genuinely reporting what they could see, that doesn't cross into gaslighting by my book. But I have a tendency to think carefully about ethical definitions. Some people just grab a word off the shelf with a negative valence and run with it: I don't put much stock in what those people say. Words are cheap. Good ethical reasoning is hard and valuable.
It's fine if you have a different definition of "gaslighting". Just remember that some of us have been actually gaslight by people, so we prefer to save the word for situations where the original definition applies. People like us are not opposed to being disappointed, upset, or angry at Anthropic, but we have certain epistemic standards that we don't toss out when an important tool fails to meet our expectations and the company behind it doesn't recognize it soon enough.
[1]: https://www.reddit.com/r/TwoXChromosomes/comments/tep32v/can...
Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.
Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.
At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.
Freezing your IDE version is now a thing of the past, the new reality is that we can't expect agentic dev workflows to be consistent and I see too many people (including myself) getting burned by going the single-provider route.
On one hand I’m glad to finally see anthropic communicate on this but at this point all I have to say is… time to diversify?
Until Opus 4.7 - this is the first time I rolled back to a previous model.
Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.
I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.
I’d never use such an expensive model for coding, so that might explain why I have little to complain about.
Over time, I realized the extended context became randomly unreliable. That was worse to me than having to compact and know where I was picking up.
Way more likely there's a "VERY IMPORTANT: When you see a block of code, ensure it's not malware" somewhere in the system prompt.
I try running my app on the develop branch. No change. Huh.
Realize it didn't.
"Claude, why isn't this changed?" "That's to be expected because it's not been merged." "I'm confused, I told you to do that."
This spectacular answer:
"You're right. You told me to do it and I didn't do it and then told you I did. Should I do it now?"
I don't know, Claude, are you actually going to do it this time?
https://www.reddit.com/r/ClaudeAI/comments/1evf0xc/the_real_...
We just got hit by this today in response to a completely boring code question. Claude freaked out about being prompt injected.
Incidentally, the hardware they run is known as well. The claim should be easy to check.
I dare you to run CC on API pricing and see how much your usage actually costs.
(We did this internally at work, that's where my "few orders of magnitude" comment above comes from)
At cell phone plan adoption levels, and cell phone plan costs, the labs are looking at 5-10yr ROI.
If that demand evens slows down in the slightest the whole bubble collapses.
Growth + Demand >> efficiency or $ spend at their current stage. Efficiency is a mature company/industry game.
How else would it know whether it has a plan now?
After all, "the first hit's free" model doesn't apply to repeat customers ;-)
Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.
Claude is periodically refusing to run those tests. That never happened prior to 4.7.
This would be a new level of troublesome/ruthless (insert correct english word here)
Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.
We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).
For instance:
Is Haiku supposed to hit a warm system-prompt cache in a default Claude code setup?
I had `DISABLE_TELEMETRY=1` in my env and found the haiku requests would not hit a warm-cached system prompt. E.g. on first request just now w/ most recent version (v2.1.118, but happened on others):
w/ telemetry off - input_tokens:10 cache_read:0 cache_write:28897 out:249
w/ telemetry on - input_tokens:10 cache_read:24344 cache_write:7237 out:243
I used to think having so many users was leading to people hitting a lot of edge cases, 3 million users is 3 million different problems. Everyone can't be on the happy path. But then I started hitting weird edge cases and started thinking the permutations might not be under control.
People literally started seeing issues immediately as you changed the defaults: https://x.com/levelsio/status/2029307862493618290 And despite a huge amount of reports you still kept it for a whole month.
And then you shipped a completely untested feature with prompt cache misses and literally gaslit users and blamed users for using the product as advertised.
Oh. Remember this https://x.com/bcherny/status/2024152178273989085? "We move fast but test carefully"?
Now untold umber of people have been hit by these changes, so as an apology you reset usage limits three hours before they would reset anyway.
Good job.
Edit. By the way, a very telling sentence from the report:
--- start quote ---
We’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features); and we'll make improvements to our Code Review tool that we use internally
--- end quote ---
Translation: no one is using or even testing the product we ship, and we blindly trust Claude Code to review and find bugs for us. Last one isn't even a translation: https://x.com/bcherny/status/2017742750473720121
UI is UI. It is naive to expect that you build some UI but users will "just magically" find out that they should use it as a terminal in the first place.
https://github.com/anthropics/claude-code/issues/36286 https://github.com/anthropics/claude-code/issues/25018
Anthropic: removes thinking output
Users: see long pauses, complain
Anthropic: better reduce thinking time
Users: wtf
To me it really, really seems like Anthropic is trying to undo the transparency they always had around reasoning chains, and a lot of issues are due to that.
Removing thinking blocks from the convo after 1 hour of being inactive without any notice is just the icing on the cake, whoever thought that was a good idea? How about making “the cache is hot” vs “the cache is cold” a clear visual indicator instead, so you slowly shape user behavior, rather than doing these types of drastic things.
They had droves of Claude devs vehemently defending and gaslighting users when this started happening
A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.
I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.
I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...
And once you get unlucky you can’t unsee it.
I vibed a low stakes budgeting app before realising what I actually needed was Actual Budget and to change a little bit how I budget my money.
I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.
Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention.
I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.
When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well.
[1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.
I don't know about others, but sessions that are idle > 1h are definitely not a corner case for me. I use Claude code for personal work and most of the time, I'm making it do a task which could say take ~10 to 15mins. Note that I spend a lot of time back and forth with the model planning this task first before I ask it to execute it. Once the execution starts, I usually step away for a coffee break (or) switch to Codex to work on some other project - follow similar planning and execution with it. There are very high chances that it takes me > 1h to come back to Claude.
Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.
I got finally fired.
In my life experience, looking back, when I've found myself swinging from "high trust" to "low trust" the change was usually rooted in my expectations; it was usually rooted in me having a naive understanding of the world that was rudely shattered.
Will you force trust to be a bit? Or can you admit a probability distribution? Bits (true/false or yes/no or trust/don't trust) thrash wildly. Bayesians update incrementally: this is (a) more pleasant; (b) more correct; (c) more curious; (d) easier to compare notes with others.
I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.
I'm using Zed and Claude Code as my harnesses.
However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode
These bugs have all of the same symptoms: undocumented model regressions at the application layer, and engineering cost optimizations that resulted in real performance regressions.
I have some follow up questions to this update:
- Why didn't September's "Quality evaluations in more places" catch the prompt change regression, or the cache-invalidation bug?
- How is Anthropic using these satisfaction questions? My own analysis of my own Claude logs was showed strong material declines in satisfaction here, and I always answer those surveys honestly. Can you share what the data looked like and if you were using that to identify some of these issues?
- There was no refund or comped tokens in September. Will there be some sort of comp to affected users?
- How should subscribers of Claude Code trust that Anthropic side engineering changes that hit our usage limits are being suitably addressed? To be clear, I am not trying to attribute malice or guilt here, I am asking how Anthropic can try and boost trust here. When we look at something like the cache-invalidation there's an engineer inside of Anthropic who says "if we do this we save $X a week", and virtually every manager is going to take that vs a soft-change in a sentiment metric.
- Lastly, when Anthropic changes Claude Code's prompt, how much performance against the stated Claude benchmarks are we losing? I actually think this is an important question to ask, because users subscribe to the model's published benchmark performance and are sold a different product through Claude Code (as other harnesses are not allowed).
[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...
https://youtu.be/KFisvc-AMII?is=NskPZ21BAe6eyGTh
Still use CC at work because team standards, but I'd take my OpenCode stack over it any day.
Care to share what you changed, maybe even the code?
1) Curated a set of models I like and heavily optimized all possible settings, per agent role and even per skill (had to really replumb a lot of stuff to get it as granular as I liked)
2) Ported from sqlite to postgresql, with heavily extended schema. I generate embeddings for everything, so every aspect of my stack is a knowledge graph that can be vector searched. Integrated with a memory MCP server and auditing tools so I can trace anything that happens in the stack/cluster back to an agent action and even thinking that was related to the action. It really helps refine stuff.
3) Tight integration of Gitea server, k3s with RBAC (agents get their own permissions in the cluster), every user workspace is a pod running opencode web UI behind Gitea oauth2.
4) Codified structure of `/projects/<monorepo>/<subrepos>` with simpler browserso non-technical family members can manage their work easier (agents handle all the management and there are sidecars handling all gitops transparent to the user)
5) Transparent failover across providers with cooldown by making model definitions linked lists in the config, so I can use a handful of subscriptions that offer my favorite models, and fail over from one to the next as I hit quota/rate limits. This has really cut my bill down lately, along with skipping OpenRouter for my favorite models and going direct to Alibaba and Xiaomi so I can tailor caching and stuff exactly how I want.
6) Integrated filebrowser, a fork of the Milkdown Crepe markdown editor, and codemirror editor so I don't even need an IDE anymore. I just work entirely from OpenCode web UI on whatever device is nearest at the moment. I added support for using Gemma 4 local on CPU from my phone yesterday while waiting in line at a store yesterday.
Those are the big ones off the top of my head. Im sure there's more. I've probably made a few hundred other changes, it just evolves as I go.
There are a few out there (latest example is Zed's new multi-agent UI), but they still rely on the underlying agent's skill and plugin system. I'm experimenting with my own approach that integrates a plugin system that can dynamically change the agent skillset & prompts supplied via an integrated MCP server, allowing you to define skills and workflows that work regardless of the underlying agent harness.
its very clear that theres money or influence exchanging hands behind the scenes with certain content creators, the information, and openai.
PS I’m not referencing a well-known book to suggest the solution is trite product group think, but good product thinking is a talent separate from good engineering, and Anthropic seems short on the later recently
But worse, based on the pronouncements of Dario et al I suspect management is entirely unsympathetic because they believe we (SWEs) are on the chopping block to be replaced. And intimation that putting guard rails around these tools for quality concerns ... I'm suspecting is being ignored or discouraged.
In the end, I feel like Claude Code itself started as a bit of a science experiment and it doesn't smell to me like it's adopted mature best practices coming out of that.
That said, that may not have been obvious at all in the Jan/Feb time frame when they got a wave of customers due to ethical concerns.
In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.
At least tell users when the system prompt has changed.
Also I don’t know how “improving our Code Review tool” is going to improve things going forward, two of the major issues were intentional choices. No code review is going to tell them to stop making poor and compromising decisions.
Of course, all their vibe coding is being done with effectively infinite tokens, so...
But in either case, if compute is so limited, they’ll have to compete with local coding agents. Qwen3.6-27B is good enough to beat having to wait until 5PM for your Claude Code limit to reset.
are you asserting that the actual dollar cost to anthropic for a heavy user was 5-10k? or are you basing this on the (fabricated) value of those tokens, ie potentially lost revenue from a pay-per-token user.
Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.
I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.
A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.
It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.
Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.
Thank you for the perfect explanation.
Last week in my confusion about the word because Anthropic was using test, eval, and harness in the same sentence so I thought Anthropic made a test harness, I used Google asking "in computer science what is a harness". It responded only discussing test harnesses which solidified my thinking that is what it is.
I wish Google had responded as clearly you did. In my defense, we don't know if we understand something unless we discuss it.
The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]
The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]
Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.
[0] https://adamsohn.com/reliably-incorrect/
[1] https://adamsohn.com/grpo/
> The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
This is very interesting, if true. It follows that one can generate several instances of the code, chose one with the bug and bug will not be found. Mythos can be used to fool Mythos.
I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.
Will they restore my account usage limits? Since I no longer have Max?
Is that one week usage restored, or the entire buggy timespan?
vim ~/.claude/settings.json
{ "model": "claude-opus-4-6", "fastMode": false, "effortLevel": "high", "alwaysThinkingEnabled": true, "autoCompactWindow": 700000 }
I’ll stay on 4.6 for awhile. Seems to be better. What’s frustrating, though you cannot rely on these tools. They are constantly tinkering and changing with things and there’s no option to opt out.
I mean, yes, even testing in production with some of your customer is better than.. testing with ALL of your customers?
- Claude Code is _vastly_ more wasteful of tokens than anything else I've used. The harness is just plain bad. I use pi.dev and created https://github.com/rcarmo/piclaw, and the gaps are huge -- even the models through Copilot are incredibly context-greedy when compared to GPT/Codex
- 4.7 can be stupidly bad. I went back to 4.6 (which has always been risky to use for anything reliable, but does decent specs and creative code exploration) and Codex/GPT for almost everything.
So there is really no reason these days to pay either their subscription or their insanely high per/token price _and_ get bloat across the board.
Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.
I would not suspect quantization before I would suspect harness changes.
Claude caveman in the system prompt confirmed?
Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.
I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.
For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.
Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.
Frontier LLMs still suck a lot, you can't afford planned degradation yet.
Right now my solution is to run CC in tmux and keep a 2nd CC pane with /loop watching the first pane and killing CC if it detects plan mode being bypassed. Burning tokens to work around a bug.
if only there were a place with 9.881 feedbacks waiting to be triaged...
and that maybe not by a duplicate-bot that goes wild and just autocloses everything, just blessing some of the stuff there with a "you´ve been seen" label would go a long way...
Or improve performance and efficiency, if we’re generous and give them the benefit of the doubt.
It makes sense, in a way. It means the subscription deal is something along the lines of fixed / predictable price in exchange for Anthropic controlling usage patterns, scheduling, throttling (quotas consumptions), defaults, and effective workload shape (system prompt, caching) in whatever way best optimises the system for them (or us if, again, we’re feeling generous) / makes the deal sustainable for them.
It’s a trade-off
It may be (but I wouldn’t know) that some of other changes not covered here reduced costs on their side without impacting users, improving the viability of their subscription model. Or maybe even improved things for users.
I’d really appreciate more transparency on this, and not just when things fail.
But I’ve learned my lesson. I’ve been weening off Claude for a few weeks, cancelled my subscription three weeks ago, let it expire yesterday, and moved to both another provider and a third-party open source harness.
If you worry about "degraded" experience, then let people choose. People won't be using other wrappers if they turn out to be bad. People ain't stupid.
> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7
They can pick the default reasoning effort:
> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode
They can decide what to keep and what to throw out (beyond simple token caching):
> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6
It literally is all in the post.
I don't worry about anything though. It's not my product. I don't work for Anthropic, so I really couldn't care less about anyone else's degraded (or not) experience.
They control the default system prompt. You can change it if you want to.
> They can pick the default reasoning effort
Don't see how it's an obstacle in allowing third party wrappers.
> They can decide what to keep and what to throw out
That's actually a good point. However I still don't think it's an obstacle. If third party wrappers were bad, people simply wouldn't be using them.
Defaults matter. A large share of people never change them (status quo bias, psychological inertia). Having control over them (and usage quotas) means Anthropic can control and fine-tune what this fixed subscription costs them.
And evidently (re, the original article), they tried to do so.
Allowing third party wrappers doesn't mean Claude Code would cease to exist. The opposite actually, Claude Code would be the default.
People dissatisfied with Code would simply use other wrappers. I call it a win-win. Don't see how Anthropic would be on a lose here, they would still retain the ability to control the defaults.
I have no idea what the share of OpenClaw instances running on pi was, or third-party wrappers in general, but it was obviously large enough that Anthropic decided they had to put an end to it.
Conversely, from the latest developments, it would seem they are perfectly fine with people running OpenClaw with Claude models through Claude Code’s programmatic interface using subscriptions.
But in the end, this, my take, your take, is all conjecture. We are both on the outside looking in.
Only the people who work at Anthropic know.
A reminder: your vibe-coded slop required peak 68GB of RAM, and you had to hire actual engineers to fix it.
... But then again, many of us are paying out of pocket $100, $200USD a month.
Far more than any other development tools.
Services that cost that much money generally come with expectations.
A month prior their vibe-coders was unironically telling the world how their TUI wrapper for their own API is a "tiny game engine" as they were (and still are) struggling to output a couple of hundred of characters on screen: https://x.com/trq212/status/2014051501786931427
Meanwhile Boris: "Claude fixes most bugs by itself. " while breaking the most trivial functionality all the time: https://x.com/bcherny/status/2030035457179013235 https://x.com/bcherny/status/2021710137170481431 https://x.com/bcherny/status/2046671919261569477 https://x.com/bcherny/status/2040210209411678369 while claiming they "test carefully": https://x.com/bcherny/status/2024152178273989085
Once OpenAI added the $100 plan, it was kind of a no-brainer.
I don't know, their desktop app felt really laggy and even switching Code sessions took a few seconds of nothing happening. Since the latest redesign, however, it's way better, snappy and just more usable in most respects.
I just think that we notice the negative things that are disruptive more. Even with the desktop app, the remaining flaws jump out: for example, how the Chat / Cowork / Code modes only show the label for the currently selected mode and the others are icons (that aren't very big), a colleague literally didn't notice that those modes are in the desktop app (or at least that that's where you switch to them).
The AI hype is dying, at least outside the silicon valley bubble which hackernews is very much a part of.
That and all the dogfooding by slop coding their user facing application(s).
Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!
Curious about this section on the system prompt change: >> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.
Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?
This sounds fishy. It's easy to show users that Claude is making progress by either printing the reasoning tokens or printing some kind of progress report. Besides, "very long" is such a weasel phrase.
If a message will do a cache recreation the cost for that should be viewable.
The real lesson is that an internal message-queuing experiment masked the symptoms in their own dogfooding. Dogfooding only works when the eaten food is the shipped food.
But if we're vibing... This is the kind of bug that should make it back into a review agent/skill's instructions in a more generic format. Essentially if something is done to the message history, check there tests that subsequent turns work as expected.
But yeah, you'd have to piss off a bunch of users in prod first to discover the blind spot.
2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
3. System prompt to make Claude less verbose reducing coding quality (4 days - better)
All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.
However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.
Doing this proactively would certainly match expectations for a fast-moving product like this.
This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!
Seems like a very basic software engineering error that would be caught by normal unit testing.
To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.
You can argue they're lying, but I think this is just folks misunderstanding what Anthropic is saying.
They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load).
However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.
Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.
In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.
I don't have trust in it right now. More regressions, more oversights, it's pedantic and weird ways. Ironically, requires more handholding.
Not saying it's a bad model; it's just not simple to work with.
for now: `/model claude-opus-4-6[1m]` (youll get different behavior around compaction without [1m])
Communicate the changes you are making. Leverage the community using your product(s). Reveal more about the building blocks (system prompts, harness, etc) so people can better understand how to use your tools.
I understand they're in an existential battle with the other SOTA houses -- but secrecy, straight out lies and opaque "communication" is not the way to win it. Not when the OSS stack is hot on their heels anyway.
Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.
The artificial creation of demand is also a concerning sign.
Do researchers know correlation between various aspects of a prompt and the response?
LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.
Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.
But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.
once i saw that, i'm not surprised anymore.
You have too many and the wrong benchmarks
now it's back to regular slop and just to check otherwise i have to spend at least $100
Resuming it cost 5% of the current session and 1% of the weekly session on a max subscription.
Also, it may be a coincidence, that the article was published just before the GPT 5.5 launch, and then they restored the original model while releasing a PR statement claiming it was due to bugs.
The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?
It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.
At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.
I really appreciate these little touches.
https://techtrenches.dev/p/the-snake-that-ate-itself-what-cl...
https://skills.sh/anthropics/skills/skill-creator
I think an apology for that incident would go a long way.
What verbosity? Most of the time I don’t know what it’s doing.
LLMs over edit and it's a known problem.
how often do these changes happen?
But if a tool is better, it's better.
We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?
If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.
There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.
Also moving fastish means having more/better models faster.
I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.
That said, there is now much better competition with Codex, so there's only so much rope they have now.
I never understood why people cheered for Anthropic then when they happily work together with Palantir.
Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.
I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.
They don't actually pay the bill or see it.
Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further".
And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots.
And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?
Again goes back to the "intern" analogy people like to make.
Somehow, three times makes me not feel confident on this response.
Also, if this is all true and correct, how the heck they validate quality before shipping anything?
Shipping Software without quality is pretty easy job even without AI. Just saying....
how do you just do that to millions of users building prod code with your shit
The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...
YES... DO IT... FRICKING MACHINE..
> Next steps are to run `cat /path/to/file` to see what the contents are
Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).
That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.
Just the other day it was in Auto mode (by accident) and I told it:
> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.
And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).
The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.
If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).
If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:
1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that
2. Dumb the models down (basically decreasing their cost per token)
3. Send less tokens (ie capping thinking budgets aggressively).
2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.
I'm not a heavy LLM user, and I've never come anywhere the $200/month plan limits I'm already subscribed to. But when I do use it, I want the smartest, most relentless model available, operating at the highest performance level possible.
Charge what it takes to deliver that, and I'll probably pay it. But you can damned well run your A/B tests on somebody else.
There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.
One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt
Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.
Enough that the prompt is different at a token-level, but not enough that the meaning changes.
It would be very difficult for them to catch that, especially if the prompts were not made public.
Run the variations enough times per day, and you'd get some statistical significance.
The guess the fuzzy part is judging the output.
Why should we ever trust what they say again out trust that they won’t be rug-pulling again once this blows over?
And it also tells us why we shouldn’t use their harness anyway: they constantly fiddle with it in ways that can seriously impact outcomes without even a warning.
What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.
1. A bunch of people with new Claude Code codebases in December now are working with a larger codebase, causing more context. Claude reads a lot of code files, and doesn't effectively prune from the context as far as I can tell. I find myself having to hint Claude regularly about what files to read (and not read) to avoid having 75k of unrelated files in the context window.
2. Claude Code tries to do more now, for the benefit of people who don't know exactly what they want. The trade-off is that it's worse at doing exactly what people want, when they do know. The "small fix" becomes a large endeavor for Claude.
The latest in homebrew is 2.1.108 so not fixed, and I don't see opus 4.7 on the models list... Is homebrew a second class citizen, or am I in the B group?
Also bit surprised they don't have any automated quality check. They can run something like swe bench before each release. Both of these seem like a basic thing even for startup, let alone some product generating billions in revenue.
Translation: To reduce the load on our servers.
So then, there must have been an explicit internal guidance/policy that allowed this tradeoff to happen.
Did they fix just the bug or the deeper policy issue?
Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?
Apparently they are using another version internally.
Notably missing from the postmortem
lua plugins WIP
translation: we ignored this and our various vibe coders were busy gaslighting everyone saying this could not be happening
The harness on the other hand. Now that had problems.
My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.
They are all doing it because OpenAI is snatching their customers. And their employees have been gaslighting people [1] for ages. I hope open-source models will provide fierce competition so we do not have to rely on an Anthropic monopoly. [1] https://www.reddit.com/r/claude/comments/1satc4f/the_biggest...
People complain so much, and the conspiracy theories are tiring.