The real score should be around 50% or less. The scoring system is done as a joke without much thought and compares a lot of apples to oranges. Like “aw my balls” equals Jackass, even describing what’s different about them it counts them as equal.
Costco degree is not equal to Microsoft degree, etc.
The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.
Yeah, good call, we're on the same page about that. I designed this tool (agentreadingtest.com) to raise awareness of these issues in a more general way, so people can point agents at it and see how it performs for them. Separately, I maintain a related tool that can actually assess these issues in documentation sites: https://afdocs.dev/
I've tried to weight things appropriately in assessing actual sites, but for the test here, I more wanted to just let people see for themselves what types of failures can occur.
I only understood half of the tech jargon in your answer. If I understood it all I’d probably run it myself.
If someone who is less knowing than me is your customer, you need to explain in simpler terms!
Fair enough! The simple answer is: we did a lot of work to make the model better at coding without requiring complicated installation or configuration. One comman to install and run.
All the benefits of claude code, without any of the limitations or rug pulls.
Since Tor has become increasingly susceptible to state monitoring of exit nodes, making app rely on Tor is potentially compromising your future users.
Look into i2p or other protocol that’s really anonymous.
Yes, I agree Tor is not the best anonymity service these days. I2P was my first choice, but the performance was just awful. I did not fully give up on the I2P idea, maybe it was just that day, so I will give it a second choice and mazbe add a third mode or fully replace tor. Not sure since a lot of people are familiar with tor, and not I2P
Whom you want to please with TOR support.. You have advantage not being a commercial product driven by recognition, to be free to base it on the next and better thing.
The post reminded me how I investigated a similar issue having no idea. Using Claude or GPT to investigate this kind of hardware issue is fast and easy. It gives you next command to try and then next one and you end up with similar summary. I wouldn’t be surprised that author didn’t know anything about displays before this.
So that’s what it is! I was wondering why reducing context and summarising still makes it make mistakes and forget the steering. And couldn’t find explanation to why it starts ignoring instructions when context is not full at all.
How did you find that tool call is what degrades it?
Isn’t this a biggest problem there is and not just “design tension”?
That’s quite a weak confidence in their own platform security if finding a root level vulnerability is not one-off event, but it’s a program expected to have multiple people routinely finding those.
It’s not quite clear that this project is- there’s no “Claude code” a program. There’s tui/gui app, harness, prompts, and LLM. so this is a harness part?
reply