Hacker Newsnew | past | comments | ask | show | jobs | submit | dostick's commentslogin

The real score should be around 50% or less. The scoring system is done as a joke without much thought and compares a lot of apples to oranges. Like “aw my balls” equals Jackass, even describing what’s different about them it counts them as equal. Costco degree is not equal to Microsoft degree, etc.

The tests should have negative weights based on how often that issue encountered and impact. The 2. SPI should have like 8 negative points out of 10 as most common blocker. And whole test inverse score.

Yeah, good call, we're on the same page about that. I designed this tool (agentreadingtest.com) to raise awareness of these issues in a more general way, so people can point agents at it and see how it performs for them. Separately, I maintain a related tool that can actually assess these issues in documentation sites: https://afdocs.dev/

My weighting system there scores the number of pages affected by SPA and caps the possible score at a "D" or "F" depending on the proportion of pages affected: https://afdocs.dev/interaction-diagnostics.html#spa-shells-i...

I've tried to weight things appropriately in assessing actual sites, but for the test here, I more wanted to just let people see for themselves what types of failures can occur.


Mentioning ULTRATHINK in prompt is the equivalent to /effort max?

Yes but only for the message that includes it. Whereas /effort max keeps it at max effort the entire convo, to my knowledge

How is your offering different from local ollama?

Its batteries included. No config.

We also fine tuned and did RL on our model, developed a custom context engine, trained an embedding model, and modified MLX to improve inference.

Everything is built to work with each other. So it’s more like an apple product than Linux. Less config but better optimized for the task.


I only understood half of the tech jargon in your answer. If I understood it all I’d probably run it myself. If someone who is less knowing than me is your customer, you need to explain in simpler terms!

Fair enough! The simple answer is: we did a lot of work to make the model better at coding without requiring complicated installation or configuration. One comman to install and run.

All the benefits of claude code, without any of the limitations or rug pulls.


Since Tor has become increasingly susceptible to state monitoring of exit nodes, making app rely on Tor is potentially compromising your future users. Look into i2p or other protocol that’s really anonymous.

Yes, I agree Tor is not the best anonymity service these days. I2P was my first choice, but the performance was just awful. I did not fully give up on the I2P idea, maybe it was just that day, so I will give it a second choice and mazbe add a third mode or fully replace tor. Not sure since a lot of people are familiar with tor, and not I2P

Whom you want to please with TOR support.. You have advantage not being a commercial product driven by recognition, to be free to base it on the next and better thing.

“Go on” works fine too

The post reminded me how I investigated a similar issue having no idea. Using Claude or GPT to investigate this kind of hardware issue is fast and easy. It gives you next command to try and then next one and you end up with similar summary. I wouldn’t be surprised that author didn’t know anything about displays before this.

So that’s what it is! I was wondering why reducing context and summarising still makes it make mistakes and forget the steering. And couldn’t find explanation to why it starts ignoring instructions when context is not full at all. How did you find that tool call is what degrades it? Isn’t this a biggest problem there is and not just “design tension”?

That’s quite a weak confidence in their own platform security if finding a root level vulnerability is not one-off event, but it’s a program expected to have multiple people routinely finding those.

Well it's a selection bias.

If an athlete breaks a world record, they're likely to do it again. Even though it's incredibly hard to break a world record.


It’s not quite clear that this project is- there’s no “Claude code” a program. There’s tui/gui app, harness, prompts, and LLM. so this is a harness part?

It's the harness/orchestration layer — the part that runs the agent loop, dispatches tool calls, and manages context.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: