Hacker Newsnew | past | comments | ask | show | jobs | submit | measurablefunc's commentslogin

I recently wrote a simple interpreter for a stack based virtual machine for a Firefox extension to do some basic runtime programming b/c extensions can't generate & evaluate JavaScript at runtime. None of the consumer AIs could generate any code for the stack VM of any moderate complexity even though the language specification could fit on a single page.

We don't have real AI & no one is anywhere near anything that can consistently generate code of moderate complexity w/o bugs or accidental issues like deleting files during basic data processing (something I ran into recently while writing a local semantic search engine for some of my PDFs using open source neural networks).


I am building an assembler+compiler+VM for a python-like statically typed language with monomorphized generics and Erlang-style concurrency. Claude Sonnet/Kimi/Gemini Pro (and even ChatGPT on occasion) are able to handle the task reasonably well because I give them specs for the VM that have been written and rewritten 200+ times to remove any ambiguities and make things as clear as possible.

I go subsystem by subsystem.

Writing the interpreter for a stack vm is as simple as it gets.


At which point you tell them they are being extremely reckless but subtly mention that something new & even scarier is being developed internally that's going to blow everything else out of the water.

I'm wondering when people are going to figure out the doom marketing playbook.

We know it, we’re just susceptible to it. Like not eating for an extended time, you know you’ll get hungry and then you do. There’s a very basic but powerful response to danger and a need for safety.

Yes, yes, it's the end of the world, so on and so forth.

Written by AI.

More accurate title would be to say it is a tail call optimized interpreter. Tail calls alone aren't special b/c what matters is that the compiler or runtime properly reuses caller's frame instead of pushing another call frame & growing the stack.

Maybe, it probably depends on how you're looking at it. The optimization is obvious, I expect any optimizing compiler will TCO all naive tail calls - but the trouble in Rust or C++ or a dozen other languages is that you can so easily write code which you think can be optimized but the compiler either can't see how or can see that it's not possible and (without this keyword) you don't find out about this because growing the stack is a valid implementation of what you wrote even though it's not what you meant.

The "become" keyword allows us to express our meaning, we want the tail call, and, duh, of course the compiler will optimize that if it can be a tail call but also now the compiler is authorized to say "Sorry Dave, that's not possible" rather than grow the stack. Most often you wrote something silly. "Oh, the debug logging happens after the call, that's never going to work, I will shuffle things around".


I wouldn't call it optimized, since that implies that it gains performance due to the tail calls and would work otherwise, but the tail calls are integral to the function of the interpreter. It simply wouldn't work if the compiler can't be forced to emit them.

What I wrote is standard nomenclature

> Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called tail-call elimination or tail-call optimization. (https://en.wikipedia.org/wiki/Tail_call)


Questioning standard nomenclature is useful too, as long as it provides insight and is not just bike-shedding. "optimization" (in the context of an optimizing compiler) is generally expected not to alter the semantics of a program.

> but the tail calls are integral to the function of the interpreter

Not really, a trampoline could emulate them effectively where the stack won't keep growing at the cost of a function call for every opcode dispatch. Tail calls just optimize out this dispatch loop (or tail call back to the trampoline, however you want to set it up).


Yup, standard practice for interpreters in languages that don't have tail call optimization.

There are no secrets when you are using AI providers. They track all interactions b/c that's valuable information for improving their models.

I'm talking about sharing things publicly that you are trying to claim as your own

It doesn't matter. If someone has the same idea then they can use AI the same way you did to recreate it. Keeping it a secret benefits no one other than the AI providers b/c now they can charge money for giving someone else "your" code. The AI providers don't care about license restrictions so it's the perfect way to launder code. If you want credit for something then you'll have to claim it publicly b/c the AI providers sure as hell are not going to give you any credit.

strange downvotes, not only these services allow anyone with money to copy their competitors if they use the same services, but on the long run, Anthropic could very well be the competition, trained on corporations that use Claude. Why would this startup be any different from Google or Microsoft on the long run? People can't seem to learn their lesson.

People are very naive about how technology companies operate.


Even if you believe the "we don't train on your data" claim/lie, that leaves a whole lot of things they can do with it besides training directly on it.

Analytics can be run on it, they can run it through their own models, synthetic training data can be derived from it, it can be used to build profiles on you/your business, they could harvest trade/literal secrets from it, they could store derivatives of your data to one day sell to competitors/compete themselves, they can use it to gauge just how dependent you've made yourself/business on their LLMs and price accordingly, etc.


No. Your data or any derivative of it does not leave RAM unless you are detected as doing something that qualifies as abuse, then it is retained for 30 days.

Even the process of deciding what "qualifies as abuse" does what I'm talking about: they're analyzing your data with their own models and doing whatever they want with the results, including storing it and using it to ban you from the product you paid for, and call the police on you.

Either way, I don't believe it.


You are a Star Wars Rebel fighting Darth Vader. Good job!

Thanks

That's about the API. It doesn't say anything about their other products like Codex. Moreover, even in the API it says you have to qualify for zero retention policies. They retain the data for however long each jurisdiction requires data retention & they are always improving their abuse detection using the retained data.

> Our use of content. We may use Content to provide, maintain, develop, and improve our Services, comply with applicable law, enforce our terms and policies, and keep our Services safe. If you're using ChatGPT through Apple's integrations, see this Help Center article (opens in a new window) for how we handle your Content.

> Opt out. If you do not want us to use your Content to train our models, you can opt out by following the instructions in this article . Please note that in some cases this may limit the ability of our Services to better address your specific use case.

https://openai.com/policies/row-terms-of-use/ https://openai.com/policies/how-your-data-is-used-to-improve...


Codex just talks to the responses API with store=false. So unless the model detects you are doing something that qualifies as abuse, nothing is retained.

Alright, good luck to you. I'm not really interested in talking to people who think they're lawyers for AI providers. If you think they don't keep any of the data & don't use it for training then you are welcome to continue believing that. It makes no difference to me either way.

> Alright, good luck to you. I'm not really interested in talking to people who think they're lawyers for AI providers.

Codex is open source, you can inspect it yourself, but let's not let facts ruin your David vs Goliath fantasy.


And you believe them?

Yes. That's the rational position.

This is a lot of useful data for the next iteration of Claude because not only does Anthropic have the final artifacts but they also saw the entire workflow from start to finish & Facebook paid them for the privilege of giving them all of that training data.

Only if you assume they don't honor their enterprise agreements.

I assume all chat logs are used for training in one way or another because it would be foolish to not do that.

More training data at this point leads to marginal improvements, curve is flattening. So advantage is low. Especially when Anthropic definitely has the budget and talent to carry out the same study.

On the other hand, having it leak that you train on your customers data, ignoring the opt-out, is probably existential when close alternatives exist in the market.


You probably also thought Anthropic did not use pirated PDFs. You don't know how these companies actually operate & you don't know what weasel language they use in their contracts to get away w/ exactly what I assume to be the case.

There is no AI, all these companies have is the chat logs so unless you have further evidence on what they do or don't do behind the scenes I recommend you use a more conservative approach in your assumptions about what they use or don't use for training.


No, why would they care about using pirated PDFs? Did you actually read/understand what I wrote? Violating their customers comes with risk for them. Violating the copyright of unrelated texbook authors does not. If that's even what they did.

They are currently paying book authors over a billion dollars in damages. You're out of your depth in this discussion so further engagement is not going to be fruitful for anyone involved. Good luck.

Oh no, not 0.2% of their valuation! The end is near for Anthropic. Humanity is saved. By the copyright lobby, of all people.

Yes, it's well known that money & prices are what make people act rationally. We'd still be slinging mud & rocks if it wasn't for money & prices.

Tangentially related from something I'm currently reading¹:

> This is the reality of twenty-first-century resource exploitation: reducing vast quantities of rock into granules and chemically processing what remains. It is both awe inspiring and disturbing. One risk is that the cyanide and mercury used in the method could escape into the surrounding ecosystem. After all, while miners like Barrick insist they follow all the rules laid down by the US Environmental Protection Agency (EPA), campaigners warn that pollution often finds its way out of the mine. Indeed, a few years earlier the EPA had fined Barrick and another nearby miner $618,000 for failing to report the release of toxic chemicals including cyanide, lead and mercury. But the main thing I was struck by as I observed each stage in this process was just how far we will go these days to secure a tiny shred of shiny metal.

> The scale, for one thing, was mind-boggling. As I looked down into the pit I could just about make out some trucks on the bottom, but only when they emerged at the top did I realise that they were bigger than three-storey buildings; the tyres alone were the size of a double-decker bus. How much earth do you have to remove to produce a gold bar? I asked my minders. They didn’t know, but they did know that in a single working day those trucks would shift rocks equivalent to the weight of the Empire State Building.

¹ Material World: A Substantial Story of Our Past and Future by Ed Conway


> in a single working day those trucks would shift rocks equivalent to the weight of the Empire State Building.

Oh. My. God.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: