wietze's comments

wietze · on Sept 3, 2024

From a living-of-the-land perspective, having to symlink/hardlink/alias a command is much noisier - and thus easier to detect. So although you are right in saying it wouldn't completely solve the problem, making it a system responsibility would still significantly reduce the scope for abuse.

wietze · on Sept 3, 2024

For busybox/toybox the argv[0] thing is great, and seems to be the prime example of why argv[0] shouldn't go - yet it is a bit of an anomaly in how argv[0] is used.

If there really is a need for having one executable that comprises multiple commands, is `busybox whoami` instead of `whoami` so much more effort? To me, that would make more sense in terms of what is going on; aliases could be used if one-word commands are preferred. In most non busybox contexts, argv[0] is just an unnecessary addition that, as the linked article shows, can introduce weirdness.

It's clear from the comments there are still many who think argv[0] is a good thing, which is great - I'm glad the post sparked this debate.

blenderob · on Sept 3, 2024

> is `busybox whoami` instead of `whoami` so much more effort?

It's not the "more effort" that is the deal breaker here. It is a matter of compliance with specs and user expectations. What you're suggesting would make Busybox very non-POSIXy, very non-Unixy. All scripts written over the last many decades would need to be updated to call `busybox ls` instead of `ls`? How is that a viable solution?

> I'm glad the post sparked this debate.

This is a very strange way to deflect concerns about quality of the article!

gary_0 · on Sept 3, 2024

Yeah. The whole point of busybox is to provide the POSIX commands in one compact executable. Making things work any other way defeats the entire purpose of busybox.

adw · on Sept 3, 2024

In other words: `busybox` is primarily an implementation of a _standard library_ and only secondarily a command line tool, so it _must_ use the standard names.

piaste · on Sept 4, 2024

Given that 'alias' is in POSIX, would a combination of

(a) the hypothetical non-argv0 busybox being discussed, and

(b) a POSIX shell of the maintainer's choice, with built-in aliases for 'ls=busybox ls'

be sufficient to make the system POSIX complaint?

blenderob · on Sept 4, 2024

aliases are not inherited by subprocesses unfortunately! so the alias solution would not work when a shell script launches other shell scripts. It wouldn't work in a wide range of other scenarios too like Makefiles, bespoke build tools, binary executables that do execve("/usr/bin/cmp", ...) etc.

yellowapple · on Sept 6, 2024

In addition to the already-raised issue of subprocesses not inheriting aliases, I'd also be worried about aliases inherently being specific to particular shells. I'd hate to have to redefine those aliases for sh, csh, zsh, fish, and Lord knows what else. It'd also be an issue for invoking those tools without going through a shell in the first place - as is common for programs launching external programs as subprocesses.

That's indeed why I personally don't use shell aliases at all, instead opting for actual shell scripts in my $PATH. Those will work no matter what shell I'm using (if any).

sltkr · on Sept 3, 2024

`busybox whoami` is probably fine, but having to write `busybox ls`, `busybox grep`, `busybox cp` etc. would get tedious quickly.

Shell aliases don't solve all problems, even if you do:

    alias rm="busybox rm"
    alias xargs="busybox xargs"
    # etc.

you still have to write `xargs -exec busybox rm`, because xargs won't use the shell alias.

But the main problem with this approach is that POSIX and LSB require certain binaries to be available at certain paths. When they're not, most shell scripts will just break.

The minimal standard solution is probably to create shell scripts for all of these, e.g. in /bin/ls:

    #!/bin/sh
    exec /bin/busybox ls

But this both adds runtime overhead (on every invocation!) and is quite wasteful in terms of disk space. Busybox boasts over 400 tools. At 4 KB per file, that's 1.6 MiB of just shell scripts. Of course that can be less if the file system uses some type of compression which is common on embedded systems where storage space is small, but it still seems to defeat the purpose of using busybox to create a minimal system.

yjftsjthsd-h · on Sept 3, 2024

Well /bin/sh is also busybox, so I think you'd need

    #!/bin/busybox sh
    exec /bin/busybox ls

?

sltkr · on Sept 3, 2024

Great point!

Actually this observation invalidates the whole setup. Because even though you could define /bin/sh itself as:

    #!/bin/busybox sh
    exec /bin/busybox sh

Then you still cannot use #!/bin/sh in any other shell scripts, because for historical reasons the interpreter of a script is not allowed to be another interpreted script, it must be a binary. So /bin/sh pretty much has to be an actual binary.

mbrumlow · on Sept 3, 2024

Yes. Anybody who has shipped software would say.

I really don’t think it is a debate. The usage of arg[0] is massively understated by the article. Just go look at gcc or any modern day compiler. Its use so much that the conversion of should we has been hashes out by many different groups yet they still chose to implement it.

The security concerns are a non issue. As arg[0] was not the problem. It was the lack of technical knowledge of how systems work and a flaw in the security application.

hinkley · on Sept 3, 2024

I think you’re both forgetting that bash has been using this trick for decades.

Bash has an sh compatibility mode that runs when you invoke it as sh.

alerighi · on Sept 3, 2024

Well of course it's not only a matter of interactive usage (even because the busybox itself shell could do the conversion). The problem are script, or worse programs that invokes commands as subprocesses (programs that maybe you don't have have access to the source code!).

What you do? Replace every single occurrence of each command by prefixing it `busybox`? Not ideal at all...

epcoa · on Sept 3, 2024

https://pubs.opengroup.org/onlinepubs/9699919799/

You appear not to realize that busybox is an essential component of a POSIX like system.

jimrandomh · on Sept 3, 2024

That's fine for when users are interactively typing commands, but it doesn't work when the command is being run by a non-busybox program which expects commands to exist in the standard locations.

wietze · on Aug 12, 2024

Although "eager" isn't called out, a recent study of academic publications shows that the use of LLMs can be measured through word frequency analysis [1], finding certain words are disproportionally represented:

> We study vocabulary changes in 14 million PubMed abstracts from 2010–2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words.

1: https://arxiv.org/html/2406.07016v1

PeterisP · on Aug 12, 2024

I don't want to look for the source of analysis right now, but I recall reading a study demonstrating that a large part, if not most of the word frequency shift was caused by RLHF training done on data predominantly generated by people hired from lower income English-speaking countries which simply have a different dialect of English with a noticeably different frequency of certain phrases and expressions, so e.g. at least some versions of ChatGPT got RLHF-trained to speak more in a Nigerian English dialect.

Since there isn't a single English (English learners generally get informed about the choice of UK vs US English only, but most English is spoken outside of UK and USA in other places and other dialects), but multiple different Englishes, any English speaker will probably find something to be surprised by, and there is an economic incentive to get data from people other than the relatively expensive native speakers of UK or USA English.

llm_nerd · on Aug 12, 2024

There wasn't a study or analysis. It was just lazy speculation that felt good because it could be bound up in a "evil white countries exploiting the developing world" narrative. Where exploiting was "paying to do a job".

It was submitted as https://news.ycombinator.com/item?id=40623629

Again, there is effectively zero real data showing this. Further, RLHF isn't likely to reinforce such word selection regardless.

A more logical, likely scenario is that training data is biased heavily towards higher grade level material, so word selection veers towards writings that you find in those realms.

tivert · on Aug 12, 2024

> It was just lazy speculation that felt good because it could be bound up in a "evil white countries exploiting the developing world" narrative. Where exploiting was "paying to do a job".

Exploitation like that is in fact happening (see pretty much everything having to do with social media content moderation and RLHF to avoid disturbing content.

Also "paying to do a job" is not the moral panacea you seem to think it is.

cowsaymoo · on Aug 12, 2024

tinfoil had theory: they implanted watermarks already, so that AI generated text can be flagged for future training runs or as a service, such that some phrases are coaxed to become statistical beacons.

mike_hearn · on Aug 12, 2024

That's not really a tinfoil hat theory. That's been possible for some years and OpenAI reportedly does watermark their outputs, and can detect it. They just haven't released it as a service because it'd annoy all the users who are using it for cheating :)

balder1991 · on Aug 12, 2024

I believe that if that was possible to do on purpose, they wouldn’t have so much trouble preventing the LLMs from talking about things they shouldn’t.

IshKebab · on Aug 12, 2024

Yeah I would like to see some evidence of this too. It's just asserted as truth in the article. Delve doesn't seem like a particularly unusual word to me, especially in the context of scientific abstracts, and LLMs could totally learn random weird things. How common is "it's important to remember" in Nigeria?

skywhopper · on Aug 12, 2024

Wait, why wouldn’t RLHF influence word choices?

llm_nerd · on Aug 12, 2024

I didn't say it wouldn't (or rather couldn't), I said it was unlikely for the selected hypothesis given standard training data vs RLHF iterations.

acc4everypici · on Aug 12, 2024

then again, most history consists of whitewashing back when northern countries were exploiting everywhere else in various ways: imperialism, colonialism, neocolonialism, capitalism, financialization,...

typical people prefer to pretend this is simply "order" and "progress"; seemingly blind to their own ideological baggage like fish in water

trte9343r4 · on Aug 12, 2024

Yeah, right. ChatGPT was trained on Pidgin English dialect.

Have a look at BBC translation to get a taste, and tell me its not hoax: https://www.bbc.com/pidgin

bakuninsbart · on Aug 12, 2024

The window of time where word frequency of chatgpt's favourites and usage of chatgpt is closely related is rather small I think. Academic language has a number of 'marker' words that are basically just style and will be more or less copied once you read many papers. 'Rigorous' is a general example, but most fields have their own. If many papers you read while writing your own paper use words like 'delve', you will be much more likely to use it yourself.

On another note, while the paper itself is pretty cool, in discussions on it I thought people where kind of looking down on using LLM's to help you write. There's a philistine moat in many fields around writing style. While writing well is in my experience correlated with paper quality, it is not predicated by it. And introducing tools that help people write more readable papers is probably a net benefit overall.

kgeist · on Aug 12, 2024

I wonder why some words are overrepresented. Isn't the whole idea of language models to model word distribution as close as possible? Does it have something to do with RLHF? Or it's the training data?

ben_w · on Aug 12, 2024

Language models would be fairly useless for most people if they accurately modelled the source distribution, no better than autocomplete. In fact, they were fairly useless when they modelled the source distribution, that's why ChatGPT was an instant hit whereas GPT-3 was mainly only interesting to other AI reasearchers.

What made LLMs suddenly interesting was that the responses were much more like answers and much less like additional questions in the same vein as the prompt.

kgeist · on Aug 12, 2024

>In fact, they were fairly useless when they modelled the source distribution, that's why ChatGPT was an instant hit whereas GPT-3 was mainly only interesting to other AI reasearchers.

I had a bot which used the original GPT3 (i.e. the completion model, not the chat model) and its answers were pretty decent (with the right prompting). Often even better than GPT3.5, whose answers were overly formulaic in comparison ("as an AI language model...", "it's important to ..." all the time)

ben_w · on Aug 12, 2024

I think that means you would count as "another AI developer" ^_^;

WhitneyLand · on Aug 12, 2024

To what extent can this style be overcome by prompting?

If it can be overcome in existing models, it’s probably going to involve different aspects including vocabulary, style, and organization.

wietze · on Aug 22, 2018

The most important import doesn't work, unfortunately:

  import antigravity

cybersol · on Aug 22, 2018

At least all is not lost:

  Welcome to PyPy.js!
  >>> from __future__ import braces
    File "<console>", line 1
  SyntaxError: not a chance
  >>>

cecilpl2 · on Aug 22, 2018

That was the first thing I tried :)

wietze · on Nov 1, 2017

For how Report URI works in your browser, see the original RFC [1] from 2015. The OWASP recommendation for HTTP Security Headers [2] gives some useful extra information on the how the HTTP Security headers hang together.

1: https://tools.ietf.org/html/rfc7469#section-2.1.4

2: https://www.owasp.org/index.php/OWASP_Secure_Headers_Project...

wietze · on July 27, 2017

It's worth noting that the timeline on the left hand side suggests that the list was last updated September 2015.

This would explain why some more recent cryptocurrencies such as Zcash (ZEC) are missing.

wietze · on Jan 11, 2017

Bootstrap V4 introduced spacing utility classes (like a class m-t-1 to get margin-top: 1rem!important), which inspired others to create this great universal.css project: https://github.com/marmelab/universal.css

franciscop · on Jan 11, 2017

Don't play with our feelings that way...

    Is this a joke?

    Of course it's a joke. Use semantic CSS class names.

resoluteteeth · on Jan 11, 2017

I was terrified for most of the way through that page that it was serious.

err4nt · on Jan 11, 2017

The first time I saw this project I went on an emotional roller coaster from shock & horror, through to laughing until tears fell out of my eyes :)

thewavelength · on Jan 11, 2017

Laughed at this one:

  styleElement.appendChild(document.createTextNode(styleContent.replace(/;/g, ' !important;')));

wietze · on Nov 29, 2016

Let me be the guy that brings this article in: http://journals.plos.org/plosone/article?id=10.1371/journal....

"LaTeX users were slower than Word users, wrote less text, made more typesetting, grammatical, and formatting errors"

MereInterest · on Nov 29, 2016

Also, the usual rebuttal to that article. http://tex.stackexchange.com/a/219581. Anything that did not perfectly reproduce the typesetting of a reference document, they counted as a typesetting error. Any word placement at the end of one line as opposed to the start of the next, for example. The sort of thing that LaTeX handles automatically, and better than humans.

nekopa · on Nov 29, 2016

Murphy's law in action, there was a correction made to the paper you quote, so I clicked through to see what was corrected:

Notice of Republication

This article was republished on March 30, 2015, to correct the sizing and placement of the figures; none of the article content was changed. The publisher apologizes for the original layout errors. Please download this article again to view the corrected version. The originally published, uncorrected article and the republished, corrected article are provided here for reference.

thanatropism · on Nov 29, 2016

> made more typesetting

Agh, I keep on making the mistake of doing homework directly to LaTeX. It seems that you're able to think directly from the page but it ultimately slows you down. I know way too much about mathematical typesetting as it is -- before I start making up rules about how Expectation should be \mathsf{E} and writing huge custom command files.

This one exam I started doing all the exercises in the book into LaTeX. As an aspiring mathematician, I need to learn to better appreciate pencil and paper.