More

jamesdutc · 2025-11-11T21:32:52 1762896772

I just could not disagree more.

This kind of rigid, singular view of operational workflows based on precomposed automations not only constantly break but also inevitably introduce huge inefficiences.

I posted a longer comment on lobste.rs: https://lobste.rs/s/azpsqe/vertical_integration_is_only_thin...

pippy360 · 2025-11-11T22:45:43 1762901143

yes! couldn't agree more with your long post. Especially this part: "(This is exacerbated when components of the automation require internal-only tooling—the poor data scientist now needs to go read through a bunch of half-written, out-of-date documentation about tools they simply don't care about to do a task that is not a core responsibility for them.)"

"Vertical integration" in my experience has just been turning a group of simple tools into a complex monolith that no one understands and is extremely difficult to debug

jamesdutc · 2025-11-11T23:28:13 1762903693

It's also very easy to complain about how employees have résumé-focus in their approach to their work: “why should I bother to learn some internal-only tooling that I'll never use anywhere else (for a task that I don't really even care that much about…)?”

But, to borrow a line from Warren VanderBurgh's ‘Children of the Magenta’: “(in the industry) we created you like this.”

Another key flaw of precomposed automations for rigidly-defined work-flows is that they usually exist in precisely the circumstances that give rise to their own subversion. (I might even go so far as to suggest that the circumstances are the cause of both the mistake and the maladaptive behaviours that address the mistake…)

Ultimately, deep stacks of tightly-integrated components forming a precomposed automation that enacts some work-flow—“vertical integration” as the post frames it—is obvious enough that it seems every big company tries it… only to fail in basically the same ways every time.

conartist6 · 2025-11-12T19:28:53 1762975733

Love seeing Children of the Magenta come up!

jamesdutc · 2025-06-13T18:48:42 1749840522

> explicitly randomize iteration

In fact, we see this with CPython `set`, controlled by `_Py_HashSecret`:

https://github.com/python/cpython/blob/6eb6c5dbfb528bd07d77b...

jamesdutc · 2025-06-13T18:37:54 1749839874

This post contains a key misconception about the Python builtin data structures, that may seem like sophistry but is key to understanding the semantics (and, thus, most fluent use) of these tools.

All of the Python builtin data structures are ordered.

The distinction we should make is not between ordered and unordered data structures. Instead, we should distinguish between human ordered and machine ordered data structures.

In the former, the data structure maintains an ordering that a human being can used as part of their understanding of the programme. A `list` is human-ordered (and its order typically connotes “processing” order,) a `tuple` is human-ordered (and its order typically connotes “semantic” ordering, which is why `sorted(…)` and `reversed(…)` is rarely a meaningful operation,) a `str` is human-ordered, and `int` is ordered (if we consider `int` in Python to be a container type, despite our inability to easily iterate over its contents. Whether or not `complex` is a container or not, is pushing this idea a bit too far, in part because I don't think anyone really uses `complex`, since NumPy dtype='complex128' is likely to be far more useful in circumstances where we're working within .)

In the latter, the data structure maintains an ordering that a human being cannot use as part of their understanding of a programme (usually as a consequence of a mechanism that the machine uses as part of its execution of the programme.) A `set` is machine-ordered, not unordered. If we iterate over a `set` multiple times in a row, we see the same ordering (even though we cannot predict this ordering.) In fact, the ordering of a `set` is intentionally made difficult for a human being to predict or use, by means of hash “salting”/seeding (that can only be controlled externally via, e.g., the https://docs.python.org/3.3/using/cmdline.html#envvar-PYTHON... `PYTHONHASHSEED` environment variable.)

Historically, the Python `dict` was machine ordered. If we looped over a `dict` multiple times in a row (without changes made in between,) we were guaranteed a consistent ordering. In fact, for `dict`, the guarantee of consistency in this ordering was actually useful: we were guaranteed that `[d]` and `[d.values()]` on a `dict` (with no intervening changes) would maintain the same correspondence order (thus `[zip(d, d.values())]` would match exactly to `[d.items()]`!)

When the split-table optimisation was added to Python, the Python `dict` became a very interesting structure. Note that, from a semantic perspective, there are actually two distinct uses of `dict` that we see in use: as a “structural” or as “data” entity. (Ordering is largely meaningless for the former, so we'll ignore it for this discussion.) When the split-table optimisation was added in Python, the underlying storage for the `dict` became two separate C-level blocks of contiguous memory, one of which was machine-ordered (in hash-subject-to-seeding-and-probing/perturbation order) and one of which was human-ordered (in insertion order.) (From this perspective, we could argue that a `dict` is both human and machine-ordered, though it stands to reason that the only useful artefact we see of the latter is with `__eq__` behaviour, which this article discusses. Since “human ordering” is a guarantee, it supersedes “machine ordering.”)

jamesdutc · 2025-06-13T18:21:23 1749838883

This is a genuine concern, since it hinders our ability to port over high-quality, high-performance hash table implementations from other languages (since these often do not preserve any human ordering.)

However, the ship has already sailed here. I think that once insertion-ordering became the standard, this creates a guarantee that we can't easily back down from.

Spivak · 2025-06-13T18:45:26 1749840326

The python dict() is already high-quality and high-performance. You're probably not going to be able to do much better than the current state. If you want faster you end up moving your data into an environment that lets you make more assumptions and be more restrictive with your data and operate on it there. It's how numpy, polaris, and pandas work.

Everything in Python is a dict, there's no data structure that's been given more attention.

https://m.youtube.com/watch?v=p33CVV29OG8

jamesdutc · 2025-06-13T18:57:53 1749841073

> The python dict() is already high-quality and high-performance

Yes, the CPython `PyDictObject` has been subject to a lot of optimisation work, and it is both high-quality and high-performance. I should not have implied that this is not the case.

However, there's a lot of ongoing research into even further improving the performance of hash tables, and there are regular posts discussing the nature of these kinds of improvements: e.g., https://news.ycombinator.com/item?id=17176713

I have colleagues who have wanted to improve the performance of their use of `dict` (within the parts of their code that are firmly within the structural/Python domain,) who have wanted to integrate these alternate implementations. For the most part, these implementations do not guarantee “human ordering” so this means that they can provide these tools only as supplements (and not replacements) of the Python built-in `dict`.

> moving your data into an environment that lets you make more assumptions and be more restrictive with your data and operate on it there. It's how numpy, polaris, and pandas work.

Yes, the idea of a heterogeneous topology of Python code, wherein “programme structuring” is done in pure Python, and “computation” is done in aggregate types that lower to C/C++/Rust/Zig (thus eliminating dynamic dispatch, ensuring contiguity, &c.) is common in Python. As you note, this is the pattern that we see with NumPy, Polars, pandas, and other tools. We might put a name to this pattern: the idea of a “restricted computation domain.” (I believe I introduced this terminology into wide-use within the Python community.)

However, not all code shows such a stark division between work that can be done at high-generality (and, correspondingly, low-performance) in pure Python and work that can be done at low-generality (but very high-performance) within a “restricted computation domain.) There are many types of problems wherein the line between these are blurred, and it is in this region where improvements to the performance of pure Python code may be desired.

rbanffy · 2025-06-13T22:35:23 1749854123

> they can provide these tools only as supplements (and not replacements) of the Python built-in `dict`.

Maybe they’ll walk back on that if there is a compelling new implementation that isn’t ordered. Or they keep dict as is and use the better implementation for internal dict-like structures.

zem · 2025-06-14T22:36:11 1749940571

or add an unordered_dict that people can opt in to if they want the higher performance and don't need the order guarantees. I think having the ordered dict be the default is more useful in the general case.

jamesdutc · 2025-05-24T20:29:44 1748118584

> Malaysian guy has to watch a BBC cook make rice (https://youtu.be/53me-ICi_f8?si=0AaZ82dk_AYFqJAx&t=226)

Of course, this video is just stupid accent comedy, but we should be careful not to draw too much from it. (Let's also set aside the specifics of making fried rice.) The implication of the section of the clip you linked is that the presenter (Hersha Patel) does not know how to make rice properly, and this is evidenced by her cooking it in too much water and draining it.

But this is not correct.

There are, in fact, many different varieties of rice, different cuisines that incorporate rice as a major component, and different styles of cooking rice. Cooking (certain varieties of long-grained? rice) in an open vessel, cooking with an excess of water, and draining the water afterwards is an extremely common and popular way to prepare it for use in some cuisines: e.g., https://youtu.be/TARO_R4cE24?t=420

When this video first made the rounds some years ago, it was surprising to see how confidently people would weigh-in on this topic, despite demonstrating very little background or knowledge. (There's a big difference between saying “that's not the appropriate way to do this in this circumstance” and “that's completely wrong,” and the former creates space to derive knowledge. After all, the dish in the video is a popular one, even in cultures that predominantly eat jasmine or basmati rice, and there are interesting variations in technique and flavour that arise as a consequence!)

> Mexican moms react to Rachael Ray trying to cook (https://www.youtube.com/watch?v=zFN2g1FBgVA)

I similarly do not understand why these kind of reaction videos are popular. There are slightly better versions of this format (e.g., https://youtu.be/DsyfYJ5Ou3g?t=182) but they are drowned out by this kind of fluff. What does one really gain from interacting with such criticism?

Perhaps there is something to be learnt from these situations: ones where, equipped with just a little bit of knowledge, we derive unearned confidence, and use this confidence not to venture forth more boldly in search of knowledge, but to convince ourselves of our own superiority.

jamesdutc · 2025-05-07T21:57:12 1746655032

Congratulations to the creators for successfully releasing this product!

(I'm trying to start with a positive tone, since I have only negative things to say about the site itself. I want to make sure that I'm coming across a critical without coming across as mean.)

I spent a few minutes generating a couple of sample stories using their prompts for the pair that I'm most qualified to evaluate “English”→“Chinese (Traditional)” and just wasn't very impressed. Honestly, I think the approach is largely a dead-end.

Let's set aside that “Chinese (Traditional)” is not a language, and that someone with experience learning or teaching Chinese ought to know this (and, as I will argue, knowing this is critical to producing high-quality educational materials!) That the creators of this tool aren't particularly familiar with the languages themselves is probably much less consequential than that they don't really appear to be familiar with the pedagogy of teaching or learning languages.

One would anticipate that the languages that most learners want to learn are subject to broad market forces, and that, as a consequence, these languages already have a variety of high-quality, human-written primary texts and educational texts (many of which may even be free-to-access!) For the language pair I tested, this is definitely true, and I would encourage every learner to start with those materials (and to avoid anything AI-generated.)

(Of course, if I wanted to learn a less-common language where materials are hard to find this might be marginally useful—e.g., Telugu probably has more total speakers than Italian, but my local high school probably has an Italian class—but I would wonder whether the training set would be good enough to accurately reproduce the language. I suppose if I wanted to learn an endangered language, where they may simply not be enough native speakers to maintain a rich catalogue of written language, then someone could train an AI to reproduce this language to aid in learning, but a similar question arises as to whether this kind of preservation or reconstruction is sufficiently “faithful.”)

It's absolutely the case that AI tools are at a point where (for common languages) they are able to reliably generate grammatically accurate language, independent of its factual accuracy. Indeed, while I could spot fluency issues in the sample stories I reviewed (since, of course, “Chinese (Traditional)” is not a language,) I could not spot outright grammatical errors. (This is an impressive accomplishment for AI models!)

But this is really a solution looking for a problem (and, in my opinion, finding the most obvious but also least useful.)

Contrast these randomly generated story with the equivalent from a human-generated educational resource. In the case of a human-generated educational resource, the quality of language may actually be worse than than that in the AI generated resource (even in the face of sloppy AI writing tends to be!) In fact, in the case of Chinese (“Traditional” or otherwise,) this is absolutely guaranteed to be the case for an introductory text. Almost all introductory texts will be written in a very choppy, repetitive style: e.g., 「那隻狗很可愛。我養的狗也很可愛。」

(It's likely the case that even intermediate and advanced learning materials will not resemble actual primary texts. e.g., I was reading the news the other day and came across the sentence 「北捷重申，無論任何年齡，各車站閘門前的黃色標線內一律禁止喝水等飲食行為，除非是身體不適或母乳哺育」 which is perfectly appropriate for an intermediate learner… except 「閘門」 is simply not useful or appropriate textbook vocabulary!)

So why is the human-generated educational material better? Well, there's a lot of design to writing these kinds of materials. How do we teach and reïterate the most broadly useful grammatical structures and vocabulary? How do we teach this in a way that maximises retention? (And, often, how do we expose the learner to useful cultural background that will help them when they visit a region where the language is spoken?)

All of this is visible in human-generated materials, yet none of this is evident in these AI-generated materials. It is, in fact, this design that makes these materials useful in the first place. In the absence of it, we end up with vocabulary lists that define 「狗：dog」 next to 「呈現：to emerge」 where a human educator would align the difficulty of these terms to the order and process in which a human learner would learn them. Similarly, a human educator knows how to evolve a student's fluency with language and understanding of tone and register, taking them from 「媽媽: mother」 to 「母親: mother」 perhaps even strategically including 「媽咪: mommy」 or even 「阿母 a-bú: mother （台）」 to engage the student. (Real educators do this very often, and students tend to really like it when they get “fun fact”-style local flavour!) I have not seen anyone attempt to introduce any of this design into AI-generated learning materials, and I suspect this is why they always come across as being so bland and mushy. Instead, the AI-generated materials are creating only rote practice items (which is why their prompts typically include things like “limit the generated text to use only vocabulary as published in the prep materials for such-and-such language proficiency exam.”) This kind of practice is, indeed, useful, but it's debatable whether it's measurably more useful than just spaced-repetition with flashcards.

Now, contrast these materials with primary texts (i.e., written language artefacts produced for an audience of native speakers.) Primary texts are often very difficult to incorporate into language learning, especially for languages like Chinese. This is probably because at the introductory level, the materials simply aren't dense enough for an adult learner, and at the advanced level, probably because these materials are far too challenging given the amount of specialised terminology and vocabulary used. (There are, in fact, very appropriate materials that sit between these extremes, such as news magazines or short stories written for middle schoolers, but these materials can be hard to access.)

The benefit of the primary text is that it is very close to the actual goal of the learner: I really don't want to read a story about a lost dog, and I only do it, because with enough practice reading such drivel, I might eventually read ‘Dream of the Red Mansion’ or ‘Red Sorghum.’ As a consequence, what most learners will reach for are “graded readers” which are adaptations of well-known works with simplified language and grammar. I'm on the fence with how well AI can create these for us. On the one hand, there is a pedagogical and creative dimension to producing a good graded reader. The former may be possible to approximate with additional prompting (“use only vocabulary from this list; use only grammatical structures familiar to a learner at this tested level,”) but I'm not sure about the latter. The reader is probably losing a lot when we simplify Gandalf to ‘Run away now!’

So while I'm quite hopeful that AI technologies can improve language learning, this kind of tool just doesn't seem to add anything to what already exists and is already much better.

The approach is just too obvious. I think it's too focused on finding a way to adapt something we know that AI can do well (generate grammatically correct text) to something we want to be able to do more cheaply or effectively (teach language learners how to read) without really considering how to solve this problem.

celltalk · 2025-05-08T11:28:13 1746703693

I totally disagree. We know AI can code well, still Cursor and Windsurf worth billions of dollars, because it works.

jamesdutc · 2025-02-17T22:10:38 1739830238

Agreed.

I have first-hand experience across five distinct AMD 7840U and AMD 8840U devices that near-perfect, out-of-the-box Linux-support (with stock kernels and no dodgy kernel flags!) is possible. This includes support for S0ix suspend.

https://news.ycombinator.com/item?id=43083669

I don't doubt it when people recount their bad experiences with AMD devices; however, my experience should serve as an existence proof that it's not a universal experience.

In the case of each device mentioned in the comment above, I followed a standard installation procedure from an Arch installer USB. I use only stock kernels: linux, linux-lts, and linux-zen. For almost all of the devices, the only kernel flags I pass are for enabling hibernate or handling FDE. (In one or two cases, the devices have portrait displays that have been installed for use in landscape-orientation. These need an `fbcon=rotate:…` kernel flag.)

In all but one case (the OneXPlayer X1 Ryzen) everything (except fingerprint readers) works flawlessly. In the case of the OneXPlayer X1 Ryzen, there is an intermittent issue with hang on suspend, but that may have gone away with a recent kernel update. If not, I'll probably come back to this blog post and see what I can do…

jamesdutc · 2025-02-17T21:57:28 1739829448

It can be really hit-or-miss, and it can be really hard to debug errors like in the post.

A lot of workarounds that are suggested for various issues are also not really viable. Some of the workarounds involve turning off different power-saving modes; however, the point of enabling sleep is often to increase the amount of usable time between charges, and turning off these power-saving modes can often dramatically shorten battery life.

But getting sleep to work (even S0ix!) is not impossible.

I have a bunch of handheld AMD 7840U and AMD 8840U devices that I have installed Arch Linux on: GPD Win Max 2, GPD Win Mini, GPD Win 4, Minisforum V3, OneXPlayer X1 Ryzen. These devices were not designed with Linux support in mind. I would be very surprised if the companies that made them ever tested them with Linux. Yet with just a small amount of work (generally fiddling with `/proc/acpi/wakeup` and `/sys/devices/*/*/*/power/wakeup` to disable sources of spurious wakeups,) I have gotten essentially flawless S0ix support (… on all but the newest OneXPlayer X1 Ryzen.)

(In general, out-of-the-box stock Linux kernel support on these devices is fantastic. Touchscreens work, pen input works, wifi and Bluetooth work well. The only gap I've seen is fingerprint reader support.)

I suspect that given how small these manufacturers are (and how small their production batches must be,) there's much less extreme-customization and tight-integration of components. This is visibly evident in the form-factors of these devices, which many millimeters thicker than they might otherwise be. (Of course, these devices are primarily advertised to a gaming audience who are eager to avoid the thermal-throttling that happens with ultra-thin devices like Surface Pro…) I partially suspect that the lack of extreme-customization, the lack of tight-integration, and the smaller production batches means that the manufacturers make much more conservative choices in components. Maybe this explains the exceptional Linux support?

hasperdi · 2025-02-18T03:09:26 1739848166

Hi, do you have these tweaks published somewhere? I'm particularly interested in knowing your GPD Win Mini tweaks.

Thanks

jamesdutc · 2025-02-13T22:10:19 1739484619

This is not a comment that correctly describes how these two entities realistically operate and interact. I don't know why people keep repeating this as though it were insightful.

Whatever your own position on this matter may be, it is important that we factually describe the positions of the parties directly involved.

Hopefully we can use this as an opportunity to spread a more accurate description of the dynamics at play.

For reference, here is a speech by William 賴清德 (https://en.wikipedia.org/wiki/Lai_Ching-te) outlining the position under which he operates. The phrasing he uses is one that is consistent across all of his public remarks; consistent with remarks by Louise 蕭美琴(https://en.wikipedia.org/wiki/Hsiao_Bi-khim) and other close associates of William 賴清德; consistent with remarks made by 蔡英文 (https://en.wikipedia.org/wiki/Tsai_Ing-wen); and consistent with how the structures in Taiwan have operated over at least the last decade.

《賴清德就職演說：兩岸「互不隸屬」》: https://youtu.be/oLO5bYF8lDs?t=139

The relevant remark is: 「由此可見，中華民國與中華人民共和國『戶不隸屬』」

Here is my (manual) translation: “From this it can be seen that the Republic of China and the People's Republic of China are ‘not subordinate to each other.’”

Here is a Whisper-generated transcript of the entire speech, with an OpenAI generated translation inline. I skimmed the translation, and it adequately conveys the speaker's meaning and intention. (It does fail to convey the delicacy, careful phrasing, and specific rhetorical choices made by the speaker that are extremely clearly visible in the Chinese. However, this aspects are harder to convey if you don't have minimal prior knowledge on this topic.) https://pastebin.com/fGxHUpXN

It is true that there are historical positions, (historical) on-paper claims, and even a variety of differing positions and lively debate on this issue across all of the populations involved. But the conclusion one might draw from the comment above is wholly incorrect. It simply isn't the framing within which William 賴清德 and associated parties are actually navigating this issue.

jamesdutc · on May 1, 2024

I think you may want to clear the environment (e.g., of `SSH_AUTH_SOCK`) as well as isolate in a PID namespace as well. I also reflexively `--as-pid-1 --die-with-parent`.

    bwrap --dev-bind / / --clearenv --tmpfs ~ --unshare-pid --as-pid-1 --die-with-parent ssh terminal.shop

(The `bwrap` manpage says “you are unlikely to use it directly from the commandline,” yet I use it like this all the time. If you do, too, then we should be friends!)