I really enjoyed the article, reading it more from the perspective of what 21st-century lexicography could be, less as a customer of a word game however thoughtfully designed. As a Wiktionary editor (and Android user who's also grown out of bare word-relationship puzzle games) though, it's sad that there seems to be no way to just use the end-product network as a reference, which I would love to do, but I suppose they did spend a million bucks on it.
I'll also use this post to wish that more people would edit Wiktionary. It has such a good mission (information on all words) and yet there are only like 80 people editing on any given day or whatever. In some languages, it's even the best or most updated dictionary available. The barriers to entry and bureaucracy are really not high for HN audience types.
> it's sad that there seems to be no way to just use the end-product network as a reference, which I would love to do, but I suppose they did spend a million bucks on it.
From the OP: "This research and computational scale was made possible by $295k NSF SBIR seed funding (#2329817) and $150k Microsoft Azure compute resources." Does that NSF funding mean it's open source? Also, I'm not 100% sure that the quote applies to all the research rather than just one component of it.
> I'll also use this post to wish that more people would edit Wiktionary. It has such a good mission (information on all words) ...
I support open source, contribute to it, and love the spirit of Wiktionary, I don't understand the practical reality of applying 'wisdom of the crowds' to a dictionary, especially the English edition, for two reasons:
Definitions are highly accurate (complete, correct, consistent), highly precise things - otherwise, what is their value? Assuming Wiktionary is descriptive - reporting the words' actual usage - it takes quite a bit of scholarship, skill, and editorial resources not to mislead people. I can't just write what I think it means - the meaning to me might not match the meaning to the person at the next desk. It takes quite a bit of research, using powerful (and sometimes expensive) tools, and understanding of lexicography to be complete and also precisely correct, including usages in places and times that are mostly unknown to any particular author. Also, writing definitions is tricky: You are using words - which have those aformentioned problems with meaning - to define words. Also, any writing anywhere can be easily misinterpreted - skill and editors are needed to avoid misunderstanding. How is the accuracy and precision problem solved?
Also, in English there are already many authoritative sources, many with a century of profesional lexicography behind them by the best in the business. Some are free. There are also meta-lookup engines such as Wordnik and OneLook. Why use Wiktionary? The few times I've compared definitions or etymologies, the authoritative sources almost always exceed or equal Wiktionary (though online copies of older print editions suffer from the minimalism caused by the constraint of printing costs). Arguably, there is nothing else both unabridged and free: Oxford unabridged costs $, so does Merriam-Webster (the free edition is abridged); American Heritage is free, but has the minimalism issue I mentioned above.
I can answer that one. I have free access to the Oxford English Dictionary (OED), which is brilliant and generally more detailed and reliable than Wiktionary when it has the word I'm looking for, but their login page is so awful that I sometimes use en.wiktionary.org instead just to save my time and temper. Also, en.wiktionary.org has proper nouns, other languages, and occasionally it has some recent or technical English word that OED does not have. So if I'm doing some serious amateur research: OED. But if I'm doing a crossword and want to check that a word exists and is spelt how I think it is: Wiktionary.
I'm one of those people who says, unironically, "words have meanings." I readily argue with people who present "language is living and evolves" - sure, but in order to communicate we have to agree on a decent subset of overall definitions.
I enjoy etymology, maybe too much. It's like magic, finding out what a barrow was, or how filibuster has a direct lineage to pirates (freebooters... In Dutch.)
I can't afford, really, the nicer old English, scandi, frisan, Norse, etc. etymology dictionaries. I have incomplete scans that were printed and bound of some of them. I still have 6 etymology dictionaries, so I can be about as quick getting a dictionary as getting on the computer and going to !eo.
> in order to communicate we have to agree on a decent subset of overall definitions.
sociologically speaking, however, it is precisely that agreement that is what evolves alongside changes in spelling, pronounciation (and occasionally "new" words).
>I'm one of those people who says, unironically, "words have meanings." I readily argue with people who present "language is living and evolves" - sure, but in order to communicate we have to agree on a decent subset of overall definitions.
A few things.
>we have to agree on a decent subset of overall definitions.
Yes but we should fairly obviously understand that a word can have multiple, often competing meanings, and make an effort to learn the new ones as they become available.
As language shifts, and its shifted rapidly in my own lifetime, you can either make an effort to keep up, or be a sourpuss and refuse to understand changes in language.
It seems to me there's usually a political dimension to people who refuse to understand what people mean, because its easier to denigrate people if they cling to definitions that aren't intended by their political opponents use of a word.
I see this shit constantly mind. Gender. Liberty. Capitalism. Communism. People get stuck fighting useless battles over the right to define a word instead of just learning and embracing their opponents intention.
> It seems to me there's usually a political dimension to people who refuse to understand what people mean, because its easier to denigrate people if they cling to definitions that aren't intended by their political opponents use of a word.
and to an extent, the rest of your comment - the solution, according to my PhD friend, is to establish the framing of the argument before you actually have the argument. It's more fun to not establish framing, but it's more effective to establish framing, first. I wonder if i have the publication (thesis?) he made on my NAS.
I don't think definitions "are" highly accurate precise things. Sometimes yes. The same scholarship, skill, and need to not mislead also applies for so many other things: encyclopedic articles, taxonomies, news, maps, operating systems. Do people still question the value of Wikipedia, OpenStreetMap? Yeah, there are problems with them, and with peer review. Using fuzzy words (or fuzzy phonetic symbols, fuzzy categories, fuzzy semantic links…) to define words is a problem (if at all) of literally any dictionary. I don't see any of these as particularly unique obstacles for Wiktionary.
Unabridged dictionaries take decades to release new editions and are still navigating transition into the exploding digital age. They are so expansive in scope, while often so limited in resources, and barely accept any crowd contributions. Such deliberately slow-going is often a good thing, but words also change quite quickly and these sources are now playing a very long game of catch-up. (Yesterday I tried to verify the latter English senses of "fandango" on Wiktionary with other dictionaries; OED's entry has not been touched for 131 years! What am I going to do with that, I need to use / understand the word now!)
Wiktionary is the big web-native word-resource (and is not cluttered with commercial junk) – allowing links, expandable quotes, images, diagrams, etc. that print's minimalism suffers from as you mention. When someone in 2025 wants information on a word, they'll likely use a search engine and click a link to Wiktionary (where Google blurbs steal some data from). Maybe they are a student wanting to confirm their nonstandard pronunciation with the IPA (still rarely used in mainstream English dictionaries) or if it's recognized in their own dialect (mainstream dictionaries rarely provide more than UK and US pronunciations) – if enough people have the same question, Wiktionary seems like the best place to put the answer – or see an accessible etymology tree. While you probably know this, it's also worth reminding that English Wiktionary isn't just for English words, it is a dictionary of all languages' words, which is written in English. It has metadata and links connecting languages' words that you can't find elsewhere.
Yes, I indeed do want people to just write what they think a word means – as a starting point in a collaborative refining process. I believe the number of word-users in the world with valuable potential contributions is a lot closer to a billion than the thousand gatekeepers working hard on classical dictionaries. The barrier to entry is really low, but the tooling could still be much better. This is one reason i'm putting my appeal under this article - because I think (professional) lexicography can stand to evolve more in the 21st century. (And are people today really buying enough dictionaries to sustain a professional version of Wiktionary, or even a professional dictionary offered in structured data form?) If we don't contribute to a crowdsourced dictionary, then we won't have any such thing.
(Meta-lookup sites are link/search engines, not dictionaries and IME really don't do a good job synthesizing their information or conventions.)
Wiktionary can be of great value without denigrating others.
> Unabridged dictionaries take decades to release new editions and are still navigating transition into the exploding digital age.
OED is now a 100% online service - a website - that releases updates every quarter, like much software. I don't see them 'still navigating' at all.
> barely accept any crowd contributions.
OED is famous for being arguably the first crowd-sourced research project. James Murray, the first great editor and driving force behind the first edition, solicited contributions from the public of usages of words and had a massive filing system of slips with all the contributions.
"Dictionary work relied on so much correspondence that a post box was installed right outside Murray’s Oxford home ...". "His children (eventually there were eleven) were paid pocket money to sort the dictionary slips into alphabetical order upon arrival." [0]
Today OED still solicits contributions, including specific appeals to the public. Every entry in the OED has a 'Contribute' button.
> (Yesterday I tried to verify the latter English senses of "fandango" on Wiktionary with other dictionaries; OED's entry has not been touched for 131 years! What am I going to do with that, I need to use / understand the word now!)
You are misunderstanding what 'revise' means to the OED (which is unnecessarily confusing); they still update entries without a full revision. If you look at the entry history:
fandango, n. was first published in 1894; not yet revised.
fandango, n. was last modified in March 2025.
> I don't think definitions "are" highly accurate precise things. Sometimes yes. The same scholarship, skill, and need to not mislead also applies for so many other things: encyclopedic articles, taxonomies, news, maps, operating systems. Do people still question the value of Wikipedia, OpenStreetMap?
I think there's a difference between requirements - or expectations - for a dictionary and Wikipedia:
My guess is that people don't question Wikipedia because they have different expectations for it: They don't expect accuracy, as defined by the Three Cs: Completeness, Correctness, Consistency. Wikipedia is more the accumulation of information generally believed about a topic (with some standards, imperfectly followed, for secondary source support - but secondary sources reflect general, consensus belief). It's not expected to be Complete; no encyclopedia can completely cover any topic - the point is to be a starting place, a summary - and anyway Wikipedia is a sort of work in progress. It's not expected to be Correct; it's what people generally believe. And Consistency is tough with so many authors. It's really an product of the post-truth era; that's what people want - just try questioning it.
People's expectation for dictionaries - or my expectation at least :) - is not a starting point but the final word. Almost always I already have an idea of what the word means - from partial knowledge, from experience, from context, from its components. I'm expecting the Three Cs from the dictionary, to put a fine point on my understanding and use of the word, to fill in my blind spots - including knowledge of how others have been understanding and using the word.
Maybe Wiktionary just isn't for me. But I worry that people do assume it's CCC - many people believe anything they read is accurate, especially something from an authoritative-looking source - and are confused by it.
Could I make a plea to make a wikitionary export easier to find/use? Assuming I can even find the magical page which hosts them, Wikipedia dumps are terribly documented and seem to incorporate shorthand which I do not recognize.
And they are full of wiki markup, templates, and inconsistent formatting. A human brain can easily understand it, but automated parsing is impossible (pre LLM).
Which words should be attested? Presumably only uncommon ones? And how is it done, is the "quotes" section the attestation? Is there vandalism to clean up, like people adding their own names to define themselves as awesome? Wiktionary seems to "just work", and I don't really understand what holds it together.
I have a feeling that LLM model collapse will be accelerated as humans lose control of smaller Wiki projects like Wiktionary.
They’ll be unable to effectively patrol or prevent generative updates to the project, and for all intensive porpoises, humans will be unwilling to step foot into disputes, and AI will have free reign to redefine all human knowledge.
I second that! I have edited a few Wiktionary pages myself, and find it's a better overall environment than Wikipedia, if you can find something meaningful to add.
Quora and Pinterest are particularly routine spam sites in my search results.
They rank just below word reference site spam, like dictionaries, thesauruses, or translation dictionaries (sites which I do benefit occasionally from), and below Wikipedia mirrors (which I feel has become so bad that I can't even get legitimate results talking about the problem itself! Try searching something like: search results spam wikipedia mirror "revolvy" "wikiwand").
But for me, the worst (and most obvious!) offenders by far are "pronunciation guide" spam sites. Just a few examples:
plus the scourge of 16-second YouTube videos on channels with names like Pronunciation Guide or Emma Saying.
(If you search for something like "Deidesheimer pronunciation" or "pronounce Canynge" on Google, the vast majority of results will be those spam sites, plus maybe an ancient forum thread from 2004 that veered off topic before anyone even tried to give a serious yet uninformed answer.)
These ad-infested spam sites purport to teach you how to pronounce an unfamiliar name or tricky word (an important and underappreciated service that many people use!). But usually they merely contain computer-generated bullshit, as if fed directly into all available text-to-speech algorithms. Even the ostensibly human-generated recordings and sites are often flagrantly wrong, unsourced, and untrustworthy.
There are a few legitimate sites (such as Forvo, Youglish, etc.), but too often they are woefully incomplete (by nature of their being crowdsourced). Forvo even contributes to the spam with "do you know how to pronounce this word?" false positives.
I once blocked all of the spam sites when the domain-blocking feature you mentioned was built into Google Search; then had to do it once again when I needed a browser add-on to replace the removed feature (which naturally only worked on desktop); and recently I was astonished to find that the add-on also stopped working! The spam never ends.
I know, right? How hard would it be for someone to make a wiki-style site with UGC, a reputation system (I speak this language natively and vouch / do not vouch for this content), a tracker for trending words, fun articles for Llanfairpwllgwyngyll and the like? The Web I used to know seems to be gone or at least in steep decline, supplanted by garbage like this.
There are also the White pages duplicates. Search for a phone number or some digits that resemble a phone number and the first 5 pages of results are all autogenerated reverse lookups under various domains catering to the different plausible personas.
Try searching for “1549 USD in EUR”, and you'll find links to sites that auto-generates individual pages for every amount.
One might argue that the SEO garbage here is less bad, since there really isn't any alternative site they're stealing hits from, but it's still a sign that shows just how horrible the web has become.
The execution of this visualization was rather disappointing.
I didn’t like the overly cute text (the description of the Simpleton algorithm was almost incomprehensible), the low-contrast captions and colorblind-unfriendly color scheme, and the limited navigation (there was no way to go to the previous slide within a chapter, for example).
But more importantly: If you are going to design an entire interactive exercise like this, graphs are a much better way to explore the effects of varying different parameters. Trying to experiment (as instructed) with different parameters by watching animations in the various chapters and the "sandbox mode" included in this simulation was not only tedious, but prevented effective comparisons. If you just run each iterated tournament (from chapter 4 and onwards) by pressing the "Start" button, there is too much going on simultaneously at a high speed to follow along – I would recommend a sorted table or bar chart rather than many multi-digit numbers arranged in a circle – while stepping through is too slow to keep everything in your head.
I noticed that some of the other "explorable explanations" by the same creator include graphs; I think omitting them from this visualization was a mistake. http://explorableexplanations.com/
Do you know of any existing forced alignment tools that work well with live audio (microphone) input? I would like to create a live stream in which the words of a known text are displayed as they are being spoken into a microphone.
For sure aeneas is not suitable, since it requires all the text and all the audio in advance.
But ASR-based tools in theory would allow such an operation mode, but I have not seen aligners that read from the mic buffer directly or have a built-in option/CLI for it.
Knowing the text in advance basically means that you can train your own language (textual) model adapted to that exact text, and then use the (standard) acoustic model for your language and aligning procedure as usual. Hence, I am quite sure you can tweak e.g. CMU Sphinx or Kaldi to do it. Perhaps gentle (which is based on Kaldi) is worth looking into.
I looked into gentle a few weeks ago and did notice that it seems to use an online algorithm. It doesn’t have built-in support for live audio input unfortunately, but it may be tweakable as you say (such as reimplementing it to use audio streams that work with either static or real-time input). I guess there’s no other way to find out than just try it myself.
Another possibility is to just run an automatic speech recognition system (e.g. Sphinx or PocketSphinx can read from the mic input), and align its output with the ground truth text.
You need to deal with imperfect matching because the ASR might produce a text slightly different from the ground truth, but if you want to chunk e.g. at sentence granularity (and then move on to the next sentence), you should be able to do it in real time.
I bet it would look really cool to have 3D primitives that can be oriented with respect to the time dimension. So for example, a spherical primitive would be a circle that grows and shrinks as frames progress.
> Well, a problem could be that you get a lot of noise if all the frames are rendered independently.
The site shows an example of what a still image looks like if rendered with the same parameters several times; it provides an interesting animated effect. I think that might produce an interesting animation style.
But for a more consistent style between frames, it'd help to have a way to seed the RNG consistently.
I think there is more to it than just seeding the RNG consistently.
For example: what if you have a minor change in the area of the screen that got the first primitive (and hence used the first few RNG numbers)? Then all the remaining primitives will use different RNG numbers.
I think the problem becomes more difficult: you want to find the primitives that result in the minimal change wrt the previous frame.
It's not ready for prime time. I thought I'd posted some sample code. I'll dig for it. That was while I was doing anonymous codepens, so they're not on my profile there.
Searching "css sidenote edward morbius" on G+ might turn up something and basic CSS.
I don't like this assumption of copying. I've been writing multiplayer tron-like games as a "hello-world" style programming exercise for 20 years now and have done numerous variations including curves, and wiimotes and the like. I don't think there's sufficient novel content in curvefever (well executed as it is) to really be accusing anyone of shamelessly copying it.
Yea, but this one has more similarities than just being another tron-like game: all the items that can be picked up is the same, the ability to add local player and the arrow that marks you at the start is the same. It's not a big deal since it's free and open-source.
I'll also use this post to wish that more people would edit Wiktionary. It has such a good mission (information on all words) and yet there are only like 80 people editing on any given day or whatever. In some languages, it's even the best or most updated dictionary available. The barriers to entry and bureaucracy are really not high for HN audience types.