The annnoying part in written thai is thattherearenospacesbetweenwords.

seanmcdirmid · on May 8, 2023

Spaces between words is a relatively recent Irish invention (7th or 8th century) in western written language, so it’s not like it’s an obvious thing to have.

thaumasiotes · on May 9, 2023

> Spaces between words is a relatively recent Irish invention (7th or 8th century) in western written language, so it’s not like it’s an obvious thing to have.

Perhaps, but interpuncts between words are several centuries older than that and occur as natural developments in e.g. the Roman Empire. https://loeb-art-center.vassarspaces.net/wp-content/gallery/...

The concept of word separation is an obvious thing to have. Whether the separator is empty space is unimportant.

postcynical · on May 8, 2023

12 centuries should be plenty of time for a simple upgrade to improve the UX of a language.

sundarurfriend · on May 9, 2023

The UX of a language for most people for most of that time was speech.

aksss · on May 9, 2023

Very true, and we should remember that in a lot of (all?) cultures across time, literacy (learning to read and write) was a marker of class, and/or a protected trade, and/or considered sacred, and/or considered profane.

In other words, there wasn’t much an incentive or recognized need to make the scribe’s job easy to pick up.

Caesar (I think) tells us that the druids of the Celts did not allow members of their tradition to write down their beliefs, traditions, etc. Writing in that context (prior to 50 BCE) was profane.

Of course, those of us in America are familiar with slaves being prevented from learning to read. Forced illiteracy in this context was a tool of oppression. [1]

I think in one of S.M. Sterling’s fictional books (On the Ocean of Eternity?? Part of the Island in the Sea of Time series, anyway) there’s a great exchange with a Babylonian scribe who laughs at the simplistic alphabet of the American, condescendingly remarking that child could learn that, to which the guy replied, yeah, that’s entirely the point!

1. https://docsouth.unc.edu/neh/singleton/singleton.html

I can’t help but link this recollection by William Henry Singleton. He recounts being whipped as a child because it was thought that he had merely opened a book, but the whole pamphlet he authored(!) (and available to read in full at that link) after becoming free, fighting in the War, and learning to read/write, is an utterly fascinating account from a primary source spanning from his experience being born into slavery in about 1830 to the point where he authored this in about 1920. It’s too easy to understate but this man saw a lot of change in a momentous century, first-hand.

fomine3 · on May 9, 2023

It was not much needed until computer text processing become a thing.

mahkeiro · on May 9, 2023

Latin used dots to separate words.

xvilka · on May 8, 2023

Same with Chinese language, thus lexing and parsing requires knowing many more words than in languages with spaces between words.

zdragnar · on May 9, 2023

As an English native speaker who learned Mandarin, I really didn't find the lack of spaces harmful to learning the language.

Since each character represents a syllable, rather than a specific sound, and the written language is essentially not phonetic, reading the characters is an entirely different experience.

OTOH, you have English and German and others that frequently use compound words, and the use of spaces becomes really important to understanding the writing.

I have zero experience with Thai.

hibbelig · on May 9, 2023

> OTOH, you have English and German and others that frequently use compound words, and the use of spaces becomes really important to understanding the writing.

Schleifmaschinenverleih would like to have a word with you.

This is parsed as Schleif-Maschinen-Verleih. Verleih means a rental company. The middle one is machine, and the first one I find both sanding and whetting as translations, not sure which one it is. So you can rent sanding and/or whetting machines there.

There are cases where it's ambiguous but for the most part the lack of spaces in compound nouns in German is not an issue.

A somewhat infamous example is Rohrohrzucker which should be parsed as Roh-Rohr-Zucker (raw cane sugar), but Rohr-Ohr-Zucker is also possible (pipe ear sugar). It's pretty clear when it happens that you got the wrong parsing but it takes a while to figure out what the right parsing is :-)

As far as speaking is concerned: I guess the extra spaces in English don't necessarily translate to pauses, do they? Is plugin pronounced differently from plug-in due to the hyphen?

zdragnar · on May 9, 2023

To clarify, I'm not saying that compound words are difficult to parse due to a lack of spaces. I'm saying that without any spaces in a sentence at all, it's harder to differentiate between compound and non-compound words.

blueberry vs. blue berry

stand up vs. standup

online vs. on line (northeast US term for queueing)

cartwheel vs. cart wheel

Stick compound words in a sentence that doesn't have any spaces at all, and you either have to pause to grok context, or context won't even help you (blue berry vs blueberry). At least German capitalizes all of its nouns, which would certainly help.

Compare this with Chinese, Korean, Japanese or other similar languages that don't use spaces at all (except perhaps after punctuation).

dumbotron · on May 9, 2023

> As an English native speaker who learned Mandarin, I really didn't find the lack of spaces harmful to learning the language.

Definitely. The logograms and being in a completely different language family are the real hurdles.

eric-hu · on May 9, 2023

No, it's different between Chinese and Thai.

Lexing is very clear in Chinese. It's never the case that you look at a Chinese sentence and don't know where a character ends and another begins. Take this sentence in both languages: "good morning, how are you"

早安，你好吗

This sentence clearly has "spaces" and I'm pretty sure any person illiterate in Chinese could tell you there are 5 characters / words. Technically the third character is composed of 人 and 尔 but I don't know that anyone, even kids or beginners, would mistake those as _not_ going together.

สวัสดีตอนเช้าคุณเป็นอย่างไรบ้าง

In contrast, Thai is as you say: lexing and parsing bleed together. There are 7 words in this sentence, but you need to lex the 10 syllables and run them through your mental dictionary to recognize the possible words they could be. My Thai is very limited, but there are examples of sentences out there that actually have multiple valid readings with different semantic meanings, depending on how you group sounds together.

Anon1096 · on May 9, 2023

早安 is made up of 2 characters but is a single word. If you fall into the trap of thinking 1 character = 1 word, you won't understand a thing. In this case you'd have thought it meant "early safe" instead of "good morning".

eric-hu · on May 9, 2023

Okay, you make a good point. Let's look back at the GGP's comment though:

> Same with Chinese language, thus lexing and parsing requires knowing many more words than in languages with spaces between words.

In English, can you get away with knowing the meaning of "good" and "morning" and not "good morning", and know that I'm greeting you instead of commenting on the quality of this morning?

likpok · on May 9, 2023

Good morning is a bad example because it has a colloquial meaning that is a least a little idiomatic. Most other words/phrases in English don’t have this effect, while many Chinese words are like 早安. 了解, for example, can’t even be pronounced without correctly parsing the word.

eric-hu · on May 9, 2023

Okay, I concede that I may have forgotten that Chinese has its exceptions too. 了解 is indeed a good example. There are plenty in English though. Even with context, sometimes I have to really pause and think whether to pronounce read as red or reed (I read it just fine, I read English just fine).

Where I've had pain specifically with Thai is that I can't even know where a syllable begins and ends until I read a few "syllables" together and decide whether some vowels go with the consonant in front or behind it, and whether some an -ar should be pronounced as an -aan.

adastra22 · on May 9, 2023

Chinese has pretty regular rules about grouping characters into words though, as most compounds are 2-characters, or a 4-character idiomatic phrase. Even if I know only half the characters in a sentence, I can usually guess the word boundaries correctly. It's not 100% reliable, but good enough to avoid confusion.

hnfong · on May 9, 2023

I guess it really depends on "dialect". Try that with Cantonese :)

As mentioned in another comment, single syllable words are much more common in Cantonese, and word combinations are much more "free" in the sense that there are a lot more ambiguity as to what counts as a "word" and what is merely two single-character-words idiomatically used together. There are also cases where grammatical constructs (and also foul words) are inserted in between a two-character word/idiomatic combo, and sometimes the characters are reversed, to the extent that it used to be a meme: https://evchk.fandom.com/zh/wiki/Y%E5%B7%B2x

It's gotten to a point where, after thinking about it for a couple years, I've come to believe that segmentation on Cantonese is a fool's errand...

Of course, there's also classical Chinese where most of the time a character is a word.

deadfoxygrandpa · on May 9, 2023

i think you're on to something about cantonese, but it's also true of mandarin. segmentation of words in chinese in general seems inherently messier than segmentation in english. also look at stuff like abbreviations: is 北大 one word? is it an abbreviation for 北京大学 the same way Caltech is an abbreviation for california institute of technology? is it just two single character words, each of which is an abbreviation? i think its much less clear than english

hnfong · on May 10, 2023

Segmentation in Mandarin is easier due to tendency of the language to use 2+ characters for words. With a high quality wordlist you will go a long way.

The problem with proper nouns is that they don't end up in dictionaries, same with slang and other terms that for reasons don't end up in dictionaries.

The additional problem with Cantonese is that there's a larger class of words where the constituent characters can move around as if they were words themselves. Even for a native speaker with some experience in lexicography, it can be difficult to determine word boundaries as there are many cases where a word with characters X+Y can be interpreted as just word X and word Y with some idiomatic meaning. This issue is more pronounced in Cantonese because there are more single character words in active use.

I've actually done this before. My experience is that naive segmentation on Mandarin text with wordlist is probably 80+% accurate, while using the same algorithm in Cantonese text (with cantonese wordlist) will definitely end up "wtf".

adastra22 · on May 9, 2023

The same problem exists in Japanese FWIW, whose speakers like to make the same sorts of abbreviations despite not having a bisyllabic meter like Mandarin does. Japanese is somewhat helped by having multiple orthographies, however.

dumbotron · on May 9, 2023

Hence expertsexchange.com

qingcharles · on May 9, 2023

Do any ideographic languages use spaces?

I'm used to it in Asian languages but it still does my head in when I try to read older Latin documents.

soundnote · on May 9, 2023

With the kind of mixed script used in Japan and that used to be used in Korea, they're not exactly necessary (still useful, but not necessary). Neither language uses prefixes much, so a sinograph is a pretty reliable indicator of the beginning of a word, followed by the inflection written out in a phonetic script like hiragana or hangeul. In Japanese's case, a switch from hiragana to katakana also indicates a word boundary and highlights that the word's likely a nonsinitic loan or the name of a plant or animal species or other technical term.

Say, for example:

"Korean people eat kimchi"

In Japanese/Korean, the structure would be:

Korean-person-topic marker-kimchi-object marker-eat-present tense.

In Japanese mixed script, that looks like:

韓国人はキムチを食べます。and would be read as "kankokujinwa kimuchiwo tabemasu".

Splitting it with spaces:

韓国人はキムチを食べます。

The heftier kanji denoting "Korean person" and at the start of "eat" should be clear even to the untrained eye, while people who've studied the language can easily tell that キムチ is "kimuchi" written in katakana. The sentence is pretty easy to parse without spaces, at the cost of using one of the most insane writing systems in the world.

Now, what if we wrote the entire thing in hiragana instead?

かんこくじんはきむちをたべます。

... yyeaahh. Spaces. Please.

かんこくじんはきむちをたべます。There, much better, though almost no one fluent in Japanese has practice reading stuff like that.

In Korean, without spaces we'd have:

한국사람들은김치를먹어요. Again, similar problems. Korea has adopted spaces now that they don't use sinographs, so we'd have:

한국 사람들은 김치를 먹어요. (han'guk sa'ram'deul'eun kim'chi'reul mog'o'yo)

If we wrote "Korean person" with the same Sinitic loans in the Japanese sentence, we might get:

한국인들은 김치를 먹어요. (han'gug'in'deul'eun kim'chi'reul mog'o'yo)

Spaces clearly do help.

throwaway2037 · on May 9, 2023

How many (modern, written) "ideographic languages" exist? I can think of two: Chinese and Japanese. Old Korean and Vietnamese used some Chinese characters, but the modern languages use none.

It is interesting to me when written Chinese and Japanese use commas. It is pretty much never required, but pure style. It does help to breakup a complex sentence, similar to phonetic languages.

hnfong · on May 9, 2023

Comma is required in modern Chinese. Nobody will bother reading your text if you don't at least put some of them in the right places...

(I don't know Japanese so I can't speak to that)

throwaway2037 · on May 9, 2023

Nice post. Can you give a simple example sentence where it is required? (I believe you.) I studied Mandarin and Cantonese for a few years, but I never got the level where I thought commas were required.

hnfong · on May 10, 2023

Just a random text I had handy:

起初我見到一碟芽菜同西芹，以為佢上錯菜，用刀叉掘咗幾下，終於見到埋咗響底嘅叉燒。

Nobody writes it like this:

起初我見到一碟芽菜同西芹以為佢上錯菜用刀叉掘咗幾下終於見到埋咗響底嘅叉燒。

Can people parse the latter (with some difficulty)? Sure. The commas are not required to the extent that you can butcher the sentence even more without losing its essential meaning, but why stop there? You can remove even more stuff from it and still retain most of the meaning:

初我見碟芽菜同西芹以為上錯用刀叉掘幾下終見埋底嘅叉燒

But that's not how people write.

ngcc_hk · on May 9, 2023

Japanese has a lot of “hint” on word (nouns) ending. And they use “full stop” plus one space to end a sentence as comma is not really needed in most cases. This is unlike chinese.

wincy · on May 9, 2023

I’ve noticed in things translated from Japanese (video games, anime) there’s two features that seem constant and don’t seem to come from other languages. They seem to constantly say “in other words” and restate and clarify topics, and they put “quotation marks” around things that don’t seem to need quotation marks. I’ve always assumed these oddities would make more sense if I learned to speak or write Japanese.

throwaway2037 · on May 9, 2023

    Japanese has a lot of “hint” on word (nouns) ending.

Can you give a simple example sentence to demonstrate your point? (I believe you.)

    This is unlike [C]hinese.

Can you give a simple example sentence to demonstrate your point? (I believe you.)

geomark · on May 9, 2023

yougetuseditafterawhile

The thing that bugs me about written Thai is that there are spaces now and then and you would expect them to be at sentence breaks but they seem to be randomly placed throughout the text, almost as if that's where the writer felt like he needed to take a breath instead of where one sentence ends and another begins.

deadfoxygrandpa · on May 9, 2023

idk the more chinese i learn, the more im convinced that the very concept of individual words is blurred and not quite the same because of the way the writing system works

中国共产党, is that one word? should you break it up as 中国共产党? what about 中国共产党? i dont think its nearly as clear which of these is correct as it is in english