>Apple's word segmentation Unless they changed it, it's probably similar to CFSt... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		krackers on Sept 13, 2021 \| parent \| context \| favorite \| on: Reverse-Engineering Apple Dictionary (2020) >Apple's word segmentation Unless they changed it, it's probably similar to CFStringTokenizer which used ICU Boundary Analysis (and maybe mecab for Japanese).

peterburkimsher on Sept 13, 2021 [–]

Thank you! The ICU Boundary Analysis documentation says it uses a dictionary to split Chinese, Japanese, Thai or Khmer.

https://unicode-org.github.io/icu/userguide/boundaryanalysis...

Is that the same as the macOS dictionary being parsed here? It seems like a pretty big file to grep every time!

krackers on Sept 13, 2021 | [–]

No, the ICU dictionaries are seen at: https://github.com/unicode-org/icu/tree/main/icu4c/source/da...

I assume at compile time it's converted to a more efficient query format

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact