Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Apple's word segmentation

Unless they changed it, it's probably similar to CFStringTokenizer which used ICU Boundary Analysis (and maybe mecab for Japanese).



Thank you! The ICU Boundary Analysis documentation says it uses a dictionary to split Chinese, Japanese, Thai or Khmer.

https://unicode-org.github.io/icu/userguide/boundaryanalysis...

Is that the same as the macOS dictionary being parsed here? It seems like a pretty big file to grep every time!


No, the ICU dictionaries are seen at: https://github.com/unicode-org/icu/tree/main/icu4c/source/da...

I assume at compile time it's converted to a more efficient query format




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: