Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"The right thing" for filesystem entries is transparently copy, do not evaluate. A file path is a mem-copied, length value sized block of identifier you don't ever mangle. If you must mangle it, touch only the necessary areas as directed. (E.G. join with os.pathsep and do not normalize anything).

Want to offer Unicode validation? Sure having that as an OPTION is fine. Forcing it means I can't rely on that tool to handle real world data which happens to not be valid but is still a valid file-system address.



One thing I've noticed is that ext, xfs, btrfs and zfs all explicitly store a length field alongside the filename. There's nothing inherent in the disk layout of these filesystems preventing them from supporting filenames with embedded slash and nul characters - those limitations are imposed by the kernel's VFS implementation. It would be nice to have a special version of open, exec etc. where one could specify a filepath as a length-prefixed array of length-prefixed strings.


ODBC (ISO 9075-3) got it right 30 years ago: all strings are accompanied by the length argument, which also accepts the sentinel value NTS, when you mean a null-terminated string.


I once thought about using JSON arrays for filepaths so all possible strings would be valid filenames.


That the beauty of it with python:

- if you want to treat paths like unicode strings, you can. Which is great for simple scripts where you don't want to deal with complexity. And 99% of the time, it's enough with modern OSes.

- if you want to threat path as bags of raw bytes, you can. Which is necessary to transparently copy and do not evaluate, as you said, for covering edge cases.

- if you need to actually deal with those as strings but don't want to loose data for edge cases, so a mix of the 2 above, you can use surrogate escapes


Where does that filepath come from? A config file; are you going to do your text processing, and interaction, with other modules in byte[] arrays in Python 3+?

Python 2's unicode model was _closer_ to correct, the trivial coercion between byte[] and Unicode.

Conversion also shouldn't imply, force, or check Validation nor Normalization. Labeling a bytestream with an Encoding and validating / normalizing that encoding should be options. Operations on bytestreams with encoding related attributes should set them to either 'unknown' result or to a proper output type if they're aware the manipulations will still yield a valid encoding.

Normalization is more complex, since Unicode strings can be normalized in different ways, then combined, and still be a valid string but no longer uniformly normalized.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: