"The right thing" for filesystem entries is transparently copy, do not evaluate....

10000truths · on Oct 20, 2021

One thing I've noticed is that ext, xfs, btrfs and zfs all explicitly store a length field alongside the filename. There's nothing inherent in the disk layout of these filesystems preventing them from supporting filenames with embedded slash and nul characters - those limitations are imposed by the kernel's VFS implementation. It would be nice to have a special version of open, exec etc. where one could specify a filepath as a length-prefixed array of length-prefixed strings.

GoblinSlayer · on Oct 21, 2021

ODBC (ISO 9075-3) got it right 30 years ago: all strings are accompanied by the length argument, which also accepts the sentinel value NTS, when you mean a null-terminated string.

gjvnq · on Oct 21, 2021

I once thought about using JSON arrays for filepaths so all possible strings would be valid filenames.

BiteCode_dev · on Oct 21, 2021

That the beauty of it with python:

- if you want to treat paths like unicode strings, you can. Which is great for simple scripts where you don't want to deal with complexity. And 99% of the time, it's enough with modern OSes.

- if you want to threat path as bags of raw bytes, you can. Which is necessary to transparently copy and do not evaluate, as you said, for covering edge cases.

- if you need to actually deal with those as strings but don't want to loose data for edge cases, so a mix of the 2 above, you can use surrogate escapes

mjevans · on Oct 21, 2021

Where does that filepath come from? A config file; are you going to do your text processing, and interaction, with other modules in byte[] arrays in Python 3+?

Python 2's unicode model was _closer_ to correct, the trivial coercion between byte[] and Unicode.

Conversion also shouldn't imply, force, or check Validation nor Normalization. Labeling a bytestream with an Encoding and validating / normalizing that encoding should be options. Operations on bytestreams with encoding related attributes should set them to either 'unknown' result or to a proper output type if they're aware the manipulations will still yield a valid encoding.

Normalization is more complex, since Unicode strings can be normalized in different ways, then combined, and still be a valid string but no longer uniformly normalized.