I'm amazed at how "we" have managed to turn such a simple idea as "128 bits is a large enough address space for uncoordinated generation to be essentially collision free" into such a "heavy" concept with 7 different versions
If you want to do something smart like encoding your node ID within the value, or prefixing a timestamp for sortability, then sure, do that in your application. No one else really needs to care how you produced your 16 bytes. Just do some napkin math to make sure you're keeping sufficient entropy
I'm not sure "UUID" even needed to be a column type, versus a "INT16" and some string formatting/parsing functions for the conventional representation (should you choose to use that in your application). You could also put IPv6 addresses in the same type. Though I guess this depends on how much you think the database should encode intention versus raw storage in types
> No one else really needs to care how you produced your 16 bytes
UUID isn't about how it's done, it's about what it is.
Instead of everyone doing something differently, everyone can just comply with UUID.
Instead of having to repeat it across your docs that the IDs of this entity are sortable, you can just say they are UUIDv7. If someone wants to extract the timestamp from your ID, they don't need to figure out which (32, 48, 50?) bits are the timestamp nor what resolution the timestamp has because you can tell them UUIDv7.
You don't have to write your own validation functions because you can tell the database that this hex string is a UUID and it can do it for you.
You're probably making the case for Sqlite here which is very minimal, but something more full-featured like Postgres, I prefer these conveniences. I can tell because whenever I use Sqlite in a case where I could've used Postgres, I regret it!
But I feel the point is: None of that is a relevant concern IDs should take on.
Most functional things related to e.g. embedding the record creation time within the ID is one of those "that's cool, but I've never seen anyone do it" kind of things. If you need to sort records by when they were created, there are probably three or four happened_at fields on the record you'd use (created_at in this case). If you need the exact time; those are there for that.
Counter-argument: Well, you can save a few bytes on every record by getting rid of the created_at field and just using a UUIDv7. Maybe, but I've never seen anyone do it. What if you need to change the time the record was created? Are you planning to explain to all your integration providers the process of extracting a timestamp from a UUIDv7? What if you need to run complex SQL timestamp functions on created_at? Etc. Its cool, but it never actually happens.
Once we enter the domain of "using the node id or timestamp or something to reduce the probability of ID collision", that's a totally reasonable responsibility within an ID's set of concerns. But, that's a very different need.
> You don't have to write your own validation functions
Why are we validating IDs?
> but something more full-featured like Postgres, I prefer these conveniences.
Agreed. I am a vocal UUID hyper-hater. UUIDs should be destroyed, and humanity would be (oh so slightly) better off if they had never existed. But, they're still a thing, and I think its cool that databases have hyper-specific types like this.
My wish is Postgres would have other more sane automatic ID types and gen capability, in addition to uuid & autoincrement.
The point of the timestamp in UUIDv7 is not to encode creation time, it is to provide some (coarse-grained) chronological sortability.
Random primary keys are bad, but exposing incremental indexes to the public is also bad, and hacking on a separate unique UUID for public use is also bad. UUIDs are over-engineered for historical reasons, and UUIDv7 as raw 128 bits without the version encoding would be nicer.
But, to the end-user it's just a few lost bits in a 128-bit ID with an odd standard for hyphenation. The standardization means you know what to expect as developer, instead of every DB rolling their own unique 128-bit ID system with its own guarantees and weirdnesses.
But my point is: When is that standardization actually leveraged? Literally, tactically, what does "you know what to expect as a developer" mean? When is this standardization used in a fashion that enables more capability than just "the ID is a string don't worry about it"?
The realistic answer is: it isn't, because pre-UUIDv7 there was literally nothing about the UUID spec that conferred more capability than just a random string. And, truly; people used them as "just gimme a random string" all the flipping time. The pipes of the internet are filled with JSON that contains UUIDs-in-a-string-field, 4 bytes wasted to hyphens, 1 byte wasted to a version number, none of that is in service to anyone or anything.
1. The other UUID versions are actually used. However, the expectations is in what the developer gets when generating it. Even "random ID" can be messed up if the author tries to be smart - e.g., rolling their own secret chronical sortability hack for their database but not telling you how much entropy and collision resistance you have left, or them hacking in multi-server collision resistance by making some bits a static server ID.
People have reason to do those things, and oh boy do you want to know that it's happening. With UUID, over-engineered as it may be, you know what you're asking for and can see what you're getting - truly random, server namespaced, or chronologically sortable.
2. Being upset over 4 bytes wasted to hyphens but not being upset about JSON itself seems hypocritical. JSON is extremely wasteful on the wire, and if you switch to something more efficient you also get to just send the UUID as 16 bytes. That's a lot more than 4 bytes saved.
Over JSON you can still base64 encode the UUID if it's not meant to be user-facing.
what do you even mean by "Why are we validating ids"?
zzzzyyyy-zyzy-zyzy-zzyyzzyyzzyy does this look like a valid ID? I could totally store this in the database if there was no validation involved
From GP (and my) perspective, the useful part of UUID is that it's 16 bytes. This is usually for formatted as 32 hex digits with dashes in specific places.
The version/variant bits are the pointless part. Of course if you put the 16 bytes on the wire you would still have some encoding (perhaps 22 base64 characters?) that requires decoding/validation, but in memory and in your DB it's just 16 bytes of opaque data.
The UUID specs are still confusing (or at least were to me lol) because the words "version" and "variant" both just say that something changes, not what is changing or why it's changing.
version from Latin vertere "to turn, turn back, be turned; convert, transform, translate; be changed"
variant from Latin variare "change, alter, make different,"
4.1.1 The variant field determines the layout of the UUID. That is, the interpretation of all other bits in the UUID depends on the setting of the bits in the variant field. As such, it could more accurately be called a type field; we retain the original term for compatibility.
4.1.3 The version number is in the most significant 4 bits of the time stamp (bits 4 through 7 of the time_hi_and_version field). The following table lists the currently-defined versions for this UUID variant. The version is more accurately a sub-type; again, we retain the term for compatibility.
It's recognized in the RFC and all you've done is broke compatibility for fashion.
In practice, UUIDs are treated as an opaque 128-bit field. In any sufficiently complex system, there is no practical way to standardize on a single blessed version. Furthermore, all of the standardized UUIDs are deficient in various ways for some use cases, so there are a large number of UUID-like types used in large enterprises that are not "standard" UUIDs but which are better fit for purpose. This is deemed okay because there is no way to even standardize on a single UUID version among the official ones. Furthermore, there are environments where some subset of the UUID standard types (including each of v3/v4/v5 in various contexts) are strictly forbidden for valid security reasons.
The practical necessity of mixing UUID versions, along with other 128-bit UUID-like values, means that the collision probabilities are far higher in many non-trivial systems than in the ideal case of having a single type of 128-bit identifier. There is a whole separate element of UUID-like type engineering that happens around trying to mitigate collision probabilities when using 128-bit identifiers from different sources, some of which you may not control.
Having 128-bits is the only common thread across these identifiers which everyone seems to agree on.
I think you're drastically underestimating the purpose and management of UUIDs in large scale systems.
If you're building for a single application or data type, sure do your thing, have at it. If you're trying to coordinate UUID spaces and generation across thousands of different applications and data types, like large data pipelines, then this matters a lot.
Also, having native database support (like indexing, filtering, etc.) improves efficiency for these types of workloads.
Because it turns out that trying to index/sort things by UUID doesn't work great. UUID, at least somewhere after version one isn't just some large number. Different parts of the field have different meanings depending on the specification.
If you want to do something smart like encoding your node ID within the value, or prefixing a timestamp for sortability, then sure, do that in your application. No one else really needs to care how you produced your 16 bytes. Just do some napkin math to make sure you're keeping sufficient entropy
I'm not sure "UUID" even needed to be a column type, versus a "INT16" and some string formatting/parsing functions for the conventional representation (should you choose to use that in your application). You could also put IPv6 addresses in the same type. Though I guess this depends on how much you think the database should encode intention versus raw storage in types