New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Name-based UUID from RQL #4636
Name-based UUID from RQL #4636
Conversation
This is then passed through to `uuid_u::from_hash` when present. A global RQL UUID namespace is hard-coded.
+1000 for this, my system uses SHA-256 primary keys everywhere because of this constraint. Hashing the id directly on RethinkDB would let me cut some microservices running only for this purpose. |
Also, why not SHA-256? :) |
+1 |
5 similar comments
+1 |
+1 |
+1 |
+1 |
+1 |
One thing to keep in mind is that |
Totally okay with that |
Thanks for the pull request @captainpete ! Tagging this as a ReQL proposal since it changes the API and putting it into 2.2 since so many people seem to want it. Generally the way we handle API changes is we have a discussion period at the start of every release (to avoid distracting people in the middle of a release). |
Using the existing |
+1 |
My only concern is how this relates to #117. For example, if we add The options I can think of:
@mlucy @danielmewes -- what do you think? |
I think it might be better to add new terms for schemes that are very different. There is another part to #117 which is supporting a more compact representation of UUIDs (i.e. not as a human-readable string, but something shorter). That part would still work well with this patch, assuming the representation will be changed through an optarg. It's a bit hard to say given that #117 is still in flux, but as things are now, I have a slight preference for keeping the patch as it is. It feels nicer to write |
r.uuid(string, optional_optarg_with_scheme) would be nice |
@coffeemug -- I think we should merge it, in the worst case where we decide that we want |
This is pretty minor, but could we have the server automatically coerce the |
If it's just data that we're using as a seed, I don't see any reason why we can't accept arbitrary data as a seed instead of just strings. We should try to have the property that if two values compare unequal in ReQL they count as different seeds. (That might be too complicated for an initial implementation though.) |
With using arbitrary objects as the seed, we probably need some predetermined and unchanging way to generate the final string that gets hashed to a uuid. So we'd need to provide a contract along the lines of "The json serialized version of the object, with keys alphabetized and pseudotypes serialized like so...". Alternately, we could define it as however they're serialized now when we order things lexicographically for indexes. |
We currently don't allow objects as keys, but I don't think it would be too hard to define a deterministic object -> string transformation (the only worry would be that we could never change the field ordering even if RethinkDB later becomes more Unicode-aware). |
Yeah, it's the "whatever we decide now is locked in forever" aspect that is troubling. I'd rather not throw it in as an add-on even though it would be super convenient. For now users can do something like |
Is |
It is, but I don't believe there's any guarantee it will never change. If we're saying "generate your primary keys with this" we're making a commitment that if you query with |
@marshall007 It is, but we have changed our JSON encoder in the past. So this limits our options for changing those details in the future. Another problem that's specific to allowing objects is that it could clash with later adding optargs to |
Ok, I remove my objection. |
As long as I can replace all those "crypto.createHash()/hash.update(key)/hash.digest('hex')" in my code, manually coercing is nothing :) |
Alright, since there are complications let's just support it for strings in the first version. |
Marking as settled from the API perspective, as implemented in the PR (i.e. |
@danielmewes -- do we still plan to merge this PR for 2.2? Is there a reason we haven't merged it yet (other than test time/scheduling)? |
Sorry for the delay. The patch looks good, merging this now. |
Motivation
In RethinkDB one of the modelling constraints is that primary keys are limited to single attributes.
In DV modelling a similar constraint is present, and it's common practice to generate surrogate keys with hashing algorithms using one or more natural keys.
Hashed keys are useful for knowing the id of a record (that may not exist yet) if we have the natural key values (useful for parallel/partial loading). Primary surrogate keys derived from hashed natural keys are also useful in the situation when replacing or updating documents matching multiple values (a kind of soft uniqueness constraint for atomic updates by key vector).
All this hashing can be handled client-side, but ETL style insert-from-select queries are a lot faster and will run in parallel. Having a deterministic hashing function (something like
r.sha1
) in RQL allows for some pretty cool relational modelling based on existing natural keys.The patch
Instead of introducing new hashing terms into RQL, existing UUID code allows for generation of name-based UUIDs (RFC4122 4.3) using SHA1. I've modified the
r.uuid
term allowing an optional value to be passed in. This also matches the existing default primary key convention of a UUID.Possible improvements
r.uuid
, call it something else that produces a UUID.is_deterministic
should return true if we're using a name.Haven't written anything in C++ for many years but the patch is pretty straight-forward.