Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name-based UUID from RQL #4636

Merged
merged 1 commit into from Sep 26, 2015
Merged

Conversation

captainpete
Copy link
Contributor

Motivation

In RethinkDB one of the modelling constraints is that primary keys are limited to single attributes.
In DV modelling a similar constraint is present, and it's common practice to generate surrogate keys with hashing algorithms using one or more natural keys.

Hashed keys are useful for knowing the id of a record (that may not exist yet) if we have the natural key values (useful for parallel/partial loading). Primary surrogate keys derived from hashed natural keys are also useful in the situation when replacing or updating documents matching multiple values (a kind of soft uniqueness constraint for atomic updates by key vector).

All this hashing can be handled client-side, but ETL style insert-from-select queries are a lot faster and will run in parallel. Having a deterministic hashing function (something like r.sha1) in RQL allows for some pretty cool relational modelling based on existing natural keys.

The patch

Instead of introducing new hashing terms into RQL, existing UUID code allows for generation of name-based UUIDs (RFC4122 4.3) using SHA1. I've modified the r.uuid term allowing an optional value to be passed in. This also matches the existing default primary key convention of a UUID.

Possible improvements

  • I've used a global base UUID, allow specifying the base (for example, a table name).
  • Don't overload r.uuid, call it something else that produces a UUID.
  • is_deterministic should return true if we're using a name.

Haven't written anything in C++ for many years but the patch is pretty straight-forward.

This is then passed through to `uuid_u::from_hash` when present.
A global RQL UUID namespace is hard-coded.
@thelinuxlich
Copy link

+1000 for this, my system uses SHA-256 primary keys everywhere because of this constraint. Hashing the id directly on RethinkDB would let me cut some microservices running only for this purpose.

@thelinuxlich
Copy link

Also, why not SHA-256? :)

@lucaspalencia
Copy link

+1

5 similar comments
@gustavopalencia
Copy link

+1

@guerreiro
Copy link

+1

@rksasaki
Copy link

rksasaki commented Aug 6, 2015

+1

@feliperoberto
Copy link

+1

@simaodeveloper
Copy link

+1

@deontologician
Copy link
Contributor

One thing to keep in mind is that generated_keys in the response will not return these values, like it will for default key generation. So to get the keys for a document you'll have to tell it to return changes like r.table('foo').insert(..., return_changes=True).run(...)

@thelinuxlich
Copy link

Totally okay with that

@mlucy mlucy added this to the 2.2 milestone Aug 6, 2015
@mlucy
Copy link
Member

mlucy commented Aug 6, 2015

Thanks for the pull request @captainpete !

Tagging this as a ReQL proposal since it changes the API and putting it into 2.2 since so many people seem to want it. Generally the way we handle API changes is we have a discussion period at the start of every release (to avoid distracting people in the middle of a release).

@danielmewes
Copy link
Member

Using the existing uuid format for this sounds neat.
Let's give this a bit of discussion for 2.2. I definitely like the basic idea.

@renanvaz
Copy link

renanvaz commented Aug 6, 2015

+1

@coffeemug
Copy link
Contributor

My only concern is how this relates to #117. For example, if we add r.uuid(type='scheme1 | scheme2 | etc'), passing a name to r.uuid won't make sense in all cases.

The options I can think of:

  • Just merge the patch, and then add another term for other types of key generation later.
  • Adjust the patch to make the string a named optarg.
  • Merge the patch; when we add other schemes we can throw if the user passes a string.
  • merge the patch; when we add other schemes, break the API and make the string a named optarg then.

@mlucy @danielmewes -- what do you think?

@danielmewes
Copy link
Member

I think it might be better to add new terms for schemes that are very different.
"uuid" is usually associated with a specific set standards for ID generation (https://en.wikipedia.org/wiki/Universally_unique_identifier#Standards), so I think things like the 'ordered' UUID that was proposed in #117 (comment) might not be a great fit for the r.uuid term.

There is another part to #117 which is supporting a more compact representation of UUIDs (i.e. not as a human-readable string, but something shorter). That part would still work well with this patch, assuming the representation will be changed through an optarg.

It's a bit hard to say given that #117 is still in flux, but as things are now, I have a slight preference for keeping the patch as it is. It feels nicer to write r.uuid(name) than r.uuid({name: name}).

@thelinuxlich
Copy link

r.uuid(string, optional_optarg_with_scheme) would be nice

@mlucy
Copy link
Member

mlucy commented Aug 27, 2015

@coffeemug -- I think we should merge it, in the worst case where we decide that we want r.uuid to support a generation scheme that doesn't make sense when passed a constant we can just error in that case.

@marshall007
Copy link
Contributor

This is pretty minor, but could we have the server automatically coerce the name argument to a string? As @captainpete's motivation for this PR implies, r.uuid(<array>) and r.uuid(<object>) are probably the cases where this feature becomes most useful, so not having to chain that extra .coerceTo('string') on the argument would be great!

@mlucy
Copy link
Member

mlucy commented Aug 27, 2015

If it's just data that we're using as a seed, I don't see any reason why we can't accept arbitrary data as a seed instead of just strings. We should try to have the property that if two values compare unequal in ReQL they count as different seeds. (That might be too complicated for an initial implementation though.)

@deontologician
Copy link
Contributor

With using arbitrary objects as the seed, we probably need some predetermined and unchanging way to generate the final string that gets hashed to a uuid. So we'd need to provide a contract along the lines of "The json serialized version of the object, with keys alphabetized and pseudotypes serialized like so...".

Alternately, we could define it as however they're serialized now when we order things lexicographically for indexes.

@mlucy
Copy link
Member

mlucy commented Aug 27, 2015

We currently don't allow objects as keys, but I don't think it would be too hard to define a deterministic object -> string transformation (the only worry would be that we could never change the field ordering even if RethinkDB later becomes more Unicode-aware).

@deontologician
Copy link
Contributor

Yeah, it's the "whatever we decide now is locked in forever" aspect that is troubling. I'd rather not throw it in as an add-on even though it would be super convenient. For now users can do something like r.uuid(blah.coerceTo('string')) if they need this functionality

@marshall007
Copy link
Contributor

Is <object>.coerceTo('string') not already deterministic? It appears to already sort keys lexicographically before serializing to JSON, at least.

@deontologician
Copy link
Contributor

It is, but I don't believe there's any guarantee it will never change. If we're saying "generate your primary keys with this" we're making a commitment that if you query with table.get(r.uuid(my_real_query_object)) on your data in the future, it will not suddenly stop working because we changed what fields we add to pseudo-types or something

@danielmewes
Copy link
Member

@marshall007 It is, but we have changed our JSON encoder in the past. So this limits our options for changing those details in the future.

Another problem that's specific to allowing objects is that it could clash with later adding optargs to r.uuid.

@coffeemug
Copy link
Contributor

Ok, I remove my objection.

@thelinuxlich
Copy link

As long as I can replace all those "crypto.createHash()/hash.update(key)/hash.digest('hex')" in my code, manually coercing is nothing :)

@mlucy
Copy link
Member

mlucy commented Aug 27, 2015

Alright, since there are complications let's just support it for strings in the first version.

@danielmewes
Copy link
Member

Marking as settled from the API perspective, as implemented in the PR (i.e. r.uuid accepts an optional string argument).

@coffeemug
Copy link
Contributor

@danielmewes -- do we still plan to merge this PR for 2.2? Is there a reason we haven't merged it yet (other than test time/scheduling)?

@danielmewes
Copy link
Member

Sorry for the delay. The patch looks good, merging this now.
It will ship with RethinkDB 2.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet