Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #24604 -- Added JSONField to contrib.postgres. #4427

Closed
wants to merge 3 commits into from

Conversation

mjtamlyn
Copy link
Member

@mjtamlyn mjtamlyn commented Apr 1, 2015

TODO:

  • Docs

This adds a JSONField to django.contrib.postgres using the underlying jsonb data type available on Postgres 9.4+. We do not support the json data type due to the fact it lacks an equality operator and is generally otherwise a bit useless.

The implementation seems to be remarkably simple all things considered, making good use of existing code for hstore and a bit of transform magic.

JSONField.register_lookup(lookups.DataContains)
JSONField.register_lookup(lookups.ContainedBy)
JSONField.register_lookup(lookups.HasKey)
JSONField.register_lookup(lookups.HasKeys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also another operator related to HasKeys: ?|, which means has at least one of the keys.

I guess that could be inferred from the type of the lookup argument: if it's a list/tuple, then use the ?| operator, else use the ? operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it also applies to hstore where we don't implement it. We might as well for completeness I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, override has_key, or has_any_key?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think has_any_key or has_some_keys are appropriate. any probably makes most sense as it mimics the any() python builtin.

@schinckel
Copy link
Contributor

One of the things that I have had many feature requests for (and implemented in both JSON fields I've built) is the ability to customise encoding and decoding.

I've done this in the most recent one by allowing __init__ kwargs on the field itself for encode_kwargs and decode_kwargs, but I wonder if there might be better ways of doing this.

The related things I have been playing with related to this are sane defaults for these. My gut feeling is that we should use DjangoJSONEncoder as the default encoder (encode_kwargs={'cls':DjangoJSONEncoder}), and perhaps using decimal.Decimal as the default parse_float for decode_kwargs.

I'm not sure it will be possible to do this per field though: http://initd.org/psycopg/docs/extras.html#json-adaptation

It seems that overriding the dumping function might be possible, but the loading function is global.

@schinckel
Copy link
Contributor

Aha. As of psycopg2 2.5.4, JSONB is automatically registered (and converted by psycopg2) into dict/list structures.

Is this something we want to override, and allow the flexibility of parsing per-field? I don't see that it will be a performance hit, as it will be basically the same code, but just running in the context of the django field, rather than the psycogp2 layer.

It will also mean that the JSON field could be used (without the lookup functionality) for other database backends.

@schinckel
Copy link
Contributor

Hmm. It seems that the use_decimal flag for json.encode is simplejson-only.

I'm not sure if there is any nice way to handle this.

@mjtamlyn
Copy link
Member Author

mjtamlyn commented Apr 5, 2015

At present, I'd just let psycopg2 do it's thing. The use of DjangoJSONEncoder is an interesting point, sensible serialisation of datetimes is super useful.

We can handle the dumping on a field by field basis quite easily as we wrap the objects in a psycopg2 json object which can take a dumping function, but it's more difficult to customise the loading on a field by field basis as this is registered in psycopg2 against the field type, not the column specifically. I'm not familiar enough with the internals of psycopg2 to know how this works. It may be worth investigating the performance difference. If it's insignificant we're ok.

@mjtamlyn
Copy link
Member Author

mjtamlyn commented Apr 8, 2015

I looked a bit more into implementing something with DjangoJSONEncoder, then I realised there's a clue in it's name - it doesn't do decoding at all. When loading back into models from a dumped json format fixture this is fine because the model creation is happy to understand the string versions it it's to_python method, but this won't be sufficient for loading back into a json based field. It is important that saving and loading the model field value should always give the original result back.

I think there is a possible solution to override the encoders on a field by field basis, but I am concerned about the performance implications. Django+postgres is almost completely free of "converter functions", which are a known performance hit (due to a second loop through all the rows after psycopg2). That said, loading large json blobs for a large number of rows will have a performance hit no matter how you look at it, and I think that using .defer() won't make any difference if we allow the conversion to happen at the psycopg2 level.

Nonetheless, I feel this whole issue can be delayed for future consideration. We should get a basic functioning version in and then we can add improvements like this later.

@schinckel
Copy link
Contributor

I've been thinking about this too (whilst playing with CompositeField stuff, which is only related in the sense that it deals with conversion of data).

It's always going to be problematic converting data back into whatever it was, as things like datetime are indistinguishable from a string that happens to be an ISO8601 string, for instance.

Perhaps a better solution is to allow passing in a "converter" as needed, that will be applied in the new from_db_value() method. That is, don't even try to inject ourselves into the psycopg2 machinery, but just let django handle it.

I think there's still value in allowing an override of the encoder kwargs, as that can be done on a per-field basis, as it's still done by django.

@mjtamlyn
Copy link
Member Author

mjtamlyn commented Apr 8, 2015

Given the difficulty of decoding, I'd be inclined to suggest that someone who wants to do it can subclass JSONField to add their own from_db_value() method which can further decode parts of the JSON it expects to be strings in a particular format. For custom encoding they would currently also need to override get_prep_value to wrap the object in psycopg2.extras.JSON with a different encoder passed.

If this becomes a common request we could easily add a utility to deal with it, personally right now I'm not completely sure it's a Good Idea ™️

@mjtamlyn mjtamlyn changed the title [WIP] Added JSONField to contrib.postgres. Added JSONField to contrib.postgres. Apr 8, 2015
@schinckel
Copy link
Contributor

Yes, that was my other approach. Perhaps a documented example of how to do that, too.

@mjtamlyn
Copy link
Member Author

mjtamlyn commented Apr 8, 2015

Docs added now, should be ready for review. I'll squash together after integrating feedback from a detailed review.

@mjtamlyn mjtamlyn changed the title Added JSONField to contrib.postgres. [Fixed #24604] Added JSONField to contrib.postgres. Apr 8, 2015
@mjtamlyn mjtamlyn changed the title [Fixed #24604] Added JSONField to contrib.postgres. Fixed #24604 -- Added JSONField to contrib.postgres. Apr 8, 2015
def to_python(self, value):
try:
return json.loads(value)
except ValueError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is loading user-supplied data, this can be made to fairly easily cause other kinds of errors which would cause a different kind of error to propagate upwards. For example:

>>> s = '['*1000 + ']'*1000
>>> json.loads(s)
...
RuntimeError: maximum recursion depth exceeded while decoding a JSON array from a byte string

There are probably other kinds of sneaky things we can do here, too. There's an argument to be made that this is OK, that a server crash is the right thing to do here, but I thought I'd point it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is handling user supplied data. I'd argue a 500 isn't the worst thing here as it's recieving weird data designed to break the site (or just weird data!). If there was a general security issue here then anyone using json.loads(request.body) would have a problem. It may be we want to be more careful, but IANASE (I am not a security expert)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we might want to think about, using json.loads() before actually saving the object to the database. Otherwise malicious data could prevent your page to show up if the item is loaded as part of an evaluated queryset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems reasonable. I'm wondering now if json.loads is really safe for user-supplied data -- in theory it ought to be, but see the history of JSON vulnerabilities in Ruby for how "theory" and "practice" differ. This is probably something we should think about further, and maybe test -- this could be an RCE waiting to happen.

Maybe we should include some sort of security note in the docs about being careful about feeding raw JSON directly to a JSON field?

/cc @PaulMcMillan any thoughts here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a genuine security concern, then we shouldn't ship an actual JSONField at all, but then we make ourselves vulnerable to someone not checking user data at all and we get into the problem Markus has suggested.

Markus - the form layer would normally do this kind of validation I think, which is what's going on here. The model layer will receive the decoded JSON object to save to the database always. (If it got a string, this could just be a valid "JSON string" i.e. a string literal).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a genuine security concern

I don't know that it is; it just triggered my spidey-sense. Again in theory this should be OK, since JSON is designed to be safe in this way (unlike, say, YAML). I'm doing some digging right now, but having a look at the Ruby vulns I mentioned, they're usually related to "extra things" that various folks added on top, not the core JSON functionality.

I don't think this should block merge, but it is something I want to investigate further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we only be attempting to json.loads(value) if value is a string object?

From https://docs.djangoproject.com/en/1.8/howto/custom-model-fields/#converting-values-to-python-objects:

As a general rule, to_python() should deal gracefully with any of the following arguments:

  • An instance of the correct type (e.g., Hand in our ongoing example).
  • A string
    *None (if the field allows null=True)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the to_python of the form field, not the model field. It would under normal circumstances always receive a string. We do need to handle empty values though, I'll add that.

@timgraham
Copy link
Member

Could you add a mention the 1.9 release notes (contrib.postgres section)? I only had minor edits to the docs: http://dpaste.com/21RWTQ3

When I submit an invalid value in the admin, the value that's displayed on the subsequent form rendering is escaped with backslashes: "{\"breed\": \"collie\"". Is it a bug?

I also updated Jenkins to PG 9.4.

@jamesbrewerdev
Copy link

What is the status of this PR? Has it been abandoned? #24604 says it needs improvement.

@cancan101
Copy link

Any to allow this field to work with non Postgres DBs where it uses a text field when the the DB doesn't support JSONB? This would be along the lines of the UUID that uses the native type where available.

@mjtamlyn
Copy link
Member Author

Needs some tweaks, have been busy sorry. Will definitely get done by end of DjangoConEU.

As for non-PG dbs, all you'd get is the ability to store a JSON string and load it again, no querying operations would be practically fast to be useful, and would require extensive SQL scripting in each database. Consequently, you may as well do this yourself as a TextField (and multiple third party solutions do this).

@cancan101
Copy link

Even though the querying operations would be slow / non-existent, having the ability to quickly run unit tests using SQLite on my local dev box is still helpful.

@jamesbrewerdev
Copy link

@mjtamlyn No worries! Was just discussing an internal implementation of the same with a coworker and came across this PR.

@mjtamlyn mjtamlyn force-pushed the dcp-jsonfield branch 2 times, most recently from eb83d78 to b32cf8e Compare May 30, 2015 20:39
@mjtamlyn mjtamlyn force-pushed the dcp-jsonfield branch 2 times, most recently from 38eb618 to aa009ec Compare May 30, 2015 21:20
@mjtamlyn
Copy link
Member Author

Merged in 33ea472

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants