Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support null characters in strings #3163

Closed
danielmewes opened this issue Oct 9, 2014 · 19 comments
Closed

Support null characters in strings #3163

danielmewes opened this issue Oct 9, 2014 · 19 comments
Assignees
Milestone

Comments

@danielmewes
Copy link
Member

The discussion came up in #2699 (among others)

We should consider allowing null characters in strings.

There are likely going to be some technical obstacles:

This is pretty low priority.

@nviennot
Copy link
Contributor

What's the difference between a string type with NULLs and the binary type?

@mlucy
Copy link
Member

mlucy commented Oct 23, 2014

The string type would probably continue to be UTF-8 only, because we support string operations that need to know an encoding (like .match). Binary is just dumb data storage with no encoding; not many terms can be called on it, and it's serialized specially to the clients since a lot of JSON libraries can't handle binary data in strings.

@srh
Copy link
Contributor

srh commented Oct 23, 2014

What's the difference between a string type with NULLs and the binary type?

A string is a sequence of unicode code points.

Edit: Strings do not have an encoding.

@gchpaco
Copy link
Contributor

gchpaco commented Oct 23, 2014

Strings currently have an implicit encoding of UTF-8. This is not enforced by anything and in fact we violate it in several places; not all byte strings represent valid Unicode code points. In particular the high bit of the first byte cannot be set (it indicates a multibyte code point). All bits 0 is a valid UTF-8 byte (if the previous byte does not indicate a multibyte code point) encoding U+0000; in particular it is a control character, and possesses the (deprecated) Unicode 1.0 character name NULL. This obviously causes problems. In so-called "Modified UTF-8" U+0000 is encoded as the byte sequence 0xC0, 0x80; in Modified UTF-8, all bits 0 can continue to be used as a string terminator.

@gchpaco
Copy link
Contributor

gchpaco commented Oct 23, 2014

Which I suppose is to say "a string is a sequence of bytes with a coding (possibly implicit)". There are things that we permit in strings with the implicit UTF-8 coding right now that we should not (for example, bytes beginning 10xxxxxx in initial position; overlong encodings; values greater than U+10FFFF) and things we do not permit in strings that we should possibly consider (U+0000). These actually have security implications; decoders that attempt to decode anything can be used to smuggle U+0000, /, ', and other dangerous characters past (admittedly pretty mediocre) security firewalls in IIS and Tomcat before. Accordingly RFC 3629 requires that decoders trigger protect against seeing any of these oddities; early decoders would trigger exceptions, but this had denial of service implications, and more recent ones have largely gone in for replacing them with some other character, like the replacement character U+FFFD.

There are additional complications—for example Oracle and MySQL will actually use a related codec called CESU-8 when outputting UTF-8, and in Java internal serialization uses the Modified UTF-8. It is generally expected that the wire protocol be true UTF-8, however.

The easiest way to do this would probably be to alter cJSON to use std::string instead of char * everywhere. This is a lot of obnoxious grunt work but not, I think, actually difficult.

Things get extremely complex when supporting arbitrary encodings. I recommend that we not even try. With r.binary you could make a document that has a coding key and a data key and do your own internal parsing. Nothing else is remotely safe; non Unicode encodings don't even encode the same code point for the same string of characters (for example, Shift JIS which is insane).

@deontologician
Copy link
Contributor

related: #2950

@gchpaco
Copy link
Contributor

gchpaco commented Oct 23, 2014

FWIW: http://w3techs.com/technologies/history_overview/character_encoding indicates that Shift JIS represents about 1.3% of the web as of today. GB2312 and GBK are both Chinese, Windows-1251 is commonly used for Cyrillic character sets, and Windows-1252 is a superset of ISO 8859-1 that is important for legacy stupid reasons. I bring this up because it means that any effort to "do the right thing" with non-UTF-8 strings will have to deal with Shift JIS, which is one of the least sane character codings ever devised.

@coffeemug
Copy link
Contributor

Moving to subsequent and marking high priority -- this came up in some internal conversations with customers.

@coffeemug coffeemug modified the milestones: subsequent, backlog Apr 10, 2015
@danielmewes
Copy link
Member Author

I'll check if RapidJSON supports this from the JSON side... Then we would just need to make sure that btree serialization handles it.

@danielmewes
Copy link
Member Author

Null bytes are now supported in the branch daniel_rapidjson as far as the JSON side is concerned (will be merged as part of #3844).

However datums still disallow null bytes, and we need to escape strings when converting them to btree keys similar to what we do with binary data.
The slightly tricky thing will be how to migrate old primary keys that contain strings with 1 bytes (these would need to be escaped too, since the escaping mechanism uses 1 bytes as escape bytes to escape null bytes).
I think we can get away with continuing to disallow null bytes in primary keys of existing tables, and require creating a new table to start using them.

Another minor thing left is that utf8 validation is currently called on a C-string in some places, but that should be easy to change (I believe the code already supports other string types).

@danielmewes danielmewes modified the milestones: 2.1-polish, subsequent Apr 10, 2015
@mlucy
Copy link
Member

mlucy commented Apr 10, 2015

@danielmewes -- we could use a different type tag in the serialization format. S means string and S+ means string with escaped NULL bytes.

@danielmewes
Copy link
Member Author

@mlucy That wouldn't sort correctly if you had a mixture of both though.

@mlucy
Copy link
Member

mlucy commented Apr 10, 2015

Er, right, forgot. I guess it probably isn't worth implementing non-lexicographic key comparison just for this.

@encryptio encryptio self-assigned this Sep 10, 2015
@encryptio
Copy link
Contributor

As discussed with @danielmewes offline, we're going to allow indexing strings containing null bytes for new secondary indexes only (disallowed in primary indexes.) Datums in general will be unrestricted wrt strings containing null bytes.

Reasoning: Adding support for null bytes in primary keys would require adding versioning support to the primary index, which gets to be a nasty change for many reasons. We can do that later if need be.

@encryptio
Copy link
Contributor

In review with @mlucy, CR 3250

@coffeemug
Copy link
Contributor

👏 🍰

encryptio added a commit that referenced this issue Sep 29, 2015
@mlucy
Copy link
Member

mlucy commented Oct 1, 2015

Re-opening: this change broke the polyglot/datum/object tests in next because it changed some error messages.

@mlucy mlucy reopened this Oct 1, 2015
@mlucy
Copy link
Member

mlucy commented Oct 1, 2015

Also broke polyglot/regression/1132.

@encryptio
Copy link
Contributor

Fixed the test failures in f510b09 now in next, reviewed by @danielmewes in person.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants