New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support null characters in strings #3163
Comments
What's the difference between a string type with NULLs and the binary type? |
The string type would probably continue to be UTF-8 only, because we support string operations that need to know an encoding (like |
A string is a sequence of unicode code points. Edit: Strings do not have an encoding. |
Strings currently have an implicit encoding of UTF-8. This is not enforced by anything and in fact we violate it in several places; not all byte strings represent valid Unicode code points. In particular the high bit of the first byte cannot be set (it indicates a multibyte code point). All bits 0 is a valid UTF-8 byte (if the previous byte does not indicate a multibyte code point) encoding U+0000; in particular it is a control character, and possesses the (deprecated) Unicode 1.0 character name NULL. This obviously causes problems. In so-called "Modified UTF-8" U+0000 is encoded as the byte sequence |
Which I suppose is to say "a string is a sequence of bytes with a coding (possibly implicit)". There are things that we permit in strings with the implicit UTF-8 coding right now that we should not (for example, bytes beginning There are additional complications—for example Oracle and MySQL will actually use a related codec called CESU-8 when outputting UTF-8, and in Java internal serialization uses the Modified UTF-8. It is generally expected that the wire protocol be true UTF-8, however. The easiest way to do this would probably be to alter cJSON to use Things get extremely complex when supporting arbitrary encodings. I recommend that we not even try. With |
related: #2950 |
FWIW: http://w3techs.com/technologies/history_overview/character_encoding indicates that Shift JIS represents about 1.3% of the web as of today. GB2312 and GBK are both Chinese, Windows-1251 is commonly used for Cyrillic character sets, and Windows-1252 is a superset of ISO 8859-1 that is important for legacy stupid reasons. I bring this up because it means that any effort to "do the right thing" with non-UTF-8 strings will have to deal with Shift JIS, which is one of the least sane character codings ever devised. |
Moving to subsequent and marking high priority -- this came up in some internal conversations with customers. |
I'll check if RapidJSON supports this from the JSON side... Then we would just need to make sure that btree serialization handles it. |
Null bytes are now supported in the branch However datums still disallow null bytes, and we need to escape strings when converting them to btree keys similar to what we do with binary data. Another minor thing left is that utf8 validation is currently called on a C-string in some places, but that should be easy to change (I believe the code already supports other string types). |
@danielmewes -- we could use a different type tag in the serialization format. |
@mlucy That wouldn't sort correctly if you had a mixture of both though. |
Er, right, forgot. I guess it probably isn't worth implementing non-lexicographic key comparison just for this. |
As discussed with @danielmewes offline, we're going to allow indexing strings containing null bytes for new secondary indexes only (disallowed in primary indexes.) Datums in general will be unrestricted wrt strings containing null bytes. Reasoning: Adding support for null bytes in primary keys would require adding versioning support to the primary index, which gets to be a nasty change for many reasons. We can do that later if need be. |
In review with @mlucy, CR 3250 |
👏 🍰 |
Re-opening: this change broke the |
Also broke |
Fixed the test failures in f510b09 now in |
The discussion came up in #2699 (among others)
We should consider allowing null characters in strings.
There are likely going to be some technical obstacles:
This is pretty low priority.
The text was updated successfully, but these errors were encountered: