[root@adam ccpp-2013-03-16-13:53:55-21780]# abrt-cli report ./ Server side error: 'Validation failed: error validating 'os': error validating 'version': string "" contains illegal characters' File 'os-release' reads: Fedora release 19 (Schrödinger's Cat) If I edit os-release to read: Fedora release 19 (Schrodingers Cat) I can get past the error. (Didn't check if it's the ' or the ö that causes the problem, sorry).
It's the ' that causes the trouble.
actually, it's both.
The problem is in libreport src/lib/problem_data.c, function is_text_file(). The file containing "Schrödinger's Cat" is considered binary although it should be text (because of both ö and ').
ah , same here, no reports for F19 I can't change the name
Quoting "man os-release" "The /etc/os-release file contains operating system identification data." "The basic file format of os-release is a newline-separated list of environment-like shell-compatible variable assignments." [...snip...] "Variable assignment values should be enclosed in double or single quotes if they include spaces, semicolons or other special characters outside of A-Z, a-z, 0-9." [...snip...] "All strings should be in UTF-8 format" [...snip...] "If double or single quotes or backslashes are to be used within variable assignments they should be escaped with backslashes, following shell style"
Fixed in upstream git: commit ea1f6ed6b1d742eca33ce9a598fd0b568e0fb0cb Author: Denys Vlasenko <dvlasenk> Date: Mon Mar 18 22:20:06 2013 +0100 is_text_file(): bump allowable non-ASCII chars from 2% to 10%. Closes rhbz#922433 This change was tested to treat os_release = "Fedora release 19 (Schrödinger's Cat)" as text. I feel this is a stop-gap measure, though...
Why don't you just require UTF8 like the spec says?
yeah, just based on the info presented here, that sounds like all kinds of crazy. there are many many non-ASCII characters which are 'text'. Any function called 'is_text_file' should be perfectly happy with a string/file/whatever that contains nothing but Unicode. Okay, Unicode text characters if that's a concept that's easily expressed/consumed.
Ok, to make this more clear for everyone who might be interested why this happened to abrt. The function is_text_file() is used to determine whether the file is text or binary and abrt uses this information when deciding how to treat this file - text files are usually shown in comments and binaries (if sent) are attachments. In this case the heuristics in is_text_file() failed, marked the file containing the os release as binary and didn't send it to abrt server. I think that using heuristics is a wrong idea, because deciding whether the file is unicode 'text' file or a binary which happens to be also a valid unicode file can't be 100% reliable.
You shouldn't just be checking for bytes >= 0x80. You should be checking for bytes >= 0x80 *which are not part of a valid UTF-8 byte sequence*. And your percentage threshold can probably go back to 2% or even lower. In the 21st century it's no longer such a gamble. In the olden days you had to allow a certain percentage to account for false positives from genuine non-ASCII characters in the legacy 8-bit encodings. But with the above change, those non-ASCII characters will no longer cause false positives.
The canonical "is a file text or binary?" used throughout GNU is simply "does it have NUL (=0x00) bytes?". E.g. http://www.gnu.org/software/diffutils/manual/html_node/Binary.html .
(In reply to comment #11) > The canonical "is a file text or binary?" used throughout GNU is simply > "does it have NUL (=0x00) bytes?". E.g. > http://www.gnu.org/software/diffutils/manual/html_node/Binary.html . It really depends on what you plan to do with that file / sequence of bytes. For handling it with C libraries, the ASCIIZ is good criteria. If you need to decide whether to Base64 that string or not when sending it as part of some XML(RPC) somewhere, the criteria has to be much more strict because XML is based on Unicode and not all sequences of non-0x00 bytes are valid.
I'd probably avoid reinventing the wheel and rely on something which already works just fine: $ cat /etc/system-release Fedora release 19 (Schrödinger's Cat) $ file /etc/system-release --dereference --mime --brief text/plain; charset=utf-8
Just use this one instead, then: Schrödinger’s Cat That’s even typographically better…
(In reply to comment #11) > The canonical "is a file text or binary?" used throughout GNU is simply > "does it have NUL (=0x00) bytes?". E.g. > http://www.gnu.org/software/diffutils/manual/html_node/Binary.html . And it has been terminally since unicode started to be used It triggers so often on a system using a non-English locale people routinely ignore it as a nuisance in most parts of the world
(In reply to comment #13) > I'd probably avoid reinventing the wheel and rely on something which already > works just fine: > > $ cat /etc/system-release > Fedora release 19 (Schrödinger's Cat) > > $ file /etc/system-release --dereference --mime --brief > text/plain; charset=utf-8 We tried that. A string of simply one characher "0" isn't considered text by file tool (and by its underlying library): # echo -n 0 >0 # file 0 0: very short file (no magic)
(In reply to comment #11) > The canonical "is a file text or binary?" used throughout GNU is simply > "does it have NUL (=0x00) bytes?". E.g. > http://www.gnu.org/software/diffutils/manual/html_node/Binary.html . We can adopt something like that. Currently, NULs are taken as "this file is binary", as well as excessive amounts of chars > 0x7e. Just dropping the second condition would do what you propose. But it may become too permissive (too biased to declare stuff to be "text"). How about this: not only NUL, but any control char that is not a whitespace laeds to a decision that file is "binary".
(In reply to comment #17) > > We can adopt something like that. > Currently, NULs are taken as "this file is binary", as well as excessive > amounts of chars > 0x7e. > Just dropping the second condition would do what you propose. > But it may become too permissive (too biased to declare stuff to be "text"). > > How about this: not only NUL, but any control char that is not a whitespace > laeds to a decision that file is "binary". Before you modify the algorithm -- what is the purpose of this "string is binary" test? What does the code do differently based on that test? Without knowing what codepaths your code will start taking if you modify the test, it's pretty dangerous to do the change.
(In reply to comment #18) > (In reply to comment #17) > > > > We can adopt something like that. > > Currently, NULs are taken as "this file is binary", as well as excessive > > amounts of chars > 0x7e. > > Just dropping the second condition would do what you propose. > > But it may become too permissive (too biased to declare stuff to be "text"). > > > > How about this: not only NUL, but any control char that is not a whitespace > > laeds to a decision that file is "binary". > > Before you modify the algorithm -- what is the purpose of this "string is > binary" test? What does the code do differently based on that test? Without > knowing what codepaths your code will start taking if you modify the test, > it's pretty dangerous to do the change. The code shows text data in GUI textboxes, it puts in into emails it composes, prints on stdout when asked to show detailed problem data. Whereas binary files are excluded from all of these uses. When reporting bugs, they are uploaded as attachments, not as parts of textual comments.
(In reply to comment #19) > > The code shows text data in GUI textboxes, it puts in into emails it > composes, > prints on stdout when asked to show detailed problem data. Which means that you don't care that much about text / binary distinction but about valid UTF-8 / the rest. You can well have valid ISO-8859-1 text string but if you handle it as text and put into GUI without conversion, the result will be wrong.
(In reply to comment #7) > Why don't you just require UTF8 like the spec says? To reiterate Ajax's question: why were you trying to guess text vs binary? The spec says that this file is UTF-8, and it is.
commit ed2911f292275e879c80c7147fcd13a79f4e00b7 Author: Denys Vlasenko <dvlasenk> Date: Mon Mar 25 11:22:43 2013 +0100 improve is_text_file() to not treat valid Unicode as bad_chars In valid Unicode, first non-ASCII byte is always 11xxxxxx and the following bytes are always 10xxxxxx. Random binary data is likely to violate this rule *a lot*. This patch makes bad_chars++ decision only if it sees a non-ASCII byte which violates the rule described above. Also, it makes immediate decision "it's binary" not only on NULs, but on any non-whitespace control chars. This hopefully improves handling of the problem identified in rhbz#922433 Signed-off-by: Denys Vlasenko <dvlasenk> Signed-off-by: Jakub Filak <jfilak> diff --git a/src/lib/problem_data.c b/src/lib/problem_data.c index 2e5e55f..77c46dd 100644 --- a/src/lib/problem_data.c +++ b/src/lib/problem_data.c @@ -288,7 +288,7 @@ static char* is_text_file(const char *name, ssize_t *sz) } lseek(fd, 0, SEEK_SET); - char *buf = (char*)xmalloc(*sz); + unsigned char *buf = xmalloc(*sz); ssize_t r = full_read(fd, buf, *sz); close(fd); if (r < 0) @@ -306,7 +306,7 @@ static char* is_text_file(const char *name, ssize_t *sz) { base++; if (is_in_string_list(base, (char**)always_text_files)) - return buf; + return (char*)buf; } /* Every once in a while, even a text file contains a few garbled @@ -316,28 +316,44 @@ static char* is_text_file(const char *name, ssize_t *sz) * os_release = "Schrödinger's Cat". Bumped to 10%. * Alternatives: add os_release to always_text_files[] * or add "if it is valid Unicode, then it's text" check here. + * + * Replaced crude "buf[r] > 0x7e is bad" logic with + * "if it is a broken Unicode, then it's bad". */ const unsigned RATIO = 10; unsigned total_chars = r + RATIO; unsigned bad_chars = 1; /* 1 prevents division by 0 later */ - while (--r >= 0) + bool prev_was_unicode = 0; + ssize_t i = -1; + while (++i < r) { - if (buf[r] >= 0x7f - /* among control chars, only '\t','\n' etc are allowed */ - || (buf[r] < ' ' && !isspace(buf[r])) - ) { - if (buf[r] == '\0') - { - /* We don't like NULs very much. Not text for sure! */ - free(buf); - return NULL; - } + /* Among control chars, only '\t','\n' etc are allowed */ + if (buf[i] < ' ' && !isspace(buf[i])) + { + /* We don't like NULs and other control chars very much. + * Not text for sure! + */ + free(buf); + return NULL; + } + if (buf[i] == 0x7f) bad_chars++; + else if (buf[i] > 0x7f) + { + /* We test two possible bad cases with one comparison: + * (1) prev byte was unicode AND cur byte is 11xxxxxx: + * BAD - unicode start byte can't be in the middle of unicode char + * (2) prev byte wasnt unicode AND cur byte is 10xxxxxx: + * BAD - unicode continuation byte can't start unicode char + */ + if (prev_was_unicode == ((buf[i] & 0x40) == 0x40)) + bad_chars++; } + prev_was_unicode = (buf[i] > 0x7f); } if ((total_chars / bad_chars) >= RATIO) - return buf; /* looks like text to me */ + return (char*)buf; /* looks like text to me */ free(buf); return NULL; /* it's binary */
(In reply to comment #21) > (In reply to comment #7) > > Why don't you just require UTF8 like the spec says? > > To reiterate Ajax's question: why were you trying to guess text vs binary? > The spec says that this file is UTF-8, and it is. We (ABRT) don't have hardcoded knowledge that the /etc/fedora-release is utf8. It's just copied with other information to /var/tmp/abrt/<somedir> and processed when user decides to report the problem and by that time abrt doesn't know where the file came from so it tries to guess if it's text and uses this information when creating the report (the binary files are send as attachments and text files usually ends up in comment fields).
(In reply to comment #23) > (In reply to comment #21) > > (In reply to comment #7) > > > Why don't you just require UTF8 like the spec says? > > > > To reiterate Ajax's question: why were you trying to guess text vs binary? > > The spec says that this file is UTF-8, and it is. > > We (ABRT) don't have hardcoded knowledge that the /etc/fedora-release is > utf8. It's just copied with other information to /var/tmp/abrt/<somedir> and > processed when user decides to report the problem and by that time abrt > doesn't know where the file came from so it tries to guess if it's text and > uses this information when creating the report (the binary files are send as > attachments and text files usually ends up in comment fields). Am I right in understanding that you're *not* tracking where all the files came from? That sounds like something that should be fixed - it seems like extremely pertinent information to me.
(In reply to comment #23) > We (ABRT) don't have hardcoded knowledge that the /etc/fedora-release is > utf8. It's just copied with other information to /var/tmp/abrt/<somedir> and > processed when user decides to report the problem and by that time abrt > doesn't know where the file came from so it tries to guess if it's text and > uses this information when creating the report (the binary files are send as > attachments and text files usually ends up in comment fields). In that case, 'text' and 'utf-8' are fairly much synonymous, surely? If it's legacy 20th century 8-bit crap like ISO8859-1, it wants to be an attachment rather than wrongly included in a "text" part of the email anyway.