Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_logs.py crashes on invalid requests #4352

Closed
anonymous-matomo-user opened this issue Nov 27, 2013 · 10 comments
Closed

import_logs.py crashes on invalid requests #4352

anonymous-matomo-user opened this issue Nov 27, 2013 · 10 comments
Assignees
Labels
Bug For errors / faults / flaws / inconsistencies etc.
Milestone

Comments

@anonymous-matomo-user
Copy link

We have lots of log lines with invalid requests, for example:

www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "\x80w\x01\x03\x01" 400 226 "-" "-"

(these are drive-by exploit attempts).

It looks like the script wants both the request method and the HTTP version, since if I modify the line above it works:

www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "GET \x80w\x01\x03\x01 HTTP/1.1" 400 226 "-" "-"

We can't stop these requests, so I guess the parser should at least parse any old garbage that shows up in the path?

@mattab
Copy link
Member

mattab commented Nov 28, 2013

Thanks for the report. We'll try to reproduce and add a test case showing the failure and that it's fixed.

@anonymous-matomo-user
Copy link
Author

Thanks, it looks like the problem is that the path regex was expecting two spaces that weren't present, between where e.g. "GET" and "HTTP/1.1". I came up with a patch that works locally and made a pull request:

#159

I confess I couldn't get the test suite to run, so it might need some work still.

@mattab
Copy link
Member

mattab commented Dec 8, 2013

(these are drive-by exploit attempts).

Piwik puts them as "invalid" which I think they really are invalid request. Marking as wontfix. If you think it really should be fixed let us know why (since they appear to be invalid requests)

@anonymous-matomo-user
Copy link
Author

Replying to matt:

(these are drive-by exploit attempts).

Piwik puts them as "invalid"

It looks like I left out something important -- these are often the first requests in the logs!

$ git clone -q https://github.com/piwik/piwik.git
$ cat bad.log 
www.example.com 70.117.169.113 - - [26/Nov/2013:01:41:01 -0500] "\x80w\x01\x03\x01" 400 226 "-" "-"
$ cp $MYCFG piwik/config/config.ini.php
$ python2.7 ./piwik/misc/log-analytics/import_logs.py --url $MYURL bad.log 
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log bad.log...
Fatal error: cannot determine the log format using the first line of the log file. Try removing it or specifying the format with the --log-format-name command line argument.

You could search through the entire file looking for a line that matches the regex, but that can waste tons of time on log files with an unknown format. The fix in the pull request instead allows the regex to match even when the request is garbage.

I am planning to add a test case to the pull request, but will be super busy for another 1.5 weeks.

@mattab
Copy link
Member

mattab commented Dec 9, 2013

Sorry I missed the "Crashed" keyword!

@diosmosis
Copy link
Member

Fixed in 3572ef7.

@anonymous-matomo-user
Copy link
Author

Replying to capedfuzz:

Fixed in 3572ef7.

Thanks for the quick fix! I just ran it on last night's logs and it made it all the way through.

I would beware of a potential Heisenbug, though: we have a ton of sites that basically no real human ever visits. It's totally possible that one day we could have a log containing nothing but invalid requests (e.g. the bad.log case above). In that case, the entire import will still crash.

For a workaround in a similar vein, maybe if you run through the entire file and find no valid log lines, it shouldn't error out? I suppose to be reliable, the number of log lines (1000) would need to be adjustable. I would personally let it check the whole file for a match.

Does Piwik record the invalid requests that it finds once it knows the log format? I ask because you should get the same results from one file containing,

invalid
invalid
invalid
invalid
valid

as you do if you split that into two files,

invalid
invalid
invalid
invalid
valid

Only one invalid line would be parsed with the files split, since no format would be found in the first file. But if the invalid lines are dropped anyway, it's a moot point.

Thanks again.

@mattab
Copy link
Member

mattab commented Dec 9, 2013

In bb477d7: Refs #4352 add comments

@mattab
Copy link
Member

mattab commented Dec 9, 2013

In d549d2c: Refs #4352 increase to 100,000, it takes ~ 6 seconds on my box on a file with 100k lines, so still fast enough to fail. If a file has more than 100k wrong lines there's defnitely something wrong with it.

@anonymous-matomo-user
Copy link
Author

Agreed, thanks a million guys.

@anonymous-matomo-user anonymous-matomo-user added this to the 2.0 - Piwik 2.0 milestone Jul 8, 2014
sabl0r pushed a commit to sabl0r/piwik that referenced this issue Sep 23, 2014
sabl0r pushed a commit to sabl0r/piwik that referenced this issue Sep 23, 2014
…box on a file with 100k lines, so still fast enough to fail. If a file has more than 100k wrong lines there's defnitely something wrong with it.
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc.
Projects
None yet
Development

No branches or pull requests

3 participants