100% developed

Ad Hoc Data Analysis From The Unix Command Line

From Wikibooks, open books for an open world
Jump to navigation Jump to search
Ad Hoc Data Analysis
From The Unix Command Line

Once upon a time, I was working with a colleague who needed to do some quick data analysis to get a handle on the scope of a problem. He was considering importing the data into a database or writing a program to parse and summarize that data. Either of these options would have taken hours at least, and possibly days. I wrote this on his whiteboard:

Your friends: cat, find, grep, wc, cut, sort, uniq

These simple commands can be combined to quickly answer the kinds of questions for which most people would turn to a database, if only the data were already in a database. You can quickly (often in seconds) form and test hypotheses about virtually any record oriented data source.

Intended audience[edit | edit source]

You've logged into a Unix box of some flavor and run some basic commands like ls and cd and cat. If you don't know what the ls command does, you need a more basic introduction to Unix than I'm going to give here.

Table of Contents[edit | edit source]

  1. Preliminaries
  2. Standard Input, Standard Output, Redirection and Pipes
  3. Counting Part 1 - grep and wc
  4. Picking The Data Apart With cut
  5. Joining The Data Together With join
  6. Counting Part 2 - sort and uniq
  7. Rewriting The Data With Inline perl
  8. Quick Plotting With gnuplot
  9. Appendices