Processing log files on AWS S3

Amazon Web Services makes it very easy to log lots of data. For most of their services you can enable logging just by indicating what bucket to log it to. Some examples include enabling logging of requests to an S3 bucket, enabling Cloudfront access logs, and enabling Cloudtrail logging.

It's easy enough to enable these and move on. However inevitably you run into an issue or you're calculating a usage metric and you need to access these log files and interpret them. Sounds easy enough, right? Normally, you'd have to:

  • Navigate the S3 bucket to find the appropriate files. (Quicker said than done.)
  • Unzip the files if necessary.
  • Search, grep based on position, or manually inspect.

For example, checking the HTTP response codes for some requests through Cloudfront.

Example list of Cloudfront logs. Hundreds created per day. Scrolling to a recent date could take forever through the S3 console.

Example list of Cloudfront logs. Hundreds created per day. Scrolling to a recent date could take forever through the S3 console.

It's not impossible, but this process could be better. Especially if it is something you find yourself doing often.

Why don't we just set up a log processing pipeline that streams the logs into a Map/Reduce or Loggly-like solution? Well, in short, we do for some of our logs! However, there are still plenty of times when we just want to access the logs and quickly review a specific timeframe or run a count.

Introducing Spotcheck, a command line utility to quickly download and query log files stored on AWS. It requires you to provide the S3 bucket, prefix, log format, and an optional date. With that, it will know the full name of the log files, download them, unzip them, concatenate them, and convert them into JSON.

$ spotcheck download config/cloudfront.json --date 2014-04-25

Now that it is in JSON format, it's a little easier to manually review and parse. You might even try JSONSelect or ./jQ. Spotcheck also provides a quick glance at the contents of the file by outputting counts of each of the JSON properties.

$ spotcheck report my-cloudfront-log.json --field sc-status
Example report output.

Example report output.

Contributions and suggestions welcome!