How to split a csv file by date, using Bash and Node

Introduction

I faced a situation recently in my work at Fluencia where I needed to split up a csv file. I had a report with about 60 rows, each from a different date. What I needed was 60 reports each with only one row. And I needed each file to be named after the date of the data it contained, in the format 20150708_iad_extracted.csv for a file containing data from July 8th, 2015.

I've been spending a lot of time over the last few months improving my bash skills, so this was a great tool for the job.

(This article requires a basic understand of Unix streams and command line bash programming, but don't be scared!)

csvkit

I found this python tool to be extremely useful for this task. Though it can be great to work with tabular data using a GUI like Excel or Numbers when you're a business user, I find that as a programmer I often want to interact with data trough the command line. Even more, I want to be able to read data and pipe it through different commands in a Unix-like fashion, like I would with other new-line-separated text. Installing csvkit is a breeze and gives you a ton of useful commands. The two I used most are:

- `csvgrep` — allows you to filter a data set to only those that match a value you specify in a column you specify
- `csvcut` — allows you to trim a data set to only specific columns you would like.

Splitting the file, in Bash

I like to build bash commands by starting with a small command and gradually appending to it, verifying the behavior (and my bash syntax) at each step.

First, I read the report into memory and log it to the screen.

cat all.csv

Then I pipe the contents of the report into csvcut. Adding the flag -c 1 tells the command that I would like only the results of the first column (which are 1 indexed, not 0 indexed).

cat all.csv | csvcut -c 1

After that I need to perform an action multiple times, since I need to create a new file for each row. The command xargs reads standard input (stdin) and performs a command once for each line. The normal use of xargs runs whatever command you pass it using that line as its stdin, but I find using the -I {} flag to often be more flexible. It allows you to take that new line and substitute it into the command you use. I usually use it in conjunction with bash -c which allows you to call a string as a command. That means that you can build up a second command with pipes in it and they won't interfere with any pipes you have on the outside.

To verify this works like I expect, I would run:

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'echo {}.csv'

Now we can get the 60 files names, but we still need to get the contents for each one. We can do that by reading the file again and using csvgrep to select the row that matches the date we already have. Using the -c 1 flag tells the command that we're matching a pattern in the first column and using the -m some_string flag tells it what we want that pattern to be. Note that if you are searching for a pattern that contains spaces, you can wrap that string in quotes, like -m "some string".

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'cat all.csv | csvgrep -c 1 -m "{}"'

Finally we just write the standard output (stdout) of that command into a file with the name we had earlier.

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'cat all.csv | csvgrep -c 1 -m "{}" > "{}".csv'

One thing to note is that this solution has a complexity of O(n^2), so it might not be the best for large files. I tried to make a similar solution with O(n), but ran into problems witht the fact that the stdout passed by xargs is not csv escaped. If you have an O(n) solution or would be interested in one, please let me know in the comments.

Renaming the files, in node.js

I chose to do this step in node because I was confident I could easily convert a string like "Tuesday, August 13th, 2015" to "20150813" using moment.js. The latter isn't very nice to read, but sometimes it's not up to us to choose the conventions expected by our systems.

I used two other libraries for this script: fs from node core and Bluebird, which is an awesome promise library. If you're not familiar with Bluebird, the biggest reason for using it is that it makes it very easy to convert non-promise code into promise code. For this I passed the whole fs library to Bluebird.promisify(), which created new promise-returning functions for all the library's callback-accepting functions (now with "Async" appended to their name).

Our convention at Fluencia is to work with promises by first defining a set of functions we are going to use, then calling them each by name in a promise chain, avoiding anonymous functions as much as possible.

First I define a function to read all files in a directory:

function readFiles(dir) {
    return fs.readdirAsync(dir);
}

Then a function to filter only files which end in .csv:

function onlyCsv(fileName) {
    return fileName.match('.csv');
}

Then the real action, a function that will read the date from the file name, then convert it to the new format, and rename the file. For more details on how moment.js does formatting, take a look at their docs here:

function rename(fileName) {
    var date = fileName.replace('.csv', '');
    date = moment(date, 'dddd, MMMM DD, YYYY');
    var newName = date.format('YYYY-MM-DD') + '_iad_extracted.csv';
    return fs.renameAsync('./' + fileName, './' + newName);
}

And finally I call all those functions in a chain:

readFiles('./')
    .filter(onlyCsv)
    .map(rename)

Conclusion

Bash commands are extremely powerful and flexible tools. They make the job of splitting one csv file into many fairly easy and repeatable. However for more complicated programming tasks, even just changing a date from one format to another, it can be much nicer to use javascript for it's many easy-to-use libraries.