How to split a csv file by date, using Bash and Node

Introduction

I faced a situation recently in my work at Fluencia where I needed to split up a csv file. I had a report with about 60 rows, each from a different date. What I needed was 60 reports each with only one row. And I needed each file to be named after the date of the data it contained, in the format 20150708_iad_extracted.csv for a file containing data from July 8th, 2015.

I've been spending a lot of time over the last few months improving my bash skills, so this was a great tool for the job.

(This article requires a basic understand of Unix streams and command line bash programming, but don't be scared!)

csvkit

I found this python tool to be extremely useful for this task. Though it can be great to work with tabular data using a GUI like Excel or Numbers when you're a business user, I find that as a programmer I often want to interact with data trough the command line. Even more, I want to be able to read data and pipe it through different commands in a Unix-like fashion, like I would with other new-line-separated text. Installing csvkit is a breeze and gives you a ton of useful commands. The two I used most are:

- `csvgrep` — allows you to filter a data set to only those that match a value you specify in a column you specify
- `csvcut` — allows you to trim a data set to only specific columns you would like.

Splitting the file, in Bash

I like to build bash commands by starting with a small command and gradually appending to it, verifying the behavior (and my bash syntax) at each step.

First, I read the report into memory and log it to the screen.

cat all.csv

Then I pipe the contents of the report into csvcut. Adding the flag -c 1 tells the command that I would like only the results of the first column (which are 1 indexed, not 0 indexed).

cat all.csv | csvcut -c 1

After that I need to perform an action multiple times, since I need to create a new file for each row. The command xargs reads standard input (stdin) and performs a command once for each line. The normal use of xargs runs whatever command you pass it using that line as its stdin, but I find using the -I {} flag to often be more flexible. It allows you to take that new line and substitute it into the command you use. I usually use it in conjunction with bash -c which allows you to call a string as a command. That means that you can build up a second command with pipes in it and they won't interfere with any pipes you have on the outside.

To verify this works like I expect, I would run:

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'echo {}.csv'

Now we can get the 60 files names, but we still need to get the contents for each one. We can do that by reading the file again and using csvgrep to select the row that matches the date we already have. Using the -c 1 flag tells the command that we're matching a pattern in the first column and using the -m some_string flag tells it what we want that pattern to be. Note that if you are searching for a pattern that contains spaces, you can wrap that string in quotes, like -m "some string".

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'cat all.csv | csvgrep -c 1 -m "{}"'

Finally we just write the standard output (stdout) of that command into a file with the name we had earlier.

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'cat all.csv | csvgrep -c 1 -m "{}" > "{}".csv'

One thing to note is that this solution has a complexity of O(n^2), so it might not be the best for large files. I tried to make a similar solution with O(n), but ran into problems witht the fact that the stdout passed by xargs is not csv escaped. If you have an O(n) solution or would be interested in one, please let me know in the comments.

Renaming the files, in node.js

I chose to do this step in node because I was confident I could easily convert a string like "Tuesday, August 13th, 2015" to "20150813" using moment.js. The latter isn't very nice to read, but sometimes it's not up to us to choose the conventions expected by our systems.

I used two other libraries for this script: fs from node core and Bluebird, which is an awesome promise library. If you're not familiar with Bluebird, the biggest reason for using it is that it makes it very easy to convert non-promise code into promise code. For this I passed the whole fs library to Bluebird.promisify(), which created new promise-returning functions for all the library's callback-accepting functions (now with "Async" appended to their name).

Our convention at Fluencia is to work with promises by first defining a set of functions we are going to use, then calling them each by name in a promise chain, avoiding anonymous functions as much as possible.

First I define a function to read all files in a directory:

function readFiles(dir) {
    return fs.readdirAsync(dir);
}

Then a function to filter only files which end in .csv:

function onlyCsv(fileName) {
    return fileName.match('.csv');
}

Then the real action, a function that will read the date from the file name, then convert it to the new format, and rename the file. For more details on how moment.js does formatting, take a look at their docs here:

function rename(fileName) {
    var date = fileName.replace('.csv', '');
    date = moment(date, 'dddd, MMMM DD, YYYY');
    var newName = date.format('YYYY-MM-DD') + '_iad_extracted.csv';
    return fs.renameAsync('./' + fileName, './' + newName);
}

And finally I call all those functions in a chain:

readFiles('./')
    .filter(onlyCsv)
    .map(rename)

Conclusion

Bash commands are extremely powerful and flexible tools. They make the job of splitting one csv file into many fairly easy and repeatable. However for more complicated programming tasks, even just changing a date from one format to another, it can be much nicer to use javascript for it's many easy-to-use libraries.

Ghost on Elastic Beanstalk

At Fluencia, we wanted to build a new SpanishDict blog that would empower our team to create engaging and exciting content, and to do so quickly and easily. Ghost is an amazing blogging platform, both because of its easy intuitive interface and its built in collaboration tools. It also allows us to take advantage of technologies our engineering team is already skilled with, including Node, Ember.js and Handlebars templates.

When bringing our new blog application to production, we had several engineering priorities:

  • Easy deployment - Our team follows an Agile process with short development sprints and frequent feature deployment. We wanted an infrastructure that would facilitate this approach instead of hindering it.
  • Zero downtime — SpanishDict.com receives several million requests daily. The new blog would be integrated into the main landing page for the site, and visible to all of our traffic, so we needed an approach that would keep our content live even as changes were being rolled out.
  • Simple source control — Ghost is under active development and we wanted to continue to integrate new features as they were released. Likewise, we wanted to minimize the manual work needed for developers to get any of our three environments (development, staging, and production) up and running.

Given our experience with Amazon Web Services, we chose Elastic Beanstalk as the home for our new blog.

Starting point

We followed a very helpful guide to get Ghost deployed on Elastic Beanstalk. That guide is great for getting started, however its final set up didn't leave us with a production-ready environment for the following reasons:

  • Deploying a new application version had to be done manually from the EB web console.
  • Changes to a custom theme were not propagated to the EB application.
  • The app experienced ~15 seconds of downtime during deploys.

The content directory

We focused most of our efforts in Ghost's content/ directory. It contains four directories:

  • data/ — which we could ignore, since all our data was going to be stored on AWS RDS instead of using SQLite;
  • apps/ — which is going to be part of future development plans for Ghost, but we had no need for right now;
  • themes/ — which we wanted to update on each deploy;
  • images/ — which we wanted to persist across deploys and between instances.

Using AWS S3 to host our images was a perfect solution for this, and using S3FS we could connect the images/ in the application to point to our S3 bucket.

S3FS credentials

S3FS requires a file /etc/passwd-s3fs containing your credentials for AWS. You can create files in the Elastic Beanstalk system through creating a *.config file in a .ebextensions/ directory. However, given that we wanted to add those files to our source control, we didn't want to include any credentials in there.

Elastic Beanstalk allows you to configure environment variables through the console, so that was a logical candidate for storying our credentials. However, the password file couldn't resolve environment variables when called, so our solution was instead to configure a container comand to read them and then write to the password file.

Here's what that solution looks like:

// .ebextensions/97_s3fs-password.config
files:
    "/etc/passwd-s3fs":
        mode: "000640"
        owner: root
        group: root
        content: |
            # placeholder text
container_commands:
    01-create_s3fs_file:
        command: "echo ${AWS_CREDS} | tee /etc/passwd-s3fs"
        cwd: /

Mounting S3FS

After installing S3FS, we needed to mount the directory on the file system. Mounting the directory into content/images/ caused us some trouble since Elastic Beanstalk completely wipes your application directory on each deploy, and it throws an error if it tries to delete a folder connected to S3FS.

Our solution was to mount our bucket in a different location completely outside the application directory, then create a symlink to that location from content/images/. Doing so still allowed Ghost to reach all the images it needed, but removing it on deploy caused no problems.

We did that with the following two config files:

// .ebextensions/98_mount-s3fs.config
files:
    "/var/local/mount_s3_bucket.sh":
        mode: "000777"
        owner: root
        group: root
        content: |
            #!/usr/bin/env bash

            ## Mount S3FS, if it isn't already mounted
            if mount | grep s3fs > /dev/null;
            then :;
            else
                mkdir /var/local/images
                /usr/bin/s3fs $S3_BUCKET /var/local/images -o allow_other -o use_cache=/tmp -o nonempty
            fi

container_commands:
    02-mount-s3fs:
        command: sh /var/local/mount_s3_bucket.sh
        cwd: /

Note that when mounting S3FS, we specify which bucket to mount using an environment variable, allowing us to effortlessly use a different S3 bucket in production and staging.

For creating the symlink:

// .ebextensions/99_create-symlink.config
files:
    "/var/local/create_s3_symlink.sh":
        mode: "000755"
        owner: root
        group: root
        content: |
            #!/usr/bin/env bash

            # Delete any images you uploaded
            rm -rf /tmp/deployment/application/content/images
            # Create symlink
            ln -s /var/local/images /tmp/deployment/application/content/images

container_commands:
    03-create_s3_symlink:
        command: sh /var/local/create_s3_symlink.sh
        cwd: /

One thing to note here is that we do not make a symlink in the actual application directory /var/app/current/. At the point at which container commands execute, Elastic Beanstalk is building up your application in the /tmp/deployment/application/ directory. It the moves that whole directory to the final location. Any changes to the application directory would be overriden by what is in the temp directory.

100% uptime during deployment

With these changes in place, all we had to do was configure batched application deployments. We did this by going to Configuration > Updates and Deployments > Application Deployments in the Elastic Beanstalk console for our environment and setting "Batch type" to "Fixed" and "Batch Size" to "1". In the Autoscaling section below, we also needed to set a minimum instance count of at least 2.

Conclusion

Now, using the Elastic Beanstalk CLI, deploying to each of our environments is a single command. $ eb deploy sd-blog-prod. It creates an archive out of the current directory, uploads it to each instance of our production environment and uses it to create our application directory.

Note that if you want to exclude certain files from this process, you can do so with an .ebignore file

With a little bit of configuration, our deploy process is now quick and painless. Hosting a Ghost blog on Elastic Beanstalk is a great option for a large scale production site.

Speed Up Table Rendering in Ember by Preventing Layout Thrashing

When implementing an infinite scroll table, we really want users to feel a smooth scrolling experience when rendering new table rows. There are many reasons that can cause a table to stutter. One that is often overlooked is layout thrashing. In this article, we will showcase an issue we encountered and the solution we found while developing PubNation.com with Ember.

The Table and the Issue

The table is simple, just scroll to the bottom and it will load data and render more rows. When we were implementing the design, we felt the scrolling speed was a little bit off. It’s not very obvious, but we pulled out Chrome Dev Tool -> Timeline and did a recording while we were scrolling.

We noticed that the frame rate was particularly low while loading data. Part of this was expected since we are dumping 25 records onto the DOM at the same time. It also made us wonder which part is particularly slow. We zoomed in around XHR calls, and a line of purple squares popped into our eyes.

It seems like we have forced our browser to read and write from the DOM repetitively in a short period of time. After checking out source code on the left, I found this line in the table row component.

didInsertElement: function () {
    ...
    this.set('enableTooltip', this.element.scrollWidth > this.element.clientWidth);
}

So we were querying the dimension of the element for each row when the row got inserted into DOM. I remembered seeing an article about Layout Thrasing. When we insert into the DOM, if we read a layout attribute immediately, the browser will become very busy calculating the up-to-date value of that attribute, which leads to adding a lot more computation to an operation that’s already very heavy.

The Solution

When I looked at this enableTooltip property, I asked myself “do we need to know it now”? The answer is “No” in this case. So I turned it into an on-demand computed property:

enableTooltip: computed(function () {
    ...
    return this.element.scrollWidth > this.element.clientWidth;
})

So this way, we didn’t have to query the DOM when it’s busy inserting our data. I did a new recording of the timeline:

And the frame rate did go above 60FPS most of the time. The scrolling felt smoother. Looking at our XHR calls again:

It’s a lot cleaner this time. Cheers! (Although we might have hit the wall with some garbage collection problem, but that’s another blog post :P) We successfully offloaded the browser from repetitively computing the layout and users could enjoy a smoother table.

Conclusion

By working on this table scroll issue, we learned a few things:

  1. Modern browsers are not exempt from performance concerns. Sometimes simple operations can have large impact on performance if not treated carefully.
  2. It’s not hard to solve performance issues. Debugging performance with Chrome Dev Tools is not scary at all.
  3. When manipulating DOM, batching operations together is a fast and simple way to ensure performance.

Let us know if you have any tips or experiences to share for speeding up Ember!

Switching from private Github repos to npm private modules

Following software engineering best practice, the team strives to write code that is modular and self-contained, and thus reusable across multiple projects. We have tended to move away from writing monolithic applications, towards building services that do one thing really well and 'library' repositories with code shared between these services. As a result our work is spread across multiple repositories: one repo per service, and multiple versioned library repositories in use by those services.

One challenge with this approach has been installing private repositories, which are only accessible by users that have Github collaborator or owner access to these projects. Because npm did not have a private registry to publish to (without buying npm enterprise or some other private registry), we fell back to a make-shift approach:

  • Specify the package as a dependency from Github rather than the npm package registry.
  • Check in the package to source control so npm won't refetch it, and our deploy scripts and Travis CI don’t need our git credentials.
  • Use the git commit hash, not a tag or branch, to specify the package. Later versions of npm always refetch any packages specified by git tags or git branches, but not those specified by git commits.

This has caused headache for the team because for clarity we much prefer to use tags to specify versions, and we've been forced to pin our npm version because git support has changed in recent versions.

So we were very excited when npm finally released support for private modules. Now we can specify the exact package version we want to use in package.json instead of using the git hash, so:

"@spanishdict/lib": "1.0.0"

instead of

"lib": "git+ssh://github.com/spanishdict/lib#longmeaninglesshash"

Overall the process to switch from this approach to using private modules was simple, but some of the details weren't so obvious. Here's how we did it and some things we learned along the way.

1. Sign up for private modules

The first step is to sign up for npm private modules. If you're working under a github organization, npm recommends creating a separate user for your organization which can later be upgraded to an organization when support for it is ready.

We created a spanishdict npm user for this purpose.

2. Upgrade npm

npm private modules depends on scoped packages, which came into being sometime during npm 2 so any version earlier than that will result in a 404 when npm can’t find your scoped private modules. I recommend upgrading to the latest npm if you can, which is 2.9.0 at the time of writing this.

$ npm install -g npm@2.9.0

In fact we ran into an issue with npm 1.4.x, which interprets scoped packages as github shorthand. This means that if your scoped package name is “@spanishdict/pkg” and it is in a github repo called github.com/spanishdict/pkg, then someone with npm 1.4.x who has access to the github repo will successfully install it from github. This caused us some confusion and we reached out to npm about it, and they are considering a patch release of 1.4.x to prevent this behavior.

3. Publish a private module

Log in to npm on the command line using npm login or npm adduser, with the account that you used to sign up for private modules. npm whoami is useful for verifying that you've switched users correctly.

Navigate to the root of the repository you plan to publish to npm private registry. Update the name of the project package.json by prefixing it with the name of the private modules account. For example, we updated the name field to:

"name": "@spanishdict/module-name",

If you have specified private: true, remove this line before publishing as npm will give a warning about this. Scoped modules are published privately by default, but you can explicitly specify the access level when you publish to the npm registry:

$ npm publish --access restricted

Navigate to e.g. https://www.npmjs.com/~spanishdict to see the newly published private module.

4. Install your new private module

Navigate to the repository that will start consuming the new private module you just published. We had to remove the old checked-in module from source control; you won't have to do this if you're using this module for the first time.

$ npm uninstall module-name --save

But git caches committed files, so in addition to removing module-name from .gitignore, you have to delete the files from git's cache:

$ git rm --cached -r node_modules/module-name

Now install your module from the npm private registry. You must be logged in as a user with access to the private module (either as the owner that published the package, or as a collaborator that's been added to the package).

$ npm install @spanishdict/module-name --save

You'll see the new line in package.json with slightly different syntax:

"@spanishdict/module-name": "1.0.0",

5. Use your new private module

Now you can use the new module in code. Note that the require syntax changes slightly from require('module-name') to require('@spanishdict/module-name').

And that's it! Make a pull request to switch this repo to using private modules and feel free to link to this article so your reviewer can follow along.

Why are we checking these modules in to source control?

There's a lot of discussion about best practice surrounding checking in node_modules, but for us it comes down to how we handle deployments:

  • We don't need to put access credentials on external machines because the install step doesn't need to fetch private repositories.
  • Removes npm or any other registry as a point of failure and dependency for deploying code. For instance if npm goes down for whatever reason, we can still deploy our code to users. This has been less of a concern lately since npm has become much more stable, but we try to cut down external dependencies wherever possible.

That last point is the deciding factor for us, so for now we're sticking to checking in node_modules.

What do you think about npm private modules? Do you have a better way to handle private dependencies? Let us know in the comments below!