Elm => javascript

Introduction

Elm is a fantastic programming language. Through functional programming and the
Elm architecture, it facilitates a style of writing web applications, that
allows for code that is modular, reliable, and easy to refactor.

If you are starting a new project that requires a fast interactive front-end and
you are willing to learn a new language and ecosystem, I would strongly
encourage you to choose Elm.

But if you don't have that flexibility, say you want to stay inside the
javascript world and borrow some of the fantastic ideas behind Elm, this blog
post is an attempt to show you how.

In this blog post, we'll build something similar to fantasy football, where
there is a list of all available players and you can draft them into your
fantasy team, but to do it for the US Congress and Senate. The application we'll
be building is available here and the
source code is available on GitHub.

The most important tool to achieve the architecture we are going for is
functional programming. If you're unfamiliar with FP, I would point you towards
this book as a wonderful introduction.

We will use a library Ramda which I find to be the swiss army knife of
functional javascript. It builds on the ideas of utility libraries like Lodash
by exposing methods that are "functional first".

Wiring the application

The essential wiring we'll use to try to emulate the Elm architecture is by
leveraging a module called main-loop.

Main loop allows us to create a mechanism for taking the state of our
application, represented in a javascript object, and run it through a render
function that turns that state into a virtual dom node. We can the attach the
result of that operation and append it on to the page. This will put the visual
representation of the state on the page.

var mainLoop = require('main-loop');

var render = function(state) {
  // Coming up soon!
}

var initialState = {
  title: 'test'
};

var loop = mainLoop(initialState, render, vdom);

document.querySelector('#content').appendChild(loop.target);

Where the magic happens is that we can call the update method on loop at any
point and pass it a new state. When we do that, the virtual-dom will take care
of figuring out the which minimal updates need to be applied to the dom to make
the new representation we need.

loop.update({
  title: 'new title'
});

Hyperscript Helpers

One of my favorite features of Elm is the ability to represent html in normal
everyday functions. That allows you to run any normal operations (like map) to
calculate your views. It also allows you to unit test your views just like any
other function.

titleView state =
  h1 [] [ text state.title ]

Luckily there's a JS library called
hyperscript-helpers that brings
that functionality!

var titleView = function(state){
  h1('.title', state.title);
};

Rendering arbitrary lists

Initial State

Our initial state will have two properties, both arrays of objects,
availableLegislators and selectedLegislators.

var initialState =  {
  selectedLegislators: [{
    firstName: 'Juan',
    lastName: 'Caicedo'
  }, {
    firstName: 'Carson',
    lastName: 'Banov'
  }],
  availableLegislators: [{
    firstName: 'Senator',
    lastName: 'One'
  }, {
    firstName: 'Congresswoman',
    lastName: 'Two'
  }]
};

This is the initial state that we will pass to the mainLoop function, along
with the virtual-dom module and the rendering function we will be describing
next.

Render functions

Let's start out with the main render function. This function only needs to
create a main div and then it can delegate the rendering of each of the two
arrays in the state.

var render = function (state) {
  return div('.container', [
    legislatorTableView('Your Team', state.selectedLegislators),
    legislatorTableView('Available', state.availableLegislators)
  ]);
};

Both of the children to this main div will be structurally identically, the
only difference will be the title rendered above the table and the actual
contents of the table. We can encapsulate these similarities by defining a new
function legislatorTableView and passing the differences in as parameters.

var legislatorTableView = function (title, legislators) {
  return div('.col-xs-6', [
    h1(title),
    table('.table.table-striped', [
      tbody(
        R.map(legislatorView, legislators)
      )
    ])
  ]);
};

The legislatorTableView renders another div (adding some bootstrap styling
to make it take up half of the screen), renders the title we passed in,
initiates another table, and delegates the rendering of each row to another
function legislatorView.

var legislatorView = function (legislator) {
  return tr('.container-fluid', [
    td('.col-xs-6', legislator.firstName),
    td('.col-xs-6', legislator.lastName)
  ]);
});

This function legislatorView is the lowest level of our rendering, so it
just takes care of turning a single object into a row of a table with two cells.

With all those view functions, we now have a way of rendering our initial state
into a page with two tables side by side, each with two rows.

Dealing with updates

We would now like to add functionality so that if you click on a legislator, it
will move that legislator from one table to the other. hyperscript-helpers
exposes a simple way to wire up onclick handlers, but that means that we need
to be able to trigger an update from inside the rendering code.

main-loop gives us the .update function, but we need to first instantiate
the loop before we can call it. Since we have to pass the render function in
order to instantiate the loop, there's no way the render function can call
loop.update directly.

Actions

In Elm, updating is done through an indirect, but wonderfully elegant mechanism.
Views have access to an "address" to which they can pass an "action". An action
is just a structure that has a type and some data associated with it. These
actions all find their way eventually to an update function that knows how to
calculate a new state given the old state and an action.

We can emulate this mechanism by first creating an action function to
instantiate this data structure.

var action = function(type, data) {
  return {
    type: type,
    data: data
  };
};

Update events

Next we will need to wire up some unavoidable stateful code. This code will go
in our "unsafe" section of the code, like instantiating the main loop.

We instantiate a new event emitter. Here we're doing it with Node's built-in
EventEmitter, since we'll be using browserify to build a client-side bundle,
but the same could be done with the DOM events or with jQuery's events.

var emitter = new EventEmitter();

The address function here just takes care of triggering an event of type
'update' and sending the action it gets along as data with the event. It's
worth noting that the addresses used in the Elm architecture are actually part
of a more complex update wiring which is therefore much more robust, but for
these purposes, a simple address illustrates the idea.

function address(action) {
  emitter.emit('update', action);
};

Then we register an event listener on any update event. It will call update
passing it the current state of loop and action to calculate a new state with.

emitter.on('update', function(action) {
  var newState = update(loop.state, action);
  loop.update(newState);
});

Changing state

update simply needs to be a function that checks the type of the action and
uses its data to calculate a new state. This is not very complicated at all,
simply some appending to the one list and removing from the other, depending on
the direction. Though it would be possible to return these two properties as a
new state, I prefer to use R.merge, which mirrors how Elm "modifies" records.
It creates a new object from the first object and the properties of the second
object.

var update = function (state, action) {
  var newSelected;
  var newAvailable;

  // fallback case
  var newState = state;

  if (action.type === 'Drop') {
    newSelected = R.reject(R.equals(action.data), state.selectedLegislators);
    newAvailable = R.append(action.data, state.availableLegislators);

    newState = R.merge(state, {
      selectedLegislators: newSelected,
      availableLegislators: newAvailable
    });
  } else if (action.type === 'Select') {
    newSelected = R.append(action.data, state.selectedLegislators);
    newAvailable = R.reject(R.equals(action.data), state.availableLegislators);

    newState = R.merge(state, {
      selectedLegislators: newSelected,
      availableLegislators: newAvailable
    });
  }
  return newState;
};

One great feature of this update mechanism is that it does not ever reference
any state outside the function definition and it is therefore very easy to test!

Triggering updates from views

We can change the definition of render to now also take address as a
parameter. We will also curry the whole function so that we will be able to
specify address without needing to specify state (as that will be specified
when main-loop calls that function). We will then pass the address function as
well as the type of the action to send whenever a use clicks on a row to
legislatorTableView.

var render = R.curry(function (address, state) {
  return div('.container', [
    legislatorSelectView(address, 'Drop', 'Your Team', state.selectedLegislators),
    legislatorSelectView(address, 'Select', 'Available', state.availableLegislators)
  ]);
});

This also means that we should change how we instantiate the main loop since we
changed the definition of render.

var loop = mainLoop(initialState, render(address), vdom);

The definition of legislatorTableView will also change, but not by much. Now
we will just add the two new arguments, and we will pass them to
legislatorView. Note that we will also be currying legislatorView, so we can
call it with address and type to partially apply those arguments.

var legislatorTableView = function (address, type, title, legislators) {
  return div('.col-xs-6', [
    h1(title),
    table('.table.table-striped', [
      tbody(
        R.map(legislatorView(address, type), legislators)
      )
    ])
  ]);
};

And now down at the legislatorView level, we will actually wire up a a call to
address. hyperscript-helpers allows us to pass an object to any element we are
instantiating and specify attributes that should have. We can specify an
onclick handler and when it's called simply call address, passing it an
action from our type and the current legislator. Then address will trigger
an update event and our update function!

var legislatorView = R.curry(function (address, type, legislator) {
  return tr('.container-fluid', {
    onclick: function(ev) {
      address(action('Toggle', action(type, legislator)));
    }
  }, [
    td('.col-xs-6', legislator.firstName),
    td('.col-xs-6', legislator.lastName)
  ]);
});

Requesting data asynchronously

Tasks

Another fantastic feature of Elm is how it represents asynchronous actions as
data. It does that through a data structure known as a Tasks. They are somewhat
similar to promises, except that they are completely stateless and in order to
execute a task you have to specify what to do if the task succeeds and what to
do if it fails.

This design is extremely useful because it allows you to create the logic of
your stateful asynchronous actions in stateless functions. This allows you to
isolate effects of those operations, keeping the majority of code still
easy to test and refactor.

var getJSON = function (url, params) {
  return new Task(function(reject, result) {
    $.getJSON(url, params, result).fail(reject);
  });

You can then save the task and execute in in a controlled environment when you
are ready to deal with the results of it.

var myTask = getJSON('url', {apikey: 'key'});

myTask.fork(
  function(data) {
    // specify what to do if the task succeed
  },
  function(err) {
    // specify what to do with a json error
  });

Nested actions

To do this, first let's enable our update code and render code to deal with
nested actions. This will allow us to have multiple top level actions, each
divided into subactions.

First let's change legislatorView to send a Toggle action with two
subactions, Drop and Select.

var legislatorView = R.curry(function (address, choice, legislator) {
  return tr('.container-fluid', {
    onclick: function(ev) {
      address(action('Toggle', action(choice, legislator)));
    }
  }, [
    td('.col-xs-6', legislator.firstName),
    td('.col-xs-6', legislator.lastName)
  ]);
});

And then let's change update to handle these nested actions.

var update = function (state, action) {
  var newSelected;
  var newAvailable;

  // fallback case
  var newState = state;

  if (action.type === 'Toggle') {
    if (action.data.type === 'Drop') {
      newSelected = R.reject(R.equals(action.data.data), state.selectedLegislators);
      newAvailable = R.append(action.data.data, state.availableLegislators);

      newState = R.merge(state, {
        selectedLegislators: newSelected,
        availableLegislators: newAvailable
      });
    } else if (action.data.type === 'Select') {
      newSelected = R.append(action.data.data, state.selectedLegislators);
      newAvailable = R.reject(R.equals(action.data.data), state.availableLegislators);

      newState = R.merge(state, {
        selectedLegislators: newSelected,
        availableLegislators: newAvailable
      });
    }
  }
  return newState;
};

Task-based JSON request

We discussed a task version of getJSON earlier, so now let's go ahead and make
a function for calling it.

var fetchLegislators = function(address, jsonTask) {
  jsonTask.fork(
    function(error) {
      address(
        action(
          'PopulateAvailableLegislators',
          action(
            'Error',
            error
          )
        )
      );
    },
    function(response) {
      address(
        action(
          'PopulateAvailableLegislators',
          action(
            'Success',
            R.map(decodeLegislator, response.results)
          )
        )
      );
    }
  );
};

fetchLegislators is not really a stateless function: it has side-effects
because it is executing an asynchronous task and it doesn't have a return value,
but writing it will still allow us to keep our code more organized.

Note that although the function might seem complicated (25 lines of code!), it's
really just saying "execute a json task, then send an action if it's successful
and a different one if it fails". We pass the results of the response though
a decoder function just because we would like the names of the properties to
match that of our existing legislators (which is not what the api returns).

var decodeLegislator = function (legislator) {
  return {
    firstName: legislator.first_name,
    lastName: legislator.last_name
  };
};

Update based off ajax response

Let's now add another section to our update function to handle the action
sent by the JSON request.

var update = function (state, action) {
  var newSelected;
  var newAvailable;

  // fallback case
  var newState = state;

  if (action.type === 'Toggle') {
    /* ... */
  } else if (action.type === 'PopulateAvailableLegislators') {
    if (action.data.type === 'Success') {
      newState = R.merge(state, {
        availableLegislators: action.data.data
      });
    } else if (action.data.type === 'Error') {
      console.log('Error', error);
    }
  }
  return newState;
};

Note that we are explicitly choosing not to do anything with the error, but it
flows through the application just like other actions and we could instead
attach it to the state and use it to render an error message for the user.

Triggering the JSON request

Now we can add a few more lines to the stateful section of our application to
trigger the request.

var dataUrl = ''; // url to request data from'
var dataParams = {}; // api key and other query params
var jsonTask = getJSON(dataUrl, dataParams);
fetchLegislators(address, jsonTask);

We instantiate a task to make the json request and then we execute it by passing
it to fetchLegislators.

Conclusion

Elm is a fantastic programming language, and the Elm architecture is a fantastic
way to structure web applications. If it's not possible for you to start using
Elm for your application, it's possible to get some benefits from copying some
ideas from them into your javascript code.

If you like the ideas in this blog post, you should consider looking at
olmo which is a very thorough look at
porting the Elm architecture to javascript. And if you would like to build a
full web application in this style, I would encourage you to look at
Redux, which is a Flux implementation which borrows
heavily from the Elm architecture.

Better JSON through streams

Web applications that handle big amounts of data are not easy. Receiving big
amounts data is slow, cumbersome, and prone to failing, so it makes for a
challenging user experience. Sending big amounts of data is also difficult,
particularly when you start introducing some complexities into how that data
needs to be processed before it can be sent.

In this blog post, we're going to look at some of those challenges and model
them with a simple web application.

Difficulties

Slow and coupled server responses

Assembling a big HTML page server-side and sending it to the user is slow. Worse
than the actual speed of it is the perception of speed. If a user requests a
page and then has to sit around waiting before they see anything on the screen,
they get unhappy.

Speeding up the initial server response should always be a performance goal,
but it becomes difficult to achieve when you think about the fact that our
response is dictated by the slowest operation we need to perform. If assembling
our large chunk of data involves making a request to another service, our
response is going to be at least as slow as our request/response cycle with
that service.

Big AJAX request are slow and brittle

This problem with slow server responses is in part why AJAX is such a useful
tool. Instead of needing to wait for a large amount of data to be compiled
before reponding, we can instead respond with a minimal page very quickly, and
then after that make a request to get the data we will need. Then once we have
that data, we can render it to the user.

However, moving the data operation to the front-end doesn't necessarily make it
any faster. In fact, it will likely make it slower, since most clients are
less powerful than most servers, and they are likely to be on lower quality,
slower connections. The user will still have to wait for all of the data to be
ready before getting to see it, and any connection problems will lead to all of
the data being unavailable.

Multiple AJAX requests increase complexity

A better option could be to break up that large AJAX request into many smaller
requests, then as each one of those reponses finishes, plot its results on
the page. This incremental approach generally leads to a much better user experience
because users can start seeing and interacting with data as soon as it's
available instead of having to wait for all the data. It is also much robust
to lost connections as each request can fail individually without affecting
data which has already been successfully loaded.

Handling many more requests is no simple feat though. Our client-side logic has
to become more complex, because it now needs to know how much data to request.
The server-side logic will also need to be more complex because it will need a
mechanism to tell which data the client has and hasn't received.

What are we going to build?

We're going to model these difficulties through a web app that renders an
adorable 8-bit pixel art cat with some sun shining down on it.

The image is represented in JSON. It is an array of objects that include three
properties, an X position, a Y position, and a color.

Our page will have one big div (representing a grid) broken up into many other
divs (representing rows) broken up into many other divs (representing cells).
For each point, we'll want the cell that corresponds to its X and Y postion to
have a CSS class to style it with a background color.

Constraints

Multiple data sources

The points for the cat are going to be stored in a JSON file on our server. The
points for the sun are going to be stored in a JSON file that we will fetch
over HTTP GitHub.

Huge response

Our JSON is going to be impractically big. Unnecessarily big. In fact, each
point in the cat JSON is going to contain 100 uneeded properties just to make
the file pretty big, about 500k. If that doesn't seem unreasonable, don't worry,
the sun points are going to have 600 uneeded properties, making that JSON about
1mb.

Bad connections

We want people to be able to use this app even if they have a bad connection.

Our solution

We are going to build client-side that makes a single request to the server. It
will have a single event handler that tells it how to handle a single point
when it receives one.

Then we're going to build a server-side that streams a JSON response to the client,
making it so that we don't have to wait for all the data to be ready before we
start responding.

And we're going to get the data for that response, all the cat and sun points
we need to send, as a single data stream. We'll obtain that by getting a
separate stream for each of those two sources and then merging them into one.
This will abstract away the fact that one of them is fast (read from disk) and
one of them is slow (over the network).

Tools

Oboe.js, event-driven JSON parsing

This is a tool that allows you to parse JSON and use parts of it before the
whole document has been parsed.

var source = //url or readable stream
var pattern = //string representing a node in the tree

oboe(source)
  .node(pattern, function(data){
  // you can use the data however you want
    })

You register callbacks which will fire when certain nodes of the tree have been
parsed. For example, you can register a callback that will fire for each
object in an array, and when it fires you can use that object.

Highland.js, a utility belt for streaming

I've always found initiating my own streams to be quite a hassle, and have been
confused by how to work with them once I have one. Highland is library that
makes it easy to initiate, manage, and manipulate streams.

var data = ['one', 'two', 'three'];
highland(data)
    .pipe(anotherStream)

We'll introduce more highland utility functions as we need them.

Some Node.js usual suspects

Some other tools we'll use which are fairly common place

  • Node.js, v4.2.6
  • Express.js
  • request

Receiving a stream response

We will start at the client side and work backwards. Our response is going to be
one big JSON, but we don't want to have to wait for it all to be ready before we
start giving using it.

We give Oboe a url to make a request to. Oboe will make that request and begin
parsing the responsonse.

oboe('http://localhost:3000/data')

We then register two listeners on that response. The first
will look for any object inside of that JSON response that has three properties,
an x position, a y position, and a color. When it finds an object like that
(in this case it will only find it inside the pixels array, but it could be
anywhere in the response), it will run a helper function to fetch the DOM
element corresponding to those coordinates in our grid and then it will add
a CSS class to it to style it with that color.

.node('{x y color}', function(point) {
  var grid = document.querySelector('.grid');
  var cell = getCell(grid, point.x, point.y);
  cell.classList.add(point.color);
})

The second listener will fire once the whole response has been parsed, and it
will simply give the user some visual feedback that all the data has been
loaded.

.done(function(){
  var element = document.querySelector('#status-message');
  element.textContent = 'All data Loaded!';
})

Sending a streamed response

The server-side application will have a route to serve data to the client. The
We will need a template for our response, which can have whatever metadata or
other non-stream properties we might want to send. The important thing is that
when we get to the pixels array, there's a placeholder string that we will be
able to identify even after the object has been JSON stringifed.

router.get('/data', function(req, res) {
  var response = {
    exif: {
        software: 'http://make8bitart.com',
      dateTime: '2015-11-07T15:35:13.415Z',
      dateTimeOriginal: '2015-11-07T00:24:05.776Z'
    },
    pixif: {
        pixels: ["#{pixels}"]
    },
    end: 'test'
};

The res object that express sends to the client is a writeable stream, which
means that you can pipe a readable stream directly to it and it will be sent to
the client. However it can only receive a string stream, so we need to convert
our response to that format. We can do so by stringifying our template.

var json = JSON.stringify(response);

After stringifying, we will split the response so that we have the section
before the #{pixels} placeholder and after it. This will be useful later when
adding data to our response.

var parts = json.split('"#{pixels}"');

Now we can make a new highland stream out of the reponse before the placeholder
and after it.

highland([
  parts[0], // before placeholder
  parts[1]  // after placeholder
])

We will call invoke on that stream to tell each element of the
stream to call it's split method with the argument ''. This will mean that
all elements that flow through our stream will get split up into a stream of
characters.

.invoke('split', [''])

Then we will tell highland that we want it to read each of it's elements
sequentially. That is, it will wait for all of the "before" stream to be read
before it starts processing any of the "after" stream. This is extremely
important so that our valid JSON will end up as still valid JSON, otherwise we'd
get a stream of seemily random characters.

.sequence()

Now we can send that stream through to the response!

.pipe(res);

Adding data to the response

Now that the client is receiving a streamed response , we can proceed to add
some data to that response. For now let's just add some static data for to make
sure this works as expected.

var points = [{
  x: 1,
  y: 2,
  color: 'orange'
  }, {
  x: 2,
  y: 2,
  color: 'orange'
}];

We can make a new highland stream from this array.

var pointsStream = highland(points)

We will want these to be strings as well, so we can tell our stream to map
each element by JSON stringify.

.map(point => JSON.stringify(point))

This new stream will be getting fit into where our "#{points}" placeholder
used to be, as the elements inside of an array. Since we want to end up with
valid JSON, we will need for the elements of this stream to be comma separated.
We can achieve that by telling our stream to put a comma in between all the
elements that flow through it.

.intersperse(',');

Now we can fit this new stream in between our "before" and "after" streams, and
everything will proceed just like we would expect.

highland([
    parts[1],
    pointStream,
    parts[1]
  ])
  .invoke('split', [''])
  .sequence()
  .pipe(res);

Getting data from another module

We can reduce the complexity of our route by moving all the code for getting
our point stream to another module.

var points = require('../data/points');

/* inside the /data route */
var pointStream = points.getStream()

This means that no matter what that module does, as long as it returns a stream
of objects, the rest of our response assembly will still work the same.

Reading cat points from a file

Let's now work towards using real data sources as opposed to hard-coded objects.
First we will read just the cat points from disk and send them back out of the
points module.

First, we will read the data in cat-points.json into memory as a stream.

function getDataStream() {
  var catPath = path.resolve(__dirname, './cat-points.json');
  var catSource = fs.createReadStream(catPath);

The contents of the file will come into our application as a stream of strings
however and ultimately we need a stream of objects. Let's abstract that
conversion to another function called getPointStream.

var catStream = getPointStream(catSource);

  return catStream;
}

To do this we will leverage Oboe again. Doing so will allow us to read the file
and act on each element of its pixel array without waiting for the full file
to be read. We will also need to use a new technique for creating a highland
stream. We will call highland and pass it a function taking one argument
push. Inside this function, we call push, passing it an error or null
and a piece of data, as many times as we want and eventually pass it the value
highland.nil. Each push will push that data through our stream and nil
will signal that the stream is over.

function getPointStream(sourceStream) {
  return highland(function(push) {
    oboe(sourceStream)
      .node('{x y color}', function(point) {
        push(null, point);
      })
      .done(function() {
        push(null, highland.nil);
      });
  });
}

Reading sun points over the network

The final piece of the puzzle is to add the sun points to our response. Luckily,
the work we've done to abstract our data will make this pretty straight forward.

In getDataStream, we will add another readable point stream like our cat
stream, but with a different origin. We will use the request library to
request another file from GitHub. This will be a much slower operation both to
start and to complete than reading a file from disk.

Luckily, request returns a readable stream, so we can pass it to
getPointStream, and Oboe will handle it the same as the result of
fs.getReadStream.

function getDataStream() {
  /* get cat stream */

  var sunUrl = 'https://raw.githubusercontent.com/JuanCaicedo/better-json-through-streams/master/data/sun-points.json';
  var sunSource = request(sunUrl);
  var sunStream = getPointStream(sunSource);

Now we have two streams, catStream and sunStream, that we would like to
send back out to another module. We can call highland with both of these
streams and then call merge to end up with a single stream we can return.
This stream will emit an event when any of the streams it contains emit in an
event. Unlike sequence, it will pass off each of these elements as soon as they
are ready, mixing them together. This means that cat points will get passed off
as soon as possible, and will not be held up by however long the sun points
take.

return highland([
    catStream,
    sunStream
  ]).merge();
}

Conclusion

Designing an application with streaming as the main strategy for data transfer
is extremely powerful. It allows us to create a client-side focusing on how to
handle a single unit of data without worrying about the full response. It
allows us to create a server-side that begins sending data to the client as
quickly as possible, without waiting to have all the data ready up front. And
finally is allows us to create modules that abstract away the source of our
data, allowing us to focus on its similarities as opposed to its differences.

Adding Control-Cache with S3FS

Caching is a big deal at SpanishDict. We cache our rendered jade views, all of our database lookups, all of our client-side code, and every single image that we show to our users. We serve all of these assets through CDNs to further improve performance on the site. Then we encourage browsers to cache these assets for as long as possible.

Except, that is, for the SpanishDict blog. Well, until this week!

The SpanishDict blog

Our blog is very important at SpanishDict. We have a fanstastic team that develops original content for it. It's featured on our homepage. It's the subject of a previous blog post, where we describe the architecture we used to reach zero-downtime deployments.

Here's the short version:

  • Lives on Elastic Beanstalk
  • Stores posts in an RDS instance
  • Thinks it saves images locally on the file system
  • Which is actually an S3 bucket, mounted with s3fs

Following a microservice architecture, the blog is a completely different world from the rest of SpanishDict. When the homepage wants to feature four articles, it makes a request to the blog's RSS feed and (the result of which is cached of course) which tells it what post titles to render and what images to fetch to display along with them.

Control-Cache

When you request images from a Ghost instance, it nicely sets a header on all of them, which tells the browser to cache those images for as long as possible. However, to lessen the load on the blog and increase speed, we wanted the homepage to instead request the images from Cloudfront, an AWS offering which acts as a CDN on top of S3.

If you request an asset from Cloudfront, the response it sends back will only have headers that are specified on that asset in S3. That meant that to get our Control-Cache header, it had to be sent along with the file when we first uploaded it to S3. Hopefully s3fs has a way to do that...

ahbe_conf

It does! When you mount a directory with s3fs, you can pass it a flag like -o ahbe_conf=file.conf. In this file you can configure all additional headers you would like to send with your uploads. More details on how to do that here.

In Elastic Beanstalk terms, it meant adding to our YAML files in .ebextensions/

"/home/ec2-user/s3fs-fuse-1.77/caching_ahbe.conf":
    owner: root
    group: root
    content: |
        # Send custom headers to s3 for caching these files as long as possible
        Cache-Control public, max-age=31536000

And changing our mounting command to:

/usr/bin/s3fs $S3_BUCKET /var/local/images -o allow_other -o use_cache=/tmp -o nonempty -o ahbe_conf=/home/ec2-user/s3fs-fuse-1.77/caching_ahbe.conf

Now all our new images will have a Control-Cache header when served though Cloudfront!

What about our old images?

You got me, old images will still have all the same headers you originally uploaded them with. Luckily, updating them is easy, using s3cmd.

Building elasticsearch queries and the case for abstraction

This post discusses why writing elasticsearch queries can be hard, and why we
wrote bodybuilder to abstract away this concern.

Elasticsearch is a versatile tool that enables very powerful searches and
aggregations over large datasets. It lets us ask questions like:

  • How many users complained about ads on SpanishDict in the past 30 days?
  • Which complaints were categorized as inappropriate, with the user's message
    including the phrase “victoria’s secret”?
  • How many resources does a particular ad load, and how does this compare to historical data?

But with great power can come great complexity, and we have found that writing
queries like these in code can quickly become unwieldy. Take this filtered query
as an example:

// Return documents from the last 24 hours, from projects '12' or '24', where the advertiser was 'adx'.
{
  filter: {
    bool: {
      must: [
        {
          range: {
            timestamp: {
              gt: 'now-24h'
            }
          }
        },
        {
          nested: {
            path: 'ad_impression_ids',
            filter: {
              term: {
                'ad_impression_ids.advertiser_name': 'adx'
              }
            }
          }
        }
      ],
      should: [
        {
          term: {
            project_id: '12'
          }
        },
        {
          term: {
            project_id: '24'
          }
        }
      ]
    }
  }
};

If this were a one-off query I think we could handle writing queries like these
manually. But because we are writing an API on top of our elasticsearch dataset,
we construct these queries in code. This becomes very complex because:

  • We can't always construct an entire query in one function, different pieces of our application have information needed for different parts of the query.
  • It is not always obvious how to combine multiple queries, filters, and aggregations together to get the expected result.
  • Writing large blocks of JSON causes clutter in code.
  • We almost always have to crack open the elasticsearch docs to relearn the syntax for different queries and filters.
  • We have to write unit tests for these queries, because we need to be confident that the JSON is constructed properly.
  • Building queries is not a core concern of our application, and it distracts us from getting things done.

This is a good candidate for abstraction. We should delegate the task of writing
complex queries to someone else so we can focus on answering important questions
about our data.

Building upon Patterns

The key to building out this abstraction is to identify and exploit patterns:

  • The elasticsearch query DSL consists of clauses that are combined in a predictable way, following a well-defined pattern.
  • Our abstraction should provide a simple, predictable interface on top of this pattern.

The concept of abstracting away the business of building complex queries is not new,
so we can find and follow an established pattern for this task. The
Builder pattern is an
established design pattern for "constructing complex objects by separating
construction and representation," and fits our use case well.

Moreover, query builders are common, especially for SQL -- sequelize, knex.js,
for example. There are also a few libraries out there for building elasticsearch
queries: elastic.js and
esq, unfortunately these have received
little maintenance recently. However, we can look to these libraries for
examples of how to write our own query builder.

Bodybuilder

Following the Builder pattern, the problem of building a complex elasticsearch
query body boils down to:

  1. Defining functions that return an object representing a single query clause.
  2. Combining multiple query clauses together.
  3. Defining a builder class that provides methods for calling these functions, and stores the changing state of our elasticsearch query body.

For example, a query clause such as a match query can be represented as a
function:

{
  "match" : {
    "message" : "this is a test"
  }
}

This object is a template that we can create a function for, e.g.

matchQuery("message", "this is a test")

Next, we should be able to call these template functions on our builder class. We can specify the query type we want by passing its name
as the first argument, e.g.

var body = new Builder()
body.query("match", "message", "this is a test")

This class should also hold on to the current state of our query body in a private variable:

body.query("match", "message", "this is a test")
// -> { "match" : { "message" : "this is a test" } }

It should also know how to combine multiple queries together without exposing
the details of how this is done:

// Combines two query clauses using a Boolean query clause.
body.query("term", "user", "herald") // -> 'must' clause
    .orQuery("term", "user", "johnny") // -> 'should' clause

And finally, the builder must provide a method for spitting out the constructed
object:

body.build() // -> returns constructed query body

Putting all of this together, the adx query from the beginning of this post
can be rewritten much more simply as:

var body = new Builder()
  .filter('range', 'timestamp', {gt: 'now-24h'})
  .filter('nested', 'ad_impression_ids', 'term', 'advertiser_name', 'adx')
  .orFilter('term', 'project_id', '12')
  .orFilter('term', 'project_id', '24')
  .build()

Bodybuilder
is a new project and any feedback or contributions in the way of
pull requests or issues are welcome.


Abstracting out the business of building complex elasticsearch query bodies was a clear win for us. It helped isolate the challenge of writing clean, testable code around elasticsearch queries, and cleared the path for building out an awesome product. As a team we continually look for opportunities to separate concerns by pulling out modules from our code base and iterating on them separately. And by releasing this as an open-source project we hope to bring together other contributors who can help make the code even better.

Do you have a solution for building complex queries programmatically in code? Has this been a sticking point in any of your projects? Share your thoughts in the comments below!

How to split a csv file by date, using Bash and Node

Introduction

I faced a situation recently in my work at Fluencia where I needed to split up a csv file. I had a report with about 60 rows, each from a different date. What I needed was 60 reports each with only one row. And I needed each file to be named after the date of the data it contained, in the format 20150708_iad_extracted.csv for a file containing data from July 8th, 2015.

I've been spending a lot of time over the last few months improving my bash skills, so this was a great tool for the job.

(This article requires a basic understand of Unix streams and command line bash programming, but don't be scared!)

csvkit

I found this python tool to be extremely useful for this task. Though it can be great to work with tabular data using a GUI like Excel or Numbers when you're a business user, I find that as a programmer I often want to interact with data trough the command line. Even more, I want to be able to read data and pipe it through different commands in a Unix-like fashion, like I would with other new-line-separated text. Installing csvkit is a breeze and gives you a ton of useful commands. The two I used most are:

- `csvgrep` — allows you to filter a data set to only those that match a value you specify in a column you specify
- `csvcut` — allows you to trim a data set to only specific columns you would like.

Splitting the file, in Bash

I like to build bash commands by starting with a small command and gradually appending to it, verifying the behavior (and my bash syntax) at each step.

First, I read the report into memory and log it to the screen.

cat all.csv

Then I pipe the contents of the report into csvcut. Adding the flag -c 1 tells the command that I would like only the results of the first column (which are 1 indexed, not 0 indexed).

cat all.csv | csvcut -c 1

After that I need to perform an action multiple times, since I need to create a new file for each row. The command xargs reads standard input (stdin) and performs a command once for each line. The normal use of xargs runs whatever command you pass it using that line as its stdin, but I find using the -I {} flag to often be more flexible. It allows you to take that new line and substitute it into the command you use. I usually use it in conjunction with bash -c which allows you to call a string as a command. That means that you can build up a second command with pipes in it and they won't interfere with any pipes you have on the outside.

To verify this works like I expect, I would run:

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'echo {}.csv'

Now we can get the 60 files names, but we still need to get the contents for each one. We can do that by reading the file again and using csvgrep to select the row that matches the date we already have. Using the -c 1 flag tells the command that we're matching a pattern in the first column and using the -m some_string flag tells it what we want that pattern to be. Note that if you are searching for a pattern that contains spaces, you can wrap that string in quotes, like -m "some string".

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'cat all.csv | csvgrep -c 1 -m "{}"'

Finally we just write the standard output (stdout) of that command into a file with the name we had earlier.

cat all.csv | csvcut -c 1 | xargs -I {} bash -c 'cat all.csv | csvgrep -c 1 -m "{}" > "{}".csv'

One thing to note is that this solution has a complexity of O(n^2), so it might not be the best for large files. I tried to make a similar solution with O(n), but ran into problems witht the fact that the stdout passed by xargs is not csv escaped. If you have an O(n) solution or would be interested in one, please let me know in the comments.

Renaming the files, in node.js

I chose to do this step in node because I was confident I could easily convert a string like "Tuesday, August 13th, 2015" to "20150813" using moment.js. The latter isn't very nice to read, but sometimes it's not up to us to choose the conventions expected by our systems.

I used two other libraries for this script: fs from node core and Bluebird, which is an awesome promise library. If you're not familiar with Bluebird, the biggest reason for using it is that it makes it very easy to convert non-promise code into promise code. For this I passed the whole fs library to Bluebird.promisify(), which created new promise-returning functions for all the library's callback-accepting functions (now with "Async" appended to their name).

Our convention at Fluencia is to work with promises by first defining a set of functions we are going to use, then calling them each by name in a promise chain, avoiding anonymous functions as much as possible.

First I define a function to read all files in a directory:

function readFiles(dir) {
    return fs.readdirAsync(dir);
}

Then a function to filter only files which end in .csv:

function onlyCsv(fileName) {
    return fileName.match('.csv');
}

Then the real action, a function that will read the date from the file name, then convert it to the new format, and rename the file. For more details on how moment.js does formatting, take a look at their docs here:

function rename(fileName) {
    var date = fileName.replace('.csv', '');
    date = moment(date, 'dddd, MMMM DD, YYYY');
    var newName = date.format('YYYY-MM-DD') + '_iad_extracted.csv';
    return fs.renameAsync('./' + fileName, './' + newName);
}

And finally I call all those functions in a chain:

readFiles('./')
    .filter(onlyCsv)
    .map(rename)

Conclusion

Bash commands are extremely powerful and flexible tools. They make the job of splitting one csv file into many fairly easy and repeatable. However for more complicated programming tasks, even just changing a date from one format to another, it can be much nicer to use javascript for it's many easy-to-use libraries.