Neil’s Notes and Thoughts – Building a Realtime Messaging Platform with Erlang

Building a Realtime Messaging Platform with Erlang

April 21, 2013

Recently at work, we launched a realtime messaging feature. We used Erlang to build a central part of this feature's server infrastructure. This post goes into some detail about building this infrastructure and what I learned along the way.

Designing the Platform

The big goals of the server platform were highly-availablity and scalability; allowing us to potentially handle millions of concurrent connections. We used a lot of Python elsewhere, so a Python-based solution was preferable. But, I strongly believe in using the right tool for the job. After doing some prototyping, talking to a few friends and seeing this presentation, Erlang felt like the right tool for the job. I was also well aware of Erlang's success at Facebook and MochiMedia, where I worked previously. At MochiMedia, I had written some Erlang, but I was far from an Erlang expert and this project would be the first Erlang-based service I was building from the ground up.

Once I decided on Erlang, there were two other important choices to be made:

Roll our own protocol or use a pre-existing one.
Roll our own server or use one off-the-shelf, preferably one that is open source.

Our project team was relatively small (4 engineers total) and for most of the project I was the engineer responsible for building the server infrastructure. Thankfully both XMPP, an open messaging protocol, and ejabberd, an open source XMPP Server, existed. Using XMPP would allow us to spend more time adding features (sending media messages, locations messages, etc.) and tuning the user experience rather than designing network protocols and building the accompanying libraries.

We found mature open source XMPP libraries for both iOS and Android client platforms quickly. On the server-side, ejabberd is a highly-performant Erlang-based XMPP server with a pluggable module system allowing lots of customizations. ejabberd uses Erlang's distribution protocol, so we could just attached ejabberd nodes together to scale the platform horizontally. We built a number of custom ejabberd modules including:

An auth module to verify authentication credentials stored in existing MongoDB collections.
An offline module to send Apple and/or Android push notifications to 'offline' users.
A pubsub module to send and receive pubsub data (we use XMPP pubsub for group chats) to a Python-based web service. We called this the 'storage service layer' and built it using Flask and Gevent with data stored in MongoDB.

By default, ejabberd uses mnesia to store pubsub data which is replicated to each node (ie. share everything). After doing some simple data usage projections, we realized that our pubsub data would not be a good fit for mnesia. So, we built the storage service layer to manage the pubsub data. This layer also uses write-through caching to minimize database queries. The caching allowed us to avoid problems with database replication lag, high query load, and the strange and unexpected ways we have seen MongoDB fail. The ejabberd changes to use this service were isolated to the custom pubsub module. Ejabberd was still using mnesia ram_copies to store session data. When a new ejabberd nodes starts it would first sync this session data before handling user requests. The session data seemed small enough (~50MB) that replicating it across the cluster did not warrant concern. Ideally, we wanted to move ejabberd to a share-nothing architecture to make it faster to add nodes to the cluster. But, to hit our shipping deadline it seemed reasonable to defer moving session data out of mnesia.

Post-launch Lessons

When the big green button was pressed (ie. we launched), we quickly found using mnesia for sessions was horrible. Our mobile clients typically connect to ejabberd for very short durations (2-5 minutes); making session data extremely volatile. Profiling ejabberd, showed certain Erlang processes spending most of their time reading from and writing to mnesia tables. By default, ejabberd uses synchronous mnesia operations to guarantee data consistency across the cluster. Under peak usage, the synchronous operations sometimes slowed ejabberd response times to 10 seconds. During a long caffeine-fueled night, I completely rewrote the ejabberd session manager using Redis as the data store. Once I rolled out this change, our problems disappeared. Additionally, the ejabberd cluster was now running in a completely share-nothing architecture; we could now add ejabberd nodes instantly and readily scale horizontally as growth demanded.

Over the subsequent few weeks, we fixed a number of bugs. Most of these bug fixes were rolled out using Erlang's code loading and without restarting ejabberd or dropping connected clients. This is a huge benefit of using Erlang to build applications and I have yet to find any other language or runtime with this ability.

Final Thoughts

Erlang and ejabberd have scaled impressively well; each ejabberd node is now handling more than twice as many users as we anticipated on the same provisioned hardware!

A large part of this project's success was choosing Erlang and ejabberd. ejabberd provides good client handling and is relatively simple to run and operate. But it has its warts — it feels over-engineered at times, the code is not ready to read or maintain and many parts of it are not built using OTP. Mongoose, an ejabberd fork, fixed the last part among others but I have not tried it as a replacement yet.

Additionally, there are a number of things we did not have to worry about because of Erlang:

Concurrency — while ejabberd manages tens of thousands of concurrent client connections, the Erlang VM is fully utilizing the multiprocessor hardware on the host. All applications built with Erlang get this for free; the VM schedules your Erlang processes across all CPU cores without you, the programmer, worrying about writing multi-threaded or multi-process code. In Erlang, it is really easy and powerful to make a CPU or IO-intensive bit of code in the same program run concurrently and then notify some other bit of code with a message when it is done. In other languages or runtimes, the programmer typically has to move the work to a separate process or thread.
Supervision — ejabberd processes are supervised; this means if there is an unhandled error in a function it is logged by the VM with the complete stack trace and input parameter (great for debugging) and more importantly the crashed process is restarted by a supervising process. As an example, let’s imagine the Erlang ejabberd server is making a request to an external database server which almost never crashes. However, one day the database server crashes, restarts and recovers quickly. When this happens the ejabberd server will only restart the processes that are communicating with the database; all other processes, for example this managing the connections to users’ devices continue running. The makes Erlang servers very resilient to non-systemic errors.
Distribution and Clustering — ejabberd nodes uses Erlang’s distribution protocol to connect nodes; this makes it easy to add nodes to the cluster. In general, the distribution protocol makes it easy to write clustered applications.

Finally, a note about learning Erlang. I know a lot of programmers pass on Erlang because of its strange syntax even if it is a better tool for their job; yes, it is very different but it takes less than 2 weeks to get used to it and maybe even love it; pattern matching is immensely useful. The more important and perhaps more subtle part of learning Erlang is starting to think in Erlang; thinking of an application as a set of lightweight concurrent and isolated processes (actors) communicating using messages. Once you start modeling and seeing your applications this way, writing and debugging Erlang applications become much clearer.

Tags: startup, erlang, functional, programming, languages, realtime, xmpp, ejabberd, scalability, python, gevent, messaging