By Zef Hemel — Dec 13, 2012

Pick Your Battles

So, you decided to build a real application. Not a toy. Not a hobby project. Something that’s supposed to last, supposed to scale, supposed to work and remain reliable.

If you’re in any way like me, I bet right now you’re browsing the interwebs like crazy to find the hottest new technology you get your hands on to use in this new venture. This is a once in a lifetime opportunity! You can choose anything you like. Erlang, Clojure, Ruby, Node.js — the sky is the limit! NoSQL databases are hot. Let’s use MongoDB. Redis. Cassandra. Let’s make the front-end super fancy and reactive with Backbone and EmberJS!

This is what Cloud9 did. Cloud9 picked the hottest thing at the time: node.js and built its entire back-end using it. Javascript front to back! Database? Redis. Super fast. Nicely scalable, ready for the future! A dream technology stack.

Here’s some of the things we ran into:

“Oh, how do we organize a big (200k+ lines of code) JavaScript code base?”
“How do we roll out releases without bringing the site down?”
“How do we test this thing?”
“Oh, when an exception occurs in a node.js program, it crashes the entire server process.”
“Oh, our node.js server processes seem to freeze up for a long time (seconds) from time to time, why does that happen?”
“Oh, why does the cluster module not distribute load over all processes equally under load?”
“Oh, now we have all this data in Redis, but at some point data started to get inconsistent, when did that happen? How do I fix that?”
“Oh, from time to time Redis completely blocks, not responding to queries for seconds at a time, what’s going on?”
“Oh, now I’d like to know something from the database that I cannot find out with simple get and put-style queries. Do I really have to write a script now that pulls in half the database to get an answer to my question?”

All of these are technically interesting problems. If you want to pioneer, this is the way to do it. Often it turns out you’re the first person to run into such a problem, because nobody pushed the technology to the limit before. It’s challenging, but not always pure fun — many of these problems tend to be discovered at 2am on a Friday night and they have to be solved quick. “No pressure, but our entire site is down and people cannot do their daily work.”

Are any of these problems unsolvable? Of course not. We figured most of them out, but it did take time and investment.

The approach taken for a large part of our infrastructure is to adopt every piece of code used as our own, and make sure the team can debug issues and fix problems at every level of the stack (at least all the “new” tech). We hired a few node.js core developers (actually, over time we ended up hiring 75% of the core node.js team) — they were incredibly helpful. We also, over time, replaced a lot of parts of our infrastructure with custom built parts, because parts we reused from third-party developers were sometimes of poor quality or simply not maintainable.

What I learned is that you have to figure out how many simultaneous challenges you’d like to handle. Does the project require a new technology that may result in problems down the road, or is making the project successful by itself enough of a challenge? You have to pick your battles carefully.

If a new technology is the only way to solve a problem you have, it’s a no brainer.

If a new technology is going to give you the edge to push the competition out of the market, it’s a no brainer.

If it’s new technology for the sake of new technology, think twice.

If you’re a startup, or are launching a new product inside a bigger company, you have to find “product-market fit” — you have to proof that your idea is viable, and that there’s people who’re willing to use it — even pay for it, you have to make sure you grow and eventually grow exponentially. More often than not, that is a huge challenge in itself. It requires focus on what matters: finding out and building what your users need. Anything other than that is a distraction. Being called out of bed at 2am, because the US woke up and started to use your application, and your language runtime of choice appears to have problems under load and now stuff breaks — that’s a distraction.

So, here’s my advice: go and build amazing applications. Build them with the most boring technology you can find. The stuff that has been in use for years and years. Where every edge case has been covered. Where every library you will ever need has been in production for years. Where every part of the release cycle has been ironed out. Where the best practices on how to do testing are known.

Use PHP. Use Java. Use Python. Use Ruby on Rails. Use .NET. Use jQuery. The language may be more verbose, and the framework may be years old, but at least googling your error message will return you results you can use. You don’t have to invent everything yourself. You’re not going to be the first to hit a certain limit.

Pick your battles.

About a decade ago I was really into forum software. I built forum software in Perl and when I was bored with it, I rebuilt it in PHP and then in Java. I knew what a forum needed to do, that was not a challenge. The challenge was in the technology. If the software you’re building is a snooze, a no-brainer, you can possibly afford some technology risk to spice up your life. If it’s not: don’t take too many chances.

No sequel
Everybody seems to be moving to NoSQL databases, because, you know, “MySQL doesn’t scale! My app will outgrow MySQL!” That’s what we call premature optimisation. First, prove you even have to scale, and if you do, that it’s your database that has scaling issues. Facebook uses MySQL to keep most of its data, are you going to get bigger than them? Accept that you simply do not yet know your technological challenges. At Cloud9, more often than not we predicted our bottlenecks wrong. Dead wrong.

I in no way mean to badmouth NoSQL databases. They have use cases, but you have to make sure you hit them. Redis is an amazing piece of engineering. It’s simple and its performance is unbelievable. There have been cases where a bug in one of our scripts would effectively launch a DoS attack on our Redis server, executing queries like crazy, but Redis wouldn’t break a sweat. Many tens of thousands of requests per second on a single box — no problem.

However, much of the Cloud9 data is very relational: we store users, workspaces, workspace members. A user has many workspaces, a workspace has many workspace members. Everything is encoded in clever ways with keys, hashtables and sets. Sadly, a bug or crash could easily make the data inconsistent. Redis has no way to ensure consistency in any way. In addition, the way data is stored in Redis has to be optimised for the type of queries that need to be performed. If, later on, you find out you need other ways to get your data out, you have a problem. For instance, you want to know which user has most workspaces, or what workspace type is most popular. With SQL that would be a single query. With Redis, because you didn’t plan for this type of query, you have to iterate over every single user and count the number of members in its “workspaces” set. Finding out what workspace type is most popular would require pulling in every single workspace record (hundreds of thousands of them), one query at a time, and then checking the value of the “property” field.

If this is a trade off you make to be able to scale to the million of users you have — great. If you pick this because it’s “cool”: don’t torture yourself.

I always used to be — and in many ways still am — an early adopter. New technology gets me excited, makes me want to play with it and use it for everything. I’ve since learned that this attitude can work, but comes at a cost.

In summation:

Don’t underestimate the value of mature technology. Things will break, and you will have to fix them. It’s bad enough if your own code is a problem, it’s worse when the problem is a poorly understood “feature” of your platform of choice.
Don’t optimise prematurely. Don’t choose C because it’s faster. Don’t use MongoDB because, supposedly, it scales better. Don’t cache until you’re sure you have to.
Be pragmatic. Technology like node.js and Redis have many great uses. If you hit one of their use cases: limit their scope to what makes sense. There’s no need to go all-in at all costs.

I know it’s exciting to plan for a system that’s ready for Google-like traffic — but seriously, focus on something people want first. You will have to go through multiple refactors of your infrastructure. There’s not a chance in the world you will get it right the first time, so don’t assume you will.

Update: There are Reddit and Hacker News discussions about this post.

There’s also a video of this as a talk.

Intrigued? Subscribe to Zef+!