The GoLD Stack: A Management Perspective

I only heard the term “GoLD stack” a few weeks ago. As far as I’ve been able to trace it back, it was coined in a tweet (of course) by Santiago Martinez Q.

It stuck with me for two reasons:

  1. It’s a short, catchy name.
  2. It subtly alludes to being a silver bullet (which we all know doesn’t exist, but I still romanticize about).
  3. We actually use it in OLX, and have positive experience with it in our teams focused on our Jobs category.

Some of the people in my team have already written over time about how to use this stack in practice and why we’re excited about it, e.g.:

Realizing there’s a catchy name for this stack now, I thought it would be good to jump on this “branding” opportunity. I mildly pushed Paweł to write it up, and he wrote a nice introductory article on the GoLD stack as a result, focused on the technical aspects.

Now, in turn, let me do my part and give you the management perspective of why I’m excited about GoLD.

In my mind the GoLD stack could become the LAMP of the 2020s.

Management people like to think in dimensions, so here are the four dimensions I will cover to show you I’m right (in case you’d even doubt me):

  1. Scaling and cost transparency
  2. Time to value
  3. Onboarding
  4. Culture and recruitment

To be clear, while GoLD nicely captures the core of the stack, it doesn’t describe a full one. In our context, in addition to using Go to write our Lambdas, and using DynamoDB as the main data store, we deploy using the serverless framework, and are heavy users of SNS, SQS, Kinesis, and S3 as well.

Scaling and cost transparency

When AWS launched 14 years ago one of its selling points was “pay for what you use.” In those early days, with primarily S3 and EC2 as services offered, this meant: you pay for the data you store, and the hours you run your VMs. Still, for the compute part it didn’t really matter if you used all of the CPU cycles, memory or allocated disk space — you picked your instance type, and you paid for the whole thing independent of “internal” utilization. If you outgrew your instance, you either scaled horizontally by adding another instance, or vertically by spinning up a bigger instance. Over the past 14 years the AWS offering evolved a lot into two dimensions (yes, dimensions again):

  1. A suite of “serverful” services, where visibly you spin up a cluster of EC2 instances (and associated services) that are preconfigured and managed by AWS to run some specific service, such as most of RDS, ElastiCache, Redshift, ECS, and EKS. AWS takes a lot of the operational burden (maintenance, backups, sometimes upgrades, sometimes scaling) but the way you provision these is still at the level of EC2 instances, so you select instance types, cluster sizes, scaling policies, availability zones etc.
  2. A suite of “serverless” services, where the level of abstraction is higher and you tend to pay per request, messages passed, CPU cycles consumed, storage used, and are pretty much oblivious to the resources that power them behind the scenes — services like this include SNS, SQS, S3, Lambda, DynamoDB, and API gateway.

Now intuitively, while you lose some level of control with the serverless options, what’s good is there’s a very close correlation between actual use and cost. If your lambda is never invoked, your cost will be close to $0, if you never put an object in S3, you pay nothing. But if you do see use, it’s fairly easy to calculate and track what the cost would be and exactly where the money goes.

This tends to be much more implicit with many of the “serverful” services, where you pay e.g. for a 3 node DB cluster which will initially be underutilized, eventually reach capacity and then you need to scale it. The granularity is always measured in nodes, though, not e.g. number of read or write operations to particular tables like DynamoDB.

Aren’t serverless solutions more expensive at scale, though? I think there are cases where it may be, but in our use and experience thus far at OLX — consistently the conclusion has been that it’s been surprisingly affordable, and cheaper than serverful options. In all transparency though, we’re still early in the journey of pushing GoLD into areas with our highest traffic.

Time to value

Project built on the GoLD stack seem to see their first production release more quickly than non-GoLD projects, and quicker iterations after. This means: a short time-to-value cycle, which I’ve written about in the past. To be honest, I don’t have a large set of datapoints to prove this, but this is my impression. It’s not clear to what extent this has technological reasons or more of a cultural background (I’ll get to culture later).

From a technical perspective, it makes sense. Assuming you have an AWS account ready to go, all you need to do is install the serverless framework, look at some example serverless.yml file with Go examples, perhaps add a DynamoDB table definition, and sls deploy and you essentially have a production ready setup, no prepared additional infrastructure required (such as kubernetes, ECS, EB or database clusters).

Because solutions built on the GoLD stack (or serverless in general) tend to involve a mix or a number of AWS services that are hard to fully emulate locally, testing in “production” (actual AWS) is largely inevitable. Sure, you write most of your automated tests at the lambda level, but to truly test the integrated whole, the best way to get confidence is to push it to either a staging environment on AWS, but since that “staging environment” is effectively the same thing as production — why not push it to production as well? Feature flags control access to features anyway, right? You are using feature flags, right?

If it’s technically easy to push to production, you just do it more often, because why not? It adds to the level of confidence you have that things actually work the way they should, even if customers don’t get to interact with all of it yet.

Another hint about why the GoLD stack may have a shorter Time to Value is nicely visualized in this graphic (under “Serverless”):

You simply get a larger part of your stack “for free,” which saves time. To read more about the angle of why serverless makes a lot of business sense, read Łukasz’ excellent post on the topic.

Onboarding

As alluded to, deploying something basic on a GoLD stack is quite trivial, and an engineer can likely do this on her first day even if starting out with an empty AWS account.

Go as a language is relatively simple. To some it may feel unnecessarily low-level, but once you get over the fact you have pointer and non-pointer types, people with a solid programming background tend to be productive in Go in a matter of days. And quickly, you simply know the entire language, and it’s a matter of learning the “Go way” to solve problems. Go is very explicit, there’s little magic, relatively few “gotchas” compared to other languages, and there’s little room for clever APIs that allow you to write that for-loop in a single line, like other languages like Kotlin, Ruby, Rust and Scala would let developers do. Likely, you just have to suck it up and write out the for-loop every time, but at least anybody new to the project (and even new to the language) will instantly understand what’s going on, without being aware of magic provided by macros, reflection, aspects, or dynamic code generation.

From a management perspective, there’s a lot to like about Go as well: it’s a relatively mature language, has a mature ecosystem, has solid backing from Google and others, it has proven to scale to large code bases (types help) if that would be required (ideally it wouldn’t be, lambda ought to be small), and it comes with a toolchain that eliminates the need for unproductive discussions, such as:

  • How code ought to be formatted (go fmt decides)
  • What package manager to use (go mod)
  • How to keep code clean and consistent: some things like capitalization of type and function names change semantics, and various types of dead code, like dead imports and dead variable declarations simply result in compile-time errors, go vet checks for various other pitfalls

Of course, the default toolchain doesn’t predefine everything, but it’s a solid baseline.

On AWS lambda, Go is a first-class citizen, you don’t need custom runtimes to run it, and the Go AWS SDK is well maintained.

Lambda as a runtime is easy to grasp. The event model may be a bit of a mind bend if you’re not used to thinking this way (but that may be a good thing — I’ll talk about cultural aspects later). But conceptually there’s not that much you need to understand to write lambdas: you write Go code, compile it, zip up the binary (which the serverless framework will upload to S3), and when certain events occur (such as SNS notifications, SQS messages, incoming API Gateway calls, DynamoDB stream events) your code is pulled down and run in some environment where your only real lever is how much memory you allocate to it (although this has implications on CPU as well, which is a bit of a gotcha).

Running Go on lambdas is a good fit, because Go compiles into a single binary, which is quick for the lambda environment to pull down and boot. Also, since Go processes require relatively little memory, you can run them in low-memory (and therefore cheaper) configurations if that makes sense. There’s a lot of ado about lambda cold starts (whether your lambda invocation will boot up a new lambda process or reuse an already running one, which obviously is more performant), but if you write in Go this is of relatively little concern — in this sense it’s probably one of the fastest options. There’s a story (not sure about the source) that the original API Gateway implementation runs on AWS Lambda with the logic implemented on top of the Java runtime, which has serious cost implications (because Java cold starts are slower and the JVM consumes quite a bit of memory). The new HTTP API offering is supposedly implemented in Go, and as a result significantly cheaper. Needs citation.

DynamoDB as a database is probably the tougher case in terms of onboarding. A little while ago The DynamoDB Book came out, essentially a must read if you would like to use this service. Going through it, my main learning is that you need to be willing to unlearn almost everything you learned about database design assuming you grew up with relational databases (as I did). Which, for some, is going to be exciting, but for others may be too much of a whack out of their comfort zone, and not a good fit.

As a database, you get a flexible enough database (once you understand how to properly model things) that scales like butter. You pay only for the data stored and operations performed.

Which naturally transitions us to the last dimension: culture and recruitment.

Culture and recruitment

There’s a few ways that I see the GoLD affecting engineering culture:

  1. Cost awareness — because of the pay-for-use model, engineers become very sensitive and aware of cost and closely monitor it.
  2. DevOps built-in — we have one SRE for two teams, but much of the infrastructure work (in serverless.yml) is done by engineers. There’s just less to learn and worry about than some alternatives like kubernetes, so even simpleton engineers can manage the infra 😛.
  3. Focus on monitoring — without proper monitoring, it’s rather hard to even debug basic issues on this stack. Since there’s so many interconnected, but loosely coupled parts, it is simply impractical not to invest in monitoring early. Luckily AWS publishes plenty of CloudWatch metrics about every deployed resources, and aggregates logs from lambda automatically — all that remains to be done is putting it together with dashboards and alerts.
  4. The type of people attracted to this stack is a bit different — let me expand on that in the context of recruitment.

On recruitment. Let’s be realistic. GoLD may be cool, but it’s far from being mainstream. The chances you will be able to recruit people with many years of GoLD experience on their CV is very low.

While this is a clear disadvantage compared to recruiting for e.g. engineers working on a LAMP stack, or a Java/Spring stack — a population that is significantly larger. It does provide an opportunity to recruit for a different mindset and culture — one that is less focused on reusing stack-specific knowledge acquired over the years, and more comfortable stepping out of their comfort zone and struggle a bit while learning something different. Something quite different, both in the language, architectural and data model dimensions.

However, to be able to support such a thing, you do need to bootstrap this process somehow. It happened to be me (humility alert) who brought the GoLD stack (then still unnamed) to OLX, with no experience in this stack whatsoever. Over time we made some key hires that compensated for this sheer lack of experience, and worked our way out of all my rookie mistakes (most of the code I wrote back 2 years ago when we started this journey, has been rewritten by now).

However, the result is that we, by now, have a solid foundation on how to structure our code, write tests, model data, structure our pipelines, as well as some reusable libraries used across projects. With this in place, the scaling in terms of teams can happen. A few months ago we onboarded an existing team (with no previous Go, Dynamo or serverless background) to the GoLD stack, which, within a quarter of close collaboration with a more experience team, was perfectly able to stand on its own feet. We have also hired some new people, or moved people internally previously working in different roles (such as front-end) that had limited previous experience, and onboarded them quickly. We’re pretty confident we can scale further when we need to. But the openness and mindset needs to be there.

We’re not in a hyper growth mode in terms of scaling teams right now (also due to COVID), so we still have to see how easy it will be to recruit more people with this profile, but I’m hopeful.

The management perspective

In management there’s a lot of angles (I won’t use the word “dimensions” again, so let me switch to synonyms) to consider when deciding on technology stacks: maturity of the ecosystem, productivity, reliability, availability of talent able — and willing — to work on the stack in the short, but also long term. There’s no silver bullet solution to this challenge. However… luckily there’s a GoLD bullet solution 🙃