Scalability Talk at International Week of Technological Innovation

Posted on April 19, 2010

Erik Schultink Chief Technical Officer

On Wednesday (21.04.2010), I’m giving a talk about scalability at Tuenti at “International Week of Technological Innovation”, hosted by Universidad Europea de Madrid. In prepping that talk over the weekend, I put together some very interesting data about the work our team has done over the last 6 months here at Tuenti. This data shows the hard-won gains from months of the approach of partition, archive, optimize – then profile/monitor and repeat. I’m pretty proud of that work and want to highlight some of it.

As I’ll speak about in my talk, I define scaling as maintaining acceptable performance under increasing amounts of load. I think of the performance of the system as graph like the one shown below:

The x-axis is request rate (e.g requests/sec); the y-axis is response time (ms). We care about the total throughput of the system – the total number of requests that can be served in a unit of time – while ensuring that every response is generated faster than some upper-bound on response time (dashed red line). You can think of this value as the “capacity” of the system as the point beyond which response time is above this threshold (ie performance is unacceptable) – the intersection of the red and blue lines. Scaling is moving this point farther and farther to the right, through actions such as optimizing, re-architecting, and adding infrastructure. All of these actions shift and re-shape the performance curve of the system – hopefully for the better.

What does that look like in practice? At Tuenti, we profile a portion of requests to our system and from that data, I produced the following performance curves:

Each curve in that graph is from a sample dataset, each of which was taken about 2 months apart. As in the theoretical graph, the x-axis is request-rate and the y-axis is average response time for those requests. For full disclosure, I scaled the data and excluded some outliers to get curves that overlaid nicely on each other – but these curves remain quite representative of the performance profiles of our system and, despite some extrapolation, don’t mask any bottlenecks lurking within the range of the x-axis.

These curves tell a very interesting story. In October, our system was clearly much inferior to what it is today. Although that dataset is quite noisy, it is clear that performance degraded rapidly at a much lower range of request rates than in later months. I don’t recall a particular bottleneck we were facing at that time – but likely it’s explained by bumping into CPU and DB contention.

Two months later, in December, we had flattened this curve substantially. Although one could complain that some outliers at the left extreme are forcing a very generously fitting trendline, it’s pretty clear that we had better performance in December at high request-rates than in October. Note that, interestingly enough, response time at lower load levels is actually worse in December than in October – we had traded about 10 ms in best-case performance for increased scalability, but that’s a trade I’ll take any day. Overall throughput of the system is more important that response time of a any request.

After another two months of work, in February, we had reclaimed that 10 ms while further flattening the curve. The dataset also looks much more stable, with less noise. In April, hard work brought response times down another 10 ms while maintaining a very healthy looking curve and stable dataset.

Overall, I think this graph gives a fantastic representation of 6-months of work scaling a Web 2.0 system – maintaining and improving performance, in the face of significant growth and new feature launches. Those response time figures are total – including CPU time rendering the page, as well as cache access and DB queries. Such work involves a lot of different teams: our backend scalability team of course, but also our backend framework and systems teams. And whatever optimizations those teams make, we still count on our product development teams to write new features in ways that don’t abuse our frameworks, DBs, or CPUs.

Interested in pushing this curve farther? Check out jobs.tuenti.com.

Leave a Reply

  • (required)
  • will not be published (required)