August 24th, 2011

Doing the right thing

The recent outage of SoundCloud was the result of everybody doing the right thing. This totally jives with John Allspaw’s message that looking for a root cause will lead to you finding smart people simply doing their jobs.

This is what happened.

 

The technologies at play

The number of interactions that escalated to the outage should be interesting for other Rails/MySQL shops out there. We’re using many idiomatic patterns within Rails that ended up having devastating consequences.

Cache keys that avoid invalidation

It’s best practice to form your cache keys in a way that doesn’t require you to issue cache key deletions, and rather let normal evictions reclaim garbage keys. To do this we use the ‘updated_at’ column on some of our models in the cache key so that if the model updates, we know we’ll get a new key.

ActiveRecord::Timestamp#touch

There is an innocuous method called ‘touch’ that will simply bump the updated_at and save that record. This is quite convenient to call on containers of has_many things like a forum topic is to its posts. With a recent ‘updated_at’ and consistent way to keep that recent via ‘touch’, there is a clean decoupling of cache strategy from business modeling when you intend to communicate the container’s version is as recent as the most recent addition.

ActiveRecord::Associations::BelongsToAssociation :counter_cache

In the absence of indexed expressions and multi index merges that some databases have, MySQL and InnoDB leaves it to the application to keep lookups of counts efficient. When dealing with tables of multiple millions of rows, a simple query like:

select count(*) from things 

Could take tens of seconds as InnoDB actually needs to traverse all primary keys and literally count the rows that exist which are visible in the current transaction.

class Foo belongs_to :things, :counter_cache => true end 

Is a simple and convenient workaround to avoid the ‘count(*)’ overhead where when a new Foo is created on a thing, that thing’s ‘foos_count’ would get a database increment by one. When it’s removed, the ‘foos_count’ would be decremented by one.

ActiveRecord::Associations::Association :dependent => :destroy

What better way to maintain all the business rules on deletion than to make sure your model’s callbacks fire when a container is destroyed. When your business consistency is maintained in the ORM, this is also the best place to ensure proper business rules on removal.

This simply does the following for every association:

thing.foos.each(&:destroy) 

Before:

thing.destroy 

It’s best to have the behavior declared where the association is declared and let the framework make sure it’s not forgotten when actually performing the destruction.

More than 140 characters

Some of the features of SoundCloud deserve more than 140 characters to do them justice. Even more than 255 characters. The tools that Rails gives you out of the box when you need more than 255 characters on a string in your data layer are limited; you’re left with the ‘text’ type in your schema definition. In MySQL this translates to a “TEXT” column.

A TEXT column with a maximum length of 65,535 (216 - 1) characters. 

Well, in most cases we wouldn’t need more than 1000 characters for much on our site, but in the spirit of “deployed or it didn’t happen”, our early database schemas were mostly driven by the simplest option that ActiveRecord::Schema offered. (We also have way too many NULLable columns)

Trade space for time or time for space

The early days, SoundCloud ran on 3 servers. CPU was precious so for some of the HTML conversion tasks we traded space for time and are storing a cached version of the HTML for some of the longer text fields in the same row as the record. This was the right choice at the time for the request load and available CPU.

(time is usually easier to scale than space)

Tune your DBs for your workload

We separate reads from writes on our data layer, and we also have slightly different tunings on the DB slaves that accept reads. We also have experienced statement-based asynchronous slave replication breaking the replication thread due to these different tunings on different hardware.

We use row-based (actually mixed) replication between our masters and slaves because it’s as close as you’ll get to the storage engines speaking directly to each other, minimizing the risk of differences in hardware/tuning interfering with the replication thread.

Alert on disk space thresholds

We have a massive amount of Nagios checks on all our systems, including host-based partition %free checks. When any partition on a host reaches a threshold of free space, an alert is sent.

Separate data and write ahead log physical devices

Most OLTP databases have data that is bound by random read/writes, whereas binary logs are fundamentally sequential writes. You want these workloads on different spindles when using rotating disks because a transaction cannot complete without first being written to the binary log. If you need to move the drive head for your binlog, you’ve just added milliseconds to all your transactions.

Clean up after yourself

Periodically there are administrative tasks that need to be performed on the site like mass takedowns of inappropriate content. The Rails console is amazing when you need to work directly with your business domain. Fire it up, get it done. For one-off maintenance tasks this is a life saver.

Add Spam, Mix, Bake and we’ve been Served

This all adds up, reviewing the good parts:

  • Abstract away bookkeeping in your domain model
  • Leverage existing patterns to get the job done quickly
  • Tune and monitor your DBs
  • Hand administer your site via your domain model

If you haven’t noticed yet, there have been some incorrigible entrepreneurs using some groups to advertise their pharmaceuticals distribution businesses. They have very thorough descriptions (5-50KB worth), and unprecedented activity with tens to thousands of posts in their own groups.

At 1:00pm yesterday, we were working through our cleanups, and cranked open a console with a large list of confirmed groups to remove. With Rails this is as simple as:

Group.find(ids).each(&:destroy) 

Looks innocent enough.

What the database sees

From the database perspective all of the automated bookkeeping and business domain extraction of individual destroys creates statements, this ended up looking something like this:


DELETE post
UPDATE group SET posts_count = posts_count - 1
UPDATE group SET updated_at = now()
... x 1-5000
DELETE track_group_contribution
UPDATE group SET tracks_count = tracks_count - 1
UPDATE group SET updated_at = now()
... x 1-5000
DELETE user_group_membership
UPDATE group SET members_count = members_count - 1
UPDATE group SET updated_at = now()
... x 1-5000

So we’re seeing 2N+1 number of updates on a group where N is the sum of associated objects.

What replication sees

When using row-based replication, any change to a row gets that entire row added to the binary log. Some of these groups had over 100k worth of text columns and hundreds of associated posts. When parsing a given binlog, these group updates were taking over 90% of the replication events being sent to the slaves.

What the binlog partition sees

This is what finally brought us down. We were producing over 3GB/min of binlogs to be replicated to our slaves from these many group updates. Our binlog partition filled up from 10GB to 100GB in a matter of 30 minutes.

The MySQL docs are clear about what happens with a full data partition. When data cannot be written, MySQL just waits. The behavior around the binlog partition wasn’t as clear. That last binlog event had a partial write. When the disk filled, the binlog corrupted. When that last event in the last binlog was attempted to be sent to the slaves, it failed and the slaves stopped replicating.

 Got fatal error 1236 from master when reading data from binary log:
'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master' 

Our max_allowed_packet is big enough for any of our rows.

How we recovered

We had a master with live queries that were not coming back from a ‘killed’ state. We scaled the binlog LVM partition so that it could accommodate new writes now, but the DB was not budging. We had no idea how to get it to start writing again so we began a failover process.

All our slaves were at the same position just before the corrupted event, so we grabbed one, confirmed the tuning was good and then promoted it. We went through the many other slaves and reconfigured them and we were good, consistent to the last event. All we lost were the few processes that were waiting to write to the full partition.

Ironically, 2 of our team members are currently in the datacenter recabling some of our racks. We also have 4 swanky new DB class machines powered on, but we were a day or two away from getting them networked and integrated. We were just short of having that excess capacity to accept the spike in load from return visitors after the tweet “We’re back!”.

Forensics

We tried to understand the cause of the sudden binlog growth so we could safely enable the site without a replay of what just had happened. We expanded out some of the logs with ‘mysqlbinlog –verbose’ to show that it was filled with group spam. To confirm that it was group activity, we compared the replication data volume per table of a well sized binlog file with an abnormally large with the following awk program:

    for log in mysql-bin.03980{5..7}; do
      mysqlbinlog $log |\
      awk '
        /Table_map/ { name = $9 }
        /BINLOG/ { bytes = 0; col = 1 }
        { if (col) bytes += length($0) }
        /\*!\*/ { if (col) sizes[name] += bytes; col = 0 }
        END { for (i in sizes) print sizes[i], i }
      ' |\
      sort -n |\
      tee /tmp/$log.sizes &
    done

This created a list of byte sizes of base64 encoded row data to table name. Groups took 820MB compared to the next largest table at 30MB.

Put this script in your toolbox, it’s also great to use for getting an idea of which tables are your hottest under normal operations.

We also used our recently finished HDFS-based log aggregation system to run map reduce jobs over our web and app logs to identify any possible abuse vectors around groups that were coming from the outside.

What we learned

When maintenance around abusive usage is also a part of your business, think about the data and impact on your running system for all maintenance work.

Cut your losses early and resist the temptation to find the “root cause” during an outage incident. Failforward. Save what you can for forensics after you’re back up.

Uncleared acknowledged alerts are alertable offenses. A big question during the incident was, “where were the alerts?”. It turned out that the host-based disk space check was previously acknowledged because we had an unrelated partition fill on the same host within expectation. This acknowledgment wasn’t cleared before the binlog partition filled so we didn’t get the 20 minutes of lead time we could have had.

In the heat of the moment, put your heads together, make a plan for the next X minutes, execute, and repeat until you’re back up. All the engineers were at battle stations during this outage. This incident blindsided us and all kinds of theories were thrown around. When we focused on do X within Y minutes we got down to time boxing research to be able to take our next action. This worked very well.

Pay off your technical debt. In the past, we took a loan on the future for trading space for time. Paying off these kinds of debts is easy to defer, until the collector comes to visit. Keep a list of the debts chosen, and the debts discovered. Even without an estimated cost to fix, your debt is a learning tool for others for where and when measured choices towards the road of delivery are made.

Be as specific as you can with your expected data types. We never expected group descriptions to be larger than 2KB. We should have encoded that expectation loud and clear in the data and business layers.

Doing the right thing

We are all working with the best practices in mind, yet the combination of all that we were doing correctly ended up with this outage nobody expected. Yesterday was an incredible learning experience for everyone at SoundCloud, and the entire team joined together with a positive spirit and heartfelt passion to restore service as quickly as possible. I’m quite proud to work with everyone here.

Sean Treadway
August 22nd, 2011

SoundCloud mobile – Proxies

The Problem

The mobile version of SoundCloud is a consumer of our own API dog food. That decision was made with the intention to deploy a self-sufficient client application that depends only on a static provider. Our early experiements showed that the attempt we made had some downsides. For example, the implementation of redirects in CORS is not behaving properly and therefore can’t be used with many of the endpoints in our API where we rely on the correct handling. Also classic XHR communication with the API is not an option due to the same origin policy implications that apply even on subdomains.

Read the rest of this entry »

alx
August 2nd, 2011

Building the SoundCloud mobile site using backbone.js

Until early this year, there was a gap. A gap between the desktop-targeted main SoundCloud site, what we call the ‘mothership’, and the native iOS (iPhone, iPod touch) and Android applications. A common and frustrating use-case was mobile Twitter: Someone would share a new favorite or upload on Twitter, you tap on it, and it tried to load the regular site on your tiny smartphone screen. Pushing the whole desktop site over a mobile connection would be a waste of precious bandwidth, if you only want to check out a track. Alternatively we could try to redirect to our native apps, but there’s no guarantee that the user has it installed and the mobile vendors don’t offer any APIs for verifying that in advance.

With that in mind, back in December 2010, we set off to build SoundCloud Mobile, targeting the mobile browsers of iOS and Android. The analytics of the existing site told us that these two platforms make up the overwhelming majority of our users, so we started there. As a mid-term goal, we decided to expand our support to devices, as long as they have a browser capable of streaming audio.

For the architecture of the site we decided to make it a SoundCloud API client, eating our own dogfood just like the native iOS and Android apps already do. With that in mind, we considered the option of building a single-page web application (vs classic serverside rendered pages). To figure out how viable that option is, we spent a week building a prototype based on jQuery Mobile. The prototype included a start page with hot tracks, a basic search, people and track pages and basic audio streaming. The lists used the theme provided by jQuery Mobile, everything else was barely styled HTML. This prototype helped a lot in making several important decisions:

  • Building a single-page app was feasible, with the client side application as the direct API client. Later we had to back away a bit from that, introducing a proxy to decorate the API (and work around WebKit bugs), but overall most of the action is still happening on the client.
  • jQuery Mobile works great for a fixed number of preloaded and infinite number of server-generated pages, but not for our usecase of generating all pages on the fly based on API results. We needed much more flexible routing with HTML5 history.pushState support, so that we could support the theme URL sets as the main site.
  • On a similar note, jQuery Mobile’s theming system allowed us to build a pretty prototype in no time, but wasn’t a good fit for the completely customized UI that we wanted.
  • Audio streaming on mobile is still very immature. Even with support for only iOS and Android, plenty of workarounds are required for a somewhat consistent experience.

After throwing away the first prototype, we moved on to create our own basic framework. It described the domain classes like ‘track’ and ‘user’ as global singleton objects. Our ‘router’ object was responsible of passing on the model data onto the responsible controller method. Soon we could see that the approach wouldn’t scale that well, especially when simultaneous instances of a class were required on the same page.

After dismissing a few bigger client-side MVC frameworks, we’ve stumbled upon Backbone.js, which was compact, easily extendable and depended only on Underscore.js. Backbone sets up only the application structure plus it offers a multitude of convenient methods that can be used while building your app. It doesn’t dictate how the application UX works nor describes how the templates have to be structured. While that still left a lot of open questions for us to answer, it also didn’t impose too much unwanted structure.

Backbone.js let’s you choose your own templating engine, and we went with the jquery-tmpl plugin. We restricted our template usage to output and iteration within the template, both to give us the option of switching to another template engine (e.g. handlebars.js) and to keep our sanity. To implement the remaining presentation logic, we used the route suggested by Backbone.js, preparing the data for output in the Model’s toJSON method. This also has the advantage of keeping the model itself clean, making it easy to update the model and send it back to the server. In addition to that we added a decoration step, modifying the template output before inserting it into the DOM. This includes adding additional classes or removing empty nodes.

When we started using Backbone.js, it supported only hash-based history (what Twitter does today when it redirects twitter.com/ericw to twitter.com/#!/ericw). We wanted support for history.pushState to map URLs from soundcloud.com to m.soundcloud.com by only prepending the ‘m.’. We extended Backbone.history for that, while also triggering a custom event. The latter can be used by the Google Analytics tracker or any other component that has to get an update on the current page state.

We also extended regular Backbone.sync method, used by all Models and Collections to exchange data with the server, to add a client side cache, backed by the HTML5 sessionStorage. That way we didn’t have to keep any pages in memory, but can instead rerender them from scratch in milliseconds, as the underlying data is still available in the cache.

With those components in place, a click (or rather, tap) on any internal link caused the following actions:

  • Handling the click/tap event, preventing the default browser action, and using history.pushState instead to update the current address. At some point telling the Backbone.router that the page changed.
  • Backbone.router maps the URL to a controller method, which creates the model for that URL, e.g. initializing the User model with the username parsed from the URL. It then creates the view and passes the model to that view.
  • The view tells the model to fetch its data. Once done, with data loaded from the server or from the client side cache, it passes the model to a template, decorates the result and inserts it into the DOM.
  • The view also initializes event handlers (via event delegation) to handle all interactions within that view, e.g. a click event on the ‘Play’ button to start streaming audio.

This turned out to be a very solid application architecture which we continued to fine-tune after the first public launch of the mobile site in March, when we redirected iOS and Android traffic from the main site. Since then we continued to add features and improve the site, watching the traffic almost doubling every month.

Along with this new client side architecture we also experimented with alternatives for development and production. The node.js-based development and production server, including the API-proxy is covered in detail by our node ninja Alexander Simmerl. In the upcoming post we’ll also talk about our approach to testing with QUnit and PhantomJS.

matas
July 29th, 2011

Velocity Conference 2011 – Europe

The venerable O’Reilly Velocity Conference is coming to Berlin on the 8th and 9th of November. SoundCloud is doing what we can to help organize the speaker program and want YOU to speak.

  • Have you turned an operations challenge into a opportunity?
  • Does web performance matter to you, and you’ve done something to prove it?
  • Do you transform your running systems raw data into meaty information?

This is an amazing opportunity to share your world class experience with European peers!

Achtung! The call for proposals ends August 9th so hurry to send your proposals in to the official website.

 

http://www.flickr.com/photos/givingkittensaway/85477841/

Sean Treadway
July 5th, 2011

MySQL for Statistics – Old Faithful

MySQL turns out to be a good Swiss Army Knife for persistence, if used wisely. Understanding disk access patterns driven by your storage engine is key. Choosing a read or write optimized disk layout will get you very far. We chose a read-optimized disk layout using InnoDB and MySQL for statistics.

Read the rest of this entry »

Sean Treadway
May 11th, 2011

Experiment 02: Destroying SoundCloud & Instagram

I was fortunate enough to be given the opportunity to help Moby premiere his new record via SoundCloud. I didn’t know what to expect from @TheLittleIdiot‘s latest piece of work. However, I soon learned he had created a phenomenal album with a perfectly crafted and inspiring theme: Destroyed

i don’t sleep very well when i travel. and as a result, i tend to be awake in cities when everyone else is asleep. that’s where this album, and the pictures that accompany it come from. it was primarily written late at night in cities when i felt like i was the only person awake (or alive), a soundtrack for empty cities at 2 a.m, at least that’s how i hear it. the pictures were taken on tour while i was writing the album. i wanted to show a different side of touring and traveling. a side that is often mundane, disconcerting, and occasionally beautiful.

ME == PUMPED

Read the rest of this entry »

Lee
May 4th, 2011

Introducing the Large Hadron Migrator

Rails style database migrations are a useful way to evolve your data schema in an agile manner. Most Rails projects start like this, and at first, making changes is fast and easy.

That is until your tables grow to millions of records. At this point, the locking nature of ALTER TABLE may take your site down for an hour our more while critical tables are migrated. In order to avoid this, developers begin to design around the problem by introducing join tables or moving the data into another layer. Development gets less and less agile as tables grow and grow. To make the problem worse, adding or changing indices to optimize data access becomes just as difficult.

Side effects may include black holes and universe implosion.

There are few things that can be done at the server or engine level. It is possible to change default values in an ALTER TABLE without locking the table. The InnoDB Plugin provides facilities for online index creation, which is great if you are using this engine, but only solves half the problem.

At SoundCloud we started having migration pains quite a while ago, and after looking around for third party solutions [0] [2], we decided to create our own. We called it Large Hadron Migrator, and it is a gem for online ActiveRecord migrations.

LHC
The Large Hadron collider at CERN

The idea

Read the rest of this entry »

Rany
April 28th, 2011

Web Scale Statistics – Failing with MongoDB

As SoundCloud rapidly grows our initial systems need an overhaul. Our scaling strategy has been very realistic, design for 10x our current usage. Our initial statistics system found under http://soundcloud.com/you/stats was made when we were 100k users, living long past its expiration date.

Background

About a year ago we started off with the goal of redesigning the statistics pages to support 500 playbacks a second. We knew that this would be a write-heavy workload and that to sustain bursts of writes, we’d need decent partitioning. Coming from a successful experience moving the Dashboard feature to Cassandra 0.6 we started out prototyping a design that would be easily partitioned.

The write side of this story went very well, Cassandra could keep up with everything we threw at it, however to naively pull out all the aggregates we were collecting took hundreds of queries to the cluster. Cassandra didn’t have atomic counters at the time, so we had a lot of individual counts that needed to be summed on the client. (This is changing with the much anticipated upcoming 0.8 release!)

In a one-night experiment, we re-implemented the Cassandra based prototype to be backed by MongoDB. Not only could this quick prototype consume events as fast as Cassandra, there were some server side features in MongoDB that we could use to simplify a few of the queries that we had for the stats like atomic inplace insert/updates (upserts) to use fewer documents and secondary indexes to build the time series. Plus it was web scale.

Read the rest of this entry »

Sean Treadway
April 28th, 2011

Marbleo.us

Greetings! I’m Robb and this is my first SoundCloud Backstage blog post. During the day I’m a developer working on the Mac App here in the SoundCloud office, but I’m also a university student. It was through Uni that I found out about and entered the annual informatiCup competition with my friend Simon.

Although we didn’t make it into the final around, I consider the project we made – a web-based marble run simulator called Marbleo.us – to be a success. Here’s a fun example map to try it out.

So let me give you a quick overview of what we did, how we did it and what I’ve learnt building Marbleo.us. Read the rest of this entry »

Robb
March 1st, 2011

Experiment 01: Puzzle To Unlock

This “Puzzle to Unlock” concept came to me straight from Manchester Orchestra’s wonderful label/management team, and we were able to pull it together very quickly with SoundCloud, jQuery, and SoundManager2.

The band wanted to build a bit of excitement around the premiere of their new single “Simple Math.” So we developed a way to tease the song with dialogue from the artist and actual clips of audio released as “pieces” to a puzzle that will unlock both the album cover and track. Cool, right?

Interaction was created using these basic jQuery commands: draggable and droppable.

Every time you drag the puzzle piece from it’s container at the bottom a hidden droppable div moves to the right pixel location where the piece will fit. If you find that location, the piece will glow and dropping it will lock the piece into place.

A successful drop will queue up a SoundManager2 powered / SoundCloud hosted track and morph the piece’s container into a nice little play/pause button for replaying purposes.

Once all of these pieces are released and placed correctly, you’ll be able to hear the full version of “Simple Math,” and trust me – it’s worth building a puzzle for. Enjoy!

Lee