December 9th, 2011

Stop! Hacker Time.

At SoundCloud we like to invent new ideas. But we’re not adverse to implementing really great tried and tested ideas like the 20% time concept made famous by Google.

We’re calling it Hacker Time. We’re still very much in start-up mode so we’re keen to nurture the spirit of hacking. We’ve been testing out Hacker Time for a few weeks now and we’re excited about its potential, from industry-changing initiatives like “Are we playing yet” to unusual passion projects like the “Owl Octave”.

Why we’re doing it

When I arrived at SoundCloud I gathered all the engineers and developers in a room and asked them what they wanted more of. This was one of the top requests:

Like every other fast-paced tech company out there, the everyday demands of development leave little time for pet projects or invention outside of the roadmap. But we shouldn’t let great ideas slip away: it’s wasteful of talent and it’s frustrating for developers themselves. Hacker Time is an attempt to keep both the ideas and the employees focused on making the product even better.

I don’t think there’s a single developer at SoundCloud who isn’t a sound creator. Or at least they quickly become one once they’ve joined. We’re a company of electronic music producers, acoustic musicians, field recordists and social sound fanatics. Of course we’re developing the product for 9 million people, but we’re also developing something we want to use ourselves.

Ironically we’re not trying to invent anything new by initiating Hacker Time. Aside from Google, we’ve been watching the Atlassian experiment and are taking a similar approach; we’ll start with a simple set of rules and adapt to what works best for us. Like Atlassian, we’ll be blogging about our experiences too.

Which projects are engineers allowed to work on?

We’ve compiled a list which is not meant to be restrictive but rather should be a guideline:

  • Pet features/improvements that never made it onto the roadmap
  • Apps based on the Soundcloud API
  • Hack Days
  • Anything for SoundCloudLabs.com
  • “this always annoyed me” bug-fixes or architectural improvements
  • Integration of some technology-du-jour with Soundcloud
  • Contributing to OSS used at Soundcloud
  • Other cool projects using SoundCloud
  • Conferences
  • Writing Blog Posts, Technical Articles

How will time be allocated?

The big decision is how to allocate time to these projects. We don’t want to cannibalize valuable product development time but we do want to give people as much freedom as possible,

There are lots of approaches out there – from reserving one day a week, accumulating time to set aside whole weeks or handling that time like vacation. We’ve decided to put the decision-making in the hands of each team, allowing them to allocate Hacker Time according to each team’s workstyle. It’s an experiment, and something we’ll be reviewing in the early stages.

Demo it!

We’re pretty disciplined about demoing work to the whole company. Hacker Time projects will be included in the demos (which happen every 2 weeks at SoundCloud), giving developers a chance to showcase their projects and sparking interest in their hacks from everyone in the organization.

Watch this blog for further reports!

Many thanks to

Atlassian for blogging about their experiences

Simon Stewart from Google
Jim Webber from Neo4J
Jan Lehnhardt from Couchbase
Stefan Roock from it-agile

for sharing their experiences with us.

Alexander Grosse
November 21st, 2011

Front-end JavaScript bug tracking

Proper and effective error tracking is a common issue for front-end JavaScript code compared to back-end environments.

We felt this pain as well and experimented with different solutions over the past months on the SoundCloud Mobile site.

Analytics

The first approach we had was to track errors with Google Analytics. Their library permits to fire custom events and whenever an ajax error would occur, we would log it.

The biggest benefit of this tool is to monitor the stability of the site and its evolution in longer periods as you can easily go back a few weeks or months to see which events were triggered. Also, it is easy to implement – almost a one-liner!

The drawback, at least for Google Analytics, is that this tool is not meant to track bugs. There is no way to add custom data to these events to get more insight about why and how an error happened, it also doesn’t work in real-time, and you obviously want that when you debug.

So we kept Analytics in place for a long-term view, but took a look at other options for real-time and in-depth tracking.

Airbrake

In our pursuit of getting more insight, we decided to take a look at Airbrake because we were already using it to track back-end errors on our main site.

Our mobile site runs on Node.js, the first thing we did was to integrate an existing plugin for it to handle error tracking on the back-end as well.

Looking a little further we found a front-end notifier, which would catch errors that would fire on window.onerror, but there was no way to report any custom errors.

We decided to take a day to hack this on our own since their API is public and easy to implement.

The benefits of Airbrake were instant. We could see what triggered which error, how, why, in which context, which browser, etc… in real-time!


It also counts errors, which can help you prioritize and include fixes in your roadmap.

However, the lack of filtering, grouping and custom sorting made it difficult to work with. There was also no sense of time or progress, as everything just gets dumped into a single list ordered by time. We needed something a little better than that.

BugSense

That’s when our Android team showed us their BugSense implementation.
BugSense seemed to address all of these issues we had with Airbrake: grouping is more effective, searching and filtering is possible, charts of errors are drawn as well.

There is one more benefit over Airbrake… JSON. No need to convert objects to XML strings anymore!

If you are interested in our BugSense notifier you can find the source on github.

Conclusion

There is still a lot of work needed to make front-end JS debugging as easy as it is for regular back-end environments.
For example, stack traces today aren’t that useful, because of anonymous objects and minified code, but hopefully browser vendors will tackle these issues soon. Maybe Source Maps could be the first milestone in this quest.

At SoundCloud, we will continue to use a combination of these tools because of the different strengths outlined above, but there are also other tools we didn’t try out yet like getexceptional or errorception. If you have tried these, or if you have any suggestion on this subject we’d like to get your feedback in the comments below.

Happy debugging!

Yves
November 9th, 2011

SoundCloud launches the HTML5 Audio Improvement Initiative

We at SoundCloud want to build the best sound player for the web, and we want to do that using the Open Web standards. While working on the native audio features on our mobile site and new widgets, or even as an experiment on the main site, we have discovered that the HTML5 Audio standard is not equally well implemented across all modern browsers and some decisions can be made that would benefit the web audio users and web developers alike. Soundcloud launches the “Are We Playing Yet?” project, which aims to raise the awareness about the state of HTML5 Audio implementations in the web browsers.

AreWePlayingYet? 2014 A pragmatic HTML5 Audio test suite

We have decided to help the parties involved and collect the issues in one place, document them, provide the code and add interactive tests that will show the implementation progress. We understand how the software development works, and that a few iterations are needed until something is fully done. We hope ”Are We Playing Yet?” can function as a handy development and quality monitoring tool.

Issues - soundcloud/areweplayingyet - GitHub

“Are We Playing Yet?” was started by SoundCloud but it’s open to all companies and developers who care about the state of HTML5 audio and want to build applications based on this Web standard. You can get the project source on GitHub, contribute tests and fixes via the pull requests or Issue Tracker, and connect to the people involved via @areweplayingyet on Twitter.

matas
October 27th, 2011

Velocity Europe Birds of a Feather Meetup

With excitement building for the Velocity Conference in Berlin, we are happy to announce a pre-event meet up on Monday, November 7 at the Betahaus in Berlin.

The event will be a great opportunity for visitors of the Velocity conference to beat the jet lag with a day of networking and relaxed sessions and talks. To get everyone in the Berlin-state-of-mind, the evening will be capped with a party.

Berlin has quickly established itself as one of Europe’s main technology hubs. There are a lot of web companies in Berlin which are part of a constantly growing and vibrant local web community. The location for the BoF, Betahaus, is a shared-office innovation-inducing place that is used by small and large startups as well as established organisations.

Participation is free for Velocity Europe attendees. Non-Velocity attendees can pay 50 Euros to take part in the BoF event. The evening party is open to all and does not require registration.

The pre-event will be organized as an open space or unconference with on-spot planning of the sessions. The event is limited to 150 participants who must register in advance. If the event is not sold out in advance, participants can show up at the door.

At the Betahaus, we have 2 larger rooms and 4 smaller meeting rooms. We will hold presentations in the larger rooms and hold discussions in the meeting rooms. The meeting rooms for the discussion can hold up to 10 people. Anyone interested in speaking at the event can contact Eliot (eliot(at)soundcloud.com)  to submit a discussion topic.

The day will close with an open party . Event participants will enjoy a dinner buffet & free drinks at the after-event party.

You can register for the event here: http://veubof2011.eventbrite.co.uk/ (free for Velocity attendees / 50 Euros at the door for non-Velocity attendees)

Outline agenda (subject to change):

  • 13:00: Arrive at Betahaus
  • 14:00: Event starts with welcoming remarks and the session planning
  • 15:15 – 16:15: Sessions 1 & 2, each 30 minutes
  • 16:15 – 16:45: Coffee break
  • 16:45 – 17:45: Sessions 3 & 4, each 30 minutes
  • 17:45 – 18:15: Wrap up
  • 18:30: Dinner buffet for event participants
  • 19:30: Open party

Event location:
Betahaus
Prinzessinnenstraße 19-20
10969 Berlin
Betahaus on Google Maps

 

 

Eliot
October 14th, 2011

SoundCloud Signs Apache Corporate Contributor License Agreement

We just signed the corporate contributor license agreement (CCLA).
SoundCloud always was big on open source – we nearly exclusively use open source software in our company and use a lot of Apache projects like Hadoop, Solr, Flume, Zookeeper and Cassandra on our large scale production site.

As SoundCloud is using a lot of Apache projects and started to contribute to project we decided to sign the CCLA and enable all our developers to contribute to Apache projects even during work time if that project is used by SoundCloud.

The first project we will commit code to is Flume, we hope there are several more coming.

Apache, keep up the great work and we will support you wherever possible!

Alexander Grosse
September 12th, 2011

Mobile: Unit Testing

When we started the Mobile project early 2011, unit testing JavaScript was one of the goals to tackle on the technical side. The history of custom JavaScript code at SoundCloud up until then rarely included unit tests, so providing references and the necessary ground research was important for both the project at hand as well as for other projects at SoundCloud.

This articles aims to provide an overview of the tools we use, what worked well and what we need to improve.

Tools

When we started the Mobile project, there were just two developers on the team, Matas and Jörn. With Jörn already maintaining and supporting QUnit for three years, this particular choice was an easy one. If you haven’t yet heard of it: Among available unit testing frameworks, QUnit is among the most popular ones. There’s a comprehensive tutorial over at ScriptJunkie.

As we were building an API client in the browser, mocking API requests was really important for us. We didn’t want to depend on the API being available, both to be able to work offline and to not depend on data that changes all the time. At the start of the project, jQuery 1.5 and its ajax extension points like custom transports weren’t available yet, so we went with mockjax, a library adding mocking on top of jQuery’s ajax module.

To run tests in continuous integration systems (at SoundCloud, on Jenkins), we looked at quite a lot of options. Jörn has some slides that give an overview of that research. Other teams at SoundCloud use Selenium, which wasn’t an option for us due to the lack of support for Chrome or Safari (which is still a work in progress). In the end we went with PhantomJS. PhantomJS is built on top of Qt-WebKit, provides a reasonable browser-like environment and enough API to run our unit tests and report back results.

We considered using TestSwarm to distribute running of our unit tests to regular desktop browsers as well as mobile devices. The lack of a Jenkins-TestSwarm plugin (now actually available) as well as tools for managing VMs, browsers, simulators and emulators (or even managing mobile devices) was enough of a hurdle that we skipped this. Until we get this in place, we won’t know how many bugs we could have catched earlier with this additional setup.

The Good

QUnit does a pretty good job. The few small issues we encountered were swiftly fixed upstream. We ended up customizing the module-method quite heavily, mostly to integrate Mockjax. Overall, Mockjax also did a pretty good job, once we figured out a pattern that worked for us. Here’s a typical module-call for testing Backbone Views and Models that fetch their data from the API:

module("user", {
  "/users/183/tracks": "/fixtures/forss-tracks.json",
  "/users/183/playlists": "/fixtures/forss-playlists.json",
  "/users/183/favorites": "/fixtures/forss-favorites.json",
  "/users/183/groups": "/fixtures/forss-groups.json",
});

We still call the module-method with the module-name as the first argument. The second argument can contain setup- and teardown-properties, just like QUnit expects it. In addition, we pass url-mock pairs, which are passed on to $.mockjax. In addition to those, we define a catch-all to make sure that no test ever ends up calling the actual API. And we have a global timeout for each test to ensure a broken async test never prevents the suite from finishing.

var testTimeout;
module = function(name, mocks) {
  QUnit.module(name, {
    setup: function() {
      if (mocks) {
        if (mocks.setup) {
          mocks.setup.apply(this, arguments);
        }
        $.each(mocks, function(url, mock) {
          if (/setup|teardown/.test(url)) {
            return;
          }
          if ( $.type(mock) === "string" ){
            $.mockjax({
              url: "/_api" + url,
              proxy: mock,
              responseTime: 1
            });
          } else {
            $.mockjax($.extend(mock,{url: "/_api" + url}));
          }
        });
      }
      $.mockjax({
        url: "/_api*",
        responseTime: 1,
        response: function(obj){
          var message = "Mockjax caught unmocked API call for url: " + obj.url
          if (obj.modelType) {
            message += ", from component " + obj.modelType;
          }
          ok( false, message );
        }
      });

      testTimeout = setTimeout(function() {
        equal( true, false, "test timeout (5s)" );
        // could involve multiple stop calls, reset
        QUnit.config.semaphore = 1;
        start();
      }, 5000);
    },
    teardown: function() {
      clearTimeout(testTimeout);
      $.mockjaxClear();
      if (mocks && mocks.teardown) {
        mocks.teardown.apply(this, arguments);
      }
    }
  });
};

The problem with this design was the lack of a $.mockjaxClear(url) method – you can’t remove an existing handler or replace it (mockjaxClear(index) is supported, but didn’t help us). We needed that to test error conditions, for example, when the API returned a 404 when asking if a particular track was a favorite of a user. In some cases, we could just mix it with other mocks. In other cases, we grouped these tests into a separate module-call (with the same name):

module("user", {
  "/users/183/playlists": {
    responseStatus: 500,
    responseText: "servererror",
    responseTime: 1
  }
});

With that, we did the regular tests in one place, the error conditions in the other.

The Bad

An interesting QUnit feature, inspired by Kent Beck’s work on JUnit MAX, is its built-in reordering. It basically records the results of one test run in sessionStorage, then looks at those results during the next run. If a test failed before, its scheduled to run first. All that happens without changing the order of the result output. If it works, you can get the relevant test results much faster then for regular sequential runs, as its likely that tests that failed before will fail again, while passing tests are a lot less likely to start failing.

The problem with that reordering for us was that with all the asynchronous tests in our suite, sometimes tests would have side effects on other tests. As long as they ran in a fixed order, those effects weren’t noticeable. Instead of addressing the actual side effects, we ended up disabling the reordering. Its on the pile of chores to still address.

Overall, the unit tests did a good job, though its not quite clear how much value they actually provided. Most bug reports are about visual issues, sometimes small glitches, often enough device specific issues. As a mobile web developer, Android, or Andy as we started to call it, becomes kind of an IE6. It gets updated only with the OS, the OS isn’t updated, so we’re stuck with this browser that was okay a year ago, but is a real pain today. On Android 2.1, you even have the same issue as on IE6: HTML5 elements like ‘header’ or ‘article’ aren’t styled. At least on IE6, there’s a workaround…

Anyway, the other category of bugs were reported much less frequently, and unit testing didn’t help there either. We learned that client-side error logging is extremely valuable. Tools like Airbrake and Bugsense still have a long way to go, but writing a single-page web application without logging of client side errors means you never know about the thousands of errors your users get to see. Expect another post on that topic.

The Ugly

As long as mockjax did its job, we were happy with it. When it didn’t, we had to look at the source, and we weren’t happy anymore. The whole thing is quite a mess and in dire need of some good refactorings. Still, in terms of features, alternatives like jQuery 1.5 custom transports or sinon.js just aren’t on par, so we stuck with mockjax.

What we now mostly gave up on is PhantomJS. The Jenkins-job that ran our QUnit tests using PhantomJS is currently disabled, as it kept failing for months. We spent overall several days trying to find the source of the one failing test, giving up at the end. We still don’t know why it was failing, and there were several hurdles that made it difficult to debug:

  • It failed only on our Jenkins server. Running the tests locally, using the same PhantomJS version, worked fine. The difference was the enviroment, with mostly OSX running on developer machines, but Debian Lenny on the Jenkins box. Sure, that’s a problem, but the point of the tool is to provide a browser-like enviroment, it shouldn’t matter what system its running on.
  • We were stuck with PhantomJS 1.1, even after 1.2.x was out for several months. While we could adapt to the completely backwards incompatible API changes from 1.1 to 1.2, we didn’t find any way around PhantomJS just crashing on our testsuite, with no useful output. If you’re interested, you can find the debugging process somewhat documented on this Google Groups thread. Even debugging with gdb proved to be a waste of time. The unhelpfulness of PhantomJS when failing to load a page is stunning.

So as nice as PhantomJS is, the combination of not being able to upgrade and not being able to fix the existing build forced us to abandon it. TestSwarm is a lot more interesting now with the existing Jenkins plugin. And with Chrome support upcoming in Selenium, that is an attractive short term solution as well.

Epilog

As you can see, this story isn’t over yet. It seems to share a common theme with other developer tools, be that editors, bug tracking or testing tools: most of them do their job, but we aren’t satisfied with any of them.

What are your experiences? What tools would you like to see improved, replaced or invented?

Jörn
September 2nd, 2011

Hack Your Way to Berlin

Berlin is playing host to the oh-so-sold-out JSConf.eu conference happening October 1-2. Some members of the SoundCloud engineering crew well be on-site and we thought it would be great if you could come too.

We have an extra ticket for the JSConf.eu conference that we are going to raffle away in a hack competition. The rules of engagement are simple: use this podcast on the periodic table from Roman Mars and build a page that presents it in an awesome way. Be it with a very creative playback method, aggregating content around the sound or building the nicest design we’ve ever seen. The only constraints are that you use JavaScript and that the page is served from GitHub Pages. We want to see the most interesting and incredible page that you can build to play this track.

The page with the most creativity and “pop” will win the JSConf.eu ticket. We will get you from where you are in Europe to Berlin for the conference and find a nice place for you to sleep.

Send your entry to eliot@soundcloud.com by 23:59:59 Central European Time on September 18.

http://jsconf.eu/2011/

Contest requirements:
- You reside in Europe
- You are at least 18 years old
- You are allowed to travel to Germany
- You want to come to the JSConf.eu conference

tl;dr:
- Build awesome page with JavaScript highlighting this track
- Deploy to GitHub Pages
- Send a link to your page & repo to eliot@soundcloud.com until September 18
- Win JSConf.eu ticket, flight & accommodation for 3 nights

Judging will be done by SoundCloud staff by September 22. Have questions? Send them to eliot@soundcloud.com

Eliot
August 28th, 2011

SoundCloud Hits the Road

A quick virtual heads-up for developers, hackers and coding enthusiasts in the Berlin-area: SoundCloud is very happy to support the upcoming Hack and Tell event at the C-base bar on Tuesday, August 30. The Hack and Tell is a monthly meet up where participants have the chance to get on stage for five minutes present their personal or professional projects. For those not presenting, it’s a great evening to mix it up with Berlin-based hackers, meet some members of the SoundCloud development team and eat copious amounts of pizza. You can get a full rundown of the event here.

Berlin’s too far or you don’t like pizza? Here are a handful of the coming non-pizza-centric events where you can catch SoundCloud engineers offline:

Vision Sound Music
London, September 2-4

SF MusicTech
San Francisco, September 12

Mobile TechCon
Mainz, Germany, September 12-14

Golden Gate Ruby Conference
San Francisco, September 16-17

JSConf.eu
Berlin, October 1-2

Goto Aarhus
Aarhus, Denmark, October 13-14

Paris Web
Paris, October 13-15

Velocity
Berlin, November 8-9

Eliot
August 24th, 2011

Doing the right thing

The recent outage of SoundCloud was the result of everybody doing the right thing. This totally jives with John Allspaw’s message that looking for a root cause will lead to you finding smart people simply doing their jobs.

This is what happened.

 

The technologies at play

The number of interactions that escalated to the outage should be interesting for other Rails/MySQL shops out there. We’re using many idiomatic patterns within Rails that ended up having devastating consequences.

Cache keys that avoid invalidation

It’s best practice to form your cache keys in a way that doesn’t require you to issue cache key deletions, and rather let normal evictions reclaim garbage keys. To do this we use the ‘updated_at’ column on some of our models in the cache key so that if the model updates, we know we’ll get a new key.

ActiveRecord::Timestamp#touch

There is an innocuous method called ‘touch’ that will simply bump the updated_at and save that record. This is quite convenient to call on containers of has_many things like a forum topic is to its posts. With a recent ‘updated_at’ and consistent way to keep that recent via ‘touch’, there is a clean decoupling of cache strategy from business modeling when you intend to communicate the container’s version is as recent as the most recent addition.

ActiveRecord::Associations::BelongsToAssociation :counter_cache

In the absence of indexed expressions and multi index merges that some databases have, MySQL and InnoDB leaves it to the application to keep lookups of counts efficient. When dealing with tables of multiple millions of rows, a simple query like:

select count(*) from things 

Could take tens of seconds as InnoDB actually needs to traverse all primary keys and literally count the rows that exist which are visible in the current transaction.

class Foo belongs_to :things, :counter_cache => true end 

Is a simple and convenient workaround to avoid the ‘count(*)’ overhead where when a new Foo is created on a thing, that thing’s ‘foos_count’ would get a database increment by one. When it’s removed, the ‘foos_count’ would be decremented by one.

ActiveRecord::Associations::Association :dependent => :destroy

What better way to maintain all the business rules on deletion than to make sure your model’s callbacks fire when a container is destroyed. When your business consistency is maintained in the ORM, this is also the best place to ensure proper business rules on removal.

This simply does the following for every association:

thing.foos.each(&:destroy) 

Before:

thing.destroy 

It’s best to have the behavior declared where the association is declared and let the framework make sure it’s not forgotten when actually performing the destruction.

More than 140 characters

Some of the features of SoundCloud deserve more than 140 characters to do them justice. Even more than 255 characters. The tools that Rails gives you out of the box when you need more than 255 characters on a string in your data layer are limited; you’re left with the ‘text’ type in your schema definition. In MySQL this translates to a “TEXT” column.

A TEXT column with a maximum length of 65,535 (216 - 1) characters. 

Well, in most cases we wouldn’t need more than 1000 characters for much on our site, but in the spirit of “deployed or it didn’t happen”, our early database schemas were mostly driven by the simplest option that ActiveRecord::Schema offered. (We also have way too many NULLable columns)

Trade space for time or time for space

The early days, SoundCloud ran on 3 servers. CPU was precious so for some of the HTML conversion tasks we traded space for time and are storing a cached version of the HTML for some of the longer text fields in the same row as the record. This was the right choice at the time for the request load and available CPU.

(time is usually easier to scale than space)

Tune your DBs for your workload

We separate reads from writes on our data layer, and we also have slightly different tunings on the DB slaves that accept reads. We also have experienced statement-based asynchronous slave replication breaking the replication thread due to these different tunings on different hardware.

We use row-based (actually mixed) replication between our masters and slaves because it’s as close as you’ll get to the storage engines speaking directly to each other, minimizing the risk of differences in hardware/tuning interfering with the replication thread.

Alert on disk space thresholds

We have a massive amount of Nagios checks on all our systems, including host-based partition %free checks. When any partition on a host reaches a threshold of free space, an alert is sent.

Separate data and write ahead log physical devices

Most OLTP databases have data that is bound by random read/writes, whereas binary logs are fundamentally sequential writes. You want these workloads on different spindles when using rotating disks because a transaction cannot complete without first being written to the binary log. If you need to move the drive head for your binlog, you’ve just added milliseconds to all your transactions.

Clean up after yourself

Periodically there are administrative tasks that need to be performed on the site like mass takedowns of inappropriate content. The Rails console is amazing when you need to work directly with your business domain. Fire it up, get it done. For one-off maintenance tasks this is a life saver.

Add Spam, Mix, Bake and we’ve been Served

This all adds up, reviewing the good parts:

  • Abstract away bookkeeping in your domain model
  • Leverage existing patterns to get the job done quickly
  • Tune and monitor your DBs
  • Hand administer your site via your domain model

If you haven’t noticed yet, there have been some incorrigible entrepreneurs using some groups to advertise their pharmaceuticals distribution businesses. They have very thorough descriptions (5-50KB worth), and unprecedented activity with tens to thousands of posts in their own groups.

At 1:00pm yesterday, we were working through our cleanups, and cranked open a console with a large list of confirmed groups to remove. With Rails this is as simple as:

Group.find(ids).each(&:destroy) 

Looks innocent enough.

What the database sees

From the database perspective all of the automated bookkeeping and business domain extraction of individual destroys creates statements, this ended up looking something like this:


DELETE post
UPDATE group SET posts_count = posts_count - 1
UPDATE group SET updated_at = now()
... x 1-5000
DELETE track_group_contribution
UPDATE group SET tracks_count = tracks_count - 1
UPDATE group SET updated_at = now()
... x 1-5000
DELETE user_group_membership
UPDATE group SET members_count = members_count - 1
UPDATE group SET updated_at = now()
... x 1-5000

So we’re seeing 2N+1 number of updates on a group where N is the sum of associated objects.

What replication sees

When using row-based replication, any change to a row gets that entire row added to the binary log. Some of these groups had over 100k worth of text columns and hundreds of associated posts. When parsing a given binlog, these group updates were taking over 90% of the replication events being sent to the slaves.

What the binlog partition sees

This is what finally brought us down. We were producing over 3GB/min of binlogs to be replicated to our slaves from these many group updates. Our binlog partition filled up from 10GB to 100GB in a matter of 30 minutes.

The MySQL docs are clear about what happens with a full data partition. When data cannot be written, MySQL just waits. The behavior around the binlog partition wasn’t as clear. That last binlog event had a partial write. When the disk filled, the binlog corrupted. When that last event in the last binlog was attempted to be sent to the slaves, it failed and the slaves stopped replicating.

 Got fatal error 1236 from master when reading data from binary log:
'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master' 

Our max_allowed_packet is big enough for any of our rows.

How we recovered

We had a master with live queries that were not coming back from a ‘killed’ state. We scaled the binlog LVM partition so that it could accommodate new writes now, but the DB was not budging. We had no idea how to get it to start writing again so we began a failover process.

All our slaves were at the same position just before the corrupted event, so we grabbed one, confirmed the tuning was good and then promoted it. We went through the many other slaves and reconfigured them and we were good, consistent to the last event. All we lost were the few processes that were waiting to write to the full partition.

Ironically, 2 of our team members are currently in the datacenter recabling some of our racks. We also have 4 swanky new DB class machines powered on, but we were a day or two away from getting them networked and integrated. We were just short of having that excess capacity to accept the spike in load from return visitors after the tweet “We’re back!”.

Forensics

We tried to understand the cause of the sudden binlog growth so we could safely enable the site without a replay of what just had happened. We expanded out some of the logs with ‘mysqlbinlog –verbose’ to show that it was filled with group spam. To confirm that it was group activity, we compared the replication data volume per table of a well sized binlog file with an abnormally large with the following awk program:

    for log in mysql-bin.03980{5..7}; do
      mysqlbinlog $log |\
      awk '
        /Table_map/ { name = $9 }
        /BINLOG/ { bytes = 0; col = 1 }
        { if (col) bytes += length($0) }
        /\*!\*/ { if (col) sizes[name] += bytes; col = 0 }
        END { for (i in sizes) print sizes[i], i }
      ' |\
      sort -n |\
      tee /tmp/$log.sizes &
    done

This created a list of byte sizes of base64 encoded row data to table name. Groups took 820MB compared to the next largest table at 30MB.

Put this script in your toolbox, it’s also great to use for getting an idea of which tables are your hottest under normal operations.

We also used our recently finished HDFS-based log aggregation system to run map reduce jobs over our web and app logs to identify any possible abuse vectors around groups that were coming from the outside.

What we learned

When maintenance around abusive usage is also a part of your business, think about the data and impact on your running system for all maintenance work.

Cut your losses early and resist the temptation to find the “root cause” during an outage incident. Failforward. Save what you can for forensics after you’re back up.

Uncleared acknowledged alerts are alertable offenses. A big question during the incident was, “where were the alerts?”. It turned out that the host-based disk space check was previously acknowledged because we had an unrelated partition fill on the same host within expectation. This acknowledgment wasn’t cleared before the binlog partition filled so we didn’t get the 20 minutes of lead time we could have had.

In the heat of the moment, put your heads together, make a plan for the next X minutes, execute, and repeat until you’re back up. All the engineers were at battle stations during this outage. This incident blindsided us and all kinds of theories were thrown around. When we focused on do X within Y minutes we got down to time boxing research to be able to take our next action. This worked very well.

Pay off your technical debt. In the past, we took a loan on the future for trading space for time. Paying off these kinds of debts is easy to defer, until the collector comes to visit. Keep a list of the debts chosen, and the debts discovered. Even without an estimated cost to fix, your debt is a learning tool for others for where and when measured choices towards the road of delivery are made.

Be as specific as you can with your expected data types. We never expected group descriptions to be larger than 2KB. We should have encoded that expectation loud and clear in the data and business layers.

Doing the right thing

We are all working with the best practices in mind, yet the combination of all that we were doing correctly ended up with this outage nobody expected. Yesterday was an incredible learning experience for everyone at SoundCloud, and the entire team joined together with a positive spirit and heartfelt passion to restore service as quickly as possible. I’m quite proud to work with everyone here.

Sean Treadway
August 22nd, 2011

SoundCloud mobile – Proxies

The Problem

The mobile version of SoundCloud is a consumer of our own API dog food. That decision was made with the intention to deploy a self-sufficient client application that depends only on a static provider. Our early experiements showed that the attempt we made had some downsides. For example, the implementation of redirects in CORS is not behaving properly and therefore can’t be used with many of the endpoints in our API where we rely on the correct handling. Also classic XHR communication with the API is not an option due to the same origin policy implications that apply even on subdomains.

Read the rest of this entry »

alx