Archive for the ‘Open Source’ Category You are currently browsing the archives for the Open Source category.
August 30th, 2012

Evolution of SoundCloud’s Architecture

This is a story of how we adapted our architecture over time to accomodate growth.

Scaling is a luxury problem and surprisingly has more to do with organization than implementation. For each change we addressed the next order of magnitude of users we needed to support, starting in the thousands and now we’re designing for the hundreds of millions.  We identify our bottlenecks and addressed them as simply as possible by introducing clear integration points in our infrastructure to divide and conquer each problem individually.

By identifying and extracting points of scale into smaller problems and having well defined integration points when the time arrived, we are able to grow organically.

Product conception

From day one, we had the simple need of getting each idea out of our heads and in front of eyeballs as quickly as possible. During this phase, we used a very simple setup:

Apache was serving our image/style/behavior resources, and Rails backed by MySQL provided an environment where almost all of our product could be modeled, routed and rendered quickly. Most of our team understood this model and could work well together, delivering a product that is very similar to what we have today.

We consciously chose not to implement high availability at this point, knowing what it would take when that time hopefully arrived. At this point we left our private beta, revealing SoundCloud to the public.

Our primary cost optimization was for opportunity, and anything that got in the way of us developing the concepts behind SoundCloud were avoided. For example, when a new comment was posted, we blocked until all followers were notified knowing that we could make that asynchronous later.

In the early stages we were conscious to ensure we were not only building a product, but also a platform. Our Public API was developed alongside our website from the very beginning. We’re now driving the website with the same API we were offering to 3rd party integrations.

Growing out of Apache

Apache served us well, but we were running Rails app servers on multiple hosts, and the routing and virtual host configuration in Apache was cumbersome to keep in sync between development and production.

The Web tier’s primary responsibility is to manage and dispatch incoming web requests, as well as buffering outbound responses so to free up an application server for the next request as quickly as possible. This meant the better connection pooling and content based routing configuration we had, the stronger this tier would be.

At this point we replaced Apache with Nginx and reduced our web tier’s configuration complexity, but our architecture didn’t change.

Load distribution and a little queue theory

Nginx worked great, but as we were growing, we found that some workloads took significantly more time compared to others (in the order of hundreds of milliseconds).

When you’re working on a slow request when a fast request arrives, the fast request will have to wait until the slow request finishes, called “head of the line blocking problem”. When we had multiple applications servers each with its own listen socket backlog, analogous to a grocery store, where you inevitably stand at one register and watch all the other registers move faster than your own.

Around 2008 when we first developed the architecture, concurrent request processing in Rails and ActiveRecord was fairly immature. Even though we felt confident that we could audit and prepare our code for concurrent request processing, we did not want to invest the time to audit our dependencies. So we stuck with the model of a single concurrency per application server process and ran multiple processes per host.

In Kendall’s notation once we’ve sent a request from the web server to the application server, the request processing can be modeled by a M/M/1 queue. The response time of such a queue depends on all prior requests, so if we drastically increase the average work time of one request the average response time also drastically increases.

Of course, the right thing to do is to make sure our work times are consistently low for any web request, but we were still in the period of optimizing for opportunity, so we decided to continue with product development and solve this problem with better request dispatching.

We looked at the Phusion passenger approach of using multiple child processes per host but felt that we could easily fill each child with long-running requests. This is like having many queues with a few workers on each queue, simulating concurrent request processing on a single listen socket.

This changed the queue model from M/M/1 to M/M/c where c is the number of child processes for every dispatched request. This is like the queue system found in a post office, or a “take a number, the next available worker will help you” kind of queue. This model reduces the response time by a factor of c for any job that was waiting in the queue which is better, but assuming we had 5 children, we would just be able to accept an average of 5 times as many slow requests. We were already seeing a factor of 10 growth in the upcoming months, and had limited capacity per host, so adding only 5 to 10 workers was not enough address the head of the line blocking problem.

We wanted a system that never queued, but if it did queue, the wait time in the queue was minimal. Taking the M/M/c model to the extreme, we asked ourselves “how can we make c as large as possible?”

To do this, we needed to make sure that a single Rails application server never received more than one request at a time. This ruled out TCP load balancing because TCP has no notion of an HTTP request/response. We also needed to make sure that if all application servers were busy, the request would be queued for the next available application server. This meant we must maintain complete statelessness between our servers. We had the latter, but didn’t have former.

We added HAProxy into our infrastructure, configuring each backend with a maximum connection count of 1 and added our backend processes across all hosts, to get that wonderful M/M/c reduction in resident wait time by queuing the HTTP request until any backend process on any host becomes available. HAProxy entered as our queuing load balancer that would buffer any temporary back-pressure by queuing requests from the application or dependent backend services so we could defer designing sophisticated queuing in other components in our request pipeline.

I heartily recommend Neil J. Gunther’s work Analyzing Computer System Performance with Perl::PDQ to brush up on queue theory and strengthen your intuition on how to model and measure queuing systems from HTTP requests all the way down to your disk controllers.

Going asynchronous

One class of request that took a long time was the fan-out of notifications from social activity. For example, when you upload a sound to SoundCloud, everyone that follows you will be notified. For people with many followers, if we were to do this synchronously, the request times would exceed the tens of seconds. We needed to queue a job that would be handled later.

Around the same time we were considering how to manage our storage growth for sounds and images, and had chosen to offload storage to Amazon S3 keeping transcoding compute in Amazon EC2.

Coordinating these subsystems, we needed some middleware that would reliably queue, acknowledge and re-deliver job tickets on failure. We went through a few systems, but in the end settled on AMQP because of having a programmable topology, implemented by RabbitMQ.

To keep the same domain logic that we had in the website, we loaded up the Rails environment and built a lightweight dispatcher class with one queue per concern.  The queues had a namespace that describes estimated work times. This created a priority system in our asynchronous workers without requiring adding the complexity of message priorities to the broker by starting one dispatcher process for each class of work that bound to multiple queues in that work class. Most of our queues for asynchronous work performed by the application are namespaced with either “interactive” (under 250ms work time) or “batch” (any work time). Other namespaces were used specific to each application.

Caching

When we approached the hundreds of thousands user mark, we saw we were burning too much CPU in the application tier, mostly spent in the rendering engine and Ruby runtime.

Instead of introducing Memcached to alleviate IO contention in the database like most applications, we aggressively cached partial DOM fragments and full pages. This turned into an invalidation problem which we solved by maintaining the reverse index of cache keys that also needed invalidation on model changes in memcached.

Our highest volume request was one specific endpoint that was delivering data for the widget. We created a special route for that endpoint in nginx and added proxy caching to that stack, but wanted to generalize caching to the point where any end point could produce proper HTTP/1.1 cache control headers and would be treated well by an intermediary we control. Now our widget content is served entirely from our public API.

We added Memcached and much later Varnish to our stack to handle backend partially rendered template caching and mostly read-only API responses.

Generalization

Our worker pools grew, handling more asynchronous tasks. The programming model was similar for all of them: take a domain model and schedule a continuation with that model state to be processed at a later state.

Generalizing this pattern, we leveraged the after-save hooks in ActiveRecord models in a way we call ModelBroadcast. The principle is that when the business domain changes, events are dropped on the AMQP bus with that change for any asynchronous client that is interested in that class of change. This technique of decoupling the write path from the readers enables the next evolution of growth by accommodating integrations we hadn’t foreseen.

after_create do |r|
  broker.publish("models", "create.#{r.class.name}",  r.attributes.to_json)
end

after_save do |r|
  broker.publish("models", "save.#{r.class.name}", r.changes.to_json)
end

after_destroy do |r|
  broker.publish("models", "destroy.#{r.class.name}", r.attributes.to_json)
end

This isn’t perfect, but it added a much needed non-disruptive, generalized, out-of-app integration point in the course of a day.

Dashboard

Our most rapid data growth was the result of our Dashboard. The Dashboard is a personalized materialized index of activities inside of your social graph and the primary place to personalize your incoming sounds from the people you follow.

We have always had a storage and access problem with this component. Looking at the read and write paths separately, the read path needs to be optimized for sequential access per user over a time range. The write path needs to be optimized for random access where one event may affect millions of users’ indexes.

The solution required a system that could reorder writes from random to sequential and store in sequential format for read that could be grown to multiple hosts. Sorted string tables are a perfect fit for the persistence format, and add the promise of free partitioning and scaling in the mix, we chose Cassandra as the storage system for the Dashboard index.

The intermediary steps started with the model broadcast and used RabbitMQ as a queue for staged processing, in three major steps: fan-out, personalization, and serialization of foreign key references to our domain models.

  • Fan-out finds the areas of the social graph where an activity should propagate.
  • Personalization looks at the relationship between the originator and destination users as well as other signals to annotate or filter the index entry.
  • Serialization persists the index entry in Cassandra for later lookup and joining against our domain models for display or API representations.

Search

Our search is conceptually a back-end service that exposes a subset of data store operations over an HTTP interface for queries. Updating of the index is handled similarly to the dashboard via ModelBroadcast with some enhancement from database replicas with index storage managed by Elastic Search.

Notifications and Stats

To make sure users are properly notified when their dashboard updates, whether this is over iOS/Android push notifications, email or other social networks we simply added another stage in the Dashboard workflow that receives messages when a dashboard index is updated. Agents can get that completion event routed to their own AMQP queues via the message bus to initiate their own logic. Reliable messages at the completion of persistence is part of the eventual consistency we work with throughout our system.

Our statistics offered to logged in users at http://soundcloud.com/you/stats also integrates via the broker, but instead of using ModelBroadcast, we emit special domain events that are queued up in a log then rolled up into a separate database cluster for fast access across the various time ranges.

What’s next

We have established some clear integration points in the broker for asynchronous write paths and in the application for synchronous read and write paths to backend services.

Over time, the application server’s codebase has collected both integration and functional responsibilities. As the product development settles, we have much more confidence now to decouple the function from the integration to be moved into backend services that can be consumed à la carte by not only the application but by other backend services, each with a private namespace in the persistence layer.

 

The way we develop SoundCloud is to identify the points of scale then isolate and optimize the read and write paths individually, in anticipation of the next magnitude of growth.

At the beginning of the product, our read and write scaling limitations were consumer eyeballs and developer hours. Today, we’re engineering for the realities of limited IO, network and CPU. We have the integration points set up in our architecture, all ready for the continued evolution of SoundCloud!

Sean Treadway
July 24th, 2012

Go at SoundCloud

SoundCloud is a polyglot company, and while we’ve always operated with Ruby on Rails at the top of our stack, we’ve got quite a wide variety of languages represented in our backend. I’d like to describe a bit about how—and why—we use Go, an open-source language that recently hit version 1.

It’s in our company DNA that our engineers are generalists, rather than specialists. We hope that everyone will be at least conversant about every part of our infrastructure. Even more, we encourage engineers to change teams, and even form new ones, with as little friction as possible. An environment of shared code ownership is a perfect match for expressive, productive languages with low barriers to entry, and Go has proven to be exactly that.

Go has been described by several engineers here as a WYSIWYG language. That is, the code does exactly what it says on the page. It’s difficult to overemphasize how helpful this property is toward the unambiguous understanding and maintenance of software. Go explicitly rejects “helper” idioms and features like the Uniform Access Principle, operator overloading, default parameters, and even exceptions, on the basis that they create more problems through ambiguity than they solve in expressivity. There’s no question that these decisions carry a cost of keystrokes—especially, as most new engineers on Go projects lament, during error handling—but the payoff is that those same new engineers can easily and immediately build a complete mental model of the application. I feel confident in saying that time from zero to productive commits is faster in Go than any other language we use; sometimes, dramatically so.

Go’s strict formatting rules and its “only one way to do things” philosophy mean we don’t waste much time bikeshedding about style. Code reviews on a Go codebase tend to be more about the problem domain than the intricacies of the language, which everyone appreciates.

Further, once an engineer has a working knowledge of Effective Go, there seems to be very little friction in moving from “how the application behaves today” to “how the application should behave in the ideal case.” Should a slow response from this backend abort the entire request? Should we retry exactly once, and then serve partial results? This agent has been acting strangely: can we install a 250ms timeout? Every high-level scenario in the behavior of a system can be expressed in a straightforward and idiomatic implementation, without the need for libraries or frameworks. Removing layers of abstraction reduces complexity; plainly stated, simpler code is better code.

Go has some other nice properties that we’ve taken advantage of. Static typing and fast compilation enable us to do near-realtime static analysis and unit testing during development. It also means that building, testing and rolling out Go applications through our deployment system is as fast as it gets.

In fact, fast builds, fast tests, fast peer-reviews and fast deployment means that some ideas can go from the whiteboard to running in production in less than an hour. For example, the search infrastructure on Next is driven by Elastic Search, but managed and interfaced with the rest of SoundCloud almost exclusively through Go services. During validation testing, we realized that we needed the ability to mark indexes as read-only in certain circumstances, and needed the indexing applications to detect and respect this new dimension of index-state. Adding the abstraction in the code, polling a new endpoint to reliably detect the state, changing the relevant indexing behaviors, and writing tests for them, all took half an afternoon. By the evening, the changes had been deployed and running under load for hours. That kind of velocity, especially in a statically-typed, natively-compiled language, is exhilarating.

I mentioned our build and deployment system. It’s called Bazooka, and it’s designed to be a platform for managing the deployment of internal services. (We’ll be open-sourcing it pretty soon; stay tuned!) Scaling 12-Factor apps over a heterogeneous network can be thought of as one large, complex state machine, full of opportunities for inconsistency and race conditions. Go was a natural choice for this kind of job. Idiomatic Go is safely concurrent by default; Bazooka developers can reason about the complexity of their problem without being distracted by the complexity of their tools. And Bazooka makes use of Doozer to coordinate its shared state, which—in addition to being the only open-source implementation of Paxos in the wild (that we’re aware of)—is also written in Go.

All together, SoundCloud maintains about half a dozen services and over a dozen repositories written entirely in Go. And we’re increasingly turning to Go when spinning up new backend projects.

Interested in writing Go to solve real problems and build real products? We’d love to hear from you!

Peter Bourgon
November 21st, 2011

Front-end JavaScript bug tracking

Proper and effective error tracking is a common issue for front-end JavaScript code compared to back-end environments.

We felt this pain as well and experimented with different solutions over the past months on the SoundCloud Mobile site.

Analytics

The first approach we had was to track errors with Google Analytics. Their library permits to fire custom events and whenever an ajax error would occur, we would log it.

The biggest benefit of this tool is to monitor the stability of the site and its evolution in longer periods as you can easily go back a few weeks or months to see which events were triggered. Also, it is easy to implement – almost a one-liner!

The drawback, at least for Google Analytics, is that this tool is not meant to track bugs. There is no way to add custom data to these events to get more insight about why and how an error happened, it also doesn’t work in real-time, and you obviously want that when you debug.

So we kept Analytics in place for a long-term view, but took a look at other options for real-time and in-depth tracking.

Airbrake

In our pursuit of getting more insight, we decided to take a look at Airbrake because we were already using it to track back-end errors on our main site.

Our mobile site runs on Node.js, the first thing we did was to integrate an existing plugin for it to handle error tracking on the back-end as well.

Looking a little further we found a front-end notifier, which would catch errors that would fire on window.onerror, but there was no way to report any custom errors.

We decided to take a day to hack this on our own since their API is public and easy to implement.

The benefits of Airbrake were instant. We could see what triggered which error, how, why, in which context, which browser, etc… in real-time!


It also counts errors, which can help you prioritize and include fixes in your roadmap.

However, the lack of filtering, grouping and custom sorting made it difficult to work with. There was also no sense of time or progress, as everything just gets dumped into a single list ordered by time. We needed something a little better than that.

BugSense

That’s when our Android team showed us their BugSense implementation.
BugSense seemed to address all of these issues we had with Airbrake: grouping is more effective, searching and filtering is possible, charts of errors are drawn as well.

There is one more benefit over Airbrake… JSON. No need to convert objects to XML strings anymore!

If you are interested in our BugSense notifier you can find the source on github.

Conclusion

There is still a lot of work needed to make front-end JS debugging as easy as it is for regular back-end environments.
For example, stack traces today aren’t that useful, because of anonymous objects and minified code, but hopefully browser vendors will tackle these issues soon. Maybe Source Maps could be the first milestone in this quest.

At SoundCloud, we will continue to use a combination of these tools because of the different strengths outlined above, but there are also other tools we didn’t try out yet like getexceptional or errorception. If you have tried these, or if you have any suggestion on this subject we’d like to get your feedback in the comments below.

Happy debugging!

Yves
November 9th, 2011

SoundCloud launches the HTML5 Audio Improvement Initiative

We at SoundCloud want to build the best sound player for the web, and we want to do that using the Open Web standards. While working on the native audio features on our mobile site and new widgets, or even as an experiment on the main site, we have discovered that the HTML5 Audio standard is not equally well implemented across all modern browsers and some decisions can be made that would benefit the web audio users and web developers alike. Soundcloud launches the “Are We Playing Yet?” project, which aims to raise the awareness about the state of HTML5 Audio implementations in the web browsers.

AreWePlayingYet? 2014 A pragmatic HTML5 Audio test suite

We have decided to help the parties involved and collect the issues in one place, document them, provide the code and add interactive tests that will show the implementation progress. We understand how the software development works, and that a few iterations are needed until something is fully done. We hope ”Are We Playing Yet?” can function as a handy development and quality monitoring tool.

Issues - soundcloud/areweplayingyet - GitHub

“Are We Playing Yet?” was started by SoundCloud but it’s open to all companies and developers who care about the state of HTML5 audio and want to build applications based on this Web standard. You can get the project source on GitHub, contribute tests and fixes via the pull requests or Issue Tracker, and connect to the people involved via @areweplayingyet on Twitter.

matas
October 14th, 2011

SoundCloud Signs Apache Corporate Contributor License Agreement

We just signed the corporate contributor license agreement (CCLA).
SoundCloud always was big on open source – we nearly exclusively use open source software in our company and use a lot of Apache projects like Hadoop, Solr, Flume, Zookeeper and Cassandra on our large scale production site.

As SoundCloud is using a lot of Apache projects and started to contribute to project we decided to sign the CCLA and enable all our developers to contribute to Apache projects even during work time if that project is used by SoundCloud.

The first project we will commit code to is Flume, we hope there are several more coming.

Apache, keep up the great work and we will support you wherever possible!

Alexander Grosse
September 12th, 2011

Mobile: Unit Testing

When we started the Mobile project early 2011, unit testing JavaScript was one of the goals to tackle on the technical side. The history of custom JavaScript code at SoundCloud up until then rarely included unit tests, so providing references and the necessary ground research was important for both the project at hand as well as for other projects at SoundCloud.

This articles aims to provide an overview of the tools we use, what worked well and what we need to improve.

Tools

When we started the Mobile project, there were just two developers on the team, Matas and Jörn. With Jörn already maintaining and supporting QUnit for three years, this particular choice was an easy one. If you haven’t yet heard of it: Among available unit testing frameworks, QUnit is among the most popular ones. There’s a comprehensive tutorial over at ScriptJunkie.

As we were building an API client in the browser, mocking API requests was really important for us. We didn’t want to depend on the API being available, both to be able to work offline and to not depend on data that changes all the time. At the start of the project, jQuery 1.5 and its ajax extension points like custom transports weren’t available yet, so we went with mockjax, a library adding mocking on top of jQuery’s ajax module.

To run tests in continuous integration systems (at SoundCloud, on Jenkins), we looked at quite a lot of options. Jörn has some slides that give an overview of that research. Other teams at SoundCloud use Selenium, which wasn’t an option for us due to the lack of support for Chrome or Safari (which is still a work in progress). In the end we went with PhantomJS. PhantomJS is built on top of Qt-WebKit, provides a reasonable browser-like environment and enough API to run our unit tests and report back results.

We considered using TestSwarm to distribute running of our unit tests to regular desktop browsers as well as mobile devices. The lack of a Jenkins-TestSwarm plugin (now actually available) as well as tools for managing VMs, browsers, simulators and emulators (or even managing mobile devices) was enough of a hurdle that we skipped this. Until we get this in place, we won’t know how many bugs we could have catched earlier with this additional setup.

The Good

QUnit does a pretty good job. The few small issues we encountered were swiftly fixed upstream. We ended up customizing the module-method quite heavily, mostly to integrate Mockjax. Overall, Mockjax also did a pretty good job, once we figured out a pattern that worked for us. Here’s a typical module-call for testing Backbone Views and Models that fetch their data from the API:

module("user", {
  "/users/183/tracks": "/fixtures/forss-tracks.json",
  "/users/183/playlists": "/fixtures/forss-playlists.json",
  "/users/183/favorites": "/fixtures/forss-favorites.json",
  "/users/183/groups": "/fixtures/forss-groups.json",
});

We still call the module-method with the module-name as the first argument. The second argument can contain setup- and teardown-properties, just like QUnit expects it. In addition, we pass url-mock pairs, which are passed on to $.mockjax. In addition to those, we define a catch-all to make sure that no test ever ends up calling the actual API. And we have a global timeout for each test to ensure a broken async test never prevents the suite from finishing.

var testTimeout;
module = function(name, mocks) {
  QUnit.module(name, {
    setup: function() {
      if (mocks) {
        if (mocks.setup) {
          mocks.setup.apply(this, arguments);
        }
        $.each(mocks, function(url, mock) {
          if (/setup|teardown/.test(url)) {
            return;
          }
          if ( $.type(mock) === "string" ){
            $.mockjax({
              url: "/_api" + url,
              proxy: mock,
              responseTime: 1
            });
          } else {
            $.mockjax($.extend(mock,{url: "/_api" + url}));
          }
        });
      }
      $.mockjax({
        url: "/_api*",
        responseTime: 1,
        response: function(obj){
          var message = "Mockjax caught unmocked API call for url: " + obj.url
          if (obj.modelType) {
            message += ", from component " + obj.modelType;
          }
          ok( false, message );
        }
      });

      testTimeout = setTimeout(function() {
        equal( true, false, "test timeout (5s)" );
        // could involve multiple stop calls, reset
        QUnit.config.semaphore = 1;
        start();
      }, 5000);
    },
    teardown: function() {
      clearTimeout(testTimeout);
      $.mockjaxClear();
      if (mocks && mocks.teardown) {
        mocks.teardown.apply(this, arguments);
      }
    }
  });
};

The problem with this design was the lack of a $.mockjaxClear(url) method – you can’t remove an existing handler or replace it (mockjaxClear(index) is supported, but didn’t help us). We needed that to test error conditions, for example, when the API returned a 404 when asking if a particular track was a favorite of a user. In some cases, we could just mix it with other mocks. In other cases, we grouped these tests into a separate module-call (with the same name):

module("user", {
  "/users/183/playlists": {
    responseStatus: 500,
    responseText: "servererror",
    responseTime: 1
  }
});

With that, we did the regular tests in one place, the error conditions in the other.

The Bad

An interesting QUnit feature, inspired by Kent Beck’s work on JUnit MAX, is its built-in reordering. It basically records the results of one test run in sessionStorage, then looks at those results during the next run. If a test failed before, its scheduled to run first. All that happens without changing the order of the result output. If it works, you can get the relevant test results much faster then for regular sequential runs, as its likely that tests that failed before will fail again, while passing tests are a lot less likely to start failing.

The problem with that reordering for us was that with all the asynchronous tests in our suite, sometimes tests would have side effects on other tests. As long as they ran in a fixed order, those effects weren’t noticeable. Instead of addressing the actual side effects, we ended up disabling the reordering. Its on the pile of chores to still address.

Overall, the unit tests did a good job, though its not quite clear how much value they actually provided. Most bug reports are about visual issues, sometimes small glitches, often enough device specific issues. As a mobile web developer, Android, or Andy as we started to call it, becomes kind of an IE6. It gets updated only with the OS, the OS isn’t updated, so we’re stuck with this browser that was okay a year ago, but is a real pain today. On Android 2.1, you even have the same issue as on IE6: HTML5 elements like ‘header’ or ‘article’ aren’t styled. At least on IE6, there’s a workaround…

Anyway, the other category of bugs were reported much less frequently, and unit testing didn’t help there either. We learned that client-side error logging is extremely valuable. Tools like Airbrake and Bugsense still have a long way to go, but writing a single-page web application without logging of client side errors means you never know about the thousands of errors your users get to see. Expect another post on that topic.

The Ugly

As long as mockjax did its job, we were happy with it. When it didn’t, we had to look at the source, and we weren’t happy anymore. The whole thing is quite a mess and in dire need of some good refactorings. Still, in terms of features, alternatives like jQuery 1.5 custom transports or sinon.js just aren’t on par, so we stuck with mockjax.

What we now mostly gave up on is PhantomJS. The Jenkins-job that ran our QUnit tests using PhantomJS is currently disabled, as it kept failing for months. We spent overall several days trying to find the source of the one failing test, giving up at the end. We still don’t know why it was failing, and there were several hurdles that made it difficult to debug:

  • It failed only on our Jenkins server. Running the tests locally, using the same PhantomJS version, worked fine. The difference was the enviroment, with mostly OSX running on developer machines, but Debian Lenny on the Jenkins box. Sure, that’s a problem, but the point of the tool is to provide a browser-like enviroment, it shouldn’t matter what system its running on.
  • We were stuck with PhantomJS 1.1, even after 1.2.x was out for several months. While we could adapt to the completely backwards incompatible API changes from 1.1 to 1.2, we didn’t find any way around PhantomJS just crashing on our testsuite, with no useful output. If you’re interested, you can find the debugging process somewhat documented on this Google Groups thread. Even debugging with gdb proved to be a waste of time. The unhelpfulness of PhantomJS when failing to load a page is stunning.

So as nice as PhantomJS is, the combination of not being able to upgrade and not being able to fix the existing build forced us to abandon it. TestSwarm is a lot more interesting now with the existing Jenkins plugin. And with Chrome support upcoming in Selenium, that is an attractive short term solution as well.

Epilog

As you can see, this story isn’t over yet. It seems to share a common theme with other developer tools, be that editors, bug tracking or testing tools: most of them do their job, but we aren’t satisfied with any of them.

What are your experiences? What tools would you like to see improved, replaced or invented?

Jörn
May 4th, 2011

Introducing the Large Hadron Migrator

Rails style database migrations are a useful way to evolve your data schema in an agile manner. Most Rails projects start like this, and at first, making changes is fast and easy.

That is until your tables grow to millions of records. At this point, the locking nature of ALTER TABLE may take your site down for an hour our more while critical tables are migrated. In order to avoid this, developers begin to design around the problem by introducing join tables or moving the data into another layer. Development gets less and less agile as tables grow and grow. To make the problem worse, adding or changing indices to optimize data access becomes just as difficult.

Side effects may include black holes and universe implosion.

There are few things that can be done at the server or engine level. It is possible to change default values in an ALTER TABLE without locking the table. The InnoDB Plugin provides facilities for online index creation, which is great if you are using this engine, but only solves half the problem.

At SoundCloud we started having migration pains quite a while ago, and after looking around for third party solutions [0] [2], we decided to create our own. We called it Large Hadron Migrator, and it is a gem for online ActiveRecord migrations.

LHC
The Large Hadron collider at CERN

The idea

Read the rest of this entry »

Rany
October 26th, 2010

Let’s Git it On

Being both a mediocre biz dev guy and a nerd means I get to post on the Developer blog as well as our Company blog, and today I’d like to talk to you about Git.

What is Git?

Git is a free & open source, version control system that when used in conjunction with social coding websites such as GitHub can greatly improve your efficiency for both small and large projects by keeping track of changes, collaborators, and more. You’ll be wondering how you ever lived without it.

I’ve had to teach myself the basics of Git from day one with SoundCloud and I recently graduated from clueless to novice with my first successful Fork > Edit > Pull.

Fork What?

Projects on Git are stored in repositories and when these are made public, anyone can Fork it (make a copy of your project), Edit it, and Send a Pull Request (notify you of the changes). Should you choose to Pull it, those changes will be added to your project. This is the definition of social coding. Killer.

I recently did this dance with a great new Ruby gem called OmniAuth from Intridea.

What’s OmniAuth?

“OmniAuth is a new Rack-based authentication system for multi-provider external authentcation.” – which allows you as a developer to roll out a login system consisting of any number of 3rd party providers, such as Twitter & Facebook, in no time at all.

OmniAuth just had one thing missing: SoundCloud support! So I said, “Fork This,” (laughing all by myself at home) and forked the project. Within an hour or so I had a working implementation and sent a Pull request to Intridea. And then just this morning, SoundCloud support was approved and added to the gem. Victory!

Here’s a link to the latest gem: github.com/intridea/omniauth.

If you haven’t already, I invite you to dig deeper into the world of Git and open-source. While it can be a bit daunting at first for a novice (like myself) – once learned, you’ll never go back. Plus, you’ll have a better understanding of our own open-source offerings available from github.com/soundcloud

If you have any questions or comments, I’m lee@soundcloud.com, @leemartin on Twitter, and leemartin on Github. Happy Hacking!

Lee