December 4th, 2012

Architecture behind our new Search and Explore experience

Search is front-and-center in the new SoundCloud, key to the consumer experience. We’ve made the search box one of the first things you see, and beefed it up with suggestions that allow you to jump directly to people, sounds, groups, and sets of interest. We’ve also added a brand-new Explore section that guides you through the huge and dynamic landscape of sounds on SoundCloud. We’ve also completely overhauled our search infrastructure, which helps us provide more relevant results, scale with ease, and experiment quickly with new features and models.

In the beginning

In SoundCloud’s startup days, when we were growing our product at super speed, we spent just two days implementing a straightforward search feature, integrating it directly into our main app. We used Apache Solr, as was the fashion at the time, and built a master-slave cluster with the common semantics: writes to the master, reads from the slaves. Besides a separate component for processing deletes, our indexing logic was pull-based: the master Solr instance used a Solr DataImportHandler to poll our database for changes, and the slaves polled the master.

At first, this worked well. But as time went on, our data grew, our entities became more complex, and our business rules multiplied. We began to see problems.

The problems

  • The index was only getting updated on the read slaves about every fifteen minutes. Being real-time is crucial in the world of sound, as it’s important that our users’ sounds are searchable immediately after upload.
  • A complete re-index of the master node from the database took up to 24 hours, which made making a simple schema change for a new feature or bugfix a Sisyphean task.
  • Worse, during those complete-reindex periods, we couldn’t do incremental indexing because of limitations in the DataImportHandler. That meant the search index was essentially frozen in time: not a great user experience.
  • Because the search infrastructure was tightly coupled in the main application, low-level Solr or Lucene knowledge leaked everywhere. If we wanted to make a slight change, such as to a filter, we had to touch 25 parts of the codebase in the main application.
  • Because the application directly interacted with the master, if that single-point-of-failure failed, we lost information about updates, and our only reliable recovery was a complete reindex, which took 24 hours.
  • Since Solr uses segment or bulk replication, our network was flooded when the slaves pulled updates from the master, which caused additional operational issues.
  • Solr felt like it was built in a different age, for a different use-case. When we had performance problems, it was opaque and difficult to diagnose.

In early 2012, we knew we couldn’t go forward with many new features and enhancements on the existing infrastructure. We formed a new team and decided we should rebuild the infrastructure from scratch, learning from past lessons. In that sense, the Solr setup was an asset: it could continue to serve site traffic, while we were free to build and experiment with a parallel universe of search. Green-field development, unburdened by legacy constraints: every engineer’s dream.

The goals

We had a few explicit goals with the new infrastructure.

  • Velocity: we want high velocity when working on schema-affecting bugs and features, so a complete reindex should take on the order of an hour.
  • Reliability: we want to grow to the next order of magnitude and beyond, so scaling the infrastructure—horizontally and vertically—should require minimum effort.
  • Maintainability: we don’t want to leak implementation details beyond our borders, so we should strictly control access through APIs that we design and control.

After a survey of the state of the art in search, we decided to abandon Solr in favor of ElasticSearch, which would theoretically address our reliability concerns in their entirety. We then just had to address our velocity and maintenance concerns with our integration with the rest of the SoundCloud universe. To the whiteboard!

The plan

On the read side, everything was pretty straightforward. We’d already built a simple API service between Solr and the main application, but had hit some roadblocks when integrating it. We pivoted and rewrote the backend of that API to make ElasticSearch queries, instead: an advantage of a lightweight service-oriented architecture is that refactoring services in this way is fast and satisfying.

On the write side—conceptually, at least—we had a straightforward task. Search is relatively decoupled from other pieces of business and application logic. To maintain a consistent index, we only need an authoritative source of events, ie. entity creates, updates, and deletes, and some method of enriching those events with their searchable metadata. We had that in the form of our message broker. Every event that search cared about was being broadcast on a well-defined set of exchanges: all we had to do was listen in.

But depending on broker events to materialize a search index is dangerous. We would be susceptible to schema changes: the producers of those events could change the message content, and thereby break search without knowing, potentially in very subtle ways. The simplest, most robust solution was to ignore the content of the messages, and perform our own hydration. That carries a runtime cost, but it would let us better exercise control over our domain of responsibility. We believed the benefits would outweigh the costs.

Using the broker as our event source, and the database as our ground truth, we sketched an ingestion pipeline for the common case. Soon, we realized we had to deal with the problem of synchronization: what should we do if an event hadn’t yet been propagated to the database slave we were querying? It turned out that if we used the timestamp of the event as the “external” version of the document in ElasticSearch, we could lean on ElasticSearch’s consistency guarantees to detect, and resolve, problems of sync.

Once it hit the whiteboard, it became clear we could re-use the common-case pipeline for the bulk-index workflow; we just had to synthesize creation events for every entity in the database. But could we make it fast enough? It seemed at least possible. The final box-diagram looked like this:


Implementation and optimization began. Over a couple of weeks, we iterated over several designs for the indexing service. We eked out a bit more performance in each round, and in the end, we got where we wanted: indexing our entire corpus against a live cluster took around 2 hours, we had no coupling in the application, and ElasticSearch itself gave us a path for scaling.

The benefits

We rolled out the new infrastructure in a dark launch on our beta site. Positive feedback on the time-to-searchability was immediate: newly-posted sounds were discoverable in about 3 seconds (post some sounds and try it out for yourself!). But the true tests came when we started implementing features that had been sitting in our backlog. We would start support for a new feature in the morning, make schema changes into a new index over lunch, and run A/B tests in the afternoon. If all lights were green, we could make an immediate live swap. And if we had any problems, we could rollback to the previous index just as fast.

There’s no onerous, manual work involved in bootstrapping or maintaining nodes: everything is set up with ElasticSearch’s REST API. Our index definitions are specified in flat JSON files. And we have a single dashboard with all the metrics we need to know how search is performing, both in terms of load and search quality.

In short, the new infrastructure has been an unqualified success. Our velocity is through the roof, we have a clear path for orders-of-magnitude growth, and we have a stable platform for a long time to come.

Enter DiscoRank

It’s not the number of degrees of separation between you and John Travolta: DiscoRank is our modification of the PageRank algorithm to provide more relevant search results: short for Discovery Rank. We match your search against our corpus and use DiscoRank to rank the results.

The DiscoRank algorithm involves creating a graph of searchable entities, in which each activity on SoundCloud is an edge connecting two nodes. It then iteratively computes the ranking of each node using the weighted graph of entities and activities. Our graph has millions and millions of nodes and a lot more edges. So, we did a lot of optimizations to adapt PageRank to our use case, to be able to keep the graph in memory, and to recalculate the DiscoRank quickly, using results from previous runs of the algorithm as priors. We keep versioned copies of the DiscoRank so that we can swap between them when testing things out.

How do we know we’re doing better?

Evaluating the relevance of search results is a challenge. You need to know what people are looking for (which is not always apparent from their query) and you need to get a sample that’s representative enough to judge improvement. Before we started work on the new infrastructure, we did research and user testing to better understand how and why people were using search on SoundCloud.

Our evaluation baseline is a set of manually-tagged queries. To gather the first iteration of that data, we built an internal evaluation framework for SoundCloud employees: thumbs up, thumbs down on results. In addition, we have metrics like click positions, click engagement, precision, and recall, all constantly updating in our live dashboard. This allows us to compare ranking algorithms and other changes to the way we construct and execute queries. And we have mountains of click log data that we need to analyse.

We have some good improvements so far, but there’s still a lot of tuning that can be done.

Search as navigation

The new Suggest feature lets you jump straight to the sound, profile, group, or set you’re looking for. We’ll trigger your memory if you don’t remember how something is spelled, or exactly what it’s called. Since we know the types of the results returned, we can send you straight to the source. This ability to make assumptions and customizations is a consequence of knowing the structure and semantics of our data, and gives a huge advantage in applications like this.

Architecture-wise, we decided to separate the suggest infrastructure from the main search cluster, and build a custom engine based on Apache Lucene’s Finite State Transducers. As ever, delivering results fast is crucial. Suggestions need to show up right after the keystroke. It turned out we are competing with the speed of light for this particular use-case.

The fact that ElasticSearch doesn’t come with a suggest engine turned out to be a non-issue and rather forced us to build this feature in isolation. This separation proved to be a wise decision, since update-rate, request patterns and customizations are totally different, and would have made a built-in solution hard to maintain.

Search as a crystal ball

Similar to suggest, Explore is based on our new search infrastructure, but with a twist: our main goal for Explore is to showcase the sounds in our community at this very moment. This led to a time-sensitive ranking, putting more emphasis on the newness of a sound.

So far, so good

We hope you enjoy using the new search features of SoundCloud as much as we enjoyed building them. Staying focused on a re-architecture is painful when you see your existing infrastructure failing more and more each day, but we haven’t looked back—especially since our new infrastructure allows us to deliver a better user experience much faster. We now can roll out new functionality with ease, and the new Suggest and Explore along with an improved Search itself are just the first results.

Keep an eye out for more improvements to come!

Apache Lucene and Solr are trademarks of the Apache Software Foundation

 

Petar Djekic
August 30th, 2012

Evolution of SoundCloud’s Architecture

This is a story of how we adapted our architecture over time to accomodate growth.

Scaling is a luxury problem and surprisingly has more to do with organization than implementation. For each change we addressed the next order of magnitude of users we needed to support, starting in the thousands and now we’re designing for the hundreds of millions.  We identify our bottlenecks and addressed them as simply as possible by introducing clear integration points in our infrastructure to divide and conquer each problem individually.

By identifying and extracting points of scale into smaller problems and having well defined integration points when the time arrived, we are able to grow organically.

Product conception

From day one, we had the simple need of getting each idea out of our heads and in front of eyeballs as quickly as possible. During this phase, we used a very simple setup:

Apache was serving our image/style/behavior resources, and Rails backed by MySQL provided an environment where almost all of our product could be modeled, routed and rendered quickly. Most of our team understood this model and could work well together, delivering a product that is very similar to what we have today.

We consciously chose not to implement high availability at this point, knowing what it would take when that time hopefully arrived. At this point we left our private beta, revealing SoundCloud to the public.

Our primary cost optimization was for opportunity, and anything that got in the way of us developing the concepts behind SoundCloud were avoided. For example, when a new comment was posted, we blocked until all followers were notified knowing that we could make that asynchronous later.

In the early stages we were conscious to ensure we were not only building a product, but also a platform. Our Public API was developed alongside our website from the very beginning. We’re now driving the website with the same API we were offering to 3rd party integrations.

Growing out of Apache

Apache served us well, but we were running Rails app servers on multiple hosts, and the routing and virtual host configuration in Apache was cumbersome to keep in sync between development and production.

The Web tier’s primary responsibility is to manage and dispatch incoming web requests, as well as buffering outbound responses so to free up an application server for the next request as quickly as possible. This meant the better connection pooling and content based routing configuration we had, the stronger this tier would be.

At this point we replaced Apache with Nginx and reduced our web tier’s configuration complexity, but our architecture didn’t change.

Load distribution and a little queue theory

Nginx worked great, but as we were growing, we found that some workloads took significantly more time compared to others (in the order of hundreds of milliseconds).

When you’re working on a slow request when a fast request arrives, the fast request will have to wait until the slow request finishes, called “head of the line blocking problem”. When we had multiple applications servers each with its own listen socket backlog, analogous to a grocery store, where you inevitably stand at one register and watch all the other registers move faster than your own.

Around 2008 when we first developed the architecture, concurrent request processing in Rails and ActiveRecord was fairly immature. Even though we felt confident that we could audit and prepare our code for concurrent request processing, we did not want to invest the time to audit our dependencies. So we stuck with the model of a single concurrency per application server process and ran multiple processes per host.

In Kendall’s notation once we’ve sent a request from the web server to the application server, the request processing can be modeled by a M/M/1 queue. The response time of such a queue depends on all prior requests, so if we drastically increase the average work time of one request the average response time also drastically increases.

Of course, the right thing to do is to make sure our work times are consistently low for any web request, but we were still in the period of optimizing for opportunity, so we decided to continue with product development and solve this problem with better request dispatching.

We looked at the Phusion passenger approach of using multiple child processes per host but felt that we could easily fill each child with long-running requests. This is like having many queues with a few workers on each queue, simulating concurrent request processing on a single listen socket.

This changed the queue model from M/M/1 to M/M/c where c is the number of child processes for every dispatched request. This is like the queue system found in a post office, or a “take a number, the next available worker will help you” kind of queue. This model reduces the response time by a factor of c for any job that was waiting in the queue which is better, but assuming we had 5 children, we would just be able to accept an average of 5 times as many slow requests. We were already seeing a factor of 10 growth in the upcoming months, and had limited capacity per host, so adding only 5 to 10 workers was not enough address the head of the line blocking problem.

We wanted a system that never queued, but if it did queue, the wait time in the queue was minimal. Taking the M/M/c model to the extreme, we asked ourselves “how can we make c as large as possible?”

To do this, we needed to make sure that a single Rails application server never received more than one request at a time. This ruled out TCP load balancing because TCP has no notion of an HTTP request/response. We also needed to make sure that if all application servers were busy, the request would be queued for the next available application server. This meant we must maintain complete statelessness between our servers. We had the latter, but didn’t have former.

We added HAProxy into our infrastructure, configuring each backend with a maximum connection count of 1 and added our backend processes across all hosts, to get that wonderful M/M/c reduction in resident wait time by queuing the HTTP request until any backend process on any host becomes available. HAProxy entered as our queuing load balancer that would buffer any temporary back-pressure by queuing requests from the application or dependent backend services so we could defer designing sophisticated queuing in other components in our request pipeline.

I heartily recommend Neil J. Gunther’s work Analyzing Computer System Performance with Perl::PDQ to brush up on queue theory and strengthen your intuition on how to model and measure queuing systems from HTTP requests all the way down to your disk controllers.

Going asynchronous

One class of request that took a long time was the fan-out of notifications from social activity. For example, when you upload a sound to SoundCloud, everyone that follows you will be notified. For people with many followers, if we were to do this synchronously, the request times would exceed the tens of seconds. We needed to queue a job that would be handled later.

Around the same time we were considering how to manage our storage growth for sounds and images, and had chosen to offload storage to Amazon S3 keeping transcoding compute in Amazon EC2.

Coordinating these subsystems, we needed some middleware that would reliably queue, acknowledge and re-deliver job tickets on failure. We went through a few systems, but in the end settled on AMQP because of having a programmable topology, implemented by RabbitMQ.

To keep the same domain logic that we had in the website, we loaded up the Rails environment and built a lightweight dispatcher class with one queue per concern.  The queues had a namespace that describes estimated work times. This created a priority system in our asynchronous workers without requiring adding the complexity of message priorities to the broker by starting one dispatcher process for each class of work that bound to multiple queues in that work class. Most of our queues for asynchronous work performed by the application are namespaced with either “interactive” (under 250ms work time) or “batch” (any work time). Other namespaces were used specific to each application.

Caching

When we approached the hundreds of thousands user mark, we saw we were burning too much CPU in the application tier, mostly spent in the rendering engine and Ruby runtime.

Instead of introducing Memcached to alleviate IO contention in the database like most applications, we aggressively cached partial DOM fragments and full pages. This turned into an invalidation problem which we solved by maintaining the reverse index of cache keys that also needed invalidation on model changes in memcached.

Our highest volume request was one specific endpoint that was delivering data for the widget. We created a special route for that endpoint in nginx and added proxy caching to that stack, but wanted to generalize caching to the point where any end point could produce proper HTTP/1.1 cache control headers and would be treated well by an intermediary we control. Now our widget content is served entirely from our public API.

We added Memcached and much later Varnish to our stack to handle backend partially rendered template caching and mostly read-only API responses.

Generalization

Our worker pools grew, handling more asynchronous tasks. The programming model was similar for all of them: take a domain model and schedule a continuation with that model state to be processed at a later state.

Generalizing this pattern, we leveraged the after-save hooks in ActiveRecord models in a way we call ModelBroadcast. The principle is that when the business domain changes, events are dropped on the AMQP bus with that change for any asynchronous client that is interested in that class of change. This technique of decoupling the write path from the readers enables the next evolution of growth by accommodating integrations we hadn’t foreseen.

after_create do |r|
  broker.publish("models", "create.#{r.class.name}",  r.attributes.to_json)
end

after_save do |r|
  broker.publish("models", "save.#{r.class.name}", r.changes.to_json)
end

after_destroy do |r|
  broker.publish("models", "destroy.#{r.class.name}", r.attributes.to_json)
end

This isn’t perfect, but it added a much needed non-disruptive, generalized, out-of-app integration point in the course of a day.

Dashboard

Our most rapid data growth was the result of our Dashboard. The Dashboard is a personalized materialized index of activities inside of your social graph and the primary place to personalize your incoming sounds from the people you follow.

We have always had a storage and access problem with this component. Looking at the read and write paths separately, the read path needs to be optimized for sequential access per user over a time range. The write path needs to be optimized for random access where one event may affect millions of users’ indexes.

The solution required a system that could reorder writes from random to sequential and store in sequential format for read that could be grown to multiple hosts. Sorted string tables are a perfect fit for the persistence format, and add the promise of free partitioning and scaling in the mix, we chose Cassandra as the storage system for the Dashboard index.

The intermediary steps started with the model broadcast and used RabbitMQ as a queue for staged processing, in three major steps: fan-out, personalization, and serialization of foreign key references to our domain models.

  • Fan-out finds the areas of the social graph where an activity should propagate.
  • Personalization looks at the relationship between the originator and destination users as well as other signals to annotate or filter the index entry.
  • Serialization persists the index entry in Cassandra for later lookup and joining against our domain models for display or API representations.

Search

Our search is conceptually a back-end service that exposes a subset of data store operations over an HTTP interface for queries. Updating of the index is handled similarly to the dashboard via ModelBroadcast with some enhancement from database replicas with index storage managed by Elastic Search.

Notifications and Stats

To make sure users are properly notified when their dashboard updates, whether this is over iOS/Android push notifications, email or other social networks we simply added another stage in the Dashboard workflow that receives messages when a dashboard index is updated. Agents can get that completion event routed to their own AMQP queues via the message bus to initiate their own logic. Reliable messages at the completion of persistence is part of the eventual consistency we work with throughout our system.

Our statistics offered to logged in users at http://soundcloud.com/you/stats also integrates via the broker, but instead of using ModelBroadcast, we emit special domain events that are queued up in a log then rolled up into a separate database cluster for fast access across the various time ranges.

What’s next

We have established some clear integration points in the broker for asynchronous write paths and in the application for synchronous read and write paths to backend services.

Over time, the application server’s codebase has collected both integration and functional responsibilities. As the product development settles, we have much more confidence now to decouple the function from the integration to be moved into backend services that can be consumed à la carte by not only the application but by other backend services, each with a private namespace in the persistence layer.

 

The way we develop SoundCloud is to identify the points of scale then isolate and optimize the read and write paths individually, in anticipation of the next magnitude of growth.

At the beginning of the product, our read and write scaling limitations were consumer eyeballs and developer hours. Today, we’re engineering for the realities of limited IO, network and CPU. We have the integration points set up in our architecture, all ready for the continued evolution of SoundCloud!

Sean Treadway
August 14th, 2012

Shoot yourself in the foot with iptables and kmod auto-loading

As some of you might know, we had an outage yesterday.
We believe that in every mistake there is something to learn from, so after each outage we are writing post-mortems. Usually we do this internally because the issues we run into are very specific to our infrastructure.

This time we ran into a quite nasty issue which could affect everyone running a linux system with a lot sessions on it and we thought you might be interested to know about that pitfall.

What happened?

At 4:40pm CEST, we got reports about Yikes (503/504 errors) on SoundCloud. Around the same time, our monitoring alerted for a high amount of 503s at our caching layer and right after that one of our L7 routing nginx instances was reported down.

We were still able to log into that system. dmesg showed:

Aug 13 14:46:52 ams-mid006.int.s-cloud.net kernel: [8623919.136122] nf_conntrack: table full, dropping packet.
Aug 13 14:46:52 ams-mid006.int.s-cloud.net kernel: [8623919.136138] nf_conntrack: table full, dropping packet.

N.B.: Our systems are set to UTC timezone

That wasn’t expected. The first thought was: “Someone must have changed the sysctl tunings for that”. Then we realized that this system has no need for connection tracking, so nf_conntrack shouldn’t be loaded at all. As a quick contermeasure we raised net.ipv4.netfilter.ip_conntrack_max. This fixed that situation and brought the service back up.

Why did it happen?

After bringing the site back up, we investigated what caused the kernel to enable connection tracking. Doing a lsmod showed that connection tracking and iptables modules were actually loaded. Another look into dmesg revealed that right before the outage the ip_tables netfilter module was loaded:

Aug 13 14:38:27 ams-mid006.int.s-cloud.net kernel: [8623415.007818] ip_tables: (C) 2000-2006 Netfilter Core Team
Aug 13 14:38:35 ams-mid006.int.s-cloud.net kernel: [8623422.444931] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)

So what happened? One of our enginners was doing some preparations for scaling that layer of our infrastructure. To verify we don’t use any specific iptable rules on that system, he did:

iptables -L iptables -t nat -L

Those commands themself are pretty harmless. They will just list configured iptables rules. The first one rules in the filter table, the second one in the nat table. Nothing which should change any system configuration, right? Nope.
Let’s try to reproduce it. Just boot up some system (I’ve tried it on my Ubuntu Laptop). No iptables module should be loaded:

root@apollon:~# lsmod|grep ipt

Now just list your iptable rules:

iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain FORWARD (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

And check again for loaded modules:

root@apollon:~# lsmod|grep ipt
iptable_filter 12810 0
ip_tables      27473 1 iptable_filter
x_tables       29846 2 iptable_filter,ip_tables

Okay, that loaded some iptables module to make it possible to add filter rules via iptables. This shouldn’t cause any problems, since without any actual rules the impact on the kernel is negligible. But now check your nat table:

root@apollon:~# iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target prot opt source destination 

Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination

Completely empty as well. But now look at your kernel modules:

root@apollon:~# lsmod|grep ipt
iptable_nat 13229 0
nf_nat 25891 1 iptable_nat
nf_conntrack_ipv4 19716 3 iptable_nat,nf_nat
nf_conntrack 81926 3 iptable_nat,nf_nat,nf_conntrack_ipv4
iptable_filter 12810 0
ip_tables 27473 2 iptable_nat,iptable_filter
x_tables 29846 3 iptable_nat,iptable_filter,ip_tables

By just listing your iptable rules for the nat table, the kernel loaded nf_conntrack which enabled connection tracking. See dmesg:

[75024.007681] nf_conntrack version 0.5.0 (16384 buckets, 65536 max

On your Laptop you probably don’t care – it’s even quite convenient. On a production server that handles a large number of connections the fairly small default nf_conntrack table will overflow quite fast and cause dropped connections.

How do we prevent it?

iptables doesn’t load the nf_conntrack itself, it only loads ip_tables which again loads modules it depends on via the kernel’s kmod facility.

But since that module loader uses the modprobe user-space helpers like modprobe, the auto-loading process will honour modprobe.d/ settings. Unfortunatelly there is no easy way to disable loading of a module altogether, but there is a workaround for that.

Since we don’t need iptables at all on that system, we’ve created a /etc/modprobe.d/netfilter.conf like this:

alias ip_tables off
alias iptable off
alias iptable_nat off
alias iptable_filter off
alias x_tables off
alias nf_nat off
alias nf_conntrack_ipv4 off
alias nf_conntrack off

This will make modprobe load off instead of the actual kernel module.

Trying to run any iptables command now, should now give you now:
iptables -t nat -L
FATAL: Module off not found. iptables v1.4.12: can't initialize iptables table `nat': Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.

Johannes 'fish' Ziemke
July 24th, 2012

Go at SoundCloud

SoundCloud is a polyglot company, and while we’ve always operated with Ruby on Rails at the top of our stack, we’ve got quite a wide variety of languages represented in our backend. I’d like to describe a bit about how—and why—we use Go, an open-source language that recently hit version 1.

It’s in our company DNA that our engineers are generalists, rather than specialists. We hope that everyone will be at least conversant about every part of our infrastructure. Even more, we encourage engineers to change teams, and even form new ones, with as little friction as possible. An environment of shared code ownership is a perfect match for expressive, productive languages with low barriers to entry, and Go has proven to be exactly that.

Go has been described by several engineers here as a WYSIWYG language. That is, the code does exactly what it says on the page. It’s difficult to overemphasize how helpful this property is toward the unambiguous understanding and maintenance of software. Go explicitly rejects “helper” idioms and features like the Uniform Access Principle, operator overloading, default parameters, and even exceptions, on the basis that they create more problems through ambiguity than they solve in expressivity. There’s no question that these decisions carry a cost of keystrokes—especially, as most new engineers on Go projects lament, during error handling—but the payoff is that those same new engineers can easily and immediately build a complete mental model of the application. I feel confident in saying that time from zero to productive commits is faster in Go than any other language we use; sometimes, dramatically so.

Go’s strict formatting rules and its “only one way to do things” philosophy mean we don’t waste much time bikeshedding about style. Code reviews on a Go codebase tend to be more about the problem domain than the intricacies of the language, which everyone appreciates.

Further, once an engineer has a working knowledge of Effective Go, there seems to be very little friction in moving from “how the application behaves today” to “how the application should behave in the ideal case.” Should a slow response from this backend abort the entire request? Should we retry exactly once, and then serve partial results? This agent has been acting strangely: can we install a 250ms timeout? Every high-level scenario in the behavior of a system can be expressed in a straightforward and idiomatic implementation, without the need for libraries or frameworks. Removing layers of abstraction reduces complexity; plainly stated, simpler code is better code.

Go has some other nice properties that we’ve taken advantage of. Static typing and fast compilation enable us to do near-realtime static analysis and unit testing during development. It also means that building, testing and rolling out Go applications through our deployment system is as fast as it gets.

In fact, fast builds, fast tests, fast peer-reviews and fast deployment means that some ideas can go from the whiteboard to running in production in less than an hour. For example, the search infrastructure on Next is driven by Elastic Search, but managed and interfaced with the rest of SoundCloud almost exclusively through Go services. During validation testing, we realized that we needed the ability to mark indexes as read-only in certain circumstances, and needed the indexing applications to detect and respect this new dimension of index-state. Adding the abstraction in the code, polling a new endpoint to reliably detect the state, changing the relevant indexing behaviors, and writing tests for them, all took half an afternoon. By the evening, the changes had been deployed and running under load for hours. That kind of velocity, especially in a statically-typed, natively-compiled language, is exhilarating.

I mentioned our build and deployment system. It’s called Bazooka, and it’s designed to be a platform for managing the deployment of internal services. (We’ll be open-sourcing it pretty soon; stay tuned!) Scaling 12-Factor apps over a heterogeneous network can be thought of as one large, complex state machine, full of opportunities for inconsistency and race conditions. Go was a natural choice for this kind of job. Idiomatic Go is safely concurrent by default; Bazooka developers can reason about the complexity of their problem without being distracted by the complexity of their tools. And Bazooka makes use of Doozer to coordinate its shared state, which—in addition to being the only open-source implementation of Paxos in the wild (that we’re aware of)—is also written in Go.

All together, SoundCloud maintains about half a dozen services and over a dozen repositories written entirely in Go. And we’re increasingly turning to Go when spinning up new backend projects.

Interested in writing Go to solve real problems and build real products? We’d love to hear from you!

Peter Bourgon
June 14th, 2012

Building The Next SoundCloud

This article is also available in Serbo-Croatian: Pravljenje novog SoundCloud.

The front-end team at SoundCloud has been building upon our experiences with the HTML5 widget to make the recently-released Next SoundCloud beta as solid as possible. Part of any learning also includes sharing your experiences, so here we outline the front-end architecture of the new site.

Building a single-page application

One of the core features of Next SoundCloud is continuous playback, which allows users to start listening to a sound and continue exploring without ever breaking the experience. Since this really encourages lots of navigation around the site, we also wanted to make that as fast and smooth as possible. These two factors were enough in themselves for us to decide to build Next as a single-page Javascript application. Data is drawn from our public API, but all rendering and navigation happens in the browser for near-instant navigation without the need to make round-trip requests to the server.

As a basis for this style of application, we have used the massively popular Backbone.js. What attracted us to Backbone (apart from the fact that we’re already using it for our Mobile site and the Widget), is that it doesn’t prescribe too much about how it should be used. Backbone still provides a solid basis for working with views, data models and collections, but leaves lots of questions unanswered, and this is where its flexibility and strength really lies.

For rendering the views on the front end, we use the Handlebars templating system. We evaluated several other templating engines, but settled on Handlebars for a few reasons:

  • No logic is performed inside the templates, which enforces good separation of concerns.
  • It can be precompiled before deployment which results both in faster rendering and a smaller payload that needs to be sent to clients (the runtime library is only 3.3kb even before gzip).
  • It allows for custom helpers to be defined.

Modular code

One technique we used with the Widget which ended up being a great success was to write our code in modules and declare all dependencies explicitly.

When we write code, we write in CommonJS-style modules which are converted to AMD modules when they’re executed in the browser. There are some reasons we decided to have this conversion step, possibly best explained by seeing what each style looks like:

// CommonJS module ////////////////
var View = require('lib/view'),
    Sound = require('models/sound'),
    MyView;

MyView = module.exports = View.extend({
    // ...
});

// Equivalent AMD module //////////
define(['require', 'exports', 'module', 'lib/view', 'models/sound'],
  function () {
    var View = require('lib/view'),
        Sound = require('models/sound'),
        MyView;

    MyView = module.exports = View.extend({
        // ...
    });
  }
);
  • The extra define boilerplate is tedious to write
  • Duplication of module dependencies is also tedious and error-prone
  • Conversion from CommonJS to AMD is easily automated, so why not?

During local development, we convert to AMD modules on-the-fly and use RequireJS to load them individually. This makes development quite frictionless as we can just save and refresh to see the updates, however it’s not so great for production since this method creates hundreds of HTTP requests. Instead, the modules are concatenated into several packages, and we drop RequireJS for the super-lightweight module loader AlmondJS.

CSS and Templates as dependencies

Since we’re already including all of our code by defining explicit dependencies, we thought it made sense to also include the CSS and template for a view in the same way. Doing this for templates was rather straight-forward since Handlebars compiles the templates into a Javascript function anyway. For the CSS, it was a new paradigm:

Views define which CSS files are needed for it to display properly, just the same as you would define the javascript modules you need to execute. Only when the view is rendered do we insert its CSS to the page. Of course, there are some common global styles, but mostly, each view has its own small CSS file which just defines the styles for that view.

When writing the code, we write in plain vanilla CSS (without help of preprocessors such as SCSS or LESS), but since they are being included by Require/Almond they need to be converted into modules as well. This is done with a build step which wraps the styles into a function which returns a <style> DOM element. Here’s an example of how it looks in essence:

Input is plain CSS

.myView {
  padding: 5px;
  color: #f0f;
}
.myView__foo {
  border: 1px solid #0f0;
}

Result is an AMD module

define("views/myView.css", [...], function (...) {
  var style = module.exports = document.createElement('style');
  style.appendChild(
    document.createTextNode('.myView { padding: 5px; color ... }');
  );
})

Views as components

A central concept in developing Next is that of treating views as independent and reusable components. Each view can include other ‘sub’ views, which can themselves include subviews and so on. The effect of this is that some views are merely composites of other views and cover an entire page, whereas others can be as small as a button, or even a label in some cases.

Keeping these views independent is very important. Each view is responsible for its own setup, events, data, and clean up. Views are ‘banned’ from modifying the behaviour or appearance of their subviews, or even making assumptions about how or where this view itself is being included. By removing these external factors, it means that each view can be included in any context with absolute minimum fuss, and we can be sure that it will work as it is supposed to.

As an example, the ‘play’ button on Next is a view. To include one anywhere on the site, all that we need to do is create an instance of the button, and tell it the id of the sound it should play. Everything else is handled internally by the button itself.

To actually create these subviews, most of the time they are created inside the template of the parent view. This is done by use of a custom Handlebars helper. Here is a snippet from a template which uses the view helper:

<div class="listenNetwork__creator">
  {{view "views/user/user-badge" resource_id=user.id}}
</div>

As you can see, adding a subview is as simple as specifying the module name of the view and passing some minimal instantiation variables. What actually happens behind the scenes goes like this:

When a view is rendered, the template must return a string. When the view helper is invoked, it pushes the attributes passed to it, plus a reference to the requested view class, into a temporary object with an id, and outputs a placeholder element (we use <view data-id="xxxx">). The id is just a unique, incrementing number. After a template is rendered, the output would be a string which might look something like:

<div class="foo">
  <view data-id="123"></view>
</div>

Then we find the placeholders and replace those elements with the subview’s element which it automatically creates for itself. In essence, the code does this:

parentView.$('view').each(function () {
  var id = this.getAttribute('data-id'),
      attrs = theTemporaryObject[id],
      SubView = attrs.ViewClass,
      subView = new SubView(attrs);
  subView.render(); // repeat the process again
  $(this).replaceWith(subView.el);
});

Sharing Models between Views

So, we now have a system where there will be dozens of views on the screen at one time, many of which will be views of the same model. Take, for example, a “listen” page:

There would be a view for the play button, the title of the sound, the waveform, the time since the sound was uploaded (this dynamically updates itself, which is why it is a view), and so on. Each of these views are of the same sound model, but we wouldn’t want each to duplicate the data. Instead we need to find a way to share the model.

Also remember that each of these views has to handle the case where there is no data yet. Almost all views are instantiated only with the id of its model, so it’s quite possible that the data for that model hasn’t been loaded yet.

To solve this, we use a construct we call the instance store. This store is an object which is implicitly accessed and modified each time a constructor for a model is called. When a model is constructed for the first time, it injects itself into the store, using its id as a unique key. If the same model constructor is called with the same id, then the original instance is returned.

var s1 = new Sound({id: 123}),
    s2 = new Sound({id: 123});

s1 === s2; // true, these are the exact same object.

This works because of a surprisingly little-known feature of Javascript. If a constructor returns an object, then that is the value assigned. Therefore, if we return a reference to the instance created earlier, we get the desired behaviour. Behind the scenes, the constructor is basically doing this:

var store = {};

function Sound(attributes) {
    var id = attributes.id;

    // check if this model has already been created
    if (store[id]) {
        // if yes, return that
        return store[id];
    }
    // otherwise, store this instance
    store[id] = this;
}

This is not a particularly new pattern: it’s simply just the Factory Method Pattern wrapped up into the constructor. It could have been written as Sound.create({id: 123}), but since Javascript gives us this expressive ability, it makes sense to use it.

So, this feature means that it’s completely simple for views to share the same instance of a model without knowing anything about the other views, simply by calling the constructor with a single id. We can then use this shared instance as a localised ‘event bus’ to facilitate communication and synchronisation of the views. Usually this is in the form of listening to changes to the model’s data. If the views subscribe to the ‘change’ events which affect it, then they will be notified immediately upon change and the page can be kept up-to-date with very little effort required by the developer.

This also is how we solve the issue of there being no data on the model. On the first pass, several views might have a reference to a model which only contains an id and no other attributes. When the first view is rendered, it can detect that the view does not have enough information and so it would ask the model to fetch its data from the API. The model keeps track of this request, so that when the other views also ask it to fetch, we do nothing and avoid duplicate requests. When the data comes back from the server, the attributes of the model will be updated, causing ‘change’ events which then notify all the views.

Making full use of data

A common feature of many APIs is that when one particular resource is requested, other related resources are included in the response. For example, on SoundCloud, when you request information about a sound, included in the response is a representation of the user who created that sound.

/* api.soundcloud.com/tracks/49931 */
{
  "id": 49931,
  "title": "Hobnotropic",
  ...
  "user": {
    "id": 1433,
    "permalink": "matas",
    "username": "matas",
    "avatar_url": "http://i1.soundc..."
  }
}

Rather than let this extra data go to waste, each model is aware of which ‘sub-resources’ it can expect in its responses. These sub-resources are inserted into the instance store in case any views need to use the data. This means we can save a lot of extra trips to the API and display the views much faster.

So, for our example above, the Sound model would know that in its property “user” it has a representation of a User model. When that data is fetched, then two models are created and populated on the client side:

var sound = new Sound({id: 49931 });

sound
  .fetch()             // get the data
  .done(function () {  // and when it's done
    var user = new User({id: sound.user.id });
    user.get('username'); // 'matas' -- we already have the model data
  });

What’s important to remember is that because there’s only ever one instance of each model, even pre-existing instances are updated. Here’s the same example from above, but note when the User is created.

var sound = new Sound({id: 49931 }),
    user = new User({id: 1433 });

user.get('username'); // undefined -- we haven't fetched anything yet.
sound
  .fetch()
  .done(function () {
    user.get('username'); // 'matas' -- the previous instance is updated
  });

Letting go

Holding on to every instance of every model forever isn’t feasible, especially for the Next SoundCloud. Because of the nature of the site, it’s quite possible that a user might go for several hours without ever performing a page load. During this time, the memory consumption of the application would just continue to grow and grow. Therefore, at some point we need to flush some models out of the instance store. To decide when it is safe to do this, the instance store increments a usage counter each time an instance has been requested, and views can ‘release’ a model when it is no longer needed and the count is decremented.

Periodically, we check the store to see if there are any models with a count of zero, and they’re purged from the store, allowing the browser’s garbage collector to free up the memory. This usage count is encapsulated in the store object, but in essence it’s something like this:

var store = {},
    counts = {};

function Sound(attributes) {
  var id = attributes.id;
  if (store[id]) {
    counts[id]++;
    return store[id];
  }
  store[id] = this;
  counts[id] = 1;
}

Sound.prototype.release = function () {
  counts[this.id]--;
}

The reason for performing the cleanup on a timer, rather than whenever a usage count hits zero is so that the model stays in the store when you switch views. If you navigate to another page, there will be a single moment between cleaning up the existing views and setting up the new ones when every single model’s count is zero. The new page might actually contain views of one or more of these models, so it’d be quite wasteful to remove them instantly.

A long journey…

This has been a brief introduction into some of the methods and concepts we’re using to create Next SoundCloud, but it’s just the beginning. There are plenty more features which we have yet to build and therefore plenty more challenges to tackle. If you want join us along the way, remember that we’re always hiring!

Nick Fisher
March 19th, 2012

Hacker Time – the first 3 months

Over 3 months ago we started the “Hacker Time” initiative (see here) and now it is time for a recap what’s happened so far.
From the outset, people outside of the engineering department were very forthcoming in suggesting ideas for Hacker Time projects. While it was great to see so much interest, the essence of Hacker Time is that engineers create the projects that they’re personally motivated to develop. (It’s absolutely key to keep things separate from the general backlog!).
Once that clicked, our engineers started initiating and working on their passion projects. Hacker Time has found momentum; currently around 50% of our engineers (who have been in the company longer than 6 months) have a project they are currently working on. These projects range from doing remote online courses to participating in competitions to cool hacks (which sometimes start at “Hack Days” and are then continued within Hacker Time).
Some of the highlights are:

  • Instasound lets you record your sounds, apply a range of filters in real time, and instantly share to SoundCloud – video
  • Large Hadron Migrator – a tool to migrate database tables online without downtime
  • HyperSpec – HyperSpec provides a Ruby DSL for testing HTTP APIs from the “outside”
  • Story Wheel lets you record a story around your Instagram pictures and share it on the web as a nostalgic slideshow.
  • Massive Site Map – a ruby gem to build painfree google sitemaps for webpages with millions of pages. Differential updates keep generation time short and reduces load on DB.
  • KDD Cup 2012 – a team of 5 SoundCloud engineers is participating
  • Some engineers are taking online courses from Stanford, e.g. the machine learning online course

We’ve decided not to track the aggregate hours spent on Hacker Time projects, but to leave it to the teams to organize. It’s an approach that’s working well, because it reinforces the spontaneity and informality that this kind of initiative needs in order to be credible.

So what about the 50% of engineers that haven’t yet worked on their own Hacker Time project? Well, no big surprises here, there are two main reasons: Either they feel they have don’t have enough time or they haven’t found a topic yet.

Overall we’ve seen the initial concept successfully initiated and it’s begun to get into full swing. We’re really pleased with the output so far and will have more updates soon – some engineers will present their projects in more detail!

Alexander Grosse
March 4th, 2012

How we do… Retrospectives

SoundCloud started its agile journey with Scrum and eventually moved to an approach based on Kanban (more on that in one of the next blog posts). Regardless of the methodology we follow, though, we strongly believe that the most important tools an agile team can use to practice continuous improvement are retrospectives.
You might ask why we consider the retrospective to be the most important one? We think it is key to constantly learn and to improve based on our experiences. The best way to achieve this is to look back at our work in regular intervals and to give every team member the opportunity to give input on the past work – and to make sure we have actually improved something. This is done by reviewing action items from the last retrospective as the first step.

What kind of retrospectives?

The most frequent retrospective meeting is done after each iteration – in our context this means every two weeks. Every three months we do a big retrospective meeting with all of the engineers in one room to find common patterns in our development department. One major result of these big retrospectives is our “hacker time” project – see here for more details.
The length of the retrospective depends on the size of the team and the time between retrospectives. We usually schedule 30 minutes for the short one after each iteration and 90 minutes for big ones.

How is the retrospective done?

Pre-requisites:

You need only a few things to do a good retrospective – you can even do one only with Google Docs – but it works best with the following things:

  1. Post-its in 3 different colors (optimal is green, red, yellow)
  2. Enough pens so everybody has one – rather use big pens than ball pens so the content is short and everybody in the room can read it
  3. A big wall to put the post-its on, a whiteboard will do
  4. And someone to facilitate the retrospective, in Scrum it should be the Scrum Master. Really experienced teams do not necessarily need a facilitator. The facilitator keeps track of the time and leads the grouping of items.

The procedure

1) You put three columns on the wall with the title: less – same – more. Ideally you label each columns with one of the three differently colored post-it’s (red= less, green=same, yellow=more).

So, all items put under “less” are things which either should be done less or should be stopped completely.
“Same” is obvious, and “more” are items which should be done more or should be started.

Why? This scheme avoids the bad/good categories often used and allows an open communication where putting a sticker somewhere does not mean something is “bad” – it allows a more nuanced conversation.

2) Collecting items: everybody gets around 5-10 minutes to write down their points. During that time no discussions are allowed, it is only about collecting items.

Why? Everybody can give their input, often some people tend to dominate a discussion and the other people are not heard as much. This approach enables everybody to give their input.

People are free to put post-it’s on the wall (on the three columns) anytime. Latest after 10 minutes or when nobody is writing anymore the next phase starts.

3) Group items

The facilitator starts grouping items into clusters to find common themes. Usually there always are topics which a lot of people mention. Every item is read out to the group and if everybody agrees put to a cluster.

4) Find good names for the clusters

This is the final test if the grouping made sense – if you cannot agree on a name the cluster has too many different items and has to be split.

5) Vote

To identify the most pressing issues (number of post-its in a cluster is also a pretty meaningful indicator) everybody gets three votes and can give one vote to three different clusters. The top three items should be addressed during the next iteration.

6) Discuss

If there is time left start discussing the clusters starting with the one which got the most votes. Try to get action items out of it.

7) Action-items

A retrospective tends to be useless if it is “just talk” and no actions follow. So the key ending of a retrospective is to to collect action-items to work on at least the top voted clusters. The next retrospective starts then with a review of the action-items from the last retrospective.

We have been doing it that way for the last 8 months with pretty good results. There are around 8 engineering teams at SoundCloud and 45 engineers overall. The key challenge is here to continue doing retrospectives and to follow up on the action items.

Alexander Grosse
December 9th, 2011

Stop! Hacker Time.

At SoundCloud we like to invent new ideas. But we’re not adverse to implementing really great tried and tested ideas like the 20% time concept made famous by Google.

We’re calling it Hacker Time. We’re still very much in start-up mode so we’re keen to nurture the spirit of hacking. We’ve been testing out Hacker Time for a few weeks now and we’re excited about its potential, from industry-changing initiatives like “Are we playing yet” to unusual passion projects like the “Owl Octave”.

Why we’re doing it

When I arrived at SoundCloud I gathered all the engineers and developers in a room and asked them what they wanted more of. This was one of the top requests:

Like every other fast-paced tech company out there, the everyday demands of development leave little time for pet projects or invention outside of the roadmap. But we shouldn’t let great ideas slip away: it’s wasteful of talent and it’s frustrating for developers themselves. Hacker Time is an attempt to keep both the ideas and the employees focused on making the product even better.

I don’t think there’s a single developer at SoundCloud who isn’t a sound creator. Or at least they quickly become one once they’ve joined. We’re a company of electronic music producers, acoustic musicians, field recordists and social sound fanatics. Of course we’re developing the product for 9 million people, but we’re also developing something we want to use ourselves.

Ironically we’re not trying to invent anything new by initiating Hacker Time. Aside from Google, we’ve been watching the Atlassian experiment and are taking a similar approach; we’ll start with a simple set of rules and adapt to what works best for us. Like Atlassian, we’ll be blogging about our experiences too.

Which projects are engineers allowed to work on?

We’ve compiled a list which is not meant to be restrictive but rather should be a guideline:

  • Pet features/improvements that never made it onto the roadmap
  • Apps based on the Soundcloud API
  • Hack Days
  • Anything for SoundCloudLabs.com
  • “this always annoyed me” bug-fixes or architectural improvements
  • Integration of some technology-du-jour with Soundcloud
  • Contributing to OSS used at Soundcloud
  • Other cool projects using SoundCloud
  • Conferences
  • Writing Blog Posts, Technical Articles

How will time be allocated?

The big decision is how to allocate time to these projects. We don’t want to cannibalize valuable product development time but we do want to give people as much freedom as possible,

There are lots of approaches out there – from reserving one day a week, accumulating time to set aside whole weeks or handling that time like vacation. We’ve decided to put the decision-making in the hands of each team, allowing them to allocate Hacker Time according to each team’s workstyle. It’s an experiment, and something we’ll be reviewing in the early stages.

Demo it!

We’re pretty disciplined about demoing work to the whole company. Hacker Time projects will be included in the demos (which happen every 2 weeks at SoundCloud), giving developers a chance to showcase their projects and sparking interest in their hacks from everyone in the organization.

Watch this blog for further reports!

Many thanks to

Atlassian for blogging about their experiences

Simon Stewart from Google
Jim Webber from Neo4J
Jan Lehnhardt from Couchbase
Stefan Roock from it-agile

for sharing their experiences with us.

Alexander Grosse
November 21st, 2011

Front-end JavaScript bug tracking

Proper and effective error tracking is a common issue for front-end JavaScript code compared to back-end environments.

We felt this pain as well and experimented with different solutions over the past months on the SoundCloud Mobile site.

Analytics

The first approach we had was to track errors with Google Analytics. Their library permits to fire custom events and whenever an ajax error would occur, we would log it.

The biggest benefit of this tool is to monitor the stability of the site and its evolution in longer periods as you can easily go back a few weeks or months to see which events were triggered. Also, it is easy to implement – almost a one-liner!

The drawback, at least for Google Analytics, is that this tool is not meant to track bugs. There is no way to add custom data to these events to get more insight about why and how an error happened, it also doesn’t work in real-time, and you obviously want that when you debug.

So we kept Analytics in place for a long-term view, but took a look at other options for real-time and in-depth tracking.

Airbrake

In our pursuit of getting more insight, we decided to take a look at Airbrake because we were already using it to track back-end errors on our main site.

Our mobile site runs on Node.js, the first thing we did was to integrate an existing plugin for it to handle error tracking on the back-end as well.

Looking a little further we found a front-end notifier, which would catch errors that would fire on window.onerror, but there was no way to report any custom errors.

We decided to take a day to hack this on our own since their API is public and easy to implement.

The benefits of Airbrake were instant. We could see what triggered which error, how, why, in which context, which browser, etc… in real-time!


It also counts errors, which can help you prioritize and include fixes in your roadmap.

However, the lack of filtering, grouping and custom sorting made it difficult to work with. There was also no sense of time or progress, as everything just gets dumped into a single list ordered by time. We needed something a little better than that.

BugSense

That’s when our Android team showed us their BugSense implementation.
BugSense seemed to address all of these issues we had with Airbrake: grouping is more effective, searching and filtering is possible, charts of errors are drawn as well.

There is one more benefit over Airbrake… JSON. No need to convert objects to XML strings anymore!

If you are interested in our BugSense notifier you can find the source on github.

Conclusion

There is still a lot of work needed to make front-end JS debugging as easy as it is for regular back-end environments.
For example, stack traces today aren’t that useful, because of anonymous objects and minified code, but hopefully browser vendors will tackle these issues soon. Maybe Source Maps could be the first milestone in this quest.

At SoundCloud, we will continue to use a combination of these tools because of the different strengths outlined above, but there are also other tools we didn’t try out yet like getexceptional or errorception. If you have tried these, or if you have any suggestion on this subject we’d like to get your feedback in the comments below.

Happy debugging!

Yves
November 9th, 2011

SoundCloud launches the HTML5 Audio Improvement Initiative

We at SoundCloud want to build the best sound player for the web, and we want to do that using the Open Web standards. While working on the native audio features on our mobile site and new widgets, or even as an experiment on the main site, we have discovered that the HTML5 Audio standard is not equally well implemented across all modern browsers and some decisions can be made that would benefit the web audio users and web developers alike. Soundcloud launches the “Are We Playing Yet?” project, which aims to raise the awareness about the state of HTML5 Audio implementations in the web browsers.

AreWePlayingYet? 2014 A pragmatic HTML5 Audio test suite

We have decided to help the parties involved and collect the issues in one place, document them, provide the code and add interactive tests that will show the implementation progress. We understand how the software development works, and that a few iterations are needed until something is fully done. We hope ”Are We Playing Yet?” can function as a handy development and quality monitoring tool.

Issues - soundcloud/areweplayingyet - GitHub

“Are We Playing Yet?” was started by SoundCloud but it’s open to all companies and developers who care about the state of HTML5 audio and want to build applications based on this Web standard. You can get the project source on GitHub, contribute tests and fixes via the pull requests or Issue Tracker, and connect to the people involved via @areweplayingyet on Twitter.

matas