April 28th, 2011

Web Scale Statistics – Failing with MongoDB

As SoundCloud rapidly grows our initial systems need an overhaul. Our scaling strategy has been very realistic, design for 10x our current usage. Our initial statistics system found under http://soundcloud.com/you/stats was made when we were 100k users, living long past its expiration date.

Background

About a year ago we started off with the goal of redesigning the statistics pages to support 500 playbacks a second. We knew that this would be a write-heavy workload and that to sustain bursts of writes, we’d need decent partitioning. Coming from a successful experience moving the Dashboard feature to Cassandra 0.6 we started out prototyping a design that would be easily partitioned.

The write side of this story went very well, Cassandra could keep up with everything we threw at it, however to naively pull out all the aggregates we were collecting took hundreds of queries to the cluster. Cassandra didn’t have atomic counters at the time, so we had a lot of individual counts that needed to be summed on the client. (This is changing with the much anticipated upcoming 0.8 release!)

In a one-night experiment, we re-implemented the Cassandra based prototype to be backed by MongoDB. Not only could this quick prototype consume events as fast as Cassandra, there were some server side features in MongoDB that we could use to simplify a few of the queries that we had for the stats like atomic inplace insert/updates (upserts) to use fewer documents and secondary indexes to build the time series. Plus it was web scale.


To answer the partitioning problem for MongoDB, we decided to lean on the automatic sharding router that was under development. This was built to automate the rebalancing of data between replication pairs and keep a cluster nice and healthy without much data administration.

Away we went, implementing a correct, distributed key-value oriented MongoDB based statistics backend. We even deployed it into a cluster of Amazon EC2 instances and hooked it up to the website to start tracking our statistics alongside our existing solution.

What we observed was disconcerting. On a 36GB RAM machine with a working set under 100 GB, the system performed better than needed. A single node could process thousands of plays per second. Once the working set approached 300 GB, the throughput dropped down to between 100 and 200 plays per second. The disk utilization of one of the shards in the cluster stayed at 100% and we were seeing a IO service latency of around 5ms and MongoDB latency spiking upwards to 15 seconds. This all pointed to becoming bound on disk seeks.

Seeking the answer

Being bound on IO was something that we anticipated as we would not need to keep the entire working set resident, but not this bad. Whatever that MongoDB’s sharding was doing was causing a single node in the cluster to bottleneck. Could it be those poor disk heads bouncing back and forth to support our write load?

For any workload, a disk seek is the worst thing one could be spending time on. In “Numbers Everyone Should Know” it’s obvious we can do better:

execute typical instruction 1/1,000,000,000 sec = 1 nanosec
fetch from L1 cache memory 0.5 nanosec
branch misprediction 5 nanosec
fetch from L2 cache memory 7 nanosec
Mutex lock/unlock 25 nanosec
fetch from main memory 100 nanosec
send 2K bytes over 1Gbps network 20,000 nanosec
read 1MB sequentially from memory 250,000 nanosec
fetch from new disk location (seek) 8,000,000 nanosec
read 1MB sequentially from disk 20,000,000 nanosec
send packet US to Europe and back 150 milliseconds = 150,000,000 nanosec

The stress to get the stats released was mounting. Our existing solution was unusable for almost all of our active users and was actually causing a ripple effect in the databases that caused site performance degradation when anyone went to visit their stats page. And we had overshot our release goal already by 2 months.

We struggled hard to identify the root cause for why we had a hotspot in the MongoDB cluster. If we could just distribute the seek-bound workload, we could at least release this iteration of the stats while we dug into the root cause of being seek bound in the first place.

To identify the root cause, we tried confirming various theories about the workload we were introducing and the behavior of MongoDB. Each experiment was time consuming but we built up an understanding of how the system was (mis)behaving.

Death by design

The design of this system was fairly simple – create a document for each time aggregate that we needed to display in a time series. Use varying time spans to reduce the number of aggregates we would need to fetch and sum in the client. We used year, month, day and hour aggregation documents for each dimension of data we wished to track. This includes the total play counts, by user, country and source URL. Send these aggregates to a [mongos][mongo-mongos] node that would then shard the documents and distribute them to one of 4 backing replication pairs.

The last theory is that when a track is first recorded, the 4 aggregate documents (year, month, day, hour) are written to the end of each collection. All 4 documents would end up on the right side of the tree and the right side on disk, because our understanding of the MongoDB disk layout goes as far as being a memory mapped BTree of the actual BSON documents. As time passes, the longer duration documents, year and month, would start to become relatively shifted to the left of the disk storage, where the newer hours and days would end up on the right. Since MongoDB’s persistence was essentially a bunch of memory mapped files, this could have caused all the page faults loading up cold pages from the middle/left of the tree when attempting to update all aggregation buckets.

We are still uncertain about how the documents were layed out in memory and on disk, how it grew, how it was mapped to disk and how the indexes resolved to documents. If the on-disk layout was by the random ID then we were screwed on writes and reads because they would both be random. If it was by time, we were screwed on reads as we’d get sparse reads over the time series, if it was on our bucket organization we were screwed in the future as our buckets locality would get sparser over time.

Coming to terms

Not reasoning hard about this system’s IO patterns was a very late, and very obvious oversight. Even if we could partition the writes, we had designed a system that would require close to 100% in-memory residence to be performant over time. This is a very expensive proposal for mostly stale and very large data.

Lessons learned, we faced and made the difficult decision to cancel this project as it could not meet our long-term goals under real workload.

Yet the stress is still there, the SoundCloud statistics must work for all users, especially the people with millions of plays.

Up soon – what we made instead (for now).

Sean Treadway
  • http://www.mongodb.org ehwizard

    Hey there,
    Would love to understand more about the shard keys and indexes you had on those collections.
    If you had indexes where the keys were always increasing, the working set should have remained very small.
    -Eliot

  • http://www.couchbase.com Perry Krug

    Might be worth checking out Membase Server from Couchbase (www.couchbase.com)…it’s got very robust (and running in production) automatic sharding as well as atomic counters and a much more advanced memory management system (not just m-mapped). It may sound like a shameless plug (and maybe it is) but I’d love the opportunity to take on this challenge with you!

  • Dreflydre

    I’m glad someone is having a good chuckle .

  • Anonymous

    Hey @ehwizard:disqus

    I believe that Torsten discussed this with you a bit before, I’m sure he can chime in when he’s back from a break.

    What we settled on was the track_id for track specific counters and user_id for user counters. We also tried the UUID and saw the same behavior.

    None of these keys are monotonically increasing. We intentionally did not shard on an increasing key like the event timestamp, or aggregated counter timstamp (year,month,day) because we thought that would create a rotating hotspot as all writes would end up on the same shard until a new chunk was created.

    With these shard keys, we thought that our writes would get balanced across all shards because the distribution of the shard key values would be random and were thinking along the lines of “consistent hashing”. What we saw was that our first replica set was taking all the writes then it looked like mongos would kick off a background rebalancing to move the chunk to a different replica set.

  • Anonymous

    @202cbbf282c581c26d97a79f0fcfec5f:disqus I’ve heard great things about membase from Jan Lehnhardt. But what we’ve found is that a pure key-value store assumes totally random IO and builds its backing store based off that assumption.

    For our statistics, we take subsets of the time series and post-aggregate sums over the counts using a sliding date range picker over arbitrary date ranges. This means the counts inside of the time series for a track or a user should be optimized for sequential reads.

  • http://www.mongodb.org ehwizard

    How much data did you have?
    Sounds like it hadn’t balanced yet, or you needed to pre-split to bootstrap.

    Take a look at pre-splitting here: http://www.mongodb.org/display/DOCS/Splitting+Chunks

    Also, for cases like there where you want to limit the index, a good strategy is to make a shard key like YEARMONTHWEEK. That way you can distribute writes evenly, but not have to keep the entire index in ram.

  • Anonymous

    We had over 300GB of documents each usually under 100bytes.

    We had found that pre-splitting was necessary for our workload. A discussion of our data importing can be found here: http://groups.google.com/group/mongodb-user/browse_thread/thread/63d13fbf6d699594/71bc6453a514eb81

    The mongos processes and shard servers were doing fine. The primary unresolved issue was thrashing our disks on the master of the first replica set under normal workload.

  • http://www.mongodb.org ehwizard

    Sounds like something wasn’t balanced right, or key distribution wasn’t even.
    Which UUID algorithm did you use?

  • http://twitter.com/jswanhart Justin Swanhart

    Hi.

    I believe that you can do what you want with Flexviews and/or Shard-Query.

    Check out:
    http://flexvie.ws – materialized views (these can maintain your aggregate counts)
    http://code.google.com/p/shard-query (distributed query engine)

    Both are open source.  Percona provides services around them as well if you would like help building a system using these tools.

  • http://www.couchbase.com Perry Krug

    Makes a lot of sense.  We’ve actually got quite a bit of experience doing things like that within the memcached protocol (our engineers wrote >80% of the memcached codebase over the last 6 years or so) and with the CouchDB integration, we’ll be able to do mapreduce and more extensive querying/indexing.  It’s definitely a matter of finding the right tool for the job.  The problems you’ve indicated in terms of scaling and sharding are definitely fixed by Membase.  Hit me up at perry@202cbbf282c581c26d97a79f0fcfec5f:disqus -at- couchbase -dot- com.  I’d be really interested to see if we can help, and if we can’t, understand why.

  • http://backstage.soundcloud.com/2011/07/mysql-stats-old-faithful/ MySQL for Statistics – Old Faithful « « SoundCloud Backstage SoundCloud Backstage

    [...] our wheels were spinning trying to find out why our statistics storage patterns were causing MongoDB to thrash our disks, we started looking for an emergency alternative with the technology that we already had: [...]