Archive for the ‘Post Mortem’ Category You are currently browsing the archives for the Post Mortem category.
August 14th, 2012

Shoot yourself in the foot with iptables and kmod auto-loading

As some of you might know, we had an outage yesterday.
We believe that in every mistake there is something to learn from, so after each outage we are writing post-mortems. Usually we do this internally because the issues we run into are very specific to our infrastructure.

This time we ran into a quite nasty issue which could affect everyone running a linux system with a lot sessions on it and we thought you might be interested to know about that pitfall.

What happened?

At 4:40pm CEST, we got reports about Yikes (503/504 errors) on SoundCloud. Around the same time, our monitoring alerted for a high amount of 503s at our caching layer and right after that one of our L7 routing nginx instances was reported down.

We were still able to log into that system. dmesg showed:

Aug 13 14:46:52 ams-mid006.int.s-cloud.net kernel: [8623919.136122] nf_conntrack: table full, dropping packet.
Aug 13 14:46:52 ams-mid006.int.s-cloud.net kernel: [8623919.136138] nf_conntrack: table full, dropping packet.

N.B.: Our systems are set to UTC timezone

That wasn’t expected. The first thought was: “Someone must have changed the sysctl tunings for that”. Then we realized that this system has no need for connection tracking, so nf_conntrack shouldn’t be loaded at all. As a quick contermeasure we raised net.ipv4.netfilter.ip_conntrack_max. This fixed that situation and brought the service back up.

Why did it happen?

After bringing the site back up, we investigated what caused the kernel to enable connection tracking. Doing a lsmod showed that connection tracking and iptables modules were actually loaded. Another look into dmesg revealed that right before the outage the ip_tables netfilter module was loaded:

Aug 13 14:38:27 ams-mid006.int.s-cloud.net kernel: [8623415.007818] ip_tables: (C) 2000-2006 Netfilter Core Team
Aug 13 14:38:35 ams-mid006.int.s-cloud.net kernel: [8623422.444931] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)

So what happened? One of our enginners was doing some preparations for scaling that layer of our infrastructure. To verify we don’t use any specific iptable rules on that system, he did:

iptables -L iptables -t nat -L

Those commands themself are pretty harmless. They will just list configured iptables rules. The first one rules in the filter table, the second one in the nat table. Nothing which should change any system configuration, right? Nope.
Let’s try to reproduce it. Just boot up some system (I’ve tried it on my Ubuntu Laptop). No iptables module should be loaded:

root@apollon:~# lsmod|grep ipt

Now just list your iptable rules:

iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain FORWARD (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

And check again for loaded modules:

root@apollon:~# lsmod|grep ipt
iptable_filter 12810 0
ip_tables      27473 1 iptable_filter
x_tables       29846 2 iptable_filter,ip_tables

Okay, that loaded some iptables module to make it possible to add filter rules via iptables. This shouldn’t cause any problems, since without any actual rules the impact on the kernel is negligible. But now check your nat table:

root@apollon:~# iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target prot opt source destination 

Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Chain POSTROUTING (policy ACCEPT)
target prot opt source destination

Completely empty as well. But now look at your kernel modules:

root@apollon:~# lsmod|grep ipt
iptable_nat 13229 0
nf_nat 25891 1 iptable_nat
nf_conntrack_ipv4 19716 3 iptable_nat,nf_nat
nf_conntrack 81926 3 iptable_nat,nf_nat,nf_conntrack_ipv4
iptable_filter 12810 0
ip_tables 27473 2 iptable_nat,iptable_filter
x_tables 29846 3 iptable_nat,iptable_filter,ip_tables

By just listing your iptable rules for the nat table, the kernel loaded nf_conntrack which enabled connection tracking. See dmesg:

[75024.007681] nf_conntrack version 0.5.0 (16384 buckets, 65536 max

On your Laptop you probably don’t care – it’s even quite convenient. On a production server that handles a large number of connections the fairly small default nf_conntrack table will overflow quite fast and cause dropped connections.

How do we prevent it?

iptables doesn’t load the nf_conntrack itself, it only loads ip_tables which again loads modules it depends on via the kernel’s kmod facility.

But since that module loader uses the modprobe user-space helpers like modprobe, the auto-loading process will honour modprobe.d/ settings. Unfortunatelly there is no easy way to disable loading of a module altogether, but there is a workaround for that.

Since we don’t need iptables at all on that system, we’ve created a /etc/modprobe.d/netfilter.conf like this:

alias ip_tables off
alias iptable off
alias iptable_nat off
alias iptable_filter off
alias x_tables off
alias nf_nat off
alias nf_conntrack_ipv4 off
alias nf_conntrack off

This will make modprobe load off instead of the actual kernel module.

Trying to run any iptables command now, should now give you now:
iptables -t nat -L
FATAL: Module off not found. iptables v1.4.12: can't initialize iptables table `nat': Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.

Johannes 'fish' Ziemke
August 24th, 2011

Doing the right thing

The recent outage of SoundCloud was the result of everybody doing the right thing. This totally jives with John Allspaw’s message that looking for a root cause will lead to you finding smart people simply doing their jobs.

This is what happened.

 

The technologies at play

The number of interactions that escalated to the outage should be interesting for other Rails/MySQL shops out there. We’re using many idiomatic patterns within Rails that ended up having devastating consequences.

Cache keys that avoid invalidation

It’s best practice to form your cache keys in a way that doesn’t require you to issue cache key deletions, and rather let normal evictions reclaim garbage keys. To do this we use the ‘updated_at’ column on some of our models in the cache key so that if the model updates, we know we’ll get a new key.

ActiveRecord::Timestamp#touch

There is an innocuous method called ‘touch’ that will simply bump the updated_at and save that record. This is quite convenient to call on containers of has_many things like a forum topic is to its posts. With a recent ‘updated_at’ and consistent way to keep that recent via ‘touch’, there is a clean decoupling of cache strategy from business modeling when you intend to communicate the container’s version is as recent as the most recent addition.

ActiveRecord::Associations::BelongsToAssociation :counter_cache

In the absence of indexed expressions and multi index merges that some databases have, MySQL and InnoDB leaves it to the application to keep lookups of counts efficient. When dealing with tables of multiple millions of rows, a simple query like:

select count(*) from things 

Could take tens of seconds as InnoDB actually needs to traverse all primary keys and literally count the rows that exist which are visible in the current transaction.

class Foo belongs_to :things, :counter_cache => true end 

Is a simple and convenient workaround to avoid the ‘count(*)’ overhead where when a new Foo is created on a thing, that thing’s ‘foos_count’ would get a database increment by one. When it’s removed, the ‘foos_count’ would be decremented by one.

ActiveRecord::Associations::Association :dependent => :destroy

What better way to maintain all the business rules on deletion than to make sure your model’s callbacks fire when a container is destroyed. When your business consistency is maintained in the ORM, this is also the best place to ensure proper business rules on removal.

This simply does the following for every association:

thing.foos.each(&:destroy) 

Before:

thing.destroy 

It’s best to have the behavior declared where the association is declared and let the framework make sure it’s not forgotten when actually performing the destruction.

More than 140 characters

Some of the features of SoundCloud deserve more than 140 characters to do them justice. Even more than 255 characters. The tools that Rails gives you out of the box when you need more than 255 characters on a string in your data layer are limited; you’re left with the ‘text’ type in your schema definition. In MySQL this translates to a “TEXT” column.

A TEXT column with a maximum length of 65,535 (216 - 1) characters. 

Well, in most cases we wouldn’t need more than 1000 characters for much on our site, but in the spirit of “deployed or it didn’t happen”, our early database schemas were mostly driven by the simplest option that ActiveRecord::Schema offered. (We also have way too many NULLable columns)

Trade space for time or time for space

The early days, SoundCloud ran on 3 servers. CPU was precious so for some of the HTML conversion tasks we traded space for time and are storing a cached version of the HTML for some of the longer text fields in the same row as the record. This was the right choice at the time for the request load and available CPU.

(time is usually easier to scale than space)

Tune your DBs for your workload

We separate reads from writes on our data layer, and we also have slightly different tunings on the DB slaves that accept reads. We also have experienced statement-based asynchronous slave replication breaking the replication thread due to these different tunings on different hardware.

We use row-based (actually mixed) replication between our masters and slaves because it’s as close as you’ll get to the storage engines speaking directly to each other, minimizing the risk of differences in hardware/tuning interfering with the replication thread.

Alert on disk space thresholds

We have a massive amount of Nagios checks on all our systems, including host-based partition %free checks. When any partition on a host reaches a threshold of free space, an alert is sent.

Separate data and write ahead log physical devices

Most OLTP databases have data that is bound by random read/writes, whereas binary logs are fundamentally sequential writes. You want these workloads on different spindles when using rotating disks because a transaction cannot complete without first being written to the binary log. If you need to move the drive head for your binlog, you’ve just added milliseconds to all your transactions.

Clean up after yourself

Periodically there are administrative tasks that need to be performed on the site like mass takedowns of inappropriate content. The Rails console is amazing when you need to work directly with your business domain. Fire it up, get it done. For one-off maintenance tasks this is a life saver.

Add Spam, Mix, Bake and we’ve been Served

This all adds up, reviewing the good parts:

  • Abstract away bookkeeping in your domain model
  • Leverage existing patterns to get the job done quickly
  • Tune and monitor your DBs
  • Hand administer your site via your domain model

If you haven’t noticed yet, there have been some incorrigible entrepreneurs using some groups to advertise their pharmaceuticals distribution businesses. They have very thorough descriptions (5-50KB worth), and unprecedented activity with tens to thousands of posts in their own groups.

At 1:00pm yesterday, we were working through our cleanups, and cranked open a console with a large list of confirmed groups to remove. With Rails this is as simple as:

Group.find(ids).each(&:destroy) 

Looks innocent enough.

What the database sees

From the database perspective all of the automated bookkeeping and business domain extraction of individual destroys creates statements, this ended up looking something like this:


DELETE post
UPDATE group SET posts_count = posts_count - 1
UPDATE group SET updated_at = now()
... x 1-5000
DELETE track_group_contribution
UPDATE group SET tracks_count = tracks_count - 1
UPDATE group SET updated_at = now()
... x 1-5000
DELETE user_group_membership
UPDATE group SET members_count = members_count - 1
UPDATE group SET updated_at = now()
... x 1-5000

So we’re seeing 2N+1 number of updates on a group where N is the sum of associated objects.

What replication sees

When using row-based replication, any change to a row gets that entire row added to the binary log. Some of these groups had over 100k worth of text columns and hundreds of associated posts. When parsing a given binlog, these group updates were taking over 90% of the replication events being sent to the slaves.

What the binlog partition sees

This is what finally brought us down. We were producing over 3GB/min of binlogs to be replicated to our slaves from these many group updates. Our binlog partition filled up from 10GB to 100GB in a matter of 30 minutes.

The MySQL docs are clear about what happens with a full data partition. When data cannot be written, MySQL just waits. The behavior around the binlog partition wasn’t as clear. That last binlog event had a partial write. When the disk filled, the binlog corrupted. When that last event in the last binlog was attempted to be sent to the slaves, it failed and the slaves stopped replicating.

 Got fatal error 1236 from master when reading data from binary log:
'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master' 

Our max_allowed_packet is big enough for any of our rows.

How we recovered

We had a master with live queries that were not coming back from a ‘killed’ state. We scaled the binlog LVM partition so that it could accommodate new writes now, but the DB was not budging. We had no idea how to get it to start writing again so we began a failover process.

All our slaves were at the same position just before the corrupted event, so we grabbed one, confirmed the tuning was good and then promoted it. We went through the many other slaves and reconfigured them and we were good, consistent to the last event. All we lost were the few processes that were waiting to write to the full partition.

Ironically, 2 of our team members are currently in the datacenter recabling some of our racks. We also have 4 swanky new DB class machines powered on, but we were a day or two away from getting them networked and integrated. We were just short of having that excess capacity to accept the spike in load from return visitors after the tweet “We’re back!”.

Forensics

We tried to understand the cause of the sudden binlog growth so we could safely enable the site without a replay of what just had happened. We expanded out some of the logs with ‘mysqlbinlog –verbose’ to show that it was filled with group spam. To confirm that it was group activity, we compared the replication data volume per table of a well sized binlog file with an abnormally large with the following awk program:

    for log in mysql-bin.03980{5..7}; do
      mysqlbinlog $log |\
      awk '
        /Table_map/ { name = $9 }
        /BINLOG/ { bytes = 0; col = 1 }
        { if (col) bytes += length($0) }
        /\*!\*/ { if (col) sizes[name] += bytes; col = 0 }
        END { for (i in sizes) print sizes[i], i }
      ' |\
      sort -n |\
      tee /tmp/$log.sizes &
    done

This created a list of byte sizes of base64 encoded row data to table name. Groups took 820MB compared to the next largest table at 30MB.

Put this script in your toolbox, it’s also great to use for getting an idea of which tables are your hottest under normal operations.

We also used our recently finished HDFS-based log aggregation system to run map reduce jobs over our web and app logs to identify any possible abuse vectors around groups that were coming from the outside.

What we learned

When maintenance around abusive usage is also a part of your business, think about the data and impact on your running system for all maintenance work.

Cut your losses early and resist the temptation to find the “root cause” during an outage incident. Failforward. Save what you can for forensics after you’re back up.

Uncleared acknowledged alerts are alertable offenses. A big question during the incident was, “where were the alerts?”. It turned out that the host-based disk space check was previously acknowledged because we had an unrelated partition fill on the same host within expectation. This acknowledgment wasn’t cleared before the binlog partition filled so we didn’t get the 20 minutes of lead time we could have had.

In the heat of the moment, put your heads together, make a plan for the next X minutes, execute, and repeat until you’re back up. All the engineers were at battle stations during this outage. This incident blindsided us and all kinds of theories were thrown around. When we focused on do X within Y minutes we got down to time boxing research to be able to take our next action. This worked very well.

Pay off your technical debt. In the past, we took a loan on the future for trading space for time. Paying off these kinds of debts is easy to defer, until the collector comes to visit. Keep a list of the debts chosen, and the debts discovered. Even without an estimated cost to fix, your debt is a learning tool for others for where and when measured choices towards the road of delivery are made.

Be as specific as you can with your expected data types. We never expected group descriptions to be larger than 2KB. We should have encoded that expectation loud and clear in the data and business layers.

Doing the right thing

We are all working with the best practices in mind, yet the combination of all that we were doing correctly ended up with this outage nobody expected. Yesterday was an incredible learning experience for everyone at SoundCloud, and the entire team joined together with a positive spirit and heartfelt passion to restore service as quickly as possible. I’m quite proud to work with everyone here.

Sean Treadway