About an hour ago you may have experienced what seemed like sporadic downtime as one of our slices went down. We were out of the office, but luckily our Twitter-based SMS alerts let us know. I called one of our users to do some pre-computer-access diagnostics during the drive back. His tests seemed to work fine, which meant our alert script could be broken or something else happened we hadn’t considered when building the automatic alert tests: one of the slices was down.
Sure enough, one of the slices had stopped working because of a memory leak in logging. It was a slow leak and it wouldn’t have happened if we hadn’t had such consistent uptime recently. Luckily, we’re refactoring out the component that had the memory leak, so we don’t expect to see this happen again for a bit.