Multiple host node outage
Incident Report for Binary Lane
Postmortem

On Monday, 18th June 2018 BinaryLane experienced a kernel-panic on seventeen of its host nodes. The outages started just after 20:00 AEST and were fully resolved by 22:00 AEST. The outage duration on individual host nodes varied between approximately 30 minutes and 45 minutes.

Background

BinaryLane’s fleet of host nodes are Linux servers. From time to time, new releases are made of the linux kernel that address exploits and other issues that could affect the security of our systems. Typically, installing an updated linux kernel requires a full reboot of the host node, which is undesirable.

A third-party service called KernelCare is available that employs a proprietary “hot-patching” mechanism to update the linux kernel without rebooting the host node. We have utilised KernelCare across our fleet of host nodes for several years.

Timeline

At approximately 20:10 our internal monitoring indicated a service fault. Initial investigation of the fault found that host node bnecompute03 was offline. IPMI access was used to determine that a kernel panic had occurred, which is unusual but not unheard of.

While waiting for that host node to power-cycle it was observed at 20:25 that host nodes bnecompute09 and melcompute01 were also showing the same fault, and then sydcompute23 at 20:30.

While proceeding with attempting to restore service we checked KernelCare to see if any updates had been made recently; a new update had been released that day (https://groups.google.com/forum/#!topic/kernelcare-ubuntu/r-vaQn3Aeb8) and at 20:35 were working on the premise that the KernelCare update was faulty.

By 20:45 we had deleted the cron task that performs the update process and then by 21:00 had instructed KernelCare to “unload” on all active host nodes.

From 21:00 no further kernel panics were experienced, and we focused purely on restoring service for affected customers.

The individual host node outage start and approximate end times were as follows (AEST):

Host Start End
bnecompute03 20:02 20:25
bnecompute04 20:57 21:19
bnecompute05 20:20 21:14
bnecompute08 20:55 21:37
bnecompute09 20:01 20:34
bnecompute10 20:55 21:36
sydcompute01 20:55 21:36
sydcompute02 20:37 21:05
sydcompute03 20:20 20:55
sydcompute04 20:55 21:28
sydcompute05 20:57 21:29
sydcompute06 20:57 21:30
sydcompute07 20:55 21:38
sydcompute09 20:55 21:38
sydcompute10 20:36 21:14
sydcompute18 20:58 21:24
melcompute01 20:02 20:38

Root Cause

During the incident we hypothesised that the root cause was the latest update available from KernelCare; removing KernelCare from our host nodes resolved the incident.

It was later confirmed by KernelCare themselves (https://groups.google.com/forum/#!topic/kernelcare-ubuntu/CwwHvtUd8Z8) that the latest update was at fault and they have since withdrawn that specific version from the service.

The severity of the fault was exacerbated by the default configuration of KernelCare’s cron entry, which at the time we installed it looked like this:

20 */4 * * * root /usr/bin/kcarectl --auto-update

If you are unfamiliar with cron syntax, this says every 4 hours at 20 minutes past the hour (i.e. at 00:20 / 04:20 / 08:20 / 12:20 / 16:20 / 20:20 ) execute the KernelCare update process. The minutes-past-the-hour (“20” in this example) is randomly selected during the installation process.

Thus we can see that every server in our fleet would have tried to pull down the update between 8:00PM and 9:00PM that day. In investigating this fault we have learned that the latest version of the installer instead creates a cron entry that looks like this:

20 */4 * * * root /usr/bin/kcarectl --auto-update --gradual-rollout=auto

The “gradual-rollout” setting apparently instructs the KernelCare software to deploy new versions at a random time within a 12 hour window. While we would have still experienced host outages with this setting, the number of affected hosts would be greatly reduced.

This is new functionality that was not available when we deployed KernelCare; and we were unaware that while the kernel updates themselves are automated, installing updates to the “client” software is a manual process.

Mitigation

In reviewing the latest KernelCare documentation, we observed that a number of additions have been made since we first deployed the product to provide greater control over the updating process.

After discussing the options internally, we have decided to use the new “delayed feed” functionality moving forwards which is a mechanism to instruct the KernelCare client to temporarily ignore new updates.

Our test servers will be left on the undelayed feed, so that we can identify any potential issues prior to production deployment.

Additionally we will restrict our KernelCare updates to occur gradually between midnight and 8:00AM. In the event that a faulty release is not detected by our testing server or other KernelCare customers, any outage that occurs due to the updated kernel should occur out-of-hours.

Posted 3 months ago. Jun 21, 2018 - 09:20 AEST

Resolved
Platform has been stable for an hour now, so we believe disabling KernelCare has resolved the issue.

If you have not already done so please check your VPS as soon as convenient to ensure it is functioning correctly following host power-cycle. The following host nodes were affected by this incident:

- bnecompute03
- bnecompute04
- bnecompute05
- bnecompute08
- bnecompute10
- sydcompute01
- sydcompute02
- sydcompute03
- sydcompute04
- sydcompute05
- sydcompute06
- sydcompute07
- sydcompute09
- sydcompute10
- sydcompute18
- melcompute01

We will provide an incident post-mortem later in the week.
Posted 3 months ago. Jun 18, 2018 - 23:03 AEST
Monitoring
We have disabled kernelcare on all host nodes and have not seen any further kernel faults. We will continue to monitor the platform but at this point believe that this incident has been resolved.

An incident post-mortem will be available later in the week.
Posted 3 months ago. Jun 18, 2018 - 22:05 AEST
Update
We are continuing to disable kernelcare and restart affected host nodes.
Posted 3 months ago. Jun 18, 2018 - 21:19 AEST
Update
We believe this issue relates to a KernelCare update released today and are in progress of disabling it.
Posted 3 months ago. Jun 18, 2018 - 20:41 AEST
Update
Host nodes bnecompute03, bnecompute09, melcompute01 are affected.
Posted 3 months ago. Jun 18, 2018 - 20:25 AEST
Identified
This host node experienced a kernel fault requiring a reboot to correct. We have done so and are currently bringing customer VPS back online.
Posted 3 months ago. Jun 18, 2018 - 20:19 AEST
Investigating
We are currently investigating this issue.
Posted 3 months ago. Jun 18, 2018 - 20:13 AEST
This incident affected: Compute, mPanel, and Website.