On Monday, 18th June 2018 BinaryLane experienced a kernel-panic on seventeen of its host nodes. The outages started just after 20:00 AEST and were fully resolved by 22:00 AEST. The outage duration on individual host nodes varied between approximately 30 minutes and 45 minutes.
BinaryLane’s fleet of host nodes are Linux servers. From time to time, new releases are made of the linux kernel that address exploits and other issues that could affect the security of our systems. Typically, installing an updated linux kernel requires a full reboot of the host node, which is undesirable.
A third-party service called KernelCare is available that employs a proprietary “hot-patching” mechanism to update the linux kernel without rebooting the host node. We have utilised KernelCare across our fleet of host nodes for several years.
At approximately 20:10 our internal monitoring indicated a service fault. Initial investigation of the fault found that host node bnecompute03 was offline. IPMI access was used to determine that a kernel panic had occurred, which is unusual but not unheard of.
While waiting for that host node to power-cycle it was observed at 20:25 that host nodes bnecompute09 and melcompute01 were also showing the same fault, and then sydcompute23 at 20:30.
While proceeding with attempting to restore service we checked KernelCare to see if any updates had been made recently; a new update had been released that day (https://groups.google.com/forum/#!topic/kernelcare-ubuntu/r-vaQn3Aeb8) and at 20:35 were working on the premise that the KernelCare update was faulty.
By 20:45 we had deleted the cron task that performs the update process and then by 21:00 had instructed KernelCare to “unload” on all active host nodes.
From 21:00 no further kernel panics were experienced, and we focused purely on restoring service for affected customers.
The individual host node outage start and approximate end times were as follows (AEST):
During the incident we hypothesised that the root cause was the latest update available from KernelCare; removing KernelCare from our host nodes resolved the incident.
It was later confirmed by KernelCare themselves (https://groups.google.com/forum/#!topic/kernelcare-ubuntu/CwwHvtUd8Z8) that the latest update was at fault and they have since withdrawn that specific version from the service.
The severity of the fault was exacerbated by the default configuration of KernelCare’s cron entry, which at the time we installed it looked like this:
20 */4 * * * root /usr/bin/kcarectl --auto-update
If you are unfamiliar with cron syntax, this says every 4 hours at 20 minutes past the hour (i.e. at 00:20 / 04:20 / 08:20 / 12:20 / 16:20 / 20:20 ) execute the KernelCare update process. The minutes-past-the-hour (“20” in this example) is randomly selected during the installation process.
Thus we can see that every server in our fleet would have tried to pull down the update between 8:00PM and 9:00PM that day. In investigating this fault we have learned that the latest version of the installer instead creates a cron entry that looks like this:
20 */4 * * * root /usr/bin/kcarectl --auto-update --gradual-rollout=auto
The “gradual-rollout” setting apparently instructs the KernelCare software to deploy new versions at a random time within a 12 hour window. While we would have still experienced host outages with this setting, the number of affected hosts would be greatly reduced.
This is new functionality that was not available when we deployed KernelCare; and we were unaware that while the kernel updates themselves are automated, installing updates to the “client” software is a manual process.
In reviewing the latest KernelCare documentation, we observed that a number of additions have been made since we first deployed the product to provide greater control over the updating process.
After discussing the options internally, we have decided to use the new “delayed feed” functionality moving forwards which is a mechanism to instruct the KernelCare client to temporarily ignore new updates.
Our test servers will be left on the undelayed feed, so that we can identify any potential issues prior to production deployment.
Additionally we will restrict our KernelCare updates to occur gradually between midnight and 8:00AM. In the event that a faulty release is not detected by our testing server or other KernelCare customers, any outage that occurs due to the updated kernel should occur out-of-hours.