CloudLinux - CloudLinux Blog - Issues caused by the latest KernelCare update and what we will do to ensure it never happens again
Blog

Issues caused by the latest KernelCare update and what we will do to ensure it never happens again

Issues caused by the latest KernelCare update and what we will do to ensure it never happens again

UPDATE: Mar 30 - 10am pacific timezone. 24h feed was updated with the same issue due to technician incorrectly removing "at" job. This has been fixed shortly, but some systems have been affected.

We want to apologize for the KernelCare incident that affected some of our customers yesterday. Unfortunately, the bug in POSIX ACL patch for CVE-2016-7097 wasn't caught by our test system.

We have spent all of last night fixing the issue and re-releasing the patches to address the local privilege escalation vulnerability CVE-2017-2647.

To avoid these incidents going forward we are implementing the following:

Currently, our test system uses a number of synthetic tests and runs it for 4 hours for each kernel. The test suite consists of LTP tests as well as our own set of tests. Clearly, there are limitations to those tests, but we are planning to add xfstests, avocado, and other synthetic test suites to our test process. We also planning to add generic workloads to our test suite.

Upon investigating the issue, what we also realized is that part of the problem could be attributed to our deployment process. We typically release all patches together for most distributions / all kernels in order to get security fixes to you ASAP, without any delay. Normally, there are no problems that arise from this process. Yet, we are only humans, and errors are possible. We find out about the errors directly from you as we don’t yet have a process to be notified if a patch has caused a crash.

As such, we are rebuilding the deployment system in this way:

  • We will modify client side script (kcarectl) to detect and notify us whether the server was patched successfully and didn’t cause a crash within 1, 2, 5, 15 and 60 minutes time interval.
  • We will be releasing separate patches on per kernel distro/kernel version basis, with a delay between each release, starting with the least popular kernels first.

Our deployment system will automatically check whether there were any problems with the deployment, and in those rare cases, it will stop the deployment of new patches immediately and roll back the one that was already deployed.

The goal is to be able to stop the deployment process as soon as possible, often after the first crash, so that the newly released patch would never crash more than one or two servers out of all of the servers that run KernelCare across all of our customers.

This trickle down patch roll-out process with automated safety checks might take up to 12 hours to get to all the 100,000+ servers running KernelCare, but we believe it is the right decision as it will ensure that no customer will ever have widespread issues from released patches ever again and that multiple customers affected would be thing of the past.

We estimate it will take us about a month to implement the new deployment system. Until then, you may choose to use the delayed feed which ensures that your servers will receive patches 24 hours after the release.

To implement the delayed feed, add PREFIX=24h to /etc/sysconfig/kcare/kcare.conf

This is just the beginning of the work we have come up with during the night of thinking about this incident that affected some of our customers. We take this very seriously. It is our first such major incident since the launch of KernelCare nearly 3 years ago, and we will take every precaution to make sure it is our last. We will continue figuring out how we can prevent such issues in the future, and implement them one by one. This means re-allocating significant development resources from other projects and investing in new ways we test, deploy, automate, and gather feedback for the product.

We will also search for other novel ways to prevent such issues and we welcome any ideas you might have on what else we can do to further protect our customers.

Once again, please accept our sincerest apologies and be assured that we have a plan for preventing this from happening ever again.

Igor Seletskiy,
CloudLinux CEO
 

Major vulnerability: The Stack Clash security issu...
Issues caused by the latest KernelCare update
 

By accepting you will be accessing a service provided by a third-party external to https://www.cloudlinux.com/