CloudLinux - CloudLinux Blog - Issues caused by the latest KernelCare update and what we will do to ensure it never happens again
RSS

Issues caused by the latest KernelCare update and what we will do to ensure it never happens again

Issues caused by the latest KernelCare update and what we will do to ensure it never happens again

UPDATE: Mar 30 - 10am pacific timezone. 24h feed was updated with the same issue due to technician incorrectly removing "at" job. This has been fixed shortly, but some systems have been affected.

We want to apologize for the KernelCare incident that affected some of our customers yesterday. Unfortunately, the bug in POSIX ACL patch for CVE-2016-7097 wasn't caught by our test system.

We have spent all of last night fixing the issue and re-releasing the patches to address the local privilege escalation vulnerability CVE-2017-2647.

To avoid these incidents going forward we are implementing the following:

Currently, our test system uses a number of synthetic tests and runs it for 4 hours for each kernel. The test suite consists of LTP tests as well as our own set of tests. Clearly, there are limitations to those tests, but we are planning to add xfstests, avocado, and other synthetic test suites to our test process. We also planning to add generic workloads to our test suite.

Upon investigating the issue, what we also realized is that part of the problem could be attributed to our deployment process. We typically release all patches together for most distributions / all kernels in order to get security fixes to you ASAP, without any delay. Normally, there are no problems that arise from this process. Yet, we are only humans, and errors are possible. We find out about the errors directly from you as we don’t yet have a process to be notified if a patch has caused a crash.

As such, we are rebuilding the deployment system in this way:

  • We will modify client side script (kcarectl) to detect and notify us whether the server was patched successfully and didn’t cause a crash within 1, 2, 5, 15 and 60 minutes time interval.
  • We will be releasing separate patches on per kernel distro/kernel version basis, with a delay between each release, starting with the least popular kernels first.

Our deployment system will automatically check whether there were any problems with the deployment, and in those rare cases, it will stop the deployment of new patches immediately and roll back the one that was already deployed.

The goal is to be able to stop the deployment process as soon as possible, often after the first crash, so that the newly released patch would never crash more than one or two servers out of all of the servers that run KernelCare across all of our customers.

This trickle down patch roll-out process with automated safety checks might take up to 12 hours to get to all the 100,000+ servers running KernelCare, but we believe it is the right decision as it will ensure that no customer will ever have widespread issues from released patches ever again and that multiple customers affected would be thing of the past.

We estimate it will take us about a month to implement the new deployment system. Until then, you may choose to use the delayed feed which ensures that your servers will receive patches 24 hours after the release.

To implement the delayed feed, add PREFIX=24h to /etc/sysconfig/kcare/kcare.conf

This is just the beginning of the work we have come up with during the night of thinking about this incident that affected some of our customers. We take this very seriously. It is our first such major incident since the launch of KernelCare nearly 3 years ago, and we will take every precaution to make sure it is our last. We will continue figuring out how we can prevent such issues in the future, and implement them one by one. This means re-allocating significant development resources from other projects and investing in new ways we test, deploy, automate, and gather feedback for the product.

We will also search for other novel ways to prevent such issues and we welcome any ideas you might have on what else we can do to further protect our customers.

Once again, please accept our sincerest apologies and be assured that we have a plan for preventing this from happening ever again.

Igor Seletskiy,
CloudLinux CEO
 

Imunify360 2.0-13 hotfix release
Issues caused by the latest KernelCare update
 

Comments 15

Guest - Marco on Thursday, 30 March 2017 09:44

Hello,

you should have a system where we can manage our servers.
So we can change all our servers to manually updating instead of automatically patching.

So we can login into your server and see which patches are available for each server.
So we can decide in GUI install latest patches for server 1,2 and 3.
Then we run the system for example for 24h and see if it's fine.
After that we can install all the other system with the same patch from GUI.

This would be nice. So customers (we) have more control about what is patched and we do not have a outage on all servers at the same time like yesterday.

Thanks

Hello, you should have a system where we can manage our servers. So we can change all our servers to manually updating instead of automatically patching. So we can login into your server and see which patches are available for each server. So we can decide in GUI install latest patches for server 1,2 and 3. Then we run the system for example for 24h and see if it's fine. After that we can install all the other system with the same patch from GUI. This would be nice. So customers (we) have more control about what is patched and we do not have a outage on all servers at the same time like yesterday. Thanks
Igor Seletskiy on Thursday, 30 March 2017 11:34

Thank you for the suggestion. We will implement such global controls.
Right now this can be done using config file & AUTO_UPDATE settings http://docs.kernelcare.com/index.html?config_options.htm
Yet, now I clearly see it has to be done better.

Thank you for the suggestion. We will implement such global controls. Right now this can be done using config file & AUTO_UPDATE settings http://docs.kernelcare.com/index.html?config_options.htm Yet, now I clearly see it has to be done better.
Guest - Pissed Customer on Thursday, 30 March 2017 09:52

Hello

we´re not happy about you kind of communication!
We recognized server crashes and was NOT informed by you, that we have to restart only servers.

WHY ARE YOU NOT SENDING AN EMAIL , WHERE YOU INFORM YOUR CUSTOMERS ABOUT THIS??

We had a lot of work to do, offline services, cancelled contracts. Because of your incompetence. The biggest issues on your side:
You have not mailed your customers. Thats a no go in our industry!

Is it more important to make party on the WHD?

Hello we´re not happy about you kind of communication! We recognized server crashes and was NOT informed by you, that we have to restart only servers. WHY ARE YOU NOT SENDING AN EMAIL , WHERE YOU INFORM YOUR CUSTOMERS ABOUT THIS?? We had a lot of work to do, offline services, cancelled contracts. Because of your incompetence. The biggest issues on your side: You have not mailed your customers. Thats a no go in our industry! Is it more important to make party on the WHD?
Igor Seletskiy on Thursday, 30 March 2017 11:45

I want to personally apologize for that. It was suggested during initial half hour by one of our team members, but I chose not to for the bunch of wrong reasons. We don't have a good mailing list for that, nor right tool to do it on the spot, nor we knew who is affected... wrong reasons.
At first, we didn't realize that it was so widespread, and by the time we did -- we already rolled back everything -- and were working on figuring out what happened / why and how to solve it. The situation was so new to us -- that we (I personally) screwed up.

Either way -- we were unprepared. We thought something like that could never happen -- and we didn't have a plan.
We will be making a plan for probable events now, and it will include communications plan as well.

So, I am really sorry about that -- and I will not let it happen again:

PS: WHD takes quite a lot of energy, and I did go to WHD party -- but I went to it knowing that only two clients affected and that we have rolled back all the patches - and other people shouldn't be affected. We didn't realize the issue was big/affected multiple clients. I did warn the support about the issue and told them to contact me right away if there are other reports. As soon as I knew about the second client, I left the party and together with other members of the company we have worked on the way to deal with the problem.

I want to personally apologize for that. It was suggested during initial half hour by one of our team members, but I chose not to for the bunch of wrong reasons. We don't have a good mailing list for that, nor right tool to do it on the spot, nor we knew who is affected... wrong reasons. At first, we didn't realize that it was so widespread, and by the time we did -- we already rolled back everything -- and were working on figuring out what happened / why and how to solve it. The situation was so new to us -- that we (I personally) screwed up. Either way -- we were unprepared. We thought something like that could never happen -- and we didn't have a plan. We will be making a plan for probable events now, and it will include communications plan as well. So, I am really sorry about that -- and I will not let it happen again: PS: WHD takes quite a lot of energy, and I did go to WHD party -- but I went to it knowing that only two clients affected and that we have rolled back all the patches - and other people shouldn't be affected. We didn't realize the issue was big/affected multiple clients. I did warn the support about the issue and told them to contact me right away if there are other reports. As soon as I knew about the second client, I left the party and together with other members of the company we have worked on the way to deal with the problem.
Guest - DJPRMF on Thursday, 30 March 2017 12:05

We are all humans. Errors can and will always happend.

Isn't about the problems, is how the company respond to it. And i must say that the KernelCare team have done a great work with the "aftermatch" of this issue.

We are all humans. Errors can and will always happend. Isn't about the problems, is how the company respond to it. And i must say that the KernelCare team have done a great work with the "aftermatch" of this issue. :)
Guest - The A on Thursday, 30 March 2017 21:04

Yes wére all humans, but it is not right so say, they did a good aftermatch at all.
we asked several times, what we have to do. Nothing come back.
We had a big downtime of over thousands clients, a long night.

Such service must run under special Assistent of humans, that know what they are doing. Wé are located in Germany, not USA, which means, that data security, professional work is very important. Iam surprised, how easy kernelcare can crash nodes, I wouldn't think about this situation: Kernelcare got hacked and hacker can control our nodes.

A NO GO in this industry. @CEO : As a prof. IT , you MUST think about all, and what can BE happen in some ways. For me, you are too easy with this now. Our level is Enterprise. Not kitchen hoster or similar.

WHD is for children and low level industry. Wake up please! This situation is very bad for reputation. In Germany, your standing is very bad!

Yes wére all humans, but it is not right so say, they did a good aftermatch at all. we asked several times, what we have to do. Nothing come back. We had a big downtime of over thousands clients, a long night. Such service must run under special Assistent of humans, that know what they are doing. Wé are located in Germany, not USA, which means, that data security, professional work is very important. Iam surprised, how easy kernelcare can crash nodes, I wouldn't think about this situation: Kernelcare got hacked and hacker can control our nodes. A NO GO in this industry. @CEO : As a prof. IT , you MUST think about all, and what can BE happen in some ways. For me, you are too easy with this now. Our level is Enterprise. Not kitchen hoster or similar. WHD is for children and low level industry. Wake up please! This situation is very bad for reputation. In Germany, your standing is very bad!
Guest - Mad Dog on Friday, 31 March 2017 00:26

"Wé are located in Germany, not USA, which means, that data security, professional work is very important."

Hey there, Hans, are you trying to imply that professionalism and data security are not something Americans strive for or find important? You should reconsider insulting an entire country over the mistake of one company, especially after we saved your ass once, and are going to have to do it again once that wicked devil Merkel forces you to become a caliphate.

"Wé are located in Germany, not USA, which means, that data security, professional work is very important." Hey there, Hans, are you trying to imply that professionalism and data security are not something Americans strive for or find important? You should reconsider insulting an entire country over the mistake of one company, especially after we saved your ass once, and are going to have to do it again once that wicked devil Merkel forces you to become a caliphate.
Guest - Somebody on Thursday, 30 March 2017 22:28

This problem did hit us too very hard, but Igor does a good communication and dont try to hide the problems. I find him trustworthy, honestly and sympathetically. He take this problem serious and want to take steps so that it dont happen again.
But Otherwise (i respect it a lot that you communicate so open!) according your update it happened just again, which destroys the first reclamed trust of course..

As DJPRMF did already sayed, humans make errors. Of course it should not happen - but it happens. Now it is important what you do in the future to prevent this. Your plans sounds legitime and reasonable.

In my case i have too a hugh damage (also at the financial view) and i have spend the whole night and day to bring back up everything. But it dont helps when crying like 'The A' - it will just have no effect. Kernelcare does help a lot and our tech team saved a lot of time in the past.

This problem did hit us too very hard, but Igor does a good communication and dont try to hide the problems. I find him trustworthy, honestly and sympathetically. He take this problem serious and want to take steps so that it dont happen again. But Otherwise (i respect it a lot that you communicate so open!) according your update it happened just again, which destroys the first reclamed trust of course.. As DJPRMF did already sayed, humans make errors. Of course it should not happen - but it happens. Now it is important what you do in the future to prevent this. Your plans sounds legitime and reasonable. In my case i have too a hugh damage (also at the financial view) and i have spend the whole night and day to bring back up everything. But it dont helps when crying like 'The A' - it will just have no effect. Kernelcare does help a lot and our tech team saved a lot of time in the past.
Guest - Tommy on Friday, 31 March 2017 00:38

Because it's Igor, I have no doubt this will lead to something much better. Cloudlinux haven't done a lot of mistakes in my opinion and mistakes are bound to happen sooner or later of people are involved.

I think you should consider adding an enterprise level support option, it seems that you have at least one customer for that. Don't sacrifice the great support you are providing for the others of us though.

WHD might be for lesser companies, but I guess the bulk of cloudlinux customers probably fits within the WHD attending group.

Also curious about what the other commenters, in the enterprise level, would like to have seen in the response from Igor. For my own purposes, the response have been ok.

Because it's Igor, I have no doubt this will lead to something much better. Cloudlinux haven't done a lot of mistakes in my opinion and mistakes are bound to happen sooner or later of people are involved. I think you should consider adding an enterprise level support option, it seems that you have at least one customer for that. Don't sacrifice the great support you are providing for the others of us though. :) WHD might be for lesser companies, but I guess the bulk of cloudlinux customers probably fits within the WHD attending group. Also curious about what the other commenters, in the enterprise level, would like to have seen in the response from Igor. For my own purposes, the response have been ok.
Guest - Ryan MacDonald on Friday, 31 March 2017 02:27

Thank you Igor for the blog post and your very honest assessment of the situation.

At the end of the day, these things happen. It is important for each service provider that uses KernelCare to assess the risks to their systems of automatic updates, be they from upstream distribution, kernelcare , control panel or similar. This is certainly the first time in 3 years that I have heard of any issues with KernelCare and in that time I recall many instances of cPanel, Plesk, retracted RHN updates and similar.

Mistakes happen. Period. If you have a system that requires you to have absolute integrity, delay your updates and manually batch them out after your own internal testing. Further, there are mitigating features anyone can apply that would have prevented the hard lock panics by simply setting appropriate sysctl values to force reboots on panic.

e.g: sysctl.conf
# reboot on panic after 90s
kernel.panic = 90

We look forward to a continued use of kernelcare going forward. Be very mindful folks that there are two players in the entire Linux ecosystems providing this service -- ksplice and kernelcare. Only one of which is under meaningful active maintenance with prompt turnaround on patches for 0days, kernelcare.

Let's be supportive and constructive, yes we are paying for a service but it is incumbent on us as users to also assist in making the product better through meaningful feedback.

Thanks Igor!

Thank you Igor for the blog post and your very honest assessment of the situation. At the end of the day, these things happen. It is important for each service provider that uses KernelCare to assess the risks to their systems of automatic updates, be they from upstream distribution, kernelcare , control panel or similar. This is certainly the first time in 3 years that I have heard of any issues with KernelCare and in that time I recall many instances of cPanel, Plesk, retracted RHN updates and similar. Mistakes happen. Period. If you have a system that requires you to have absolute integrity, delay your updates and manually batch them out after your own internal testing. Further, there are mitigating features anyone can apply that would have prevented the hard lock panics by simply setting appropriate sysctl values to force reboots on panic. e.g: sysctl.conf # reboot on panic after 90s kernel.panic = 90 We look forward to a continued use of kernelcare going forward. Be very mindful folks that there are two players in the entire Linux ecosystems providing this service -- ksplice and kernelcare. Only one of which is under meaningful active maintenance with prompt turnaround on patches for 0days, kernelcare. Let's be supportive and constructive, yes we are paying for a service but it is incumbent on us as users to also assist in making the product better through meaningful feedback. Thanks Igor!
Already Registered? Login Here
Guest
Friday, 20 September 2019

Captcha Image