CloudLinux - CloudLinux Blog - Issues caused by the latest KernelCare update and what we will do to ensure it never happens again
RSS

By accepting you will be accessing a service provided by a third-party external to https://www.cloudlinux.com/

Issues caused by the latest KernelCare update and what we will do to ensure it never happens again

Issues caused by the latest KernelCare update and what we will do to ensure it never happens again

UPDATE: Mar 30 - 10am pacific timezone. 24h feed was updated with the same issue due to technician incorrectly removing "at" job. This has been fixed shortly, but some systems have been affected.

We want to apologize for the KernelCare incident that affected some of our customers yesterday. Unfortunately, the bug in POSIX ACL patch for CVE-2016-7097 wasn't caught by our test system.

We have spent all of last night fixing the issue and re-releasing the patches to address the local privilege escalation vulnerability CVE-2017-2647.

To avoid these incidents going forward we are implementing the following:

Currently, our test system uses a number of synthetic tests and runs it for 4 hours for each kernel. The test suite consists of LTP tests as well as our own set of tests. Clearly, there are limitations to those tests, but we are planning to add xfstests, avocado, and other synthetic test suites to our test process. We also planning to add generic workloads to our test suite.

Upon investigating the issue, what we also realized is that part of the problem could be attributed to our deployment process. We typically release all patches together for most distributions / all kernels in order to get security fixes to you ASAP, without any delay. Normally, there are no problems that arise from this process. Yet, we are only humans, and errors are possible. We find out about the errors directly from you as we don’t yet have a process to be notified if a patch has caused a crash.

As such, we are rebuilding the deployment system in this way:

  • We will modify client side script (kcarectl) to detect and notify us whether the server was patched successfully and didn’t cause a crash within 1, 2, 5, 15 and 60 minutes time interval.
  • We will be releasing separate patches on per kernel distro/kernel version basis, with a delay between each release, starting with the least popular kernels first.

Our deployment system will automatically check whether there were any problems with the deployment, and in those rare cases, it will stop the deployment of new patches immediately and roll back the one that was already deployed.

The goal is to be able to stop the deployment process as soon as possible, often after the first crash, so that the newly released patch would never crash more than one or two servers out of all of the servers that run KernelCare across all of our customers.

This trickle down patch roll-out process with automated safety checks might take up to 12 hours to get to all the 100,000+ servers running KernelCare, but we believe it is the right decision as it will ensure that no customer will ever have widespread issues from released patches ever again and that multiple customers affected would be thing of the past.

We estimate it will take us about a month to implement the new deployment system. Until then, you may choose to use the delayed feed which ensures that your servers will receive patches 24 hours after the release.

To implement the delayed feed, add PREFIX=24h to /etc/sysconfig/kcare/kcare.conf

This is just the beginning of the work we have come up with during the night of thinking about this incident that affected some of our customers. We take this very seriously. It is our first such major incident since the launch of KernelCare nearly 3 years ago, and we will take every precaution to make sure it is our last. We will continue figuring out how we can prevent such issues in the future, and implement them one by one. This means re-allocating significant development resources from other projects and investing in new ways we test, deploy, automate, and gather feedback for the product.

We will also search for other novel ways to prevent such issues and we welcome any ideas you might have on what else we can do to further protect our customers.

Once again, please accept our sincerest apologies and be assured that we have a plan for preventing this from happening ever again.

Igor Seletskiy,
CloudLinux CEO
 

Major vulnerability: The Stack Clash security issu...
Issues caused by the latest KernelCare update
 

Комментарии 15

Hello,

you should have a system where we can manage our servers.
So we can change all our servers to manually updating instead of automatically patching.

So we can login into your server and see which patches are available for each server.
So we can decide in GUI install latest patches for server 1,2 and 3.
Then we run the system for example for 24h and see if it's fine.
After that we can install all the other system with the same patch from GUI.

This would be nice. So customers (we) have more control about what is patched and we do not have a outage on all servers at the same time like yesterday.

Thanks

Hello, you should have a system where we can manage our servers. So we can change all our servers to manually updating instead of automatically patching. So we can login into your server and see which patches are available for each server. So we can decide in GUI install latest patches for server 1,2 and 3. Then we run the system for example for 24h and see if it's fine. After that we can install all the other system with the same patch from GUI. This would be nice. So customers (we) have more control about what is patched and we do not have a outage on all servers at the same time like yesterday. Thanks

Thank you for the suggestion. We will implement such global controls.
Right now this can be done using config file & AUTO_UPDATE settings http://docs.kernelcare.com/index.html?config_options.htm
Yet, now I clearly see it has to be done better.

Thank you for the suggestion. We will implement such global controls. Right now this can be done using config file & AUTO_UPDATE settings http://docs.kernelcare.com/index.html?config_options.htm Yet, now I clearly see it has to be done better.

Hello

we´re not happy about you kind of communication!
We recognized server crashes and was NOT informed by you, that we have to restart only servers.

WHY ARE YOU NOT SENDING AN EMAIL , WHERE YOU INFORM YOUR CUSTOMERS ABOUT THIS??

We had a lot of work to do, offline services, cancelled contracts. Because of your incompetence. The biggest issues on your side:
You have not mailed your customers. Thats a no go in our industry!

Is it more important to make party on the WHD?

Hello we´re not happy about you kind of communication! We recognized server crashes and was NOT informed by you, that we have to restart only servers. WHY ARE YOU NOT SENDING AN EMAIL , WHERE YOU INFORM YOUR CUSTOMERS ABOUT THIS?? We had a lot of work to do, offline services, cancelled contracts. Because of your incompetence. The biggest issues on your side: You have not mailed your customers. Thats a no go in our industry! Is it more important to make party on the WHD?

I want to personally apologize for that. It was suggested during initial half hour by one of our team members, but I chose not to for the bunch of wrong reasons. We don't have a good mailing list for that, nor right tool to do it on the spot, nor we knew who is affected... wrong reasons.
At first, we didn't realize that it was so widespread, and by the time we did -- we already rolled back everything -- and were working on figuring out what happened / why and how to solve it. The situation was so new to us -- that we (I personally) screwed up.

Either way -- we were unprepared. We thought something like that could never happen -- and we didn't have a plan.
We will be making a plan for probable events now, and it will include communications plan as well.

So, I am really sorry about that -- and I will not let it happen again:

PS: WHD takes quite a lot of energy, and I did go to WHD party -- but I went to it knowing that only two clients affected and that we have rolled back all the patches - and other people shouldn't be affected. We didn't realize the issue was big/affected multiple clients. I did warn the support about the issue and told them to contact me right away if there are other reports. As soon as I knew about the second client, I left the party and together with other members of the company we have worked on the way to deal with the problem.

I want to personally apologize for that. It was suggested during initial half hour by one of our team members, but I chose not to for the bunch of wrong reasons. We don't have a good mailing list for that, nor right tool to do it on the spot, nor we knew who is affected... wrong reasons. At first, we didn't realize that it was so widespread, and by the time we did -- we already rolled back everything -- and were working on figuring out what happened / why and how to solve it. The situation was so new to us -- that we (I personally) screwed up. Either way -- we were unprepared. We thought something like that could never happen -- and we didn't have a plan. We will be making a plan for probable events now, and it will include communications plan as well. So, I am really sorry about that -- and I will not let it happen again: PS: WHD takes quite a lot of energy, and I did go to WHD party -- but I went to it knowing that only two clients affected and that we have rolled back all the patches - and other people shouldn't be affected. We didn't realize the issue was big/affected multiple clients. I did warn the support about the issue and told them to contact me right away if there are other reports. As soon as I knew about the second client, I left the party and together with other members of the company we have worked on the way to deal with the problem.

We are all humans. Errors can and will always happend.

Isn't about the problems, is how the company respond to it. And i must say that the KernelCare team have done a great work with the "aftermatch" of this issue.

We are all humans. Errors can and will always happend. Isn't about the problems, is how the company respond to it. And i must say that the KernelCare team have done a great work with the "aftermatch" of this issue. :)

Yes wére all humans, but it is not right so say, they did a good aftermatch at all.
we asked several times, what we have to do. Nothing come back.
We had a big downtime of over thousands clients, a long night.

Such service must run under special Assistent of humans, that know what they are doing. Wé are located in Germany, not USA, which means, that data security, professional work is very important. Iam surprised, how easy kernelcare can crash nodes, I wouldn't think about this situation: Kernelcare got hacked and hacker can control our nodes.

A NO GO in this industry. @CEO : As a prof. IT , you MUST think about all, and what can BE happen in some ways. For me, you are too easy with this now. Our level is Enterprise. Not kitchen hoster or similar.

WHD is for children and low level industry. Wake up please! This situation is very bad for reputation. In Germany, your standing is very bad!

Yes wére all humans, but it is not right so say, they did a good aftermatch at all. we asked several times, what we have to do. Nothing come back. We had a big downtime of over thousands clients, a long night. Such service must run under special Assistent of humans, that know what they are doing. Wé are located in Germany, not USA, which means, that data security, professional work is very important. Iam surprised, how easy kernelcare can crash nodes, I wouldn't think about this situation: Kernelcare got hacked and hacker can control our nodes. A NO GO in this industry. @CEO : As a prof. IT , you MUST think about all, and what can BE happen in some ways. For me, you are too easy with this now. Our level is Enterprise. Not kitchen hoster or similar. WHD is for children and low level industry. Wake up please! This situation is very bad for reputation. In Germany, your standing is very bad!

"Wé are located in Germany, not USA, which means, that data security, professional work is very important."

Hey there, Hans, are you trying to imply that professionalism and data security are not something Americans strive for or find important? You should reconsider insulting an entire country over the mistake of one company, especially after we saved your ass once, and are going to have to do it again once that wicked devil Merkel forces you to become a caliphate.

"Wé are located in Germany, not USA, which means, that data security, professional work is very important." Hey there, Hans, are you trying to imply that professionalism and data security are not something Americans strive for or find important? You should reconsider insulting an entire country over the mistake of one company, especially after we saved your ass once, and are going to have to do it again once that wicked devil Merkel forces you to become a caliphate.

This problem did hit us too very hard, but Igor does a good communication and dont try to hide the problems. I find him trustworthy, honestly and sympathetically. He take this problem serious and want to take steps so that it dont happen again.
But Otherwise (i respect it a lot that you communicate so open!) according your update it happened just again, which destroys the first reclamed trust of course..

As DJPRMF did already sayed, humans make errors. Of course it should not happen - but it happens. Now it is important what you do in the future to prevent this. Your plans sounds legitime and reasonable.

In my case i have too a hugh damage (also at the financial view) and i have spend the whole night and day to bring back up everything. But it dont helps when crying like 'The A' - it will just have no effect. Kernelcare does help a lot and our tech team saved a lot of time in the past.

This problem did hit us too very hard, but Igor does a good communication and dont try to hide the problems. I find him trustworthy, honestly and sympathetically. He take this problem serious and want to take steps so that it dont happen again. But Otherwise (i respect it a lot that you communicate so open!) according your update it happened just again, which destroys the first reclamed trust of course.. As DJPRMF did already sayed, humans make errors. Of course it should not happen - but it happens. Now it is important what you do in the future to prevent this. Your plans sounds legitime and reasonable. In my case i have too a hugh damage (also at the financial view) and i have spend the whole night and day to bring back up everything. But it dont helps when crying like 'The A' - it will just have no effect. Kernelcare does help a lot and our tech team saved a lot of time in the past.

Because it's Igor, I have no doubt this will lead to something much better. Cloudlinux haven't done a lot of mistakes in my opinion and mistakes are bound to happen sooner or later of people are involved.

I think you should consider adding an enterprise level support option, it seems that you have at least one customer for that. Don't sacrifice the great support you are providing for the others of us though.

WHD might be for lesser companies, but I guess the bulk of cloudlinux customers probably fits within the WHD attending group.

Also curious about what the other commenters, in the enterprise level, would like to have seen in the response from Igor. For my own purposes, the response have been ok.

Because it's Igor, I have no doubt this will lead to something much better. Cloudlinux haven't done a lot of mistakes in my opinion and mistakes are bound to happen sooner or later of people are involved. I think you should consider adding an enterprise level support option, it seems that you have at least one customer for that. Don't sacrifice the great support you are providing for the others of us though. :) WHD might be for lesser companies, but I guess the bulk of cloudlinux customers probably fits within the WHD attending group. Also curious about what the other commenters, in the enterprise level, would like to have seen in the response from Igor. For my own purposes, the response have been ok.

Thank you Igor for the blog post and your very honest assessment of the situation.

At the end of the day, these things happen. It is important for each service provider that uses KernelCare to assess the risks to their systems of automatic updates, be they from upstream distribution, kernelcare , control panel or similar. This is certainly the first time in 3 years that I have heard of any issues with KernelCare and in that time I recall many instances of cPanel, Plesk, retracted RHN updates and similar.

Mistakes happen. Period. If you have a system that requires you to have absolute integrity, delay your updates and manually batch them out after your own internal testing. Further, there are mitigating features anyone can apply that would have prevented the hard lock panics by simply setting appropriate sysctl values to force reboots on panic.

e.g: sysctl.conf
# reboot on panic after 90s
kernel.panic = 90

We look forward to a continued use of kernelcare going forward. Be very mindful folks that there are two players in the entire Linux ecosystems providing this service -- ksplice and kernelcare. Only one of which is under meaningful active maintenance with prompt turnaround on patches for 0days, kernelcare.

Let's be supportive and constructive, yes we are paying for a service but it is incumbent on us as users to also assist in making the product better through meaningful feedback.

Thanks Igor!

Thank you Igor for the blog post and your very honest assessment of the situation. At the end of the day, these things happen. It is important for each service provider that uses KernelCare to assess the risks to their systems of automatic updates, be they from upstream distribution, kernelcare , control panel or similar. This is certainly the first time in 3 years that I have heard of any issues with KernelCare and in that time I recall many instances of cPanel, Plesk, retracted RHN updates and similar. Mistakes happen. Period. If you have a system that requires you to have absolute integrity, delay your updates and manually batch them out after your own internal testing. Further, there are mitigating features anyone can apply that would have prevented the hard lock panics by simply setting appropriate sysctl values to force reboots on panic. e.g: sysctl.conf # reboot on panic after 90s kernel.panic = 90 We look forward to a continued use of kernelcare going forward. Be very mindful folks that there are two players in the entire Linux ecosystems providing this service -- ksplice and kernelcare. Only one of which is under meaningful active maintenance with prompt turnaround on patches for 0days, kernelcare. Let's be supportive and constructive, yes we are paying for a service but it is incumbent on us as users to also assist in making the product better through meaningful feedback. Thanks Igor!

Hello,

yesterday we changed all our servers to PREFIX=24h
And now you are telling that you have same problem on this prefix. Most of our servers crashed again.
Our customers are so angry.

Please send me your email, so we can send you an invoice for 2 times technician work and the customers we have lost now.
This absoluty unbelievable what happens.
Once is already too much, but two times now.
NO that's not acceptable.
Send me an email where we can sent the invoice.

Hello, yesterday we changed all our servers to PREFIX=24h And now you are telling that you have same problem on this prefix. Most of our servers crashed again. Our customers are so angry. Please send me your email, so we can send you an invoice for 2 times technician work and the customers we have lost now. This absoluty unbelievable what happens. Once is already too much, but two times now. NO that's not acceptable. Send me an email where we can sent the invoice.

I am finally back in US - and that makes things a little easier from logistics standpoint.

To all those who are rightfully pissed at me:
* Yes it is my fault, and this disaster was preventable - I didn't expect it. It is easy for me to see where I was mistaken in the aftermath of things - but it is difficult to figure out
* I learned important lessons including the fact that relying on procedures is not enough. Automation & safety checks are required - especially that might turn stressful

My focus for the next month will be at getting things right.

So far I have made few minor deployment adjustments that should prevent widespread issues like this one. Starting Monday - three of my best developers will start working on various components of client software, testing tools, and deployment framework. Some improvements will be done within a week; some will take up to a month. We will be adding staff as needed to speed up things.
I think within a month we should have robust deployment/rollback system and greatly improved test system.
We will also add a regular review of procedures -- as they get outdated with time -- and might cause additional issues.

Once again, my sincere apologies for letting you down.

I am finally back in US - and that makes things a little easier from logistics standpoint. To all those who are rightfully pissed at me: * Yes it is my fault, and this disaster was preventable - I didn't expect it. It is easy for me to see where I was mistaken in the aftermath of things - but it is difficult to figure out * I learned important lessons including the fact that relying on procedures is not enough. Automation & safety checks are required - especially that might turn stressful My focus for the next month will be at getting things right. So far I have made few minor deployment adjustments that should prevent widespread issues like this one. Starting Monday - three of my best developers will start working on various components of client software, testing tools, and deployment framework. Some improvements will be done within a week; some will take up to a month. We will be adding staff as needed to speed up things. I think within a month we should have robust deployment/rollback system and greatly improved test system. We will also add a regular review of procedures -- as they get outdated with time -- and might cause additional issues. Once again, my sincere apologies for letting you down.

Hello my Fellow datacenter geeks.

We all have many bad and hard experiences in these never ending race to keep systems up forever. These is what we do, and we all know it is not easy at all. Some times something goes wrong so we learn and fix, then we keep moving. These is the spirit. These is what it takes to be a part of these complex Internet digital machine. So we don´t cry, we take responsibility and we work together as a community.

I do know CLoudLinux team take their job very seriously, they do their best and they do it with passion. I don´t have any regret, since we work with Igor and the smart and talented Cloudlinux tean we build a stronger platform and we feel well supported.

These issues have to be managed with intelligence, i´m sure Igor and their team will deliver a better and safe update process and it will be better than ever.

I trust Cloudlinux team, i´m sure these will be fixed and improved very soon!

Best Regards

Roy Zderich

Hello my Fellow datacenter geeks. We all have many bad and hard experiences in these never ending race to keep systems up forever. These is what we do, and we all know it is not easy at all. Some times something goes wrong so we learn and fix, then we keep moving. These is the spirit. These is what it takes to be a part of these complex Internet digital machine. So we don´t cry, we take responsibility and we work together as a community. I do know CLoudLinux team take their job very seriously, they do their best and they do it with passion. I don´t have any regret, since we work with Igor and the smart and talented Cloudlinux tean we build a stronger platform and we feel well supported. These issues have to be managed with intelligence, i´m sure Igor and their team will deliver a better and safe update process and it will be better than ever. I trust Cloudlinux team, i´m sure these will be fixed and improved very soon! Best Regards Roy Zderich

Hi,

Was there ever a follow up on this on what changed in the rollout scheme and so on?

Hi, Was there ever a follow up on this on what changed in the rollout scheme and so on?

We are working now on gradual rollout for KernelCare patches. The client part is ready (you may have noticed --gradual-rollout=auto option in cron entry for kcare) . The server part is in beta now and is being tested before making it available to all KC customers. Delayed feed options are also available now, as described in http://docs.kernelcare.com/delayed_feed.htm

We are working now on gradual rollout for KernelCare patches. The client part is ready (you may have noticed --gradual-rollout=auto option in cron entry for kcare) . The server part is in beta now and is being tested before making it available to all KC customers. Delayed feed options are also available now, as described in http://docs.kernelcare.com/delayed_feed.htm
Уже зарегистрированны? Войти на сайт
Guest
07.06.2020

Изображение капчи