Throwaway account... CrowdStrike in this context is a NT kernel loadable module ...

amluto · 2024-07-19T09:19:57 1721380797

I did approximately this recently, but on a Linux machine on GCP. It sucked far worse than it should have: apparently GCP cannot reliably “stop” a VM in a timely manner. And you can’t detach a boot disk from a VM that isn’t “stopped”, nor can you multi-attach it, nor can you (AFAICT) convince a VM to boot off an alternate disk.

I used to have this crazy idea that fancy cloud vendors had competent management tools. Like maybe I could issue an API call to boot an existing instance from an alternate disk or HTTPS netboot URL. Or to insta-stop a VM and get block-level access to its disk via API, even if I had to pay for the instance while doing this.

And I’m not sure that it’s possible to do this sort of recovery at all without blowing away local SSD. There’s a “preview” feature for this on GCP, which seems to be barely supported, and I bet it adds massive latency to the process. Throwing away one’s local SSD on every single machine in a deployment sounds like a great way to cause potentially catastrophic resource usage when everything starts back up.

Hmm, I wonder if you’re even guaranteed to be able to get your instance back after stopping it.

WTF. Why can’t I have any means to access the boot disk of an instance, in a timely manner? Or any better means to recover an instance?

Is AWS any better?

teeheelol · 2024-07-19T09:38:34 1721381914

AWS is not any better really on this. In fact 2 years ago (to the day!) we had a complete AZ outage in our local AWS region. This resulted in their control plane going nuts and being unable to shut down or start new instances. Then capacity problems.

khrystoph · 2024-07-20T00:47:17 1721436437

That's happened several times, actually. That's probably just the latest one. The really fun one was when S3 went down in 2017 in Virginia. Caused global outages of multiple services because most services were housed out of Virginia and when EC2 and other services went offline due to dependency on S3, everything cascade failed across multiple regions (in terms of start/stop/delete...ie. api actions. Stuff that was running was, for the most part, still working in some places).

...I remember that day pretty well. It was a busy day.

Twirrim · 2024-07-19T20:42:22 1721421742

> apparently GCP cannot reliably “stop” a VM in a timely manner.

In OCI we made a decision years ago that after 15 minutes from sending an ACPI shutdown signal, the instance should be hard powered off. We do the same for VM or BM. If you really want to, we take an optional parameter on the shutdown and reboot commands to bypass this and do an immediate hard power off.

So worst case scenario here, 15 minutes to get it shut down and be able to detach the boot volume to attach to another instance.

jwrallie · 2024-07-20T21:21:21 1721510481

I had this happen to one of my VMs, I was trying to compile something and went out of memory, then tried to stop the VM and it only came back after 15 min. I think it is a good compromise, long enough to give a chance for a clean reboot but short enough to prevent longer downtimes.

I’m just a free tier user but OCI is quite powerful. It feels a bit like KDE to me where sometimes it takes a while to find out where some option is, but I can always find it somewhere, and in the end it beats feeling limited by lack of options.

Twirrim · 2024-07-20T22:36:41 1721515001

We've tried at shorter time periods, back in the earlier days of our platform. Unfortunately what we've found is that the few times we've tried to lower it from 15 minutes, we've ended up with Windows users experiencing corrupt drives. Our best blind interpretation is that some things common enough on Windows can take up to 14 minutes to shut down under worst circumstances. So 15 minutes it is!

metadat · 2024-07-22T02:35:26 1721615726

This sounds appealing. Is OCI the only cloud to offer this level of control?

pizza234 · 2024-07-20T15:28:59 1721489339

Based on your description, AWS has another level of stop, the "force stop", which one can use in such cases. I don't have statistics on the time, so I don't know if that meets your criteria of "timely", but I believe it's quick enough (sub-minute, I think).

khrystoph · 2024-07-19T17:18:43 1721409523

There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

As for "throwing away local SSD", that only happens on AWS with instance store volumes which used to be called ephemeral volumes as the storage was directly attached to the host you were running on and if you did a stop/start of an ebs-backed instance, you were likely to get sent to a different host (vs. a restart API call, which would make an ACPI soft command and after a duration...I think it was 5 minutes, iirc, the hypervisor would kill the instance and restart it on the same host).

When the instance would get sent to a different host, it would get different instance storage and the old instance storage would be wiped from the previous host and you'd be provisioned new instance storage on the new host.

However, with EBS-volumes, those travel from host to host across stop/start cycles and they're attached via very low latency across the network from EBS servers and presented as a local block device to the instance. It's not quite as fast as local instance store, but it's fast enough for almost every use case if you get enough IOPS provisioned either through direct provisioning + correct instance size OR through a large enough drive + large enough instance to maximjze the connection to EBS (there's a table and stuff detailing IOPs, throughput, and instance size in the docs).

Also, support can detach the volume as well if the instance is stuck shutting down and doesn't get manually shut down by the API after a timeout.

None of this is by any means "ideal", but the complexity of these systems is immense and what they're capable of at the scale they operate is actually pretty impressive.

The key is...lots of the things you talk about are do-able at small scale, but when you add more and more operations and complexity to the tool stack on interacting with systems, you add a lot of back-end network overhead, which leads to extreme congestion, even in very high speed networks (it's an exponential scaling problem).

The "ideal" way to deal with these systems is to do regular interval backups off-host (ie. object/blob storage or NFS/NAS/similar) and then just blow away anything that breaks and do a quick restore to the new, fixed instance.

It's obviously easier said than done and most shops still on some level think about VMs/instances as pets, rather than cattle or have hurdles that make treating them as cattle much more challenging, but manual recovery in the cloud, in general, should just be avoided in favor of spinning up something new and re-deploying to it.

amluto · 2024-07-19T22:18:42 1721427522

> There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

Also, there should be a way to force stop an instance that is already stopping.

khrystoph · 2024-07-20T00:20:00 1721434800

>This is absurd. Every BMC I’ve ever used has an option to turn off the power immediately. Every low level hypervisor can do this, too. (Want a QEMU guest gone? Kill QEMU.). Why on Earth can’t public clouds do it?

The issue is far more nuanced than that. The systems are very complex and they're a hypervisor that has layers of applications and interfaces to allow scaling. In fact, the hosts all have BMCs (last I knew...but I know there were some who wanted to get rid of the BMC due to BMCs being unreliable, which is, yes, an issue when you deal with scale because BMCs are in fact unreliable. I've had to reset countless stuck BMCs and had some BMCs that were dead).

The hypervisor is certainly capable of killing an instance instantly, but the preferred method is an orderly shutdown. In the case of a reboot and a stop (and a terminate where the EBS volume is not also deleted on termination), it's preferred to avoid data corruption, so the hypervisor attempts an orderly shutdown, then after a timeout period, it will just kill it if the instance has not already shutdown in an orderly manner.

Furthermore, there's a lot more complexity to the problem than just "kill the guest". There are processes that manage the connection to the EBS backend that provides the interface for the EBS volume as well as apis and processes to manage network interfaces, firewall rules, monitoring, and a whole host of other things. If the monitoring process gets stuck, it may not properly detect an unhealthy host and external automated remediation may not take action. Additionally, that same monitoring is often responsible for individual instance health and recovery (ie. auto-recover) and if it's not functioning properly, it won't take remediation actions to kill the instance and start it up elsewhere. Furthermore, the hypervisor itself may not be properly responsive and a call from the API won't trigger a shutdown action. If the control plane and the data plane (in this case, that'd be the hypervisor/host) are not syncing/communicating (particularly on a stop or terminate), the API needs to ensure that the state machine is properly preserved and the instance is not running in two places at once. You can then "force" stop or "force" terminate and/or the control plane will update state in its database and the host will sync later. There is a possibility of data corruption or double send/receive data in a force case, which is why it's not preferred. Also, after the timeout (without the "force" flag), it will go ahead and mark it terminated/stopped and will sync later, the "force" just tells the control plane to do it immediately, likely because you're not concerned with data corruption on the EBS volume, which may be double-mounted if you start up again and the old one is not fully terminated.

>The state machine for a cloud VM instance should have a concept where all of the resources for an instance are still held and being billed, but the instance is not running. And one should be able to quickly transition between this state and actually running, in both directions.

It does have a concept where all resources are still held and billed, except CPU and Memory. That's what a reboot effectively does. Same with a stop (except you're not billed for compute usage and network usage will obviously be zero, but if you have an EIP, that would incur charges still). The transition between stop and running is also fast, the only delays incurred are via the control plane...either via capacity constraints causing issues placing an instance/VM or via the chosen host not communicating properly...but in most cases, it is generally a fast transition. I'm usually up and running in under 20 seconds when I start up an existing instance from a stopped state. There's also now a hibernate or sleep state that the instance can be put into if it's windows via the API where the instance acts just like a sleep/hibernate state of a regular Windows machine.

>Also, there should be a way to force stop an instance that is already stopping.

There is. I believe I referred to it in my initial response. It's a flag you can throw in the API/SDK/CLI/web console when you select "terminate" and "stop". If the stop/terminate command don't execute in a timely manner, you can call the same thing again with a "force" flag and tell the control plane to forcefully terminate, which marks the instance as terminated and will asynchronously try to rectify state when the hypervisor can execute commands. The control plane updates the state (though, sometimes it can get stuck and require remediation by someone with operator-level access) and is notified that you don't care about data integrity/orderly shutdown and will (once its updated the state in the control plane and regardless of the state of the data plane) mark it as "stopped" or "terminated". Then, you can either start again, which should kick you over to a different host (there are some exceptions) or you can launch a new instance if you terminated and attach an EBS volume (if you chose not to terminate the EBS volume on termination) and retrieve data (or use the data or whatever you were doing with that particular volume).

Almost all of that information is actually in the public docs. There was only a little bit of color about how the backend operates that I added for color. There's hundreds of programs that run to make sure the hypervisor and control plane are both in sync and able to manage resources and if just a few of them hang or are unable to communicate or the system runs out of resources (more of a problem on older, non-nitro hosts as that's a completely different architecture with completely different resource allocations), then the system can become partially functional...enough so that remediation automation won't step in or can't step in because other guests appear to be functioning normally. There's many different failure modes of varying degrees of "unhealthy" and many of them are undetectable or need manual remediation, but are statistically rare and by and large most hosts operate normally. On a normally operating host, forcing a shutdown/terminate works just fine and is fast. Even when some programs that are managing the host are not functioning properly, launch/terminate/stop/start/attach/detach all tend to continue to function (along with the "force" on detach, terminate, stop), even if one or two functions of the host are not functioning properly. It's also possible (and has happened several times) where a particular resource vector is not functioning properly, but the rest of the host is fine. In that case, the particular vector can be isolated and the rest of the host works just fine. It's literally these tiny little edge cases that happen maybe .5% of the time that cause things to move slower and at scale, a normal host with a normal BMC would have the same issues. Ie. I've had to clear stuck BMCs before on those hosts. Also, I've dealt with completely dead BMCs. When those states occur, if there's also a host problem, remediation can't go in and remedy host-level problems, which can lead to those control-plane delays as well as the need to call a "force".

Conclusion: it may SEEM like it should be super easy, but there's about a million different moving parts to cloud vendors and it's not just as simple as kill it with fire and vengeance (ie. quemu guest kill). BMCs and hypervisors do have an instant kill switch (and guest kill is used on the hypervisor as is a BMC power off in the right remediation circumstances), but you're assuming those things always work. BMCs fail. BMCs get stuck. You likely haven't had the issue because you're not dealing with enough scale. I've had to reset BMCs manually more times than I can count and I've also dealt with more than my fair share of dead ones. So, "power off immediately" does not always work, which means a disconnect occurs between the control plane and the data plane. There's also delays in remediation actions that automation takes to give enough time for things to respond to the given commands, which leads to additional wait time.

amluto · 2024-07-20T13:02:49 1721480569

I understand that this complexity exists. But in my experience with Google Compute, this isn’t a 1%-of-the-time problem with something getting stuck. It’s a “GCP lacks the capability” issue. Here’s the API:

https://cloud.google.com/compute/docs/reference/rest/v1/inst...

AWS does indeed seem more enlightened:

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_S...

khrystoph · 2024-07-27T02:45:55 1722048355

yeah, AWS rarely has significant capacity issues. While the capacity utilization typically sits around 90% across the board, they're constantly landing new capacity, recovering broken capacity, and working to fix issues that cause things to get stuck (and lots of alarms and monitoring).

I worked there for just shy of 7 years and dealt with capacity tangentially (knew a good chunk of their team for a while and had to interact with them frequently) across both teams I worked on (support and then inside the EC2 org).

Capacity, while their methodologies for expanding were, in my opinion, antiquated and unenlightened for a long time, were still rather effective. I'm pretty sure that's why they never updated their algorithm for increasing capacity to be more JIT. Now, they have a LOT more flexibility in capacity now that they have resource vectoring, because you no longer have hosts with fixed instance sizes for the entire host (homogenous). You now have the ability to fit everything like legos as long as it is the same family (ie. c4 with c4, m4 with m4, etc.) and there was additional work being done to have cross-family resource vectoring as well that was in-use.

Resource vectors took a LONG time for them to get in place and when they did, capacity problems basically went away.

The old way of doing it was if you wanted to have more capacity for, say, c4.xlarge, you'd either have to drop new capacity and build it out to where the entire host had ONLY c4.xlarge OR you would have to rebuild excess capacity within the c4 family in that zone (or even down to the datacenter-level) to be specifically built-out as c4.xlarge.

Resource vectors changed all that. DRAMATICALLY. Also, to reconfigure a hosts recipe now takes minutes, rather than rebuilding a host and needing hours. So, capacity is infinitely more fungible than it was when I started there.

Also, I think resource vectoring came on the scene around 2019 or so? I don't think it was there in 2018 when I went to work for EC2...but it was there for a few years before I quit...and I think it was in-use before the pandemic...so, 2019 sounds about right.

Prior to that, though, capacity was a much more serious issue and much more constrained on certain instance types.

spectrumero · 2024-07-19T11:27:12 1721388432

I always said if you want to create real chaos, don't write malware. Get on the inside of a security product like this, and push out a bad update, and you can take most of the world down.

valval · 2024-07-22T06:43:14 1721630594

So… Write malware?

Namari · 2024-07-22T09:59:41 1721642381

*Malicious code on a legit software

Twirrim · 2024-07-19T20:44:32 1721421872

> Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver.

It took a bit to figure out with some customers, but we provide optional VNC access to instances at OCI, and with VNC the trick seems to be to hit esc and then F8, at the right stage in the boot process. Timing seems to be the devil in the details there, though. Getting that timing right is frustrating. People seem to be developing a knack for it though.

tamimio · 2024-07-19T08:47:00 1721378820

> give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.

Interesting..

> We have to literally take each node down, attach the disk to a working node..

Probably the easiest solution for you is to go back in time to a previous scheduled snapshot, if you have that setup already.

teeheelol · 2024-07-19T09:06:06 1721379966

That would make sense but it appears everyone is doing EBS snapshots in our regions like mad so they aren't restoring. Spoke to our AWS account manager (we are a big big big org) and they have contention issues everywhere.

I really want our cages, C7000's and VMware back at this point.

khrystoph · 2024-07-20T00:28:21 1721435301

Netflix big? Bigger or Smaller?

I'm betting I have a good idea of one of the possible orgs you work for, since I used to work specifically with the largest 100 customers during my ~3yr stint in premium support

OhMeadhbh · 2024-07-21T04:12:59 1721535179

Netflix isn't really that big. two organizations ago our reverse proxy used 40k cores. netflixes is less than 5k. of course, that could just mean our nginx extensions are 8 times crappier than netflix.

teeheelol · 2024-07-20T16:06:52 1721491612

Smaller. No one has heard of us :)

CoastalCoder · 2024-07-19T13:49:28 1721396968

> Spoke to our AWS account manager (we are a big big big org)

Is this how you got the inside scoop on the rollout fiasco?

bflesch · 2024-07-19T11:22:07 1721388127

Beautiful

numbsafari · 2024-07-19T16:42:40 1721407360

> This is not a windows issue.

Honest question, I've seen comments in these various threads about people having similar issues (from a few months/weeks back) with kernel extension based deployments of CrowdStrike on Debian/Ubuntu systems.

I haven't seen anything similar regarding Mac OS, which no longer allows kernel extensions.

Is Mac OS not impacted by these kinds of issues with CrowdStrike's product, or have we just not heard about it due to the small scale?

Personally, it's a shared responsibility issue. MS should build a product that is "open to extension but closed for modification".

> they pissed over everyone's staging and rules and just pushed this to production.

I am guessing that act alone is going to create a massive liability for CrowdStrike over this issue. You've made other comments that your organization is actively removing CrowdStrike. I'm curious how this plays out. Did CrowdStrike just SolarWind themselves? Will we see their CISO/CTO/CEO do time? This is just the first part of this saga.

teeheelol · 2024-07-19T17:29:42 1721410182

The issue is where it is integrated. You could arguably implement CrowdStrike in BPF on Linux. On NT they literally hook NT syscalls in the kernel from a driver they inject into kernel space which is much bad juju. As for macOS, you have no access to the kernel.

There is no shared responsibility. CrowdStrike pushed a broken driver out, then triggered the breakage, overriding customer requirement and configuration for staging. It is a faulty product with no viable security controls or testing.

meeper22 · 2024-07-19T19:45:27 1721418327

Yep, it's extremely lame that CS has been pushing the "Windows" narrative to frame it as a Windows issue in the press, so everyone will just default blame Microsoft (which everyone knows) and not Crowdstrike (which only IT/cybersec people are familiar with).

And then you get midwits who blame Microsoft for allowing kernel access in the first place. Yes Apple deprecated kexts on macOS; that's a hell of a lot easier to do when you control the entire hardware ecosystem. Go ahead and switch to Apple then. If you want to build your own machines or pick your hardware vendor, guess what, people are going to need to write drivers, and they are probably going to want kernel mode, and the endpoint security people like CrowdStrike will want to get in there too because the threat is there.

There's no way for Microsoft or Linux for that matter to turn on a dime and deny kernel access to all the thousands upon thousands of drivers and system software running on billions of machines in billions of potential configurations. That requires completely reworking the system architecture.

numbsafari · 2024-07-20T02:09:58 1721441398

> midwits

This midwit spent the day creating value for my customers instead of spinning in my chair creating value for my cardiologist.

Microsoft could provide adequate system facilities so that customers can purchase products that do the job without having the ability to crash the system this way. They choose not to make those investments. Their customers pay the price by choosing Microsoft. It's a shared responsibility between the parties involved, inclduing the customers that selected this solution.

We all make bad decisions like this, but until customers start standing up for themselves with respect to Microsoft, they are going to continue to have these problems, and society is going to continue to pay the price all around.

We can and should do better as an industry. Making excuses for Microsoft and their customers doesn't get us there.

sphar1970 · 2024-07-20T12:08:08 1721477288

This midwit believes a half decent Operating System kernel would have a change tracking system that can auto-roll back a change/update that impacts the boot process causing a BSOD. We see in Linux, multiple kernel boot options, fail safe etc. It is trivial to code at the kernel the introduction of driver / .sys tracking that can detect a failed boot and revert to the previous good config. A well designed kernel would have roll back, just like SQL.

teeheelol · 2024-07-20T16:08:08 1721491688

Windows does have that and does do that. Crowdstrike does stuff at UEFI level to install itself again.

hunter2_ · 2024-07-23T15:41:15 1721749275

Could Microsoft put pressure on UEFI vendors to coordinate a way for such reinstallation to be suppressed during this failsafe boot?

numbsafari · 2024-07-20T16:01:13 1721491273

Not sure why you are being downvoted. Take a look at ChromeOS and MacOS to see how those mechanisms are implemented there.

They aren’t perfect, but they are an improvement over what is available on Windows. Microsoft needs to get moving in this same direction.

OhMeadhbh · 2024-07-21T04:15:14 1721535314

um.. don't have access to the kernel? what's with all the kexts then? [edit: just read 3rd parties don't get kexts on apple silicon. that's a step in the right direction, IMHO. I love to bitch about Mach/NeXTStep flaws, but happy to give them props when they do the right thing.]

crustycoder · 2024-07-20T17:04:50 1721495090

Although it's a .sys file, it's not a device driver.

"Although Channel Files end with the SYS extension, they are not kernel drivers."

https://www.crowdstrike.com/blog/technical-details-on-todays...

teeheelol · 2024-07-20T17:43:39 1721497419

Yeah it's a way of delivering a payload to the driver, which promptly crashed.

Which is horrible!

crustycoder · 2024-07-20T19:34:31 1721504071

Horrible for sure, not least because hackers now know that the channel file parser is fragile and perhaps exploitable. I haven't seen any significant discussion about follow-on attacks, it's all been about rolling back the config file rather than addressing the root cause, which is the shonky device driver.

OhMeadhbh · 2024-07-21T04:10:21 1721535021

I suspect the wiley hackors have known how fragile that code is for years.

sphar1970 · 2024-07-20T12:02:45 1721476965

But it is Windows because the kernel should be able to roll back a bad update, there should NEVER be BSODs.

teeheelol · 2024-07-20T16:04:30 1721491470

Windows does do that. Crowdstrike sticks it back in at the UEFI level by the looks, because you know, "security".

OhMeadhbh · 2024-07-21T04:24:47 1721535887

pish! this isn't VM/SP! commodity OSes and hardware took over because customers didn't want to pay firms to staff people who grokked risk management. linux supplanted mature OSes because some dork implied even security bugs were shallow with all those billions of eyes. It's a weird world when MSFT does a security stand down in 2003 and in 2008 starts widening security holes because the new "secure" OS they wrote was a no-go for third parties who didn't want to pay $100 to hire someone who knew how to rub two primes together.

I miss my AS/400.

This might be a decent place to recount the experience I had when interviewing for office security architect in 2003. my background is mainframe VM system design and large system risk management modeling which I had been doing since the late 80s at IBM, DEC, then Digital Switch and Bell Canada. My resume was pretty decent at the time. I don't like Python and tell VP/Eng's they have a problem when they can't identify benefits from JIRA/SCRUM, so I don't get a lot of job offers these days. Just a crusty greybeard bitching...

But anyway... so I'm up in Redmond and I have a decent couple of interviews with people and then the 3rd most senior dev in all of MSFT comes in and asks "how's your QA skills?" and I start to answer about how QA and Safety/Security/Risk Management are different things. QA is about insuring the code does what it's supposed to, software security, et al is about making sure the code doesn't do what it's not supposed to and the philosophic sticky wicket you enter when trying to prove a negative (worth a google deep dive if you're unfamiliar.) Dude cuts me off and says "meh. security is stupid. in a month, Bill will end this stupid security stand down and we'll get back to writing code and I need to put you somewhere and I figured QA is the right place."

When I hear that MSFT has systems that expose inadequate risk management abstractions, I think of the culture that promoted that guy to his senior position... I'm sure he was a capable engineer, but the culture in Redmond discounts the business benefits of risk management (to the point they outsource critical system infrastructure to third parties) because senior engineers don't want to be bothered to learn new tricks.

Culture eats strategy for breakfast, and MSFT has been fed on a cultural diet of junk food for almost half a century. At least from the perspective of doing business in the modern world.

Reason077 · 2024-07-19T13:12:32 1721394752

> ”This is not a windows issue. This is a third party security vendor shitting in the kernel.“

Sure, but Windows shares some portion of the blame for allowing third-party security vendors to “shit in the kernel”.

Compare to macOS which has banned third-party kernel extensions on Apple Silicon. Things that once ran as kernel extensions, including CrowdStrike, now run in userspace as “system extensions”.

naib0930 · 2024-07-21T17:50:38 1721584238

Back in 2006 the Microsoft agreed to allow kernel level access for Security companies due to an EU anti trust investigation. They were being sued by anti virus companies because they were blocking kernel access in the soon to be released Vista.

https://arstechnica.com/information-technology/2006/10/7998/

sam_lowry_ · 2024-07-22T05:43:33 1721627013

Wow, that looks like a root cause

Reason077 · 2024-07-22T12:45:04 1721652304

Wow! First cookie pop-ups, now Blue Friday...?

syndeo · 2024-07-22T16:05:23 1721664323

Sick and tired of EU meddling in tech. If third parties can muck around in the kernel, then there's nothing Microsoft can really do at that point. SMH

wooger · 2024-07-23T10:55:06 1721732106

Can they simultaneously allow this, but recommend against it and deny support / sympathy if you do it to your OS?

phendrenad2 · 2024-07-21T00:25:35 1721521535

Yes... in the same sense that if a user bricks their own system by deleting system32 then Windows shares some small sliver of the blame. In other words, not much.

Reason077 · 2024-07-21T13:11:12 1721567472

Why should Windows let users delete system32? If they don't make it impossible to do so accidentally (or even maliciously), then I would indeed blame Windows.

On macOS you can't delete or modify critical system files without both a root password and enough knowledge to disable multiple layers of hardware-enforced system integrity protection.

hobs · 2024-07-24T12:58:42 1721825922

And what do you think installing a deep level antivirus across your entire fleet is equivalent to?

phendrenad2 · 2024-07-23T01:26:36 1721697996

lol. Never said they should, did I?

OhMeadhbh · 2024-07-21T04:28:28 1721536108

the difference is you can get most of the functionality you want without deleting system32, but if you want the super secure version of NT, you have to let idiots push untested code to your box.

linux, Solaris, BSD and macOS aren't without their flaws, but MSFT could have done a much better job with system design.

deepsummer · 2024-07-20T22:04:17 1721513057

...but still, if the user space process is broken, MacOS will fail as well. Maybe it's a bit easier to recover, but any broken process with non-trivial privileges can interrupt the whole system.

Reason077 · 2024-07-21T00:12:39 1721520759

It's certainly not supposed to work like that. In the kernel, a crash brings down the entire system by design. But in userspace, failed services can be restarted and continued without affecting other services.

If a failure in a userspace service can crash the entire system, that's a bug.

deepsummer · 2024-07-22T12:00:16 1721649616

It's kind of inevitable that a security system can crash the system. It just needs to claim than one essential binary is infected with malware, and the system won't run.

jakebleiberg · 2024-07-23T13:50:12 1721742612

Hello:

I'm a reporter with Bloomberg News covering cybersecurity. I'm trying to learn more about this Crowdstrike update potentially bypassing staging rules and would love to hear about your experience. Would you be open to a coversation?

I'm reachable by email at [email protected] or on Signal at JakeBleiberg.24. Here's my Bloomberg author page: https://www.bloomberg.com/authors/AWuCZUVX-Pc/jake-bleiberg.

Thank you.

Jake

HelloNurse · 2024-07-19T09:26:16 1721381176

Before reaching the "pushed out to every client without authorization" stage, a kernel driver/module should have been tested. Tested by Microsoft, not by "a third party security vendor shitting in the kernel" that some criminally negligent manager decided to trust.

Shorn · 2024-07-19T10:52:35 1721386355

> Tested by Microsoft

MS don't have testers any more. Where do you think CS learned their radically effective test-in-prod approach?

teeheelol · 2024-07-20T17:48:32 1721497712

I think they learned it from Freedesktop developers.

teeheelol · 2024-07-19T09:29:48 1721381388

Yeah we have a staging and test process where we run their updated Falcon sensor releasees.

They shit all over our controls and went to production.

This says we don't control it and should not trust it. It is being removed.

JonChesterfield · 2024-07-19T11:30:36 1721388636

> It is being removed.

Congratulations on actually fixing the root cause, as opposed to hand wringing and hoping they don't break you again. I'm expecting "oh noes, better keep it on anyway to be safe" to be the popular choice.

khrystoph · 2024-07-20T00:26:04 1721435164

yeah, I agree. I think most places will at least keep it until the existing contract comes time for renegotiation and most will probably keep using cs.

It's far easier for IT departments to just keep using it than it is to switch and managers will complain about "the cost of migrating" and "the time to evaluate and test a new solution" or "other products don't have feature X that we need" (even when they don't need that feature, but THINK they do).

nox101 · 2024-07-20T16:43:45 1721493825

why would Microsoft be required to test some 3rd party software? Maybe I mis-understood.

willmadden · 2024-07-20T22:26:50 1721514410

It's a shitty C++ hack job within CloudStrike with a null pointer. Because the software has root access, Windows shuts it down as a security precaution. A simple unit test would have caught this, or any number of tools that look for null pointers in C++, not even full QA. It's unbelievable incompetence.