An update on last week's customer shutdown incident

_7siz · on June 5, 2019

That’s a really fair and reasonable response. Not sure what else people really expect here.

> The template used for response in account denial will be removed entirely. If account access is denied during an appeal, which often is the case as most appeals are true bad actors, the agent must create a reasoned response.

Glad this is seen as an issue and corrected.

IMO, this probably would have made this whole thing never escalate if a better response was previously in place for everyone.

Accidents, shotty support, whatever — all expected these days unless you have big cash money agreements in place.

But to kill an account of a responsive person with a gigantic middle finger email without reasoning was a pretty dumb process in place. You can see the email on the Twitter thread somewhere.

Glad it’s fixed! Still a DO fan here

Edit: TALKING ABOUT THIS: https://pbs.twimg.com/media/D76ocofXoAY_xB5.png

sergiosgc · on June 5, 2019

> That’s a really fair and reasonable response. Not sure what else people really expect here.

The root cause of suspension is incomprehensible to me. They were suspended because they launched a set of instances and these were using 100% CPU. How is that unreasonable and cause for suspension?

I'm not a Digital Ocean customer, but if I were, I'd expect to be able to use the resources I bought without risk of being suspended. This is the root cause. It was compounded by incompetent customer support, but I really do not understand the suspension cause.

The response tackles all secondary factors, but does not talk about the root cause. I'd expect it to.

falcor84 · on June 5, 2019

Agreed. They say in the postmortem they it was protection against crypto mining, but what kind of weird reason is that?! If I want to pay for 10 instances mining crypto, why the hell wouldn't I be allowed to do that? I don't see why they should block any workload as long as the credit card details are valid.

jarito · on June 5, 2019

It's a customer protection method. Most cryptominers are not using accounts they pay for. They compromise customer accounts and spin up resources. If you aren't proactive about communicating this to customers or blocking it, it can be quite some time before the customer notices and almost all customers will request a refund - even when the attack is a compromised password / successful phish on the customer's side.

Additionally, all cloud providers operate on various models of over-subscription. It is not in anyone's (customer / provider) interest to allow the full consumption of resources when the activity is fraudulent.

As you can see in the post-mortem, they are fine with the usage. They have a process and flag to allow legitimate customers to use their resources. However, based on previous experience at another cloud provider, I would bet that over 90% of those automated hits are correct.

This was bad support. They know that and they seem to be making the right moves to fix it. Fraud is bad for everyone and has to be combated. Not doing so can raise prices and kill a business like DO. I'm sure they feel awful that a customer was so poorly impacted, but the error wasn't in the first ban, it was everything after that.

coinbotvps · on June 5, 2019

Part of the whole issue here revolves around shared hosting in my opinion. Host hardware is so oversold that one customer utilizing 100% CPU is so impactful to a handful of other customers that it's not allowed at all. I have seen providers that has terminated services for less than 100% CPU usage, a constant 90% is enough on some of them. But due to the profit margins and shared hosting, providers are able to charge incredibly low prices per instance and be able to oversell their hardware sometimes as much as 10 to 20 times. That's as many as four hundred customers on a box that should maybe have 20 if it weren't oversold it all. In this case it really is an instance of you get what you pay for. The service we provide is no oversold hardware and all dedicated plans. Some people are initially very turned off by the pricing but the ability to allow customers to mine if they wanted to and not affect a single other customer on the platform giving each customer the same experience regardless of any other one images resource utilization, leaves too much happier customers even if smaller profit margins for us. At the end of the day customer experience and support provided are two of the most important factors in running a hosting provider. While I disagree with aspects of digital oceans business model as a shared hosting provider, I do think that the response to this was more than appropriate and better than would be expected of a lot of shared hosting providers, provided they actually implement any of the things talked about in the response.

jammygit · on June 7, 2019

When you say we/us as a more expensive, but dedicated alternative, what is the cost difference as a percentage for say a small project?

Edit: found your site, looks like you’re cheaper than aws at a glance

_bxg1 · on June 5, 2019

If it was just the first point, the customer should be able to confirm that the activity was intended without even going through human review. It should be like when your bank texts you to confirm an odd transaction. They don't simply lock your account.

It sounds much more like it was the second point, which is unsettling. It's one thing to plan your pricing based on the assumption that most customers won't maximally-utilize. It's another thing to enforce a soft-limit that's vague and below what was advertised. I'd much rather have a lowered, known limit than whatever this is.

oefrha · on June 5, 2019

I totally agree, but unfortunately my bank (a major U.S. bank) does block a transaction and sometimes lock my credit card completely when they think the transaction is suspicious. There’s no confirmation mechanism, I have to call them to get the card unlocked. Of course, this usually gets resolved within five minutes (except that one time when I had to renew a .ng ___domain, and the Nigeria-originated transaction got auto-blocked three times in a row, and eventually the case had to be escalated to override their security mechanism entirely), not 29 hours.

usefulcat · on June 5, 2019

> They don't simply lock your account.

Capital One did this to me once, and refused to restore the use of the blocked account even after I immediately called them and confirmed that the transaction that triggered the block was not fraudulent.

chewbacha · on June 5, 2019

It was in combination with a lack of payment history. So if they had been paying it would not have triggered but they had been working off of credits instead. I think this point addresses your concern that paying customers should be allowed to mine.

speleo_engr · on June 5, 2019

I ran some really long compute jobs on GCP (100% CPU for weeks across many vCPUs) with credits without getting flagged. I was evaluating FFTW performance for a project. Perhaps GCP could tell I was calling into FFTW and not mining so they decided it wasn't fraud?

jjeaff · on June 6, 2019

It makes sense for a company like DO to not allow crypto miners to use credits. Or else they would develop elaborate systems to create fake accounts and spin them up to mine.

Google can afford to eat the cost and perhaps has better heuristics to detect mining. And they definitely have better data to detect a single user signing up for multiple accounts.

chewbacha · on June 5, 2019

Perhaps, or they viewed credits as payment history? I’m not defending the algorithm as even DO has said it was a false-positive. I just wanted to point out that this wasn’t an attack by DO on paying for crypto. That it specifically was trying to look at non-payers.

numlock86 · on June 5, 2019

Having your instances run at 100% CPU pretty much raises a red flag at any cloud provider. Depending on your plan it either gets shut off (like in this case) or you get a notice about "suspicious" behavior and a bit of time to fix the "issue".

sergiosgc · on June 5, 2019

What's next? Having your disks use too much I/O causes the same response? Or actually using the RAM you pay for?

I run my own iron, with cloud only for elastic loads. Every time I launch a cloud instance, it will be using 100% CPU, otherwise I wouldn't launch it. It's unacceptable to label that profile as "suspicious". It never happened to me on AWS or Azure.

chewbacha · on June 5, 2019

> ...you pay for

The major indicator here was the lack of payment history, so they hadn’t paid for it but were working off of credit. I think it’s a nuance that’s very important.

sergiosgc · on June 5, 2019

I'm sorry to dig heels, but that's no excuse. If the credit they were given allowed them to use the resources, it follows that using the resources is not a breach of contract.

From the description I imagine Digital Ocean offers a free period or tier, to reduce friction in customer acquisition. This is a marketing tool, and must not, in any way, cause situations like the one described.

If a marketing tool induces service failure, it has no place in a professional setting.

chewbacha · on June 5, 2019

Credit and promo codes are also used extensively for fraud. If a business had been in operation for a while solely on credit, it may well generate a false positive in a fraud detection algorithm if it scaled dramatically.

But it is important to disconnect monetary spending from coupons or vouchers as they are not equivalent.

You mention free tier but that’s not what was at issue here. Also, 10 additional instances isn’t in the free tier of any cloud service I’ve used.

I’m not saying that DO is correct, but I believe the parent argument was a simplification if the events in question. Also, DOs handling of it via support was far worse than the initial algorithm, imo.

sergiosgc · on June 5, 2019

> But it is important to disconnect monetary spending from coupons or vouchers as they are not equivalent.

They must be. If they are not, then you've entered the territory I referred, where marketing actions are impacting service availability. This impact is not acceptable in professional services.

In this specific case, if voucher giveaways produce ingress of resource leeches (cryptominers that will never result in real customers), and if it is impossible to prevent this undesired ingress without impacting existing customers (which it is), then that marketing action must stop. This is the conclusion I expected from the post-mortem.

chewbacha · on June 5, 2019

Money is fungible and fiat while vouchers are vendor-locked and not fiat, that's why they can't be evaluated the same.

I won't try to argue whether they should be removed in their entirety, that's not even an option I had even considered until now.

vorpalhex · on June 5, 2019

This is confusing though, since Digital Ocean credit can mean like a referral, or by prepaying your account - something I do to prevent billing overages.

omginternets · on June 5, 2019

Hardly the point.

Using what you've rightfully obtained shouldn't be regarded with suspicion.

chewbacha · on June 5, 2019

That seems even more hyperbolic. Are you suggesting that no service should attempt to detect fraud?

omginternets · on June 5, 2019

Of course not.

Are you suggesting that 100% usage implies fraud?

There's a difference between suspecting fraud from high resource usage and equating high resource usage with fraud.

The latter is what is happening, here, and its outrageous.

chewbacha · on June 5, 2019

That's a simplification of what was happening. It was a combination of indicators that they list:

- A large increase in number of nodes

- All nodes using 100% of CPU

- AND a lack of payment history

I'm merely saying that the lack of payment history is an important indicator of suspicious activity. 100% usage by-itself was not the primary indicator that their article discusses.

yjftsjthsd-h · on June 5, 2019

I can assure you we run AWS instances at 100% with no problem at all. (Well, no problem from AWS; sometimes it's caused by a software bug.)

nbevans · on June 5, 2019

That's not true. A proper cloud provider (AWS or Azure) would not bat an eye because the CPU is frequently pegged at 100%.

arianvanp · on June 5, 2019

That sounds odd to me. Especially given that digitalocean is the default dynamic provider for Gitlab CI builds which _will_ run droplets at 100% CPU.

robteix · on June 5, 2019

From what I understand, they’re not saying you’re not allowed to use 100% for (what their user agreements define as) legitimate uses. They’re saying several droplets suddenly created and immediately going to 100% flags them as suspicious activity for human review. Looks like after such review, they would flag them as legitimate and all would be fine, 100% CPU or not.

They’ve botched that second step though.

mkl · on June 5, 2019

That doesn't make sense to me. You pay for the time you have the droplets running, so it seems kind of silly to have them sit idle for a bit before you give them work to do.

wgerard · on June 5, 2019

I don't work at a cloud provider, but I think the reasoning is:

It's a common pattern in malicious actors to immediately spin up several droplets and immediately peg the CPU on each one.

There are, obviously, non-malicious actors who do the same, but it's a bit like wearing a balaclava in public: Likely to raise some suspicion just because it's associated with malicious actors.

numpad0 · on June 5, 2019

Not sure what the materialization of that suspicion might look like -- competitors trying to crush DO's business? mass account creation or mass fraudulent logins? "mining crypto"? What I could come up felt quote-unquote legit grounds for a timed suspension but only instinctively so.

corobo · on June 5, 2019

> I'd expect to be able to use the resources I bought without risk of being suspended

They weren't bought resources at the time, they were on credit. In this case a false positive for sure.

In the case of an actual cryptominer it's more likely they'll just ditch the account when it comes to billing time. Even more likely is that it's a compromised account that someone else has to pay for

solidasparagus · on June 5, 2019

I can't really fault their postmortem or their response on HN. The corrections are all good, but the very fact that these things need to be corrected (automatically locking the entire account when there is a compute spike, having such a casual review process before permanently denying access to an account, not having 24/7 support after locking an account, etc.) makes you question their overall maturity as a B2B infrastructure provider.

TheOperator · on June 5, 2019

Sure it's better to never make a mistake but so long as they don't make a habit of things like this I'm not going to think anything of it until I see more cracks in the wall.

A screw up is inevitable. A mature response is not. So the fact they gave mature response goes a long way. Although it's unfortunate that social media seems to be their emergency support channel...

noselasd · on June 5, 2019

> Sure it's better to never make a mistake but so long as they don't make a habit of things like this

This is the thing - the customer that got locked out managed to get attention on HN, Reddit and other media - this seems to have prompted action from Digital Ocean.

How many have silently fallen victim before this ? We don't really know if this is a habit or not - we only know this one customer was corrected.

ensignavenger · on June 5, 2019

Based on this post, Digital Ocean is taking specific measures company wide to prevent similiar issues from affecting any customer in the future. So they did no just correct the situation for this one customer.

donmcronald · on June 5, 2019

> declined to activate it

Except they were declining to unlock it, right? I’m always shocked to see support that’s so pitiful they don’t even bother to have a correctly worded template for a common event.

The real problem is support reps that aren’t trained properly and don’t even care enough to apply a bit of common sense. Getting rid of a response template doesn’t automatically make the support reps care enough to apply common sense.

How about a “don’t fuck me” support tier where I can pay a one time $100-$250 fee for the sole purpose of getting a phone call before my account gets banned?

close04 · on June 5, 2019

The real problem is most definitely not the support rep. They don’t really go off book. This is the process as designed and approved by higher management, not by a low pay first level support (unless you assume they have some top level engineer doing this stuff).

And going off process could make it better... yay, self pat on the back. But it could make it worse in which case I see unemployment in the support rep’s future. So they won’t go that way very often.

Anyone who ever had such a low positioned job knows how it works. At that level your only freedom is to do what you’re told and follow company process.

No, this is the fault of the manager who asked for this process and their manager who approved it. Management isn’t just about picking up a higher paycheck, it’s also to take the accountability for the decisions made under your watch.

crankylinuxuser · on June 5, 2019

> That’s a really fair and reasonable response. Not sure what else people really expect here.

If you nuke VMs, under no circumstances do you also nuke access to data, backups, etc.

Because if it wasn't for "social escalation" (aka: mob justice via HN and Twitter), this 2 person company would have lost everything.

If you terminate a customer for $reasons, the data still belongs to the user, and not the company. And the company should still be legally required to provide the data on a reasonable timescale, like FTP access for 7 days.

corobo · on June 5, 2019

While you're swinging that legal word around, have an armchair lawyer skim DO's Terms of Service.

> 9.1 Subscriber is solely responsible for the preservation of Subscriber's data which Subscriber saves onto its virtual server (the "Data"). EVEN WITH RESPECT TO DATA AS TO WHICH SUBSCRIBER CONTRACTS FOR BACKUP SERVICES PROVIDED BY DIGITALOCEAN, TO THE EXTENT PERMITTED BY APPLICABLE LAW, DIGITALOCEAN SHALL HAVE NO RESPONSIBILITY TO PRESERVE DATA. DIGITALOCEAN SHALL HAVE NO LIABILITY FOR ANY DATA THAT MAY BE LOST, OR UNRECOVERABLE, BY REASON OF SUBSCRIBER'S FAILURE TO BACKUP ITS DATA OR FOR ANY OTHER REASON.

Summary: Do offsite backups n'all you dinguses

megous · on June 5, 2019

You're right. It's good to have some expectations of the company, but customers really need to take the TOS seriously.

crankylinuxuser · on June 5, 2019

It's yet again that the ToS is hidden crap that goes against the direct things, like "Backups".

Sure the ToS needs legalese crap for the lawyers, but a plain version also needs to be made. I'm certainly no lawyer, and nor are most people.

corobo · on June 5, 2019

It’s not really hidden and they do have an easy to read non-lawyer summary underneath the part I quoted

> In other words, we trust that you’ll be responsible and back up your own data. Things happen!

qes · on June 5, 2019

It doesn't take a college degree, in law or otherwise, to understand that data in one place - whether that's physically or under the umbrella of a single service provider - is subject to unpredictable, unexpected, total loss.

wwweston · on June 5, 2019

I agree it's a generally good response. There are a few more things I'd like to see more clearly addressed:

* While the removal of the account termination template is good, in conjunction with additional hiring to support more attention to any individual ticket, I can't tell by whose standards the "reasoned response" is gauged, or if the response is reviewable at all. I did note that they now want two human reviewers, but that's distinct from specifying a process in which a reasoned response is articulated and reviewed.

* More importantly, if the reasoned response doesn't pass muster with the customer, what's their resort? Still Twitter-shaming? I suppose that's legit if they'd rather their mistakes were public like this.

* The question of whether an account-wide lockout w/ no data retrieval is a necessary/proper consequence for those flagged for CPU abuse needs addressing -- ideally they should have a different policy that allows for data egress (with bandwidth fees, if necessary), but if not, a rationale and clear policy might be acceptable.

ensignavenger · on June 5, 2019

Back in the days before Twitter, folks wrote to the CEO or other senior executives as a last resort. Might still be effective in some cases.

Angostura · on June 5, 2019

> shotty support

"shoddy", for what it's worth.

_7siz · on June 5, 2019

Woop. TIL - Thanks!

EasyTiger_ · on June 5, 2019

How tremendously forgiving.

rphlx · on June 5, 2019

Their apparent conclusion that high CPU% for a few hours or half day or whatever means "cryptocurrency miner - ban ASAP!" is naive and flawed.

Compute offload is an ancient and fairly common use case for the public cloud; my VPS (or ten..) should be able to burn 100% CPU for many hours compiling a large project, even if it means they make less profit than they would have had I instead run a static web server that sleeps on IO, imposing nearly no CPU load.

At the very least they should provide some objective, quantitative guidance on exactly how many CPU-seconds-per-hour they consider acceptable/not-abuse (or, if not CPU-seconds, then increased host power consumption, or whatever they are ultimately trying to limit to ensure they can pack a few hundred near-zero-load servers onto the same host to make glorious truly massive profits all the time).

Don't make customers guess at whether their workload will trigger some opaque but hyper-aggressive abuse automation or not.

hamandcheese · on June 5, 2019

I think the heuristic they use is spike of high cpu + non-established billing history, not just CPU. That seems to me much more indicative of potential fraud, though by no means foolproof.

rphlx · on June 5, 2019

Indeed, but AFAICT there are apparently still some opaque, undefined CPU% limits for people paying with CC instead of free credits. They also mentioned elsewhere that customers paying via PO are exempt from the automated miner murderer, but that was was news to me and I guess just furthers my point: we shouldn't have to trawl HN threads to understand your CPU% abuse limits; they should be spelled out specifically and quantitatively in the main TOS, for each type of payment method, and any other factor(s) that effect them.

solidasparagus · on June 5, 2019

They point out that automatically terminating compute is a bad idea that they will no longer be doing in most cases.

"Services that result in the power down of resources will no longer automatically take action on any account, regardless of lack of payment history, for accounts that were engaged more than 90 days prior. These cases will be escalated for manual review"

rphlx · on June 5, 2019

That's a good idea, though I would still prefer to understand their detailed CPU% abuse criteria pre-deployment rather than via just-try-it-and-see-what-happens. Secret rules are a problem, no matter if the enforcement is automated, manual, or some hybrid of the two.

noncoml · on June 5, 2019

I think you are missing the point. They don’t care if you use 100% of CPU. What they don’t want you to do is use 100% CPU and not pay the bill.

Not that I am defending their actions and perma-ban.

rphlx · on June 5, 2019

> They don’t care if you use 100% of CPU

They clearly do, at least for some subset of customers meeting various quasi-secret criteria.

> they don’t want you to do is use 100% CPU and not pay the bill.

The account here was fully paid up, albeit via credits that they issued rather than via USD. Regardless, it was not past due, so, the high CPU% was the mortal sin.

marcinzm · on June 5, 2019

Reading this response it seems that crypto-mining is not allowed on digital ocean as they have checks against it. The TOS doesn't say so explicitly but does note that:

>violation of any of these Terms of Service or any law, or if you misuse system resources, such as, by employing programs that consume excessive network capacity, CPU cycles, or disk IO

By my reading that seems to mean that you're not allowed to use your VMs to their full capacity due to them being over-provisioned. This is in contrast to AWS who are more explicit on which instances (T instances) are over-provisioned and exactly how they're throttled.

bcooks · on June 5, 2019

If you want to do cryptocurrency mining on DO that is actually okay with us. Some of the other respondents are correct the behavior we were looking for was really around fraudulent accounts being created and performing cryptocurrency mining. This is why the trigger that flagged this account was using payment history as a key factor in the triggering.

weaksauce · on June 6, 2019

The thing that has me scratching my head is how this chain of events unfolded.

I get that your fraud algorithm flagged it because of lack of established payment. how is this possible if what the tweet referred to as "locking us out of all of our backups and work"? surely an account history of any significance would have an established payment record. From their tweets they mention that they had 5 droplets and some storage of a not insignificant number of records (~500k) and that a script is required to be run every 2-3 months to process some data and that script spins up 10 droplets during that time. seems like it will take 13 hours to process the data based on row count and per record time.[0] I am struggling to see how they didn't have payment history. can you elaborate?

In addition another thing I'd think would help assuage fears of a complete lockout is some process where you can request and download the db or a snapshot of the virtual machine.

[0] https://twitter.com/w3Nicolas/status/1134529322902007809

wyldfire · on June 5, 2019

> If you want to do cryptocurrency mining on DO that is actually okay with us

Do you disclose this anywhere? Are there any special steps one could take to avoid issues while doing legitimate mining?

chris_wot · on June 5, 2019

Your post-mortem implies this is not allowed at all.

Johnny555 · on June 5, 2019

Your post-mortem implies this is not allowed at all.

Not sure why you were downvoted, I had the same impression, after reading:

...an automated service that monitors for cryptocurrency mining activity (Droplet CPU loads and Droplet create behaviors). These signals, coupled with a number of account-level signals (including payment history and current run rate compared to total payments) are used to determine if automated action is warranted to minimize the impact of potential fraudulent high-cpu-loads on other customers

This sounds like they don't permit extended high CPU loads due to the impact it can have on other customers.

solidasparagus · on June 5, 2019

Cryptojacking is a well-known, major problem for cloud compute providers. Catching and squashing new exploits that allow people to create a fresh account, run up compute bills and then abandon the account without paying is very important.

My guess would be that this is such a well-known problem (within the field of cloud compute at least) that they just didn't think they had to state that normal crypto mining by paying customers is completely fine.

mr_toad · on June 5, 2019

Is ‘normal’ crypto mining in the cloud even profitable, compared to custom designed hardware?

AgentME · on June 5, 2019

Depending on the cryptocurrency's proof-of-work algorithm and new-ness, it can be profitable to mine in the cloud. I've done it briefly in the past. But generally it's not profitable.

In every cryptocurrency (the popular and functional ones anyway), there's a set global rate of mining rewards. All miners compete for a slice of that reward, so as more people mine, each individual miner gets less reward. (This causes an equilibrium to be reached where more people mine until it's no longer profitable for more people to start mining. If mining becomes unprofitable, some miners will drop out, and the remaining miners will each make a little more.) If masses of people realize that cloud mining for a particular cryptocurrency is profitable, then what generally happens is that lots of people pounce on cloud providers to mine, it becomes barely profitable, and then people operating their own hardware that's cheaper than cloud providers come in and push the mining rewards down to where it's no longer profitable for people to cloud mine.

Because cloud mining is never profitable in the long run, most cloud mining that happens is fraudulent activity using stolen cloud accounts or payment info. (If you're not paying for it, then making any amount of money from it is profitable.)

vthriller · on June 5, 2019

It depends on what crypto is going to be mined an how many accounts can be stolen given the fact that there is already plethora of bots that look all over GitHub for accidentally committed credentials. Heck, just a year ago people did scans for outdated WordPress installations to inject, among other things, some JavaScript (!) Monero miners [0]…

[0] https://arstechnica.com/information-technology/2018/01/more-...

nostrebored · on June 5, 2019

No. Cryptomining represents an arbitrage opportunity such that the spot price of the instances should be adjusted. In the long run it should not be profitable.

fhars · on June 5, 2019

No, that quite clearly states that they treat high CPU loads as suspect on accounts without an established good payment history or if it significantly deviates from previous usage patterns.

bgirard · on June 5, 2019

The keyword here is 'fraudulent'. High-cpu-loads is allowed, but an automated service monitors for fraudulent activity.

dlubarov · on June 5, 2019

What would "fraud" mean in this context? Are they talking about customers who don't pay their bill to DO? (If so, seems like the account should just be temporarily suspended until the bill is paid.) Or are they talking about fraud to other parties, like phishing sites? (If so, I don't see the connection to crypto mining.)

hoseja · on June 5, 2019

My understanding is that they're trying to prevent users from creating new accounts, running 100%CPU until it's time to pay the bill and then just not paying, moving on to another new account.

edit: from elsewhere ITT it seems they're doing this with stolen credit cards.

chris_wot · on June 5, 2019

Obviously they don't verify if the load is fraudulent. Otherwise this whole debacle couldn't have happened.

rajaganesh87 · on June 5, 2019

Blog post did mention that accounts with high CPU usage and payment history won't be flagged.

ac29 · on June 5, 2019

Here was my key takeaway:

"Cryptocurrency mining mitigation detects suspicious behavior, including very high CPU utilization on an account with no payment history, which results in an account lock"

Lots might have been done wrong here, but it sounds like they had an account with trial or promotional credits - I can see how this could easily be abused.

mdip · on June 5, 2019

Completely. At the same time, those promotional credits are going to be used by guys like me who will have to decide if their services are worth having to spend an extended period of time explaining why "we're not recommending AWS/Azure/Goober"

An account shutdown, or enough complaints from verifiable sources, and I'm not going to the trouble. Not to pull out an old trope "Nobody got fired for picking IBM". But that's the case: pulls a move like this and the entirety of the customer, who likely came in with AWS in mind (in some cases, was advised against it and insisted on it) is going to shrug their shoulders. Pick a provider that the customer hasn't heard of and I'm going to get a phone call that goes something like "You're the one that said we should use that basement-operation!" with raised voices. Heck, the last time there was an Azure outage, we didn't hear from most of our customers. It was so impacting that even customers well outside of software development/technology read news articles and connected the dots. I had one customer tell me he thought it was just their corporate internet connection; they assumed it was working[1].

[0] Plus, too lazy to put in the research; sorry.

[1] They were a customer who insisted on doing the app monitoring, themselves -- that guy was getting the alerts and similarly assumed it was the network since that happened regularly with another application they developed -- the monitoring server was on-prem.

phire · on June 5, 2019

Sounds like it's designed to counter stolen credit cards.

An attacker might load a stolen credit card number into an account and only use enough resources to generate a few dollars worth of billing. The owner of the credit card might not notice the small charge.

Then after a few months of low billing (to bypass a previous heuristic), they ramp up the utilization, mine a bunch of coins then the holder gets a massive bill.

The holder does a charge back and DO is left holding the bill.

kijin · on June 5, 2019

It's also designed to keep everyone relatively happy in a shared-CPU environment. "Standard" droplets share CPU with others on the same node, so one droplet pegging the CPU 24/7 can be problematic.

AWS doesn't have this problem because either your instance is allowed to use all the CPU that's allicated to it, or else (t2 & t3) the platform will automatically limit your CPU usage. You don't have to care about how your usage affects other people. It's one more thing that AWS abstracts away. DO's abstraction, on the other hand, is rather leaky in this area. That's a problem in and of itself, in addition to the matter of credit card fraud that every company has to deal with.

xvf22 · on June 5, 2019

I don't think that it isn't allowed but that they have seen that fraudulently acquired resources are generally used for crypto-mining so they felt that this was a good signal to look for.

duskwuff · on June 5, 2019

Right. Reading between the lines a bit, it's not the activity itself that DO is worried about, but a pattern of usage that suggests that the account may have been created fraudulently or compromised.

bcooks · on June 5, 2019

This is correct. This was the primary thing we were attempting to solve for in this case and the bug in the algorithm started the chain of events documented in the postmortem.

Twirrim · on June 5, 2019

Cryptocurrency mining is a favourite of fraudsters. Get a fake or stolen credit card number / identity, sign up for an account on a cloud provider, and spin up instances to do your mining. Depending on how quickly the provider reacts to the pattern of behaviour, or identifies the account as fraudulent etc, you can have earned a reasonable pay-off.

If the provider isn't paying attention and/or doesn't have good fraud detection in place, it may be a few months before your account and resources are terminated (assuming the payments eventually bounce, and the provider gives you a chance to fix it)

Shoop · on June 5, 2019

I believe the purpose of these checks is for when people's outdated wordpress install or whatever gets compromised by script kiddies. Generally, the scripts install crypto miners to mine for the hacker until your account gets shut down, running you up a huge bill in the process.

treis · on June 5, 2019

I think it's moreso people continually creating new accounts to get free credits. Then bitcoin mining turns the credits into actual money.

jimmaswell · on June 5, 2019

Anecdotally, I had a python script unintentionally using 100% for a month or two and never heard about it or noticed until I happened to look at top.

xnxn · on June 5, 2019

Counter-anecdote slash venting:

I recently logged into my long-dormant DO account to kick the tires on their now-GA managed Kubernetes service and to contribute some CPU cycles to a distributed computing project I'm interested in (not cryptocurrency, I swear).

I first requested and was approved for a droplet limit increase (to 25). I started 20 nodes, deployed my very CPU intensive workload, and 12 hours later my account was locked and my nodes all went NotReady.

I immediately replied to the abuse ticket explaining my usage. 4 days later, my account was unlocked and received the "allow high CPU" flag... but they billed me for the nodes as if they had been running that whole time. I asked for a credit (5 days ago) and they haven't replied yet. Probably a little busy over there right now.

So... I'm not too thrilled with DO. I get that cryptocurrency ruined everything, but this has been a frustrating experience and I'm glad I was only using it for a pet project.

zacharybk · on June 5, 2019

Hey xnxn - Thanks for raising this issue. I'm Zach, Director of Support at DO. Please send me an email with your account detalis, first name at, and I'll take care of this.

sneak · on June 5, 2019

Customers should not have to resort to shaming you publicly to get timely support responses.

Two points establish a line.

zacharybk · on June 5, 2019

Hey Sneak - I totally agree* :) We've already started efforts between Support and Security leadership to leverage the 24/7 structure of Support. Our goal is that no one will ever need to use social as an escalation path, and our new Support Engineers who are joining in mid-June and early July will be part of this a reality for our customers.

metildaa · on June 5, 2019

Fyi, I'm planning to migrate off of DO due to the crummy payment options. With every other recurring service I can have a PayPal subscription, DO doesn't support PayPal properly though.

Also, your billing system is incredibly spammy (1 to 2 emails a day at times). The bills I get are always for partial months despite the VMs on the account running for a full month.

zacharybk · on June 5, 2019

Hey metildaa - Zach here from DO support. It is accurate that that do not currently support PayPal subscriptions, this certainly isn't the overall experience that you should be having. Can you shoot me an email (first name at) and I'll look into this for you?

Thank you! Zach

xnxn · on June 5, 2019

Thanks Zach, will do.

zacharybk · on June 5, 2019

Email received, followed up on, and account credited.

-Zach

chris_wot · on June 5, 2019

Really? What is so bad about your support that you needed to hear about this from HN before you did anything?

heyoni · on June 5, 2019

The postmortem...they admit that they need to hire more support staff.

chris_wot · on June 5, 2019

That's my point. Their support got so bad that they had to be called out on in publicly before they did anything. That's hardly what you want to hear from a company like DO.

mschuster91 · on June 5, 2019

> That's my point. Their support got so bad that they had to be called out on in publicly before they did anything. That's hardly what you want to hear from a company like DO.

That's what you get with most providers these days. Ever tried reaching Google or Amazon?

glenneroo · on June 5, 2019

Actually I have complained to Amazon... and they have always responded within a couple days and in one case where I complained about Prime Video, they called me on a Sunday, which shocked me because everything is otherwise closed in Austria, except gas stations, restaurants and hospitals.

Google on the other hand, I make complaints or suggestions once every couple months and 99% of the time I don't even get a boilerplate response.

chris_wot · on June 5, 2019

I don't believe that Google or Amazon have ever done anything like what DO did before. However, if you notice their support guy Zach was responding to a completely different incident that is not related to the one being addressed with their post-mortem.

nostrebored · on June 5, 2019

If you pay for enterprise support with Amazon you can open a case with a less than 15 minute SLA [1]. With business that goes down to an hour SLA. With both plans you can create chats or calls that typically get answered within a matter of minute with and assigned to an engineer with a background in the service.

[1] https://aws.amazon.com/premiumsupport/plans/enterprise/

jasonlotito · on June 5, 2019

Amazon and AWS. I've always been happy with the support I've received from them. It's always been easy.

marcinzm · on June 5, 2019

Amazon support is from everything I've heard and experienced great.

So don't lump everyone together in an attempt to paint DO in a better light when it's not true.

andr · on June 5, 2019

I have no relation to DO, but I'm surprised by the negative responses in this thread. I can't think of any other major company conducting a public postmortem for a customer service failure (as opposed to networking/ops failure). Not only are they changing their policies across the board, taking on more risk to improve customer experience, but they are hiring extra people so it does not happen again. Kudos for that!

And of course DO will still retain the ability to suspend your account for suspected fraud - that is the case with any cloud services company, and any online business in general (check your ToS). Again, I can't think of any business that will en masse promise to never react to any fraudulent users. It's how this process is performed that matters and that's what they are improving.

dang · on June 5, 2019

> I'm surprised by the negative responses in this thread.

That is actually a general pattern. Negative, dismissive responses come fast because they're reflexive. They don't require processing significant information, nor reflection, nor thoughtful writing—all of which take time. Therefore they are the first to appear.

After that, a second wave of objections gets triggered because people read the first wave and are dismayed by how negative it was. That is the contrarian dynamic: https://hn.algolia.com/?query=%22contrarian%20dynamic%22&sor....

thaumaturgy · on June 5, 2019

Yep. I haven't had the best experience with DO previously (not terrible, just not great), and was vocal in the original thread about this, and I think this is one of the best reactions to a customer service failure that I've seen from a tech company.

This wasn't a failure that impacted thousands of customers [#]. DO could've just fixed things up for this customer and said not a word more, and changed nothing, and everyone would've forgotten about it by about ... now.

Instead they dedicated a nontrivial amount of resources to understanding what went wrong -- identifying not just a single cause, but several -- and publicly explained what happened, without a lot of weasel words, and what they're going to do to fix it.

It's an awesome response.

[#]: ...at that specific time. Yes, others have probably previously been impacted.

_skel · on June 5, 2019

This happens every time a post-mortem is posted on HN. It is often dismissed as a PR move, or ass-covering, or lies.

rlue · on June 5, 2019

I haven't been around the block enough to know, but I'll take your word for it.

What I don't understand is why anyone would complain about a post-mortem as PR. If that's what DO was up to here, I'm sold—it shows transparency, thoughtful problem analysis, and swift execution.

More than what this means to me as a customer, I could surely stand to learn a lot in my own work from the way they approached this situation.

dkersten · on June 5, 2019

A PR move isn't worth anything unless they actually follow through. I suppose many people are skeptical that things will actually improve as PR is often all talk and no actual action. Not saying that's the case here, but it certainly is difficult to trust a company purely by what they say for PR these days.

microcolonel · on June 5, 2019

There's no amount of money you can pay them to get decent support, that I'm aware; at least, if you could, you can't just pay them upfront. They screwed up their weird bespoke network configuration system (which they inexplicably use in place of DHCP) on FreeBSD, and their only support output is: lol blow down your instance and start a fresh one.

You tell me why a company like theirs, which should be mature, ditched their proprietary support and ticketing system (which actually worked) for an shoddy, misconfigured off-the-shelf product which has equally little excuse for being that bad.

I think it's customary here to be unfair to companies, even when they're doing "the right thing" as DO is right now; but DO has spent their customer patience budget elsewhere, so I'm not going to be surprised if people view the post mortem as an inadequate replacement for getting it right the first time, rather than a followup to an honest mistake they will actually try hard to avoid in the future.

julianlam · on June 5, 2019

> There's no amount of money you can pay them to get decent support, that I'm aware;

We at NodeBB have a high enough spend to have access to level 2 agents (their responses are around 1 hour turnaround, if not sooner).

We don't actually spend that much either, compared to what some of our clients pay AWS, etc...

Now, we certainly don't qualify for their highest tier with the dedicated support manager and slack access, but that's ok. DO's been amazing to us as a host.

microcolonel · on June 5, 2019

Yeah, I'd just like to be able to pay for the support, without having so much volume. I can't really consider sending them enough business that we'd qualify for level 2 agents if I run into regular technical issues with the services before I even get that chance.

Is it bad for image if they just have a "send a couple hundred bucks because you need to speak with someone real bad" button?

codegeek · on June 5, 2019

What is the high enough spend average to get Level 2 agents if you don't mind sharing ?

HappyDreamer · on June 5, 2019

I wonder this me too :-)

Their pricing page reads: "world-class technical support to all of our customers—around the clock" (https://www.digitalocean.com/pricing/#Included_services ) b.t.w. which doesn't seem totally accurate to me, with the 12h and 29h response times in this account banned case.

julianlam · on June 5, 2019

We qualify for their "Business Support" tier: https://www.digitalocean.com/support/#BusinessSupport

julianlam · on June 5, 2019

Hey, sorry for the delay -- the info sheet they provided us lists $500 monthly spend.

https://pages.news.digitalocean.com/n/b0000VI6mEn0DfeZ0305PT...

_Codemonkeyism · on June 5, 2019

DO has been amazing for us too.

pbreit · on June 5, 2019

I don’t understand how the account did not have payment history. Hadn’t it been in operation for quite some time?

rajaganesh87 · on June 5, 2019

I think the account used startup credit.

https://www.digitalocean.com/hatch/

manojlds · on June 5, 2019

The tweet thread suggested using for years. So either they are giving credits long time for startups they don't even know how and what they are doing or messed up a simple logic about history of the customer.

dosy · on June 5, 2019

Similarly, as I have skin in the game and didn't want to get blocked by DO, I was expecting a muddy response and ready to make a throwaway account and complain about "keeping processes opaque to give carte blanche to take whatever arbitrary action they like" in the normal complaint about deplatforming and tech censorship, but then I read their document and was surprised and encouraged by how transparent DO were. Congratulations DO!

computerex · on June 5, 2019

> I can't think of any other major company conducting a public postmortem for a customer service failure (as opposed to networking/ops failure).

There are many companies that have done this in the past. They are not doing this out of the goodness of their hearts, this is lip service for the fact that their mishap blew up in their faces on twitter. Do you really think they would have gone at length to highlight to the public this incident had it not gone viral?

> Not only are they changing their policies across the board, taking on more risk to improve customer experience, but they are hiring extra people so it does not happen again. Kudos for that!

There is no telling that they are actually going to follow through with anything. Mere lip service.

The bottom line is, people host their businesses and livelihoods on cloud providers and they (the cloud providers) should take the necessary care and precautions when taking destructive actions. Maybe err on the side of the customer instead of shutting down someone's entire business because of some automated heuristic. Maybe have a better response time than 29 hours. Maybe teach basic communication and develop processes so that care agents can see and react appropriately to recent activity on the account. These are not revolutionary concepts, they are simple things that demonstrate customer care, something DigitalOcean is sorely lacking.

rajaganesh87 · on June 5, 2019

> precautions when taking destructive actions.

No data was lost, it is not destructive in anyway.

> because of some automated heuristic.

If the customer had "payment history" none of this would have happened. Probably it was being used under "startup credits"

> people host their businesses and livelihoods on cloud providers

people shouldn't run entire operation on credits and blame DO in twitter.

Only issue is that DO took 29 hours, apart from that i see no problem with DO.

icebraining · on June 5, 2019

> people shouldn't run entire operation on credits

Why not? Until now, I wouldn't have considered that using credits might make me a second-class customer. They should at least be upfront about that.

dkersten · on June 5, 2019

> No data was lost, it is not destructive in anyway.

Except in the way that the guy may[1] have lost customers or revenue due to the downtime. Being offline, even without data loss, is very destructive for many businesses.

[1] I don't know anything about his business.

computerex · on June 5, 2019

> No data was lost, it is not destructive in anyway.

Tell that to the owner who was begging DO for their data back on Twitter. Again, had this not blown up on twitter nothing would have been done.

> If the customer had "payment history" none of this would have happened. Probably it was being used under "startup credits"

What's your point in saying this?? The fact is that the customer faced downtime because of a bug in DO's code.

> people shouldn't run entire operation on credits and blame DO in twitter.

Are you saying that customers on credits aren't subject to SLA's?

> Only issue is that DO took 29 hours, apart from that i see no problem with DO.

I think you seem very biased.

pbreit · on June 5, 2019

It should be pretty hard to shit down a legit biz. Seemed automatic in this case.

gavindean90 · on June 5, 2019

I mean it is. Unless your business is wholly dependent on a service from my business.

pbreit · on June 9, 2019

Edit: shut

scarejunba · on June 5, 2019

What? We use a few million dollars in GCP credits every month.

_Codemonkeyism · on June 5, 2019

"There are many companies that have done this in the past."

Haven't seen those, can you point me to them? I love companies doing this.

artificialLimbs · on June 5, 2019

There is also no telling if they're not going to follow through, and it is not mere lip service.

What cloud business do you run that does better, according to your standards?

jwr · on June 5, 2019

Key takeaway for me: DigitalOcean might still kill your account at any time, revoking your access to your data.

Refusing service is fine, but holding my data hostage and refusing access to it is not, so I am making a note to not consider DO for any kind of hosting.

Jedd · on June 5, 2019

I am curious which alternative service providers you use that offer guarantees to never kill your account, and to provide copies of data stored on their infrastructure.

(As noted by many observers during the initial event, anyone keeping their data exclusively within one organisation's walls is a profound mistake.)

solidasparagus · on June 5, 2019

I don't know if it's a promise, but I know from experience that AWS won't kill your account when they automatically detect fraud-like increased usage. They send warnings that the account will be shut down in several days.

Because shutting down running servers without warning is completely unacceptable for a B2B infrastructure provider.

Jedd · on June 5, 2019

> I don't know if it's a promise ...

I suspect that it almost definitely is not.

As IT professionals we should do better than use words like 'kill' to describe system and account changes.

As I understand it, Digital Ocean suspended the account, and because the (perceived) problem was related to excessive / potentially fraudulent CPU usage, they suspended the machines. The data contained on them was not deleted. Does that match your reading?

As someone who suffered from extremely noisy neighbours in AWS (in the very early days), risking significant damage to the performance guarantees to our customers, I'm actually cautiously happy with automated protection systems. Naturally I'd rather noisy neighbours were throttled so that I never heard them, and I expect that's closer to what happens these days

solidasparagus · on June 5, 2019

I think 'kill' is appropriate here. They shut down access to his company's account along with all running infrastructure and after escalating multiple times, their final official response (after 30hrs of silence) was that his company was permanently denied access to the account (DO calls it 'account termination' in their postmortem). The first tweet of the thread that went viral was 'How Digital Ocean just killed our company' [https://twitter.com/w3Nicolas/status/1134529316904153089].

I don't want to spent too much time dumping on them since they clearly know how badly this situation was handled, but this is an example of terrible automated protection leading to a company that's not enterprise-ready. As you say, AWS probably doesn't publicly promise not to terminate your account, but this would never happen because they understand that availability and security matter more than anything else when running B2B infrastructure.

Jedd · on June 5, 2019

On review, the wording is quite vague.

From TFA:

> Shortly thereafter, DigitalOcean investigated the issue and the Raisup account was unlocked and powered back on.

But it's not clear if any data was deleted by DigitalOcean.

The suggestion the account was unlocked rather than re-created suggests it was not, but OTOH there's no reference to erase, delete, restore, or indeed current state of customer data in that post mortem.

irishsultan · on June 5, 2019

The fact that data didn't get deleted is irrelevant if the customer can't actually get at that data.

That the customer got unlocked is of course a good thing, but for at least 30 hours the customer couldn't access their data, that's highly problematic.

bcooks · on June 5, 2019

Nothing was deleted or removed. The droplets were powered off and the access to them locked. once (way too long later) the unlock happened the customer had full control and access again.

Jedd · on June 6, 2019

Good to hear - that's how I'd assumed things went, but thank you for clarifying Barry.

callmeal · on June 5, 2019

>> I don't know if it's a promise ...

>I suspect that it almost definitely is not.

It's not a promise. But they don't shut it down without letting you know they're going to be shutting it down first.

bepvte · on June 5, 2019

Because I reinstalled my droplet with a different filesystem manually, the snapshot restore doesnt work. Support tells me they cant do anything, so my 2~ of chat logs are sitting in a disk image that they cant restore bc they need to mount it for some reason...

Sebguer · on June 5, 2019

Hey there! I would love to follow up on the issue you're describing here. It looks like you tried to bring a disk image over from a provider in a format that we don't support, and unfortunately there's nothing trivial we could do to get a working Droplet out of it (which is a requirement for us to expose the volume within the systems we have).

I can't promise a super fast resolution - but I'd be happy to work internally to see if there's any outside-the-ordinary workarounds we can supply here if you're willing to follow back up on the ticket.

bepvte · on June 5, 2019

I replied to my ticket (#2710287). Thank you so much for giving it a shot by the way.

ummonk · on June 5, 2019

If you have illegal content such as hacked / stolen data or child pornography on your account, they absolutely should revoke access to your data.

nkozyra · on June 5, 2019

I appreciate that they did this.

It's sad to me that your only chance in hell of getting huge companies to listen to you is by shamespamming across social media.

That, coupled with the clear issues following procedure from support, paint a clear picture: customer service is an area to skimp on for big tech.

zacharybk · on June 5, 2019

Hey Nkozyra - Zach here from DO Support. One of our remediation efforts, that is already underway, is that Support and Security Operations leadership will create new workflows to allow abuse-related events to leverage the 24/7 structure of Support.

On Support, we have additional Support Engineers joining our Developer Experience team in mid-June and early July. We will continually assess our ability to provide high-quality responses as fast as possible to all tickets. Our customer feedback will continue to be the measure of how well we're doing, but our goal is that no one will ever need to use social as an escalation path.

sladey · on June 5, 2019

I've been using DO spaces for about a year now, and for the later half of that time, my experience has been pretty terrible.

- Spaces throwing up errors that magically fix itself a couple of days later.

- Asking about the credits we were promised for when Spaces lost our files results in the question being ignored. Still haven't received the promised credits for 6+ months. I can't even look back at the tickets now as the support system has deleted all the tickets older than a month.

It's gotten to the point where we have started work on migrating off DO, which is unfortunate because DO's offerings looked very attractive.

zacharybk · on June 5, 2019

Hi Sladey - Zach here from DO. I'd like to help you out and investigate what might be going on. Can you send me an email (first name at) and I'll investigate right away?

Thank you, Zach

laughinghan · on June 5, 2019

In what way is DigitalOcean a "huge company"? At ~300 employees, it's closer to the SMB's definition of a small business (<250 employees) than mid-size (<500 employees).

In all fairness, having worked at a few tech startups, it can be hard to scale customer service to keep up with demand—you don't control how many support tickets come in, and it takes a lot more time to hire and train new customer service agents than it does to spin up new servers, and if you over-hire, it's a lot more costly than shutting down some servers.

omeid2 · on June 5, 2019

DO is claimed to be "third-largest hosting company in the world in terms of web-facing computers", so that should give you an idea of how many customers they have.

im3w1l · on June 5, 2019

How do they manage so many boxes with so few people? Do they rent metal and resell it with value add software?

breakingcups · on June 5, 2019

By skimping out on support, clearly.

laughinghan · on June 5, 2019

Oh interesting, hadn't seen that claim before

p1necone · on June 5, 2019

I feel like a better measure for the "size" of a tech company is number of customers, rather than number of employees. Considering software doesn't need a linearly increasing number of people to produce/support it as your share of the market gets bigger.

hyperpape · on June 5, 2019

It doesn’t scale linearly, but we’re all here discussing in this thread because scaling support is hard.

mdip · on June 5, 2019

I had wondered if they were a large company or not. I have used them in the past and there was a specific reason that we picked them over AWS/Azure/Goober, but those details escape me.

Based on the other comments, they don't sound like a huge company. Someone mentioned about 300 employees. I am not sure their revenue and chuckled at the reference that they're "third largest" given some specific criteria (of course, "largest" doesn't mean anything, either -- what we're really looking for is "third best" by some criteria that we've defined in our head)

I found this: https://www.canalys.com/newsroom/cloud-market-share-q4-2018-...

That jives more with other things I've read and anecdotal experience.

I think the size of the company is less relevant than the last point that you make about the clear issues around training/support. At a company that size, CSR training might be less formal than it needs to be. When a one-off like a customer who is legitimate but in every other way appears to be fraud might involve a slack messages to people with wrong information rather than clear guidance followed up with formal training.

It's difficult for a smaller (average cash-flow for their size) company to succeed in highly competitive markets that are, effectively, commodities. A larger company can handle being a loss leader to knock smaller competitors out of the market, providing excellent customer service. They don't. But it'd be easier for them to do so. :)

exabrial · on June 5, 2019

Digital Ocean: allow me to do some extended verification so you know exactly who I am and reduce your risk. In exchange, there is no automated locking, rather we are contacted and have 24 hours to mitigate the issue.

Requirements:

* 1 year continuous ontime payments at $250+/mo usage

* automatic billing is set up

* billing limits are set up and have been reviewed within the last year

* copy of our business insurance and license

* u2f on all accounts

Fair?

bcooks · on June 5, 2019

The first few items on your list are actually a part of what we meant by "having billing history with us". There are a number of things we look at in that bucket. We use these items as a part of validating users before taking any action (yes, we clearly failed on this account due to the credits which is a clear bug). As far as offering things like a copy of your business license or other means of verification that isn't a bad idea. As an example people paying with POs today are excluded from the algorithm already.

kijin · on June 5, 2019

Please make it official, so that people can have peace of mind knowing that they've got that "verified" badge. People hate having to wonder whether they're at risk of crossing an invisible, inscrutable, and constantly changing threshold. See: PayPal and AdSense account forfeitures. You could do so much better than that.

gr2020 · on June 5, 2019

Not sure why you’re being downvoted so hard - while I think your list of requirements perhaps aren’t the right ones, I like the idea in general. Have a process where I can make assurances of some form that we’re good guys, and treat me accordingly.

idlewords · on June 5, 2019

What is a "business license"?

exabrial · on June 5, 2019

Various businesses in the usa require licensure to operate. These are often done at state or even federal levels. Not every business has one, but even an EIN provides them more information about who's paying for the service.

idlewords · on June 5, 2019

Not every business has an EIN. I'm kind of surprised at the number of regulatory hoops you're willing to jump through to negotiate a minimal level of service.

dylan604 · on June 5, 2019

Looking at it from the point of operations of the affected company, this doesn't sound like minimum level of service. Instead, DO was the entire business structure of the company. As a company, if you're entire business plan depends on the services of a 3rd party, it is not unusual to go through extra steps to ensure 3rd party can't end your business. Running an internet driven company from a consumer service from ATT/Spectrum etc is suicidal. If your network goes down, they'll fix it when the get to it. With a business level account, they have much more guarantees to keep your signal hot. Running a POC off of a free tier of AWS/DO/etc is fine, but these guys were well past POC.

Aeolun · on June 5, 2019

As happy as I am to see this post itself, the mistakes made here are pretty appalling.

Killing customer accounts by automated action without any human check just seems like a recipe for disaster. Even if you can respond faster to crypto issues, the effects of a false positive are just unacceptable.

Though apparently the human checks at Digital Ocean don’t work either.

laughinghan · on June 5, 2019

According to the post, that's not what happened? The customer account wasn't terminated by the automated system, but rather by the second Abuse Ops agent.

Upon a second review by a different Abuse Operations agent [...] the agent fully denied access back into the account. This action triggered the final “access denied” communication to the customer.

mintplant · on June 5, 2019

That was after the automated process locked access to the account and powered off all associated machines.

computerex · on June 5, 2019

Read the post?

> The initial account lock and resource power down resulted from an automated service that monitors for cryptocurrency mining activity (Droplet CPU loads and Droplet create behaviors).

wvenable · on June 5, 2019

You're looking at the one false positive instead the potentially thousands of true positives. Those true positives are people using free DO credits to cryptomine on DO. And probably 99.99% of the time the system works as intended. Dealing with bad actors and abusers is not pleasant or easy.

This sort of 0.01% case is exactly what developers and sysadmins deal with on a regular basis -- some crazy scenario that lead to a bug you've never experienced before. The correct and only response is to fix the bug (whether it be software or process), offer whatever compensation to the injured parties, and move on.

Aeolun · on June 5, 2019

How much do you think this one instance will hurt DO’s bottom line?

I’m prepared to say the lost income potential is massive.

Who would ever use them for production if they can arbitrarily decide that your servers need to be powered down?

grzm · on June 4, 2019

Digital Ocean's follow-up to "DigitalOcean Killed Our Company" https://news.ycombinator.com/item?id=20064169

cannonedhamster · on June 5, 2019

We dropped DO from our company usage after similar issues, though honestly DO probably wasn't the right place for our product at that stage of development. What was meant as a POC became technical debt and an outage forced is to come to terms with the fact that by the time the issue happened we had more than enough of our own capacity to run on our own metal.

Kudos to DO for the open incident management. As someone who does this myself, these are often really painful and hard to get right.

caffeinewriter · on June 5, 2019

Good response from DO, but this line jumps out at me.

> Responses to account locks were not prioritized differently from a ticket management standpoint to be above less severe tickets.

That's arguably the biggest failure, IMO. The fact that an action which locks/terminates an account is not prioritized any different than a general ticket is pretty jaw-dropping, and I'm glad they're going to change that.

bcooks · on June 5, 2019

Yeah... that one was painful and we are fixing it. At least if the priority placed this at the top of the queue we could have acted faster. Probably the same outcome due to the other issues involved in this incident though.

caffeinewriter · on June 10, 2019

I appreciate all your transparency and engagement on this issue. It probably would have had the same outcome, yes, but potentially resolved much more quickly. Regardless, the fact that you're fixing it is music to my ears.

mtw · on June 5, 2019

The startup claimed they had all their backups on digitalocean, which contained data of Fortune 500 companies.

A startup who has fortune 500 clients must have history. I don't get then why digitalocean says they do not have payment history. Either the startup moved a few weeks ago - but then why don't they have offsite backups if they just moved. Or because they're french they did have payment history but did not have an American credit card to similar.. not sure what's up

heartbreak · on June 5, 2019

Their payment history was with DO credits according to the post.

heliodor · on June 5, 2019

Maybe one of the problems is DO's views of credits. Maybe things would work out better if they would treat credits like real money instead of phony money and tighten up how the credits are handed out.

__HYde · on June 5, 2019

Yes, DO have stated that how they viewed credits was a mistake and that they will be addressing it.

cls59 · on June 5, 2019

AWS is a 8,000 pound gorilla. This means throwing massive amounts of promotional credit at new accounts is often the only viable way to get them into the funnel.

jmhyer123 · on June 5, 2019

They we're running on DO account credit so they may have been using the service for a while but haven't needed to pay for the service yet.

gpvos · on June 5, 2019

Looks like a clear indicator for possible abuse to me. Apart from the long response times, I can't really blame DO.

EugeneOZ · on June 5, 2019

They were paying customers before, then got credits as startup. It's mentioned by DO in comments.

oblib · on June 5, 2019

I really cannot blame DO for this incident. These kinds of things must be handled in a learn as you go way and when I consider what one can do with a DO VPS (or 10) it's astonishing. I would expect them to automatically flag some uses.

A business "relationship" is a two way thing. You call and talk to people, and tell them what you want to do, and ask if it's ok.

When I've called and talked to DO reps about what I've wanted to do they have been very accommodating.

Animats · on June 8, 2019

Here's what's really wrong. This is a B2B service with B2C-grade terms of service. You don't want to base your business on one of those. Not one with a "sole discretion" termination clause. Those are for low-value consumer facing services only.

Compare, say, these terms of service from a major dedicated server hosting company.[1]

Either of the parties may terminate this Agreement (including all existing Orders) if: The other party breaches any material obligation under it (other than our obligations covered by an SLA), and fails to begin to cure such a breach within ten days of written notice of such a breach from the non-breaching party, or fails to completely cure such a breach within thirty days of the original written notice; OR a force majeure event continues for more than thirty days.

Now that's what a reasonable B2B contract looks like. That seems to be fairly standard for dedicated server hosting.

[1] https://info.codero.com/hubfs/Linked%20Assets/Legal%20Docume...

andrewstuart · on June 5, 2019

Nothing in the statement from Digital Ocean indicates that they won't kill your account or shutdown your systems - that's not the sort of cloud host any company can afford to use.

Cloud providers that kill accounts - or SAY they kill accounts, must be dropped and not used.

The worst thing that should be possible is for your account to be suspended.

AWS, if there is a billing issue, prevents you making changes to your infrastructure via the console until the billing issue is sorted out - this is good and reasonable.

""Peer review of account terminations. For any account appealing a lock, two agents will be required to review the submission prior to issuing a final deny.""

- I can imagine how this plays out:

(service agent 1 turns to next service agent along) 'This looks like a bad account - I think I should shut it down, what do you think buddy?'

(service agent 2) - 'Yep I trust you, shut it down.'

asdfasgasdgasdg · on June 5, 2019

> Nothing in the statement from Digital Ocean indicates that they won't kill your account or shutdown your systems - that's not the sort of cloud host any company can afford to use.

There is no cloud that will issue such a promise. If your criterion is that a cloud has to promise that you won't get shut down, you just can't use cloud hosting providers.

nothal · on June 5, 2019

The article discusses violating TOS, not a billing issue. I don't think it's unreasonable to disable an account in that circumstance but I agree that deleting images/resources without allowing customers to defend themselves and backup the systems would be unfair.

ben0x539 · on June 5, 2019

> The article discusses violating TOS, not a billing issue.

That's not the impression I got. It sounds like the issue was that a account with misinterpreted payment history was showing bitcoin-mining-like usage patterns. Mining is not against the terms of use, they were just erroneously convinced themselves that the customer was not going to pay for it.

ben0x539 · on June 5, 2019

I'd like to apologize for the typos in my previous comment that I neglected to notice before the edit window expired.

bcooks · on June 5, 2019

Correct.

newsoul2019 · on June 5, 2019

I have read and written similar RCA's in the past, this one is very good IMHO.

craftinator · on June 5, 2019

Barry Cooks did a phenomenal job with this after-action. He not only publicly accepted fault on DO's behalf (+1), not only stated the incident timeline clearly and without bias (+2), but also showed mitigation steps and procedural changes to avoid this in the future and prioritize customer business interests (+3). Many medium and larger sized companies should take note of this handling style (looking at you, Google and Facebook). I love that there was no generic PR "we're very sorry". Succinct, accurate, and without spin (+4).

lioeters · on June 5, 2019

I agree, the incident report was well done. The combination of factors that led to the issue was described in clear detail, and I was glad to see a concrete plan to improve various aspects to avoid future cases like this. It certainly helped to regain trust.

silversconfused · on June 5, 2019

I had been expecting a short blip of an update denying anything of consequence (a twitter post promised a status update, but well, you know...) but this transparency significantly exceeds expectations. Nicely done DO.

You may want to explain service credits in some light detail though, for those that are unfamiliar with them.

Ill_ban_myself · on June 4, 2019

That is everything I'd hoped to see as a developer and digital ocean customer. Good response.

muppetman · on June 5, 2019

Totally. I'm fairly new to DO and after seeing what happened was re-thinking my decision. But this is a solid followup, "we made a mistake" post so I think I can rest easy.

ncmncm · on June 5, 2019

They hoped so.

One wonders how many others didn't get enough Twitter cred, before. That some low-level ticket stamper (even a high-level ticket-stamper) had authority to deep-six a customer on no more say-so than high CPU usage tells us more about the company than an incident report massaged by marketing communication specialists. Simply, the latter sounds good because it has been made to sound good by sounds-good experts, and could say anything; but the event itself is ground truth.

They will need a lot more time and good behavior to live this down.

bcooks · on June 5, 2019

I agree on the twitter cred point. The fact that this happened in the end, personally I think it is a good thing as it highlighted a weakness we must fix.

We trust our people high-level, low-level whatever to make important decisions everyday. thats why they are here.

The "marketing communications specialists" are getting slammed a lot here, so I will just point out that they spend most of their time rolling their eyes at my crappy grammar, spelling and ludicrous number of comma splices. I don't think our goal was to sound like anything. We just wanted to lay out our investigation and the follow on work we are undertaking.

Totally agree with your point that trust is earned and we lost many peoples in the last few days. That will take time and as you say good behavior to earn back, but that is what we are committed to doing.