Hacker News new | past | comments | ask | show | jobs | submit login
Update on Samsung SSD Reliability (pugetsystems.com)
487 points by Akharin on Feb 3, 2023 | hide | past | favorite | 234 comments



Apparently a few months ago it became known on the Chinese internet that the 980 Pro, 970 Evo Plus with new controller, and OEM versions are prone to getting unreadable sectors, where SMART 'Media and Data Integrity Errors' increases on every read attempt.

https://www.reddit.com/r/buildapc/comments/x82mwe/samsung_ss... https://www.reddit.com/r/DataHoarder/comments/x8arle/psa_sam...

How I came across this: Ran into this last week(!) on a 6-month old drive -- but I'm not in China....hmm. Not just one bad batch? Interestingly, it's non deterministic - the data is backed up but trying ddrescue, it occasionally succeeds at reading a few kilobytes from the 5 MB of several runs of 512-16384 bytes that can't be read or written. Curious to see what happens with a firmware update and secure erase.


PS: I'm one of the victim with a 970 Evo Plus. The company that provided aftersell services, Lobcom, did not want to provide any RMA services and claimed nothing wrong is found.

The scamming company in question: https://zh.lobcomgroup.com/


My anecdata:

tl;dr: All 3 of my Samsung M.2 NVMe SSDs have failed in less than 3 years. 100% failure rate.

My first SSD was a 1TB Samsung 970 EVO. It failed after 2 years and 8 months. It was replaced under warranty with a 1TB 970 EVO Plus.

That replacement has now also failed after 1 year and 9 months.

I bought a 2nd 1TB 970 EVO Plus in May 2019. It has now also failed (2 years and 7 months).

Both are expected to be replaced under warranty.

The 2 970 EVO Plus SSDs clearly had hardware errors (that were not accurately reflected in SMART data) that caused everything from system hangs, game crashes to file corruption on OTHER drives. I couldn't believe it at first but after 5 days of testing and trial and error, I had it confirmed. As soon as I removed those SSDs, my PC was completely stable again.

In the meantime, I have bought a Kingston KC3000 1TB drive as I no longer trust Samsung M.2 NVMe SSDs. On the other hand, I have a Samsung EVO 850 SATA drive which has been rock-solid.


My anecdata, I have been running 4x 500GB Samsung 850 EVOs in Raid 0 continuously without failures since early 2015.


The article mentions issues with the 900-series drives. It seems like the 800-series are still rock solid (also been running them for s few years now without issue)


Unfortunately there have been recent issues with the 870 EVO series also: https://www.techpowerup.com/forums/threads/samsung-870-evo-b...

There may be multiple, different issues with Samsung parts at play here. The 900 series issues seem to have been addressed with a f/w update; the 870 EVO issues were - allegedly - caused by bad NAND and the devices needed to be replaced.

ofc part of the problem here is the lack of public acknowledgement / information from Samsung on these issues.


Similarly my M.2 NVMe 950 pro has been in an always on machine that gets a ton of use since 2016.


The parent posts mentioned 970 and 980, not 850.


Is it possible that your motherboard or PSU is killing the drives?

Could also just be sheer chance, of course.


How does this happen? Got any background info?


Poor voltage regulation from the motherboard or power supply could glitch the controller of the drive causing I/O errors or failures.


As an example, an old Asus board of mine has trouble with modern m2 drives. A PICe m2 adapter solved the problem and the Samsung ssd worked without issues thereafter.


I've bought 6-8 m.2 Samsung 970 EVO Plus and 980s since 2018, and none have failed to date.

Anecdata is the worst, I'm sorry to hear about this happening to you. It's surely frustrating and upsetting.


Worth checking if you have any thermal issues with it. Mine failed in a similar way due to presumably a rookie mistake of forgetting to remove the thermal pad tape on the mobo.


It's not likely that thermal issues would cause bad reliability on these things. At worst you could expect intermittently bad performance. You can check for this condition with `nvme smart-log`. If your device was often overheated, it would have "critical composite temperature time" non-zero. My Samsung that has been in service for years and has no thermal solution has a value of 1 minute and I happen to know that is because I heated it with a hair dryer to find out what would happen if it crossed the critical temperature.


"I happen to know that is because I heated it with a hair dryer to find out what would happen if it crossed the critical temperature."

Ah this is a fantastic and true hacker mindset :)

Willing to tamper with fairly expensive equipment just for the heck of it.


Ha, interesting! Makes sense, the drive is supposed to just throttle itself before it can reach unsafe temps. I’ll def try to check, didn’t know the drive recorded that - thanks for the tip. In any case, now I know RMA is in order


The controller is less thick than the NAND flash so don't make proper contact with the thermal pad. I just discovered mine is affected by this. After heeavy reading the controller is at 67C while the NAND is at 42C.

https://www.youtube.com/watch?v=I8Z09nU554Q


Hmm, that still seems like it should be ok. Tjmax is usually over 100C (though for NANDs they recommend 70C I think)


My anecdata, I've had 5 Samsung SSDs and they've all performed great.

I'd point the finger at your PSU or motherboard. That's way too many failures for it to be the SSDs.

Samsung couldn't stay in business if that was a normal failure rate.


> that caused everything from system hangs, game crashes to file corruption on OTHER drives.

Interesting. Maybe my M2 (WD 570) is the cause for the hangs in my system. Thank you very much!


I can second EVO 850 SATA. Mine has been rock-solid since 2015.


My anecdata, I have a 840 Pro, 850, 850 EVO, 970 and 980 Pro, all still running for years


My 980 pro failed witihn two months of purchasing it in late 2022


I wonder if Qvo are still subject to the same issues.


Hmm I'm going to need to check my Samsung ssd from oct 2021 that failed the first week of Jan 2023. I had started noticing some quirks in spring 2022 but it wasn't a super important drive so I ignored it.


I have similar issue. It started failing mid last year. Then it got more and more frequent toward the end of the year. Last month I got tired of reinstalling OS for the 4th time and got a new system.


In 2021 I bought at least ten 870 EVO 4 TB SATA and six 980 Pro 2 TB NVMe. All devices failed within 6 months on barely used systems. Find some smart data here:

<https://thomas.glanzmann.de/samsung/>

The pattern is always the same: I have them configured in a raid 1. Once a month debian does a raid check. During the raid check Debian reads all data from both devices. I get uncorrectable read errors. I no longer use Samsung SSDs and replaced them them with SSDSC2KB076T8, Micron SSDs and KC3000 Kingston NVMes. No failures since then. In 2021 I told a friend of mine about the issue. He also had a 870 EVO, issued a dd if=/dev/sdX of=/dev/null bs=8M and guess what, he got uncorrectable read errors. Due to running them in RAID 1 I caught the issue early and I had no data loss or downtime because the Linux software raid compensated for the bad hardware. However I replaced them in a hurry because I no longer trust Samsung SSDs. As you can see from the smart log they're barely used. Less than 4 months in service and 10 TB written. I also got uncorrectable read errors when evacuating data from the devices.


Similar experience for me. Four 2TB 980pro in RAID-Z2. Since introducing the drives in August 2021 I've had to replace them five times. I think none of the original SSDs are left. Only between 5 - 25TB written on average. Usually individual uncorrectable errors caught by ZFS, but one drive just straight up died. I keep a cold spare these days. On the bright side, if the cycle continues, I'll never run out of warranty from Samsung or the vendor.


> On the bright side, if the cycle continues, I'll never run out of warranty from Samsung or the vendor.

The warranty is from the date of sale of the original unit. Replacing one doesn't reset the warranty end date.


If this really becomes an issue due to repeated failure, lemon law activates for a refund.


I tested mine. 980 (non-Pro) 1TB with ~20TB read and ~13TB of written. No errors logged. SMART looks fine. No errors when issuing dd as suggested. Though sometimes I do get weird errors put I will put the blame on AMDGPU drivers and the mess that Gnome on Arch often is.


As a data point, the Linux kernel has a long list of workarounds for "ata" related devices (SSDs, HDDs, etc):

https://github.com/torvalds/linux/blob/69f2c9346313ba3d3dfa4...

Can be a bit eye opening to look down that and see equipment you're using listed. ;)


This seems to be the same thing for NVMe devices (eg Samsung, SK Hynix, Micron, Kingston, ADATA, Intel, (etc) NVMe drives):

https://github.com/torvalds/linux/blob/69f2c9346313ba3d3dfa4...

Not quite as easy to read and understand as the ata driver code though.

With the occasional further device specific workarounds in other parts of the code.

eg for specific Toshiba, LiteON, and Kioxia devices:

https://github.com/torvalds/linux/blob/69f2c9346313ba3d3dfa4...

While this seems to be special handling for Samsung X5 SSD external drives, and also Samsung 970 Evo Plus drives:

https://github.com/torvalds/linux/blob/69f2c9346313ba3d3dfa4...


What's horkage?


Brokenness.


For the same price, you can get twice the space for 1/4 the endurance, thrice the space for 1/8th the endurance, and now four times the space for 1/16th the endurance. Most people don't realise that is a horrible tradeoff, because NAND flash marketing and terminology like "TLC" or "QLC" is intentionally deceptive and manufacturers have been very secretive about the true endurance specifications, as well as trying to overprice SLC out of production. If more people knew the truth of what they were trying to do, we wouldn't be in this situation.


> as well as trying to overprice SLC out of production.

Is it even possible to buy SLC drives any more? For the past 5+ years the only outlet I've been able to find that even advertise SLC is https://www.delkin.com/, and you need to speak to sales to even get a price. I just assumed they and any other similar suppliers bought giant lots of chips at the tail end of SLC production and jack up the price on every new order as their supply dwindles. Or maybe they cobble together drives from the tiny SLC chips used for cache on modern SSDs?


Is it even possible to buy SLC drives any more?

Yes, small ones for industrial use. They're extremely expensive, however.

Looking at the raw NAND flash prices, SLC seems to still be around $4.60USD/GB, or roughly the same as it was over a decade ago, while MLC is already <$1USD/GB despite only a doubling in capacity. TLC and QLC seems to be down in the $0.10USD/GB. You can still buy raw SLC NAND flash in the smaller capacities of few GBs; this one is only 512MB, at the price mentioned above:

https://www.newark.com/micron/mt29f4g08abaeawp-it-e/flash-me...

If the pricing was sane, SLC drives would be only 4x more expensive as QLC ones for the same capacity, but that's not what we're seeing today.


Do you think the pricing difference could be due to economies of scale?


In theory an SSD could run TLC NAND in SLC mode with just a firmware change but I'm not aware of any such drives.


Enmotus partnered with Phison to produce a QLC SSD that mapped the first several GB of logical blocks to SLC. This was sold bundled with Enmotus's SSD caching software, but it could also be treated as simply having a SLC partition and a QLC partition that are largely independent.


Samsung 980 Pro, the original, does that when sufficiently empty, to offer around iirc 10% nominal capacity of SLC-mode write buffer (given they are TLC, that would be around 30% of the NAND).


Many TLC drives have SLC cache so if you limit usage of the drive you may effectively use it in SLC-only mode. This does not say anything about its endurance however. If you want endurance, better mirror your data to HDD.


Show me software that can mirror an SSD to HDD in realtime without the "backup" being affected by the slower write speeds. I was looking for a way to do this a few years ago, and couldn't find anything. I'd be very happy if I could pull my SSD out, change my boot drive, and boot into my existing OS without any issues. My understanding is that existing solutions for this delay write confirmation until both drives are complete, negating the speed advantage of the SSD.


Check out Syncthing. I personally run two instances of syncthing, synchronization happens very fast and as long as there isn't high write volumes it syncs pretty quickly. If you don't need realtime backup, rsync'ing in a loop would work.


If you want to use a drive as pure SLC, you're probably better off buying a QLC drive than a TLC drive. QLC drives are more reliant on SLC caching and tend to be more reluctant to migrate data from SLC to QLC when it isn't absolutely necessary.


For most people this isn't a horrible tradeoff. For example, my desktop which I use pretty heavily, I've averaged 24 GB writes per day. With 500 drive writes of a TLC drive, my ssd will last me roughly 50 years at current rate. If I had to chose between a 300 GB SSD that will last me 400 years or a 1TB ssd that lasts 50 years, I'll take the TB one any day of the week.


While NAND endurance has certainly gone down, FTLs got much better during the same time so that SSD endurance is still fine for most people. And if the stock endurance isn't enough, a little overprovisioning is probably better than dropping back to very expensive MLC.


If the number of firmware bugs in SSDs we've seen over time is any indication, I don't think things are really getting better...

SLC needs almost no FTL. 100K endurance. Very low raw error rate that can be handled with basic ECC.


Is it a horrible tradeoff though? I can think of many situations where that would be a somewhat compelling alternative to a spinning platter.


Can you explain your point further? Are you talking about competitors?


It isn't a bad tradeoff for most read workloads.


SSDs are complex beasts :-)

I've had an OCZ SSD in the past that also became read-only, in the sense that all changes after shutting down the computer were gone.

If you rebooted everything was fine, but as soon as the computer was shut down and the SSD also ran out of power, it converted to the state before turning on the computer. That was so bizarre. Once I had a hefty Windows upgrade installed and it was gone after the reboot - as if I had never installed it. It also took me a while to realize it because first of course you start to doubt your own memory, you don't realize that the sector- mapping on the SSD has suddenly become read-only

OCZ eventually solved this via an RMA and eventually OCZ also went upside down. The Time Warp bug it was called I think.


sounds ideal for a public terminal ;)


Marketing should make that a feature: hardware-guaranteed non-modifiable root filesystem! Better than Nix!


No updates! Ever!


Also, no malware!


There used to be a PCI module (iirc) that would assist in that. I believe it was called a Bo(u)rne Again module. It was used in school computers etc.


There was an IDE dongle which did this. It was at a school I was visiting (but not my regular school) and it was pretty cool because it allowed them to be pretty liberal with permissions on the computers since you couldn't really do anything terrible as a reboot would fix everything and they powered them off every night.


There was (is?) a software solution called Deep Freeze. They used it on computers at school when I was growing up.


I wish more independent review organizations would conduct destructive "write lifespan until ultimate failure" real world tests on SSDs. With a mixture of real world large contiguous files and small random writes.

Real ultimate write lifespan on 3-level-cell and QLC consumer grade SSDs varies wildly for things of the same capacity and similar price.

Such as this series of tests from 7 years ago: https://techreport.com/review/27909/the-ssd-endurance-experi...

It looks like the bar charts and other data in that URL are now broken, which is sad, because I recall reading it when it was first published and it shows some amazing differences between the drives that died first, and the ones that died last.

another similar: https://www.guru3d.com/news-story/endurance-test-of-samsung-...


I'd like to see this too!

The official line is that endurance should not matter for most people. For example the Samsung 990 Pro 4TB is rated for 2400TB TBW - which, if the drive has a service of 5 years, is 1.3TB of data written per day. The average user will need < 1% of that.

Where that falls down though of course is cases like this. The point of a review is to show when real-world performance doesn't match the marketing. Tech reviewers seem to be blindly trusting the marketing on this one. They're really dropping the ball.


What's even more important than durability is what the drive does when it runs out of write cycles.

They should just become read-only, but it seems that in the vast majority, the controller just shuts off and bricks the drive.


>destructive "write lifespan until ultimate failure" real world tests on SSDs

>from 7 years ago

It's from 7 years back for good reason. They stopped doing those tests when it became impractical as endurance increased. The drives are now good enough that you can't wear them out fast enough to make sense in a review setting

...unless fundamentally broken like these


The whole review industry just stopped scrutinizing SSDs several years ago, right around the time manufacturers started cutting features like power loss protection and DRAT/RZAT along with switching to TLC and QLC.

Funny how that worked out.


I find it highly improbable that you couldn't wear out a 3-level-cell or 4-level-cell consumer grade SSD which is capable of 300-500MB/s writes with a 24x7 automated test script in just a few months. Maybe even just a couple of weeks. The published total DWPD (drive-writes-per-day) endurance on these is not that great.

Even assuming a conservative 300MB per second, there's 86400 seconds in one day. That's 25920000 MB per day. Or close to 26TB per day. The samsung 960 Pro 2TB is rated by its manufacturer for a total 1200TB of write endurance lifespan.

Or at least leave it running for a couple of weeks and then see what the SMART-reported remaining write lifespan data reports it to be, versus the brand new out of box baseline.


>Or close to 26TB per day. The samsung 960 Pro 2TB is rated by its manufacturer for a total 1200TB of write endurance lifespan.

Right. So around a month and a half. In a world where hardware news drops simultaneous by multiple outlets literally within minutes of news embargoes being lifted.

That's a lot of time investment to get result that are boring AF ("we tested them. they work"). Have very little real life consumer relevance. And the manufacturer that sent you the review sample definitely doesn't want to see (focus on edge case negative).

>Or at least leave it running for a couple of weeks and then see what the SMART-reported remaining write lifespan data reports it to be

Yeah that would make a bit more sense. Run various units down to 95%. That said the resulting story would still have watch paint dry appeal only


> In a world where hardware news drops simultaneous by multiple outlets literally within minutes of news embargoes being lifted.

Any outlet interested in journalism rather purely PR can purchase a retail sample on release and publish an endurance report at a later date, however long it takes.

From my experience people buy storage whenever they have a need for something faster or larger, unlike CPUs and GPUs which have peak interest around their release dates. Storage is evergreen in that sense.

> That's a lot of time investment to get result that are boring AF ("we tested them. they work").

What's boring about that? This is extremely valuable information for any perspective buyer. Either way you gain reputation for being a trustworthy outlet that people can rely on for accurate information.

> Have very little real life consumer relevance.

I disagree and I think most consumers would be extremely interested in durability of their storage devices, especially since most of them rely on it in the absence of backups.

> And the manufacturer that sent you the review sample definitely doesn't want to see (focus on edge case negative).

That's hardly relevant. Informing potential customers about extremely serious flaws in the product is quite literally their job - at least if they wish to have any semblance of integrity, trustworthiness, and respect.

Many choose to sell out and simply echo the approved selling points they receive directly from the company, but not every outlet does this and it shouldn't be held up as something that tech journalists should aspire (or be allowed) to do.


> it became impractical as endurance increased

[citation needed]


I installed 2 x 980 Pro 2Tb in a laptop in Nov 2022. Running a daily Robocopy bat script to backup a folder in C: to D: would freeze a couple of times a week and lock the D: drive. After reboot, a drive check would find no errors and everything would work as normal. I've used the same script for years with no issues.

Since the firmware update last week Robocopy has not frozen the drive at all this week.


The freeze/reboot/fine cycle seems to be a common one for SSDs acting poorly, running out of blocks they want to use internally, or memory or cache or something, or just hanging in their own firmware for whatever other reason.

One of my earlier forays into switching to SSDs, I installed Intel... I think it was 525, 535, something like that, 2.5 inch SATA drives in several different machines. Every one has failed by now with this similar mode of (in)operation. On my desktop where I had one, it would simply bluescreen, but then come back fine until eventually reading certain parts of the disk would just always cause it to hang and it had to be replaced. Failed SSDs like this are interesting because Windows (and to a lesser extent Linux) really aren't prepared for the disk to just hang, so trying to recover anything off them can be a challenge.

Just recently I found out the last one I had around, in a little headless desktop server, was the cause of my problems with it where it would partially hang after a couple days of uptime. Having finally gotten around to having it hooked up to a display, I was treated to a sea of red dmesg errors from the disk.

I think ultimately part of the problem was new power-saving features Intel had tried to add for these disks, which would cause them to write to themselves a large amount and just eat through their useful lifetime much faster than you'd assume.

In almost every case, I replaced these with, of course... Samsungs. Though I believe I've been lucky enough not to choose any of their bad ones.


> The freeze/reboot/fine cycle seems to be a common one for SSDs acting poorly, running out of blocks they want to use internally, or memory or cache or something, or just hanging in their own firmware for whatever other reason.

Given that it's two gen4 drives in a laptop being subjected to a moderately heavy sustained workload, I'd also suspect a thermal problem or maybe even power delivery. Those two slots are probably being fed off the same 3.3V regulator.

Since the firmware bug appears to have caused catastrophic write amplification, what may seem to the user to be only a modest and reasonable workload may be causing the drive that is the backup destination to be running at full tilt doing a ton of writes to the flash and causing the drive to hit its peak power consumption and heat output.


I always suspected that it may be ability for laptop hardware to handle the second drive as performance was not as quite as performant as primary slot. Both slots are rated for PCIE 4 though.

It is strange though that after the firmware update there have been zero freezes.


Yeah I had an old intel ssd that would hang like this and I could never figure out wtf was going on


> the firmware update last week

Link to specific firmware version please?


The new firmware is version 5B2QGXA7, updated via magician on Windows. I didn't make a note of earlier firmware versions. It's still too soon to know if the ssd freeze will reoccur.


How odd, I've got a 980 Pro 2TB that I've had since mid last year, and checking I'm already on that firmware version.


thank you


Could you provide that batch script please? Like in a GitHub Gist or something similar.


Sure...

@echo off

pause

robocopy "C:\Users\o\Desktop\2023" "D:\2023" /e /mir /np /v /tee /r:0 /w:0 /log+:"C:\Users\o\Desktop\log_robocopy.txt"

pause

@echo on


You don’t need to turn echo back on at the end of your batch file. That line is pointless.


And don't need /e since /mir=/e /purge


Thanks. This script has adapted over time and I need to lookup most of the switches these days to remind me what they do.


what does the SMART data for your drive say?

I'm morbidly curious how much it reports lifespan remaining for its internal write-wear-leveling system.


I've only really experienced drive locking and freezing which resolves on a reboot, this is concerning enough. I haven't experienced any endurance issues.

SMART data is as follows (both since 2022-11-25)

Primary drive C: 5.1 TBW Model Name, Samsung SSD 980 PRO 2TB Serial Number, S***** Drive Type, NVMe Result,Byte End,Byte Start,Description,Raw Data,Status ,0,0,Critical Warning,0,OK ,2,1,Temperature (K),320,OK ,3,3,Available Spare,100,OK ,4,4,Available Spare Threshold,10,OK ,5,5,Percentage Used,0,OK ,47,32,Data Units Read,6465577,OK ,63,48,Data Units Written,10998930,OK ,79,64,Host Read Commands,150273501,OK ,95,80,Host Write Commands,157439035,OK ,111,96,Controller Busy Time,1083,OK ,127,112,Power Cycles,199,OK ,143,128,Power On Hours,571,OK ,159,144,Unsafe Shutdowns,12,OK ,175,160,Media Errors,0,OK ,191,176,Number of Error Information Log Entries,0,OK ,195,192,Warning Composite Temperature Time,0,OK ,199,196,Critical Composite Temperature Time,0,OK ,201,200,Temperature Sensor 1,320,OK ,203,202,Temperature Sensor 2,328,OK ,205,204,Temperature Sensor 3,0,OK ,207,206,Temperature Sensor 4,0,OK ,209,208,Temperature Sensor 5,0,OK ,211,210,Temperature Sensor 6,0,OK ,213,212,Temperature Sensor 7,0,OK ,215,214,Temperature Sensor 8,0,OK

Secondary drive D: 2.9 TBW Model Name, Samsung SSD 980 PRO 2TB Serial Number, S***** Drive Type, NVMe Result,Byte End,Byte Start,Description,Raw Data,Status ,0,0,Critical Warning,0,OK ,2,1,Temperature (K),320,OK ,3,3,Available Spare,100,OK ,4,4,Available Spare Threshold,10,OK ,5,5,Percentage Used,0,OK ,47,32,Data Units Read,4919136,OK ,63,48,Data Units Written,6128916,OK ,79,64,Host Read Commands,164977799,OK ,95,80,Host Write Commands,94324034,OK ,111,96,Controller Busy Time,78,OK ,127,112,Power Cycles,199,OK ,143,128,Power On Hours,538,OK ,159,144,Unsafe Shutdowns,20,OK ,175,160,Media Errors,0,OK ,191,176,Number of Error Information Log Entries,0,OK ,195,192,Warning Composite Temperature Time,56,OK ,199,196,Critical Composite Temperature Time,0,OK ,201,200,Temperature Sensor 1,320,OK ,203,202,Temperature Sensor 2,323,OK ,205,204,Temperature Sensor 3,0,OK ,207,206,Temperature Sensor 4,0,OK ,209,208,Temperature Sensor 5,0,OK ,211,210,Temperature Sensor 6,0,OK ,213,212,Temperature Sensor 7,0,OK ,215,214,Temperature Sensor 8,0,OK


Funny thing. This article prompted me to check the health of my two Samsung SSD's (a 250GB 850 EVO SATA III, and a 970 EVO Plus 1TB NVMe), which were fine.

But Samsung's Magician also listed my Seagate ST2000DM008-2FR102 2TB spinny disk. It found a SMART error. I ran a performance test and looked at SMART again, and the "Hardware ECC Recovered" value went from 80 to 81, with a threshold of 64. My other software labels this as "good". Nevertheless, this drive is now being replaced by a 4TB WD Blue. Thanks, article. Saved me some future troubles!


The value rising from 80 to 81 is an improvement. The calculated value decreases when the raw value of "Hardware ECC Recovered" worsens.


Oh, now I feel like a total idiot. Thank you for clarifying that. Good thing is, I still needed a bigger drive so now that's on the way :)


Is this the case for all SMART values? Higher=better?


Yes, the non-raw values are always reported with higher meaning better. The raw values are the raw measurements/counts and they each mean different things.


Is the number the percentage of a max 100%?


You did the right thing by replacing the Seagate ST2000DM008. I do data recovery professionally and that's not one of my favorite drives! Lots of issues IMHO.


No, sometimes it's in the 0-200 range. I believe the device always reports the maximum and minimum possible values.


Hardware ECC Recovered represents the amount of time between error correction events, so a higher number is better.


You might want to double-check that SMART status in CrystalDiskInfo

I don't trust Magician to report correctly on other vendor's storage


Yep, I was wrong on this. I should have used my google-fu instead of jumping to a conclusion. I guess I just really wanted an excuse to upgrade to a 4TB :)


How do I check health of a Samsung SSD on Linux?


LMGTFY sudo smartctl -t long -a /dev/sdX


That's not the way for a modern SSD. Try `sudo nvme smart-log /dev/nvmeN`


Seems to give the same output as the second section "smartctl --all" gives... so, less information.

Aside, any idea why it thinks my drive is 208% used?


No idea on that one. Mine are all three indicating 0%, but I've seen wacky stuff from SMART indicators over the years.


Completely normal


I've been trying to find a decent endurance NVME in the m.2 form factor for write-heavy applications and it appears that true 2-bit MLC has all but disappeared, replaced by 3-bit TLC and higher (with commensurate loss of endurance)

The high endurance SSDs appear to be only available in u.2\u.3\hhhl and god-help-me EDSFF form factors

Any suggestions? Micron's 7450 isn't readily available


They don't really make them anymore, but you can still get m.2 form factor Intel Optane SSDs in the 900/905P series, example[0]. They have insane endurance specs. Their performance is also still awesome, especially for random reads/writes[1]. I wish they had continued making them. Most PC builders just bought crappy Samsung SSDs this whole time, ignoring these awesome (and high priced) drives.

> Life Expectancy 1.6 million hours Mean Time Between Failures (MTBF)

> Lifetime Endurance4 10 Drive Writes per Day (DWPD)

There was one update, but I don't believe it's m.2? [2]

0: https://www.newegg.com/intel-optane-ssd-905p-series-380gb/p/...

1: https://ssd.userbenchmark.com/ (sort by "Avg Bench" and you'll see these old Optanes still in the top 10)

2: https://www.intel.com/content/www/us/en/products/docs/memory...


> Most PC builders just bought crappy Samsung SSDs this whole time, ignoring these awesome (and high priced) drives.

Ha, you can't blame the consumer.

Great tech but expensive and locked to intel. Nobody was going back to the blue evil just because they had a really random reads and writes for the enterprise market.


Isn't userbenchmark the site that had them artificially giving Intel/nvidia better scores on GPU and CPU metrics?


Same one. Assume malice, cluelessness, or both when someone cites "scores" from that place


Most PCs experience very little write load; I can imagine that many of them experience less than one full drive write per lifetime.

A database server box, or even a CI build box, is a whole different business.


WD RED SSDs are targeted at NAS use cases and claim endurance of 1PBW per TB, https://www.tomshardware.com/reviews/wd-red-sn700-review


You may also want to ask over on reddit in /r/newMaxx. Good place for SSD info and there is a pinned post for asking questions like this


NewMaxx SSD references: http://ssd.borecraft.com/


Why not get a U.2 drive and an adapter like this one https://www.startech.com/en-us/hdd/m2e4sff8643


Didn't know such a thing existed, thank you!


While it's true that MLC is mostly dead, you might want to consider a higher capacity TLC ssd. If you double your capacity, you double the endurance since SSD endurance is in drive writes per day, and a bigger SSD will likely have a bigger SLC cache to help with the write speed.


Yep, you can even adjust the overprovisioning manually (at least for Samsung). So if you need more endurance (and improved random write performance) just buy bigger capacity and increase the over provisioning allocation. Found this good summary about it with graphs showing the impact of it:

https://www.atpinc.com/blog/over-provisioning-ssd-benefits-e...


As long as you're using a filesystem Samsung understands? When I overprovisioned my 860, windows showed that the partition didn't fill the drive. When I decreased the size of the partition in windows, Samsung Magician showed that the overprovisioning had increased. While I'm not sure how the drive itself deals with overprovisioning, Magician certainly detects it based on the size of the partition.


2 -> 3 bit cells is a 1.5x capacity bump.

i would not be shocked to find tlc has a >1.5x impact on dwpd.


But it might be easier to find e.g. a 3x sized TLC compared to a 1x size MLC in the FF that GP wants.


For a simple example, the 970 pro (MLC) had a 1200 drive write warrenty, while the 980 pro (TLC) only has a 600 drive write warranty, but the 2TB 980 pro is cheaper than the 1 TB 970 pro so you can get the same endurance for less.


Seems weird that only 2TBs fail then


This sounds like a firmware bug that has nothing to do with endurance.


Use larger drives and use RAID to distribute writes over more drives. Accept that a TLC drive heavily used for writes is a consumable and act accordingly.


i recommend samsung pm9a3 versions. not as popular, but are enterprise products and also the endurance is like 3 times 980 pro i believe (please check it, not 100% sure).

Been using in my threadrupper workstation with a lot of vms which are put to sleep every day with around .25tb written and read each time the vms are started. keep in mind these are 22110 form factor


These are not available with retail support so if you manage to acquire one (which may have been pulled from service with an unknown level of previous wear) you will get ZERO support from Samsung no matter what goes wrong.


well i bought from digitec here in switzerland and they take care of warranty. other than that i don't expect any support from any ssd vendor.

I rely on the physical store i buy from where i live. that's the reason i only buy either physically or from amazon germnay (their support had been rock solid in the last 10 years i had been using them).



I have this exact model, 980 Pro 2TB.

It says to update firmware, but how can you do that from Linux? The instructions are all about some Windows program. Thanks!

EDIT: I'm quite happy with the warning from this article, fixed a potential future problem!


If you're on linux, you probably want to use fwupd[1]. You can check the existing version of your drive's firmware by running `fwupdmgr get-devices`. The version with the fix is 5B2QGXA7.

I'm on Arch and apparently I installed the update at some point in the past.

1. https://wiki.archlinux.org/title/fwupd


Samsung is publishing some firmware but not this. (https://fwupd.org/lvfs/vendors/#samsung, https://github.com/fwupd/fwupd/issues/5477)

  $ fwupdmgr get-updates
  Devices with no available firmware updates: 
   • SSD 980 PRO 2TB  
It would be good to put some pressure on Samsung to use the Linux Vendor Firmware Service. I just opened a support ticket about it.

fwupd is at least manually adding a warning about the affected firmware. https://github.com/fwupd/fwupd/pull/5481


Oops, you're correct. Looking through my shell history, it seems I manually downloaded and installed the firmware update in March of 2022. Here are the commands I ran:

    curl -O https://semiconductor.samsung.com/resources/software-resources/Samsung_SSD_980_PRO_5B2QGXA7.iso
    mkdir /mnt/iso
    sudo mount -v -o loop ./Samsung_SSD_980_PRO_5B2QGXA7.iso /mnt/iso/
    mkdir /tmp/fwupdate
    cd /tmp/fwupdate
    gzip -dc /mnt/iso/initrd | cpio -idv --no-absolute-filenames
    cd root/fumagician/
    sudo ./fumagician


A few months ago I already updated my Samsung SSD by following this procedure: https://askubuntu.com/a/1386451. Theoretically they provide an image to boot from to do the update, but the image seems very outdated and did not recognize my keyboard so it was unusable.


I now found and followed this:

https://blog.quindorian.org/2021/05/firmware-update-samsung-...

And it seems to have worked. After extracting this updater tool and running it, smartctl kept showing the old firmware version (3B2QGXA7), but after reboot it now shows the new version (5B2QGXA7).

I took the risk of running this while the OS (Archlinux) was running with the disk mounted (this is the OS install disk), and at first sight this didn't cause issues. But still do it at your own risk!!


Thanks, but that takes me to old firmware. I was however able to download the new firmware from https://semiconductor.samsung.com/consumer-storage/support/t... and use the same procedure and it worked:

    ├─SSD 980 PRO 2TB:
    │     Device ID:          03281da317dccd2b18de2bd1cc70a782df40ed7e
    │     Summary:            NVM Express solid state drive
    │     Current version:    5B2QGXA7

My home is on non-redundant stripe of two 980 Pro which both had the bad firmware, so I was obviously motivated, but not panicked as it's replicated hourly to spinning rust (and I have offsite backups). I treat Flash memory as dynamic ram with only slightly better retention.


The command line example in the link shows the old firmware, but they do say to go to the samsung website and get the latest one there. The 980 was the first one in the list there under the Firmware section.


I was trying to figure out why the update wasn't working on my Archlinux box. After a few attempts I barely caught a glimpse of an error message that flashed by: something along the lines of "unzip not found".

After installing unzip, the firmware updated successfully.


The command can be simplified as

  isoinfo -R -i xxx.iso -x /initrd | gzip -dc | cpio -idv --no-absolute-filenames "root/fumagician*"
if you don't want to go through the mounting and extracting everything.


For me `isoinfo` doesn't like the format of the .iso file ("CD-ROM is NOT in ISO 9660 format"). Using bsdtar worked though:

  bsdtar xOf xxx.iso initrd | gzip -dc | cpio -idv --no-absolute-filenames "root/fumagician*"
(And of course then running `root/fumagician/fumagician` in either case.)


This worked for me, thanks! I have a Samsung SSD 980 PRO 2TB which I upgraded from 2B2QGXA7 to 5B2QGXA7. I backed up and also made a copy of the output of `sudo smartctl -t long -a /dev/nvme0` before and after.


Also seems to have worked for me doing 3B2QGXA7→5B2QGXA7 in NixOS. Extract ISO, extract initrd, run fumagician, reboot.


I think I managed to upgrade the firmware from within Kubuntu (it wasn't the OS Nvme) by using this method

https://blog.quindorian.org/2021/05/firmware-update-samsung-...

I've done it some months ago, so I don't remember if it was that one exactly.

I'm now relieved to know that 5B2QGXA7 is still the current one.


an example:

    nvme fw-log /dev/nvme0
    nvme id-ctrl /dev/nvme0 -H | grep Firmware
    nvme fw-download -f firmware.ebin /dev/nvme0
    nvme fw-commit /dev/nvme0 -s 2 -a 3
    nvme fw-log /dev/nvme0

In an unlikely event, may need to change the slot (-s)


Following from my linux desktop. I lost a system SSD that was a 980 2 TB and recently reinstalled everything, thinking it was a fluke. Now worried it will happen again rapidly.


I just put the same drive in my PS5 in December. None of my Windows machines have spare NVMe slots, so updating this should be interesting


boot from Windows on a SATA drive?


From TFA it's not just the 980 Pro 2 TB but also all the newer 990, so it's problematic.


The article switches from the 980 issue to the 990 issue in a bit of an unclear way, but I think they're independent problems and the firmware update should fix the 980 one?


nvme drives can be updated from the command line. I've done it myself.

Extracting the actual firmware update from the files Samsung gives you might be an issue though.


nvme-cli can, in principle, upload firmware. I’ve never personally tried it.


every machine I have that can fit 2 SSDs (basically, everything except the very slim laptops) I have converted over to running a ZFS mirror as its root filesystem. NixOS makes this very easy to do because the grub.mirroredBoots option [0] removes the need for a separate "bootpool" with limited ZFS feature flags.

and crucially, I always make sure they're 2 drives from different manufacturers, so that a bug of this nature should never be able to take down both drives in a pool simultaneously.

I think of this as the "if you're going to go to the trouble of wearing a belt and suspenders, make sure to buy them from separate brands" principle.

0: https://search.nixos.org/options?channel=22.11&show=boot.loa...


Just a note as a happy customer of Puget Systems that my experience working with them has been excellent, they really seem to be expert in their field, and have been for many years.

Also, their submerged in mineral oil aquarium computer was really cool, back in the day.


I confirm. 2 Puget System workstations in this house. Just opened both of them up to add more hard drives for games and games work storage, and the cabling is so lovely. In my new gaming box I have a Samsung 980 Pro 1TB which according to this note is unaffected, and I couldn't find it on my motherboard. I created a support ticket and the support immediately responded with very clear explanation (it's under its own heatsink under a humongous heatsink/fan for the CPU. Duh!


I use my Puget for build performance for a large mono-repo at Adobe. I could set my watch to the consistency in memory and CPU usage during a full build of our products.


This posting came two months late for me.

My 980 2TB crossed the river styx over the holiday break. Failure mode exactly as described. Nice Christmas present for me. Took 3 weeks to get the warranty replacement from Samsung.


I wasn't aware the 980 Pro 2TB was also affected; I have four of those in a new machine I put together last year.

Time to install some bloatware and see about updating their firmwares, I guess...


Samsung 870 EVO drives were also known to fail early including my 2TB model.

https://www.techpowerup.com/forums/threads/samsung-870-evo-b...


Yes, this bit me just last month. Around October 2022 I purchased 3 Samsung 870 EVO 2TBs for use in a RAID array. By January 2023, all three of them failed within a week of each other!

Fortunately they failed one by one, so I was barely was able to recover my RAID array by pulling out one drive at a time, powering the computer off, and waiting for the RMA replacement to arrive.

But imagine my shock to see one drive fail... only to replace it with an RMA... and then days later, seeing the next drive fail... and the next!


I just had a 2TB 870 EVO fail too! There were SMART errors about uncorrectable errors, and I saw CRC errors in the OS. I lost some data.

This issue is all over the Internet, yet Samsung would not acknowledge it was a known issue. They also refused my RMA because the corner of one of the plastic port guides was chipped - we're talking about a miniscule chip, barely visible with the human eye. So despite the fact this drive was obviously defective, Samsung won't replace it. So I'm down £350 (£225 for the original drive, and £125 for the Crucial I had to buy to replace it), through no fault of my own.

I'm not buying Samsung SSDs again - problems are one thing, but how you deal with them is paramount.


when equipping a storage system, it does make sense to use disks from different manufacturers, different models, and different vintages (manufacturing batches).

equipping a storage system with disks all of the same make, model, and vintage is invoking the statistics gods to strike failure all at once (or close enough that you won't be able to keep up with the rate of failure and time to rebuild)

personal experience: attempting to rescue a failing 192 disk system containing disks all of the same make and model. wearisome.


A perfect example why RAID ain’t backup


Indeed! Fortunately the RAID was also backed-up offsite. But, the entire process was shocking in several ways.

At least Samsung was fairly speedy with the RMAs and it was basically no-questions-asked... because I imagine they're getting tons of these mailed back to them.


Did they require the failed drives to be returned? Did the secure erase function work on a failed drive? I guess this is a good reason for full disk encryption even on machines where I'm satisfied with their physical security.


Yes, they sent a prepaid label to return the bad drives. I did a secure erase both via Linux command line and via their proprietary software and it seemed to work. The whole process took about 3 weeks per drive.


I panicked for a second until I realized mine is a 860 2TB EVO and I bought it back in 2019. It still has zero block and other errors after 15k hours according to smart.

> seems to primarily affect drives produced in January/February 2021

That is interesting, I wonder if it's another one of those cases where the supply chain shortages forced them into respinning the boards with some slightly out of spec parts. It's certainly been a major problem for anyone making PCBs.


Yep, my 2TB also got hit by this - did an RMA in November

The replacement drive they sent me has behaved itself so far, touch wood


There is a chinese youtuber doing SSD durability tests on a few SSDs of different vendor. And one of the tested ssd is Samsung 980 pro.

What is the funny thing? The Samsung 980 died before the wear test even start.

https://youtu.be/tXYQZHz7u3w?t=898


The real problem is OS:es that write to disk for no good reason.

Windows 10 writes 100KB/s constantly.

That should be illegal.


Try running Windows 10 off of an older spinning laptop drive. It can take upwards of 40 minutes to display the desktop on the first boot after Windows Update runs. Even in normal operation those constant low level writes leave barely any breathing room for your actual applications. Full size hard drives do a bit better, but even then it can be pretty painful when the drive indexing service kicks off or .NET is updated.


Windows 10 on an SSD feels like Windows 7 on a spinning disk. Microsoft has wiped out the gains we got from SSDs.


This doesn't actually matter in a practical sense. Assuming 24/7, it's 3TB a year. Which is ~1% drive endurance.

Also, if you are worried about overwriting the same files over and over, it also doesn't matter. Block device addresses are not physical addresses, controller maps them to wear the drive evenly.


On the other hand, lots of tiny writes scattered all over will tend to produce much higher write amplification than large sequential writes. So you'll get more actual wear to the drive from the 3TB of constant background churn than if you copied in 3TB of movies.


Those writes would have to be significantly smaller than the SSD's page (sector) size which is 512 bytes or 4 KiB. And would have to be written to different pages in rapid succession (to be flushed apart) - a standard serial write wouldn't trigger this even if it's 1 byte at a time, the OS FS cache would buffer it.

It would have to be very misbehaving software or deliberate sabotage.


The logical block size presented by SSDs to the host system is 512 bytes or rarely 4kB. But the native page size of the flash memory itself is usually more like 16kB, and the erase block size is several MB at a minimum. Those larger sizes are why random writes (and especially random overwrites) can cause high write amplification within the SSD: because what looks like a series of single-sector writes to the host will at a minimum cause fragmentation within the SSD, and can easily cause large read-modify-write operations within the SSD.

Normally, SSDs and operating systems both use aggressive caching to combine writes. That's the only way a drive can turn in extremely high random write benchmark numbers. Consumer SSDs do this caching even though they do not have power loss protection capacitors to ensure that data cached in volatile SRAM will be flushed to the flash in an emergency. But it wouldn't be smart for the caching to wait forever for more writes to combine with a sub-page write, which is why I'd be concerned that a slow and steady trickle of write activity may be able to cause serious real write amplification.


I’m pretty sure SSDs can only do 4kib aligned writes regardless of the FS sector size (under the hood it’s a write amplification unless the OS or controller manage to coalesce them. But yea, it depends on how things are getting flushed, but generally I wouldn’t expect too much magic unless you get lucky. It sounds like a small bug in the OS (ie these kinds of wires should be matched in memory in the application).


Almost all SSDs internally track allocations in a 4kB granularity. That size is what leads to the convention of equipping the drive with 1GB of DRAM for every 1TB of NAND flash, when the drive is designed to hold the entire table of logical to physical address mappings in DRAM.

It's now common for consumer SSDs to have less DRAM than the normal 1GB per 1TB ratio, but they run their FTL with the same 4kB granularity and just don't have the full lookup table in RAM. There are at least a handful of special-purpose enterprise drives that use a larger sector size in their FTL, such as the 32kB used by WD's SN340: https://www.anandtech.com/show/14723


I do wonder if perhaps the good NVME SSD controllers come with magic. It would take a single instance of malware ruining SSD's with 4000x write amplification to taint some brands while aiding the marketing of others.


I thought some of them even do 8KB. I’ve seen ZFS tips that claim you should use 8KB blocks on things like an 850 Pro.


Not familiar with that. I know QLC disks have a block size of 64kib.


Except "evenly" is not a standard, or something that anyone other than the manufacturer can verify, it's hidden in the firmware so we have no idea really.


Do we know the ___location of those writes, perhaps they can be redirected to a ramdisk?


Sysinternals tools will show up the writes and what is causing them.

I've dug down and found random things doing dumb stuff in the past. Verbose logging turned on by default for some services, for example.


management/windows logs about >100 active logs. Performance/data collector set about >50 active Event Trace Sessions running.


browsers are much worse, both chrome based and firefox


Is there any way to update my SSD on Windows without downloading Samsung Magician? I have the Samsung Evo across multiple systems and my Linux ones are okay but my Windows machines aren't okay unfortunately, they have the bad firmware.


Tangent: what's the cheapest 4TB+ PCIe SSD without significant known issues but with hardware encryption? The SN850x seems to have its own issues, and beyond these everything else seems so expensive.


Is hardware encryption a widespread feature?

What are the benefits of it for you, compared to OS-level full disk encryption?


> The SN850x seems to have its own issues

Such as?



I have a puget system with the Samsung SSDs mentioned and it locks up on me every 2-4 weeks. This sounds like it would explain the problem. Puget sent out a message this week to upgrade the Samsung firmware, but I am at the latest. I’ll be contacting Puget support on Monday so I’m on their radar.

I will say I love my Puget, its performance has been killer other than these lockups. And I’ve heard only good things about Puget support. I should have reported this months ago, but it’s just now that I’m doing some critical work on Windows that it’s affecting me.


Would like to know more. Were failures in the field from wear-out or sudden death? Are the health indicators losing 1% per week consistent with the datasheet TBW, or worse?


We have Lenovo laptops at work with M.2’s that are OEM-branded Samsungs.

One bricked itself in to read only mode after a few months.

The other has been losing 1% health each week or so. I caught it losing 2% in just two days recently.

These drives are older than the 990 model mentioned in the article but I have my suspicions anyway they’re dud drives.

Nothing lost except time - they can be swapped under warranty. But I used to buy Intel exclusively before swapping to Samsung when Intel started selling rebranded drives.

I guess the search for a reliable vendor starts again…


These anecdotes are pretty frustrating without the other key piece of information. For the given lifetime indicators, how many writes were served? Are they wearing out faster than their TBW claims, or are they being written more than you expected?


It’s at 86% with 12.21TB written. Total power on time 68 days. Drive temp sits around 45 degrees celsius.

I’m not paying great attention to all the SMART counters day by day.

It’s for a dev workload so like … compiling code and stuff? I have the exact same workload on my desktop PC and its Samsung drive health is 99% after … years.


OK, so close to 1% lifetime per TBW, or lifetime approximately 100TBW. Thanks! That's consistent with their endurance claims for the smallest SSDs (128GB PM991 for example, or 256GB 960 Evo) but it would be poor for a larger one.


Cool well these are 2TB drives so…


Even Crucial are having tons of issues with current MX500 models.

I think SKHynix might be the last as they supply a lot of OEM over the years. But their consumer base is small so we don't have a huge sample size like Samsungs.


I can live with my batch of 980pros and 870 EVOs failing. Bad batches happen to every manufacturer.

However this seems like a systematic problem at Samsung.

What really stinks is that I've been recommending Samsung EVOs and PROs to relatives and friends for some years.

If I want to remain honest with myself I have to contact every one of them and have to run the SMART tests.

So now there are NO reputable SSD manufacturers left. Only reliable SSDs are MLCs from mid 201Xs especially Intel and Samsung.


Ask anyone with a circa-2014-ish Macbook Pro about Samsung SSD reliability.

The samsung-made drives lasted about 5-6 years. Everything seemed fine, and then one day you'd get a spinning pizza of death, power down, power it back on...and your SSD was...completely gone. Doesn't even enumerate on the PCIe bus. It's just gone.

Screw the SSD chipset manufacturers for not making sure that their controllers can at least a)still show up on the bus b)be read-only in some sort of recovery mode.


Not sure why downvoted.

A recovery mode seems absolutely possible and I don’t care if it’s 100kbit over SPI. It’s better than losing everything.


Ars had a piece covering this as well, and I do wonder if there is something going on somewhere else in the Samsung stack, not just the NVMe 900 series line. Pure anecdote, but two years ago I did a NAS for a client using 24x 2TB Samsung 870 Evo drives (they'd gotten some incentive deal for it). While it was all one type vs mixed, there was the "luxury" of time because at that point getting the system they wanted together had a significant lead time. So I did ensure that the drives were purchased over the course of around 7 months, from multiple different reputable sellers (B&H, CDW, Provantage etc) in separate batches. System was solid, an Epyc 2 based SuperMicro server, running TrueNAS.

And then last year with around 5500-7500 quite light hours of runtime (primarily reads, ~0.08 DWPD, well under official rating of 0.3 DWPD) drives started failing. These were definitely real failures, first indication came from regular automated ZFS scrubs and reporting increasing checksum errors and ATA errors. It was for so many drives and I'd always considered Samsung SSDs relatively reliable (even for consumer ones) that at first I thought it was a SATA controller failure, and our rep agreed and warranties back the server. They were great, gold plated support contracts pay off once in a while, and motherboard replacement and thorough testing later back in service. More drive problems. SMART short tests said everything was healthy, first longs did too. But then drives exceeded error limits and started getting faulted, and at last SMART long tests started failing. Digging in showed worrisome stats. So began swapping out and warrantying drives (cheers to the stress test to TrueNAS, in the end zero downtime or need to restore from backups). In the end, THIRTEEN (13) out of 24 failed. Brutal >50% dead drive rate. I talked to some others around and they'd seen <1 year rates also at 30-60%. Big :\. Rep also indicated they were hearing more about Samsung failures.

Anyway, gave me a talking point going forward to really, really press management on "it's worth paying for drives from 3-4x brands and maybe splurging for higher rated vs consumer too", but also does made me wonder if there is something going on, or was (pandemic related?), at Samsung's storage division. It's definitely pure anecdote but still, I spread those drive purchases out reasonably hard, and they had radically different serial numbers. Same with other folks I know at other businesses using various Samsung drives, everyone has been going to real effort following decent practices to prevent buying drives all from a single lot. Even 10% failure rate for consumer drives I could have seen, but 54%? And not a bathtub curve all frontloaded in the first month or two but after 7-11 months? That feels high? Samsung did replace them all no questions asked, they paid for shipping too. I don't have any global insight into how this all looks and it could be just plain bad luck for all of us in the region, but still.


> Anyway, gave me a talking point going forward to really, really press management on "it's worth paying for drives from 3-4x brands and maybe splurging for higher rated vs consumer too", but also does made me wonder if there is something going on, or was (pandemic related?), at Samsung's storage division.

What's going on at Samsung is the same thing that's always been going on at Samsung: they use their own flash and their own controllers and they have their own sets of problems as a result. It was a selling point in the early days of Sandforce and other turds (leading to things like the OCZ Vertex series) but now the commodity market has caught up and Samsung doing their own thing is kind of a negative. Like yeah it's fine as long as they don't screw up, but they're screwing up a lot more than they used to.

I don't see any direct correlation between the various failures that have occurred over the years. 840 Evo had something wrong with the flash NANDs that caused them to lose charge over time (leading to data loss) so they put out a new firmware that would just continuously write the flash in a tape-loop sort of deal (lol) to avoid the data ever aging out. I don't count that as a controller flaw, that's a flash flaw that was fixed with a firmware patch.

870 Evo, 970 Evo, 970 Evo Plus, and 980 have all been accused of having problems over the years, in addition to the 980 Pro and now 990 Pro, but there's actually a pretty good variety of controller models as well as different flash types (from 64-layer to 236-layer) there. It's hard to know how much firmware they all share though... or whether it was more flash problems in certain batches, or what. But overall Samsung certainly has had a lot of failures in recent years and I think it all really comes back to the fact that they're using their own controllers and their own flash, while everyone else is pretty much commodity at this point... meaning they get their own bugs too.

https://docs.google.com/spreadsheets/d/1B27_j9NDPU3cNlj2HKcr...


If you're using Linux you can use the following instructions to update the firmware.

https://wiki.archlinux.org/title/Solid_state_drive#Update_un...


> high failure rates in the field

Jikes. A lot of people were hoping that the high wear is just a reporting artifact


I've had 2x Samsung 980 Pro 2TB fail in as many years. Last time I buy Samsung.


My Samsung 980s (1TB so not affected per article) are still going strong but Samsung has been dropping the ball across the board for me in recent years. I've had a bad experience with their Samsung Frame last year, their fridges are a nightmare to fix, and I'm now hearing bad things about their washing machine and their SSDs. I know these departments are not related but it's not a good look.


Magician Smart Status reports 0 'media errors' on my 980 2TB. Is that suspiciously low or does it mean things are actually ok? Is there a different tool I should check with?


Anyone know if the failure mode is sudden total bricking or something more recoverable?


There is always the possibility that their S.M.A.R.T. implementation is borked...


The article does say they've seen "abnormally high failure rates in the field", so it's not just that.


If that's all it was, then it's likely a firmware update would not only prevent the issue, but also reverse it if the storage is actually healthy. That doesn't seem to be the case here, though.


Keep in mind that one of the things SSD firmware does is deliberately write-lock the drive if the media is too worn to erase. So buggy firmware overestimating media wear is also likely to cause a failed drive.


Wouldn't be surprised if this is another pandemic supply chain casualty.


Sounds more like crappy firmware to me. This is not the first SSD to suffer from a crippling frimware flaw that is impossible to fix.

I lost almost an entire computer lab of Dells thanks to the goddamn Sandforce firmware. One that the company acknowledged but refused to lift a finger to fix. Luckily it is possible to fix these yourself despite the vendor hostility towards the repair. Look how easy it is: https://computerlounge.it/how-to-unbrick-sandforce-ssd/


Had issues with a 2TB 980 Pro. Things have stabilized with recent updates.


I really want to see ssd manufacturers offer a decent warranty...

This drive costs $100, and will last 10 years or until 100TB has been written to it, as long as you keep it within the specified temperature/humidity/power conditions.

If it fails to do that, we will return $1000 to you.


This sounds like an SLA agreement, its very unlikely you'll get that for 100 bucks. Even if this manufacturer somehow perfected their process and have zero defects, they are still acquiring a 10 years liability for 100 dollars of revenue.


APC sell surge protectors with equipment protection insurance for less than $100. Apparently, it's possible even for products sold at $100.


Well they sure market this, but it seems pretty difficult to collect. For the insurance to apply you have to register within 10 days of purchase with a list of what will be plugged into the the surge protector. In case of damage, you have to pay to ship the surge protector to APC. If they determine it's been damage by power line transients, then you need to ship all damaged equipment somewhere to be evaluated. If it's determined that the equipment was damaged by power line transients, then APC chooses whether they'll pay for repairs or for current fair market value.


Sure, should be possible to sell you an insurance.


In theory, a 3rd party insurance equivalent to AppleCare could be constructed for some technology products, but this is hampered by short product lifecycles, lack of BOM transparency (e.g components changed within a single product generation) and ability of firmware updates to change product behavior and invalidate previously collected data on reliability.

Open-source SSD firmware would provide more transparency on performance and reliability.


> Open-source SSD firmware would provide more transparency on performance and reliability.

This seems fantastic. Are you saying you could review the firmware source and know that the 980 Pro would lose ~1% of its endurance per week?


Lifetime warranties used to be commonplace, I wish we could return to those times, or at least to a time of repairability.


Lifetime warranty on a consumable product (SSDs have a limited number of writes) doesn't seem reasonable.


True, in my perfect world I would settle for a trade in program, you would get some value for the failed unit so that you can upgrade and the OEM could recycle the raw materials. If we will ever live in a sustainable society we are going to need repairability and recycling programs for all consumer products.


using up the flash doesn't hurt the controller, though. the controller still knows how much writing it's done even if the flash itself is toast, it's a totally different part of the drive.

And even still, you could construct the controller so that it was burning e-fuses to indicate lifespan and the fuses could be readable through JTAG, short of complete controller death or lightning-strike level surges (which you can legitimately argue as being abuse and not warrantyable) you could make it offline-readable from an external device.

https://en.wikipedia.org/wiki/EFuse

The problems here are primarily economic/social, not technical. Companies don't want to hold warranty liability on their books for 10+ years, but they also don't really want to accept returns for defective products or other things either, and we make them do it anyway.

The EU is already pushing warranties to a minimum of two years for exactly this reason. Could it be 5 years, or 10 years? Sure, why not.

Companies will scream in the short term, of course. It's cheaper for them to push out crap that'll die and be in the trash in 3 years. Engineering products for longer lifespans would be a shift in engineering/design mindset. It probably would also push minimum device costs upwards at least a little bit, but, that's not a bad thing either - the slogan is "reduce, reuse, recycle", in that order, and "reduce" there means simply buy less or buy things that last longer. A shift away from planned obsolescence isn't the worst thing culturally, we don't want to encourage design-for-disposability.

Especially as Moore's Law slows, hardware is relevant for longer and longer periods of time. For example, a lot of people are finding that their GPUs are dying before they're actually irrelevant as hardware. It's not just NVIDIA who had bumpgate, a ton of hardware from that era failed over time due to faulty solder and probably could have been fixed with an hour of a tech's work.

Even worse, they're often dropped from support. There's really nothing wrong with a R9 290X as a GPU, but AMD won't support it with software anymore, despite the fact that it basically works anyway and it's pretty much purely a software lockout (which third parties have hacked and bypassed), because they want you to buy the new one. Wouldn't it be nice if GPUs were just expected to work for 10 years from purchase and that was covered by warranty and software support?

There are an increasing number of people who do hang onto hardware for 5-10 years because the relevant lifespan is getting longer and longer, and we should encourage that and require companies to support those consumption patterns. Just like not gluing together phones to make the battery irreplaceable, we really should be making sure electronics bumpouts don't fail in 3-5 years and that companies don't dump-and-run on the software.

Routers are another one where the software support is just egregious, too. How many rando Linksys or TP-Link or whatever actually get an update when a bunch of new vulnerabilities in WPA or whatever are discovered? Not that many, and "just install OpenWRT" is not a society-level answer especially when companies are locking down hardware.


It also used to be the case that a computer was basically ewaste within two or three years because a new one would be ten times faster.


Growing up poor, I was always a few generations behind, rocking a 486 DX2 when the PII & PIII where the latest and greatest. 33kbps modem when others had 56k. When I was 10ish Me and my older brother would go to the thrift store and dig through the computer parts, it was an adventure.


> When I was 10ish Me and my older brother would go to the thrift store and dig through the computer parts, it was an adventure.

I miss that too. Thrift stores suck now, they're pulling all the good clothes out and selling them to upcyclers and pulling all the cool electronics and cameras and other stuff and selling them on ShopGoodwill and ebay.

And ShopGoodwill is pretty absurd, almost everything is sold as-is and uninspected, and prices are just as high as ebay if not sometimes higher.

The days of wandering through a goodwill and finding some neat stuff at a bargain price are gone now, unfortunately.


Back when HDD would fail really a lot warranty was working. I'd happily fill an online form, Web 1.0 style, and then send my Seagate (I'm in Europe, was sending them to the Netherlands IIRC) disks and a few weeks later I'd receive a new drive.

I probably still have a few screenshots of these forms somewhere.


I am not sure why you want a 10x refund, but it seems like your request is easily met by current warrantees. A 1TB WD SN850X advertises 1200TBW endurance, rather more than you require.


https://www.law.cornell.edu/wex/punitive_damages

Seems clear the idea is to make sure that companies err well on the side of lifespan rather than designing something that fails a month after the warranty expires. Because if they're cutting it close, a decent number of units are going to fall under the warranty line and they'll be liable.

Even if a company is required to stand behind the product, a lot of consumers won't pursue it if it's not perceived to be worth the trouble. Do you care about the 120GB drive you bought in 2012? Not really. Do you care if you can get 10x the original ($1/gb) purchase price for it? Sure, $1200 is worth my trouble.

As they say - "A times B times C, if that's less than X, the cost of a recall, we don't do one".

I'm not OP and am not gonna die on this hill as a point of policy, but if 9/10 consumers just shrug their shoulders and accept that their 8yr old drive has failed and throw it in the garbage, that's still a bad thing at a society-wide level where you want people to be using hardware for longer and longer periods of time. Especially as moore's law tapers down even further and hardware becomes relevant for longer and longer periods of time - a R9 290X is still a pretty nice piece of hardware!

Michigan used to do something very similar with checkout price scanners - if the price coded in the system was more than advertised, you got 10 times the difference up to a limit. And the point was to get retailers to pay fucking attention because a 50 cent pricing error on a can of chili could cost them 5 bucks. Punitive damages, with citizens who spot the violations receiving the bounty.

https://www.canr.msu.edu/news/michigan_changed_item_pricing_...


The SN850x seems to have its own issues from what I read (just google it).


Perhaps an insurance agent can craft a policy to do that for you.

Failing that, maybe a bookmaker.


HPE does that for enterprise disks. But it ain’t free!




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: