Update on Samsung SSD Reliability

mgdlbp · on Feb 3, 2023

Apparently a few months ago it became known on the Chinese internet that the 980 Pro, 970 Evo Plus with new controller, and OEM versions are prone to getting unreadable sectors, where SMART 'Media and Data Integrity Errors' increases on every read attempt.

https://www.reddit.com/r/buildapc/comments/x82mwe/samsung_ss... https://www.reddit.com/r/DataHoarder/comments/x8arle/psa_sam...

How I came across this: Ran into this last week(!) on a 6-month old drive -- but I'm not in China....hmm. Not just one bad batch? Interestingly, it's non deterministic - the data is backed up but trying ddrescue, it occasionally succeeds at reading a few kilobytes from the 5 MB of several runs of 512-16384 bytes that can't be read or written. Curious to see what happens with a firmware update and secure erase.

stevefan1999 · on Feb 4, 2023

PS: I'm one of the victim with a 970 Evo Plus. The company that provided aftersell services, Lobcom, did not want to provide any RMA services and claimed nothing wrong is found.

The scamming company in question: https://zh.lobcomgroup.com/

jamhan · on Feb 4, 2023

My anecdata:

tl;dr: All 3 of my Samsung M.2 NVMe SSDs have failed in less than 3 years. 100% failure rate.

My first SSD was a 1TB Samsung 970 EVO. It failed after 2 years and 8 months. It was replaced under warranty with a 1TB 970 EVO Plus.

That replacement has now also failed after 1 year and 9 months.

I bought a 2nd 1TB 970 EVO Plus in May 2019. It has now also failed (2 years and 7 months).

Both are expected to be replaced under warranty.

The 2 970 EVO Plus SSDs clearly had hardware errors (that were not accurately reflected in SMART data) that caused everything from system hangs, game crashes to file corruption on OTHER drives. I couldn't believe it at first but after 5 days of testing and trial and error, I had it confirmed. As soon as I removed those SSDs, my PC was completely stable again.

In the meantime, I have bought a Kingston KC3000 1TB drive as I no longer trust Samsung M.2 NVMe SSDs. On the other hand, I have a Samsung EVO 850 SATA drive which has been rock-solid.

DriverDaily · on Feb 4, 2023

My anecdata, I have been running 4x 500GB Samsung 850 EVOs in Raid 0 continuously without failures since early 2015.

mynameisvlad · on Feb 4, 2023

The article mentions issues with the 900-series drives. It seems like the 800-series are still rock solid (also been running them for s few years now without issue)

PBondurant · on Feb 4, 2023

Unfortunately there have been recent issues with the 870 EVO series also: https://www.techpowerup.com/forums/threads/samsung-870-evo-b...

There may be multiple, different issues with Samsung parts at play here. The 900 series issues seem to have been addressed with a f/w update; the 870 EVO issues were - allegedly - caused by bad NAND and the devices needed to be replaced.

ofc part of the problem here is the lack of public acknowledgement / information from Samsung on these issues.

Godel_unicode · on Feb 4, 2023

Similarly my M.2 NVMe 950 pro has been in an always on machine that gets a ton of use since 2016.

AuthorizedCust · on Feb 4, 2023

The parent posts mentioned 970 and 980, not 850.

post-it · on Feb 4, 2023

Is it possible that your motherboard or PSU is killing the drives?

Could also just be sheer chance, of course.

scns · on Feb 4, 2023

How does this happen? Got any background info?

ChuckNorris89 · on Feb 4, 2023

Poor voltage regulation from the motherboard or power supply could glitch the controller of the drive causing I/O errors or failures.

piceas · on Feb 4, 2023

As an example, an old Asus board of mine has trouble with modern m2 drives. A PICe m2 adapter solved the problem and the Samsung ssd worked without issues thereafter.

metadat · on Feb 4, 2023

I've bought 6-8 m.2 Samsung 970 EVO Plus and 980s since 2018, and none have failed to date.

Anecdata is the worst, I'm sorry to hear about this happening to you. It's surely frustrating and upsetting.

dinvlad · on Feb 4, 2023

Worth checking if you have any thermal issues with it. Mine failed in a similar way due to presumably a rookie mistake of forgetting to remove the thermal pad tape on the mobo.

jeffbee · on Feb 4, 2023

It's not likely that thermal issues would cause bad reliability on these things. At worst you could expect intermittently bad performance. You can check for this condition with `nvme smart-log`. If your device was often overheated, it would have "critical composite temperature time" non-zero. My Samsung that has been in service for years and has no thermal solution has a value of 1 minute and I happen to know that is because I heated it with a hair dryer to find out what would happen if it crossed the critical temperature.

short_sells_poo · on Feb 4, 2023

"I happen to know that is because I heated it with a hair dryer to find out what would happen if it crossed the critical temperature."

Ah this is a fantastic and true hacker mindset :)

Willing to tamper with fairly expensive equipment just for the heck of it.

dinvlad · on Feb 4, 2023

Ha, interesting! Makes sense, the drive is supposed to just throttle itself before it can reach unsafe temps. I’ll def try to check, didn’t know the drive recorded that - thanks for the tip. In any case, now I know RMA is in order

lobocinza · on Feb 4, 2023

The controller is less thick than the NAND flash so don't make proper contact with the thermal pad. I just discovered mine is affected by this. After heeavy reading the controller is at 67C while the NAND is at 42C.

https://www.youtube.com/watch?v=I8Z09nU554Q

dinvlad · on Feb 9, 2023

Hmm, that still seems like it should be ok. Tjmax is usually over 100C (though for NANDs they recommend 70C I think)

brokenmachine · on Feb 7, 2023

My anecdata, I've had 5 Samsung SSDs and they've all performed great.

I'd point the finger at your PSU or motherboard. That's way too many failures for it to be the SSDs.

Samsung couldn't stay in business if that was a normal failure rate.

scns · on Feb 4, 2023

> that caused everything from system hangs, game crashes to file corruption on OTHER drives.

Interesting. Maybe my M2 (WD 570) is the cause for the hangs in my system. Thank you very much!

branko_d · on Feb 4, 2023

I can second EVO 850 SATA. Mine has been rock-solid since 2015.

Macha · on Feb 4, 2023

My anecdata, I have a 840 Pro, 850, 850 EVO, 970 and 980 Pro, all still running for years

sam0x17 · on Feb 4, 2023

My 980 pro failed witihn two months of purchasing it in late 2022

dheera · on Feb 4, 2023

I wonder if Qvo are still subject to the same issues.

voidfunc · on Feb 3, 2023

Hmm I'm going to need to check my Samsung ssd from oct 2021 that failed the first week of Jan 2023. I had started noticing some quirks in spring 2022 but it wasn't a super important drive so I ignored it.

Dnguyen · on Feb 4, 2023

I have similar issue. It started failing mid last year. Then it got more and more frequent toward the end of the year. Last month I got tired of reinstalling OS for the 4th time and got a new system.

ThomasGlanzmann · on Feb 4, 2023

In 2021 I bought at least ten 870 EVO 4 TB SATA and six 980 Pro 2 TB NVMe. All devices failed within 6 months on barely used systems. Find some smart data here:

<https://thomas.glanzmann.de/samsung/>

The pattern is always the same: I have them configured in a raid 1. Once a month debian does a raid check. During the raid check Debian reads all data from both devices. I get uncorrectable read errors. I no longer use Samsung SSDs and replaced them them with SSDSC2KB076T8, Micron SSDs and KC3000 Kingston NVMes. No failures since then. In 2021 I told a friend of mine about the issue. He also had a 870 EVO, issued a dd if=/dev/sdX of=/dev/null bs=8M and guess what, he got uncorrectable read errors. Due to running them in RAID 1 I caught the issue early and I had no data loss or downtime because the Linux software raid compensated for the bad hardware. However I replaced them in a hurry because I no longer trust Samsung SSDs. As you can see from the smart log they're barely used. Less than 4 months in service and 10 TB written. I also got uncorrectable read errors when evacuating data from the devices.

lazyweb · on Feb 4, 2023

Similar experience for me. Four 2TB 980pro in RAID-Z2. Since introducing the drives in August 2021 I've had to replace them five times. I think none of the original SSDs are left. Only between 5 - 25TB written on average. Usually individual uncorrectable errors caught by ZFS, but one drive just straight up died. I keep a cold spare these days. On the bright side, if the cycle continues, I'll never run out of warranty from Samsung or the vendor.

ThePowerOfFuet · on Feb 4, 2023

> On the bright side, if the cycle continues, I'll never run out of warranty from Samsung or the vendor.

The warranty is from the date of sale of the original unit. Replacing one doesn't reset the warranty end date.

gowld · on Feb 4, 2023

If this really becomes an issue due to repeated failure, lemon law activates for a refund.

lobocinza · on Feb 4, 2023

I tested mine. 980 (non-Pro) 1TB with ~20TB read and ~13TB of written. No errors logged. SMART looks fine. No errors when issuing dd as suggested. Though sometimes I do get weird errors put I will put the blame on AMDGPU drivers and the mess that Gnome on Arch often is.

justinclift · on Feb 4, 2023

As a data point, the Linux kernel has a long list of workarounds for "ata" related devices (SSDs, HDDs, etc):

https://github.com/torvalds/linux/blob/69f2c9346313ba3d3dfa4...

Can be a bit eye opening to look down that and see equipment you're using listed. ;)

justinclift · on Feb 4, 2023

This seems to be the same thing for NVMe devices (eg Samsung, SK Hynix, Micron, Kingston, ADATA, Intel, (etc) NVMe drives):

https://github.com/torvalds/linux/blob/69f2c9346313ba3d3dfa4...

Not quite as easy to read and understand as the ata driver code though.

With the occasional further device specific workarounds in other parts of the code.

eg for specific Toshiba, LiteON, and Kioxia devices:

https://github.com/torvalds/linux/blob/69f2c9346313ba3d3dfa4...

While this seems to be special handling for Samsung X5 SSD external drives, and also Samsung 970 Evo Plus drives:

https://github.com/torvalds/linux/blob/69f2c9346313ba3d3dfa4...

duckmysick · on Feb 4, 2023

What's horkage?

jwatt · on Feb 4, 2023

Brokenness.

userbinator · on Feb 4, 2023

For the same price, you can get twice the space for 1/4 the endurance, thrice the space for 1/8th the endurance, and now four times the space for 1/16th the endurance. Most people don't realise that is a horrible tradeoff, because NAND flash marketing and terminology like "TLC" or "QLC" is intentionally deceptive and manufacturers have been very secretive about the true endurance specifications, as well as trying to overprice SLC out of production. If more people knew the truth of what they were trying to do, we wouldn't be in this situation.

wahern · on Feb 4, 2023

> as well as trying to overprice SLC out of production.

Is it even possible to buy SLC drives any more? For the past 5+ years the only outlet I've been able to find that even advertise SLC is https://www.delkin.com/, and you need to speak to sales to even get a price. I just assumed they and any other similar suppliers bought giant lots of chips at the tail end of SLC production and jack up the price on every new order as their supply dwindles. Or maybe they cobble together drives from the tiny SLC chips used for cache on modern SSDs?

userbinator · on Feb 4, 2023

Is it even possible to buy SLC drives any more?

Yes, small ones for industrial use. They're extremely expensive, however.

Looking at the raw NAND flash prices, SLC seems to still be around $4.60USD/GB, or roughly the same as it was over a decade ago, while MLC is already <$1USD/GB despite only a doubling in capacity. TLC and QLC seems to be down in the $0.10USD/GB. You can still buy raw SLC NAND flash in the smaller capacities of few GBs; this one is only 512MB, at the price mentioned above:

https://www.newark.com/micron/mt29f4g08abaeawp-it-e/flash-me...

If the pricing was sane, SLC drives would be only 4x more expensive as QLC ones for the same capacity, but that's not what we're seeing today.

Wowfunhappy · on Feb 4, 2023

Do you think the pricing difference could be due to economies of scale?

wmf · on Feb 4, 2023

In theory an SSD could run TLC NAND in SLC mode with just a firmware change but I'm not aware of any such drives.

wtallis · on Feb 4, 2023

Enmotus partnered with Phison to produce a QLC SSD that mapped the first several GB of logical blocks to SLC. This was sold bundled with Enmotus's SSD caching software, but it could also be treated as simply having a SLC partition and a QLC partition that are largely independent.

namibj · on Feb 4, 2023

Samsung 980 Pro, the original, does that when sufficiently empty, to offer around iirc 10% nominal capacity of SLC-mode write buffer (given they are TLC, that would be around 30% of the NAND).

renonce · on Feb 4, 2023

Many TLC drives have SLC cache so if you limit usage of the drive you may effectively use it in SLC-only mode. This does not say anything about its endurance however. If you want endurance, better mirror your data to HDD.

efreak · on Feb 4, 2023

Show me software that can mirror an SSD to HDD in realtime without the "backup" being affected by the slower write speeds. I was looking for a way to do this a few years ago, and couldn't find anything. I'd be very happy if I could pull my SSD out, change my boot drive, and boot into my existing OS without any issues. My understanding is that existing solutions for this delay write confirmation until both drives are complete, negating the speed advantage of the SSD.

renonce · on Feb 6, 2023

Check out Syncthing. I personally run two instances of syncthing, synchronization happens very fast and as long as there isn't high write volumes it syncs pretty quickly. If you don't need realtime backup, rsync'ing in a loop would work.

wtallis · on Feb 4, 2023

If you want to use a drive as pure SLC, you're probably better off buying a QLC drive than a TLC drive. QLC drives are more reliant on SLC caching and tend to be more reluctant to migrate data from SLC to QLC when it isn't absolutely necessary.

adgjlsfhk1 · on Feb 4, 2023

For most people this isn't a horrible tradeoff. For example, my desktop which I use pretty heavily, I've averaged 24 GB writes per day. With 500 drive writes of a TLC drive, my ssd will last me roughly 50 years at current rate. If I had to chose between a 300 GB SSD that will last me 400 years or a 1TB ssd that lasts 50 years, I'll take the TB one any day of the week.

wmf · on Feb 4, 2023

While NAND endurance has certainly gone down, FTLs got much better during the same time so that SSD endurance is still fine for most people. And if the stock endurance isn't enough, a little overprovisioning is probably better than dropping back to very expensive MLC.

userbinator · on Feb 4, 2023

If the number of firmware bugs in SSDs we've seen over time is any indication, I don't think things are really getting better...

SLC needs almost no FTL. 100K endurance. Very low raw error rate that can be handled with basic ECC.

emodendroket · on Feb 4, 2023

Is it a horrible tradeoff though? I can think of many situations where that would be a somewhat compelling alternative to a spinning platter.

newZWhoDis · on Feb 4, 2023

Can you explain your point further? Are you talking about competitors?

HyperSane · on Feb 4, 2023

It isn't a bad tradeoff for most read workloads.

sebazzz · on Feb 4, 2023

SSDs are complex beasts :-)

I've had an OCZ SSD in the past that also became read-only, in the sense that all changes after shutting down the computer were gone.

If you rebooted everything was fine, but as soon as the computer was shut down and the SSD also ran out of power, it converted to the state before turning on the computer. That was so bizarre. Once I had a hefty Windows upgrade installed and it was gone after the reboot - as if I had never installed it. It also took me a while to realize it because first of course you start to doubt your own memory, you don't realize that the sector- mapping on the SSD has suddenly become read-only

OCZ eventually solved this via an RMA and eventually OCZ also went upside down. The Time Warp bug it was called I think.

mvdwoord · on Feb 4, 2023

sounds ideal for a public terminal ;)

csdvrx · on Feb 4, 2023

Marketing should make that a feature: hardware-guaranteed non-modifiable root filesystem! Better than Nix!

totalZero · on Feb 4, 2023

No updates! Ever!

ashirviskas · on Feb 4, 2023

Also, no malware!

sebazzz · on Feb 4, 2023

There used to be a PCI module (iirc) that would assist in that. I believe it was called a Bo(u)rne Again module. It was used in school computers etc.

AdamJacobMuller · on Feb 6, 2023

There was an IDE dongle which did this. It was at a school I was visiting (but not my regular school) and it was pretty cool because it allowed them to be pretty liberal with permissions on the computers since you couldn't really do anything terrible as a reboot would fix everything and they powered them off every night.

jdsully · on Feb 4, 2023

There was (is?) a software solution called Deep Freeze. They used it on computers at school when I was growing up.

walrus01 · on Feb 3, 2023

I wish more independent review organizations would conduct destructive "write lifespan until ultimate failure" real world tests on SSDs. With a mixture of real world large contiguous files and small random writes.

Real ultimate write lifespan on 3-level-cell and QLC consumer grade SSDs varies wildly for things of the same capacity and similar price.

Such as this series of tests from 7 years ago: https://techreport.com/review/27909/the-ssd-endurance-experi...

It looks like the bar charts and other data in that URL are now broken, which is sad, because I recall reading it when it was first published and it shows some amazing differences between the drives that died first, and the ones that died last.

another similar: https://www.guru3d.com/news-story/endurance-test-of-samsung-...

rom-antics · on Feb 4, 2023

I'd like to see this too!

The official line is that endurance should not matter for most people. For example the Samsung 990 Pro 4TB is rated for 2400TB TBW - which, if the drive has a service of 5 years, is 1.3TB of data written per day. The average user will need < 1% of that.

Where that falls down though of course is cases like this. The point of a review is to show when real-world performance doesn't match the marketing. Tech reviewers seem to be blindly trusting the marketing on this one. They're really dropping the ball.

KennyBlanken · on Feb 4, 2023

What's even more important than durability is what the drive does when it runs out of write cycles.

They should just become read-only, but it seems that in the vast majority, the controller just shuts off and bricks the drive.

Havoc · on Feb 4, 2023

>destructive "write lifespan until ultimate failure" real world tests on SSDs

>from 7 years ago

It's from 7 years back for good reason. They stopped doing those tests when it became impractical as endurance increased. The drives are now good enough that you can't wear them out fast enough to make sense in a review setting

...unless fundamentally broken like these

donmcronald · on Feb 4, 2023

The whole review industry just stopped scrutinizing SSDs several years ago, right around the time manufacturers started cutting features like power loss protection and DRAT/RZAT along with switching to TLC and QLC.

Funny how that worked out.

walrus01 · on Feb 4, 2023

I find it highly improbable that you couldn't wear out a 3-level-cell or 4-level-cell consumer grade SSD which is capable of 300-500MB/s writes with a 24x7 automated test script in just a few months. Maybe even just a couple of weeks. The published total DWPD (drive-writes-per-day) endurance on these is not that great.

Even assuming a conservative 300MB per second, there's 86400 seconds in one day. That's 25920000 MB per day. Or close to 26TB per day. The samsung 960 Pro 2TB is rated by its manufacturer for a total 1200TB of write endurance lifespan.

Or at least leave it running for a couple of weeks and then see what the SMART-reported remaining write lifespan data reports it to be, versus the brand new out of box baseline.

Havoc · on Feb 4, 2023

>Or close to 26TB per day. The samsung 960 Pro 2TB is rated by its manufacturer for a total 1200TB of write endurance lifespan.

Right. So around a month and a half. In a world where hardware news drops simultaneous by multiple outlets literally within minutes of news embargoes being lifted.

That's a lot of time investment to get result that are boring AF ("we tested them. they work"). Have very little real life consumer relevance. And the manufacturer that sent you the review sample definitely doesn't want to see (focus on edge case negative).

>Or at least leave it running for a couple of weeks and then see what the SMART-reported remaining write lifespan data reports it to be

Yeah that would make a bit more sense. Run various units down to 95%. That said the resulting story would still have watch paint dry appeal only

dns_snek · on Feb 4, 2023

> In a world where hardware news drops simultaneous by multiple outlets literally within minutes of news embargoes being lifted.

Any outlet interested in journalism rather purely PR can purchase a retail sample on release and publish an endurance report at a later date, however long it takes.

From my experience people buy storage whenever they have a need for something faster or larger, unlike CPUs and GPUs which have peak interest around their release dates. Storage is evergreen in that sense.

> That's a lot of time investment to get result that are boring AF ("we tested them. they work").

What's boring about that? This is extremely valuable information for any perspective buyer. Either way you gain reputation for being a trustworthy outlet that people can rely on for accurate information.

> Have very little real life consumer relevance.

I disagree and I think most consumers would be extremely interested in durability of their storage devices, especially since most of them rely on it in the absence of backups.

> And the manufacturer that sent you the review sample definitely doesn't want to see (focus on edge case negative).

That's hardly relevant. Informing potential customers about extremely serious flaws in the product is quite literally their job - at least if they wish to have any semblance of integrity, trustworthiness, and respect.

Many choose to sell out and simply echo the approved selling points they receive directly from the company, but not every outlet does this and it shouldn't be held up as something that tech journalists should aspire (or be allowed) to do.

kasabali · on Feb 4, 2023

> it became impractical as endurance increased

[citation needed]

ssdpain · on Feb 3, 2023

I installed 2 x 980 Pro 2Tb in a laptop in Nov 2022. Running a daily Robocopy bat script to backup a folder in C: to D: would freeze a couple of times a week and lock the D: drive. After reboot, a drive check would find no errors and everything would work as normal. I've used the same script for years with no issues.

Since the firmware update last week Robocopy has not frozen the drive at all this week.

zerocrates · on Feb 3, 2023

The freeze/reboot/fine cycle seems to be a common one for SSDs acting poorly, running out of blocks they want to use internally, or memory or cache or something, or just hanging in their own firmware for whatever other reason.

One of my earlier forays into switching to SSDs, I installed Intel... I think it was 525, 535, something like that, 2.5 inch SATA drives in several different machines. Every one has failed by now with this similar mode of (in)operation. On my desktop where I had one, it would simply bluescreen, but then come back fine until eventually reading certain parts of the disk would just always cause it to hang and it had to be replaced. Failed SSDs like this are interesting because Windows (and to a lesser extent Linux) really aren't prepared for the disk to just hang, so trying to recover anything off them can be a challenge.

Just recently I found out the last one I had around, in a little headless desktop server, was the cause of my problems with it where it would partially hang after a couple days of uptime. Having finally gotten around to having it hooked up to a display, I was treated to a sea of red dmesg errors from the disk.

I think ultimately part of the problem was new power-saving features Intel had tried to add for these disks, which would cause them to write to themselves a large amount and just eat through their useful lifetime much faster than you'd assume.

In almost every case, I replaced these with, of course... Samsungs. Though I believe I've been lucky enough not to choose any of their bad ones.

wtallis · on Feb 3, 2023

> The freeze/reboot/fine cycle seems to be a common one for SSDs acting poorly, running out of blocks they want to use internally, or memory or cache or something, or just hanging in their own firmware for whatever other reason.

Given that it's two gen4 drives in a laptop being subjected to a moderately heavy sustained workload, I'd also suspect a thermal problem or maybe even power delivery. Those two slots are probably being fed off the same 3.3V regulator.

Since the firmware bug appears to have caused catastrophic write amplification, what may seem to the user to be only a modest and reasonable workload may be causing the drive that is the backup destination to be running at full tilt doing a ton of writes to the flash and causing the drive to hit its peak power consumption and heat output.

ssdpain · on Feb 4, 2023

I always suspected that it may be ability for laptop hardware to handle the second drive as performance was not as quite as performant as primary slot. Both slots are rated for PCIE 4 though.

It is strange though that after the firmware update there have been zero freezes.

verall · on Feb 4, 2023

Yeah I had an old intel ssd that would hang like this and I could never figure out wtf was going on

dmitrygr · on Feb 3, 2023

> the firmware update last week

Link to specific firmware version please?

ssdpain · on Feb 3, 2023

The new firmware is version 5B2QGXA7, updated via magician on Windows. I didn't make a note of earlier firmware versions. It's still too soon to know if the ssd freeze will reoccur.

philjohn · on Feb 4, 2023

How odd, I've got a 980 Pro 2TB that I've had since mid last year, and checking I'm already on that firmware version.

dmitrygr · on Feb 3, 2023

thank you

issafram · on Feb 3, 2023

Could you provide that batch script please? Like in a GitHub Gist or something similar.

ssdpain · on Feb 3, 2023

Sure...

@echo off

pause

robocopy "C:\Users\o\Desktop\2023" "D:\2023" /e /mir /np /v /tee /r:0 /w:0 /log+:"C:\Users\o\Desktop\log_robocopy.txt"

pause

@echo on

skissane · on Feb 3, 2023

You don’t need to turn echo back on at the end of your batch file. That line is pointless.

thrdbndndn · on Feb 4, 2023

And don't need /e since /mir=/e /purge

ssdpain · on Feb 4, 2023

Thanks. This script has adapted over time and I need to lookup most of the switches these days to remind me what they do.

walrus01 · on Feb 3, 2023

what does the SMART data for your drive say?

I'm morbidly curious how much it reports lifespan remaining for its internal write-wear-leveling system.

ssdpain · on Feb 4, 2023

I've only really experienced drive locking and freezing which resolves on a reboot, this is concerning enough. I haven't experienced any endurance issues.

SMART data is as follows (both since 2022-11-25)

Primary drive C: 5.1 TBW Model Name, Samsung SSD 980 PRO 2TB Serial Number, S***** Drive Type, NVMe Result,Byte End,Byte Start,Description,Raw Data,Status ,0,0,Critical Warning,0,OK ,2,1,Temperature (K),320,OK ,3,3,Available Spare,100,OK ,4,4,Available Spare Threshold,10,OK ,5,5,Percentage Used,0,OK ,47,32,Data Units Read,6465577,OK ,63,48,Data Units Written,10998930,OK ,79,64,Host Read Commands,150273501,OK ,95,80,Host Write Commands,157439035,OK ,111,96,Controller Busy Time,1083,OK ,127,112,Power Cycles,199,OK ,143,128,Power On Hours,571,OK ,159,144,Unsafe Shutdowns,12,OK ,175,160,Media Errors,0,OK ,191,176,Number of Error Information Log Entries,0,OK ,195,192,Warning Composite Temperature Time,0,OK ,199,196,Critical Composite Temperature Time,0,OK ,201,200,Temperature Sensor 1,320,OK ,203,202,Temperature Sensor 2,328,OK ,205,204,Temperature Sensor 3,0,OK ,207,206,Temperature Sensor 4,0,OK ,209,208,Temperature Sensor 5,0,OK ,211,210,Temperature Sensor 6,0,OK ,213,212,Temperature Sensor 7,0,OK ,215,214,Temperature Sensor 8,0,OK

Secondary drive D: 2.9 TBW Model Name, Samsung SSD 980 PRO 2TB Serial Number, S***** Drive Type, NVMe Result,Byte End,Byte Start,Description,Raw Data,Status ,0,0,Critical Warning,0,OK ,2,1,Temperature (K),320,OK ,3,3,Available Spare,100,OK ,4,4,Available Spare Threshold,10,OK ,5,5,Percentage Used,0,OK ,47,32,Data Units Read,4919136,OK ,63,48,Data Units Written,6128916,OK ,79,64,Host Read Commands,164977799,OK ,95,80,Host Write Commands,94324034,OK ,111,96,Controller Busy Time,78,OK ,127,112,Power Cycles,199,OK ,143,128,Power On Hours,538,OK ,159,144,Unsafe Shutdowns,20,OK ,175,160,Media Errors,0,OK ,191,176,Number of Error Information Log Entries,0,OK ,195,192,Warning Composite Temperature Time,56,OK ,199,196,Critical Composite Temperature Time,0,OK ,201,200,Temperature Sensor 1,320,OK ,203,202,Temperature Sensor 2,323,OK ,205,204,Temperature Sensor 3,0,OK ,207,206,Temperature Sensor 4,0,OK ,209,208,Temperature Sensor 5,0,OK ,211,210,Temperature Sensor 6,0,OK ,213,212,Temperature Sensor 7,0,OK ,215,214,Temperature Sensor 8,0,OK

geocrasher · on Feb 3, 2023

Funny thing. This article prompted me to check the health of my two Samsung SSD's (a 250GB 850 EVO SATA III, and a 970 EVO Plus 1TB NVMe), which were fine.

But Samsung's Magician also listed my Seagate ST2000DM008-2FR102 2TB spinny disk. It found a SMART error. I ran a performance test and looked at SMART again, and the "Hardware ECC Recovered" value went from 80 to 81, with a threshold of 64. My other software labels this as "good". Nevertheless, this drive is now being replaced by a 4TB WD Blue. Thanks, article. Saved me some future troubles!

megous · on Feb 4, 2023

The value rising from 80 to 81 is an improvement. The calculated value decreases when the raw value of "Hardware ECC Recovered" worsens.

geocrasher · on Feb 4, 2023

Oh, now I feel like a total idiot. Thank you for clarifying that. Good thing is, I still needed a bigger drive so now that's on the way :)

stavros · on Feb 4, 2023

Is this the case for all SMART values? Higher=better?

deadbeeves · on Feb 4, 2023

Yes, the non-raw values are always reported with higher meaning better. The raw values are the raw measurements/counts and they each mean different things.

SV_BubbleTime · on Feb 4, 2023

Is the number the percentage of a max 100%?

othersteve · on Feb 5, 2023

You did the right thing by replacing the Seagate ST2000DM008. I do data recovery professionally and that's not one of my favorite drives! Lots of issues IMHO.

deadbeeves · on Feb 5, 2023

No, sometimes it's in the 0-200 range. I believe the device always reports the maximum and minimum possible values.

opencl · on Feb 4, 2023

Hardware ECC Recovered represents the amount of time between error correction events, so a higher number is better.

arprocter · on Feb 4, 2023

You might want to double-check that SMART status in CrystalDiskInfo

I don't trust Magician to report correctly on other vendor's storage

geocrasher · on Feb 4, 2023

Yep, I was wrong on this. I should have used my google-fu instead of jumping to a conclusion. I guess I just really wanted an excuse to upgrade to a 4TB :)

dheera · on Feb 4, 2023

How do I check health of a Samsung SSD on Linux?

javaunsafe2019 · on Feb 4, 2023

LMGTFY sudo smartctl -t long -a /dev/sdX

jeffbee · on Feb 4, 2023

That's not the way for a modern SSD. Try `sudo nvme smart-log /dev/nvmeN`

Izkata · on Feb 4, 2023

Seems to give the same output as the second section "smartctl --all" gives... so, less information.

Aside, any idea why it thinks my drive is 208% used?

jeffbee · on Feb 4, 2023

No idea on that one. Mine are all three indicating 0%, but I've seen wacky stuff from SMART indicators over the years.

kiririn · on Feb 4, 2023

Completely normal

smiley1437 · on Feb 3, 2023

I've been trying to find a decent endurance NVME in the m.2 form factor for write-heavy applications and it appears that true 2-bit MLC has all but disappeared, replaced by 3-bit TLC and higher (with commensurate loss of endurance)

The high endurance SSDs appear to be only available in u.2\u.3\hhhl and god-help-me EDSFF form factors

Any suggestions? Micron's 7450 isn't readily available

EarthLaunch · on Feb 3, 2023

They don't really make them anymore, but you can still get m.2 form factor Intel Optane SSDs in the 900/905P series, example[0]. They have insane endurance specs. Their performance is also still awesome, especially for random reads/writes[1]. I wish they had continued making them. Most PC builders just bought crappy Samsung SSDs this whole time, ignoring these awesome (and high priced) drives.

> Life Expectancy 1.6 million hours Mean Time Between Failures (MTBF)

> Lifetime Endurance4 10 Drive Writes per Day (DWPD)

There was one update, but I don't believe it's m.2? [2]

0: https://www.newegg.com/intel-optane-ssd-905p-series-380gb/p/...

1: https://ssd.userbenchmark.com/ (sort by "Avg Bench" and you'll see these old Optanes still in the top 10)

2: https://www.intel.com/content/www/us/en/products/docs/memory...

beebeepka · on Feb 4, 2023

> Most PC builders just bought crappy Samsung SSDs this whole time, ignoring these awesome (and high priced) drives.

Ha, you can't blame the consumer.

Great tech but expensive and locked to intel. Nobody was going back to the blue evil just because they had a really random reads and writes for the enterprise market.

morjom · on Feb 4, 2023

Isn't userbenchmark the site that had them artificially giving Intel/nvidia better scores on GPU and CPU metrics?

beebeepka · on Feb 4, 2023

Same one. Assume malice, cluelessness, or both when someone cites "scores" from that place

nine_k · on Feb 4, 2023

Most PCs experience very little write load; I can imagine that many of them experience less than one full drive write per lifetime.

A database server box, or even a CI build box, is a whole different business.

walterbell · on Feb 4, 2023

WD RED SSDs are targeted at NAS use cases and claim endurance of 1PBW per TB, https://www.tomshardware.com/reviews/wd-red-sn700-review

OGWhales · on Feb 3, 2023

You may also want to ask over on reddit in /r/newMaxx. Good place for SSD info and there is a pinned post for asking questions like this

walterbell · on Feb 4, 2023

NewMaxx SSD references: http://ssd.borecraft.com/

zorgmonkey · on Feb 4, 2023

Why not get a U.2 drive and an adapter like this one https://www.startech.com/en-us/hdd/m2e4sff8643

smiley1437 · on Feb 4, 2023

Didn't know such a thing existed, thank you!

adgjlsfhk1 · on Feb 3, 2023

While it's true that MLC is mostly dead, you might want to consider a higher capacity TLC ssd. If you double your capacity, you double the endurance since SSD endurance is in drive writes per day, and a bigger SSD will likely have a bigger SLC cache to help with the write speed.

spyder · on Feb 4, 2023

Yep, you can even adjust the overprovisioning manually (at least for Samsung). So if you need more endurance (and improved random write performance) just buy bigger capacity and increase the over provisioning allocation. Found this good summary about it with graphs showing the impact of it:

https://www.atpinc.com/blog/over-provisioning-ssd-benefits-e...

efreak · on Feb 4, 2023

As long as you're using a filesystem Samsung understands? When I overprovisioned my 860, windows showed that the partition didn't fill the drive. When I decreased the size of the partition in windows, Samsung Magician showed that the overprovisioning had increased. While I'm not sure how the drive itself deals with overprovisioning, Magician certainly detects it based on the size of the partition.

rektide · on Feb 3, 2023

2 -> 3 bit cells is a 1.5x capacity bump.

i would not be shocked to find tlc has a >1.5x impact on dwpd.

aidenn0 · on Feb 3, 2023

But it might be easier to find e.g. a 3x sized TLC compared to a 1x size MLC in the FF that GP wants.

adgjlsfhk1 · on Feb 3, 2023

For a simple example, the 970 pro (MLC) had a 1200 drive write warrenty, while the 980 pro (TLC) only has a 600 drive write warranty, but the 2TB 980 pro is cheaper than the 1 TB 970 pro so you can get the same endurance for less.

dinvlad · on Feb 4, 2023

Seems weird that only 2TBs fail then

wmf · on Feb 4, 2023

This sounds like a firmware bug that has nothing to do with endurance.

HyperSane · on Feb 4, 2023

Use larger drives and use RAID to distribute writes over more drives. Accept that a TLC drive heavily used for writes is a consumable and act accordingly.

mnadkvlb · on Feb 3, 2023

i recommend samsung pm9a3 versions. not as popular, but are enterprise products and also the endurance is like 3 times 980 pro i believe (please check it, not 100% sure).

Been using in my threadrupper workstation with a lot of vms which are put to sleep every day with around .25tb written and read each time the vms are started. keep in mind these are 22110 form factor

jeffbee · on Feb 3, 2023

These are not available with retail support so if you manage to acquire one (which may have been pulled from service with an unknown level of previous wear) you will get ZERO support from Samsung no matter what goes wrong.

mnadkvlb · on Feb 3, 2023

well i bought from digitec here in switzerland and they take care of warranty. other than that i don't expect any support from any ssd vendor.

I rely on the physical store i buy from where i live. that's the reason i only buy either physically or from amazon germnay (their support had been rock solid in the last 10 years i had been using them).

csdvrx · on Feb 4, 2023

SR-IOV support?

Any firmware issue? (https://forums.servethehome.com/index.php?threads/pm9a3-firm...)

Aardwolf · on Feb 3, 2023

I have this exact model, 980 Pro 2TB.

It says to update firmware, but how can you do that from Linux? The instructions are all about some Windows program. Thanks!

EDIT: I'm quite happy with the warning from this article, fixed a potential future problem!

ggreer · on Feb 4, 2023

If you're on linux, you probably want to use fwupd[1]. You can check the existing version of your drive's firmware by running `fwupdmgr get-devices`. The version with the fix is 5B2QGXA7.

I'm on Arch and apparently I installed the update at some point in the past.

1. https://wiki.archlinux.org/title/fwupd

aendruk · on Feb 4, 2023

Samsung is publishing some firmware but not this. (https://fwupd.org/lvfs/vendors/#samsung, https://github.com/fwupd/fwupd/issues/5477)

  $ fwupdmgr get-updates
  Devices with no available firmware updates: 
   • SSD 980 PRO 2TB

It would be good to put some pressure on Samsung to use the Linux Vendor Firmware Service. I just opened a support ticket about it.

fwupd is at least manually adding a warning about the affected firmware. https://github.com/fwupd/fwupd/pull/5481

ggreer · on Feb 4, 2023

Oops, you're correct. Looking through my shell history, it seems I manually downloaded and installed the firmware update in March of 2022. Here are the commands I ran:

    curl -O https://semiconductor.samsung.com/resources/software-resources/Samsung_SSD_980_PRO_5B2QGXA7.iso
    mkdir /mnt/iso
    sudo mount -v -o loop ./Samsung_SSD_980_PRO_5B2QGXA7.iso /mnt/iso/
    mkdir /tmp/fwupdate
    cd /tmp/fwupdate
    gzip -dc /mnt/iso/initrd | cpio -idv --no-absolute-filenames
    cd root/fumagician/
    sudo ./fumagician

pentamassiv · on Feb 3, 2023

A few months ago I already updated my Samsung SSD by following this procedure: https://askubuntu.com/a/1386451. Theoretically they provide an image to boot from to do the update, but the image seems very outdated and did not recognize my keyboard so it was unusable.

Aardwolf · on Feb 3, 2023

I now found and followed this:

https://blog.quindorian.org/2021/05/firmware-update-samsung-...

And it seems to have worked. After extracting this updater tool and running it, smartctl kept showing the old firmware version (3B2QGXA7), but after reboot it now shows the new version (5B2QGXA7).

I took the risk of running this while the OS (Archlinux) was running with the disk mounted (this is the OS install disk), and at first sight this didn't cause issues. But still do it at your own risk!!

FullyFunctional · on Feb 4, 2023

Thanks, but that takes me to old firmware. I was however able to download the new firmware from https://semiconductor.samsung.com/consumer-storage/support/t... and use the same procedure and it worked:

    ├─SSD 980 PRO 2TB:
    │     Device ID:          03281da317dccd2b18de2bd1cc70a782df40ed7e
    │     Summary:            NVM Express solid state drive
    │     Current version:    5B2QGXA7

My home is on non-redundant stripe of two 980 Pro which both had the bad firmware, so I was obviously motivated, but not panicked as it's replicated hourly to spinning rust (and I have offsite backups). I treat Flash memory as dynamic ram with only slightly better retention.

Aardwolf · on Feb 4, 2023

The command line example in the link shows the old firmware, but they do say to go to the samsung website and get the latest one there. The 980 was the first one in the list there under the Firmware section.

gbba · on Feb 4, 2023

I was trying to figure out why the update wasn't working on my Archlinux box. After a few attempts I barely caught a glimpse of an error message that flashed by: something along the lines of "unzip not found".

After installing unzip, the firmware updated successfully.

YoumuChan · on Feb 4, 2023

The command can be simplified as

  isoinfo -R -i xxx.iso -x /initrd | gzip -dc | cpio -idv --no-absolute-filenames "root/fumagician*"

if you don't want to go through the mounting and extracting everything.

jwatt · on Feb 6, 2023

For me `isoinfo` doesn't like the format of the .iso file ("CD-ROM is NOT in ISO 9660 format"). Using bsdtar worked though:

  bsdtar xOf xxx.iso initrd | gzip -dc | cpio -idv --no-absolute-filenames "root/fumagician*"

(And of course then running `root/fumagician/fumagician` in either case.)

niedzielski · on Feb 4, 2023

This worked for me, thanks! I have a Samsung SSD 980 PRO 2TB which I upgraded from 2B2QGXA7 to 5B2QGXA7. I backed up and also made a copy of the output of `sudo smartctl -t long -a /dev/nvme0` before and after.

aendruk · on Feb 4, 2023

Also seems to have worked for me doing 3B2QGXA7→5B2QGXA7 in NixOS. Extract ISO, extract initrd, run fumagician, reboot.

qwertox · on Feb 3, 2023

I think I managed to upgrade the firmware from within Kubuntu (it wasn't the OS Nvme) by using this method

https://blog.quindorian.org/2021/05/firmware-update-samsung-...

I've done it some months ago, so I don't remember if it was that one exactly.

I'm now relieved to know that 5B2QGXA7 is still the current one.

rogers18445 · on Feb 4, 2023

an example:

    nvme fw-log /dev/nvme0
    nvme id-ctrl /dev/nvme0 -H | grep Firmware
    nvme fw-download -f firmware.ebin /dev/nvme0
    nvme fw-commit /dev/nvme0 -s 2 -a 3
    nvme fw-log /dev/nvme0

In an unlikely event, may need to change the slot (-s)

acidburnNSA · on Feb 3, 2023

Following from my linux desktop. I lost a system SSD that was a 980 2 TB and recently reinstalled everything, thinking it was a fluke. Now worried it will happen again rapidly.

babypuncher · on Feb 3, 2023

I just put the same drive in my PS5 in December. None of my Windows machines have spare NVMe slots, so updating this should be interesting

867-5309 · on Feb 3, 2023

boot from Windows on a SATA drive?

TacticalCoder · on Feb 3, 2023

From TFA it's not just the 980 Pro 2 TB but also all the newer 990, so it's problematic.

Aardwolf · on Feb 3, 2023

The article switches from the 980 issue to the 990 issue in a bit of an unclear way, but I think they're independent problems and the firmware update should fix the 980 one?

pja · on Feb 3, 2023

nvme drives can be updated from the command line. I've done it myself.

Extracting the actual firmware update from the files Samsung gives you might be an issue though.

amluto · on Feb 3, 2023

nvme-cli can, in principle, upload firmware. I’ve never personally tried it.

evil-olive · on Feb 4, 2023

every machine I have that can fit 2 SSDs (basically, everything except the very slim laptops) I have converted over to running a ZFS mirror as its root filesystem. NixOS makes this very easy to do because the grub.mirroredBoots option [0] removes the need for a separate "bootpool" with limited ZFS feature flags.

and crucially, I always make sure they're 2 drives from different manufacturers, so that a bug of this nature should never be able to take down both drives in a pool simultaneously.

I think of this as the "if you're going to go to the trouble of wearing a belt and suspenders, make sure to buy them from separate brands" principle.

0: https://search.nixos.org/options?channel=22.11&show=boot.loa...

aporetics · on Feb 3, 2023

Just a note as a happy customer of Puget Systems that my experience working with them has been excellent, they really seem to be expert in their field, and have been for many years.

Also, their submerged in mineral oil aquarium computer was really cool, back in the day.

danielodievich · on Feb 4, 2023

I confirm. 2 Puget System workstations in this house. Just opened both of them up to add more hard drives for games and games work storage, and the cabling is so lovely. In my new gaming box I have a Samsung 980 Pro 1TB which according to this note is unaffected, and I couldn't find it on my motherboard. I created a support ticket and the support immediately responded with very clear explanation (it's under its own heatsink under a humongous heatsink/fan for the CPU. Duh!

craigching · on Feb 4, 2023

I use my Puget for build performance for a large mono-repo at Adobe. I could set my watch to the consistency in memory and CPU usage during a full build of our products.

gregatragenet3 · on Feb 4, 2023

This posting came two months late for me.

My 980 2TB crossed the river styx over the holiday break. Failure mode exactly as described. Nice Christmas present for me. Took 3 weeks to get the warranty replacement from Samsung.

Dalewyn · on Feb 3, 2023

I wasn't aware the 980 Pro 2TB was also affected; I have four of those in a new machine I put together last year.

Time to install some bloatware and see about updating their firmwares, I guess...

adrenvi · on Feb 3, 2023

Samsung 870 EVO drives were also known to fail early including my 2TB model.

https://www.techpowerup.com/forums/threads/samsung-870-evo-b...

acabal · on Feb 3, 2023

Yes, this bit me just last month. Around October 2022 I purchased 3 Samsung 870 EVO 2TBs for use in a RAID array. By January 2023, all three of them failed within a week of each other!

Fortunately they failed one by one, so I was barely was able to recover my RAID array by pulling out one drive at a time, powering the computer off, and waiting for the RMA replacement to arrive.

But imagine my shock to see one drive fail... only to replace it with an RMA... and then days later, seeing the next drive fail... and the next!

GordonS · on Feb 4, 2023

I just had a 2TB 870 EVO fail too! There were SMART errors about uncorrectable errors, and I saw CRC errors in the OS. I lost some data.

This issue is all over the Internet, yet Samsung would not acknowledge it was a known issue. They also refused my RMA because the corner of one of the plastic port guides was chipped - we're talking about a miniscule chip, barely visible with the human eye. So despite the fact this drive was obviously defective, Samsung won't replace it. So I'm down £350 (£225 for the original drive, and £125 for the Crucial I had to buy to replace it), through no fault of my own.

I'm not buying Samsung SSDs again - problems are one thing, but how you deal with them is paramount.

gjvc · on Feb 3, 2023

when equipping a storage system, it does make sense to use disks from different manufacturers, different models, and different vintages (manufacturing batches).

equipping a storage system with disks all of the same make, model, and vintage is invoking the statistics gods to strike failure all at once (or close enough that you won't be able to keep up with the rate of failure and time to rebuild)

personal experience: attempting to rescue a failing 192 disk system containing disks all of the same make and model. wearisome.

bombcar · on Feb 3, 2023

A perfect example why RAID ain’t backup

acabal · on Feb 3, 2023

Indeed! Fortunately the RAID was also backed-up offsite. But, the entire process was shocking in several ways.

At least Samsung was fairly speedy with the RMAs and it was basically no-questions-asked... because I imagine they're getting tons of these mailed back to them.

zargon · on Feb 4, 2023

Did they require the failed drives to be returned? Did the secure erase function work on a failed drive? I guess this is a good reason for full disk encryption even on machines where I'm satisfied with their physical security.

acabal · on Feb 5, 2023

Yes, they sent a prepaid label to return the bad drives. I did a secure erase both via Linux command line and via their proprietary software and it seemed to work. The whole process took about 3 weeks per drive.

moffkalast · on Feb 4, 2023

I panicked for a second until I realized mine is a 860 2TB EVO and I bought it back in 2019. It still has zero block and other errors after 15k hours according to smart.

> seems to primarily affect drives produced in January/February 2021

That is interesting, I wonder if it's another one of those cases where the supply chain shortages forced them into respinning the boards with some slightly out of spec parts. It's certainly been a major problem for anyone making PCBs.

arprocter · on Feb 4, 2023

Yep, my 2TB also got hit by this - did an RMA in November

The replacement drive they sent me has behaved itself so far, touch wood

mmis1000 · on Feb 4, 2023

There is a chinese youtuber doing SSD durability tests on a few SSDs of different vendor. And one of the tested ssd is Samsung 980 pro.

What is the funny thing? The Samsung 980 died before the wear test even start.

https://youtu.be/tXYQZHz7u3w?t=898

bullen · on Feb 4, 2023

The real problem is OS:es that write to disk for no good reason.

Windows 10 writes 100KB/s constantly.

That should be illegal.

jandrese · on Feb 4, 2023

Try running Windows 10 off of an older spinning laptop drive. It can take upwards of 40 minutes to display the desktop on the first boot after Windows Update runs. Even in normal operation those constant low level writes leave barely any breathing room for your actual applications. Full size hard drives do a bit better, but even then it can be pretty painful when the drive indexing service kicks off or .NET is updated.

donmcronald · on Feb 4, 2023

Windows 10 on an SSD feels like Windows 7 on a spinning disk. Microsoft has wiped out the gains we got from SSDs.

rogers18445 · on Feb 4, 2023

This doesn't actually matter in a practical sense. Assuming 24/7, it's 3TB a year. Which is ~1% drive endurance.

Also, if you are worried about overwriting the same files over and over, it also doesn't matter. Block device addresses are not physical addresses, controller maps them to wear the drive evenly.

wtallis · on Feb 4, 2023

On the other hand, lots of tiny writes scattered all over will tend to produce much higher write amplification than large sequential writes. So you'll get more actual wear to the drive from the 3TB of constant background churn than if you copied in 3TB of movies.

rogers18445 · on Feb 4, 2023

Those writes would have to be significantly smaller than the SSD's page (sector) size which is 512 bytes or 4 KiB. And would have to be written to different pages in rapid succession (to be flushed apart) - a standard serial write wouldn't trigger this even if it's 1 byte at a time, the OS FS cache would buffer it.

It would have to be very misbehaving software or deliberate sabotage.

wtallis · on Feb 4, 2023

The logical block size presented by SSDs to the host system is 512 bytes or rarely 4kB. But the native page size of the flash memory itself is usually more like 16kB, and the erase block size is several MB at a minimum. Those larger sizes are why random writes (and especially random overwrites) can cause high write amplification within the SSD: because what looks like a series of single-sector writes to the host will at a minimum cause fragmentation within the SSD, and can easily cause large read-modify-write operations within the SSD.

Normally, SSDs and operating systems both use aggressive caching to combine writes. That's the only way a drive can turn in extremely high random write benchmark numbers. Consumer SSDs do this caching even though they do not have power loss protection capacitors to ensure that data cached in volatile SRAM will be flushed to the flash in an emergency. But it wouldn't be smart for the caching to wait forever for more writes to combine with a sub-page write, which is why I'd be concerned that a slow and steady trickle of write activity may be able to cause serious real write amplification.

vlovich123 · on Feb 4, 2023

I’m pretty sure SSDs can only do 4kib aligned writes regardless of the FS sector size (under the hood it’s a write amplification unless the OS or controller manage to coalesce them. But yea, it depends on how things are getting flushed, but generally I wouldn’t expect too much magic unless you get lucky. It sounds like a small bug in the OS (ie these kinds of wires should be matched in memory in the application).

wtallis · on Feb 4, 2023

Almost all SSDs internally track allocations in a 4kB granularity. That size is what leads to the convention of equipping the drive with 1GB of DRAM for every 1TB of NAND flash, when the drive is designed to hold the entire table of logical to physical address mappings in DRAM.

It's now common for consumer SSDs to have less DRAM than the normal 1GB per 1TB ratio, but they run their FTL with the same 4kB granularity and just don't have the full lookup table in RAM. There are at least a handful of special-purpose enterprise drives that use a larger sector size in their FTL, such as the 32kB used by WD's SN340: https://www.anandtech.com/show/14723

rogers18445 · on Feb 4, 2023

I do wonder if perhaps the good NVME SSD controllers come with magic. It would take a single instance of malware ruining SSD's with 4000x write amplification to taint some brands while aiding the marketing of others.

donmcronald · on Feb 4, 2023

I thought some of them even do 8KB. I’ve seen ZFS tips that claim you should use 8KB blocks on things like an 850 Pro.

vlovich123 · on Feb 4, 2023

Not familiar with that. I know QLC disks have a block size of 64kib.

matja · on Feb 4, 2023

Except "evenly" is not a standard, or something that anyone other than the manufacturer can verify, it's hidden in the firmware so we have no idea really.

walterbell · on Feb 4, 2023

Do we know the ___location of those writes, perhaps they can be redirected to a ramdisk?

paranoidrobot · on Feb 4, 2023

Sysinternals tools will show up the writes and what is causing them.

I've dug down and found random things doing dumb stuff in the past. Verbose logging turned on by default for some services, for example.

rasz · on Feb 4, 2023

management/windows logs about >100 active logs. Performance/data collector set about >50 active Event Trace Sessions running.

hrez · on Feb 4, 2023

browsers are much worse, both chrome based and firefox

blindriver · on Feb 4, 2023

Is there any way to update my SSD on Windows without downloading Samsung Magician? I have the Samsung Evo across multiple systems and my Linux ones are okay but my Windows machines aren't okay unfortunately, they have the bad firmware.

dataflow · on Feb 4, 2023

Tangent: what's the cheapest 4TB+ PCIe SSD without significant known issues but with hardware encryption? The SN850x seems to have its own issues, and beyond these everything else seems so expensive.

nine_k · on Feb 4, 2023

Is hardware encryption a widespread feature?

What are the benefits of it for you, compared to OS-level full disk encryption?

wnevets · on Feb 4, 2023

> The SN850x seems to have its own issues

Such as?

dataflow · on Feb 4, 2023

https://community.wd.com/t/sn850x-the-driver-detected-a-cont...

craigching · on Feb 4, 2023

I have a puget system with the Samsung SSDs mentioned and it locks up on me every 2-4 weeks. This sounds like it would explain the problem. Puget sent out a message this week to upgrade the Samsung firmware, but I am at the latest. I’ll be contacting Puget support on Monday so I’m on their radar.

I will say I love my Puget, its performance has been killer other than these lockups. And I’ve heard only good things about Puget support. I should have reported this months ago, but it’s just now that I’m doing some critical work on Windows that it’s affecting me.

jeffbee · on Feb 3, 2023

Would like to know more. Were failures in the field from wear-out or sudden death? Are the health indicators losing 1% per week consistent with the datasheet TBW, or worse?

taspeotis · on Feb 4, 2023

We have Lenovo laptops at work with M.2’s that are OEM-branded Samsungs.

One bricked itself in to read only mode after a few months.

The other has been losing 1% health each week or so. I caught it losing 2% in just two days recently.

These drives are older than the 990 model mentioned in the article but I have my suspicions anyway they’re dud drives.

Nothing lost except time - they can be swapped under warranty. But I used to buy Intel exclusively before swapping to Samsung when Intel started selling rebranded drives.

I guess the search for a reliable vendor starts again…

jeffbee · on Feb 4, 2023

These anecdotes are pretty frustrating without the other key piece of information. For the given lifetime indicators, how many writes were served? Are they wearing out faster than their TBW claims, or are they being written more than you expected?

taspeotis · on Feb 4, 2023

It’s at 86% with 12.21TB written. Total power on time 68 days. Drive temp sits around 45 degrees celsius.

I’m not paying great attention to all the SMART counters day by day.

It’s for a dev workload so like … compiling code and stuff? I have the exact same workload on my desktop PC and its Samsung drive health is 99% after … years.

jeffbee · on Feb 4, 2023

OK, so close to 1% lifetime per TBW, or lifetime approximately 100TBW. Thanks! That's consistent with their endurance claims for the smallest SSDs (128GB PM991 for example, or 256GB 960 Evo) but it would be poor for a larger one.

taspeotis · on Feb 4, 2023

Cool well these are 2TB drives so…

chemmail · on Feb 7, 2023

Even Crucial are having tons of issues with current MX500 models.

I think SKHynix might be the last as they supply a lot of OEM over the years. But their consumer base is small so we don't have a huge sample size like Samsungs.

sireat · on Feb 4, 2023

I can live with my batch of 980pros and 870 EVOs failing. Bad batches happen to every manufacturer.

However this seems like a systematic problem at Samsung.

What really stinks is that I've been recommending Samsung EVOs and PROs to relatives and friends for some years.

If I want to remain honest with myself I have to contact every one of them and have to run the SMART tests.

So now there are NO reputable SSD manufacturers left. Only reliable SSDs are MLCs from mid 201Xs especially Intel and Samsung.

KennyBlanken · on Feb 4, 2023

Ask anyone with a circa-2014-ish Macbook Pro about Samsung SSD reliability.

The samsung-made drives lasted about 5-6 years. Everything seemed fine, and then one day you'd get a spinning pizza of death, power down, power it back on...and your SSD was...completely gone. Doesn't even enumerate on the PCIe bus. It's just gone.

Screw the SSD chipset manufacturers for not making sure that their controllers can at least a)still show up on the bus b)be read-only in some sort of recovery mode.

SV_BubbleTime · on Feb 4, 2023

Not sure why downvoted.

A recovery mode seems absolutely possible and I don’t care if it’s 100kbit over SPI. It’s better than losing everything.

xoa · on Feb 3, 2023

Ars had a piece covering this as well, and I do wonder if there is something going on somewhere else in the Samsung stack, not just the NVMe 900 series line. Pure anecdote, but two years ago I did a NAS for a client using 24x 2TB Samsung 870 Evo drives (they'd gotten some incentive deal for it). While it was all one type vs mixed, there was the "luxury" of time because at that point getting the system they wanted together had a significant lead time. So I did ensure that the drives were purchased over the course of around 7 months, from multiple different reputable sellers (B&H, CDW, Provantage etc) in separate batches. System was solid, an Epyc 2 based SuperMicro server, running TrueNAS.

And then last year with around 5500-7500 quite light hours of runtime (primarily reads, ~0.08 DWPD, well under official rating of 0.3 DWPD) drives started failing. These were definitely real failures, first indication came from regular automated ZFS scrubs and reporting increasing checksum errors and ATA errors. It was for so many drives and I'd always considered Samsung SSDs relatively reliable (even for consumer ones) that at first I thought it was a SATA controller failure, and our rep agreed and warranties back the server. They were great, gold plated support contracts pay off once in a while, and motherboard replacement and thorough testing later back in service. More drive problems. SMART short tests said everything was healthy, first longs did too. But then drives exceeded error limits and started getting faulted, and at last SMART long tests started failing. Digging in showed worrisome stats. So began swapping out and warrantying drives (cheers to the stress test to TrueNAS, in the end zero downtime or need to restore from backups). In the end, THIRTEEN (13) out of 24 failed. Brutal >50% dead drive rate. I talked to some others around and they'd seen <1 year rates also at 30-60%. Big :\. Rep also indicated they were hearing more about Samsung failures.

Anyway, gave me a talking point going forward to really, really press management on "it's worth paying for drives from 3-4x brands and maybe splurging for higher rated vs consumer too", but also does made me wonder if there is something going on, or was (pandemic related?), at Samsung's storage division. It's definitely pure anecdote but still, I spread those drive purchases out reasonably hard, and they had radically different serial numbers. Same with other folks I know at other businesses using various Samsung drives, everyone has been going to real effort following decent practices to prevent buying drives all from a single lot. Even 10% failure rate for consumer drives I could have seen, but 54%? And not a bathtub curve all frontloaded in the first month or two but after 7-11 months? That feels high? Samsung did replace them all no questions asked, they paid for shipping too. I don't have any global insight into how this all looks and it could be just plain bad luck for all of us in the region, but still.