Storage Trends 2025

It’s been almost 15 months since I blogged about Storage Trends 2024 [1]. There hasn’t been much change in this time (in Australia at least – I’m not tracking prices in other countries). The change was so small I had to check how the Australian dollar has performed against other currencies to see if changes to currencies had countered changes to storage prices, but there has been little overall change when compared to the Chinese Yuan and the Australian dollar is only about 11% worse against the US dollar when compared to a year ago. Generally there’s a trend of computer parts decreasing in price by significantly more than 11% per annum.

Small Storage

The cheapest storage device from MSY now is a Patriot P210 128G SATA SSD for $19, cheaper than the $24 last year and the same price as the year before. So over the last 2 years there has been no change to the cheapest storage device on sale. It would almost never make sense to buy that as a 256G SATA SSD (also Patriot P210) is $25 and has twice the lifetime (120TBW vs 60TBW). There are also 256G NVMe devices for $29 and $30 which would be better options if the system has a NVMe socket built in.

The cheapest 500G devices are $42.50 for a 512G SATA SSD and $45 for a 500G NVMe. Last year the prices were $33 for SATA and $36 for NVMe in that size so there’s been a significant increase in price there. The difference is enough that if someone was on a tight budget they might reasonably decide to use smaller storage than they might have used last year!

2TB hard drives are still $89 the same price as last year! Last year a 2TB SATA SSD was $118 and a 2TB NVMe was $145, now a 2TB SATA SSD is $157 and a 2TB NVMe is $127. So NVMe has become cheaper than SATA in that segment but overall prices are higher than last year. Again for business use 2TB seems a sensible minimum for most systems if you are paying MSY rates (or similar rates from Amazon etc).

Medium Storage

Last year 4TB HDDs were $135, now they are $148. Last year the cheapest 4TB SSD was $299, now the cheapest is a $309 NVMe. While the prices have all gone up the price difference between hard drives and SSD has decreased in that size range. So for a small server (a lot of home servers and small business servers) 4TB of RAID-1 storage is all that’s needed and for that SSDs are the best option. The price difference between $296 for 4TB of RAID-1 HDDs and $618 for RAID-1 NVMe is small enough to be justified by the benefits of speed and being quiet for most small server uses.

In 2023 a 8TB hard drive cost $179 and a 8TB SSD cost $739. Last year a 8TB hard drive cost $239 and a 8TB SATA SSD cost, $899. Now a 8TB HDD costs $229 and MSY doesn’t sell 8TB SSDs but for comparison Amazon has a Samsung 8TB SATA SSD for $919. So for storing 8TB+ there are benefits of hard drives as SSDs are difficult to get in that size range and more expensive than they were before. It seems that 8TB SSDs aren’t used by enough people to have a large market in the home and small office space, so those of us who want the larger storage sizes will have to get second hand enterprise gear. It will probably be another few years before 8TB enterprise SSDs start appearing on the second hand market.

Serious Storage

Last year I wrote about the affordability of U.2 devices. I regret not buying some then as there are fewer on sale now and prices are higher.

For hard drives they still aren’t a good choice for most users because most users don’t have more than 4TB of data.

For large quantities of data hard drives are still a good option, a 22TB disk costs $899. For companies this is a good option for many situations. For home users there is the additional problem that determining whether a drive is Shingled Magnetic Recording which has some serious performance issues for some use and it’s very difficult to determine which drives use it.

Conclusion

For corporate purchases the options for serious storage are probably decent. But for small companies and home users things definitely don’t seem to have improved as much as we expect from the computer industry, I had expected 8TB SSDs to go for $450 by now and SSDs less than 500G to not even be sold new any more.

The prices on 8TB SSDs have gone up more in the last 2 yeas than the ASX 200 (index of 200 biggest companies in the Australian stock market). I would never recommend using SSDs as an investment, but in retrospect 8TB SSDs could have been a good one.

$20 seems to be about the minimum cost that SSDs approach while hard drives have a higher minimum price of a bit under $100 because they are larger, heavier, and more fragile. It seems that the market is likely to move to most SSDs being close to $20, if they can make 2TB SSDs cheaply enough to sell for about that price then that would cover the majority of the market.

I’ve created a table of the prices, I should have done this before but I initially didn’t plan an ongoing series of posts on this topic.

Jun 2020 Apr 2021 Apr 2023 Jan 2024 Apr 2025
128G SSD $49 $19 $24 $19
500G SSD $97 $73 $32 $33 $42.50
2TB HDD $95 $72 $75 $89 $89
2TB SSD $335 $245 $149
4TB HDD $115 $135 $148
4TB SSD $895 $349 $299 $309
8TB HDD $179 $239 $229
8TB SSD $949 $739 $899 $919
10TB HDD $549 $395

Storage Trends 2024

It has been less than a year since my last post about storage trends [1] and enough has changed to make it worth writing again. My previous analysis was that for <2TB only SSD made sense, for 4TB SSD made sense for business use while hard drives were still a good option for home use, and for 8TB+ hard drives were clearly the best choice for most uses. I will start by looking at MSY prices, they aren't the cheapest (you can get cheaper online) but they are competitive and they make it easy to compare the different options. I'll also compare the cheapest options in each size, there are more expensive options but usually if you want to pay more then the performance benefits of SSD (both SATA and NVMe) are even more appealing. All prices are in Australian dollars and of parts that are readily available in Australia, but the relative prices of the parts are probably similar in most countries. The main issue here is when to use SSD and when to use hard disks, and then if SSD is chosen which variety to use.

Small Storage

For my last post the cheapest storage devices from MSY were $19 for a 128G SSD, now it’s $24 for a 128G SSD or NVMe device. I don’t think the Australian dollar has dropped much against foreign currencies, so I guess this is partly companies wanting more profits and partly due to the demand for more storage. Items that can’t sell in quantity need higher profit margins if they are to have them in stock. 500G SSDs are around $33 and 500G NVMe devices for $36 so for most use cases it wouldn’t make sense to buy anything smaller than 500G.

The cheapest hard drive is $45 for a 1TB disk. A 1TB SATA SSD costs $61 and a 1TB NVMe costs $79. So 1TB disks aren’t a good option for any use case.

A 2TB hard drive is $89. A 2TB SATA SSD is $118 and a 2TB NVMe is $145. I don’t think the small savings you can get from using hard drives makes them worth using for 2TB.

For most people if you have a system that’s important to you then $145 on storage isn’t a lot to spend. It seems hardly worth buying less than 2TB of storage, even for a laptop. Even if you don’t use all the space larger storage devices tend to support more writes before wearing out so you still gain from it. A 2TB NVMe device you buy for a laptop now could be used in every replacement laptop for the next 10 years. I only have 512G of storage in my laptop because I have a collection of SSD/NVMe devices that have been replaced in larger systems, so the 512G is essentially free for my laptop as I bought a larger device for a server.

For small business use it doesn’t make sense to buy anything smaller than 2TB for any system other than a router. If you buy smaller devices then you will sometimes have to pay people to install bigger ones and when the price is $145 it’s best to just pay that up front and be done with it.

Medium Storage

A 4TB hard drive is $135. A 4TB SATA SSD is $319 and a 4TB NVMe is $299. The prices haven’t changed a lot since last year, but a small increase in hard drive prices and a small decrease in SSD prices makes SSD more appealing for this market segment.

A common size range for home servers and small business servers is 4TB or 8TB of storage. To do that on SSD means about $600 for 4TB of RAID-1 or $900 for 8TB of RAID-5/RAID-Z. That’s quite affordable for that use.

For 8TB of less important storage a 8TB hard drive costs $239 and a 8TB SATA SSD costs $899 so a hard drive clearly wins for the specific case of non-RAID single device storage. Note that the U.2 devices are more competitive for 8TB than SATA but I included them in the next section because they are more difficult to install.

Serious Storage

With 8TB being an uncommon and expensive option for consumer SSDs the cheapest price is for multiple 4TB devices. To have multiple NVMe devices in one PCIe slot you need PCIe bifurcation (treating the PCIe slot as multiple slots). Most of the machines I use don’t support bifurcation and most affordable systems with ECC RAM don’t have it. For cheap NVMe type storage there are U.2 devices (the “enterprise” form of NVMe). Until recently they were too expensive to use for desktop systems but now there are PCIe cards for internal U.2 devices, $14 for a card that takes a single U.2 is a common price on AliExpress and prices below $600 for a 7.68TB U.2 device are common – that’s cheaper on a per-TB basis than SATA SSD and NVMe! There are PCIe cards that take up to 4*U.2 devices (which probably require bifurcation) which means you could have 8+ U.2 devices in one not particularly high end PC for 56TB of RAID-Z NVMe storage. Admittedly $4200 for 56TB is moderately expensive, but it’s in the price range for a small business server or a high end home server. A more common configuration might be 2*7.68TB U.2 on a single PCIe card (or 2 cards if you don’t have bifurcation) for 7.68TB of RAID-1 storage.

For SATA SSD AliExpress has a 6*2.5″ hot-swap device that fits in a 5.25″ bay for $63, so if you have 2*5.25″ bays you could have 12*4TB SSDs for 44TB of RAID-Z storage. That wouldn’t be much cheaper than 8*7.68TB U.2 devices and would be slower and have less space. But it would be a good option if PCIe bifurcation isn’t possible.

16TB SATA hard drives cost $559 which is almost exactly half the price per TB of U.2 storage. That doesn’t seem like a good deal. If you want 16TB of RAID storage then 3*7.68TB U.2 devices only costs about 50% more than 2*16TB SATA disks. In most cases paying 50% more to get NVMe instead of hard disks is a good option. As sizes go above 16TB prices go up in a more than linear manner, I guess they don’t sell much volume of larger drives.

15.36TB U.2 devices are on sale for about $1300, slightly more than twice the price of a 16TB disk. It’s within the price range of small businesses and serious home users. Also it should be noted that the U.2 devices are designed for “enterprise” levels of reliability and the hard disk prices I’m comparing to are the cheapest available. If “NAS” hard disks were compared then the price benefit of hard disks would be smaller.

Probably the biggest problem with U.2 for most people is that it’s an uncommon technology that few people have much experience with or spare parts for testing. Also you can’t buy U.2 gear at your local computer store which might mean that you want to have spare parts on hand which is an extra expense.

For enterprise use I’ve recently been involved in discussions with a vendor that sells multiple petabyte arrays of NVMe. Apparently NVMe is cheap enough that there’s no need to use anything else if you want a well performing file server.

Do Hard Disks Make Sense?

There are specific cases like comparing a 8TB hard disk to a 8TB SATA SSD or a 16TB hard disk to a 15.36TB U.2 device where hard disks have an apparent advantage. But when comparing RAID storage and counting the performance benefits of SSD the savings of using hard disks don’t seem to be that great.

Is now the time that hard disks are going to die in the market? If they can’t get volume sales then prices will go up due to lack of economy of scale in manufacture and increased stock time for retailers. 8TB hard drives are now more expensive than they were 9 months ago when I wrote my previous post, has a hard drive price death spiral already started?

SSDs are cheaper than hard disks at the smallest sizes, faster (apart from some corner cases with contiguous IO), take less space in a computer, and make less noise. At worst they are a bit over twice the cost per TB. But the most common requirements for storage are small enough and cheap enough that being twice as expensive as hard drives isn’t a problem for most people.

I predict that hard disks will become less popular in future and offer less of a price advantage. The vendors are talking about 50TB hard disks being available in future but right now you can fit more than 50TB of NVMe or U.2 devices in a volume less than that of a 3.5″ hard disk so for storage density SSD can clearly win. Maybe in future hard disks will be used in arrays of 100TB devices for large scale enterprise storage. But for home users and small businesses the current sizes of SSD cover most uses.

At the moment it seems that the one case where hard disks can really compare well is for backup devices. For backups you want large storage, good contiguous write speeds, and low prices so you can buy plenty of them.

Further Issues

The prices I’ve compared for SATA SSD and NVMe devices are all based on the cheapest devices available. I think it’s a bit of a market for lemons [2] as devices often don’t perform as well as expected and the incidence of fake products purporting to be from reputable companies is high on the cheaper sites. So you might as well buy the cheaper devices. An advantage of the U.2 devices is that you know that they will be reliable and perform well.

One thing that concerns me about SSDs is the lack of knowledge of their failure cases. Filesystems like ZFS were specifically designed to cope with common failure cases of hard disks and I don’t think we have that much knowledge about how SSDs fail. But with 3 copies of metadata BTFS or ZFS should survive unexpected SSD failure modes.

I still have some hard drives in my home server, they keep working well enough and the prices on SSDs keep dropping. But if I was buying new storage for such a server now I’d get U.2.

I wonder if tape will make a comeback for backup.

Does anyone know of other good storage options that I missed?

BTRFS Rebuild Time

In February I replaced a Dell T320 server with a HP Z640 workstation for a home server/workstation [1]. The T320 has 8*3.5″ drive bays which I had used to put 3*4TB disks in a BTRFS RAID-10 array for 6TB of usable capacity. The Z640 has only 2*3.5″ bays and 4*2.5″ bays, so one option I could have taken was to buy a 4TB 2.5″ SSD and keep the same 3*4TB array as before. Instead I chose to use an 8TB disk I had spare in an array with one of the original 4TB disks and some extra on NVMe devices (the system has 2*1TB NVMe devices which are used as a 380G RAID-1 for the root filesystem and the rest for the storage array). It’s nice how BTRFS allows putting any storage you have in a RAID-10 configuration.

Unfortunately it seems that I chose the wrong 4TB disk to use for this as it failed three days ago. It gave thousands of read and write errors and Linux decided that the drive no longer existed. I tried rebooting the system to get it in the BTRFS array again but it failed again and failed so quickly that it wasn’t even possible to use the data on it as part of a RAID rebuild. So I removed that disk and put in one of the other 4TB disks.

As the array is comprised of an 8TB disk and 3 other devices that don’t add up to 8TB the layout is the 8TB disk having one copy of everything and the other devices having parts of it. So the rebuild process comprised of copying data from the 8TB disk to the 4TB disk. For a RAID array run in the manner of Linux software RAID the rebuild of a RAID-1 involves a linear copy of data which is the optimal case for hard disks, copying 4TB of data in that manner would have an average speed of a bit over 100MB/s and take about 11 hours. With BTRFS the source disk has to be updated for each block that is recreated so the process was bottlenecked on writing to the 8TB disk. It took 2 days 23 hours to complete. The process involved reading 3,478,031MB and writing 4,405,545MB. The system was live for the process and some cron jobs etc were writing to the array, but in the 12 hours since the rebuild completed the array has had 7,038MB written. So presumably during the rebuild time about 42G of actual data were written to the array and the other 4.3TB written to the 8TB disk were from the process of copying 3.5TB from it to another device. Iostat reported that there were 645.36 TPS for the duration of the rebuild which seems like a decent number for a hard drive, during the process iostat reported that the drive had 99%+ of IO capacity used for the duration.

While waiting for this to complete I wrote a blog post about storage trends [2]. One thing I didn’t mention in that post is that if you are the type of person who checks the rebuild process fifty times a day then that should be counted as part of the cost of using slow storage. If instead of an 8TB disk plus some SSD storage I had used 2*4TB disks and 1*4TB SSD as I had considered doing then instead of having 3.8TB on one device I would have had about 2.5TB and the reconstruct would have probably taken 2/3 of the time. If I had moved the array to 3*4TB SSDs then it would have taken a small fraction of the time.

One thing to note is that I made a mistake in this operation by removing the failed device instead of doing a “btrfs replace” operation which can be significantly faster. If I had correctly done this then I would have written a blog post about the rebuild taking 2 days or something, the issues of hard drives being slow and me compulsively checking the progress would still apply.

3

Storage Trends 2023

It’s been 2 years since my last blog post about storage trends [1].

Minimum Storage <=2TB

In 2021 I stated that as MSY had 2TB disks for $72 and 2TB SSD for $245 it was barely worth considering a 2TB disk and anything less than 2TB wasn’t worth considering. Now for 2TB storage from MSY NVMe starts at $129, SATA SSD starts at $143, and hard disks start at $75. I guess that NVMe is slightly cheaper due to some combination of economies of scale for manufacture/sales and having less postage costs. It really doesn’t make sense to consider hard disks for storing 2TB or less.

For storage for a small system (PC or laptop) the cheapest storage device is $19 for a 128G SATA SSD. But it wouldn’t make sense to buy that when you can get a 256G SATA SSD for $22 or a 240G NVMe device for $23, saving $3 on storage wouldn’t make any sense. For 512G of storage the prices are $32 for NVMe and $33 for SATA SSD. For 1TB of storage the prices start at $68 for SATA SSD and $74 for NVMe. Probably for the vast majority of home users 1TB of SATA SSD or NVMe is the minimum storage capacity to consider, the $50 price difference isn’t much when considering the entire price of a PC or laptop and anything less than 1TB will run out quickly with modern use.

Larger Storage 4TB+

The price for 4TB of storage from MSY is NVMe starting at $349, SATA SSD starting at $369, and hard disks starting at $115. If you need 4TB of RAID-1 storage then it might be worth saving $470 and getting hard drives for a home user. For business use it wouldn’t make sense. Some laptops have two NVMe sockets so 8TB of storage (or 4TB of RAID-1) in a laptop would be interesting.

For 8TB of storage the MSY prices are SATA SSD for $739 and hard drives starting at $179. Probably hard drives are the best choice for most situations where there is a need to store 8TB or more of data. But the prices are low enough to make 8TB SSD something that can be considered for home use, it doesn’t seem that long ago that the 4TB hard drives I bought for my home server were almost that expensive.

Big Storage

MSY doesn’t have 8TB NVMe, such devices are on eBay for $1700 for regular M.2 NVMe and just under $1000 for U.2 (server hot-swap devices). So if you need more than 8TB of NVMe storage then probably buying a server with U.2 built in is the correct solution.

For home users who need more than 8TB of storage hard drives are a good solution. One issue is that the more affordable and larger drives use Shingled Magnetic Recording (SMR) which has some different performance characteristics for certain workloads. Apparently SMR performs badly for anything other than large file storage.

Why MSY?

I primarily used MSY prices for this post because they are a reliable local store that has a list of prices that is easy to read. For everything in this post I can get better prices by using eBay, the StaticIce.com.au price comparison site [2], and the computing section of the OzBargain site that gamifies finding good prices [3]. But a good shopping strategy nowadays is to compare prices in a store to determine what items are in your price range and then shop around for price on the item you want. Checking all the different bargain sites for all these items would take much more time than I want to spend writing a blog post!

Conclusion

Hard drives don’t make sense for the vast majority of systems. Not for laptops, not for typical desktop PCs, and not for small business servers (say 8TB or less of RAID storage). Hard drives only make sense for dozens or hundreds of TB of storage and even then finding out how to deal with SMR issues is going to increase the pain of deployment. Maybe using a combination of SSD and hard drives to deal with the SMR issues is going to be a competitive advantage for NAS vendors in future.

NVMe looks like it’s on the way to being cheaper than SATA SSD. There is likely to be a good market for systems with NVMe as the only internal storage option.

The long term trend of systems without DVD drives and with maybe 2.5″ SATA devices but no 3.5″ SATA devices seems to lead to the GPU being the major part that needs to fit into a PC case that determines the overall size. Maybe there will be a new trend of GPUs connected to riser cards so they can be parallel to the motherboard for compact PCs.

For business desktop systems (IE low powered graphics hardware as it’s not for gaming) I expect that the trend will be towards NUC type devices which are already based around the M.2 as a storage device size.

Storage Trends 2021

The Viability of Small Disks

Less than a year ago I wrote a blog post about storage trends [1]. My main point in that post was that disks smaller than 2TB weren’t viable then and 2TB disks wouldn’t be economically viable in the near future.

Now MSY has 2TB disks for $72 and 2TB SSD for $245, saving $173 if you get a hard drive (compared to saving $240 10 months ago). Given the difference in performance and noise 2TB hard drives won’t be worth using for most applications nowadays.

NVMe vs SSD

Last year NVMe prices were very comparable for SSD prices, I was hoping that trend would continue and SSDs would go away. Now for sizes 1TB and smaller NVMe and SSD prices are very similar, but for 2TB the NVMe prices are twice that of SSD – presumably partly due to poor demand for 2TB NVMe. There are also no NVMe devices larger than 2TB on sale at MSY (a store which caters to home stuff not special server equipment) but SSDs go up to 8TB.

It seems that NVMe is only really suitable for workstation storage and for cache etc on a server. So SATA SSDs will be around for a while.

Small Servers

There are a range of low end servers which support a limited number of disks. Dell has 2 disk servers and 4 disk servers. If one of those had 8TB SSDs you could have 8TB of RAID-1 or 24TB of RAID-Z storage in a low end server. That covers the vast majority of servers (small business or workgroup servers tend to have less than 8TB of storage).

Larger Servers

Anandtech has an article on Seagates roadmap to 120TB disks [2]. They currently sell 20TB disks using HAMR technology

Currently the biggest disks that MSY sells are 10TB for $395, which was also the biggest disk they were selling last year. Last year MSY only sold SSDs up to 2TB in size (larger ones were available from other companies at much higher prices), now they sell 8TB SSDs for $949 (4* capacity increase in less than a year). Seagate is planning 30TB disks for 2023, if SSDs continue to increase in capacity by 4* per year we could have 128TB SSDs in 2023. If you needed a server with 100TB of storage then having 2 or 3 SSDs in a RAID array would be much easier to manage and faster than 4*30TB disks in an array.

When you have a server with many disks you can expect to have more disk failures due to vibration. One time I built a server with 18 disks and took disks from 2 smaller servers that had 4 and 5 disks. The 9 disks which had been working reliably for years started having problems within weeks of running in the bigger server. This is one of the many reasons for paying extra for SSD storage.

Seagate is apparently planning 50TB disks for 2026 and 100TB disks for 2030. If that’s the best they can do then SSD vendors should be able to sell larger products sooner at prices that are competitive. Matching hard drive prices is not required, getting to less than 4* the price should be enough for most customers.

The Anandtech article is worth reading, it mentions some interesting features that Seagate are developing such as having 2 actuators (which they call Mach.2) so the drive can access 2 different tracks at the same time. That can double the performance of a disk, but that doesn’t change things much when SSDs are more than 100* faster. Presumably the Mach.2 disks will be SAS and incredibly expensive while providing significantly less performance than affordable SATA SSDs.

Computer Cases

In my last post I speculated on the appearance of smaller cases designed to not have DVD drives or 3.5″ hard drives. Such cases still haven’t appeared apart from special purpose machines like the NUC that were available last year.

It would be nice if we could get a new industry standard for smaller power supplies. Currently power supplies are expected to be almost 5 inches wide (due to the expectation of a 5.25″ DVD drive mounted horizontally). We need some industry standards for smaller PCs that aren’t like the NUC, the NUC is very nice, but most people who build their own PC need more space than that. I still think that planning on USB DVD drives is the right way to go. I’ve got 4PCs in my home that are regularly used and CDs and DVDs are used so rarely that sharing a single DVD drive among all 4 wouldn’t be a problem.

Conclusion

I’m tempted to get a couple of 4TB SSDs for my home server which cost $487 each, it currently has 2*500G SSDs and 3*4TB disks. I would have to remove some unused files but that’s probably not too hard to do as I have lots of old backups etc on there. Another possibility is to use 2*4TB SSDs for most stuff and 2*4TB disks for backups.

I’m recommending that all my clients only use SSDs for their storage. I only have one client with enough storage that disks are the only option (100TB of storage) but they moved all the functions of that server to AWS and use S3 for the storage. Now I don’t have any clients doing anything with storage that can’t be done in a better way on SSD for a price difference that’s easy for them to afford.

Affordable SSD also makes RAID-1 in workstations more viable. 2 disks in a PC is noisy if you have an office full of them and produces enough waste heat to be a reliability issue (most people don’t cool their offices adequately on weekends). 2 SSDs in a PC is no problem at all. As 500G SSDs are available for $73 it’s not a significant cost to install 2 of them in every PC in the office (more cost for my time than hardware). I generally won’t recommend that hard drives be replaced with SSDs in systems that are working well. But if a machine runs out of space then replacing it with SSDs in a RAID-1 is a good choice.

Moore’s law might cover SSDs, but it definitely doesn’t cover hard drives. Hard drives have fallen way behind developments of most other parts of computers over the last 30 years, hopefully they will go away soon.

ZFS 2.0.0 Released

Version 2.0 of ZFS has been released, it’s now known as OpenZFS and has a unified release for Linux and BSD which is nice.

One new feature is persistent L2ARC (which means that when you use SSD or NVMe to cache hard drives that cache will remain after a reboot) is an obvious feature that was needed for a long time.

The Zstd compression invented by Facebook is the best way of compressing things nowadays, it’s generally faster while giving better compression than all other compression algorithms and it’s nice to have that in the filesystem.

The PAM module for encrypted home directories is interesting, I haven’t had a need for directory level encryption as block device encryption has worked well for my needs. But there are good use cases for directory encryption.

I just did a quick test of OpenZFS on a VM, the one thing it doesn’t do is let me remove a block device accidentally added to a zpool. If you have a zpool with a RAID-Z array and mistakenly add a new disk as a separate device instead of replacing a part of the RAID-Z then you can’t remove it. There are patches to allow such removal but they didn’t appear to get in to the OpenZFS 2.0.0 release.

For my use ZFS on Linux 0.8.5 has been working fairly well and apart from the inability to remove a mistakenly added device I haven’t had any problems with it. So I have no immediate plans to upgrade existing servers. But if I was about to install a new ZFS server I’d definitely use the latest version.

People keep asking about license issues. I’m pretty sure that Oracle lawyers have known about sites like zfsonlinux.org for years, the fact that they have decided not to do anything about it seems like clear evidence that it’s OK for me to keep using that code.

Storage Trends

In considering storage trends for the consumer side I’m looking at the current prices from MSY (where I usually buy computer parts). I know that other stores will have slightly different prices but they should be very similar as they all have low margins and wholesale prices are the main factor.

Small Hard Drives Aren’t Viable

The cheapest hard drive that MSY sells is $68 for 500G of storage. The cheapest SSD is $49 for 120G and the second cheapest is $59 for 240G. SSD is cheaper at the low end and significantly faster. If someone needed about 500G of storage there’s a 480G SSD for $97 which costs $29 more than a hard drive. With a modern PC if you have no hard drives you will notice that it’s quieter. For anyone who’s buying a new PC spending an extra $29 is definitely worthwhile for the performance, low power use, and silence.

The cheapest 1TB disk is $69 and the cheapest 1TB SSD is $159. Saving $90 on the cost of a new PC probably isn’t worth while.

For 2TB of storage the cheapest options are Samsung NVMe for $339, Crucial SSD for $335, or a hard drive for $95. Some people would choose to save $244 by getting a hard drive instead of NVMe, but if you are getting a whole system then allocating $244 to NVMe instead of a faster CPU would probably give more benefits overall.

Computer stores typically have small margins and computer parts tend to quickly either become cheaper or be obsoleted by better parts. So stores don’t want to stock parts unless they will sell quickly. Disks smaller than 2TB probably aren’t going to be profitable for stores for very long. The trend of SSD and NVMe becoming cheaper is going to make 2TB disks non-viable in the near future.

NVMe vs SSD

M.2 NVMe devices are at comparable prices to SATA SSDs. For some combinations of quality and capacity NVMe is about 50% more expensive and for some it’s slightly cheaper (EG Intel 1TB NVMe being cheaper than Samsung EVO 1TB SSD). Last time I checked about half the motherboards on sale had a single M.2 socket so for a new workstation that doesn’t need more than 2TB of storage (the largest NVMe that MSY sells) it wouldn’t make sense to use anything other than NVMe.

The benefit of NVMe is NOT throughput (even though NVMe devices can often sustain over 4GB/s), it’s low latency. Workstations can’t properly take advantage of this because RAM is so cheap ($198 for 32G of DDR4) that compiles etc mostly come from cache and because most filesystem writes on workstations aren’t synchronous. For servers a large portion of writes are synchronous, for example a mail server can’t acknowledge receiving mail until it knows that it’s really on disk, so there’s a lot of small writes that block server processes and the low latency of NVMe really improves performance. If you are doing a big compile on a workstation (the most common workstation task that uses a lot of disk IO) then the writes aren’t synchronised to disk and if the system crashes you will just do all the compilation again. While NVMe doesn’t give a lot of benefit over SSD for workstation use (I’ve uses laptops with SSD and NVMe and not noticed a great difference) of course I still want better performance. ;)

Last time I checked I couldn’t easily buy a PCIe card that supported 2*NVMe cards, I’m sure they are available somewhere but it would take longer to get and probably cost significantly more than twice as much. That means a RAID-1 of NVMe takes 2 PCIe slots if you don’t have an M.2 socket on the motherboard. This was OK when I installed 2*NVMe devices on a server that had 18 disks and lots of spare PCIe slots. But for some systems PCIe slots are an issue.

My home server has all PCIe slots used by a video card and Ethernet cards and the BIOS probably won’t support booting from NVMe. It’s a Dell server so I can’t just replace the motherboard with one that has more PCIe slots and M.2 on the motherboard. As it’s running nicely and doesn’t need replacing any time soon I won’t be using NVMe for home server stuff.

Small Servers

Most servers that I am responsible for have less than 2TB of storage. For my clients I now only recommend SSD storage for small servers and am recommending SSD for replacing any failed disks.

My home server has 2*500G SSDs in a BTRFS RAID-1 for the root filesystem, and 3*4TB disks in a BTRFS RAID-1 for storing big files. I bought the SSDs when 500G SSDs were about $250 each and bought 2*4TB disks when they were about $350 each. Currently that server has about 3.3TB of space used and I could probably get it down to about 2.5TB if I deleted things I don’t really need. If I was getting storage for that server now I’d use 2*2TB SSDs and 3*1TB hard drives for the stuff that doesn’t fit on SSDs (I have some spare 1TB disks that came with servers). If I didn’t have spare hard drives I’d get 3*2TB SSDs for that sort of server which would give 3TB of BTRFS RAID-1 storage.

Last time I checked Dell servers had a card for supporting M.2 as an optional extra so Dells probably won’t boot from NVMe without extra expense.

Ars Technica has an informative article about WD selling SMR disks as “NAS” disks [1]. The Shingled Magnetic Recording technology allows greater storage density on a platter which leads to either larger capacity or cheaper disks but at the cost of lower write performance and apparently extremely bad latency in some situations. NAS disks are supposed to be low latency as the expectation is that they will be used in a RAID array and kicked out of the array if they have problems. There are reports of ZFS kicking SMR disks from RAID sets. I think this will end the use of hard drives for small servers. For a server you don’t want to deal with this sort of thing, by definition when a server goes down multiple people will stop work (small server implies no clustering). Spending extra to get SSDs just to avoid the risk of unexpected SMR would be a good plan.

Medium Servers

The largest SSD and NVMe devices that are readily available are 2TB but 10TB disks are commodity items, there are reports of 20TB hard drives being available but I can’t find anyone in Australia selling them.

If you need to store dozens or hundreds of terabytes than hard drives have to be part of the mix at this time. There’s no technical reason why SSDs larger than 10TB can’t be made (the 2.5″ SATA form factor has more than 5* the volume of a 2TB M.2 card) and it’s likely that someone sells them outside the channels I buy from, but probably at a price higher than what my clients are willing to pay. If you want 100TB of affordable storage then a mid range server like the Dell PowerEdge T640 which can have up to 18*3.5″ disks is good. One of my clients has a PowerEdge T630 with 18*3.5″ disks in the 8TB-10TB range (we replace failed disks with the largest new commodity disks available, it used to have 6TB disks). ZFS version 0.8 introduced a “Special VDEV Class” which stores metadata and possibly small data blocks on faster media. So you could have some RAID-Z groups on hard drives for large storage and the metadata on a RAID-1 on NVMe for fast performance. For medium size arrays on hard drives having a “find /” operation take hours is not uncommon, for large arrays having it take days isn’t that uncommon. So far it seems that ZFS is the only filesystem to have taken the obvious step of storing metadata on SSD/NVMe while bulk data is on cheap large disks.

One problem with large arrays is that the vibration of disks can affect the performance and reliability of nearby disks. The ZFS server I run with 18 disks was originally setup with disks from smaller servers that never had ZFS checksum errors, but when disks from 2 small servers were put in one medium size server they started getting checksum errors presumably due to vibration. This alone is a sufficient reason for paying a premium for SSD storage.

Currently the cost of 2TB of SSD or NVMe is between the prices of 6TB and 8TB hard drives, and the ratio of price/capacity for SSD and NVMe is improving dramatically while the increase in hard drive capacity is slow. 4TB SSDs are available for $895 compared to a 10TB hard drive for $549, so it’s 4* more expensive on a price per TB. This is probably good for Windows systems, but for Linux systems where ZFS and “special VDEVs” is an option it’s probably not worth considering. Most Linux user cases where 4TB SSDs would work well would be better served by smaller NVMe and 10TB disks running ZFS. I don’t think that 4TB SSDs are at all popular at the moment (MSY doesn’t stock them), but prices will come down and they will become common soon enough. Probably by the end of the year SSDs will halve in price and no hard drives less than 4TB will be viable.

For rack mounted servers 2.5″ disks have been popular for a long time. It’s common for vendors to offer 2 versions of a rack mount server for 2.5″ and 3.5″ disks where the 2.5″ version takes twice as many disks. If the issue is total storage in a server 4TB SSDs can give the same capacity as 8TB HDDs.

SMR vs Regular Hard Drives

Rumour has it that you can buy 20TB SMR disks, I haven’t been able to find a reference to anyone who’s selling them in Australia (please comment if you know who sells them and especially if you know the price). I expect that the ZFS developers will soon develop a work-around to solve the problems with SMR disks. Then arrays of 20TB SMR disks with NVMe for “special VDEVs” will be an interesting possibility for storage. I expect that SMR disks will be the majority of the hard drive market by 2023 – if hard drives are still on the market. SSDs will be large enough and cheap enough that only SMR disks will offer enough capacity to be worth using.

I think that it is a possibility that hard drives won’t be manufactured in a few years. The volume of a 3.5″ disk is significantly greater than that of 10 M.2 devices so current technology obviously allows 20TB of NVMe or SSD storage in the space of a 3.5″ disk. If the price of 16TB NVMe and SSD devices comes down enough (to perhaps 3* the price of a 20TB hard drive) almost no-one would want the hard drive and it wouldn’t be viable to manufacture them.

It’s not impossible that in a few years time 3D XPoint and similar fast NVM technologies occupy the first level of storage (the ZFS “special VDEV”, OS swap device, log device for database servers, etc) and NVMe occupies the level for bulk storage with no space left in the market for spinning media.

Computer Cases

For servers I expect that models supporting 3.5″ storage devices will disappear. A 1RU server with 8*2.5″ storage devices or a 2RU server with 16*2.5″ storage devices will probably be of use to more people than a 1RU server with 4*3.5″ or a 2RU server with 8*3.5″.

My first IBM PC compatible system had a 5.25″ hard drive, a 5.25″ floppy drive, and a 3.5″ floppy drive in 1988. My current PC is almost a similar size and has a DVD drive (that I almost never use) 5 other 5.25″ drive bays that have never been used, and 5*3.5″ drive bays that I have never used (I have only used 2.5″ SSDs). It would make more sense to have PC cases designed around 2.5″ and maybe 3.5″ drives with no more than one 5.25″ drive bay.

The Intel NUC SFF PCs are going in the right direction. Many of them only have a single storage device but some of them have 2*M.2 sockets allowing RAID-1 of NVMe and some of them support ECC RAM so they could be used as small servers.

A USB DVD drive costs $36, it doesn’t make sense to have every PC designed around the size of an internal DVD drive that will probably only be used to install the OS when a $36 USB DVD drive can be used for every PC you own.

The only reason I don’t have a NUC for my personal workstation is that I get my workstations from e-waste. If I was going to pay for a PC then a NUC is the sort of thing I’d pay to have on my desk.

4

Finding Storage Performance Problems

Here are some basic things to do when debugging storage performance problems on Linux. It’s deliberately not an advanced guide, I might write about more advanced things in a later post.

Disk Errors

When a hard drive is failing it often has to read sectors several times to get the right data, this can dramatically reduce performance. As most hard drives aren’t monitored properly (email or SMS alerts on errors) it’s quite common for the first notification about an impending failure to be user complaints about performance.

View your kernel message log with the dmesg command and look in /var/log/kern.log (or wherever your system is configured to store kernel logs) for messages about disk read errors, bus resetting, and anything else unusual related to the drives.

If you use an advanced filesystem like BTRFS or ZFS there are system commands to get filesystem information about errors. For BTRFS you can run “btrfs device stats MOUNTPOINT” and for ZFS you can run “zpool status“.

Most performance problems aren’t caused by failing drives, but it’s a good idea to eliminate that possibility before you continue your investigation.

One other thing to look out for is a RAID array where one disk is noticeably slower than the others. For example if you have a RAID-5 or RAID-6 array every drive should have almost the same number of reads and writes, if one disk in the array is at 99% performance capacity and the other disks are at 5% then it’s an indication of a failing disk. This can happen even if SMART etc don’t report errors.

Monitoring IO

The iostat program in the Debian sysstat package tells you how much IO is going to each disk. If you have physical hard drives sda, sdb, and sdc you could run the command “iostat -x 10 sda sdb sdc” to tell you how much IO is going to each disk over 10 second periods. You can choose various durations but I find that 10 seconds is long enough to give results that are useful.

By default iostat will give stats on all block devices including LVM volumes, but that usually gives too much data to analyse easily.

The most useful things that iostat tells you are the %util (the percentage utilisation – anything over 90% is a serious problem), the reads per second “r/s“, and the writes per second “w/s“.

The parameters to iostat for block devices can be hard drives, partitions, LVM volumes, encrypted devices, or any other type of block device. After you have discovered which block devices are nearing their maximum load you can discover which of the partitions, RAID arrays, or swap devices on that disk are causing the load in question.

The iotop program in Debian (package iotop) gives a display that’s similar to that of top but for disk io. It generally isn’t essential (you can run “ps ax|grep D” to get most of that information), but it is handy. It will tell you which programs are causing IO on a busy filesystem. This can be good when you have a busy system and don’t know why. It isn’t very useful if you have a system that is used for one task, EG a database server that is known to be busy doing database stuff.

It’s generally a good idea to have sysstat and iotop installed on all systems. If a system is experiencing severe performance problems you might not want to wait for new packages to be installed.

In Debian the sysstat package includes the sar utility which can give historical information on system load. One benefit of using sar for diagnosing performance problems is that it shows you the time of day that has the most load which is the easiest time to diagnose performance problems.

Swap Use

Swap use sometimes confuses people. In many cases swap use decreases overall disk use, this is the design of the Linux paging algorithms. So if you have a server that accesses a lot of data it might swap out some unused programs to make more space for cache.

When you have multiple virtual machines on one system sharing the same disks it can be difficult to determine the best allocation for RAM. If one VM has some applications allocating a lot of RAM but not using it much then it might be best to give it less RAM and force those applications into swap so that another VM can cache all the data it accesses a lot.

The important thing is not the amount of swap that is allocated but the amount of IO that goes to the swap partition. Any significant amount of disk IO going to a swap device is a serious problem that can be solved by adding more RAM.

Reads vs Writes

The ratio of reads to writes depends on the applications and the amount of RAM. Some applications can have most of their reads satisfied from cache. For example an ideal configuration of a mail server will have writes significantly outnumber reads (I’ve seen ratios of 5:1 for writes to reads on real mail servers). Ideally a mail server will cache all new mail for at least an hour and as the most prolific users check their mail more frequently than that most mail will be downloaded before it leaves the cache. If you have a mail server with reads outnumbering writes then it needs more RAM. RAM is cheap nowadays so if you don’t want to compete with Gmail it should be cheap to buy enough RAM to cache all recent mail.

The ratio of reads to writes is important because it’s one way of quickly determining if you have enough RAM and adding RAM is often the cheapest way of improving performance.

Unbalanced IO

One common performance problem on systems with multiple disks is having more load going to some disks than to others. This might not be a problem (EG having cron jobs run on disks that are under heavy load while the web server accesses data from lightly loaded disks). But you need to consider whether it’s desirable to have some disks under more load than others.

The simplest solution to this problem is to just have a single RAID array for all data storage. This is also the solution that gives you the maximum available disk space if you use RAID-5 or RAID-6.

A more complex option is to use some SSDs for things that require performance and disks for things that don’t. This can be done with the ZIL and L2ARC features of ZFS or by just creating a filesystem on SSD for the data that is most frequently accessed.

What Did I Miss?

I’m sure that I missed something, please let me know of any other basic things to do – or suggestions for a post on more advanced things.

1

Booting GPT

I’m installing new 4TB disks on an older Dell server, it’s a PowerEdge T110 with a G6950 CPU so it’s not really old, but it’s a couple of generations behind the latest Dell servers.

I tried to enable UEFI booting, but when I turned that option on the system locked up during the BIOS process (wouldn’t boot from the CD or take keyboard input). So I had to make it boot with a BIOS compatible MBR and a GPT partition table.

Number  Start (sector)    End (sector)  Size      Code  Name
  1            2048            4095  1024.0 KiB  EF02  BIOS boot partition
  2            4096        25169919  12.0 GiB    FD00  Linux RAID
  3        25169920      7814037134  3.6 TiB    8300  Linux filesystem

After spending way to much time reading various web pages I discovered that the above partition table works. The 1MB partition is for GRUB code and needs to be enabled by a parted command such as the following:

parted /dev/sda set 1 bios_grub on

/dev/sda2 is a RAID-1 array used for the root filesystem. If I was installing a non-RAID system I’d use the same partition table but with a type of 8300 instead of FD00. I have a RAID-1 array over sda2 and sdb2 for the root filesystem and sda3, sdb3, sdc3, sdd3, and sde3 are used for a RAID-Z array. I’m reserving space for the root filesystem on all 5 disks because it seems like a good idea to use the same partition table and the 12G per disk that is unused on sdc, sdd, and sde isn’t worth worrying about when dealing with 4TB disks.

1

More BTRFS Fun

I wrote a BTRFS status report yesterday commenting on the uneventful use of BTRFS recently [1].

Early this morning the server that stores my email (which had 93 days uptime) had a filesystem related problem. The root filesystem became read-only and then the kernel message log filled with unrelated messages so there was no record of the problem. I’m now considering setting up rsyslogd to log the kernel messages to a tmpfs filesystem to cover such problems in future. As RAM is so cheap it wouldn’t matter if a few megs of RAM were wasted by that in normal operation if it allowed me to extract useful data when something goes really wrong. It’s really annoying to have a system in a state where I can login as root but not find out what went wrong.

After that I tried 2 kernels in the 3.14 series, both of which had kernel BUG assertions related to Xen networking and failed to network correctly, I filed Debian Bug #756714. Fortunately they at least had enough uptime for me to run a filesystem scrub which reported no errors.

Then I reverted to kernel 3.13.10 but the reboot to apply that kernel change failed. Systemd was unable to umount the root filesystem (maybe because of a problem with Xen) and then hung the system instead of rebooting, I filed Debian Bug #756725. I believe that if asked to reboot a system there is no benefit in hanging the system with no user space processes accessible. Here are some useful things that systemd could have done:

  1. Just reboot without umounting (like “reboot -nf” does).
  2. Pause for some reasonable amount of time to give the sysadmin a possibility of seeing the error and then rebooting.
  3. Go back to a regular runlevel, starting daemons like sshd.
  4. Offer a login prompt to allow the sysadmin to login as root and diagnose the problem.

Options 1, 2, and 3 would have saved me a bit of driving. Option 4 would have allowed me to at least diagnose the problem (which might be worth the drive).

Having a system on the other side of the city which has no remote console access just hang after a reboot command is not useful, it would be near the top of the list of things I don’t want to happen in that situation. The best thing I can say about systemd’s operation in this regard is that it didn’t make the server catch fire.

Now all I really know is that 3.14 kernels won’t work for my server, 3.13 will cause problems that no-one can diagnose due to lack of data, and I’m now going to wait for it to fail again. As an aside the server has ECC RAM and it’s hardware is known to be good, so I’m sure that BTRFS is at fault.