This series of posts covers APFS, Apple’s new filesystem announced at WWDC 2016. See the first post for the table of contents.
Data Integrity
Arguably the most important job of a file system is preserving data integrity. Here’s my data, don’t lose it, don’t change it. If file systems could be trusted absolutely then the “only” reason for backup would be the idiot operators (i.e. you and me). There are a few mechanisms that file systems employ to keep data safe.
Redundancy
APFS makes no claims with regard to data redundancy. As Apple’s Eric Tamura noted at WWDC, most Apple devices have a single storage device (i.e. one logical SSD) making RAID, for example, moot. Instead redundancy comes from lower layers such as Apple RAID (apparently a thing), hardware RAID controllers, SANs, or even the “single” storage devices themselves..
As an aside note that SSDs in most Apple products where APFS will run include multiple more-or-less independent NAND chips. High-end SSDs do implement data redundancy within the device, but it comes at the price of reduced capacity and performance. As noted above, the “flash-optimization” of APFS doesn’t actually extend much below the surface of the standard block device interface, but the raw materials for innovation are there.
Also, APFS removes the most common way of a user achieving local data redundancy: copying files. A copied file in APFS actually creates a lightweight clone with no duplicated data. Corruption of the underlying device would mean that both “copies” were damaged whereas with full copies localized data corruption would affect just one.
Crash Consistency
Computer systems can fail at any time—crashes, bugs, power outages, etc.—so file systems need to anticipate and recover from these scenarios. The old old old school method is to plod along and then have a special utility to check and repair the file system during boot (fsck, short for file system check). More modern systems labor to achieve an always consistent format, or only narrow windows of inconsistency, obviating the need for the full, expensive fsck. ZFS, for example, builds up new state on disk and then atomically transitions from the previous state to the new one with a single atomic operation.
Overwriting data creates the most obvious opening for inconsistency. If the file system needs to overwrite several regions there is a window where some regions represent the new state and some represent the former state. Copy-on-write (COW) is a method to avoid this by always allocating new regions and then releasing old ones for reuse rather than modifying data in-place. APFS claims to implement a “novel copy-on-write metadata scheme”; APFS lead developer Dominic Giampaolo emphasized the novelty of this approach without delving into the details. In conversation later, he made it clear that APFS does not employ the ZFS mechanism of copying all metadata above changed user data which allows for a single, atomic update of the file system structure.
It’s surprising to see that APFS includes fsck_apfs—even after asking Dominic I’m not sure why it would be necessary. For comparison I don’t believe there’s been an instance where fsck for ZFS would have found a problem that the file system itself didn’t already know how to detect. But Dominic was just as confused about why ZFS would forego fsck, so perhaps it’s just a matter of opinion.
Checksums
Notably absent from the APFS intro talk was any mention of checksums. A checksum is a digest or summary of data used to detect (and correct) data errors. The story here is surprisingly nuanced. APFS checksums its own metadata but not user data. The justification for checksumming metadata is strong: there’s relatively not much of it (so the checksums don’t consume much storage) and losing metadata can cast a potentially huge shadow of data loss. If, for example, metadata for a top level directory is corrupted then potentially all data on the disk could be rendered inaccessible. ZFS duplicates metadata (and triple duplicates top-level metadata) for exactly this reason.
Explicitly not checksumming user data is a little more interesting. The APFS engineers I talked to cited strong ECC protection within Apple storage devices. Both flash SSDs and magnetic media HDDs use redundant data to detect and correct errors. The engineers contend that Apple devices basically don’t return bogus data. NAND uses extra data, e.g. 128 bytes per 4KB page, so that errors can be corrected and detected. (For reference, ZFS uses a fixed size 32 byte checksum for blocks ranging from 512 bytes to megabytes. That’s small by comparison, but bear in mind that the SSD’s ECC is required for the expected analog variances within the media.) The devices have a bit error rate that’s tiny enough to expect no errors over the device’s lifetime. In addition, there are other sources of device errors where a file system’s redundant check could be invaluable. SSDs have a multitude of components, and in volume consumer products they rarely contain end-to-end ECC protection leaving the possibility of data being corrupted in transit. Further, their complex firmware can (does) contain bugs that can result in data loss.
The Apple folks were quite interested in my experience with regard to bit rot (aging data silently losing integrity) and other device errors. I’ve seen many instances where devices raised no error but ZFS (correctly) detected corrupted data. Apple has some of the most stringent device qualification tests for its vendors; I trust that they really do procure the best components. Apple engineers I spoke with claimed that bit rot was not a problem for users of their devices, but if your software can’t detect errors then you have no idea how your devices really perform in the field. ZFS has found data corruption on multi-million dollar storage arrays; I would be surprised if it didn’t find errors coming from TLC (i.e. the cheapest) NAND chips in some of Apple’s devices. Recall the (fairly) recent brouhaha regarding storage problems in the high capacity iPhone 6. At least some of Apple’s devices have been imperfect.
As someone who has data he cares about on a Mac, who has seen data lost from HFS, and who knows that even expensive, enterprise-grade equipment can lose data, I would gladly sacrifice 16 bytes per 4KB–less than 1% of my device’s size.
Scrub
As data ages you might occasionally want to check for bit rot. Likely fsck_apfs can accomplish this; as noted though there’s no data redundancy and no checksums for user data, so scrub would only help to find problems and likely wouldn’t help to correct them. And if it makes it any easier for Apple to reverse course, let’s say it’s for the el cheap-o drive I bought from Fry’s not for the gold-plated device I got from Apple.
Next in this series: Conclusions
42 Responses
The big question for Apple: if “Apple devices basically don’t return bogus data”, then why bother with checksums for metadata? Isn’t that just wasteful complexity? Why can we trust one case but not the other?
Besides, now that the only way to expand my Mac’s storage capacity is to buy an external third-party hard disk, I don’t see how it’s relevant what only Apple devices do. On my Mac, I store lots of data on non-Apple hardware.
I have a case for my iPhone because it has to deal with environments that are dirtier than an Apple Store, and I want checksums for my data because it has to deal with environments that are less reliable than a brand new Mac with a stock SSD.
I guess that at Apple, either each storage device will ship encased in a Faraday cage, or the posse there hasn’t yet heard of Solar flares and other forms of electromagnetic radiation which regularly penetrate our already weakened atmosphere.
I myself have had ZFS detect and automatically heal bit rot of a single bit in the storage pool, on a machine using expensive, enterprise grade SCSI drives. For the record, the system was underground, and the radiation went through a layer of both concrete and earth.
I also wonder what will happen when the users attach cheap consumer grade drives to their devices?
Luckily, there is a port of OpenZFS for OS X; I know what I have to do, but pity the duplication of effort, and all because Apple wouldn’t integrate OpenZFS into OS X.
Actually, the ZFS checksum is 32 bytes, ususally 4 64 bit fletcher4 words. And if that is not strong enough for you, there is the quite strong yet still very fast Edon-R now.
Yeah… not sure how I managed to get that wrong…
Couldn’t user data checksumming be done independently of the filesystem?
As long as you have some mechanism by which to read and write bytes on disk, even unreliably, any feature can theoretically be done at a higher level. It’s just a lot less convenient, to the point where nobody will actually use it.
For example, you could write a data checksum feature in your photo app. But if Apple doesn’t want to put data checksums in the filesystem, what are the odds their Photos.app team is going to implement it in their app?
This is kind of like multiprocessing in the OS 9 days. We had dual CPUs, and it was technically possible to use them both, but it just wasn’t practical for the 99.99% of programmers who weren’t working on Photoshop.
Checksums are worthy an appeal or two to Apple to convinces them. Their price is quite small.
The best way to deal with it might be as part of compression.
The lack of file checksums & some form of self-healing is very disappointing. I have had bit-rot on HFS+ and ZFS – and I’ve experienced on my MacBook Pro’s SSD. Apple is wrong if claiming this isn’t – or can’t be – a potential problem for users. Apple products and their users are not magically immune to this phenomenon. Any self-respecting modern file system should address this. Like many others, I use many external HD’s for data. It would be nice to have some sort of data integrity on these provided by the FS too. If not, I guess there is still ZFS.
@Aaron Meuerer,
Data checksumming is done independently of the filesystem. Hard disks have CRC error correcting codes that detect errors. In the specs of a SAS or FC Enterprise disk, you will typically see “1 uncorrectable error in 10^16 bits read”. Thus, these CRC codes are not always able to detect all errors. For instance, ECC RAM can detect and correct any 1 bit errors. But they can only detect 2 bit errors, but not correct them. There are probably other bit errors configurations that are not detectable nor correctable. Thus, ECC RAM does not protect 100% against bit corruption. Neither do CRC codes on disks. This short coming is because disks and ECC ram use simple error correcting algorithm. ZFS use a very advanced checksum to detect and correct all errors. For instance, you can tell ZFS to use SHA-256 to detect errors.
Before adding a checksum to filesystem data, Apple should ship every system with ECC memory.
If you read correctly from disk, a bit gets flipped in the memory buffer, you compute a checksum, then write correctly to disk, the checksum won’t detect the error because it was computed after the memory bit flip.
Adding ECC memory will likely eliminate most cases of undetected “bit rot”, which IMO, are actually memory parity errors that are written to disk. If a bit gets flipped on a disk, it’s EXTREMELY unlikely that it will pass the disk hardware ECC tests.
Very good point, and it’s extremely unlikely that that will happen given the added costs. Perhaps consumers get the data integrity that they pay for and that likely won’t include ECC DRAM so shouldn’t include checksums.
I totally agree with you with respect to Apple shipping ECC memory with their systems, but in their defence the blame rests with Intel as (it is my understanding that) the iCore processors don’t support ECC memory–for that, one must use Xeons.
But yeah, in an ideal world, Intel would my iCore series processors with ECC support, and Apple would offer that at least as an option (I’d pay extra in a heartbeat for ECC!).
Are you implying that the engineers at Apple are basically in a cloud of ignorance to a problem that’s a hot topic amongst anyone who has spent a decent amount of time on data integrity even as a theoretical subject?
Maybe the reality distortion field is so strong that it works inwards now, because I have a hard time believing that highly-paid engineers would otherwise think that bit rot not to be a problem on ANY kind of storage medium.
I really hope that bit was mostly PR, however it is quite interesting to see that checksumming apparently hasn’t been worked on.
Then again, they did state that they do leave parts of the filesystem flexible as to allow for features getting added later on, I assume well after release, so that a new iteration of features still provides an underlying FS that can be used on older devices.
I really do hope they integrate checksumming for all data.
Apple really does have a great qual and probably ships the best components you can buy. As a software guy who built a hardware product, I started out no understanding the failure modes of hardware. I’d wager the team will come around and APFS will include checksums.
You do not always know which components are actually good when they are fairly new. Apple has had bad storage components in their machines, e.g. https://www.apple.com/support/imac-harddrive-3tb/
True.
Same goes for most parts that exceed the complexity of hinges. 😛
I’ve had two iMacs and both had GPU faults that rendered the machines unusuable.
Both AMD cards (or well, one being ATi to be precise)
The faults came to light a bit after warranty period, both had replacement programs available, I missed the time window on my first iMac by 3 months. Guess it wouldn’t have mattered much anyways as I had opened the machine myself before already to replace the faulty HDD… Oh look, another thing that inevitably eventually breaks and unfortunately it doesn’t break in a binary manner (works – works not), but it may hit data integrity first by writing poorly before it finally is inoperable.
Good luck making a last minute backup from files that already got corrupted and restoring them afterwards.
I own several Apple products myself and whilst their software and GUI department have tanked a considerable amount in recent times I believe their hardware components are – quality-wise – amazing.
However, we’re still talking about things made by humans, things that are subject to radiation and all sorts of other possible factors that lessen the meaning of having a great (still consumer-grade) storage device. (and well, even enterprise-grade wouldn’t be enough, ZFS doesn’t exist for nothing)
Extending on that, I really do wonder what Apple thinks they may get away with in their own data centers… Let’s just hope other people are working on their data center storage requirements.
It is unclear to me why data integrity for user data is the responsibility of the file system. Apple seems to argue that it is the job of the hardware, and one could also argue that it is the job of the user-level software. It also seems strange to me that the file system level would be a magical place that can do no wrong, while the disk-level software is seen as suspect. Who says that a detected error is the fault of the underlying layer and not the layer that detected the error?
What is remarkable though, is if they are doing PRNG xor-based encryption (seems likely, for example counter mode) and not using a MAC, since then they are open to selective data manipulation even though it is encrypted. If they are using MACs, then data integrity is solved by that.
“It is unclear to me why data integrity for user data is the responsibility of the file system.”
Whose responsibility do you think it is, then?
“Apple seems to argue that it is the job of the hardware,”
Well, they’re in the business of selling hardware. Unfortunately, most Apple users will end up putting data on a non-Apple disk at some point. And from what the ZFS people have found, even Apple disks are not immune to bit-rot.
“and one could also argue that it is the job of the user-level software.”
I don’t want to live in a world where my data integrity is the responsibility of each application, or my job as a developer is to continuously re-implement filesystem features that Apple didn’t want to. Does my text editor need to invent a new text format now, so that it can guarantee data integrity of text files? Are we going to make a new version of Git? What about data integrity of the applications themselves?
“It also seems strange to me that the file system level would be a magical place that can do no wrong, while the disk-level software is seen as suspect.”
Doesn’t seem strange to me at all. I upgraded my operating system last week. When’s the last time you upgraded the firmware in your hard disk? Yeah, me neither.
“Who says that a detected error is the fault of the underlying layer and not the layer that detected the error?”
Empirical evidence, for one. I can run my CPU all month and not get a wrong answer. That’s not true of my hard disk.
It’s a bit like asking why your TCP implementation needs to implement checksums, when Apple-brand ethernet cables are really good, and most of the modern layer-5 protocols support retries, anyway. We put checksums in TCP because we see a lot of transmission errors in practice, and that’s a layer where the data integrity implementation can be shared by almost all applications.
> Doesn’t seem strange to me at all. I upgraded my operating system last week. When’s the last time you upgraded the firmware in your hard disk? Yeah, me neither.
So the file system is so simple that it obviously doesn’t do anything wrong, but it needs to be updated to fix bugs?
I’m NOT arguing against file system level checksums, I’m arguing against the idea that file systems are infallible.
“It’s a bit like asking why your TCP implementation needs to implement checksums”
Well, if you don’t trust your disk drives checksumming, you definitely shouldn’t trust TCP’s. TCP checksumming is a joke. Ethernet, on the other hand, has pretty good checksumming.
> It is unclear to me why data integrity for user data is the responsibility of the file system.
You must not have studied under Keith Wesolowski, then (;-))
Very few hardware components are designed to be radiation resistant, and those are usually only designed for applications in space; not even enterprise grade hardware is normally designed to operate correctly in face of radiation, as designing and building such hardware is both expensive and hard.
Contemporary hardware relies on firmware, which is often (if not always) buggy, and most of it is opaque.
If one can then design software which can be mathematically proven to protect data, given above facts, one should design such software (ZFS), and one did.
At the end of the day, we all want and have the right to expect that our computers operate correctly; therefore, if the problem can not be practically mitigated with hardware, but can be with software (filesystem), then it should be.
> Very few hardware components are designed to be radiation resistant, and those are usually only designed for applications in space; not even enterprise grade hardware is normally designed to operate correctly in face of radiation, as designing and building such hardware is both expensive and hard.
There is nothing that says that file system code runs on hardware that is more fault resistant than the disk-level code.
We all agree that the disks don’t return errors all the time, right? But for some reason, when a file system detects an error, it cannot be that the file system makes a rare mistake, it has to be the disk?
There is nothing wrong with doing integrity checking in the file system, but it strikes me as wrong to suppose that just because it is done in kernel-level code it is automatically more fault-tolerant and bug-free than anything else.
Conclusion: APFS should use MACs, if it doesn’t already. Also, disks should be as fault-free as possible, and applications should be prepared to handle errors.
John Gruber suggests that for portable devices, checksums represent a problem for energy consumption.
True?
I would guess that any inefficiencies would come from having to read and process the whole block that the checksum is calculated over. Xor-based (like counter mode) encryption can be applied to arbitrarily small blocks and in parallel.
Simple checksums are pretty inexpensive to generate and validate. Once a block is in cache, simple integer operations are nearly free.
John Gruber is a mentally ill fanboy and will literally invent reasons even though he doesn’t have the technical understanding to say that behind it.
> But for some reason, when a file system detects an error, it cannot be that the file system makes a rare mistake, it has to be the disk?
Mathematical check sums are pretty trivial to implement, verify, and reason about; in contrast, contemporary firmware is not only complex, but opaque as well.
When there is a bug in a filesystem (ZFS in particular now), it can be debugged and fixed an order of magnitude easier than you will be able to debug and fix firmware in your storage area network controller, or in the electronics on the disk drive, or SSD for that matter.
If the failures and protection are easier to prove, debug, and understand by making the filesystem responsible for data integrity, then common sense tells me that is the layer which should be responsible for protecting data.
Of course, if you are of the belief that firmware is the correct place for data protection, by all means, don’t use ZFS and I wish you all the best going back and forth with the storage vendors, or reverse engineering the electronics and the firmware yourself. By all means, have at it!
You have obviously only read the part you are quoting, since I explicitly state that “There is nothing wrong with doing integrity checking in the file system” in the next sentence.
> There is nothing that says that file system code runs on hardware that is more fault resistant than the disk-level code.
Sorry, but the above quoted sentence makes no sense to me. I do not understand what you are trying to say with it. Are you saying that the filesystem should be implemented in hardware, and that there is nothing wrong with that? That is how I understand it, and that makes no sense to me.
I mean what I wrote. The file system code runs on the “main CPU” using the “main memory”.
You were implying that radiation resistance was a reason for doing the checks in the main CPU.
There is nothing that says that the main CPU and memory is more resistant to radiation than the disk CPU and memory.
I’m unfamiliar with how encryption and checksums work in detail, but given APFS will be mostly used with encryption, there will be a lot of computation going on for the whole IO stream anyway. Is there a way to integrate data integrity checksums into encryption processing and amortize its cost or are these two feature in conflict?
TL;DR: Using checksums is more or less mandatory for cryptographic reasons.
You can use encryption in fundamentally two different ways: where each byte/word is dependent on the previous, and some sort of “random access” mode. When using it in the first way, you normally have to use an “initialization vector” to get good security, a start value that is unique for each block. This has to be stored somewhere, making the encrypted block larger than the cleartext block. You also have to read and decrypt the whole block each time you want any data in the block, and vice versa when you want to write.
If you use the random access way, you normally use the encryption key and a start value to feed a pseudo-random number generator, and then XOR the values from that onto the cleartext. One method is to use an initialization vector that is stored in the same way as above, but we still get the benefit of random access. Another way is to use the offset of the byte as the start value (counter mode). This means that nothing extra has to be stored for each block, but each block still has a unique start value. This makes it practical in file systems.
All XOR based systems are, however, vulnerable to attacks where the attacker wants to invert bits in your file. Since XOR operations are commutative, inverting a bit in the cryptotext automatically inverts the same bit in the cleartext, without changing anything else. This brings us to the second topic: checksums.
There are three main types of checksums: non-cryptographic, cryptographic without key (normally called cryptographic hashes), and cryptographic with key (message authentication codes). Non-cryptographic (e.g. CRC) are used to detect errors, but cannot be used to detect malicious tampering. Cryptographic without key (e.g. SHA-256) can detect malicious tampering, but if the attacker can change the checksum, you are still vulnerable, since anyone can generate a new one. These are good when you store the checksum in a safe place. Finally, message authentication codes (MAC) can neither be verified nor generated without access to the key. This makes them ideal for cryptographic data integrity. The only real alternative is cryptographic signatures.
All these checksums generate additional data per block, however, and have to be calculated over the whole block.
Take SHA-256 as an example: it is a 32 byte checksum. This means that a small block size of 32 bytes doubles the amount of data stored, while a large block size of 1MB has a small overhead, but forces the file system to read and verify that 1MB each time a byte is needed.
Regarding cost, counter mode can be implemented in a completely parallel way. Checksums have dependencies, so they cannot be completely computed in parallel, but certain checksumming systems, like Merkle trees, use log(n) time when having enough compute units. The checksums also have to be stored somewhere.
Note: I have simplified some explanations, do not use this to implement your own cryptography.
To dispel anecdotal fear of bitrot, and whether it really necessitates ZFS-style checksumming, are there field/case studies of how much it occurs in the real world (of SSDs now)? Did Sun/Oracle instrument the checksum code to report actual bitrot rates, independent of manufacturer UBER stats at the level of 10E-14 or -15?
There are certainly studies for hard disks showing bit rot. See CERN’s from 2007 – http://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf and one a year later from Netapp/Wisconsin/Toronto https://www.usenix.org/conference/fast-08/analysis-data-corruption-storage-stack .
I’m a WAFL engineer and I’m as gobsmacked as you by the details you have outlined in this section. It is seriously trivial to just checksum everything (and we don’t stop there, either.) Can’t understand these choices.
Hardware -always- fails in ways you didn’t anticipate.
> If file systems could be trusted absolutely then the “only” reason for backup would be the idiot operators (i.e. you and me).
And failing hardware.
> Apple has some of the most stringent device qualification tests for its vendors
Source?
Personal communication with current and former Apple employees as well as those at various SSD vendors while I directed flash strategy at Sun and Oracle.