When you’ve been using computers as a serious tool for as long as I have, it’s important that they’re reliable. For more than ten years I’ve stuck to a policy of staying away from bleeding-edge operating-system versions, organising disk partitions to keep my data separate from the OS, using each install for as long as possible before upgrading, and keeping them backed up in the form of partition images. If something goes wrong, I can restore a backup and be using the machine again in about twenty minutes, without stopping to think about losing data. This has paid off pretty handsomely over the years in terms of minimising time spent on IT gruntwork despite running quite a few machines.
While this policy was still taking shape, another serious reliability problem reared its head for the first time.
As related in a previous post, I once did a digital restoration job from a vinyl copy of Rickie Lee Jones’s eponymous debut album. I wrote a set of archive CDs using WinRAR and redundant volumes, added the results to my MP3 collection, and moved on.
Some time later, having some reason or other to want to restore this archive, I put the disc in and started extracting. I was annoyed when WinRAR reported integrity errors, but the next step was clear: get everything off the disc and use the redundant RAR volumes to undo the corruption.
It didn’t work! My only choice was to turn off WinRAR’s deletion of failed files and see what I could get back. There were noticeable glitches in one or two places, even on the original transfers.
Disgruntled, I e-mailed Eugene Roshal, WinRAR’s author, to politely explain that I seemed to have lost data to a bug. He replied before long, telling me (with what seemed like a hint of impatience) to check for faulty memory.
“Typical programmer,” I thought, “assuming there can’t be anything wrong with his software”. When I pressed him, he told me that he’d been through this several times before and it had always been faulty memory, not bugs in his software; apparently the kind of optimised bit-manipulation code found in WinRAR’s multimedia compression algorithms is uncommonly susceptible to hardware faults.
I stand (error-) corrected
Determined to prove him wrong, I looked into memory testing, found memtest86, burned a bootable CD of it, and left it running overnight on the machine concerned. By morning there were several test failures and I was forced to conclude that Mr Roshal knew what he was talking about. I immediately started a run on my other workstation.
I don’t remember whether one or two of the 256 MB DIMMs were faulty, but I immediately retired the offender(s) and distributed the remaining modules between the two workstations. I then tried the Rickie Lee Jones archive again. Sadly, the corruption had occurred while making the archive, and it was permanently broken.
A few years after that, CERN and others started seriously evaluating the causes of data corruption in their systems, with interesting results, but because their installations use ECC RAM, we have to interpret a little to apply their lessons to standard memory. Specifically, single-bit errors are correctable with ECC and do not show up directly – one has to query the memory controller via an out-of-band mechanism to discover them. Much like other such error-reporting systems – C2 on compact discs springs to mind – this is susceptible to corner-cutting in hardware/firmware implementations: in other words, it’s hard to be sure that all, or even a majority, of the single-bit errors are actually reported properly. Given that the incidence of double-bit (uncorrectable) errors was much higher than expected in CERN’s tests, this explanation for the unexpectedly low single-bit-error rates makes a lot of sense.
Even more worrying, they pointed out that data flows inside the operating system for a simple streaming transfer involve around 6 copying operations per data block. Modern protected-memory operating systems are part of the problem here, because getting data between kernel-mode device drivers and user-mode processes involves crossing privilege barriers, a problem that most such OSes inexplicably solve by copying the data rather than remapping the physical memory it already occupies. Apart from magnifying the opportunity for bit rot by passing the data through several read/write cycles, this needlessly hurts I/O performance, but I digress.
So what are we to do? ECC RAM, being an “enterprise” feature, has always been too expensive to use for home computers. For the same reason, motherboards that support it are few and expensive, and it’s not a panacea in any case.
For over ten years, I have not allowed myself to use a machine for anything serious until I know that all of its installed memory has passed a rigorous memory test (1-2 overnight runs) with memtest86 or, more recently, memtest86+.
This policy paid off in 2007 when I bought two pairs of differently-branded 1 GB DIMMs with which to build a new workstation. Both pairs failed because one module of each contained hardware faults. I sent the whole lot back for replacement, and the replacements were fine.
About 18 months ago I started discussing all this with a friend; he wasn’t very convinced that it was a real problem, mainly because he hadn’t fully realised that all data transfers in a machine, even between files on the same local disk, have to go through main memory, and that this necessarily involves lots of copying. Once I pointed this out, he understood my point that no computer with suspect memory can be trusted to handle data you care about.
We worked through the job of testing all his computers – several Macs and one Linux machine. Most of them were fine, but his MacBook Air fairly exploded with errors. The machine was nearly out of warranty, and the memory was built into the motherboard rather than slotted, so there was some concern that the expense of solving the problem would lead to Apple’s trying to weasel out of it, but Apple came through for him and replaced the motherboard. The replacement was fine.
Mention RAM errors to the average techie and you’ll hear the old chestnut about cosmic rays permanently damaging memory cells. In my experience, this is a load of hogwash, as is the general idea of age-related RAM degradation; the biggest problem with RAM is simply bad quality control.
If RAM does deteriorate with time, I’ve never seen it happen; I re-run my memory tests every year or three, and I’ve never once seen a good module go bad. One possible explanation is that the kind of RAM that does this is actually inferior to begin with – owing to bad fabrication tolerances – and would fail a proper test even in a new state, whereas RAM that survives such testing tends to stay good.
In any case, since I began my policy of putting new/untested modules through their paces, I’ve never had a repeat of the corrupted-archive incident.