Redupe: Forward Error Correction
By Robert EscrivaForward error correction provides the ability to detect and correct corrupt data. At a high level, forward error correction works by adding structured redundancy to data that ensures the original data may always be reconstructed, so long as the corruption does not affect more than a configurable percentage of the overall data.
This post introduces the
redupe
and
fecsum
command-lines tool for providing
forward error correction to streams of data. The first tool,
redupe
, is
modeled after compression tools like
gzip
or
bzip2
, but adds redundancy
instead of eliminating it. Redupe operates on the data directly, transforming
it into its redundant state by adding the redundant information inline with the
original data. This works well for streaming or archival purposes, but is
notably less convenient for multimedia use cases where the data would have to be
recovered prior to playback. For these use cases the package provides
fecsum
,
short for "forward-error-correction checksum" that preserves the exact byte
pattern of the input data by writing the forward error correction to a separate
file.
Quick Overview
The
redupe
tool has two basic workflows. The most common/tested workflow is
to pass redupe a file or files to be encoded with error correction, like this
(output modified for presentation):
$ ls 2018-05-15-home-backup.tar.gz $ redupe 2018-05-15-home-backup.tar.gz $ ls -l -rw------- 1 rescrv 5625162218 2018-05-15-home-backup.tar.gz -rw-r--r-- 1 rescrv 6433996800 2018-05-15-home-backup.tar.gz.rd
Notice that the original file was left intact and that we have a new file that is approximately 15% larger than the original. Out of an abundance of caution, redupe will never remove the original file. This is to encourage a workflow where the file can be checked end-to-end for accuracy:
$ sha256sum 2018-05-15-home-backup.tar.gz 2018-05-15-home-backup.tar.gz.rd 0335f60167c8276f13b0... 2018-05-15-home-backup.tar.gz 1a878ad85dd45b2af18e... 2018-05-15-home-backup.tar.gz.rd $ reundupe < 2018-05-15-home-backup.tar.gz.rd | sha256sum 0335f60167c8276f13b0... - $ sha256sum 2018-05-15-home-backup.tar.gz 2018-05-15-home-backup.tar.gz.rd 1a878ad85dd45b2af18e... 2018-05-15-home-backup.tar.gz.rd
This example shows us that there exists a tool called
reundupe
. The commands
redupe
and
reundupe
should link to the same binary; it will add or remove
redundancy based upon which name is used to invoke the command. The other thing
we can see is that
reundupe
(and because it's the same binary,
redupe
) can
operate within a UNIX pipeline. When passed no files and a stream on stdin,
both commands will output the transformed stream on stdout.
At a higher level, we see an extra paranoid process for verifying our backup before shipping it off to long-term storage. First, we check the original and redupe'd files's checksums. Then, we undo the redupe process and confirm it matches the original file's checksum. Finally, we check the redupe'd file's checksum against the original. While this process is not guaranteed to ensure there are no unrecoverable errors, it provides a high degree of confidence that the redundancy-infused file can reconstruct the original, and that the file is unlikely to have been corrupted during the verification process.
We could also, have gone through a similar workflow with the
fecsum
tool:
$ ls 2018-05-15-home-backup.tar.gz $ fecsum create 2018-05-15-home-backup.tar.gz{,.fec} $ ls 2018-05-15-home-backup.tar.gz 2018-05-15-home-backup.tar.gz.fec $ fecsum check 2018-05-15-home-backup.tar.gz{,.fec} || echo fail
If we were to discover that the last step failed, we could recover our original. It looks something like this:
$ fecsum check 2018-05-15-home-backup.tar.gz{,.fec} || echo fail fail $ fecsum correct 2018-05-15-home-backup.tar.gz{,.fec,.orig} || echo fail
Verification
I trust 100% of my backups to redupe. To gain confidence that I could do such a thing, I subjected redupe to literally years worth of CPU time to verify that it would keep my data safe in the case of no errors. I used the guacamole random generator to create approximately 128 different random streams. I then ran each of these streams on an independent CPU core for somewhere between two weeks and a month of constant testing. I sought to confirm that something like this pipeline performed correctly:
$ guacamole-generate -s $seed \ | redupe \ | reundupe \ | guacamole-compare -s $seed
Because guacamole is a deterministic random number generator, we should be able
to generate data in
guacamole-generate
and recreate the exact same sequence of
bytes in
guacamole-compare
to compare against the stream of data that has been
processed in both directions by the redupe software.
All told, this works out to somewhere between five and ten years of up-front verification. The guacamole generator was not well optimized during this validation (2015) and put forth only 200-400MB/s of random data. This was still enough to push more than a couple petabytes through the algorithm for verification.
There was also approximately another week of follow-up validation with an altered pipeline that injected random corruption into the file. Unfortunately, the details of this testing escape my memory and have been lost to time.
Limitations
Redupe and fecsum were written largely as a fun weekend project to learn about Reed Solomon codes. As it was a learning exercise, it may not follow best practices for forward error correction. Some anticipable problems would be:
- The basic algorithm underlying the tool is to perform the forward error correction across groups of 255 pages, so that any single page or adjacent set of bytes can be lost or overwritten. This rounds every file up to a multiple of approximately 1MB. I work around this by generally operating on files of hundreds of megabytes or more, so the waste is a trivial percentage of data stored.
- There is an assumption that data corruption will be limited in its impact. When 12.5% redundancy is added, no single chunk of data (~1MB) can contain more than 12.5% corruption. If any single chunk exceeds the configured redundancy, the chunk is entirely unrecoverable.
- The fecsum API is ugly and subject to change. I usually don't notice this because I embed it within scripts and haven't changed the code much.
- As implemented, forward error correction will account for changed bytes, but the tool does nothing to account for shifts in the file. There is no real work-around in redupe for such failures.
Last, but certainly not least, and deserving of its own paragraph: This was a hobby project written to learn about forward error correction. Read the LICENSE distributed with redupe because it contains strong language about this software being provided as-is
Getting redupe
Make it this far and want to install redupe? It is available in source form at
my web page
or on
GitHub
. This is a standard autotools
package, so the usual
./configure && make && make install
dance will suffice
for installation. If installing from git, you'll need to prefix that with
autoreconf -ivf
. You will also likely need to install libpopt's development
package and execute the
make install
step with root privileges.
If you have trouble with this, please don't hesitate to open a ticket on GitHub.
Summary
Redupe is a library for providing forward error correction on the command-line. It was developed to add a layer of safety for both tarball-oriented backups and my multimedia collection. While the tool has seen extensive use by me and has processed petabytes of synthetic data, I've never really taken the time to make it accessible to others and hope this blog post is the start of changing that.