Data Integrity at Home : Data Scrubbing and Hashdeep
OK, when you have terabytes of data at home, you definitely will. have losses. I’ve had a bad RAID array that corrupted a bunch of JPEGs and it is super sad. So what’s the solution to this:
- If you are using BTRFS, then you can turn on bit scrubbing. This basically in the background ensures that the data is good in the image. It’s not hard to turn on and highly recommended.
- Good copy checking. Some of the tools like Rsync and Goodsync do a checksum on a check.
- At the system level, the big idea is to do user level corruption checking. There is a utility call hashdeep that let’s you do this.
Here is what hashdeep does. It can recursively go through a file system and generate a hash for each file. This let’s you then audit a copy and a file list so that you can see what is corrupt and hopefully have a backup of it.
Here’s the way to use it on a Mac
# Assuming that your have a Backup volume cd /Volumes/Backup brew install hashdeep hashdeep -l -r * > ~/Backup.hash # this generates a hash file for every file in that directory # -l means relative link # -r means recursive # Now check on it with an audit with -a which # only returns TRUE or FALSE cd /Volumes/Backup hashdeep -l -r -x ~/Backup.hash * # It will tell you which files now hash differently
Anyway a nice (if slow solution to this problem). The -l is really nice
VPSInfo has a slightly different and probably better set of options which look like and the -c to get SHA-256 is good as is the and -o f means only do this for files in case there are block devices.
# Create a has using SHA256 which is strong # do it recursively and only look for regular files hashdeep -c sha256 -r -o f /bin /boot /dev /etc /home /lib /lib64 /opt /root /sbin /srv /usr /var > ~/file_hashes # Now to do a check, compare it again # the -x means dump out the list of offending files hashdeep -c sha256 -k ~/file_hashes -s -x -r -o f /bin /boot /dev /etc /home /lib /lib64 /opt /root /sbin /srv /usr /var
And Superuser has a good example of how to use this across file systems
# create a hash in the source directory # -vvv means lots of debug output, -e means show a progress indicator # you will need the -vvv when you do the audit hashdeep -c sha256 -r -l -e -vvv * | tee ../hashlist.txt # now copy that over the destination and use this as the comparison hashdeep -c sha256 -r -l -k ../hashlist.txt -a -e -vvv * | tee ../hashcompareresult.txt
So the final fragment to use in a script looks like. Note that if you use
-a all you get is a error if something doesn’t work, you need to add a -vv which prints discrepancies telling you if there are matches or not
# on the source hashdeep -c sha256 -r -l -e * > ../hashdeep.txt # on the destination copy over hashdeep.txt hashdeep -c sha256 -r -l -a -vv -k ../hashdeep.txt > ../hashdeep.compare.txt