Data Integrity at Home : Data Scrubbing and Hashdeep

OK, when you have terabytes of data at home, you definitely will. have losses. I’ve had a bad RAID array that corrupted a bunch of JPEGs and it is super sad. So what’s the solution to this:

  1. If you are using BTRFS, then you can turn on bit scrubbing. This basically in the background ensures that the data is good in the image. It’s not hard to turn on and highly recommended.
  2. Good copy checking. Some of the tools like Rsync and Goodsync do a checksum on a check.
  3. At the system level, the big idea is to do user level corruption checking. There is a utility call hashdeep that let’s you do this.

Here is what hashdeep does. It can recursively go through a file system and generate a hash for each file. This let’s you then audit a copy and a file list so that you can see what is corrupt and hopefully have a backup of it.

Here’s the way to use it on a Mac

# Assuming that your have a Backup volume
cd /Volumes/Backup
brew install hashdeep
hashdeep -l -r * > ~/Backup.hash
# this generates a hash file for every file in that directory
# -l means relative link
# -r means recursive 

# Now check on it with an audit with -a which 
# only returns TRUE or FALSE
cd /Volumes/Backup
hashdeep -l -r -x ~/Backup.hash *
# It will tell you which files now hash differently

Anyway a nice (if slow solution to this problem). The -l is really nice

VPSInfo has a slightly different and probably better set of options which look like and the -c to get SHA-256 is good as is the and -o f means only do this for files in case there are block devices.

# Create a has using SHA256 which is strong
# do it recursively and only look for regular files
hashdeep -c sha256 -r -o f /bin /boot /dev /etc /home /lib /lib64 /opt /root /sbin /srv /usr /var > ~/file_hashes

# Now to do a check, compare it again 
# the -x means dump out the list of offending files
hashdeep -c sha256 -k ~/file_hashes -s -x -r -o f /bin /boot /dev /etc /home /lib /lib64 /opt /root /sbin /srv /usr /var
 

And Superuser has a good example of how to use this across file systems

# create a hash in the source directory
# -vvv means lots of debug output, -e means show a progress indicator
# you will need the -vvv when you do the audit
hashdeep -c sha256 -r -l -e -vvv * | tee ../hashlist.txt
# now copy that over the destination and use this as the comparison
hashdeep -c sha256 -r -l -k ../hashlist.txt -a -e -vvv * | tee ../hashcompareresult.txt

So the final fragment to use in a script looks like. Note that if you use -a all you get is a error if something doesn’t work, you need to add a -vv which prints discrepancies telling you if there are matches or not

# on the source
hashdeep -c sha256 -r -l -e * > ../hashdeep.txt
# on the destination copy over hashdeep.txt
hashdeep -c sha256 -r -l -a -vv -k ../hashdeep.txt > ../hashdeep.compare.txt

I’m Rich & Co.

Welcome to Tongfamily, our cozy corner of the internet dedicated to all things technology and interesting. Here, we invite you to join us on a journey of tips, tricks, and traps. Let’s get geeky!

Let’s connect