In day-to-day operations, cirumstances often arise where you need simple answers to fairly complicated situations. In the best scenario, the information is available to you in some structured way, like in a database, and you can come up with a query (e.g., “what percentage of our customers in January spent more than $7.50 on two consecutive Wednesdays” is something you could probably query). In other scenarios, the information is not as readily available or not in a structured format.
One nice thing about Linux and Unix-like operating systems, is that the filesystem can be interrogated by chaining various tools to make it cough up information you need.
For example, I needed to copy the assets from a digital asset management (DAM) system to a staging server to test a major code change. The wrinkle is that the DAM is located on a server with limited monthly bandwidth. So my challenge: what was the right number of files to copy down without exceeding the bandwidth cap?
So, to start out with, I use some simple commands to determine what I’m dealing with:
$ ls -1 asset_storage | wc -l
$ du -hs asset_storage
So that first command lists all the files in the “asset_storage” directory, with the
-1 flag saying to list one file per line, which is then piped into the word-count command with the
-l flag which say to count lines. The second command tells me the storage requirement, with the
-h flag asking for human-readble units.
I’ve got a problem. Over 10,000 files totalling over 400G of storage, and say my data cap is 5G. The first instinct is to say, “well, the average file size is 40M, so I may only be able to copy 125 files.” However, we know that’s wrong. There are some big video files and many small image thumbnails in there. So what if I only copy the smaller files?
$ find asset_storage -size -10M -print0 | xargs -0 du -hc | tail -n1
Look at that beautiful sequence. Just look at it! The
find command looks in the
asset_storage directory for files smaller than 10M. The list it creates gets passed into the disk usage command via the super-useful
xargs takes a list that’s output from some command and uses that list as input parameters to another command. To be safe with weird characters (i.e., things that could cause trouble by being interpreted by the shell, like single quotes or parens or dollar signs) we use the
-print0 flag from
find (which forces it to use null terminators after each result output) and the
-0 flag on
xargs, which tells it to expect the null terminators. This takes the list of small files, passes them to the disk usage command with the
-h (human-readable) and
-c (cumulative) flags. The
du command gives output for each file and for the sum total, but we only want the sum, so we pipe it into the
tail command to just give us that last value.
So if we only include files under 10M, we can transfer them without getting close to our data cap. But what percentage of the files will be included?
$ find asset_storage -size -10M -print | wc -l
find command looks in the
asset_storage directory for files smaller than 10M and each line is passed into the word count as before. So if we include only files smaller than 10M, we get 7,708 of the 10,384 files, or just under 75% of them! Hooray!
But when I started to create the tar file to transfer the files, something was wrong! The tar file was 2G and growing! Control C! Control C! What’s going on here?
What was wrong? Well, this is where it gets into the weeds a bit. It took me longer than I’d like to admit to track down. The shell command buffer has limitations on its length, and
xargs has its own limitations. If the list it receives exceeds those limits,
xargs splits the input and invokes the destination command multiple times, each with a chunk of the list. So in my example above, the
find command was overwhelming the
xargs buffer and the
du command was called multiple times:
$ find asset_storage -size -10M -print0 | xargs -0 du -hc | grep -i total
tail command was seeing that second total, and missing the first one! To make the computation work the way I’d wanted, I had to allocate more command line length to
xargs (the size you can set is system dependent, and can be found with
$ find asset_storage -size -10M -print0 | xargs -0 -s2000000 du -hc | grep -i total
Playing with the file size threshold, I was finally able to determine that my ideal target was files under 5M, which still gave me 68% of the files and kept the final transfer down to about 3G.
In summary, do it this way:
$ find asset_storage -size -5M -print0 | xargs -0 -s2000000 du -hc | tail -n1
$ find asset_storage -size -5M -print | wc -l
$ find asset_storage -size -5M -print0 | xargs -0 -s2000000 tar cf dam_image_backup.tgz