Listing Directories by Size

I have a lot of files that I don't need anymore. Normally, I don't really care if they stay there. To an extent I believe that empty hard drive space is wasted hard drive space. I may not need a couple of pictures right now, but I may need to use them in a video project one day. Why delete the pictures? I have terabytes of space. Do I really need that 1M right now? Saving them may mean that I won't have to scour the Internet for them later.

This works to a point. Naturally, if my files are unorganized then I may as well just pull up Google Image Search. This also starts to work against me when I finally start running out of hard drive space.

I wanted to list my directories and large files to see what was devouring my hard drive. Going through by hand is time consuming. I have a command line; I'm going to make the PC do the work for me. I want to see the 10 largest hard drive hogging files & directories.

find . -maxdepth 1 \( -type d \! -name . \) -print0 -o \( -type f -size +1G \) -print0 | xargs -0 du -sh | sort -h | tail

This one-liner chains four commands together. Let's break it down to see how it works.

The find Command

find . -maxdepth 1 \( -type d \! -name . \) -print0 \
                -o \( -type f -size +1G \) -print0

The first part tells the PC to only look in the current directory. If we wanted a different directory, we can replace . with the directory path. If we wanted to see this directory and one level deeper into each subdirectory, we can replace -maxdepth 1 with -maxdepth 2.

find . -maxdepth 1 \( -type d \! -name . \) -print0 \
                -o \( -type f -size +1G \) -print0

The two patterns I want are next. The (, ), and ! characters must be escaped using backslashes. Otherwise, the shell will try to interpret them. We don't want the shell to interpret them; we want find to interpret them. Backslashes will make the shell ignore them.

The first one says that we only want directories. However, exclude the current directory because it will always be the largest; it contains all the other entries that will be listed.

The second pattern says that we only want files larger than 1GB. The plus sign is important. Without it, the find command would return files that are less than 1GB.

find . -maxdepth 1 \( -type d \! -name . \) -print0 \
                -o \( -type f -size +1G \) -print0

"-o" means "or." This will join the two patterns together.

The reason for the two -print0 arguments is slightly odd. Without them, running the find command by itself would list all the files that we are interested in. When adding either -print or -print0, we have to manually tell find which patterns we want to output. By default everything is -print'ed. This isn't acceptable for reasons listed below.

We can wrap the entire statement in parentheses and use one -print0 at the end, but I find that too many escaped parentheses make the line hard to read.

We now have a list of all the directories and files that we care about. Let's find out how big each one is.

The xargs and du Commands

The next command in the pipeline is actually two commands joined together.

xargs -0 du -sh

The xargs command takes all of the text piped to it and appends it to another command. Let's say that we want to copy a list of files to another directory. The cp command can only take a list of files; cp ignores piped input (STDIN). We have to ask xargs to attach the file list from find to the cp command string.

[contrapants@waronpants ~]$ find . -maxdepth 1 -name 'Inductor*.png'
./Inductor Voltage Divider Schematic.png
[contrapants@waronpants ~]$ mkdir pictemp
[contrapants@waronpants ~]$ find . -maxdepth 1 -name 'Inductor*.png' | xargs cp -vt pictemp/
cp: cannot stat `./Inductor': No such file or directory
cp: cannot stat `Voltage': No such file or directory
cp: cannot stat `Divider': No such file or directory
cp: cannot stat `Schematic.png': No such file or directory
[contrapants@waronpants ~]$

What happened? Remember that you can normally give cp a list of files separated by spaces. So, when xargs attached the file name to cp, it effectively ran:

cp -vt pictemp/ ./Inductor Voltage Divider Schematic.png

According to cp, that's 4 files. We would normally write files with spaces as:

cp -vt pictemp/ "./Inductor Voltage Divider Schematic.png"

We can't modify the strings as they come out of find very easily. Thankfully, find can add a null character after each file name using -print0. By adding -0 to xargs we are telling it to take each null-delimited string and quote them for us, giving us the proper command that we want.

[contrapants@waronpants ~]$ find . -maxdepth 1 -name 'Inductor*.png' -print0 | xargs -0 cp -vt pictemp/
`./Inductor Voltage Divider Schematic.png' -> `pictemp/Inductor Voltage Divider Schematic.png'
[contrapants@waronpants ~]$
xargs -0 du -sh

du is disk usage.

-s tells du to summarize each directory i.e. add all of the file sizes inside of the directory rather than displaying a file size for each file that the directory contains.

-h tells du to make the output human-readable. This displays the output as kilobytes, megabytes, etc. instead of the number of blocks the file takes.

We now have a two column table. Column 1 is the file size. Column 2 is the file name. Next, let's sort the list by file size.

The sort Command

sort -h

sort normally sorts a list alphabetically. -h tells it that the beginning of the line is a human-readable file size; sort by that instead.

Now our table is sorted. The largest files and directories are at the bottom of the list. I only care about the largest 10. The table may contain thousands of entries. Let's trim that up a bit.

The tail Command

tail

tail filters out everything except the last 10 lines. Optionally, we can add -n to specify a different number. For example if we wanted the last 20 lines, we would write tail -n 20.

Reversing the Order

If we wanted to list the largest file at the top of the list rather than the bottom, our one-liner would be:

find . -maxdepth 1 \( -type d \! -name . \) -print0 -o \( -type f -size +1G \) -print0 | xargs -0 du -sh | sort -hr | head

sort's -r switch reverses the sort order. head filters out all lines except the first 10.

Removing the File Sizes

If we didn't want to see the file sizes, we can display only the file names by adding the cut command.

find . -maxdepth 1 \( -type d \! -name . \) -print0 -o \( -type f -size +1G \) -print0 | xargs -0 du -sh | sort -h | tail | cut -f 2

-f 2 tells cut to only display field 2 i.e. Column 2. This is useful if we are building a file list for another command to process.