Archiving Pidgin Logs by Year

I synchronize a lot of my programs among all of the computers that I work on. This isn't a problem, and I'm happy with my methods for doing so. However, there's one part that feels like an eternity when compared to the rest: my chat logs.

I have years of message logs. I do (rarely) look at them kind of like text-based photographs, so I don't want to simply delete them. Something had to be done, though, because the measly 1.5MB of >7000 files takes a long time to calculate checksums that (in theory) should never change. I decided to archive them by year. I pulled up a command line and typed out a quick line. In a few minutes, it was done.

for i in seq 2006 2010; do find logs/ -name "$i-*" -print0 | xargs -0I {} sh -c "chmod 644 {} ; tar -rf logs$i.tar {} && rm {}" ; bzip2 -9 logs$i.tar ; done

This line relies on the file name convention that Pidgin uses: each log file's name starts with the year.

Instead of writing a huge one-line, it could easily be done by writing the find output to a file and passing that file to each of the other commands in the chain, but my mind seems to default to the most difficult way possible. I have one-liners and regular expression that reach Randall Munroe-level complexity. On the plus side, though, you can enter the one-liner and walk away. Everything will be done in one shot without needing user intervention.

Let's break it down into segments to explain what it does.

for i in `seq 2006 2010`; do find logs/ -name "$i-*" -print0 | xargs -0I {} sh -c "chmod 644 {} ; tar -rf logs$i.tar {} && rm {}" ; bzip2 -9 logs$i.tar ; done

The For Loop

for i in seq 2006 2010;
do command;
done

This part of the one-liner creates the variable $i in command. seq 2006 2010 generates a sequence from 2006 to 2010. So, the command will run 5 times: once with $i equal to 2006, once with $i equal to 2007, etc.

The Find and Xargs Segment

find logs/ -name "$i-*" -print0 | xargs -0

This looks for all files named "2006-", "2007-", etc. under the directory "logs".

I already wrote a more in depth article on find and xargs, so I won't cover it here.

The Primary Command Chain

The -I switch of xargs is used to replace the following string token with the passed in file name. I chose {} because it's the default. If $i is 2006 and the file name is "2006-01-01-blah.txt", then:

sh -c "chmod 644 {} ; tar -rf logs$i.tar {} && rm {}" ;

would expand to:

sh -c "chmod 644 2006-01-01-blah.txt ; tar -rf logs2006.tar 2006-01-01-blah.txt && rm 2006-01-01-blah.txt" ;

I do this because I need to run multiple commands on the same file. There's one problem with this, though. Whenever the -I switch is used, xargs can only process one file at a time rather than batching them. Because I have 7000+ files, this means that chmod, tar, and rm will all be called 7000+ times. This can be solved by writing a quick script that takes a list of file names and passed them all to each command, but it wasn't necessary for my purposes.

This segment consists of 4 commands.

sh -c "chmod 644 {} ; tar -rf logs$i.tar {} && rm {}" ;

Cheating With a Subshell

sh -c "<commands>" ;

xargs only replaces {} in one command. Once it reached ; or && it won't replace anything after that because it's no longer part of the xargs command. By putting the next three commands in another shell using sh -c , xargs sees only one command. Therefore, all three instances of {} will be replaced.

File Permissions

chmod 644 {} ;

For the most part, this isn't necessary. I do this because I synchronize with PCs that run Windows. Windows doesn't know about Linux file permissions. All the text files that originated in Windows are set as executable in Linux by default on my box. I want to reset the file permissions before I archive them. You can leave this command out if you need. If you are archiving anything that is made to be executed e.g. a program, then you should leave this command out.

Packaging Everything into One File

tar -rf logs$i.tar {}

This creates a temporary file called logs2006.tar, logs2007.tar, etc. If you haven't worked with tar too extensively, then you may never have seen the -r (append) switch before. -r will append to a tar file or create an archive if it's not already there. -z (gzip) and -j (bzip2) can't be used here; -r requires an uncompressed file.

Append is used because of how I'm calling tar. I'm adding thousands of files one-at-a-time. I need a temporary file for that.

If I used -c (create), then tar would overwrite the archive during every iteration. After adding all the files, I'd be left with an archive that contained only the last file added. Clearly, that's not what I want.

Removing the Added File

&& rm {} ;

rm is used to delete the original file on the fly. I use && just in case there is an error. If the tar fails for some reason, the line will crash rather than calling rm.

tar does provide a switch to do this same operation called --remove-files. Every time I've used it, it leave files behind. Oddly, it's always the same files. I didn't figure out why, but I stopped trusting it. Calling rm isn't too difficult to do for piece of mind.

After all that, I have an archive full of a year's worth of log files and the original files are removed. It's time to compress.

Compressing the File

bzip2 -9 logs$i.tar

bzip2 takes the temporary file, compresses it to originalfilename.bz2, and deletes the original file. -9 says to use the strongest compression possible.

Conclusion

Although this one-liner is very application-specific, I hope the concepts it uses will help someone. Let me know what you think in the comments below.