Archiving Pidgin Logs by Year
I synchronize a lot of my programs among all of the computers that I work on. This isn't a problem, and I'm happy with my methods for doing so. However, there's one part that feels like an eternity when compared to the rest: my chat logs.
I have years of message logs. I do (rarely) look at them kind of like text-based photographs, so I don't want to simply delete them. Something had to be done, though, because the measly 1.5MB of >7000 files takes a long time to calculate checksums that (in theory) should never change. I decided to archive them by year. I pulled up a command line and typed out a quick line. In a few minutes, it was done.
for i in
seq 2006 2010
; do find logs/ -name "$i-*" -print0 | xargs -0I {} sh -c "chmod 644 {} ; tar -rf logs$i.tar {} && rm {}" ; bzip2 -9 logs$i.tar ; done
This line relies on the file name convention that Pidgin uses: each log file's name starts with the year.
Instead of writing a huge one-line, it could easily be done by writing the find
output to a file and passing that file to each of the other commands in the chain, but my mind seems to default to the most difficult way possible. I have one-liners and regular expression that reach Randall Munroe-level complexity. On the plus side, though, you can enter the one-liner and walk away. Everything will be done in one shot without needing user intervention.
Let's break it down into segments to explain what it does.
for i in `seq 2006 2010`; do
find logs/ -name "$i-*" -print0 | xargs -0
I {} sh -c "chmod 644 {} ; tar -rf logs$i.tar {} && rm {}" ;
bzip2 -9 logs$i.tar ;
done
The For Loop
for i in
seq 2006 2010
;
do command;
done
This part of the one-liner creates the variable $i
in command. seq 2006 2010
generates a sequence from 2006 to 2010. So, the command will run 5 times: once with $i
equal to 2006, once with $i
equal to 2007, etc.
The Find and Xargs Segment
find logs/ -name "$i-*" -print0 | xargs -0
This looks for all files named "2006-", "2007-", etc. under the directory "logs".
I already wrote a more in depth article on find
and xargs
, so I won't cover it here.
The Primary Command Chain
The -I
switch of xargs
is used to replace the following string token with the passed in file name. I chose {}
because it's the default. If $i
is 2006 and the file name is "2006-01-01-blah.txt", then:
sh -c "chmod 644 {} ; tar -rf logs$i.tar {} && rm {}" ;
would expand to:
sh -c "chmod 644 2006-01-01-blah.txt ; tar -rf logs2006.tar 2006-01-01-blah.txt && rm 2006-01-01-blah.txt" ;
I do this because I need to run multiple commands on the same file. There's one problem with this, though. Whenever the -I
switch is used, xargs
can only process one file at a time rather than batching them. Because I have 7000+ files, this means that chmod
, tar
, and rm
will all be called 7000+ times. This can be solved by writing a quick script that takes a list of file names and passed them all to each command, but it wasn't necessary for my purposes.
This segment consists of 4 commands.
sh -c "
chmod 644 {} ;
tar -rf logs$i.tar {}
&& rm {}
" ;
Cheating With a Subshell
sh -c "<commands>" ;
xargs
only replaces {}
in one command. Once it reached ;
or &&
it won't replace anything after that because it's no longer part of the xargs
command. By putting the next three commands in another shell using sh -c
, xargs
sees only one command. Therefore, all three instances of {}
will be replaced.
File Permissions
chmod 644 {} ;
For the most part, this isn't necessary. I do this because I synchronize with PCs that run Windows. Windows doesn't know about Linux file permissions. All the text files that originated in Windows are set as executable in Linux by default on my box. I want to reset the file permissions before I archive them. You can leave this command out if you need. If you are archiving anything that is made to be executed e.g. a program, then you should leave this command out.
Packaging Everything into One File
tar -rf logs$i.tar {}
This creates a temporary file called logs2006.tar
, logs2007.tar
, etc. If you haven't worked with tar
too extensively, then you may never have seen the -r
(append) switch before. -r
will append to a tar file or create an archive if it's not already there. -z
(gzip) and -j
(bzip2) can't be used here; -r
requires an uncompressed file.
Append is used because of how I'm calling tar. I'm adding thousands of files one-at-a-time. I need a temporary file for that.
If I used -c
(create), then tar
would overwrite the archive during every iteration. After adding all the files, I'd be left with an archive that contained only the last file added. Clearly, that's not what I want.
Removing the Added File
&& rm {} ;
rm
is used to delete the original file on the fly. I use &&
just in case there is an error. If the tar
fails for some reason, the line will crash rather than calling rm
.
tar
does provide a switch to do this same operation called --remove-files
. Every time I've used it, it leave files behind. Oddly, it's always the same files. I didn't figure out why, but I stopped trusting it. Calling rm
isn't too difficult to do for piece of mind.
After all that, I have an archive full of a year's worth of log files and the original files are removed. It's time to compress.
Compressing the File
bzip2 -9 logs$i.tar
bzip2
takes the temporary file, compresses it to originalfilename.bz2, and deletes the original file. -9
says to use the strongest compression possible.
Conclusion
Although this one-liner is very application-specific, I hope the concepts it uses will help someone. Let me know what you think in the comments below.