Monday, 25 June 2012

Merge and sort many logfiles having multiline entries

logmerge is the small and powerful script to merge two or more log files so that multiline entries appear in the correct chronological order without breaks of entries. Optional arguments control an adding of descriptive fields at the beginning of each line in the resulting combined logfile. Reading of .gz/.bz2 files is available.

I am working with the complex java application that makes many log files (each separate process of the application makes a separate logfile). Once we wre needed to merge several log files into one file with entires sorted chronologically. There was several questions to be solved:
  1. Dates are stored in log files in the format mm/dd/YYYY HH:MM:SS.ccc (ccc - milliseconds) totally incompatible for sorting chronologically. It requires to be converted to the more confortable format, for exmple YYYYmmddHHMMSSccc;
  2. Multiline entries. Some entries occupy more than one string of the logfile (for example stacktrace usual for java-applications). In this case the sorting breaks the order strings;
  3. Strings having the same timestamp should keep the order within one input file. That means that the order of two log entries "B" and "A" with one timestamp should be kept during sorting.
I found many solutions in Internet but all of them don't cover our requests. Some tools are specific for web applications only and don't understand multiple log entries or should be fixed to handle different timestamp formats. Another tools are written on Ruby or Python. This way is not good because it requires installation of subjects that will be hardly used by us or our customers. The issue should be solved using only system specific tools and nothing else. For example, all Unix systems have Shell and Perl. To close my short review of the existing tools I'd like to say that there are tools available for money. This is worst way.

Finally I have developed the tool covering all our requests. It requires Bash and Perl and gzip/bzip2 for reading of packed files. Of course all these things are native in Unix world. But they are available for those Windows users who have istalled Cygwin.

Let's consider examples describing the main features of the tool.


1. Merge all Apache error files
./logmerge --apache-error ./error.log* > all.log
Merge all error.log* Apache files, including gziped files too, and store to the resulting file. The --apache-error option considers that each line seems like the example below, makes the marker containing the sortable timestamp 20100423221421 corresponding to the original one:
[Fri Apr 23 22:14:21 2010] <the rest of the entry>
2. Merge all Apache access files
find /export/home/ -name 'access.log' | xargs ./logmerge -f -n --apache-access > all.log
Find all last access.log Apache files from all home directories within the /export/home directory, merge them chronologically and store to the resulting file. Additionally each line of the file will begin with a filename and line number within the original file. The utility considers that the Apache's access log files consist of the following logentries and transforms the found timestamp to the sortable form 20080215141549:
<the begin of the entry> [15/Feb/2008:14:18:49 +0300] <the rest of the entry>
3. Merge multiline entries from several files chronologically
./logmerge -f -n log/*.log | gzip -c > all.gz
Merge all files located within the log/ directory and pass the result to archive. The filename and the line number will be added at the beginning of each line in the resulting file. By default the utility assumes that each log entry begins with a timestamp and can occupy more than a single line (e.g.: Java's stack traces like below):
05/21/2012 21:54:41.070 <the rest of the entry>
        at boo.hoo.StackTrace.main(

The project is hosted on Google Code and available for download under MIT license.

No comments:

Post a Comment