d

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore.

15 St Margarets, NY 10033
(+381) 11 123 4567
ouroffice@aware.com

 

KMF

A Good Old-Fashioned Perl Log Analyzer

A recent Lobsters post lauding the virtues of AWK reminded me that although the language is powerful and lightning-fast, I usually find myself exceeding its capabilities and reaching for Perl instead. One such application is analyzing voluminous log files such as the ones generated by this blog. Yes, WordPress has stats, but I’ve never let reinvention of the wheel get in the way of a good programming exercise.

So I whipped this script up on Sunday night while watching RuPaul’s Drag Race reruns. It parses my Apache web server log files and reports on hits from week to week.

#!/usr/bin/env perl

use strict;
use warnings;
use Regexp::Log::Common;
use DateTime::Format::HTTP;
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-w]*sitemap[-w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my ( %count, %week_of );
while (<>) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} = 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    my $dt  = DateTime::Format::HTTP->parse_datetime( $log{ts} );
    my $key = sprintf '%u-%02u', $dt->week;
    $week_of{$key} ||= $dt->date;    # get first date of each week
    $count{$key}++;
}

printf "Week of %s: % 10sn", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

Here’s some sample output:

Week of 2021-07-31:      2,672
Week of 2021-08-02:     16,222
Week of 2021-08-09:     12,609
Week of 2021-08-16:     17,714
Week of 2021-08-23:     14,462
Week of 2021-08-30:     11,758
Week of 2021-09-06:     14,811
Week of 2021-09-13:        407

I first started prototyping this on the command line as if it were an awk one-liner by using the perl -n and -a flags. The former wraps code in a loop over the <> “diamond operator”, processing each line from standard input or files passed as arguments. The latter splits the fields of the line into an array named @F. It looked something like this while I was listing URIs (locations on the website):

gunzip -c ~/logs/phoenixtrap.com-ssl_log-*.gz | 
perl -anE 'say $F[6]'

But once I realized I’d need to filter out a bunch of URI patterns (listed on lines 18 through 22) and do some aggregation by date, I turned it into a script and turned to CPAN.

There I found Regexp::Log::Common and DateTime::Format::HTTP, which let me pull apart the Apache log format and its timestamp strings without having to write even more complicated regular expressions myself. (As noted above, this was already a wheel-reinvention exercise; no need to compound that further.)

Regexp::Log::Common builds a compiled regular expression based on the log format and fields you’re interested in, so that’s the constructor on lines 10 through 13. The expression then returns those fields as a list, which I’m assigning to a hash slice with those field names as keys in line 28. I then skip over requests that aren’t successful or browser cache hits (line 31), skip over requests that don’t GET web pages or other assets (e.g., POST s to forms or updating other resources), and skip over the URI patterns mentioned earlier.

(Those patterns are worth a mention: they include the and sitemap XML files used by search engine indexers, WordPress administration pages, files used by RSS newsreaders subscribed to my blog, and routes used by the Jetpack WordPress add-on. If you’re adapting this for your site you might need to customize this list based on what software you use to run it.)

Lines 37 and 38 parse the timestamp from the log into a DateTime object using DateTime::Format::HTTP and then build the key used to store the per-week hit count. The last lines of the loop then grab the first date of each new week (assuming the log is in chronological order) and increment the count. Once finished, lines 43 and 44 provide a report sorted by week, displaying it as a friendly “Week of date” and the hit counts aligned to the right with sprintf. Number::Format’s format_number function displays the totals with thousands separators.

Room for Improvement

DateTime is a very powerful module but this comes at a price of speed and memory. Something simpler like Date::WeekNumber should yield performance improvements, especially as my logs grow (here’s hoping). It requires a bit more manual massaging of the log dates to convert them into something the module can use, though:

#!/usr/bin/env perl

use strict;
use warnings;
use Syntax::Construct 'regex-named-capture-group';
use Regexp::Log::Common;
use Date::WeekNumber 'iso_week_number';
use List::Util 1.33 'any';
use Number::Format 'format_number';

my $parser = Regexp::Log::Common->new(
    format  => ':extended',
    capture => [qw<req ts status>],
);
my @fields      = $parser->capture;
my $compiled_re = $parser->regexp;

my @skip_uri_patterns = qw<
  ^/+robots.txt
  [-w]*sitemap[-w]*.xml
  ^/+wp-
  /feed/?$
  ^/+?rest_route=
>;

my %month = (
    Jan => '01', Feb => '02', Mar => '03', Apr => '04',
    May => '05', Jun => '06', Jul => '07', Aug => '08',
    Sep => '09', Oct => '10', Nov => '11', Dec => '12',
);

my ( %count, %week_of );
while (<>) {
    my %log;
    @log{@fields} = /$compiled_re/;

    # only interested in successful or cached requests
    next unless $log{status} =~ /^2/ or $log{status} = 304;

    my ( $method, $uri, $protocol ) = split ' ', $log{req};
    next unless $method eq 'GET';
    next if any { $uri =~ $_ } @skip_uri_patterns;

    # convert log timestamp to YYYY-MM-DD for Date::WeekNumber
    $log{ts} =~ m!^ (?<day>dd) / (?<month>...) / (?<year>d{4}) : !x;
    my $date = "$+{year}-$month{ $+{month} }-$+{day}";

    my $week = iso_week_number($date);
    $week_of{$week} ||= $date;
    $count{$week}++;
}

printf "Week of %s: % 10sn", $week_of{$_}, format_number( $count{$_} )
  for sort keys %count;

It looks almost the same as the first version, with the addition of a hash to convert month names to numbers in lines 26 through 30 and the actual conversion (using named regular expression capture groups for readability, using Syntax::Construct to check for that feature in line 5) in lines 45 and 46. On my server, this results in a ten- to eleven-second savings when processing two months of compressed logs.

What’s next? Pretty graphs? Drilling down to specific blog posts? Database storage for further queries and analysis? Perl and CPAN make it possible to go far beyond what you can do with AWK. What would you add or change? Let me know in the comments.

Credit: Source link

Previous Next
Close
Test Caption
Test Description goes like this