Better Living Through Thinking

Dumping HTML tables using HTML::Parser

Thu, 05 Oct 2006

HTML::Parser is a powerful and mysterious module, which I have to re-learn every year or so when I have to do something with HTML. Here's a little program to extract and print all tables in an HTML document, including nested tables:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;

## print all tables in 'page.html'

my $p = HTML::Parser->new( api_version => 3,
                           start_h => [\&start_handler, "tagname,self,text"],
                           );

$p->parse_file('page.html');
exit;

sub start_handler {
    my $tag  = shift;
    return unless $tag eq 'table';

    my $self = shift;
    print shift;
    my $nested_table = 0;

    ## setup a new start handler  
    $self->handler(   start => sub { print shift; $nested_table++ if shift eq 'table' }, "text,tagname");
    $self->handler( default => sub { print shift }, "text" );
    $self->handler(     end => sub { print shift;
        if( shift eq "table" ) {
            if( $nested_table ) { $nested_table-- }
            else {
                ## restore the old start handler, and remove our other handlers
                $self->handler( start => \&start_handler, "tagname,self,text" );
                $self->handler( default => undef );
                $self->handler( end => undef );
            }
        }
    }, "text,tagname" );
}
[ category: /perl | link: html_parser_tables ]

Audio Broadcast

(standby)

Moon Status

Phase: 99.97%
Illuminated: 0.00%
Age (days): 29.52
moon phase 0.999710517459869 Sun May 20 17:35:22 MDT 2012