Better Living Through Thinking |
|
Dumping HTML tables using HTML::ParserThu, 05 Oct 2006HTML::Parser is a powerful and mysterious module, which I have to re-learn every year or so when I have to do something with HTML. Here's a little program to extract and print all tables in an HTML document, including nested tables: #!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
## print all tables in 'page.html'
my $p = HTML::Parser->new( api_version => 3,
start_h => [\&start_handler, "tagname,self,text"],
);
$p->parse_file('page.html');
exit;
sub start_handler {
my $tag = shift;
return unless $tag eq 'table';
my $self = shift;
print shift;
my $nested_table = 0;
## setup a new start handler
$self->handler( start => sub { print shift; $nested_table++ if shift eq 'table' }, "text,tagname");
$self->handler( default => sub { print shift }, "text" );
$self->handler( end => sub { print shift;
if( shift eq "table" ) {
if( $nested_table ) { $nested_table-- }
else {
## restore the old start handler, and remove our other handlers
$self->handler( start => \&start_handler, "tagname,self,text" );
$self->handler( default => undef );
$self->handler( end => undef );
}
}
}, "text,tagname" );
}
|
Audio Broadcast(standby)Moon StatusPhase: 99.97%Illuminated: 0.00% Age (days): 29.52
Sun May 20 17:35:22 MDT 2012 |