Better Living Through Thinking |
|
Parsing HTML tables with HTML::ParserMon, 09 Oct 2006We can parse HTML tables into native Perl data structures (arrays, etc.) with HTML::Parser. Here is one way (maybe not the most HTML::Parser-ish way, but it works); we use a simple state mechanism to remember if we're in a table, row, or table cell. If we're in a cell, we save the cell data found there to a variable
( Here's the code: #!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser ();
my $state = '';
my @table = (); ## @table = ( [foo,bar], [baz,blech] );
my @row = (); ## ("foo", "bar")
my $cell = ''; ## "foo"
my $p = HTML::Parser->new( api_version => 3 );
$p->handler( start => sub {
my $tag = shift;
$state = 'TABLE' if $tag eq 'table';
$state = 'TR' if $tag eq 'tr';
if( $state eq 'TD' ) {
$cell .= shift;
}
$state = 'TD' if $tag eq 'td';
}, "tagname,text" );
$p->handler( default => sub {
$cell .= shift if $state eq 'TD';
}, "text" );
$p->handler( end => sub {
my $tag = shift;
if( $tag eq 'td' ) {
$state = 'TR';
push @row, $cell;
$cell = '';
}
if( $tag eq 'tr' ) {
$state = 'TABLE';
push @table, [@row];
@row = ();
}
$state = '' if $tag eq 'table';
}, "tagname" );
## get the HTML table
undef $/;
my $data = <DATA>;
## parse it (this calls each handler as necessary)
$p->parse($data);
## now dump out the Perl structures
use Data::Dumper;
for my $row ( @table ) {
print Dumper($row);
}
exit;
__DATA__
<table>
<tr>
<td colspan=2>fat and wide</td>
</tr>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
<tr>
<td>baz</td>
<td>blech</td>
</tr>
</table>
|
Audio Broadcast(standby)Moon StatusPhase: 99.97%Illuminated: 0.00% Age (days): 29.52
Sun May 20 17:34:56 MDT 2012 |