Better Living Through Thinking

Parsing HTML tables with HTML::Parser

Mon, 09 Oct 2006

We can parse HTML tables into native Perl data structures (arrays, etc.) with HTML::Parser. Here is one way (maybe not the most HTML::Parser-ish way, but it works); we use a simple state mechanism to remember if we're in a table, row, or table cell.

If we're in a cell, we save the cell data found there to a variable ($cell). Once we reach the end of a cell (a </td> tag), we append that cell to @row. Each row is in turn appended to the @table array as a reference.

Here's the code:

#!/usr/bin/perl
use strict;
use warnings;

use HTML::Parser ();

my $state = '';
my @table = ();  ## @table = ( [foo,bar], [baz,blech] );
my @row   = ();  ## ("foo", "bar")
my $cell  = '';  ## "foo"

my $p = HTML::Parser->new( api_version => 3 );

$p->handler( start => sub {
    my $tag = shift;

    $state = 'TABLE' if $tag eq 'table';
    $state = 'TR'    if $tag eq 'tr';

    if( $state eq 'TD' ) {
        $cell .= shift;
    }

    $state = 'TD'    if $tag eq 'td';

}, "tagname,text" );

$p->handler( default => sub {
    $cell .= shift if $state eq 'TD';
}, "text" );

$p->handler( end => sub {
    my $tag = shift;

    if( $tag eq 'td' ) {
        $state = 'TR';
        push @row, $cell;

        $cell = '';
    }

    if( $tag eq 'tr' ) {
        $state = 'TABLE';
        push @table, [@row];

        @row = ();
    }

    $state = ''      if $tag eq 'table';

}, "tagname" );

## get the HTML table  
undef $/;
my $data = <DATA>;

## parse it (this calls each handler as necessary)
$p->parse($data);

## now dump out the Perl structures
use Data::Dumper;
for my $row ( @table ) {
    print Dumper($row);
}

exit;

__DATA__
<table>
  <tr>
    <td colspan=2>fat and wide</td>
  </tr>
  <tr>
    <td>foo</td>
    <td>bar</td>
  </tr>
  <tr>
    <td>baz</td>
    <td>blech</td>
  </tr>
</table>
[ category: /perl | link: html_parser_table_parsing ]

Audio Broadcast

(standby)

Moon Status

Phase: 99.97%
Illuminated: 0.00%
Age (days): 29.52
moon phase 0.999701377247734 Sun May 20 17:34:56 MDT 2012