HTML-TableExtract reviews

cpanratings
 

RSS | Module Info | Add a review of HTML-TableExtract

HTML-TableExtract (2.11) *****

This module is the reason I learned Perl.

When I wanted to convert a series of HTML table files to CSV, this class made the HTML parsing beautifully simple.

The module repesents an HTML table as an array of arrays. By happy coincidence, Text::CSV_XS represents a CSV file in exactly the same way. To convert HTML to CSV, you just pass the table object to the csv function. It's that easy!

search.cpan.org/~hmbrand/Text-CSV_XS-...

As Graham Stead said, the documentation would benefit with more detail. For example, it's not clear how the superclass methods are overriden and what they return.

I'm more comfortable with Python, but I couldn't find an equivalent module that makes it so easy. Sebastien Sauvage's html2csv in Python is even simpler to use, but it's not in a package or actively maintained.

sebsauvage.net/python/html2csv.py

HTML-TableExtract (2.11) ****

Powerful - gives you access to HTML data inside table cells, unlike HTML::TableParser, but slightly more difficult to use than HTML::TableParser, and with fewer ways to select a table.

There is a comment from 2008 saying that the module didn't work on Cygwin or Windows - this is fortunately no longer the case, and the module works great.

HTML-TableExtract (2.10) *****

Makes an otherwise tedious task almost trivial. I had earlier used HTML-Tree to extract the data from a specific depth. However every time the layout changed(often) the extraction code would break.

Extracting tables using the headers instead of some HTML properties (depth, element name etc) is simple and robust.

HTML-TableExtract (2.10) *****

An excellent module to handle the fairly common task of extracting data from a HTML table; no need for ugly scraping code, you can just tell this module "Here's some HTML; find me a table with headings named $headers, then get me the data. If the page changes, no problem - as long as the table still has the same column headings (even if their order changes), it'll still be found with no issues.

You can also identify the table you want by name/id or various attributes, if you need to.

An essential part of your toolkit for screen-scraping.

HTML-TableExtract (2.10)

Useful, but has flaws.

This module is very handy for getting the entries out of tables quickly. However it has some flaws. For example it's not possible to get the attributes of the <td> and other tags which form the table, so if you need to extract only the elements which have a certain name or class, you'll be stuck with this.

There is a way around the problem but it's complicated.

The other big problem with this module is that it's broken on Cygwin and Windows.

HTML-TableExtract (2.10) *****

Excellent module, much easier than Template::Extract and HTML::TreeBuilder for extracting data from web pages in many cases, and one even doesn't have to look into the HTML source being processed.

My only complaint is the encoding problem. When dealing with pages in non-ascii and non-utf8 encodings like GB2312, it just refuses to match headers. I have to convert the HTML input to UTF-8 manually all the time. I think it may be a problem on the HTML::Parser side... So UTF-8 is always my best friend. :)

HTML-TableExtract (2.06) *****

This module helped me create a parser that I struggling to build any other way. The headers feature is *very* handy and provides great basic functionality. If you need to go beyond this, be prepared to spend a bit of time understanding how things work; I found Matt's examples (www.mojotoad.com/sisk/projects/HTML-T... to be helpful (and necessary). I give this module an overall high score because its great functionality trumps everythings else. It would have been even better if it were more intuitive (granted, this is highly subjective), or if the off-line examples were referenced in the POD. Kudos!

HTML::TableExtract / HTML-TableExtract (1.07) ****

A must-have module for getting information out of any table-organized HTML page (you'll be surprised how many web pages this actually is true for).

It might be a little steep on the learning curve, but this is only due to it's powers, and the fact that extracting information out of nested tables is a daunting task.

Two small tips for getting your information:

1. Don't know where your table's at?

Construct the TableExtract without depth and count, and loop using:

foreach my $ts ($table->table_states) {

warn "DEBUG: Table found at ", join(',', $ts->coords) if $DEBUG>2;

2. Replace IMG tags with their ALT attribute:

Subclass TableExtract, overriding the start method:
package TableParser;

use base qw(HTML::TableExtract);

sub start {

my $self = shift;

if ($self->{_in_a_table} and $_[0] eq 'img') {

my %attrs = ref $_[1] ? %{$_[1]} : {};

$self->text($attrs{alt});

} else {

$self->SUPER::start(@_);

}
}