HTML-Tidy reviews

RSS | Module Info

HTML-Tidy (1.56) ***

HTML::Tidy just saved my day's work!
I was having a terrible time trying to parse HTML generated by a dumb application (that were not designed to convert special characters found in it's database to HTML entities) with HTML::TokeParse.

My parser is fine, but HTML::TokeParse will get lost if something looks like a token, but it is not.

With 3 lines of code (create a HTML::Tidy object, invoke clean() method and give back the result to HTML::TokeParse) I solved the issue.

The only drawback on installing it on Windows 7 was a test failing while trying to recognize ' ' Unicode character ( I forced the install and all worked for me.

If you are in the same situation (bad HTML to parse), I highly recommend this module.

HTML-Tidy (1.54) ****

I use HTML::Tidy to convert HTML to XHTML, since at the moment it's the best tool for this job.

I found HTML::Tidy a bit of a pain to install, but once that's done it works seamlessly.

Here is the how I do the conversion:

sub _tidy_html

{ my( $html, $options)= @_;

my $TIDY_DEFAULTS= { output_xhtml => 1, # duh!

tidy_mark => 0, # do not add the "generated by tidy" comment

numeric_entities => 1,

char_encoding => 'utf8',

bare => 1,

clean => 1,

doctype => 'transitional',

fix_backslash => 1,

merge_divs => 0,

merge_spans => 0,

sort_attributes => 'alpha',

indent => 0,

wrap => 0,

break_before_br => 0,


$options ||= {};

my $tidy_options= { %$TIDY_DEFAULTS, %$options};

my $tidy = HTML::Tidy->new( $tidy_options);

# not clean, but any remaining error will be caught by the XML parsing

$tidy->ignore( type => 1, type => 2 ); # 1 is TIDY_WARNING, 2 is TIDY_ERROR

my $xml= $tidy->clean( $html );

return $xml;


HTML-Tidy (1.08) ***

Since I first used this module a few years ago it has been much improved: The documentation is much more useful and the ->clean() method makes it clearer as to how to use this module to tidy up HTML and return it to you. There are improvements that could be made, but I do think earlier reviews are not indicative of how much more usable HTML::Tidy has become.

HTML-Tidy (1.06) *

The documentation would have to improve considerably to be atrocious, the module has some head scratching limitations, and is slow.

Installation on Darwin 8.5.0/perl 5.8.6 was a nightmare of dependency resolution, whether by hand or by cpan. The author might have mentioned that the htmltidy source and headers have to be present before installation in the instructions. While the documentation does mention to "tell the makefile that you're using ranlib", that convoluted set of instructions doesn't actually address the problem I had.

That aside, on first glance, there's really no way to tell to tell if this is module is useful, since the documentation doesn't actually say how to use it. Fortunately, there is some useful chatter on annocpan that gives hints.

Worst of all, the module forces you to load an external configuration file to do any changes to alter formatting or tags. Not only is this slow, it defeats the purpose of altering the markup programmatically.

Any developer is probably better off doing a system or exec call to the tidy binary than trying to use this exceptionally aggravating waste of time.

HTML-Tidy (1.06) **

As above, the module may be nice but fails tests. I believe this is because the tidylib has been updated but the perl module hasn't kept in step.

The problem in the test is due to the address tag in HTML - AFAIK it can't be nested or have a <center> tag inside it but an older version of tidylib allowed this, and the enclosed test tickles this particular case quite hard.

Installed OK from source after compiling tidylib from source (CVS, 25jan06) and tweaking the venus.t to take out the extra address tags.

As per previous commenter, it's a reet pain that you can't specify parameters in any sane way and have to put them in an external file.

You basically get two methods: parse and clean.

parse takes a dummy $filename first parameter, which is a bit odd - it will only parse strings, and $tidy->parse( $string ) doesn't work. It returns true even if there are errors in the HTML, which is also odd (truth is based on the success or failure of being able to get libtidy to look at the string)

I found the following code to be useful, but it does use a the private _tidy_messages interface...

my $errs = HTML::Tidy::_tidy_messages( $html );
if ($errs) {

print "there were errors\n";

#$errors is a listref of strings

The clean() method is much more sensible. my $cleaned = $tidy->clean( $html ); works as expected and returns the cleaned HTML (even though the docs say "Returns true if all went OK")

HTML-Tidy (1.06) ***

This would be a Blessed Module indeed if it did but compile on any system on which I slaveth. Oh, woe, for lack of the time that may allow a Sensible, nay, Noble, corrective contribution from your humble scribe. Alas, alak.

That was June.

Now's it's three months and one thousand miles later, and I've been reduced to RedHat, where it compiles. It just doesn't pass the tests. Wonder what'll happen if I force it...?

So far so good, my noly "issue" is that currently configuration options have to be loaded from a file; there is not yet an option to set them at the time of construction, which is a shame.
1 hidden unhelpful review