I wanted to check some cruddy old HTML pages for errors, so I started looking on CPAN. HTML::Lint was the second thing I came upon, after one other thing which looked very difficult to install.
It did a few useful things, like catching closing tags without an equivalent opening, or informing about img tags with no height, width, or alt text. But this module has a big problem if your web page contains any UTF-8 encoded Unicode characters. If you use its "parse_file" method, not only does it insist that you have to use HTML entities:
but even worse it doesn't take any notice of your encoding anyway, and it tells you to use entities for each byte, which is wrong and will result in a broken web page. If you read the text in yourself as UTF-8 and send it to this module via its "parse" method, things get even worse since it doesn't have any way of coping with these inputs. When I tried using "parse" on a Unicode-encoded string, I got streams of errors like this:
Use of uninitialized value $val in substitution (s///) at /home/ben/software/install/lib/perl5/site_perl/5.10.0/HTML/Lint/Error.pm line 112.
I tried fiddling with the three error switches provided, but it turned out that switching off the entity part also switches off the other parts which were useful to me. I guess one could just send the output of this module through 'grep -v "Invalid character"' to remove these bogus errors.