HTML-Obliterate reviews

cpanratings
 

RSS | Module Info | Add a review of HTML-Obliterate

HTML-Obliterate (0.3) *

This is a terrible implementation of a bad idea. Regular expressions for HTML parsing are extremely difficult to get right. This module will erase content including entities, instead of converting them which is trivial which HTML::Entities, and markup like "1 < 2;" while bad, plenty of HTML in the wild contains stuff like this.

It is trivial to strip HTML yourself with HTML::TokeParser or HTML::TokeParser::Simple or XML::LibXML or ...

The code makes it look like it should have been in the Acme namespace. 12 subs all synonyms for calling, via goto, the real routine.

Do *not* put this module into production code. This is the sort of code that gives Perl and the CPAN a bad name.

Update: I'm not hating, I'm protecting other users from taking this at face value or learning to prefer quick'n'dirty over correct. I posted this review after the module was recommended on the TT2 list, which is widely read.

Here's the real rub...

First, self-documenting and without misleading extra substitution modifiers; 40 characters long without any dependencies-

$naively_stripped_html =~ s/<\w+[^>]+>//g;

Now, obfuscated by nature of being hidden in a module; 100+ chars, adding a pointless dependency to boot-

use HTML::Obliterate qw(extirpate_html);

my $html_less_version_of_string = extirpate_html( $html_code_string );

It's now 13 aliases to the same sub? Again, if it were an Acme module it would be amusing.

HTML-Obliterate (0.3)

> This is a terrible implementation of a bad idea. Regular expressions for HTML parsing
> are extremely difficult to get right.

It's not meant to be a comprehensive parser neither does it claim to be such. All it does is "removes all HTML tags and entities".

> This module will erase content including entities, instead of converting them which is
> trivial which HTML::Entities, and markup like "1 < 2;" while bad, plenty of HTML in the
> wild contains stuff like this.

The only claim it makes is "removes all HTML tags and entities".

If you want to preserve certain entities you should use HTML::Entities to decode (and therefore keep) the ones you want. (I'll add that to the POD in the next version) Also to strip tags and what it contains <script>code code code</script> you'd need a more indepth HTML::Parser based module. (I'll also add that to the POD)

> It is trivial to strip HTML yourself with HTML::TokeParser or HTML::TokeParser::Simple or > XML::LibXML or ...

HTML::Strip is also available for a parse-into-a-tree-and-clean-up-the-tree approach.

Again though, this is not mean to be a convert-HTML-into-equivalent-text module but a quick "no HTML please" just as it says so I'm pretty shocked at the intensity of your hatred.

> The code makes it look like it should have been in the Acme namespace. 12 subs all
> synonyms for calling, via goto, the real routine.

So no one is allowed to have fun with their code?

> Do *not* put this module into production code.

Sure don't use it if it doesn't fill your need, use it if it does.

> This is the sort of code that gives Perl and the CPAN a bad name.

Haters like you give CPAN a bad name. Relax, have fun!