HTML-Strip reviews

RSS | Module Info | Add a review of HTML-Strip

HTML-Strip (2.10)

HTML::Strip completely strips HTML tags from a document and leaves one with just the text.

It works well enough to remove the tags from HTML, and it also removes the contents of <script>..</script> and HTML comments. The behaviour regarding whitespace is rather different from HTML::Restrict and HTML::Scrubber. It turns all opening and closing tags into a space character, which means that it leaves things like the following, where there is an </a> tag in the original HTML:

101c96
< strokes in the correct order and direction .
---
> strokes in the correct order and direction.
216c211
< discussion forum .
---
> discussion forum.
221c216
< the KanjiVG project . The
---
> the KanjiVG project. The

All of the odd whitespace appeared where there is an </a> tag in the original HTML. This is the documented behaviour of the module:

metacpan.org/pod/HTML::Strip#DESCRIPTION

"Anything that looks like a tag, or group of tags will be replaced with a single space character."

Oddly enough, though, neither HTML::Restrict nor HTML::Scrubber bother to turn characters which should be whitespace into whitespace. In HTML::Restrict and HTML::Scrubber, the <br> tag is turned into nothing, rather than a space or a return character, which means that if you have a sentence broken up like this:

I wondered lonely as a cloud<br>That floats on high o'er vales and hills,

then the output from HTML::Restrict and HTML::Scrubber looks like this:

I wondered lonely as a cloudThat floats on high o'er vales and hills,

whereas HTML::Strip gives you

I wondered lonely as a cloud That floats on high o'er vales and hills,

So unfortunately none of these modules actually grapples with the problem of converting HTML tags into a reasonable whitespace equivalent.

* HTML::Strip is much faster than HTML::Restrict and HTML::Scrubber. See my review of HTML::Scrubber for a comparison of performance.

For a list of similar modules and links to other reviews, please see my page at www.lemoda.net/perl/html-cleanup-modu...

HTML-Strip (1.06) *****

I've not looked under the hood. But it seems to work well and strip the markup from websites. So I'm satisfied with it.

HTML-Strip (1.06) *****

Well defined purpose, does what it says it does, easy to use. Probably worth explaining when and why the ->eof method should be called.

HTML-Strip (1.02) ****

Simply to use, fully commented code