HTML::Strip completely strips HTML tags from a document and leaves one with just the text.
It works well enough to remove the tags from HTML, and it also removes the contents of <script>..</script> and HTML comments. The behaviour regarding whitespace is rather different from HTML::Restrict and HTML::Scrubber. It turns all opening and closing tags into a space character, which means that it leaves things like the following, where there is an </a> tag in the original HTML:
< strokes in the correct order and direction .
> strokes in the correct order and direction.
< discussion forum .
> discussion forum.
< the KanjiVG project . The
> the KanjiVG project. The
All of the odd whitespace appeared where there is an </a> tag in the original HTML. This is the documented behaviour of the module:
"Anything that looks like a tag, or group of tags will be replaced with a single space character."
Oddly enough, though, neither HTML::Restrict nor HTML::Scrubber bother to turn characters which should be whitespace into whitespace. In HTML::Restrict and HTML::Scrubber, the <br> tag is turned into nothing, rather than a space or a return character, which means that if you have a sentence broken up like this:
I wondered lonely as a cloud<br>That floats on high o'er vales and hills,
then the output from HTML::Restrict and HTML::Scrubber looks like this:
I wondered lonely as a cloudThat floats on high o'er vales and hills,
whereas HTML::Strip gives you
I wondered lonely as a cloud That floats on high o'er vales and hills,
So unfortunately none of these modules actually grapples with the problem of converting HTML tags into a reasonable whitespace equivalent.
* HTML::Strip is much faster than HTML::Restrict and HTML::Scrubber. See my review of HTML::Scrubber for a comparison of performance.