The other day I decided to finally bite the bullet and clean up the Page From Hell on a client’s web site. It’s a page that sees a lot of content churn, and I had been asked to insert a video feed near the bottom that would span two columns. But this page is the product of about 8 years of questionable attention from a variety of people, including non-developers using FrontPage. The result is totally muddled indentation, gratuitous, quadruple-nested tables, and worse.
As a developer I live most of my life under the hood and usually do not have to venture out into 50K piles of smelly HTML. Call me sheltered, but I assumed I’d grab someone’s HTML parser and clean everything up so that I could turn it into a nice, valid XHTML document that I could actually work on.
Silly me. I turned first to HTMLTidy, but I should have known better than to start with a tool having origins within the W3C. It threw up its hands twenty lines into the page, saying there were too many errors. So I had this classic chicken-or-egg scenario where I wanted to clean up the page so it had valid HTML, but couldn’t clean it up even part way because the HTML wasn’t valid.
Browsers, of course, are incredibly forgiving, and somehow manage to render the page. I figured there had to be an easier way.
I eventually located an HTML/XML/XSLT beautifier called htb, which made a much better and more pragmatic effort but didn’t provide me with any kind of analytical tools to explain the lack of proper tag closures near the end of the file, presumably because of failure to close certain tags way further up.
As I was stewing over all this, Jeff Atwood posted on this very topic and opined that “forgiveness by default is absolutely required for the kind of large-scale, worldwide adoption that the web enjoys.” His point being that if JavaScript and HTML errors punished us like our compilers do, perhaps the majority of web pages would refuse to render.
I guess he has a point there, but in that case, I wish someone would harness the “forgiving” parser in, say, Firefox, wrap a nice API around it, and produce some useful tools. I would love to see the HTML that Firefox displays for my Page from Hell, as opposed to the “actual” HTML, for instance. That could perhaps become the starting point for cleaning up what I have inherited.
In my case, I’ll probably just throw it all away and rewrite it. If anyone knows a good tool for these situations, I’m all ears.
{ 1 comment… read it below or add one }
Your wish may already be true!
If you use the Web Developer Extension, you have a View Source menu with the ability to View Generated Source.
Try it, it’s very handy ;)
Bob responds: You refer I think to the Web Developer Extension for Firefox (and Flock, Mozilla and Seamonkey). I have it installed but hadn’t fully explored it! View Generated Source doesn’t clean up formatting but it might serve to present htb or other beautifiers with something more rational. Thanks, I’ll experiment with it!