Long-term Memory

Parsing an HTML website in Perl

I always wanted to know how to filter pieces of information from all the HTML-bloat surrounding it. As a proof of concept, I wrote a Perl program that parses the Skynet Electronic Program Guide and turns it into an XMLTV-file. XMLTV is a file-format used by programs like MythTV to present the on-screen program-guide and to schedule recordings.

The Skynet site contains content that is protected by Intellectual Property rights. This program however is for educational purposes only; in most countries this is allowed under “fair-use”. To ensure this (and to keep me out of legal trouble), I slowed the program down to an unusable speed.

To figure out how it works, start by trying “./grab.pl –help”; if that doesn’t get you the answer, feel free to browse through the code.

Caching

When experimenting with the grabber, I figured out that the Skynet page requires your browser to revalidate each and every page it serves. Using a standard Perl HTTP-caching module was not an option, since the server disallows caching.

I decided to write my own caching plugin for Perl’s LWP module. It uses the special protocol “cachedhttp://www.example.com” and is controlled through extra headers in the request. For more information use “perldoc cachedhttp.pm”.

The code

grab.pl: the grabber and parser itself

URI/cachedhttp.pm: helper module to define the cachedhttp:// URI
LWP/Protocol/cachedhttp.pm: the real work is done here

I hereby release this code as GPL.

This entry was posted by Niobos on 2008-08-21 at 18:30 under Uncategorized. Tagged EPG, MythTV, Perl. You can skip to the end and leave a response. Pinging is currently not allowed. Follow any responses to this entry through the RSS 2.0 feed.

Parsing an HTML website in Perl

Caching

The code

Pages

Categories

Archives