I always wanted to know how to filter pieces of information from all the HTML-bloat surrounding it. As a proof of concept, I wrote a Perl program that parses the Skynet Electronic Program Guide and turns it into an XMLTV-file. XMLTV is a file-format used by programs like MythTV to present the on-screen program-guide and to schedule recordings.
The Skynet site contains content that is protected by Intellectual Property rights. This program however is for educational purposes only; in most countries this is allowed under “fair-use”. To ensure this (and to keep me out of legal trouble), I slowed the program down to an unusable speed.
To figure out how it works, start by trying “./grab.pl –help”; if that doesn’t get you the answer, feel free to browse through the code.
Caching
When experimenting with the grabber, I figured out that the Skynet page requires your browser to revalidate each and every page it serves. Using a standard Perl HTTP-caching module was not an option, since the server disallows caching.
I decided to write my own caching plugin for Perl’s LWP module. It uses the special protocol “cachedhttp://www.example.com” and is controlled through extra headers in the request. For more information use “perldoc cachedhttp.pm”.
The code
grab.pl: the grabber and parser itself
URI/cachedhttp.pm: helper module to define the cachedhttp:// URI
LWP/Protocol/cachedhttp.pm: the real work is done here
I hereby release this code as GPL.