PHP5, DOM, and screen scraping
»From the php part of the brain.
I spent some time using PHP5 to screen scrape a web page so that I could generate an RSS feed from it.
Textdrive forbids the usage of wget or fetch so I had to grab the file directly using a socket connection:
$host = "www.example.com";
$address = gethostbyname($host);
$port = getservbyname('www', 'tcp');
$result = socket_connect($socket, $address, $port);
$in = "GET /index.html HTTP/1.1\r\n";
$in .= "Host: $host\r\n";
$in .= "Connection: Close\r\n\r\n";
socket_write($socket, $in, strlen($in));
while ($out = socket_read($socket, 2048)) {
$html.= $out;
}
socket_close($socket);
Run it through tidy and you’ve got yourself a clean DOM tree which the following three lines of code will load and extract all <font> tags from.
$dom = new DomDocument();
$dom->loadHTML($tidy_html_string);
$fonts = $dom->getElementsByTagName("font");
I created an RSS feed for from the shadows which was severely lacking one. I’m also playing around with FeedBurner to manage feed statistics. Since FeedBurner fetches the feed periodically, it also behaves like a cron job, so I don’t have to set one up.
So without further adieu, here is the FTS RSS feed. It relies on screen scraping and the FTS crew’s request not to take it down, so in other words, it might not work forever.
Updated: November 5th, 2005: I’ve removed the feed, please link to the FTS feed directly