Puffbox.com

Adventures in government, politics and open source. Mostly WordPress-related.

Simon Dickson, principal consultant at Puffbox, has been blogging about e-government, online politics, and WordPress since 2005. Some important people read it.

Work in progress: what's hot on the political blogs

4 November 2009

I thought I'd share an early screenshot of a little side-project I'm working on at the moment. Not sure if it'll lead anywhere in particular, but it's been an interesting* adventure into coding at the very least. Maybe some of you lot can see a use for it, or can suggest directions I might take it.

Basically, it's an automated 'word cloud' generator for blogs: think 'Twitter trending' for a defined collection of RSS sources. Every few minutes, it pulls in the latest posts from Iain Dale's Top 100 political blogs (although it could be any folder you care to share in Google Reader), and looks for the most popular words in article headlines and opening sentences. It joins up pairs of words likely to go together, such as people's names, based on a manually-maintained list stored in a plain text file. It removes any words it finds in a 300-strong list of 'stopwords'; then sorts the remainder in order of popularity. Finally there's some cheeky string manipulation to apply CSS classes to the words in the 'cloud', including the calling-in of little icons where available. It's all been built for flexibility (maximum number of posts to review, over how many hours, etc) and easy maintenance. And I'm really quite pleased with it so far.

I took this screenshot a few minutes ago: you can see how the hot news topics jump out at you.

hotwords

But then what? In my current test build, the words are all clickable - and act as a show/hide toggle for a long aggregated list of posts. So you click 'david cameron' and you see all posts whose headline or opening sentences contain the specific phrase 'david cameron'. It's not bad, but I don't yet feel it's the right end result. Ideas welcome!

For the technically minded: I'm doing it all in PHP, pulling feeds in from Google Reader and processing them using SimplePie, before getting crazy with some monster arrays. On my local machine, it takes about 5 seconds to process each cloud, based on around 100 posts each time: maybe it could be faster, but it doesn't need to be. In production, I'd probably have it running every 5-10 minutes on a cron, generating a static HTML chunk to be called in via an include. I did initially try building it in javascript, but processing times didn't look promising.

Puffbox is registered as a limited company in England and Wales. Company number 621 0273. VAT number 912 9843 08.
Registered address (for nasty legal purposes only): Griffins Court, 24-32 London Road, Newbury, Berkshire, RG14 1JX