Simon Dickson, principal consultant at Puffbox, writes stuff about e-government, online news and politics. Some important people read it.

Work in progress: what’s hot on the political blogs

4 November 2009 0

I thought I'd share an early screenshot of a little side-project I'm working on at the moment. Not sure if it'll lead anywhere in particular, but it's been an interesting* adventure into coding at the very least. Maybe some of you lot can see a use for it, or can suggest directions I might take it.

Basically, it's an automated 'word cloud' generator for blogs: think 'Twitter trending' for a defined collection of RSS sources. Every few minutes, it pulls in the latest posts from Iain Dale's Top 100 political blogs (although it could be any folder you care to share in Google Reader), and looks for the most popular words in article headlines and opening sentences. It joins up pairs of words likely to go together, such as people's names, based on a manually-maintained list stored in a plain text file. It removes any words it finds in a 300-strong list of 'stopwords'; then sorts the remainder in order of popularity. Finally there's some cheeky string manipulation to apply CSS classes to the words in the 'cloud', including the calling-in of little icons where available. It's all been built for flexibility (maximum number of posts to review, over how many hours, etc) and easy maintenance. And I'm really quite pleased with it so far.

I took this screenshot a few minutes ago: you can see how the hot news topics jump out at you.

hotwords

But then what? In my current test build, the words are all clickable - and act as a show/hide toggle for a long aggregated list of posts. So you click 'david cameron' and you see all posts whose headline or opening sentences contain the specific phrase 'david cameron'. It's not bad, but I don't yet feel it's the right end result. Ideas welcome!

For the technically minded: I'm doing it all in PHP, pulling feeds in from Google Reader and processing them using SimplePie, before getting crazy with some monster arrays. On my local machine, it takes about 5 seconds to process each cloud, based on around 100 posts each time: maybe it could be faster, but it doesn't need to be. In production, I'd probably have it running every 5-10 minutes on a cron, generating a static HTML chunk to be called in via an include. I did initially try building it in javascript, but processing times didn't look promising.

Got something to say? Say it.

Stop wasting your time RSS feed

Let us tell you when there's new stuff to read at puffbox.com, by subscribing to the RSS feed.

Go on, show your face

If you want your photo to appear beside any comments you leave here, hop over to Gravatar and upload a picture of yourself. Otherwise, we'll just assume the machine-generated monster is a fair likeness.

Tag cloud

Puffbox.com archives

Search

Alan's comments feed

By popular demand: the comments feed

Ancient history

For posts during 2006 or 2007, Simon's old blog's archives are still available.