I thought I’d share an early screenshot of a little side-project I’m working on at the moment. Not sure if it’ll lead anywhere in particular, but it’s been an interesting* adventure into coding at the very least. Maybe some of you lot can see a use for it, or can suggest directions I might take it.
Basically, it’s an automated ‘word cloud’ generator for blogs: think ‘Twitter trending’ for a defined collection of RSS sources. Every few minutes, it pulls in the latest posts from Iain Dale’s Top 100 political blogs (although it could be any folder you care to share in Google Reader), and looks for the most popular words in article headlines and opening sentences. It joins up pairs of words likely to go together, such as people’s names, based on a manually-maintained list stored in a plain text file. It removes any words it finds in a 300-strong list of ‘stopwords’; then sorts the remainder in order of popularity. Finally there’s some cheeky string manipulation to apply CSS classes to the words in the ‘cloud’, including the calling-in of little icons where available. It’s all been built for flexibility (maximum number of posts to review, over how many hours, etc) and easy maintenance. And I’m really quite pleased with it so far.
I took this screenshot a few minutes ago: you can see how the hot news topics jump out at you.
But then what? In my current test build, the words are all clickable – and act as a show/hide toggle for a long aggregated list of posts. So you click ‘david cameron’ and you see all posts whose headline or opening sentences contain the specific phrase ‘david cameron’. It’s not bad, but I don’t yet feel it’s the right end result. Ideas welcome!
For the technically minded: I’m doing it all in PHP, pulling feeds in from Google Reader and processing them using SimplePie, before getting crazy with some monster arrays. On my local machine, it takes about 5 seconds to process each cloud, based on around 100 posts each time: maybe it could be faster, but it doesn’t need to be. In production, I’d probably have it running every 5-10 minutes on a cron, generating a static HTML chunk to be called in via an include. I did initially try building it in javascript, but processing times didn’t look promising.