Skip to content

Puffbox

Simon Dickson's gov-tech blog, active 2005-14. Because permalinks.

2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005

Code For The People company e-government news politics technology Uncategorised

api award barackobama barcampukgovweb bbc bis blogging blogs bonanza borisjohnson branding broaderbenefits buddypress budget cabinetoffice careandsupport chrischant civilservice coi commentariat commons conservatives consultation coveritlive crimemapping dailymail datasharing datastandards davidcameron defra democracy dfid directgov dius downingstreet drupal engagement facebook flickr foi foreignoffice francismaude freedata gds google gordonbrown governanceofbritain govuk guardian guidofawkes health hosting innovation internetexplorer labourparty libdems liveblog lynnefeatherstone maps marthalanefox mashup microsoft MPs mysociety nhs onepolitics opensource ordnancesurvey ournhs parliament petitions politics powerofinformation pressoffice puffbox rationalisation reshuffle rss simonwheatley skunkworks skynews statistics stephenhale stephgray telegraph toldyouso tomloosemore tomwatson transparency transport treasury twitter typepad video walesoffice wordcamp wordcampuk wordpress wordupwhitehall youtube

Privacy Policy

  • X
  • Link
  • LinkedIn
  • 4 Nov 2009
    politics
    hotwords

    Work in progress: what's hot on the political blogs

    I thought I’d share an early screenshot of a little side-project I’m working on at the moment. Not sure if it’ll lead anywhere in particular, but it’s been an interesting* adventure into coding at the very least. Maybe some of you lot can see a use for it, or can suggest directions I might take it.

    Basically, it’s an automated ‘word cloud’ generator for blogs: think ‘Twitter trending’ for a defined collection of RSS sources. Every few minutes, it pulls in the latest posts from Iain Dale’s Top 100 political blogs (although it could be any folder you care to share in Google Reader), and looks for the most popular words in article headlines and opening sentences. It joins up pairs of words likely to go together, such as people’s names, based on a manually-maintained list stored in a plain text file. It removes any words it finds in a 300-strong list of ‘stopwords’; then sorts the remainder in order of popularity. Finally there’s some cheeky string manipulation to apply CSS classes to the words in the ‘cloud’, including the calling-in of little icons where available. It’s all been built for flexibility (maximum number of posts to review, over how many hours, etc) and easy maintenance. And I’m really quite pleased with it so far.

    I took this screenshot a few minutes ago: you can see how the hot news topics jump out at you.

    hotwords

    But then what? In my current test build, the words are all clickable – and act as a show/hide toggle for a long aggregated list of posts. So you click ‘david cameron’ and you see all posts whose headline or opening sentences contain the specific phrase ‘david cameron’. It’s not bad, but I don’t yet feel it’s the right end result. Ideas welcome!

    For the technically minded: I’m doing it all in PHP, pulling feeds in from Google Reader and processing them using SimplePie, before getting crazy with some monster arrays. On my local machine, it takes about 5 seconds to process each cloud, based on around 100 posts each time: maybe it could be faster, but it doesn’t need to be. In production, I’d probably have it running every 5-10 minutes on a cron, generating a static HTML chunk to be called in via an include. I did initially try building it in javascript, but processing times didn’t look promising.

Proudly Powered by WordPress