The idea for this project came about in an editorial meeting: a journalist wondered if it was possible to test the assumption that politicians spent a disproportionate amount of time talking about certain issues at the expense of others that mattered more to people’s lives. We decided to test this theory based on Hansard transcripts.
Phase one of the project was to decide if data-mining parliamentary speeches was even possible. Early in our research we discovered OpenAustralia and the work it had done on aggregating digital versions of Hansard transcripts provided by the government.
In retrospect, it’s hard to imagine doing the project without it. While Hansard transcripts are provided by the government, the site is designed for humans to search and not for organisations trying to access the data programmatically. Electronic versions of Hansard are available but not easily retrieved. OpenAustralia had fought most of these battles for us, working out a way to automate the downloading of transcripts and cleaning the data. OpenAustralia then provides this data via an API and via XML downloads.
We decided early on to return whatever code we could to OpenAustralia, so we made any changes we needed to the downloading process in Ruby (because that’s the language it uses) and submitted the changes to OpenAustralia via Github.
We then started building prototypes to explore the data further. The first prototype had a sortable table of three categories: speeches, interjections and duration. We also experimented with various text-analysis algorithms to mine the records for popular topics, and to list these topics by frequency of mentions. We released these tools internally, for journalists to explore the data for themselves and to suggest further ideas to investigate. We decided that we would start with two projects: one more playful and the other as a more serious exploration of the original hypothesis.
The biggest challenge for the backend was to take the 77 million words uttered in Parliament since 2006 and process them such that the entire record could be queried quickly and easily via the web. We explored a number of different solutions ranging from CouchDB, MongoDB, ElasticSearch and Lucene. All of these solutions do text searching very well, but are optimised to return search results as if they were being used by a search engine. We needed a solution that could search for a user’s query, find the matching Hansard transcripts and then aggregate the data so that it was broken down by party, week and phrase. Ultimately, we chose to use PostgreSQL. This allowed us to fine-tune any on-the-fly aggregations that were required. As the default PostgreSQL text index didn’t quite do what we wanted, we ended up building a full inverted index of the data and querying that.
In the end, we came up with something that worked fairly well.
Another challenge of the Party Lines tool was how to present all the data in a clear and meaningful way. The ability to see trends at a glance was paramount. Initially we thought a stacked area chart would serve the purpose, but initial tests showed people found them hard to interpret so we moved to stacked bar charts instead.
The “bubble-heads” style we chose for Talking Heads came by accident. In the initial prototype we needed a mockup of a visualisation and a bubble chart seemed to fit the bill. The editorial team liked the concept so we decided to develop it further. “Kitten mode” comes thanks to Andrew Cobby.
We plan to release a few more visualisations based off the work we’ve done on the Powerhouse blog. Ideas and collaborations are welcome [email@example.com]
Again, thanks to OpenAustralia for its hard work and help. If you’re into technology and open government, then you probably know what a great job it does already. Our team looks forward to future opportunities to work with it.