Crying wolf: adventures at the cutting edge of data access

Somewhere between our data access dreams as researchers and the chaotic ocean of information on tap at our workstations lies a new hope: Wolfram|Alpha. Long fascinated by the problem of access to data, Stephen Wolfram, a British-born scientist and the progenitor of the Mathematica software, released the remarkable Wolfram|Alpha in 2009. More an answer engine than a search engine it is, according to The Guardian[1], ‘…a website that aims to be able to answer any factual question asked of it’ (or, according to its creator, an ‘insanely ambitious thing, like the science fiction computers of old’). And, indeed, there is something inspiringly grand about a plan to ‘take knowledge from throughout history and automate it.’ This casual-sounding phrase probably needs a few moments to properly digest. This is nothing less than all recorded knowledge, fully integrated and accessible through an intelligent interface which can understand a diverse range of instructions and produce meaningful output – immediately, or nearly so. I’m already excited.

Google, like other search engines in wide use at the moment, simply searches through existing pages on the internet in which your keywords appear. If no-one has already written about the search query you submit then nothing will emerge. If someone has, this family of search engines will list the results but has little interest in their provenance.

By contrast, Wolfram|Alpha aims to understand queries, whether expressed in full sentences, key words, broken English or mathematical symbols. It aims to provide ‘answers and relevant visualizations from a core knowledge base of curated, structured data’[2]. At 20 queries a second it ranks 3,304th among the most used search engines, according to The Guardian. Small steps perhaps, but this is not unusual among internet phenomena. However, it is hard not to feel excited by this tottering new-born with such potential. And with every search it becomes better, as a research team integrates new datasets and refines its ability to understand what we mean when we come to it with queries.

Undoubtedly, the system is at its best when answering technical queries, and gives an instant stomach-fluttering glimpse into something truly extraordinary (a word that we use often in connection with the internet, but increasingly don’t mean). A polynomial equation is solved and plotted, for example. A chi-square distribution is instantly displayed. Cluster analysis is defined. Access to further information and references is included; full detail of sources is provided. Helpfully, the mass of Jupiter divided by Loxodonta Africana (touchingly, ‘elephant’ was considered far too vague and this taxonomic alternative was offered) is immediately calculated.

But what of the social sciences? The simple stuff is all there: populations, average banana consumption by country, average income and so on. But if you want more subtle information, such as the population in the UK over 65 years old, Wolfram|Alpha gives a detailed definition of the unit ‘year’. So, not quite ready yet? Certainly the fickle creature that is the average internet punter is unlikely to return to a website after an unsuccessful first visit, and has little time for an enterprise which declares that ‘Development of this topic is under investigation’.

Nevertheless, Wolfram claims that at 10 trillion data points his website constitutes the largest integrated dataset in the world. No small feat.  It is only fair to put such a system to the test. Three Brook Lyndhurst researchers were chosen at, ahem, random, given a brief description of the potential of the system and asked to come up with search queries. I ran the three queries to see what would happen.

Researcher No. 1: ‘What proportion of off-grid households in the UK currently supply their energy needs through renewables?’

First go, Wolfram|Alpha gives some information on energy expenditure for a human. Doing nothing, it helpfully informs us, typically requires 38Cal and burns 0.17oz of fat. Chopping wood, on the other hand, would burn 1oz over the time 30 minute time period. Interesting, but not quite what we were looking for. After a couple of minutes refining the search, I have found data on renewable energy production and consumption in the UK and worldwide, but to answer this specific question will take some more unravelling. Still, a challenging test; on to the next one.

Researcher No. 2: ‘Amount of textiles in tonnes which end up in landfill in the UK every year?’

Wolfram|Alpha isn’t at all sure what to do about this, with this wording. But we are a demanding lot at Brook Lyndhurst. I modify the query to a more machine-friendly ‘Textiles+landfill+UK’. It isn’t looking good, although it would be churlish to discount the large amount of UK-related data that is instantly generated. Nothing to do with textiles, or landfill, unfortunately. Despite trying a number of alternative search terms, I’m none the wiser on textile-related landfill in the UK, although I have seen a satellite image of London and know the scrabble score of ‘landfill’. Perhaps we’re not being fair. While waiting for the third researcher to independently come up with a search query, I look up t-test and find a handy calculator, and I am soon looking at ‘Hypothesis Testing’ in Mathworld, a Mathematica-related Wolfram resource of for anyone interested in Maths. Of course, we all already know that the internet can lead us to all kinds of interesting places; borderline time-wasting diversions are not the data revolution foretold. So, back to testing the promise of prompt, relevant data.

Researcher No. 3: ‘What volume of water would be needed to make sea-level rise 1m?’

Again, the first go isn’t directly helpful, although the water content in the Earth’s atmosphere is duly delivered. After trying in a few alternative forms, I’m not making much progress. Although it is a repeat of the previous searches, I hope that this is an indication more of the current capacity of the system than its future potential.

All the while acknowledging that the internet is as much ‘we’ as it is ‘it’, and that our engagement with such things can be what makes them great, Wolfram|Alpha either isn’t quite ready yet or we need to get better at using it. The move from personally amused to professionally interested will require that we are able to reliably query specific datasets. For example, we recently became interested in the release of the remarkable ‘Understanding Society’ longitudinal study data, with an unusually large dataset of UK households, and we have reason to access various Office for National Statistics datasets for our work, among many others. A reliable and intelligent search function for such data collections is therefore a tantalising prospect. We will follow developments in Wolfram|Alpha with interest.

[1] Alex Bellos, 12 February 2011, The Guardian, ‘Stephen Wolfram: Can he topple Google?

[2] Wikipedia, ‘Wolfram Alpha


Post a Comment

Your email is never shared. Required fields are marked *

*
*