By Gillian Law
7 March 2005
SOCIAL scientists trying to combine different UK government or commercial datasets face a frustrating battle. It should be simple to link UK census data with information from local councils and other Office for National Statistics (ONS) datasets - but in reality it's a difficult process because many datasets are collected using different geographical units, which change over time.
An ESRC funded grid-based demonstrator service called ConvertGrid was devised, using OGSA-DAI, to show how grid technologies could automate complex workflows, facilitate integrated use of multiple geo-referenced datasets and stimulate new forms of research.
"If someone wanted to look at Experian data on house prices in London, say, and compare that to ONS figures on educational attainment in the area, they'd find that the geographical units of collection are different, and it's impossible to compare them properly," says Linda Mason, (IT Specialist) at Manchester University and co-developer of ConvertGrid. Currently they would have to collect data via different interfaces, convert them individually to a common geography and then put them back together.
To facilitate comparison of data for different geographies, ESRC funded a project to exploit the ONS All Fields Postcode Directory. Postcodes are a very small unit of geography, so they fit very well into any larger units. The directory lists every postcode in the country and maps each one into many other census, health, electoral etc. geographies. A web application was developed that allowed users to input a single set of data they had already collected and convert it from the source geography to a chosen target geography.
"So, building on this, we developed ConvertGrid, a grid-based version that lets you do it all, in 'one seamless operation', as they say! Using OGSA-DAI, ConvertGrid, in a simple six step process, can pull data from several different grid-enabled datasets in different native geographies and convert them to a common geography that allows you to compare them. You can even visualise your results. In the demonstrator project mapping is only available if your target geography is 1991 Census wards. For the datasets we used a subset of the 1991 Census data at ward level geography; the Experian 2000 dataset at postcode level geography and a small subset of ONS Neighbourhood Statistics datasets at Local Authority District and Ward98 levels of geography" she says.
E.g. a screenshot below shows the relationship between house prices (Experian dataset) and the percentage of young people in the 16 -19 age group entering University (ONS datasets) for the London area (at 1991 Census ward level).
At the start of the project OGSA-DAI did not connect to MS SQL Server so a subset of an existing Census dataset had to be transferred to Oracle on the National Grid Service. (The other databases were created from scratch). This presented a number of challenges. Exact duplication was impossible because of minor inherent differences between the two database systems. Fortunately OGSA-DAI has now been extended to include MS SQL Server. Datasets have to continue to support existing applications, so it is important that they can be "grid-enabled" in situ.
Developer Pascal Ekin did much of the development work for ConvertGrid. He says that while OGSA-DAI was slow on the initial versions of ConvertGrid, later versions are more efficient.
A Single sign-on mechanism is a fundamental feature of grid applications. Both Census and Experian datasets require different authorisation. The ConvertGrid team got around this problem by stitching an Athens (the UK academic access management system) login on the front.
Combining datasets is inherently problematic, leaving aside the methodological issues, which were not within the remit of this project. Census data is structured in a complex way and is different for each census. The convert algorithm expects "counts", i.e. numbers of persons, not indices or percentages, a few of which appear in the Experian dataset. Postcodes vary in format, the Experian dataset had a fixed format with a blank space between the inner and outer code, whereas the convert AFPD database had removed surplus spaces. UK datasets vary in their coverage of Scotland, Wales and N.I.
"Metadata is critical", says Mason, "I think to grid-enable social science datasets, they need to be much more self-describing than they are at the moment. When you write web applications you understand the structure of the database and often build metadata into the web interface, however when a dataset may need to be combined with others certain details, such as, is the number a simple count or is it a percentage, must be included within the database itself. This is an enormous job for a complex dataset such as the 2001 Census."
Ekin is now working on an ESRC funded project called GEMS, which takes ConvertGrid a step further and is to grid-enable the entire 2001 Census aggregate statistics.
The new version of GEMS will address the metadata issue by including it in the XML that goes back to people in response to queries, Ekin says.
"You need to have intimate knowledge of the way data is stored and presented to be able to make any sense of census data," he says. "So we're going to use some software that was developed by MIMAS, a web interface to the aggregate statistics from the Census, that allows you to drill down through the data. What we want to do is wrap that up in an API so that people can access the census data through OGSA-DAI, in a simple manner. That's what we're working on now," he says.
ConvertGrid was really just a demonstrator project to show what can be done, Mason says. "It's a very simple, practical application, and we showed it could work and highlighted the potential of the grid."
The ConvertGrid project, which ran for one year from May 2004, was funded by ESRC (Economic and Social Research Council).