About
OGSA-DAI is an innovative solution for distributed data access and management.
OGSA-DAI allows data resources (e.g. relational or XML databases, files or web services) to be federated and accessed via web services on the web or within grids or clouds. Via these web services, data can be queried, updated, transformed and combined in various ways.
OGSA-DAI is about sharing data, whether it be within a single organisation, between a group of partners or with the public. By sharing data we can identify, understand and exploit complex interactions between disparate variables and so convert data into information. This in turn can help us increase our scientific knowledge or our business advantage.
Here, we provide a high-level overview of OGSA-DAI's novel approach to distributed data management.
OGSA-DAI: An Overview
To introduce the importance of sharing distributed data we'll look at a key motivating example on the early warning of disease outbreaks.
- Detecting possible disease outbreaks
- How distributed data complicates analysis
- How to analyse distributed data
- Avoiding bloated clients - why a server is useful
- Making distributed analysis easier with query processing
- Converting data into a useful form and delivering it where it's needed
- OGSA-DAI
- A demo
- Other examples of OGSA-DAI:
- Summary
Detecting possible disease outbreaks
Every time we visit a health centre or hospital, or a doctor visits us at home, this information is recorded in our medical records. This includes our personal details e.g. name and address and medical details e.g. visits, symptoms and treatments.
Now, suppose we wanted to detect whether an outbreak of swine flu was imminent. One way we could detect this would be to look at this patient data to see if the number of patients displaying swine flu symptoms exceeded some critical threshold within a given region.
So, imagine we have a region. Recording the locations of patients exhibiting swine flu symptoms might show us these occurrences across our region. If our critical threshold was 10 then the cluster of 12 points in the centre would give us cause for concern.
This is not too difficult. But it does assume that all the patient data is readily available and can be easily accessed and analysed. In reality things aren't this straightforward.
How distributed data complicates analysis
Our region may be covered by a number of health centres whose catchment areas overlap.
Showing patients with swine flu symptoms for the yellow health centre might give us this. There is no cluster greater than our threshold of 10.
And for the green health centre we have these occurrences. Again there is no cluster greater than our threshold.
This is because our cluster is within the area where the catchment areas overlap. Only by combining the data from both health centres do we see the true picture, that there is a cluster of patients with swine flu symptoms that is above our threshold.
How to analyse distributed data
So, if our data is held within multiple sources, or databases, to identify clusters of patients there are a number of activities we need to do.
- Firstly, we need to get the data from each health centre on the numbers of patients they have recorded who have swine flu symptoms together with their post codes.
- We then need to collect, or union, this data together.
- Then we can get the final total counts of occurrences for each post code.
These activities are shown here along with example queries expressed in the popular query language SQL. These ask each database for the number of patients with a "FLU" symptom and output the total counts of patients per post code.
Avoiding bloated clients - why a server is useful
We could write an application to do this, a client that would get the data from the databases, combine it, and then visualise it. The client would need to handle the fact that the databases are located at different sites, may have different data formats or be different products. They may also have different ways of authenticating with the databases and logging in. However, if there were a number of clients and a health centre changed its usernames and passwords or database product then all the clients would need to be updated.
So, it can be useful to introduce a server. The server can manage the connections with each database. So if a health centre changes its database, only the server needs updated. The client only needs to connect to the server and so would be protected from such changes.
The server would also manage execution of activities on the client's behalf. All the client needs to do is tell the server what activities it wants the server to run.
There is another reason for introducing a server and that is that a client might only be interested in a very small subset of the data. Using a server with large amounts of processing power means that server can access and filter large amounts of raw data on the client's behalf and only return to the client exactly what it needs. Clients can then be very lightweight.
If we have our server then our client would now just tell the server to carry out these activities on its behalf. The server would carry out these activities and return the data to the client.
We have three activities to get the information we need from the database, two to query the databases and one to combine and summarise this data. It would be easier for the client if the databases could be made to look like a single database instead of two separate databases. Then they'd only need to request the execution of one query activity.
Making distributed analysis easier with query processing
Furthermore, given the expressive power of query languages, they could express how the data should then be combined and summarised within their query rather than request a separate activity for this. In other words, it would be easier if the client could just specify the query shown, and the server take care of determining what queries need to be sent to each database and what additional activities need to be run to answer the client's query.
This is called distributed query processing (DQP).
With distributed query processing the activities that the client needs to tell the server to do become much simpler.
Converting data into a useful form and delivering it where it's needed
If the client is visualising the data it will need to transform it into a suitable format, for example a JPG binary image file or a document written in the geographical markup language KML. As this is just a data transformation why can't our server handle that too?
And, instead of delivering the visualisation data to the client, why doesn't it just hold it on the server until the client is ready for it. The server could return a URL which tells the client where to get the data from. This would allow the client to do other things while the activities are running and also for other clients to access the results too, without having to rerun what might have been a time-consuming query. They can just get the results from the URL.
So, adopting this solution gives us a new set of activities where the server now converts the data into a visual format, stores it on the server and returns a URL to the client from which they can get their data later.
Now, how might the server execute our activities?
OGSA-DAI
OGSA-DAI is a framework that allows groups of activities like this to be executed, activities that involve accessing, updating, combining, transforming or delivering data that could be distributed across a number of databases and held in various formats.
It consists of a workflow executor which executes groups of activities, or, as they're called in OGSA-DAI, workflows.
It also has a distributed query processor which allows a single query to cite tables in multiple databases. It will automatically parse this query and output a query plan which specifies the workflows to execute to get the required data from each database.
Data is streamed through OGSA-DAI and different activities work on different parts of the data stream at the same time. For example data retrieved from a database by a query may be transformed and delivered while other data from the same query is still being retrieved. This leads to more efficient execution times and reduced memory overheads.
OGSA-DAI is a 100% Java free open source product licenced under the flexible Apache 2.0 licence.
It is independent of any specific applications area having been designed to be highly-customisable to satisfy data management requirements in a wide number of fields.
A demo
We have produced a demonstrator that implements our early warning scenario. This demo:
- Runs a query across two health databases and uses a third database to map post codes to latitudes and longitudes.
- It converts the data into KML, the geographical markup language.
- The client uses Google Maps to visualise the KML.
- The demonstration server provides the client WWW pages and manages interactions with the OGSA-DAI server.
Please feel free to visit the demo
In this demonstration the only application-specific components, that is, the only components specifically relating to health data are the databases and the client code.
The workflow executor, distributed query processor, database query, and KML conversion components are all standard OGSA-DAI components, independent of health or any specific applications domain.
Other examples of OGSA-DAI at work
Analysing transport data
OGSA-DAI has been used in a number of applications areas.
FirstDIG was a project that involved EPCC and FirstGroup plc, the UK's largest transport operator. They had data spread across their departments. This data included:
- Customer contact, for example questions, compliments and complaints from customers.
- Daily vehicle mileages for bus services.
- Daily tickets sold and the money taken for the bus services.
- Schedule adherence, recorded via a satellite tracking system that records whether a bus is arriving and departing on time from a bus stop
These data was held in relational databases and COBOL files.
This was in OGSA-DAI's early days so data integration was done by the client, but nowadays it could be done on the server. Here, OGSA-DAI served as a single access point for the databases. The client didn't need to handle individual database connections, locations, logins or passwords. This allowed the data to be easily mined to see how late buses would affect ticket revenues and complaints, for example.
Visualising social sciences data
SEE-GEO was a project that looked at SEcurE access to GEOspatial services. One aspect of this work was combining census data and borders data. In SEE-GEO the data sources were not traditional databases but web services.
Using OGSA-DAI they constructed a portal which allowed a user to submit a query for example to "show me the population distribution of Leeds according to census output areas."
- The query parameters would populate an OGSA-DAI workflow. This workflow would get the relevant census data and then the relevant data on geographical regions (the borders data).
- It would then join these, producing a set of geographical regions annotated with the census data.
- This data would be transformed into an image file by the use of an image creation service - converting the annotated regions into a set of shaded polygons.
- The image would then be delivered to a map server and the URL of the image returned to the portal.
- The portal would then get the image and display it to the user.
Summary
OGSA-DAI is an innovative distributed data management product that contributes to a future in which researchers and business users move away from technical issues such as data location, data structure, data transfer and the ins and outs of data integration and instead focus on application-specific data analysis and processing.
OGSA-DAI has been under development since 2002 and is currently an open source project managed by EPCC, The University of Edinburgh. It has contributed to, and continues to contribute to the success of projects and organisations worldwide.
OGSA-DAI