This section gives guidance on how to best use OGSA-DAI with regards to speed/time and use of memory. The main points analysed are how different forms of delivery, transform, data aggregation and DBMSs can affect performance. This is rounded out with some example performance figures.
Briefly, synchronous execution will not return the results from the workflow execution to the client until the whole workflow is complete. This is ideal for small workflows operating on small datasets or where control must only be returned to the client when a workflow is finished.
Synchronous execution is good for interactive clients which need constant communication with an OGSA-DAI server for small operations.
Synchronous operation is not ideal for executing large requests where the client is intended to be responsive or interim results are required. For long/large operations the better option is asynchronous operation. Depending on the scenario the type of return used can vary accordingly and some delivery scenarios will be suited to a synchronous operation.
If a synchronous request takes a long time then the client connection to the OGSA-DAI server (via the data request execution service) may time out. In such circumstances, the OGSA-DAI server will continue to execute the workflow submitted by the client. There is however no way for the client to access their data since there is no means of getting the ID of the request (as known to the server) back to the client. This is a further argument in favour of the use of asynchronous requests for workflows that may have long execution times.
Asynchronous operations will return immediately after the request is submitted to the OGSA-DAI server. This means that the results of the request are not returned directly to the client. The request must either deliver the results to some target or the request must be polled until the workflow execution is complete.
Asynchronous operation is ideal for working with large operations or operations which will take an extended period of time. The asynchronous operation can be combined with different delivery methods for different purposes depending on the scenario.
Examples of where asynchronous operation is best suited is for large dataset delivery, either direct to the client via a data source or external delivery via FTP or similar. This means that some of the interactive scenarios can be extended from synchronous operation to asynchronous with careful planning and design.
The request status (see Appendix K, OGSA-DAI request execution specification) represents the results of an execution of a workflow. This includes the status of activities in a workflow and possibly data. This can be used by clients to identify where any errors have occurred.
Deliverying results via the request status (using the DeliverToRequestStatus activity) is a quick method to get results back to the client but as the workflows and datasets increase in size, the transmission and processing of the request status becomes slower and more memory is required. Memory problems can occur with large datasets in the request status, even with improvements. Large datasets should be returned via another delivery method as the requests tatus is not suitable for this purpose.
Support for parsing a request status is easy to integrate into a client application with straightforward methods for accessing activities status and returned results. It is best used when working with results for small return sizes such as highly specific queries. Large operations are not recommended.
An option which will allow better performance in the use of the request status to return data is the use of an aggregator. Using an aggregator allows the OGSA-DAI server to be more efficient in how it builds the request status for the execution request. This is discussed in Section 22.1.2, “Aggregation effects” below.
File Transfer Protocol (FTP) is one of the standard delivery options provided by OGSA-DAI. FTP is a commonplace method for moving files between different locations. OGSA-DAI can deliver via FTP to a given host provided the proper credentials for access have been provided.
FTP delivery is accomplished via use of an FTP activity and by providing this activity with the hostname and credentials for access to the host. The operation is limited by the throughput of OGSA-DAI and the file transfer rate. As OGSA-DAI uses a streaming model throughput may be slightly less than a standard FTP client transferring a file. This is because there may be data collecton and processing stages to be carried out which slows the rate at which the data reaches the FTP activity. However, in example performance figures (see Section 22.1.4, “DBMS and performance” below), it was found that for a dataset of roughly 60 megabytes, the FTP transfer time was 30 seconds and on the same system, OGSA-DAI transformed the data to a CSV file the same size with an increase in time of ten to fifteen percent. The exact overhead is determined by the number and type of resources accessed and the transformations carried out on the data.
FTP delivery is suitable for use in both synchronous and asynchronous workflows, however it should be remembered that in synchronous workflows the execution will not return until the results have been transferred. If small files are being transferred this should not be a great problem, but in some cases with large datasets being transferred to other hosts, the scenario may demand that the client not be returned to until the file has been transferred. The choice of execution type should be determined by the scenario in use.
FTP delivery is well suited for use with large datasets, exceeding 1,000,000 rows in a database. The streaming model of OGSA-DAI allows the system to cope without undue measures needing to be enacted. Large datasets pose problems to the request status delivery option, as discussed above, and FTP is a suitable alternative if the scenario permits.
In scenarios where external delivery is required, for instance storage outside of the OGSA-DAI server, FTP is most appropriate. Also for delivery to other systems, OGSA-DAI can deliver to a host which the other system can operate on. This is a possibility where OGSA-DAI forms part of a larger workflow for operating on the data.
The use of data source resources is a good delivery option for clients requiring large amounts of data without the need for FTP or other external delivery mechanisms. Results can be written to a data source resource for retrieval by a client or another OGSA-DAI server, and can be retrieved immediately or asychronously.
Data sources resource provide a streamable and hence scalable alternative to the request status when returning large data sets. Data source resources allow clients to obtain data in small manageable chunks rather than otaining all the data in one chunk as is the case when data is written the request status. As shown in the example performance figures (see Section 22.1.4, “DBMS and performance” below), use of a data source yields performance on par with a request status for small query operations and becomes the best option for large queries.
A data source resource can be used in synchronous or asynchronous operation. Synchronous operation is only suitable for small datasets as the data source has a limited storage capability and the workflow will block when the data source is filled. Asynchronous operation allows the data to be retrieved quickly as the streaming OGSA-DAI model means that the data is written to the data source after it is processed, which can then be read by the client. This means that asynchronous operation on a data source can be extended to very large datasets, in excess of 1,000,000 database rows for example.
Aggregation (discussed in Section 22.1.2, “Aggregation effects” below) can help with the storage limitations on the data source by reducing the number of entries by putting them into fewer larger entries. When working with large datasets this improves the performance of the overall workflow. The section on aggregation and the performance figures will show how the aggregation can help in this case. Another factor on using a data source is the manner of retrieval. The data stream from a data source resource can be read in various ways, either block-by-block, a group of N blocks, or all blocks at once. This option chosen will depend on the scenario, but retrieving the whole collection of blocks at once is not feasible for large datasets and better performance can be better obtained by tuning the aggregator and retrieval sizes.
Data sources resources are a flexible delivery option to other OGSA-DAI servers and clients, and with proper design can be used to create an interactive client using both synchronous and asynchronous operation depending on the setup of OGSA-DAI and its resources.
There are two aggregator activities provided with OGSA-DAI 3.0, these are the CharArraysResize and ByteArraysResize. As indicated by the names these are intended to resize arrays for more efficient transfer.
The purpose of the aggregators is to group large amounts of data items such as the character arrays of a WebRowSet into larger chunks to facilitate more efficient transfer of the data via the different delivery methods. The new array size can be specified in both aggregators allowing the performance to be tuned for a specific requirement. In the case of the character arrays, the size specified is the number of characters in an array and for byte arrays, it is the number of bytes in an array.
The reason the aggregators have been provided is to improve the performance of delivering data back to a client and to other targets. Before the aggregators were developed, the delivery mechanisms, especially via the request status were prone to slowing down due to the volume of individual arrays/values being passed to them. To counter this, the aggregator transforms the numerous arrays into a series of fewer arrays to be more manageable while not changing the data content. In tests, as will be explained, this aggregation has shown to improve performance considerably up to around fifty percent in some scenarios.
Array chunk sizing is dependent on the scenario in use. The effect on performance is dependent on the size of the dataset in use, the method of delivery and any transformations of the data. The best performance improvement is seen in using request status delivery in conjunction with OGSA-DAI web services although data sources accessed by web services also benefit for the resizing of the arrays. In both cases aggregation offsets the overheads incurred when converting the data to XML as part of a SOAP/HTTP server-client communication.
In our experiments an array chunk size of 5,000 characters give a reasonable increase in performance. As previously stated, the transformation and delivery method also play a part, the improvements when delivering a CSV file over SOAP/HTTP from an OGSA-DAI web service to a client is more dramatic than that over some other delivery methods or transformations.
To understand why aggregation makes such a difference when the data is passed
to be client using SOAP, we need to look at how the SOAP body is rendered.
Each object stored in the request status or in a data source resource is
rendered as a data element in the SOAP response
that carries that data to the
client. If three char[] objects containing the characters "abc", "def" and
"ghi" are to be returned then these would be rendered as:
<data><charArray><![CDATA[abc]]><charArray><data> <data><charArray><![CDATA[def]]><charArray><data> <data><charArray><![CDATA[ghi]]><charArray><data>
This data would be much more efficiently rendered if the three char[] object are aggregated to one char[]. They would then be rendered as:
<data><charArray><![CDATA[abcdefghi]]><charArray><data>
This aggregation leads to a smaller message size and also less XML processing at both the client and server. These factors together mean that aggregation can significantly reduce the overall roundtrip execution time.
When converting tuples to XML WebRowSet format the activity outputs a character array object for each incoming tuple. This is ideal when we wish to efficiently stream the data through OGSA-DAI activities but is not so ideal for sending the data back to the client. In our tests each row produced a character array with approximately 200 characters. With a aggregation size of 5000 characters we will fit about 25 rows into each new character array and hence the number of elements in the XML rendering will be about 4% of what it was without aggregation.
Transformation activities are designed to take data in one form and change it in some way to another form. A common example in OGSA-DAI is to change the tuples output by relational activities (e.g. SQLQuery) to the XML WebRowSet format. These transforms are one of the factors which determine performance of OGSA-DAI, both in time taken to operate on the data and also the size of the output affecting transfer time.
The two main factors which will affect performance in the transform to the overall OGSA-DAI performance are the output size and the complexity of the transform.
The more verbose the output of a transform activity the longer it will take to transfer the results of that activity, this is important when designing workflows as proper planning and choice of appropriate transformation will result in more efficient use of the OGSA-DAI software. The key thing is to choose a format that is best for the scenario at hand. If a scenario is requiring a lot of numerical data for analysis in a mathematical package then it is more likely that a CSV format is of more use that a WebRowSet and also far simpler and smaller.
The complexity of the transform will also affect the performance, an example of this would be the XSLTransform activity. The more complex the transformation to occur is, then it is obviously going to affect the performance. To this end, it is important to ensure that the XSLTransform to be undertaken is well designed and should be as efficient as possible to ensure a good performance level.
Block activities can be used to operate on the blocks of data passed around OGSA-DAI without needing to know about the internal data of those blocks. This activities could be looked at as general organisation actions. Examples of these would be ListConcatenate, Split and Tee.
It is important to note that in most cases that the performance of block activities is determined by the slowest input to the activity. This means that if multiple data resources are connected to a block or indeed any activity then the slowest of those resources will generally define the overall performance of the workflow.
One of the external factors to OGSA-DAI which can affect the performance when executing a workflow is the data resource. OGSA-DAI can access a variety of data resources. The speed of response from a database will affect the overall performance, OGSA-DAI has been performance tested with all its supported databases (see Chapter 3, Data resource products). eXist was slowest by far but this was expected.
A more realistic comparison is looking at the five relational DBMS, DB2, MySQL, Oracle, PostgreSQL and SQLServer. These were all tested using the same query and code. The results are presented below.
The DBMSs can all be tuned and the JDBC drivers will have various options available which can be setup when the resource is exposed via OGSA-DAI. Obviously, tuning the DBMS for the specific data needs should give a performance boost. By tuning the DBMS and its host system for optimum operation, then the DBMS will pass on improved performance in the form of the response time to the request from OGSA-DAI.
To get the best performance out of the relational access of DBMS in OGSA-DAI, there are a few things that can be done. The first is to ensure that the DBMS and host are correctly setup and tuned for the application, including the database organisation. Also ensure that the latest compatible, stable JDBC drivers are available to OGSA-DAI for access and ensure that streaming results from the database is enabled.
One of the main parameters which can be altered to improve performance is the fetch size for JDBC which can alter how the return from the database is streamed and how return will be broken up.
The tests for 5000 rows were aggregated where stated using a size of 5000 characters.
The tests for 10000 rows were aggregated where stated using a size of 5000 characters.
The tests were aggregated where stated using a size of 5000 characters.
These tests were carried out against a localhost OGSA-DAI server with network connections to the data resources.
The table provides the main workflow path that the tests follow. Each test was run one hundred times to reduce the margin of error in determining the mean time for an operation.
| Test Name | Brief Description |
|---|---|
| SQL Query | Standard SQL Query to WebRowSetCharArrays to RequestStatus |
| SQL Query Aggregate | Standard SQL Query to WebRowSetCharArrays to RequestStatus Aggregate |
| SQL Query TupleToCSV | Standard SQL Query to CSV to RequestStatus |
| SQL Query TupleToCSV Aggregate | Standard SQL Query to CSV to RequestStatus Aggregate |
| SQL Query FTP | Standard SQL Query to WebRowSetCharArrays to FTP |
| SQL Query FTP Aggregate | Standard SQL Query to WebRowSetCharArrays to FTP Aggregate |
| SQL Query CSV FTP | Standard SQL Query to CSV to FTP |
| SQL Query CSV FTP Aggregate | Standard SQL Query to CSV to FTP Aggregate |
| SQL Query DataSource Aggregate | Standard SQL Query to WebRowSetCharArrays to Datasource Aggregate |
| SQL Query CSV DataSource Aggregate | Standard SQL Query to CSV to FTP Aggregate |