by CS Liew Liew, Malcolm, PA, van Hemert, JI and Han, L
Abstract:
Modern scientific collaborations have opened up the opportunity of solving complex problems that involve multi- disciplinary expertise and large-scale computational experiments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organisations. A common strategy to make the experiments more manageable is executing the processing steps as a workflow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes represent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a measurement tool to evaluate each enactment. We conducted ex- periments to evaluate our optimisation strategies with a real world problem in the Life Sciences?EURExpress-II The paper presents our distributed data-handling model, the optimisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy.
Reference:
Towards Optimising Distributed Data Streaming Graphs using Parallel Streams (CS Liew Liew, Malcolm, PA, van Hemert, JI and Han, L), In Data Intensive Distributed Computing, ACM, 2010.
Bibtex Entry:
@article{LAHH2010,
abstract = {Modern scientific collaborations have opened up the opportunity of solving complex problems that involve multi- disciplinary expertise and large-scale computational experiments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organisations. A common strategy to make the experiments more manageable is executing the processing steps as a workflow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes represent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a measurement tool to evaluate each enactment. We conducted ex- periments to evaluate our optimisation strategies with a real world problem in the Life Sciences?EURExpress-II The paper presents our distributed data-handling model, the optimisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy.},
author = {CS Liew Liew and Malcolm, PA and van Hemert, JI and Han, L},
date-added = {2011-04-19 11:34:05 +0100},
date-modified = {2011-04-19 11:53:40 +0100},
journal = {Data Intensive Distributed Computing},
keywords = {e-Science; data-intensive},
pages = {725--36},
publisher = {ACM},
title = {Towards Optimising Distributed Data Streaming Graphs using Parallel Streams},
url = {http://www.cct.lsu.edu/~kosar/didc10/index.php},
year = {2010},
bdsk-url-1 = {http://www.cct.lsu.edu/~kosar/didc10/index.php}}