A Generic Parallel Processing Model for Facilitating Data Mining and Integration (bibtex)
by L Han, CS Liew, Malcolm PA and van Hemert, JI
Abstract:
To facilitate Data Mining and Integration (DMI) processes in a generic way, we investigate a parallel pipeline streaming model. We model a DMI task as a streaming data-flow graph: a directed acyclic graph (DAG) of Processing Elements PEs. The composition mechanism links PEs via data streams, which may be in memory, buffered via disks or inter-computer data-flows. This makes it possible to build arbitrary DAGs with pipelining and both data and task parallelisms, which provides room for performance enhancement. We have applied this approach to a real DMI case in the Life Sciences and implemented a prototype. To demonstrate feasibility of the modelled DMI task and assess the efficiency of the prototype, we have also built a performance evaluation model. The experimental evaluation results show that a linear speedup has been achieved with the increase of the number of distributed computing nodes in this case study.
Reference:
A Generic Parallel Processing Model for Facilitating Data Mining and Integration (L Han, CS Liew, Malcolm PA and van Hemert, JI), In Parallel Computing, Elsevier, volume 37, 2011.
Bibtex Entry:
@article{HLAH2011,
	_day = {19},
	abstract = {To facilitate Data Mining and Integration (DMI) processes in a generic way, we investigate a parallel pipeline streaming model. We model a DMI task as a streaming data-flow graph: a directed acyclic graph (DAG) of Processing Elements PEs. The composition mechanism links PEs via data streams, which may be in memory, buffered via disks or inter-computer data-flows. This makes it possible to build arbitrary DAGs with pipelining and both data and task parallelisms, which provides room for performance enhancement. We have applied this approach to a real DMI case in the Life Sciences and implemented a prototype. To demonstrate feasibility of the modelled DMI task and assess the efficiency of the prototype, we have also built a performance evaluation model. The experimental evaluation results show that a linear speedup has been achieved with the increase of the number of distributed computing nodes in this case study.},
	author = {L Han and CS Liew and Malcolm PA and van Hemert, JI},
	date-modified = {2011-04-19 11:35:39 +0100},
	issn = {0167-8191},
	issue = {3},
	journal = {Parallel Computing},
	keywords = {e-Science; data-intensive; data mining; data integration},
	pages = {157--71},
	publisher = {Elsevier},
	title = {A Generic Parallel Processing Model for Facilitating Data Mining and Integration},
	url = {http://dx.doi.org/10.1016/j.parco.2011.02.006},
	volume = {37},
	year = {2011},
	bdsk-url-1 = {http://dx.doi.org/10.1016/j.parco.2011.02.006}}
Powered by bibtexbrowser