Data-Centric Scientific Workflow Management Systems

David T Liu

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2007-83
June 15, 2007

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-83.pdf

Recent trends in science and technology augur a rapid increase in the number of computations being employed by scientists. Accompanying increased volumes are growing expectations for the tools that scientists use to handle their computations. These increased volumes and expectations present a new set of problems and opportunities in computation management. In this thesis, I propose Data Centric Scientific Workflow Management Systems (DSWMSs) to address these issues. DSWMSs supersede current approaches by leveraging a deeper understanding of the data manipulated by computations to provide new features and improve usability and performance. Examples of such features include data provenance, work sharing, and interactive computational steering. In this thesis, I make several contributions towards realizing the concept of a DSWMS. First, in conjunction with scientists from several scientific domains, I propose a set of services that are not provided by current paradigms, but are made possible in DSWMSs. Second, I de ne an abstract model, the Functional Data Model with Relational Covers (FDM/RC), for representing scientific workloads and a language for de ning and manipulating instances (schemas) of the model. Third, I design and implement GridDB, a prototype DSWMS. GridDB is deployed on a large cluster at Lawrence Livermore National Laboratories where it runs science applications at real-world scales. The deployment uncovers a pair of technical problems involving the provisioning of data provenance and memoization (computational caching) so I also contribute solutions to these problems.

Advisor: Michael Franklin


BibTeX citation:

@phdthesis{Liu:EECS-2007-83,
    Author = {Liu, David T},
    Title = {Data-Centric Scientific Workflow Management Systems},
    School = {EECS Department, University of California, Berkeley},
    Year = {2007},
    Month = {Jun},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-83.html},
    Number = {UCB/EECS-2007-83},
    Abstract = {Recent trends in science and technology augur a rapid increase in
the number of computations being employed by
scientists. Accompanying increased volumes are growing
expectations for the tools that scientists use to handle their
computations.  These increased volumes and expectations present a
new set of problems and opportunities in computation
management. In this thesis, I propose Data Centric Scientific
Workflow Management Systems (DSWMSs) to address these
issues. DSWMSs supersede current approaches by leveraging a
deeper understanding of the data manipulated by computations to
provide new features and improve usability and
performance. Examples of such features include data provenance,
work sharing, and interactive computational steering.  In this
thesis, I make several contributions towards realizing the
concept of a DSWMS. First, in conjunction with scientists from
several scientific domains, I propose a set of services that are
not provided by current paradigms, but are made possible in
DSWMSs. Second, I dene an abstract model, the Functional Data
Model with Relational Covers (FDM/RC), for representing
scientific workloads and a language for dening and manipulating
instances (schemas) of the model. Third, I design and implement
GridDB, a prototype DSWMS. GridDB is deployed on a large cluster
at Lawrence Livermore National Laboratories where it runs science
applications at real-world scales. The deployment uncovers a pair
of technical problems involving the provisioning of data
provenance and memoization (computational caching) so I also
contribute solutions to these problems.}
}

EndNote citation:

%0 Thesis
%A Liu, David T
%T Data-Centric Scientific Workflow Management Systems
%I EECS Department, University of California, Berkeley
%D 2007
%8 June 15
%@ UCB/EECS-2007-83
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-83.html
%F Liu:EECS-2007-83