Foundations of Workflows for Large-Scale Scientific Data Analysis

B3: Debugging Distributed Data Analysis Workflows

Essentially all scientific disciplines are generating an ever-increasing amount of data. To derive scientific discoveries, these data sets are analyzed by complex data analysis workflows (DAWs), which are series of discrete analysis programs arranged in (typically non-linear) pipelines and often executed on distributed computation infrastructures. While existing DAW research is mostly devoted to improving the speed of DAW executions, human productivity in large-scale scientific data analysis is still the most expensive resource and limiting factor. We address this problem within the Collaborative Research Center FONDA (Foundations for Large-Scale Scientific Data Analysis Workflows). The overall research goal is to establish foundations for developing DAWs in a more rigorous, systematic and sustainable manner, leading to DAWs which are portable, adaptable and dependable.

A specific research question which we address within the sub-project B3 (Debugging Distributed Data Analysis Workflows) is how to enable domain scientists to efficiently formulate, test, and refine a debugging hypothesis in the context of scientific software engineering and large-scale data analytics.