Exercises for the Harvard University Introduction to Data Processing Workflow Languages training course.
For details, see slides.
This repository contains sample pipelines in CWL and Nextflow.
The pipeline explores correlation between different numeric columns of a tab-separated file and builds a bar-chart plot. The user select one column (main variable) and a set of secondary columns that are correlated with the selected variable.
The actual data file contains a mix of demographics, behavioral, climate and air pollution data. It is hosted in IBM Cloud S3 bucket.
Note: This piepline is not intended to be used as best practices but as playground to explore different features of CWL and Nextflow workflow definition languages.
The same tasks migh be easier to do as standalone Python or R program, but the goal here is to show how to use specialized workflow definition domain specific languages (DSL).
It performs the following steps:
- Cleanse the data. The input file contains strings
(null)
for some numeric values. Rows that with such values will not be parseable by pandas package and hence should be removed. - For every column that we would like to correlate with the main variable, we will calculate Pearson Correlation Coefficient using Python pandas package. Calculations can be done in parallel for different columns.
- Gather and combine results of the calculations in step 2.
- Use Gnuplot to build a bar-chart