From fcd27ba733fdd9ff93e5eea33f75dd9100c55f3c Mon Sep 17 00:00:00 2001 From: Dorota Jarecka Date: Mon, 16 Oct 2023 22:37:23 -0400 Subject: [PATCH 1/2] updates to Intro and FunctionTask --- notebooks/1_intro_pydra.md | 51 +++---- notebooks/2_intro_functiontask.md | 220 ++++++------------------------ 2 files changed, 63 insertions(+), 208 deletions(-) diff --git a/notebooks/1_intro_pydra.md b/notebooks/1_intro_pydra.md index fe65a2d..0b64f62 100644 --- a/notebooks/1_intro_pydra.md +++ b/notebooks/1_intro_pydra.md @@ -4,9 +4,9 @@ jupytext: extension: .md format_name: myst format_version: 0.13 - jupytext_version: 1.14.0 + jupytext_version: 1.15.2 kernelspec: - display_name: Python 3 + display_name: Python 3 (ipykernel) language: python name: python3 --- @@ -15,46 +15,31 @@ kernelspec: +++ -Pydra is a lightweight, Python 3.7+ dataflow engine for computational graph construction, manipulation, and distributed execution. -Designed as a general-purpose engine to support analytics in any scientific domain; created for [Nipype](https://github.com/nipy/nipype), and helps build reproducible, scalable, reusable, and fully automated, provenance tracked scientific workflows. -The power of Pydra lies in ease of workflow creation -and execution for complex multiparameter map-reduce operations, and the use of global cache. +Pydra is a lightweight, Python 3.7+ dataflow engine. While it originated within the neuroimaging community, its versatile design makes it suitable as a general-purpose engine to facilitate analytics across various scientific fields. -Pydra's key features are: -- Consistent API for Task and Workflow -- Splitting & combining semantics on Task/Workflow level -- Global cache support to reduce recomputation -- Support for execution of Tasks in containerized environments -+++ +You can discover a more in-depth explanation of the concept behind Pydra here. TODO-LINK -## Pydra computational objects - Tasks -There are two main types of objects in *pydra*: `Task` and `Workflow`, that is also a type of `Task`, and can be used in a nested workflow. -![nested_workflow.png](../figures/nested_workflow.png) ++++ +In this tutorial you will create and execute your first `Task`s from standard Python functions and shell commands. +You'll also construct basic `Workflow`s that link multiple tasks together. Furthermore, you'll have the opportunity to produce `Task`s and `Workflow`s capable of automatically running for multiple inputs values. ++++ -**These are the current `Task` implemented in Pydra:** -- `Workflow`: connects multiple `Task`s withing a graph -- `FunctionTask`: wrapper for Python functions -- `ShellCommandTask`: wrapper for shell commands - - `ContainerTask`: wrapper for shell commands run within containers - - `DockerTask`: `ContainerTask` that uses Docker - - `SingularityTask`: `ContainerTask` that uses Singularity +**Before going to the main notebooks, let's check if pydra is properly installed.** If you have any issues running the following cell, please revisit the Installation section. TODO-LINK -+++ +```{code-cell} ipython3 +import pydra +``` -## Pydra Workers -Pydra supports multiple workers to execute `Tasks` and `Workflows`: -- `ConcurrentFutures` -- `SLURM` -- `Dask` -- `PSI/J` +### Additional notes +++ -**Before going to next notebooks, let's check if pydra is properly installed** - -```{code-cell} -import pydra +At the beginning of each tutorial you will see: +``` +import nest_asyncio +nest_asyncio.apply() ``` +This is run because both *Jupyter* and *Pydra* use `asyncio` and in some cases you can see `RuntimeError: This event loop is already running` if `nest_asyncio` is not used. **This part is not needed if Pydra is used outside the Jupyter environment.** diff --git a/notebooks/2_intro_functiontask.md b/notebooks/2_intro_functiontask.md index ace8f2d..e1e2c9c 100644 --- a/notebooks/2_intro_functiontask.md +++ b/notebooks/2_intro_functiontask.md @@ -4,15 +4,13 @@ jupytext: extension: .md format_name: myst format_version: 0.13 - jupytext_version: 1.15.0 + jupytext_version: 1.15.2 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- -# FunctionTask - ```{code-cell} ipython3 --- jupyter: @@ -23,11 +21,14 @@ pycharm: ' --- import nest_asyncio - nest_asyncio.apply() ``` -A `FunctionTask` is a `Task` that can be created from every *python* function by using *pydra* decorator: `pydra.mark.task`: +# FunctionTask + ++++ + +In this tutorial, you will generate your initial *pydra* `Task`, which is a fundamental *pydra*'s component capable of processing data. You will start from a `FunctionTask`, a type of `Task` that can be created from every *python* function by using *pydra* decorator: `pydra.mark.task`: ```{code-cell} ipython3 import pydra @@ -37,26 +38,20 @@ def add_var(a, b): return a + b ``` -Once we decorate the function, we can create a pydra `Task` and specify the input: +Now that we decorated the function, we can create a pydra `Task` and specify the input, for this example values for `a` and `b` are needed. ```{code-cell} ipython3 task0 = add_var(a=4, b=5) ``` -We can check the type of `task0`: - -```{code-cell} ipython3 -type(task0) -``` - -and we can check if the task has correct values of `a` and `b`, they should be saved in the task `inputs`: +You can now check if the task has correct values of `a` and `b`, they should be saved in the task `inputs`: ```{code-cell} ipython3 print(f'a = {task0.inputs.a}') print(f'b = {task0.inputs.b}') ``` -We can also check content of entire `inputs`: +You can also check content of entire `inputs`: ```{code-cell} ipython3 task0.inputs @@ -64,132 +59,41 @@ task0.inputs As you could see, `task.inputs` contains also information about the function, that is an inseparable part of the `FunctionTask`. -Once we have the task with set input, we can run it. Since `Task` is a "callable object", we can use the syntax: +Once you have the task with set values of input, you can run it. Since `Task` is a "callable object", we can use the following syntax: ```{code-cell} ipython3 task0() ``` -As you can see, the result was returned right away, but we can also access it later: +As you can see, the result was returned right away, but you can also access it later: ```{code-cell} ipython3 task0.result() ``` -`Result` contains more than just an output, so if we want to get the task output, we can type: +The function should return the `Result` object. `Result` contains more than just an output, so if you want to get the task output, we can type: ```{code-cell} ipython3 result = task0.result() result.output.out ``` -And if we want to see the input that was used in the task, we can set an optional argument `return_inputs` to True. +You can also see the input that was used to run the task by setting an optional argument `return_inputs` to True. ```{code-cell} ipython3 task0.result(return_inputs=True) ``` -## Type-checking - -+++ - -### What is Type-checking? - -Type-checking is verifying the type of a value at compile or run time. It ensures that operations or assignments to variables are semantically meaningful and can be executed without type errors, enhancing code reliability and maintainability. - -+++ - -### Why Use Type-checking? - -1. **Error Prevention**: Type-checking helps catch type mismatches early, preventing potential runtime errors. -2. **Improved Readability**: Type annotations make understanding what types of values a function expects and returns more straightforward. -3. **Better Documentation**: Explicitly stating expected types acts as inline documentation, simplifying code collaboration and review. -4. **Optimized Performance**: Type-related optimizations can be made during compilation when types are explicitly specified. +Notice that the full name of the input variables contains the name of the task! +++ -### How is Type-checking Implemented in Pydra? +If you want to practice, change the values of `a` and `b` and run the task again. +++ -#### Static Type-Checking -Static type-checking is done using Python's type annotations. You annotate the types of your function arguments and the return type and then use a tool like `mypy` to statically check if you're using the function correctly according to those annotations. - -```{code-cell} ipython3 -@pydra.mark.task -def add(a: int, b: int) -> int: - return a + b -``` - -```{code-cell} ipython3 -# This usage is correct according to static type hints: -task1a = add(a=5, b=3) -task1a() -``` - -```{code-cell} ipython3 -:tags: [raises-exception] -# This usage is incorrect according to static type hints: -task1b = add(a="hello", b="world") -task1b() -``` - -#### Dynamic Type-Checking - -Dynamic type-checking is done at runtime. Add dynamic type checks if you want to enforce types when the function is executed. - -```{code-cell} ipython3 -@pydra.mark.task -def add(a, b): - if not (isinstance(a, int) and isinstance(b, int)): - raise TypeError("Both inputs should be integers.") - return a + b -``` - -```{code-cell} ipython3 -# This usage is correct and will not raise a runtime error: -task1c = add(a=5, b=3) -task1c() -``` - -```{code-cell} ipython3 -:tags: [raises-exception] -# This usage is incorrect and will raise a runtime TypeError: -task1d = add(a="hello", b="world") -task1d() -``` - -#### Checking Complex Types - -For more complex types like lists, dictionaries, or custom objects, we can use type hints combined with dynamic checks. - -```{code-cell} ipython3 -from typing import List, Tuple - -@pydra.mark.task -def sum_of_pairs(pairs: List[Tuple[int, int]]) -> List[int]: - if not all(isinstance(pair, Tuple) and len(pair) == 2 for pair in pairs): - raise ValueError("Input should be a list of pairs (tuples with 2 integers each).") - return [sum(pair) for pair in pairs] -``` - -```{code-cell} ipython3 -# Correct usage -task1e = sum_of_pairs(pairs=[(1, 2), (3, 4)]) -task1e() -``` - -```{code-cell} ipython3 -:tags: [raises-exception] -# This will raise a ValueError -task1f = sum_of_pairs(pairs=[(1, 2), (3, "4")]) -task1f() -``` - ## Customizing output names -Note, that "out" is the default name for the task output, but we can always customize it. There are two ways of doing it: using *python* function annotation and using another *pydra* decorator: - -Let's start from the function annotation: +Note, that "out" is the default name for the task output, but you can always customize it by using *python* function annotation. ```{code-cell} ipython3 import typing as ty @@ -217,27 +121,9 @@ task2b = modf_an(a=3.5) task2b() ``` -The second way of customizing the output requires another decorator - `pydra.mark.annotate` - -```{code-cell} ipython3 -@pydra.mark.task -@pydra.mark.annotate({'return': {'fractional': ty.Any, 'integer': ty.Any}}) -def modf(a: float): - import math - - return math.modf(a) - -task2c = modf(a=3.5) -task2c() -``` - -**Note, that the order of the pydra decorators is important!** - -+++ - ## Setting the input -We don't have to provide the input when we create a task, we can always set it later: +Note that you don't have to provide the input when you create a task, you can always set it later: ```{code-cell} ipython3 task3 = add_var() @@ -246,19 +132,16 @@ task3.inputs.b = 5 task3() ``` -If we don't specify the input, `attr.NOTHING` will be used as the default value +If you don't specify the input, `attr.NOTHING` will be used as the default value ```{code-cell} ipython3 task3a = add_var() task3a.inputs.a = 4 -# importing attr library, and checking the type of `b` -import attr - -task3a.inputs.b == attr.NOTHING +task3a.inputs.b ``` -And if we try to run the task, an error will be raised: +And if you try to run the task, an error will be raised: ```{code-cell} ipython3 :tags: [raises-exception] @@ -266,9 +149,13 @@ And if we try to run the task, an error will be raised: task3a() ``` +You can now try to fix the task and run it again. + ++++ + ## Output directory and caching the results -After running the task, we can check where the output directory with the results was created: +After running the task, you can check where the output directory with the results was created: ```{code-cell} ipython3 task3.output_dir @@ -278,13 +165,13 @@ Within the directory you can find the file with the results: `_result.pklz`. ```{code-cell} ipython3 import os -``` - -```{code-cell} ipython3 os.listdir(task3.output_dir) ``` -But we can also provide the path where we want to store the results. If a path is provided for the cache directory, then pydra will use the cached results of a node instead of recomputing the result. Let's create a temporary directory and a specific subdirectory "task4": +But you can also provide the path where you want to store the results. +**Note that if the same path is provided when you run the task again, pydra will use the cached results instead of recomputing the result.** + +Let's create a temporary directory and a specific subdirectory "task4": ```{code-cell} ipython3 from tempfile import mkdtemp @@ -296,7 +183,7 @@ cache_dir_tmp = Path(mkdtemp()) / 'task4' print(cache_dir_tmp) ``` -Now we can pass this path to the argument of `FunctionTask` - `cache_dir`. To observe the execution time, we specify a function that is sleeping for 5s: +Now you can pass this path to the argument of `FunctionTask` - `cache_dir`. To observe the execution time, you can specify a function that is sleeping for 5s: ```{code-cell} ipython3 @pydra.mark.task @@ -311,25 +198,29 @@ task4 = add_var_wait(a=4, b=6, cache_dir=cache_dir_tmp) If you're running the cell first time, it should take around 5s. +You can meassure the exact time by using a special method from Jupyter by adding `%%time`. + ```{code-cell} ipython3 +%%time task4() task4.result() ``` -We can check `output_dir` of our task, it should contain the path of `cache_dir_tmp` and the last part contains the name of the task class `FunctionTask` and the task checksum: +You can check `output_dir` of our task, it should contain the path of `cache_dir_tmp` and the last part contains the name of the task class `FunctionTask` and the task checksum that is unique for a specific function and specific set of input values. You can read more about checksum here TODO-LINK ```{code-cell} ipython3 task4.output_dir ``` -Let's see what happens when we defined identical task again with the same `cache_dir`: +Let's see what happens when an identical task is run again with the same `cache_dir`: ```{code-cell} ipython3 +%%time task4a = add_var_wait(a=4, b=6, cache_dir=cache_dir_tmp) task4a() ``` -This time the result should be ready right away! *pydra* uses available results and do not recompute the task. +This time the result should be ready right away! *pydra* uses available results and do not recompute the task. The wall time provided by `%%tinme` should be in milliseconds. *pydra* not only checks for the results in `cache_dir`, but you can provide a list of other locations that should be checked. Let's create another directory that will be used as `cache_dir` and previous working directory will be used in `cache_locations`. @@ -342,7 +233,7 @@ task4b = add_var_wait( task4b() ``` -This time the results should be also returned quickly! And we can check that `task4b.output_dir` was not created: +This time the results should be also returned quickly! And you can check that `task4b.output_dir` was not created: ```{code-cell} ipython3 task4b.output_dir.exists() @@ -361,7 +252,7 @@ task4c(rerun=True) task4c.output_dir.exists() ``` -If we update the input of the task, and run again, the new directory will be created and task will be recomputed: +Remember that if you update the input of the task, the new directory will be created and task will be recomputed! ```{code-cell} ipython3 task4b.inputs.a = 1 @@ -369,25 +260,28 @@ print(task4b()) print(task4b.output_dir.exists()) ``` -and when we check the `output_dir`, we can see that it's different than last time: +and when you check the `output_dir`, you can see that it's different than last time: ```{code-cell} ipython3 task4b.output_dir ``` -This is because, the checksum changes when we change either input or function. +This is because, the checksum changes when you change either input or function. +++ {"solution2": "hidden", "solution2_first": true} ### Exercise 1 +Now you can practice creating new tasks! + Create a task that take a list of numbers as an input and returns two fields: `mean` with the mean value and `std` with the standard deviation value. ```{code-cell} ipython3 :tags: [hide-cell] +#TODO-HIDE @pydra.mark.task @pydra.mark.annotate({'return': {'mean': ty.Any, 'std': ty.Any}}) -def mean_dev(my_list: List): +def mean_dev(my_list): import statistics as st return st.mean(my_list), st.stdev(my_list) @@ -400,27 +294,3 @@ my_task.result() ```{code-cell} ipython3 # write your solution here (you can use statistics module) ``` - -## Using Audit - -*pydra* can record various run time information, including the workflow provenance, by setting `audit_flags` and the type of messengers. - -`AuditFlag.RESOURCE` allows you to monitor resource usage for the `Task`, while `AuditFlag.PROV` tracks the provenance of the `Task`. - -```{code-cell} ipython3 -from pydra.utils.messenger import AuditFlag, PrintMessenger - -task5 = add_var(a=4, b=5, audit_flags=AuditFlag.RESOURCE) -task5() -task5.result() -``` - -One can turn on both audit flags using `AuditFlag.ALL`, and print the messages on the terminal using the `PrintMessenger`. - -```{code-cell} ipython3 -task5 = add_var( - a=4, b=5, audit_flags=AuditFlag.ALL, messengers=PrintMessenger() -) -task5() -task5.result() -``` From 31078bf4bf49696bfe1b3aafa7983d62ce2fee4f Mon Sep 17 00:00:00 2001 From: Dorota Jarecka Date: Tue, 17 Oct 2023 14:02:53 -0400 Subject: [PATCH 2/2] applying Yibei suggestions --- notebooks/1_intro_pydra.md | 4 ++-- notebooks/2_intro_functiontask.md | 12 ++++++------ 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/notebooks/1_intro_pydra.md b/notebooks/1_intro_pydra.md index 0b64f62..ec85f2e 100644 --- a/notebooks/1_intro_pydra.md +++ b/notebooks/1_intro_pydra.md @@ -27,7 +27,7 @@ You'll also construct basic `Workflow`s that link multiple tasks together. Furth +++ -**Before going to the main notebooks, let's check if pydra is properly installed.** If you have any issues running the following cell, please revisit the Installation section. TODO-LINK +**Let's check if pydra is properly installed.** If you have any issues running the following cell, please revisit the Installation section. TODO-LINK ```{code-cell} ipython3 import pydra @@ -42,4 +42,4 @@ At the beginning of each tutorial you will see: import nest_asyncio nest_asyncio.apply() ``` -This is run because both *Jupyter* and *Pydra* use `asyncio` and in some cases you can see `RuntimeError: This event loop is already running` if `nest_asyncio` is not used. **This part is not needed if Pydra is used outside the Jupyter environment.** +This is because both *Jupyter* and *Pydra* use `asyncio` and you can get `RuntimeError: This event loop is already running` if `nest_asyncio` is not used. **This part is not needed if Pydra is used outside of Jupyter Notebook/Lab.** diff --git a/notebooks/2_intro_functiontask.md b/notebooks/2_intro_functiontask.md index e1e2c9c..bb34a72 100644 --- a/notebooks/2_intro_functiontask.md +++ b/notebooks/2_intro_functiontask.md @@ -28,7 +28,7 @@ nest_asyncio.apply() +++ -In this tutorial, you will generate your initial *pydra* `Task`, which is a fundamental *pydra*'s component capable of processing data. You will start from a `FunctionTask`, a type of `Task` that can be created from every *python* function by using *pydra* decorator: `pydra.mark.task`: +In this tutorial, you will generate your initial *Pydra* `Task`, which is a fundamental *Pydra*'s component capable of processing data. You will start from a `FunctionTask`, a type of `Task` that can be created from every *python* function by using *Pydra* decorator: `pydra.mark.task`: ```{code-cell} ipython3 import pydra @@ -38,7 +38,7 @@ def add_var(a, b): return a + b ``` -Now that we decorated the function, we can create a pydra `Task` and specify the input, for this example values for `a` and `b` are needed. +After decorating the function, you can create a Pydra `Task` and specify the input. In this example, values for `a` and `b` are needed. ```{code-cell} ipython3 task0 = add_var(a=4, b=5) @@ -93,7 +93,7 @@ If you want to practice, change the values of `a` and `b` and run the task again +++ ## Customizing output names -Note, that "out" is the default name for the task output, but you can always customize it by using *python* function annotation. +Note, that `out` from `result.output.out` is the default name for the task output, but you can always customize it by using *python* function annotation. ```{code-cell} ipython3 import typing as ty @@ -169,7 +169,7 @@ os.listdir(task3.output_dir) ``` But you can also provide the path where you want to store the results. -**Note that if the same path is provided when you run the task again, pydra will use the cached results instead of recomputing the result.** +**Note that if the same path is provided when you run the task again, Pydra will use the cached results instead of recomputing the result.** Let's create a temporary directory and a specific subdirectory "task4": @@ -220,9 +220,9 @@ task4a = add_var_wait(a=4, b=6, cache_dir=cache_dir_tmp) task4a() ``` -This time the result should be ready right away! *pydra* uses available results and do not recompute the task. The wall time provided by `%%tinme` should be in milliseconds. +This time the result should be ready right away! *Pydra* uses available results and do not recompute the task. The wall time provided by `%%tinme` should be in milliseconds. -*pydra* not only checks for the results in `cache_dir`, but you can provide a list of other locations that should be checked. Let's create another directory that will be used as `cache_dir` and previous working directory will be used in `cache_locations`. +*Pydra* not only checks for the results in `cache_dir`, but you can provide a list of other locations that should be checked. Let's create another directory that will be used as `cache_dir` and previous working directory will be used in `cache_locations`. ```{code-cell} ipython3 cache_dir_tmp_new = Path(mkdtemp()) / 'task4b'