diff --git a/src/topics/best-practices.md b/src/topics/best-practices.md index 52bb5097..e7b88faf 100644 --- a/src/topics/best-practices.md +++ b/src/topics/best-practices.md @@ -96,6 +96,55 @@ all are required. - Software containers should be made to be conformant to the ["Recommendations for the packaging and containerizing of bioinformatics software"][containers] (also useful to other disciplines). +The following are a set of recommended good practices to keep in mind when running CWL workflows within Docker: + +- Make sure you are using the latest version of both CWL and Docker, + as this will ensure that you have access to the latest features and bug fixes. + +- Use meaningful tags on your own Docker image + so you can tell versions of your Docker image apart as it is updated over time. + These can reflect the version of the underlying software, + or a version you assign to the Dockerfile itself. + These can be manually assigned version numbers (e.g. 1.0, 1.1, 1.2, 2.0), + timestamps (e.g. YYYYMMDD like 20220126) or the hash of a git commit. + +- It is good practice to keep your Dockerfiles in Git, just like your workflow definitions, + because they are also scripts and should be managed and tracked with version control. + +- When creating a Dockerfile, it is important to specify the exact version + of the software you want to install and the base image you want to use. + This helps ensure that your Docker image builds are consistent and reproducible. + Additionally, when using the `FROM` command, specify a tag for the base image, + otherwise it will default to "latest" which can change at any time. + +- To ensure that the user specified in the Dockerfile is actually used to run the tool, + it is best to avoid using the `USER` instruction in the Dockerfile. + This is because cwltool will override the `USER` instruction and match the user instead, + which means that the user specified in the `USER` instruction + may not be the user that is actually used to run the tool. + To avoid this, use the `--no-match-user` cwltool flag + to disable passing the current user ID to `docker run --user`. + +- Keep your container images as small as possible, + this speeds up the download time and consumes less storage space. + Also, when using bioinformatics tools, reference data should be supplied externally + (as workflow inputs), rather than including it in the container image. + This way, it is easier to update the reference data without the need to rebuild the Docker image. + +- Avoid using the `ENTRYPOINT` command in your Dockerfile + because it changes the command line that runs inside the container. + This can cause confusion when the command line that supplied to the container + and the command that actually runs are different. + +- Docker has a feature that can save you time during development by + reusing a previous command and its base layer, instead of running it again. + However, this can also cause problems if a file being downloaded changes, + but the command remains the same. In that case, the cached version of the file will be used + instead of the updated one. To avoid this, use the `--no-cache` option to force Docker to re-run the steps. + + To learn more about creating workflows with Docker, + see this [tutorial](https://doc.arvados.org/rnaseq-cwl-training/08-supplement-docker/index.html). + [containers]: https://doi.org/10.12688/f1000research.15140.1 [apache-license]: https://spdx.org/licenses/Apache-2.0.html [license-example]: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/workflows/emg-assembly.cwl#L200 @@ -112,4 +161,3 @@ all are required. % % - Writing CWL workflows (include existing docs from https://github.com/common-workflow-library/cwl-patterns/blob/main/README.md) % - FAIR best practices with CWL -% - Docker best practices with CWL - https://github.com/common-workflow-language/common-workflow-language/issues/347