Skip to content

Commit 99e0a38

Browse files
TAO 5.2 Release - PyTorch
1 parent 1a94305 commit 99e0a38

File tree

315 files changed

+44187
-721
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

315 files changed

+44187
-721
lines changed

README.md

+65-9
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
* [Hardware Requirements](#HardwareRequirements)
88
* [Software Requirements](#SoftwareRequirements)
99
* [Instantiating the development container](#Instantiatingthedevelopmentcontainer)
10+
* [Command line options](#Commandlineoptions)
11+
* [Using the mounts file](#Usingthemountsfile)
1012
* [Updating the base docker](#Updatingthebasedocker)
1113
* [Build base docker](#Buildbasedocker)
1214
* [Test the newly built base docker](#Testthenewlybuiltbasedocker)
@@ -25,16 +27,16 @@
2527

2628
TAO Toolkit is a Python package hosted on the NVIDIA Python Package Index. It interacts with lower-level TAO dockers available from the NVIDIA GPU Accelerated Container Registry (NGC). The TAO containers come pre-installed with all dependencies required for training. The output of the TAO workflow is a trained model that can be deployed for inference on NVIDIA devices using DeepStream, TensorRT and Triton.
2729

28-
This repository contains the required implementation for the all the deep learning components and networks using the PyTorch backend. These routines are packaged as part of the TAO Toolkit PyTorch container in the Toolkit package.
30+
This repository contains the required implementation for the all the deep learning components and networks using the PyTorch backend. These routines are packaged as part of the TAO Toolkit PyTorch container in the Toolkit package. These source code here is compatible with PyTorch version > 2.0.0
2931

3032
## <a name='GettingStarted'></a>Getting Started
3133

3234
As soon as the repository is cloned, run the `envsetup.sh` file to check
33-
if the build enviroment has the necessary dependencies, and the required
35+
if the build environment has the necessary dependencies, and the required
3436
environment variables are set.
3537

3638
```sh
37-
source scripts/envsetup.sh
39+
source ${PATH_TO_REPO}/scripts/envsetup.sh
3840
```
3941

4042
We recommend adding this command to your local `~/.bashrc` file, so that every new terminal instance receives this.
@@ -64,23 +66,24 @@ We recommend adding this command to your local `~/.bashrc` file, so that every n
6466
| **Software** | **Version** |
6567
| :--- | :--- |
6668
| Ubuntu LTS | >=18.04 |
67-
| python | >=3.8.x |
69+
| python | >=3.10.x |
6870
| docker-ce | >19.03.5 |
6971
| docker-API | 1.40 |
7072
| `nvidia-container-toolkit` | >1.3.0-1 |
7173
| nvidia-container-runtime | 3.4.0-1 |
7274
| nvidia-docker2 | 2.5.0-1 |
73-
| nvidia-driver | >525.85 |
75+
| nvidia-driver | >535.85 |
7476
| python-pip | >21.06 |
7577

7678
### <a name='Instantiatingthedevelopmentcontainer'></a>Instantiating the development container
7779

78-
Inorder to maintain a uniform development enviroment across all users, TAO Toolkit provides a base environment docker that has been built and uploaded to NGC for the developers. For instantiating the docker, simply run the `tao_pt` CLI. The usage for the command line launcher is mentioned below.
80+
Inorder to maintain a uniform development environment across all users, TAO Toolkit provides a base environment Dockerfile in `docker/Dockerfile` that contains all
81+
the required third party dependencies for the developers. For instantiating the docker, simply run the `tao_pt` CLI. The usage for the command line launcher is mentioned below.
7982

8083
```sh
8184
usage: tao_pt [-h] [--gpus GPUS] [--volume VOLUME] [--env ENV]
8285
[--mounts_file MOUNTS_FILE] [--shm_size SHM_SIZE]
83-
[--run_as_user] [--ulimit ULIMIT] [--port PORT]
86+
[--run_as_user] [--tag TAG] [--ulimit ULIMIT] [--port PORT]
8487

8588
Tool to run the pytorch container.
8689

@@ -92,6 +95,7 @@ optional arguments:
9295
--mounts_file MOUNTS_FILE Path to the mounts file.
9396
--shm_size SHM_SIZE Shared memory size for docker
9497
--run_as_user Flag to run as user
98+
--tag TAG The tag value for the local dev docker.
9599
--ulimit ULIMIT Docker ulimits for the host machine.
96100
--port PORT Port mapping (e.g. 8889:8889).
97101

@@ -106,6 +110,55 @@ tao_pt --gpus all \
106110
--env PYTHONPATH=/tao-pt
107111
```
108112

113+
Running Deep Neural Networks implies working on large datasets. These datasets are usually stored on network share drives with significantly higher storage capacity. Since the `tao_pt` CLI wrapper uses docker containers under the hood, these drives/mount points need to be mapped to the docker.
114+
115+
There are 2 ways to configure the `tao_pt` CLI wrapper.
116+
117+
1. Via the command line options
118+
2. Via the mounts file. By default, at `~/.tao_mounts.json`.
119+
120+
#### <a name='Commandlineoptions'></a>Command line options
121+
122+
| **Option** | **Description** | **Default** |
123+
| :-- | :-- | :-- |
124+
| `gpus` | Comma separated GPU indices to be exposed to the docker | 1 |
125+
| `volume` | Paths on the host machine to be exposed to the container. This is analogous to the `-v` option in the docker CLI. You may define multiple mount points by using the --volume option multiple times. | None |
126+
| `env` | Environment variables to defined inside the interactive container. You may set them as `--env VAR=<value>`. Multiple environment variables can be set by repeatedly defining the `--env` option. | None |
127+
| `mounts_file` | Path to the mounts file, explained more in the next section. | `~/.tao_mounts.json` |
128+
| `shm_size` | Shared memory size for docker in Bytes. | 16G |
129+
| `run_as_user` | Flag to run as default user account on the host machine. This helps with maintaining permissions for all directories and artifacts created by the container. |
130+
| `tag` | The tag value for the local dev docker | None |
131+
| `ulimit` | Docker ulimits for the host machine |
132+
| `port` | Port mapping (e.g. 8889:8889) | None |
133+
134+
#### <a name='Usingthemountsfile'></a>Using the mounts file
135+
136+
The `tao_pt` CLI wrapper instance can be configured by using a mounts file. By default, the wrapper expects the mounts file to be at
137+
`~/.tao_mounts.json`. However, for multiple options, you may be able
138+
139+
The launcher config file consists of three sections:
140+
141+
* `Mounts`
142+
143+
The `Mounts` parameter defines the paths in the local machine, that should be mapped to the docker. This is a list of `json` dictionaries containing the source path in the local machine and the destination path that is mapped for the CLI wrapper.
144+
145+
A sample config file containing 2 mount points and no docker options is as below.
146+
147+
```json
148+
{
149+
"Mounts": [
150+
{
151+
"source": "/path/to/your/experiments",
152+
"destination": "/workspace/tao-experiments"
153+
},
154+
{
155+
"source": "/path/to/config/files",
156+
"destination": "/workspace/tao-experiments/specs"
157+
}
158+
]
159+
}
160+
```
161+
109162
### <a name='Updatingthebasedocker'></a>Updating the base docker
110163

111164
There will be situations where developers would be required to update the third party dependancies to newer versions, or upgrade CUDA etc. In such a case, please follow the steps below:
@@ -120,10 +173,11 @@ cd $NV_TAO_PYTORCH_TOP/docker
120173
```
121174

122175
#### <a name='Testthenewlybuiltbasedocker'></a>Test the newly built base docker
123-
Developers may tests their new docker by using the `tao_pt` command.
176+
177+
The build script tags the newly built base docker with the username of the account in the user's local machine. Therefore, the developers may tests their new docker by using the `tao_pt` command with the `--tag` option.
124178

125179
```sh
126-
tao_pt -- script args
180+
tao_pt --tag $USER -- script args
127181
```
128182

129183
#### <a name='Updatethenewdocker'></a>Update the new docker
@@ -151,6 +205,8 @@ bash $NV_TAO_PYTORCH_TOP/docker/build.sh --build --push --force
151205
The TAO docker is built on top of the TAO Pytorch base dev docker, by building a python wheel for the `nvidia_tao_pyt` module in this repository and installing the wheel in the Dockerfile defined in `release/docker/Dockerfile`. The whole build process is captured in a single shell script which may be run as follows:
152206

153207
```sh
208+
git lfs install
209+
git lfs pull
154210
source scripts/envsetup.sh
155211
cd $NV_TAO_PYTORCH_TOP/release/docker
156212
./deploy.sh --build --wheel

docker/Dockerfile

+27-7
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:23.02-py3
2-
FROM ${BASE_IMAGE}
1+
ARG PYTORCH_BASE_IMAGE=nvcr.io/nvidia/pytorch:23.08-py3
2+
FROM ${PYTORCH_BASE_IMAGE}
33

44
# Ensure apt-get won't prompt for selecting options
55
ENV DEBIAN_FRONTEND=noninteractive
@@ -16,8 +16,8 @@ RUN pip install parametrized ninja
1616
WORKDIR /opt
1717

1818
# Clone and checkout TensorRT OSS
19-
# Moving TensorRT to 8.5 branch.
20-
ENV TRT_TAG "release/8.5"
19+
# Moving TensorRT to 8.6 branch.
20+
ENV TRT_TAG "release/8.6"
2121
ENV TRT_INCLUDE_DIR="/usr/include/x86_64-linux-gnu"
2222
# Install TRT OSS
2323
RUN mkdir trt_oss_src && \
@@ -27,21 +27,41 @@ RUN mkdir trt_oss_src && \
2727
cd TensorRT && \
2828
git submodule update --init --recursive && \
2929
mkdir -p build && cd build && \
30-
cmake .. -DGPU_ARCHS="53 60 61 70 75 80 86 90" -DTRT_LIB_DIR=/usr/lib/x86_64-linux-gnu -DTRT_BIN_DIR=`pwd`/out -DCUDA_VERSION=11.8 -DCUDNN_VERSION=8.7 && \
30+
cmake .. \
31+
-DGPU_ARCHS="53;60;61;70;75;80;86;90" \
32+
-DCMAKE_CUDA_ARCHITECTURES="53;60;61;70;75;80;86;90" \
33+
-DTRT_LIB_DIR=/usr/lib/x86_64-linux-gnu \
34+
-DTRT_BIN_DIR=`pwd`/out \
35+
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.2/bin/nvcc \
36+
-DCUDNN_VERSION=8.9 && \
3137
make -j16 nvinfer_plugin nvinfer_plugin_static && \
32-
cp libnvinfer_plugin.so.8.5.3 /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.8.5.3 && \
38+
cp libnvinfer_plugin.so.8.6.1 /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.8.6.1 && \
3339
cp libnvinfer_plugin_static.a /usr/lib/x86_64-linux-gnu/libnvinfer_plugin_static.a && \
3440
cd ../../../ && \
3541
rm -rf trt_oss_src
3642

3743
COPY docker/requirements-pip.txt requirements-pip.txt
38-
RUN pip install --ignore-installed -r requirements-pip.txt \
44+
# Forcing cython==0.29.36 for pycocotools-fix with python3.10.
45+
RUN pip install Cython==0.29.36 \
46+
&& pip install --ignore-installed -r requirements-pip.txt \
3947
&& rm requirements-pip.txt
4048

4149
COPY docker/requirements-pip-pytorch.txt requirements-pip-pytorch.txt
4250
RUN pip install --ignore-installed --no-deps -r requirements-pip-pytorch.txt \
4351
&& rm requirements-pip-pytorch.txt
4452

53+
COPY docker/requirements-pip-odise.txt requirements-pip-odise.txt
54+
RUN pip install --ignore-installed --no-deps -r requirements-pip-odise.txt \
55+
&& rm requirements-pip-odise.txt
56+
57+
# Install mmcv from source for our cuda versions.
58+
COPY third_party/mmcv/mmcv.patch mmcv.patch
59+
RUN git clone https://github.com/open-mmlab/mmcv.git \
60+
&& cd mmcv && git checkout v1.7.1 \
61+
&& git apply /opt/mmcv.patch \
62+
&& pip install -r requirements/optional.txt --ignore-installed \
63+
&& FORCE_CUDA=1 MMCV_WITH_OPS=1 python setup.py install
64+
4565
# Setup user account
4666
ARG uid=1000
4767
ARG gid=1000

docker/requirements-pip-odise.txt

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# ODISE
2+
huggingface-hub
3+
fvcore
4+
ftfy
5+
kornia==0.6
6+
diffdist==0.1
7+
nltk>=3.6.2
8+
taming-transformers-rom1504
9+
importlib-metadata==4.11.3
10+
flake8-comprehensions
11+
git+https://github.com/facebookresearch/detectron2.git
12+
git+https://github.com/openai/CLIP.git@main#egg=clip
13+
git+https://github.com/cocodataset/panopticapi.git
14+
yacs>=0.1.8
15+
iopath==0.1.9
16+
jmespath
17+
s3transfer
18+
pathspec
19+
black

docker/requirements-pip-pytorch.txt

+3-2
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,13 @@ fairscale==0.4.12
22
lpips==0.1.4
33
lightning-utilities==0.8.0
44
mmcls==0.25.0
5-
mmcv-full -f https://download.openmmlab.com/mmcv/dist/11.4/torch1.11.0/index.htmlninja
65
pytorch-lightning==1.8.5
76
pytorch_metric_learning==1.7.1
87
pytorch-msssim
98
thop
109
timm>=0.9.6.dev0
1110
torchmetrics==0.10.3
12-
open-clip-torch[training]==2.20.0
11+
open-clip-torch[training]==2.23.0
12+
sentencepiece==0.1.99
1313
ftfy
14+
torch-pruning==1.2.2

docker/requirements-pip.txt

+23-10
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,24 @@
11
addict==2.4.0
22
anyconfig==0.9.10
33
astroid==2.5.2
4+
boto3
5+
botocore
46
ccimport==0.4.2
7+
click==8.0.4
58
colored==1.4.4
69
cumm-cu114==0.2.8
10+
cutex==0.2.1
711
easydict==1.10
12+
einops==0.3.2
813
faiss-cpu==1.7.2 # TODO: faiss-gpu works better in some cases
914
fire==0.5.0
1015
flake8==6.0.0
1116
gdown==4.6.4
17+
gradio==4.3.0
1218
hydra-core==1.2.0
1319
imgaug==0.4.0
1420
imageio==2.26.0
15-
isort==4.2.5
21+
isort==4.3.21
1622
lark==1.1.5
1723
lazy-import==0.2.2
1824
lazy_object_proxy==1.5.1
@@ -26,23 +32,25 @@ mypy-extensions==1.0.0
2632
natsort==8.3.1
2733
ninja==1.11.1
2834
nltk==3.8.1
29-
https://files.pythonhosted.org/packages/02/99/ca518644076d372509d9dff13e85072e65fba273c42da79a344f55bbad48/nvidia_eff-0.6.4-py38-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
30-
https://files.pythonhosted.org/packages/d1/c2/c14dd8884a5bc05ca07331b3d78a92812eb19e25a625a0b59af8b609a93f/nvidia_eff_tao_encryption-0.1.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
35+
nvidia-eff==0.6.5
36+
nvidia-eff-tao-encryption==0.1.8
3137
numpy==1.22.2 # TODO: Update np.float to np.float64 in the coming numpy version
3238
omegaconf==2.2.2
3339
# Install onnx-graphsurgeon with extra-index-url
3440
--extra-index-url https://pypi.ngc.nvidia.com
3541
onnx-graphsurgeon
3642
onnx-simplifier==0.4.5
3743
onnxoptimizer==0.3.8
38-
onnxruntime>=1.7.0<=1.11.1
44+
onnxruntime==1.15.1
3945
onnxsim==0.4.17
40-
opencv-python==4.5.5.64
46+
opencv-python==4.8.0.74
4147
pccm==0.4.6
48+
pillow==9.5.0
4249
Polygon3==3.0.8
50+
protobuf>4.21.0,<5.0
4351
pyarmor==7.7.4
44-
pyclipper==1.1.0.post3
45-
pycocotools-fix==2.0.0.9
52+
pyclipper
53+
pycocotools
4654
pycodestyle==2.10.0
4755
pycuda==2022.2.2
4856
pycodestyle==2.10.0
@@ -52,6 +60,7 @@ pyflakes==3.0.1
5260
pylint==2.2.2
5361
pynini==2.1.5
5462
pyquaternion==0.9.9
63+
pyrr==0.10.3
5564
PyWavelets==1.4.1
5665
PyYAML==6.0
5766
rich==13.3.2
@@ -60,13 +69,17 @@ shapely==1.8.2
6069
soundfile==0.12.1
6170
spconv-cu114==2.1.21
6271
tabulate>=0.9.0
63-
tensorboardX==2.6
72+
tensorboardX==2.6.2.2
6473
terminaltables==3.1.0
6574
tifffile==2023.2.28
66-
transformers>=4.8.2
67-
tokenizers==0.10.3
75+
# Upgrading transformer due to an error with importlib version checks.
76+
transformers==4.33.3
77+
tokenizers==0.12.1
78+
# Same issue with tqdm.pip
79+
tqdm==4.65.0
6880
ujson==5.5.0
6981
unidecode==1.2.0
82+
wandb>=0.12.11
7083
wget==3.2
7184
wrapt>=1.11, <1.13.0
7285
yapf==0.32.0

nvidia_tao_pytorch/core/callbacks/loggers.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,10 @@
1414

1515
"""Status Logger callback."""
1616

17-
from collections import Iterable
17+
try:
18+
from collections.abc import Iterable
19+
except ImportError:
20+
from collections import Iterable
1821

1922
from datetime import timedelta
2023

nvidia_tao_pytorch/core/mmlab/mmclassification/model_params_mapping.py

+11
Original file line numberDiff line numberDiff line change
@@ -32,4 +32,15 @@
3232
"gc_vit_base": 1024,
3333
"gc_vit_large": 1536,
3434
"gc_vit_large_384": 1536,
35+
"faster_vit_0_224": 512, # FasterViT
36+
"faster_vit_1_224": 640,
37+
"faster_vit_2_224": 768,
38+
"faster_vit_3_224": 1024,
39+
"faster_vit_4_224": 1568,
40+
"faster_vit_5_224": 2560,
41+
"faster_vit_6_224": 2560,
42+
"faster_vit_4_21k_224": 1568,
43+
"faster_vit_4_21k_384": 1568,
44+
"faster_vit_4_21k_512": 1568,
45+
"faster_vit_4_21k_768": 1568,
3546
}}}

nvidia_tao_pytorch/core/mmlab/mmclassification/utils.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,11 @@ def load_model(model_path, mmcls_config=None, return_ckpt=False):
201201
Returns:
202202
Returns the loaded model instance.
203203
"""
204-
temp = tempfile.NamedTemporaryFile(suffix='.pth', delete=False)
204+
# Forcing delete to close.
205+
temp = tempfile.NamedTemporaryFile(
206+
suffix='.pth',
207+
delete=True
208+
)
205209
tmp_model_path = temp.name
206210

207211
# Remove EMA related items from the state_dict

nvidia_tao_pytorch/cv/__init__.py

+12
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,15 @@
2626
from third_party.onnx.utils import _export
2727
# Monkey Patch ONNX Export to disable onnxscript
2828
torch.onnx.utils._export = _export
29+
# Monkey Patch SDPA location
30+
torch.nn.functional.scaled_dot_product_attention = torch._C._nn._scaled_dot_product_attention # noqa: pylint: disable=I1101
31+
32+
33+
if major_version >= 2:
34+
# From https://github.com/pytorch/pytorch/blob/2efe4d809fdc94501fc38bf429e9a8d4205b51b6/torch/utils/tensorboard/_pytorch_graph.py#L384
35+
def _node_get(node: torch._C.Node, key: str): # noqa: pylint: disable=I1101
36+
"""Gets attributes of a node which is polymorphic over return type."""
37+
sel = node.kindOf(key)
38+
return getattr(node, sel)(key)
39+
40+
torch._C.Node.__getitem__ = _node_get # noqa: pylint: disable=I1101

nvidia_tao_pytorch/cv/action_recognition/config/default_config.py

+1
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ class ARTrainExpConfig:
9292

9393
results_dir: Optional[str] = None
9494
gpu_ids: List[int] = field(default_factory=lambda: [0])
95+
num_gpus: int = 1
9596
resume_training_checkpoint_path: Optional[str] = None
9697
optim: OptimConfig = OptimConfig()
9798
num_epochs: int = 10

0 commit comments

Comments
 (0)