HeterogeneousCore/SonicTriton/README.md

0001 # SONIC for Triton Inference Server
0002
0003 ## Introduction to Triton
0004
0005 Triton Inference Server ([docs](https://docs.nvidia.com/deeplearning/triton-inference-server/archives/triton_inference_server_1130/user-guide/docs/index.html), [repo](https://github.com/NVIDIA/triton-inference-server))
0006 is an open-source product from Nvidia that facilitates the use of GPUs as a service to process inference requests.
0007
0008 Triton supports multiple named inputs and outputs with different types. The allowed types are:
0009 boolean, unsigned integer (8, 16, 32, or 64 bits), integer (8, 16, 32, or 64 bits), floating point (16, 32, or 64 bit), or string.
0010
0011 Triton additionally supports inputs and outputs with multiple dimensions, some of which might be variable (denoted by -1).
0012 Concrete values for variable dimensions must be specified for each entry (see [Batching](#batching) below).
0013
0014 ## Client
0015
0016 Accordingly, the `TritonClient` input and output types are:
0017 * input: `TritonInputMap = std::unordered_map<std::string, TritonInputData>`
0018 * output: `TritonOutputMap = std::unordered_map<std::string, TritonOutputData>`
0019
0020 `TritonInputData` and `TritonOutputData` are classes that store information about their relevant dimensions and types
0021 and facilitate conversion of data sent to or received from the server.
0022 They are stored by name in the input and output maps.
0023 The consistency of dimension and type information (received from server vs. provided by user) is checked at runtime.
0024 The model information from the server can be printed by enabling `verbose` output in the `TritonClient` configuration.
0025
0026 `TritonClient` takes several parameters:
0027 * `modelName`: name of model with which to perform inference
0028 * `modelVersion`: version number of model (default: -1, use latest available version on server)
0029 * `modelConfigPath`: path to `config.pbtxt` file for the model (using `edm::FileInPath`)
0030 * `preferredServer`: name of preferred server, for testing (see [Services](#services) below)
0031 * `timeout`: maximum allowed time for a request (disabled with 0)
0032 * `timeoutUnit`: seconds, milliseconds, or microseconds (default: seconds)
0033 * `outputs`: optional, specify which output(s) the server should send
0034 * `verbose`: enable verbose printouts (default: false)
0035 * `useSharedMemory`: enable use of shared memory (see [below](#shared-memory)) with local servers (default: true)
0036 * `compression`: enable compression of input and output data to reduce bandwidth (using gzip or deflate) (default: none)
0037
0038 ### Batching
0039
0040 SonicTriton supports two types of batching, rectangular and ragged, depicted below:
0041 ![batching diagrams](./doc/batching_diagrams.png)
0042 In the rectangular case, the inputs for each object in an event have the same shape, so they can be combined into a single entry.
0043 (In this case, the batch size is specified as the "outer dimension" of the shape.)
0044 In the ragged case, the inputs for each object in an event do not have the same shape, so they cannot be combined;
0045 instead, they are represented internally as separate entries, each with its own shape specified explicitly.
0046
0047 The batch size is set and accessed using the client, in order to ensure a consistent value across all inputs.
0048 The batch mode can also be changed manually, in order to allow optimizing the allocation of entries.
0049 (If two entries with different shapes are specified, the batch mode will always automatically switch to ragged.)
0050 * `setBatchSize()`: set a new batch size
0051   * some models may not support batching
0052 * `batchSize()`: return current batch size
0053 * `setBatchMode()`: set the batch mode (`Rectangular` or `Ragged`)
0054 * `batchMode()`: get the current batch mode
0055
0056 Useful `TritonData` accessors include:
0057 * `variableDims()`: return true if any variable dimensions
0058 * `sizeDims()`: return product of dimensions (-1 if any variable dimensions)
0059 * `shape(unsigned entry=0)`: return actual shape (list of dimensions) for specified entry
0060 * `sizeShape(unsigned entry=0)`: return product of shape dimensions (returns `sizeDims()` if no variable dimensions) for specified entry
0061 * `byteSize()`: return number of bytes for data type
0062 * `dname()`: return name of data type
0063
0064 To update the `TritonData` shape in the variable-dimension case:
0065 * `setShape(const std::vector<int64_t>& newShape, unsigned entry=0)`: update all (variable) dimensions with values provided in `newShape` for specified entry
0066 * `setShape(unsigned loc, int64_t val, unsigned entry=0)`: update variable dimension at `loc` with `val` for specified entry
0067
0068 ### I/O types
0069
0070 There are specific local input and output containers that should be used in producers.
0071 Here, `T` is a primitive type, and the two aliases listed below are passed to `TritonInputData::toServer()`
0072 and returned by `TritonOutputData::fromServer()`, respectively:
0073 * `TritonInputContainer<T> = std::shared_ptr<TritonInput<T>> = std::shared_ptr<std::vector<std::vector<T>>>`
0074 * `TritonOutput<T> = std::vector<std::span<const T>>`
0075
0076 The `TritonInputContainer` object should be created using the helper function described below.
0077 It expects one vector per batch entry (i.e. the size of the outer vector is the batch size (rectangular case) or number of entries (ragged case)).
0078 Therefore, it is best to call `TritonClient::setBatchSize()`, if necessary, before calling the helper.
0079 It will also reserve the expected size of the input in each inner vector (by default),
0080 if the concrete shape is available (i.e. `setShape()` was already called, if the input has variable dimensions).
0081 * `allocate<T>()`: return a `TritonInputContainer` properly allocated for the batch and input sizes
0082
0083 ### Shared memory
0084
0085 If the local fallback server (see [Services](#services) below) is in use,
0086 input and output data can be transferred via shared memory rather than gRPC.
0087 Both CPU and GPU (CUDA) shared memory are supported.
0088 This is more efficient for some algorithms;
0089 if shared memory is not more efficient for an algorithm, it can be disabled in the Python configuration for the client.
0090
0091 For outputs, shared memory can only be used if the batch size and concrete shape are known in advance,
0092 because the shared memory region for the output must be registered before the inference call is made.
0093 As with the inputs, this is handled automatically, and the use of shared memory can be disabled if desired.
0094
0095 ## Modules
0096
0097 SONIC Triton supports producers, filters, and analyzers.
0098 New modules should inherit from `TritonEDProducer`, `TritonEDFilter`, or `TritonOneEDAnalyzer`.
0099 These follow essentially the same patterns described in [SonicCore](../SonicCore#for-analyzers).
0100
0101 If an `edm::GlobalCache` of type `T` is needed, there are two changes:
0102 * The new module should inherit from `TritonEDProducerT<T>` or `TritonEDFilterT<T>`
0103 * The new module should contain these lines:
0104     ```cpp
0105     static std::unique_ptr<T> initializeGlobalCache(edm::ParameterSet const& pset) {
0106       TritonEDProducerT<T>::initializeGlobalCache(pset);
0107       [module-specific code goes here]
0108     }
0109     ```
0110
0111 For `TritonEDProducer` and `TritonEDFilter`, the function `tritonEndStream()` replaces the standard `endStream()`.
0112 For `TritonOneEDAnalyzer`, the function `tritonEndJob()` replaces the standard `endJob()`.
0113
0114 In a SONIC Triton producer, the basic flow should follow this pattern:
0115 1. `acquire()`:
0116     a. access input object(s) from `TritonInputMap`
0117     b. allocate input data using `allocate<T>()`
0118     c. fill input data
0119     d. set input shape(s) (optional for rectangular case, only if any variable dimensions; required for ragged case)
0120     e. convert using `toServer()` function of input object(s)
0121 2. `produce()`:
0122     a. access output object(s) from `TritonOutputMap` (includes shapes)
0123     b. obtain output data as `TritonOutput<T>` using `fromServer()` function of output object(s)
0124     c. fill output products
0125
0126 ## Services
0127
0128 ### `cmsTriton`
0129
0130 A script [`cmsTriton`](./scripts/cmsTriton) is provided to launch and manage local servers.
0131 The script has three operations (`start`, `stop`, `check`) and the following options:
0132 * `-c`: don't cleanup temporary dir (for debugging)
0133 * `-C [dir]`: directory containing Nvidia compatibility drivers (checks CMSSW_BASE by default if available)
0134 * `-D`: dry run: print container commands rather than executing them
0135 * `-d [exe]`: container choice: apptainer, docker, podman, podman-hpc (default: apptainer)
0136 * `-E [path]`: include extra path(s) for executables (default: /cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin)
0137 * `-f`: force reuse of (possibly) existing container instance
0138 * `-g [device]`: device choice: auto (try to detect GPU), CPU, GPU (default: auto)
0139 * `-i [name]`: server image name (default: fastml/triton-torchgeo:22.07-py3-geometric)
0140 * `-I [num]`: number of model instances (default: 0 -> means no local editing of config files)
0141 * `-M [dir]`: model repository (can be given more than once)
0142 * `-m [dir]`: specific model directory (can be given more than one)
0143 * `-n [name]`: name of container instance, also used for hidden temporary dir (default: triton_server_instance)
0144 * `-P [port]`: base port number for services (-1: automatically find an unused port range) (default: 8000)
0145 * `-p [pid]`: automatically shut down server when process w/ specified PID ends (-1: use parent process PID)
0146 * `-r [num]`: number of retries when starting container (default: 3)
0147 * `-s [dir]`: apptainer sandbox directory (default: /cvmfs/unpacked.cern.ch/registry.hub.docker.com/fastml/triton-torchgeo:22.07-py3-geometric)
0148 * `-t [dir]`: non-default hidden temporary dir
0149 * `-v`: (verbose) start: activate server debugging info; stop: keep server logs
0150 * `-w [time]`: maximum time to wait for server to start (default: 300 seconds)
0151 * `-h`: print help message and exit
0152
0153 Additional details and caveats:
0154 * The `start` and `stop` operations for a given container instance should always be executed in the same directory
0155 if a relative path is used for the hidden temporary directory (including the default from the container instance name),
0156 in order to ensure that everything is properly cleaned up.
0157 * The `check` operation just checks if the server can run on the current system, based on driver compatibility.
0158 * A model repository is a folder that contains multiple model directories, while a model directory contains the files for a specific file.
0159 (In the example below, `$CMSSW_BASE/src/HeterogeneousCore/SonicTriton/data/models` is a model repository,
0160 while `$CMSSW_BASE/src/HeterogeneousCore/SonicTriton/data/models/resnet50_netdef` is a model directory.)
0161 If a model repository is provided, all of the models it contains will be provided to the server.
0162 * Older versions of Apptainer (Singularity) have a short timeout that may cause launching the server to fail the first time the command is executed.
0163 The `-r` (retry) flag exists to work around this issue.
0164
0165 ### `cmsTritonConfigTool`
0166
0167 The `config.pbtxt` files used for model configuration are written in the protobuf text format.
0168 To ease modification of these files, a dedicated Python tool [`cmsTritonConfigTool`](./scripts/cmsTritonConfigTool) is provided.
0169 The tool has several modes of operation (each with its own options, which can be viewed using `--help`):
0170 * `schema`: displays all field names and types for the Triton ModelConfig message class.
0171 * `view`: displays the field values from a provided `config.pbtxt` file.
0172 * `edit`: allows changing any field value in a `config.pbtxt` file. Non-primitive types are specified using JSON format.
0173 * `checksum`: checks and updates checksums for model files (to enforce versioning).
0174 * `versioncheck`: checks and updates checksums for all `config.pbtxt` files in `$CMSSW_SEARCH_PATH`.
0175 * `threadcontrol`: adds job- and ML framework-specific thread control settings.
0176
0177 The `edit` mode is intended for generic modifications, and only supports overwriting existing values
0178 (not modifying, removing, deleting, etc.).
0179 Additional dedicated modes, like `checksum` and `threadcontrol`, can easily be added for more complicated tasks.
0180
0181 ### `TritonService`
0182
0183 A central `TritonService` is provided to keep track of all available servers and which models they can serve.
0184 The servers will automatically be assigned to clients at startup.
0185 If some models are not served by any server, the `TritonService` can launch a fallback server using the `cmsTriton` script described above.
0186 If the process modifiers `enableSonicTriton` or `allSonicTriton` are activated,
0187 the fallback server will launch automatically if needed and will use a local GPU if one is available.
0188 If the fallback server uses CPU, clients that use the fallback server will automatically be set to `Sync` mode.
0189
0190 Servers have several available parameters:
0191 * `name`: unique identifier for each server (clients use this when specifying preferred server; also used internally by `TritonService`)
0192 * `address`: web address for server
0193 * `port`: port number for server (Triton server provides gRPC service on port 8001 by default)
0194 * `useSsl`: connect to server via SSL (default: false)
0195 * `rootCertificates`: for SSL, name of file containing PEM encoding of server root certificates, if any
0196 * `privateKey`: for SSL, name of file containing PEM encoding of user's private key, if any
0197 * `certificateChain`: for SSL, name of file containing PEM encoding of user's certificate chain, if any
0198
0199 The fallback server has a separate set of options, mostly related to the invocation of `cmsTriton`:
0200 * `enable`: enable the fallback server
0201 * `debug`: enable debugging (equivalent to `-c` in `cmsTriton`)
0202 * `verbose`: enable verbose output in logs (equivalent to `-v` in `cmsTriton`)
0203 * `container`: container choice (equivalent to `-d` in `cmsTriton`)
0204 * `device`: device choice (equivalent to `-g` in `cmsTriton`)
0205 * `retries`: number of retries when starting container (passed to `-r [num]` in `cmsTriton` if >= 0; default: -1)
0206 * `wait`: maximum time to wait for server to start (passed to `-w time` in `cmsTriton` if >= 0; default: -1)
0207 * `instanceBaseName`: base name for server instance if random names are enabled (default: triton_server_instance)
0208 * `instanceName`: specific name for server instance as alternative to random name (passed to `-n [name]` in `cmsTriton` if non-empty)
0209 * `tempDir`: specific name for server temporary directory (passed to `-t [dir]` in `cmsTriton` if non-empty; if `"."`, uses `cmsTriton` default)
0210 * `imageName`: server image name (passed to `-i [name]` in `cmsTriton` if non-empty)
0211 * `sandboxName`: Apptainer sandbox directory (passed to `-s [dir]` in `cmsTriton` if non-empty)
0212
0213 ## Examples
0214
0215 Several example producers can be found in the [test](./test) directory.
0216
0217 ## Legend
0218
0219 The SonicTriton documentation uses different terms than Triton itself for certain concepts.
0220 The SonicTriton:Triton correspondence for those terms is given here:
0221 * Entry : request
0222 * Rectangular batching : Triton-supported batching