Line Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540
# Alpaka algorithms and modules in CMSSW

## Introduction

This page documents the Alpaka integration within CMSSW. For more information about Alpaka itself see the [Alpaka documentation](https://alpaka.readthedocs.io/en/latest/).

### Compilation model

The code in `Package/SubPackage/{interface,src,plugins,test}/alpaka` is compiled once for each enabled Alpaka backend. The `ALPAKA_ACCELERATOR_NAMESPACE` macro is substituted with a concrete, backend-specific namespace name in order to guarantee different symbol names for all backends, that allows for `cmsRun` to dynamically load any set of the backend libraries.

The source files with `.dev.cc` suffix are compiled with the backend-specific device compiler. The other `.cc` source files are compiled with the host compiler.

The `BuildFile.xml` must contain `<flags ALPAKA_BACKENDS="1"/>` to enable the behavior described above.

## Overall guidelines

* Minimize explicit blocking synchronization calls
  * Avoid `alpaka::wait()`, non-cached memory buffer allocations
* If you can, use `global::EDProducer` base class
  * If you need per-stream storage
    * For few objects consider using [`edm::StreamCache<T>`](https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkGlobalModuleInterface#edm_StreamCacheT) with the global module, or
    * Use `stream::EDProducer`
  * If you need to transfer some data back to host, use `stream::SynchronizingEDProducer`
* All code using `ALPAKA_ACCELERATOR_NAMESPACE` should be placed in `Package/SubPackage/{interface,src,plugins,test}/alpaka` directory
  * Alpaka-dependent code that uses templates instead of the namespace macro can be placed in `Package/SubPackage/interface` directory
* All source files (not headers) using Alpaka device code (such as kernel call, functions called by kernels) must have a suffic `.dev.cc`, and be placed in the aforementioned `alpaka` subdirectory
* Any code that `#include`s a header from the framework or from the `HeterogeneousCore/AlpakaCore` must be separated from the Alpaka device code, and have the usual `.cc` suffix.
  * Some framework headers are allowed to be used in `.dev.cc` files:
    * Any header containing only macros, e.g. `FWCore/Utilities/interface/CMSUnrollLoop.h`, `FWCore/Utilities/interface/stringize.h`
    * `FWCore/Utilities/interface/Exception.h`
    * `FWCore/MessageLogger/interface/MessageLogger.h`, although it is preferred to issue messages only in the `.cc` files
    * `HeterogeneousCore/AlpakaCore/interface/EventCache.h` and `HeterogeneousCore/AlpakaCore/interface/QueueCache.h` can, in principle, be used in `.dev.cc` files, even if there should be little need to use them explicitly

## Data formats

Data formats, for both Event and EventSetup, should be placed following their usual rules. The Alpaka-specific conventions are
* There must be a host-only flavor of the data format that is either independent of Alpaka, or depends only on Alpaka's Serial backend
  * The host-only data format must be defined in `Package/SubPackage/interface/` directory
  * If the data format is to be serialized (with ROOT), it must be serialized in a way that the on-disk format does not depend on Alpaka, i.e. it can be read without Alpaka
  * For Event data products the ROOT dictionary should be defined in `DataFormats/SubPackage/src/classes{.h,_def.xml}`
    * As usual, the `classes_def.xml` should declare the dictionaries for the data product type `T` and `edm::Wrapper<T>`. These data products can be declared as persistent (default) or transient (`persistent="false"` attribute).
  * For EventSetup data products [the registration macro `TYPELOOKUP_DATA_REG`](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideHowToRegisterESData) should be placed in `Package/SubPackage/src/ES_<type name>.cc`.
* The device-side data formats are defined in `Package/SubPackage/interface/alpaka/` directory
  * The device-side data format classes should be either templated over the device type, or defined in the `ALPAKA_ACCELERATOR_NAMESPACE` namespace.
  * For host backends (`serial`), the "device-side" data format class must be the same as the aforementioned host-only data format class
    * Use `ASSERT_DEVICE_MATCHES_HOST_COLLECTION(<device collection type>, <host collection type>);` macro to ensure that, see an example in [../../DataFormats/PortableTestObjects/interface/alpaka/TestDeviceCollection.h](TestDeviceCollection.h)
    * This equality is necessary for the [implicit data transfers](#implicit-data-transfers) to function properly
  * For Event data products the ROOT dictionary should be defined in `DataFormats/SubPackage/src/alpaka/classes_<platform>{.h,_def.xml}`
    * The `classes_<platform>_def.xml` should declare the dictionaries for the data product type `T`, `edm::DeviceProduct<T>`, and `edm::Wrapper<edm::DeviceProduct<T>>`. All these dictionaries must be declared as transient with `persistent="false"` attribute.
    * The list of `<platform>` includes currently: `cuda`, `rocm`
  * For EventSetup data products the registration macro should be placed in `Package/SubPackage/src/alpaka/ES_<type name>.cc`
     * Data products defined in `ALPAKA_ACCELERATOR_NAMESPACE` should use `TYPELOOKUP_ALPAKA_DATA_REG` macro
     * Data products templated over the device type should use `TYPELOOKUP_ALPAKA_TEMPLATED_DATA_REG` macro
* For Event data products the `DataFormats/SubPackage/BuildFile.xml` must contain `<flags ALPAKA_BACKENDS="!serial"/>`
  * unless the package has something that is really specific for `serial` backend that is not generally applicable on host

Note that even if for Event data formats the examples above used `DataFormats` package, Event data formats are allowed to be defined in other packages too in some circumstances. For full details please see [SWGuideCreatingNewProducts](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCreatingNewProducts).

### Implicit data transfers

Both EDProducers and ESProducers make use of implicit data transfers. In CPU backends these data transfers are omitted, and the host-side and the "device-side" data products are the same.

#### Data copy definitions

The implicit host-to-device and device-to-host copies rely on specialization of `cms::alpakatools::CopyToDevice` and `cms::alpakatools::CopyToHost` class templates, respectively. These have to be specialized along
```cpp
#include "HeterogeneousCore/AlpakaInterface/interface/CopyToDevice.h"

namespace cms::alpakatools {
  template<>
  struct CopyToDevice<TSrc> {
    template <typename TQueue>
      requires alpaka::isQueue<TQueue>
    static auto copyAsync(TQueue& queue, TSrc const& hostProduct) -> TDst {
      // code to construct TDst object, and launch the asynchronous memcpy from the host to the device of TQueue
      return ...;
    }
  };
}
```
or
```cpp
#include "HeterogeneousCore/AlpakaInterface/interface/CopyToHost.h"

namespace cms::alpakatools {
  template <>
  struct CopyToHost<TSrc> {
    template <typename TQueue>
      requires alpaka::isQueue<TQueue>
    static auto copyAsync(TQueue& queue, TSrc const& deviceProduct) -> TDst {
      // code to construct TDst object, and launch the asynchronous memcpy from the device of TQueue to the host
      return ...;
    }
  };
}
```
respectively. 

Note that the destination (device-side/host-side) type `TDst` can be different from or the same as the source (host-side/device-side) type `TSrc` as far as the framework is concerned. For example, in the `PortableCollection` model the types are different. The `copyAsync()` member function is easiest to implement as a template over `TQueue`. The framework handles the necessary synchronization between the copy function and the consumer in a non-blocking way.

Both `CopyToDevice` and `CopyToHost` class templates are partially specialized for all `PortableObject` and `PortableCollection` instantiations.


##### Data products with `memcpy()`ed pointers

If the data product in question contains pointers to memory elsewhere within the data product, after the `alpaka::memcpy()` calls in the `copyAsync()` those pointers still point to device memory, and need to be updated. **Such data products are generally discouraged.** Nevertheless, such pointers can be updated without any additional synchronization by implementing a `postCopy()` function in the `CopyToHost` specialization along (extending the `CopyToHost` example [above](#data-copy-definitions))
```cpp
namespace cms::alpakatools {
  template <>
  struct CopyToHost<TSrc> {
    // copyAsync() definition from above

    static void postCopy(TDst& obj) {
      // modify obj
      // any modifications must be such that the postCopy() can be
      // skipped when the obj originates from the host (i.e. on CPU backends)
    }
  };
}
```
The `postCopy()` is called after the operations enqueued in the `copyAsync()` have finished. The code in `postCopy()` must be such that the call to `postCopy()` can be omitted on CPU backends.

Note that for `CopyToDevice` such `postCopy()` functionality is **not** provided. It should be possible to a issue kernel call from the `CopyToDevice::copyAsync()` function to achieve the same effect.


#### EDProducer

In EDProducers for each device-side data product a transfer from the device memory space to the host memory space is registered automatically. The data product is copied only if the job has another EDModule that consumes the host-side data product. For each device-side data product a specialization of `cms::alpakatools::CopyToHost` is required to exist.

In addition, for each host-side data product a transfer from the host memory space to the device meory space is registered autmatically **if** a `cms::alpakatools::CopyToDevice` specialization exists. The data product is copied only if the job has another EDModule that consumes the device-side data product.

#### ESProducer

In ESProducers for each host-side data product a transfer from the host memory space to the device memory space (of the backend of the ESProducer) is registered automatically. The data product is copied only if the job has another ESProducer or EDModule that consumes the device-side data product. For each host-side data product a specialization of `cms::alpakatools::CopyToDevice` is required to exist.

### `PortableCollection`

For more information see [`DataFormats/Portable/README.md`](../../DataFormats/Portable/README.md) and [`DataFormats/SoATemplate/README.md`](../../DataFormats/SoATemplate/README.md).


## Modules

### Base classes

The Alpaka-based EDModules should use one of the following base classes (that are defined in the `ALPAKA_ACCELERATOR_NAMESPACE`):

* `global::EDProducer<...>` (`#include "HeterogeneousCore/AlpakaCore/interface/alpaka/global/EDProducer.h"`)
   * A [global EDProducer](https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkGlobalModuleInterface) that launches (possibly) asynchronous work
* `stream::EDProducer<...>` (`#include "HeterogeneousCore/AlpakaCore/interface/alpaka/stream/EDProducer.h"`)
   * A [stream EDProducer](https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkStreamModuleInterface) that launches (possibly) asynchronous work
* `stream::SynchronizingEDProducer<...>` (`#include "HeterogeneousCore/AlpakaCore/interface/alpaka/stream/SynchronizingEDProducer.h"`)
   * A [stream EDProducer](https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkStreamModuleInterface) that may launch (possibly) asynchronous work, and synchronizes the asynchronous work on the device with the host
      * The base class uses the [`edm::ExternalWork`](https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkStreamModuleInterface#edm_ExternalWork) for the non-blocking synchronization

The `...` can in principle be any of the module abilities listed in the linked TWiki pages, except the `edm::ExternalWork`. The majority of the Alpaka EDProducers should be `global::EDProducer` or `stream::EDProducer`, with `stream::SynchronizingEDProducer` used only in cases where some data to be copied from the device to the host, that requires synchronization, for different reason than copying an Event data product from the device to the host.

New base classes (or other functionality) can be added based on new use cases that come up.

The Alpaka-based ESProducers should use the `ESProducer` base class (`#include "HeterogeneousCore/AlpakaCore/interface/alpaka/ESProducer.h"`).

Note that both the Alpaka-based EDProducer and ESProducer constructors must pass the argument `edm::ParameterSet` object to the constructor of their base class.

Note that currently Alpaka-based ESSources are not supported. If you need to produce EventSetup data products into a Record for which there is no ESSource yet, use [`EmptyESSource`](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideEDMParametersForModules#EmptyESSource).


### Event, EventSetup, Records

The Alpaka-based modules have a notion of a _host memory space_ and _device memory space_ for the Event and EventSetup data products. The data products in the host memory space are accessible for non-Alpaka modules, whereas the data products in device memory space are available only for modules of the specific Alpaka backend. The host backend(s) use the host memory space directly.

The EDModules get `device::Event` and `device::EventSetup` from the framework, from which data products in both host memory space and device memory space can be accessed. Data products can also be produced to either memory space. As discussed [above](#edproducer), for each data product produced into the device memory space an implicit data copy from the device memory space to the host memory space is registered, and for each data produced produced into the host memory space for which `cms::alpakatools::CopyToDevice` is specialized an implicit data copy from the host memory space to the device memory space is registered. The `device::Event::queue()` returns the Alpaka `Queue` object into which all work in the EDModule must be enqueued.

The ESProducer can have two different `produce()` function signatures
* If the function has the usual `TRecord const&` parameter, the function can read an ESProduct from the host memory space, and produce another product into the host memory space. An implicit copy of the data product from the host memory space to the device memory space (of the backend of the ESProducer) is registered as discussed above.
* If the function has `device::Record<TRecord> const&` parameter, the function can read an ESProduct from the device memory space, and produce another product into the device memory space. No further copies are made by the framework. The `device::Record<TRecord>::queue()` gives the Alpaka `Queue` object into which all work in the ESProducer must be enqueued. 

### Tokens

The memory spaces of the consumed and (in EDProducer case) produced data products are driven by the tokens. The token types to be used in different cases are summarized below. 


|                                                                | Host memory space             | Device memory space              |
|----------------------------------------------------------------|-------------------------------|----------------------------------|
| Access Event data product of type `T`                          | `edm::EDGetTokenT<T>`         | `device::EDGetToken<T>`          |
| Produce Event data product of type `T`                         | `edm::EDPutTokenT<T>`         | `device::EDPutToken<T>`          |
| Access EventSetup data product of type `T` in Record `TRecord` | `edm::ESGetToken<T, TRecord>` | `device::ESGetToken<T, TRecord>` |

With the device memory space tokens the type-deducing `consumes()`, `produces()`, and `esConsumes()` calls must be used (i.e. do not specify the data product type as part of the function call). For more information on these registration functions see
* [`consumes()` in EDModules](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideEDMGetDataFromEvent#consumes)
* [`produces()` in EDModules](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCreatingNewProducts#Producing_the_EDProduct)
* [`esConsumes()` in EDModules](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideHowToGetDataFromES#In_ED_module)
* [`consumes()` in ESProducers](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideHowToGetDataFromES#In_ESProducer)


### `fillDescriptions()`

In the [`fillDescriptions()`](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideConfigurationValidationAndHelp) function specifying the module label automatically with the [`edm::ConfigurationDescriptions::addWithDefaultLabel()`](https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideConfigurationValidationAndHelp#Automatic_module_labels_from_plu) is strongly recommended. Currently a `cfi` file is generated for a module for each Alpaka backend such that the backend namespace is explicitly used in the module definition. An additional `cfi` file is generated for the ["module type resolver"](#module-type-resolver-portable) functionality, where the module type has `@alpaka` postfix.

Also note that the `fillDescription()` function must have the same content for all backends, i.e. any backend-specific behavior with e.g. `#ifdef` or `if constexpr` are forbidden.

### Copy e.g. configuration data to all devices in EDProducer

While the EventSetup can be used to handle copying data to all devices
of an Alpaka backend, for data used only by one EDProducer a simpler
way would be to use one of
* `cms::alpakatools::MoveToDeviceCache<TDevice, THostObject>` (recommended)
  * `#include "HeterogeneousCore/AlpakaCore/interface/MoveToDeviceCache.h"`
  * Moves the `THostObject` to all devices using `cms::alpakatools::CopyToDevice<THostObject>` synchronously. On host backends the argument `THostObject` is moved around, but not copied.
  * The `THostObject` must not be copyable
    * This is to avoid easy mistakes with objects that follow copy semantics of `std::shared_ptr` (that includes Alpaka buffers), that would allow the source memory buffer to be used via another copy during the asynchronous data copy to the device.
  * The constructor argument `THostObject` object may not be used, unless it is initialized again e.g. by assigning another `THostObject` into it.
  * The corresponding device-side object can be obtained with `get()` member function using either alpaka Device or Queue object. It can be used immediately after the constructor returns.
* `cms::alpakatools::CopyToDeviceCache<TDevice, THostObject>` (use only if **must** use copyable `THostObject`)
  * `#include "HeterogeneousCore/AlpakaCore/interface/CopyToDeviceCache.h"`
  * Copies the `THostObject` to all devices using `cms::alpakatools::CopyToDevice<THostObject>` synchronously. Also host backends do a copy.
  * The constructor argument `THostObject` object can be used for other purposes immediately after the constructor returns
  * The corresponding device-side object can be obtained with `get()` member function using either alpaka Device or Queue object. It can be used immediately after the constructor returns.

For examples see [`HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlpakaGlobalProducerCopyToDeviceCache.cc`](../../HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlpakaGlobalProducerCopyToDeviceCache.cc) and [`HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlpakaGlobalProducerMoveToDeviceCache.cc`](../../HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlpakaGlobalProducerMoveToDeviceCache.cc).

## Guarantees

* All Event data products in the device memory space are guaranteed to be accessible only for operations enqueued in the `Queue` given by `device::Event::queue()` when accessed through the `device::Event`.
* All Event data products in the host memory space are guaranteed to be accessible for all operations (after the data product has been obtained from the `edm::Event` or `device::Event`).
* All EventSetup data products in the device memory space are guaranteed to be accessible only for operations enqueued in the `Queue` given by `device::Event::queue()` when accessed via the `device::EventSetup` (ED modules), or by `device::Record<TRecord>::queue()` when accessed via the `device::Record<TRecord>` (ESProducers).
* The EDM Stream does not proceed to the next Event until after all asynchronous work of the current Event has finished.
  * **Note**: this implies if an EDProducer in its `produce()` function uses the `Event::queue()` or gets a device-side data product, and does not produce any device-side data products, the `produce()` call will be synchronous (i.e. will block the CPU thread until the asynchronous work finishes)

## Examples

For concrete examples see code in [`HeterogeneousCore/AlpakaTest`](../../HeterogeneousCore/AlpakaTest) and [`DataFormats/PortableTestObjects`](../../DataFormats/PortableTestObjects).

### EDProducer

This example shows a mixture of behavior from test code in [`HeterogeneousCore/AlpakaTest/plugins/alpaka/`](../../HeterogeneousCore/AlpakaTest/plugins/alpaka/)
```cpp
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/EDGetToken.h"
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/EDPutToken.h"
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/ESGetToken.h"
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/Event.h"
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/EventSetup.h"
#include "HeterogeneousCore/AlpakaCore/interface/alpaka/global/EDProducer.h"
#include "HeterogeneousCore/AlpakaInterface/interface/config.h"
// + usual #includes for the used framework components, data format(s), record(s)

// Module must be defined in ALPAKA_ACCELERATOR_NAMESPACE
namespace ALPAKA_ACCELERATOR_NAMESPACE {

  // Base class is defined in ALPAKA_ACCELEATOR_NAMESPACE as well (note, no edm:: prefix!)
  class ExampleAlpakaProducer : public global::EDProducer<> {
  public:
    ExampleAlpakaProducer(edm::ParameterSet const& iConfig)
        : EDProducer<>(iConfig),
          // produces() must not specify the product type, it is deduced from deviceToken_
          deviceToken_{produces()},
          size_{iConfig.getParameter<int32_t>("size")} {}

    // device::Event and device::EventSetup are defined in ALPAKA_ACCELERATOR_NAMESPACE as well
    void produce(edm::StreamID sid, device::Event& iEvent, device::EventSetup const& iSetup) const override {
      // get input data products
      auto const& hostInput = iEvent.get(getTokenHost_);
      auto const& deviceInput = iEvent.get(getTokenDevice_);
      auto const& deviceESData = iSetup.getData(esGetTokenDevice_);
    
      // run the algorithm, potentially asynchronously
      portabletest::TestDeviceCollection deviceProduct{size_, event.queue()};
      algo_.fill(event.queue(), hostInput, deviceInput, deviceESData, deviceProduct);

      // put the asynchronous product into the event without waiting
      // must use EDPutToken with emplace() or put()
      //
      // for a product produced with device::EDPutToken<T> the base class registers
      // a separately scheduled transformation function for the copy to host
      // the transformation function calls
      // cms::alpakatools::CopyToDevice<portabletest::TestDeviceCollection>::copyAsync(Queue&, portabletest::TestDeviceCollection const&)
      // function
      event.emplace(deviceToken_, std::move(deviceProduct));
    }

    static void fillDescriptions(edm::ConfigurationDescriptions& descriptions) {
      // All backends must have exactly the same fillDescriptions() content!
      edm::ParameterSetDescription desc;
      desc.add<int32_t>("size");
      descriptions.addWithDefaultLabel(desc);
    }

  private:
    // use edm::EGetTokenT<T> to read from host memory space
    edm::EDGetTokenT<FooProduct> const getTokenHost_;
    
    // use device::EDGetToken<T> to read from device memory space
    device::EDGetToken<BarProduct> const getTokenDevice_;

    // use device::ESGetToken<T, TRecord> to read from device memory space
    device::ESGetToken<TestProduct, TestRecord> const esGetTokenDevice_;

    // use device::EDPutToken<T> to place the data product in the device memory space
    device::EDPutToken<portabletest::TestDeviceCollection> const deviceToken_;
    int32_t const size_;

    // implementation of the algorithm
    TestAlgo algo_;
  };

}  // namespace ALPAKA_ACCELERATOR_NAMESPACE

#include "HeterogeneousCore/AlpakaCore/interface/MakerMacros.h"
DEFINE_FWK_ALPAKA_MODULE(TestAlpakaProducer);

```

### ESProducer to reformat an existing ESProduct for use in device

```cpp
// Module must be defined in ALPAKA_ACCELERATOR_NAMESPACE
namespace ALPAKA_ACCELERATOR_NAMESPACE {

  // Base class is defined in ALPAKA_ACCELEATOR_NAMESPACE as well (note, no edm:: prefix!)
  class ExampleAlpakaESProducer : public ESProducer {
  public:
    ExampleAlpakaESProducer(edm::ParameterSet const& iConfig) : ESProducer(iConfig) {
      // register the production function
      auto cc = setWhatProduced(this);
      // register consumed ESProduct(s)
      token_ = cc.consumes();
    }

    static void fillDescriptions(edm::ConfigurationDescriptions& descriptions) {
      // All backends must have exactly the same fillDescriptions() content!
      edm::ParameterSetDescription desc;
      descriptions.addWithDefaultLabel(desc);
    }

    // return type can be
    // - std::optional<T> (T is cheap to move),
    // - std::unique_ptr<T> (T is not cheap to move),
    // - std::shared_ptr<T> (allows sharing between IOVs)
    //
    // the base class registers a separately scheduled function to copy the product on device memory
    // the function calls
    // cms::alpakatools::CopyToDevice<SimpleProduct>::copyAsync(Queue&, SimpleProduct const&)
    // function
    std::optional<SimpleProduct> produce(TestRecord const& iRecord) {
      // get input data
      auto const& hostInput = iRecord.get(token_);

      // allocate data product on the host memory
      SimpleProduct hostProduct;

      // fill the hostProduct from hostInput

      return hostProduct;
    }

  private:
    edm::ESGetToken<TestProduct, TestRecord> token_;
  };
}  // namespace ALPAKA_ACCELERATOR_NAMESPACE

#include "HeterogeneousCore/AlpakaCore/interface/alpaka/ModuleFactory.h"
DEFINE_FWK_EVENTSETUP_ALPAKA_MODULE(ExampleAlpakaESProducer);
```

### ESProducer to derive a new ESProduct from an existing device-side ESProduct

```cpp
// Module must be defined in ALPAKA_ACCELERATOR_NAMESPACE
namespace ALPAKA_ACCELERATOR_NAMESPACE {

  // Base class is defined in ALPAKA_ACCELEATOR_NAMESPACE as well (note, no edm:: prefix!)
  class ExampleAlpakaDeriveESProducer : public ESProducer {
  public:
    ExampleAlpakaDeriveESProducer(edm::ParameterSet const& iConfig) : ESProducer(iConfig) {
      // register the production function
      auto cc = setWhatProduced(this);
      // register consumed ESProduct(s)
      token_ = cc.consumes();
    }

    static void fillDescriptions(edm::ConfigurationDescriptions& descriptions) {
      // All backends must have exactly the same fillDescriptions() content!
      edm::ParameterSetDescription desc;
      descriptions.addWithDefaultLabel(desc);
    }

    std::optional<OtherProduct> produce(device::Record<TestRecord> const& iRecord) {
      // get input data in the device memory space
      auto const& deviceInput = iRecord.get(token_);

      // allocate data product on the device memory
      OtherProduct deviceProduct(iRecord.queue());

      // run the algorithm, potentially asynchronously
      algo_.fill(iRecord.queue(), deviceInput, deviceProduct);

      // return the product without waiting
      return deviceProduct;
    }

  private:
    device::ESGetToken<SimpleProduct, TestRecord> token_;
    
    OtherAlgo algo_;
  };

#include "HeterogeneousCore/AlpakaCore/interface/alpaka/ModuleFactory.h"
DEFINE_FWK_EVENTSETUP_ALPAKA_MODULE(ExampleAlpakaDeviceESProducer);
```

## Configuration

There are a few different options for using Alpaka-based modules in the CMSSW configuration.

In all cases the configuration must load the necessary `ProcessAccelerator` objects (see below) For accelerators used in production, these are aggregated in `Configuration.StandardSequences.Accelerators_cff`. The `runTheMatrix.py` handles the loading of this `Accelerators_cff` automatically. The HLT menus also load the necessary `ProcessAccelerator`s.
```python
## Load explicitly
# One ProcessAccelerator for each accelerator technology, plus a generic one for Alpaka
process.load("Configuration.StandardSequences.Accelerators_cff")
```

### Explicit module type (non-portable)

The Alpaka modules can be used in the python configuration with their explicit, full type names
```python
process.producerCPU = cms.EDProducer("alpaka_serial_sync::ExampleAlpakaProducer", ...)
process.producerGPU = cms.EDProducer("alpaka_cuda_async::ExampleAlpakaProducer", ...)
```
Obviously this kind of configuration can be run only on machines that provide the necessary hardware. The configuration is thus explicitly non-portable.


### SwitchProducerCUDA (semi-portable)

A step towards a portable configuration is to use the `SwitchProcucer` mechanism, for which currently the only concrete implementation is [`SwitchProducerCUDA`](../../HeterogeneousCore/CUDACore/README.md#automatic-switching-between-cpu-and-gpu-modules). The modules for different Alpaka backends still need to be specified explicitly
```python
from HeterogeneousCore.CUDACore.SwitchProducerCUDA import SwitchProducerCUDA
process.producer = SwitchProducerCUDA(
    cpu = cms.EDProducer("alpaka_serial_sync::ExampleAlpakaProducer", ...),
    cuda = cms.EDProducer("alpaka_cuda_async::ExampleAlpakaProducer", ...)
)

# or

process.producer = SwitchProducerCUDA(
    cpu = cms.EDAlias(producerCPU = cms.EDAlias.allProducts(),
    cuda = cms.EDAlias(producerGPU = cms.EDAlias.allProducts()
)
```
This kind of configuration can be run on any machine (a given CMSSW build supports), but is limited to CMSSW builds where the modules for all the Alpaka backends declared in the configuration can be built (`alpaka_serial_sync` and `alpaka_cuda_async` in this example). Therefore the `SwitchProducer` approach is here called "semi-portable".

### Module type resolver (portable)

A fully portable way to express a configuration can be achieved with "module type resolver" approach. The module is specified in the configuration without the backend-specific namespace, and with `@alpaka` postfix
```python
process.producer = cms.EDProducer("ExampleAlpakaProducer@alpaka", ...)

# backend can also be set explicitly
process.producerCPU = cms.EDProducer("ExampleAlpakaProducer@alpaka",
    ...
    alpaka = cms.untracked.PSet(
        backend = cms.untracked.string("serial_sync")
    )
)
```
The `@alpaka` postfix in the module type tells the system the module's exact class type should be resolved at run time. The type (or backend) is set according to the value of `process.options.accelerators` and the set of accelerators available in the machine. If the backend is set explicitly in the module's `alpaka` PSet, the module of that backend will be used.

This approach is portable also across CMSSW builds that support different sets of accelerators, as long as only the host backends (if any) are specified explicitly in the `alpaka` PSet.


#### Examples on explicitly setting the backend

##### For individual modules

The explicitly-set backend must be one of those allowed by the job-wide `process.options.accelerators` setting. This setting overrides the `ProcessAcceleratorAlpaka` setting described in the next paragraph.

```python
process.producerCPU = cms.EDProducer("ExampleAlpakaProducer@alpaka",
    ...
    alpaka = cms.untracked.PSet(
        backend = cms.untracked.string("serial_sync") # or "cuda_async" or "rocm_async"
    )
)
```

##### For all Alpaka modules

The explicitly-set backend must be one of those allowed by the job-wide `process.options.accelerators` setting. This `ProcessAcceleratorAlpaka` setting can be further overridden for individual modules as described in the previous paragraph.

```python
process.ProcessAcceleratorAlpaka.setBackend("serial_sync") # or "cuda_async" or "rocm_async"
```

##### For entire job (i.e. also for non-Alpaka modules)
```python
process.options.accelerators = ["cpu"] # or "gpu-nvidia" or "gpu-amd"
```

### Blocking synchronization (for testing)

While the general approach is to favor asynchronous operations with non-blocking synchronization, for testing purposes it can be useful to synchronize the EDModule's `acquire()` / `produce()` or ESProducer's production functions in a blocking way. Such a blocking synchronization can be specified for individual modules via the `alpaka` `PSet` along
```python
process.producer = cms.EDProducer("ExampleAlpakaProducer@alpaka",
    ...
    alpaka = cms.untracked.PSet(
        synchronize = cms.untracked.bool(True)
    )
)
```

The blocking synchronization can be specified for all Alpaka modules via the `ProcessAcceleratorAlpaka` along
```python
process.ProcessAcceleratorAlpaka.setSynchronize(True)
```
Note that the possible per-module parameter overrides this global setting.


## Unit tests

Unit tests that depend on Alpaka and define `<flags ALPAKA_BACKENDS="1"/>`, e.g. as a binary along
```xml
<bin name="<unique test binary name>" file="<comma-separated list of files">
  <use name="alpaka"/>
  <flags ALPAKA_BACKENDS="1"/>
</bin>
```
or as a command (e.g. `cmsRun` or a shell script) to run

```xml
<test name="<unique name of the test>" command="<command to run>">
  <use name="alpaka"/>
  <flags ALPAKA_BACKENDS="1"/>
</test>
```

will be run as part of `scram build runtests` according to the
availability of the hardware:
- `serial_sync` version is run always
- `cuda_async` version is run if NVIDIA GPU is present (i.e. `cudaIsEnabled` returns 0)
- `rocm_async` version is run if AMD GPU is present (i.e. `rocmIsEnabled` returns 0)

Tests for specific backend (or hardware) can be explicitly specified to be run by setting `USER_UNIT_TESTS=cuda` or `USER_UNIT_TESTS=rocm` environment variable. Tests not depending on the hardware are skipped. If the corresponding hardware is not available, the tests will fail.