ACCL

ACCL::ACCL

class ACCL

Main ACCL class that talks to the CCLO on hardware or emulation/simulation.

Public Functions

ACCL(const std::vector<rank_t> &ranks, int local_rank, xrt::device &device, xrt::ip &cclo_ip, xrt::kernel &hostctrl_ip, int devicemem, const std::vector<int> &rxbufmem, int networkmem, networkProtocol protocol = networkProtocol::TCP, int nbufs = 16, addr_t bufsize = 1024, const arithConfigMap &arith_config = DEFAULT_ARITH_CONFIG)

Construct a new ACCL object that talks to hardware.

Parameters
  • ranks – All ranks on the network

  • local_rank – Rank of this process

  • device – FPGA device on which the CCLO lives

  • cclo_ip – The CCLO kernel on the FPGA

  • hostctrl_ip – The hostctrl kernel on the FPGA

  • devicemem – Memory bank of device memory

  • rxbufmem – Memory banks of rxbuf memory

  • networkmem – Memory bank of network memory

  • protocol – Network protocol to use

  • nbufs – Amount of buffers to use

  • bufsize – Size of buffers

  • arith_config – Arithmetic configuration to use

ACCL(const std::vector<rank_t> &ranks, int local_rank, unsigned int start_port, networkProtocol protocol = networkProtocol::TCP, int nbufs = 16, addr_t bufsize = 1024, const arithConfigMap &arith_config = DEFAULT_ARITH_CONFIG)

Construct a new ACCL object that talks to the ACCL emulator/simulator.

Parameters
  • ranks – All ranks on the network

  • local_rank – Rank of this process

  • start_port – First port to use to connect to the ACCL emulator/ simulator

  • protocol – Network protocol to use

  • nbufs – Amount of buffers to use

  • bufsize – Size of buffers

  • arith_config – Arithmetic configuration to use

ACCL(const std::vector<rank_t> &ranks, int local_rank, unsigned int start_port, xrt::device &device, networkProtocol protocol = networkProtocol::TCP, int nbufs = 16, addr_t bufsize = 1024, const arithConfigMap &arith_config = DEFAULT_ARITH_CONFIG)

Construct a new ACCL object that talks to emulator/simulator and is compatible with the Vitis emulator.

Parameters
  • ranks – All ranks on the network

  • local_rank – Rank of this process

  • start_port – First port to use to connect to the ACCL emulator/ simulator

  • device – Simulated FPGA device from the Vitis emulator

  • protocol – Network protocol to use

  • nbufs – Amount of buffers to use

  • bufsize – Size of buffers

  • arith_config – Arithmetic configuration to use

~ACCL()

Destroy the ACCL object.

Automatically deinitializes the CCLO.

void deinit()

Deinitializes the CCLO.

inline val_t get_retcode()

Get the return code of the last ACCL call.

Returns

val_t The return code

inline val_t get_hwid()

Get the hardware id from the FPGA.

Returns

val_t The hardware id

CCLO *set_timeout(unsigned int value, bool run_async = false, std::vector<CCLO*> waitfor = {})

Set the timeout of ACCL calls.

Parameters
  • value – Timeout in miliseconds

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *nop(bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the nop operation on the FPGA.

Parameters
  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *send(BaseBuffer &srcbuf, unsigned int count, unsigned int dst, unsigned int tag = TAG_ANY, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, streamFlags stream_flags = streamFlags::NO_STREAM, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the send operation on the FPGA.

Parameters
  • comm_id – Index of communicator to use.

  • srcbufBuffer that contains the data to be send. Create a buffer using ACCL::create_buffer.

  • count – Amount of elements in buffer to send.

  • dst – Destination rank to send data to.

  • tag – Tag of send operation.

  • from_fpga – Set to true if the data is already on the FPGA.

  • stream_flags – Stream flags to use.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *recv(BaseBuffer &dstbuf, unsigned int count, unsigned int src, unsigned int tag = TAG_ANY, communicatorId comm_id = GLOBAL_COMM, bool to_fpga = false, streamFlags stream_flags = streamFlags::NO_STREAM, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the receive operation on the FPGA.

Parameters
  • dstbufBuffer where the data should be stored to. Create a buffer using ACCL::create_buffer.

  • count – Amount of elements to receive.

  • src – Source rank to receive data from.

  • tag – Tag of receive operation.

  • comm_id – Index of communicator to use.

  • to_fpga – Set to true if the data will be used on the FPGA only.

  • stream_flags – Stream flags to use.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *copy(BaseBuffer &srcbuf, BaseBuffer &dstbuf, unsigned int count, bool from_fpga = false, bool to_fpga = false, bool run_async = false, std::vector<CCLO*> waitfor = {})

Copy a buffer on the FPGA.

Parameters
  • srcbufBuffer that contains the data to be copied. Create a buffer using ACCL::create_buffer.

  • dstbufBuffer where the data should be stored to. Create a buffer using ACCL::create_buffer.

  • count – Amount of elements in buffer to copy.

  • from_fpga – Set to true if the data is already on the FPGA.

  • to_fpga – Set to true if the copied data will be used on the FPGA only.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *combine(unsigned int count, reduceFunction function, BaseBuffer &val1, BaseBuffer &val2, BaseBuffer &result, bool val1_from_fpga = false, bool val2_from_fpga = false, bool to_fpga = false, bool run_async = false, std::vector<CCLO*> waitfor = {})

Perform reduce operation on two buffers on the FPGA.

Parameters
  • count – Amount of elements to perform reduce operation on.

  • function – Reduce operation to perform.

  • val1 – First buffer that should be used for reduce operation. Create a buffer using ACCL::create_buffer.

  • val2 – Second buffer that should be used for reduce operation. Create a buffer using ACCL::create_buffer.

  • resultBuffer where the result should be stored to. Create a buffer using ACCL::create_buffer.

  • val1_from_fpga – Set to true if the data of the first buffer is already on the FPGA.

  • val2_from_fpga – Set to true if the data of the second buffer is already on the FPGA.

  • to_fpga – Set to true if the copied data will be used on the FPGA only.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *bcast(BaseBuffer &buf, unsigned int count, unsigned int root, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the broadcast operation on the FPGA.

Parameters
  • bufBuffer that should contain the same data as the root after the operation. Create a buffer using ACCL::create_buffer.

  • count – Amount of elements in buffer to broadcast.

  • root – Rank to broadcast the data from.

  • comm_id – Index of communicator to use.

  • from_fpga – Set to true if the data is already on the FPGA.

  • to_fpga – Set to true if the copied data will be used on the FPGA only.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *scatter(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, unsigned int root, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the scatter operation on the FPGA.

Parameters
  • sendbufBuffer of count × world size elements that contains the data to be scattered. Create a buffer using ACCL::create_buffer. You can pass a DummyBuffer on non-root ranks.

  • recvbufBuffer of count elements where the scattered data should be stored. Create a buffer using ACCL::create_buffer.

  • count – Amount of elements to scatter per rank.

  • root – Rank to scatter the data from.

  • comm_id – Index of communicator to use.

  • from_fpga – Set to true if the data is already on the FPGA.

  • to_fpga – Set to true if the scattered data will be used on the FPGA only.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *gather(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, unsigned int root, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the gather operation on the FPGA.

Parameters
  • sendbufBuffer of count elements that contains the data to be gathered. Create a buffer using ACCL::create_buffer.

  • recvbufBuffer of count × world size elements to where the data should be gathered. Create a buffer using ACCL::create_buffer. You can pass a DummyBuffer on non-root ranks.

  • count – Amount of elements to gather per rank.

  • root – Rank to gather the data to.

  • comm_id – Index of communicator to use.

  • from_fpga – Set to true if the data is already on the FPGA.

  • to_fpga – Set to true if the gathered data will be used on the FPGA only.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *allgather(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the allgather operation on the FPGA.

Parameters
  • sendbufBuffer of count elements that contains the data to be gathered. Create a buffer using ACCL::create_buffer.

  • recvbufBuffer of count × world size elements to where the data should be gathered. Create a buffer using ACCL::create_buffer.

  • count – Amount of elements to gather per rank.

  • comm_id – Index of communicator to use.

  • from_fpga – Set to true if the data is already on the FPGA.

  • to_fpga – Set to true if the gathered data will be used on the FPGA only.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *reduce(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, unsigned int root, reduceFunction func, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the reduce operation on the FPGA.

Parameters
  • sendbufBuffer that contains the data to be reduced. Create a buffer using ACCL::create_buffer.

  • recvbufBuffer to where the data should be reduced. Create a buffer using ACCL::create_buffer. You can pass a DummyBuffer on non-root ranks.

  • count – Amount of elements to reduce.

  • root – Rank to reduce the data to.

  • func – Reduce function to use.

  • comm_id – Index of communicator to use.

  • from_fpga – Set to true if the data is already on the FPGA.

  • to_fpga – Set to true if the reduced data will be used on the FPGA only.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *allreduce(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, reduceFunction func, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the allreduce operation on the FPGA.

Parameters
  • sendbufBuffer that contains the data to be reduced. Create a buffer using ACCL::create_buffer.

  • recvbufBuffer to where the data should be reduced. Create a buffer using ACCL::create_buffer.

  • count – Amount of elements to reduce.

  • func – Reduce function to use.

  • comm_id – Index of communicator to use.

  • from_fpga – Set to true if the data is already on the FPGA.

  • to_fpga – Set to true if the reduced data will be used on the FPGA only.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

CCLO *reduce_scatter(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, reduceFunction func, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})

Performs the reduce_scatter operation on the FPGA.

Parameters
  • sendbufBuffer of count × world size elements that contains the data to be reduced. Create a buffer using ACCL::create_buffer.

  • recvbufBuffer of count elements to where the data should be reduced. Create a buffer using ACCL::create_buffer.

  • count – Amount of elements to reduce per rank.

  • func – Reduce function to use.

  • comm_id – Index of communicator to use.

  • from_fpga – Set to true if the data is already on the FPGA.

  • to_fpga – Set to true if the reduced data will be used on the FPGA only.

  • compress_dtype – Datatype to compress buffers to over ethernet.

  • run_async – Run the ACCL call asynchronously.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

Returns

CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.

void barrier(communicatorId comm_id = GLOBAL_COMM, std::vector<CCLO*> waitfor = {})

Performs a barrier on the FPGA.

Parameters
  • comm_id – Index of communicator to use.

  • waitforACCL call will wait for these operations before it will start. Currently not implemented.

inline bool is_simulated() const

Check if ACCL is being run in simulated mode or not.

Returns

true ACCL is running on an emulator or simulator.

Returns

false ACCL is running on hardware.

template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(size_t length, dataType type)

Construct a new buffer object without an existing host buffer.

Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer(xrt::bo &, size_t, dataType).

Template Parameters

dtype – Datatype of the buffer.

Parameters
  • length – Amount of elements to allocate for.

  • typeACCL datatype of the buffer.

Returns

std::unique_ptr<Buffer<dtype>> The allocated buffer.

template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(size_t length, dataType type, unsigned mem_grp)

Construct a new buffer object without an existing host buffer on the specified memory bank.

Only use this function if you want to store the buffer on a different memory bank than the devicemem bank specified during construction.

Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer(xrt::bo &, size_t, dataType).

Template Parameters

dtype – Datatype of the buffer.

Parameters
  • length – Amount of elements to allocate for.

  • typeACCL datatype of the buffer.

  • mem_grp – Memory bank to allocate buffer on.

Returns

std::unique_ptr<Buffer<dtype>> The allocated buffer.

template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(dtype *host_buffer, size_t length, dataType type)

Construct a new buffer object from an existing host pointer.

On hardware it is required that the host pointer is aligned to 4096 bytes. If a non-aligned host pointer is provided and ACCL is running on hardware, ACCL will keep it’s own aligned host buffer, and copy between the unaligned and aligned host buffers when required. It is recommended to provide an aligned host pointer to avoid unnecessary memory copies.

Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer(xrt::bo &, size_t, dataType).

Template Parameters

dtype – Datatype of the buffer.

Parameters
  • host_buffer – The host pointer containing the data.

  • length – Amount of elements in the host buffer.

  • typeACCL datatype of the buffer.

Returns

std::unique_ptr<Buffer<dtype>> The allocated buffer.

template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(dtype *host_buffer, size_t length, dataType type, unsigned mem_grp)

Construct a new buffer object from an existing host pointer on the specified memory bank.

Only use this function if you want to store the buffer on a different memory bank than the devicemem bank specified during construction.

On hardware it is required that the host pointer is aligned to 4096 bytes. If a non-aligned host pointer is provided and ACCL is running on hardware, ACCL will keep it’s own aligned host buffer, and copy between the unaligned and aligned host buffers when required. It is recommended to provide an aligned host pointer to avoid unnecessary memory copies.

Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer(xrt::bo &, size_t, dataType).

Template Parameters

dtype – Datatype of the buffer.

Parameters
  • host_buffer – The host pointer containing the data.

  • length – Amount of elements in the host buffer.

  • typeACCL datatype of the buffer.

  • mem_grp – Memory bank to allocate buffer on.

Returns

std::unique_ptr<Buffer<dtype>> The allocated buffer.

template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(xrt::bo &bo, size_t length, dataType type)

Construct a new buffer object from an existing BO buffer.

When using an ACCL emulator or simulator, this function can be used to pass a simulated BO buffer from the Vitis emulator and use the Vitis emulator together with the ACCL emulator. In this case, ACCL will also create a new internal simulated BO buffer to copy data between the simulated BO buffer and the simulated ACCL buffer when required.

When running on hardware, ACCL will simply use this BO buffer internally, instead of allocating a new one.

Template Parameters

dtype – Datatype of the buffer.

Parameters
  • bo – The BO buffer to use.

  • length – Amount of elements in the BO buffer.

  • typeACCL datatype of the buffer.

Returns

std::unique_ptr<Buffer<dtype>> The allocated buffer.

template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer_p2p(size_t length, dataType type)

Construct a new p2p buffer object.

Will create a normal buffer when running in simulated mode.

Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer_p2p(xrt::bo &, size_t, dataType).

Template Parameters

dtype – Datatype of the buffer.

Parameters
  • length – Amount of elements to allocate for.

  • typeACCL datatype of the buffer.

Returns

std::unique_ptr<Buffer<dtype>> The allocated P2P buffer.

template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer_p2p(size_t length, dataType type, unsigned mem_grp)

Construct a new p2p buffer object on the specified memory bank.

Will create a normal buffer when running in simulated mode.

Only use this function if you want to store the buffer on a different memory bank than the devicemem bank specified during construction.

Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer_p2p(xrt::bo &, size_t, dataType).

Template Parameters

dtype – Datatype of the buffer.

Parameters
  • length – Amount of elements to allocate for.

  • typeACCL datatype of the buffer.

  • mem_grp – Memory bank to allocate buffer on.

Returns

std::unique_ptr<Buffer<dtype>> The allocated P2P buffer.

template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer_p2p(xrt::bo &bo, size_t length, dataType type)

Construct a new p2p buffer object from an existing P2P BO buffer.

If you do not pass a non-P2P BO buffer, data will not be copied correctly from and to the FPGA.

Will create a normal buffer when running in simulated mode. See the notes of create_buffer(xrt::bo &, size_t, dataType) about using BO buffers in simulated mode.

Template Parameters

dtype – Datatype of the buffer.

Parameters
  • length – Amount of elements to allocate for.

  • typeACCL datatype of the buffer.

Returns

std::unique_ptr<Buffer<dtype>> The allocated P2P buffer.

std::string dump_exchange_memory()

Dump the content of the exchange memory to a string.

Returns

std::string Content of the exchange memory.

std::string dump_rx_buffers(size_t nbufs)

Dump the content of the RX buffers to a string for the first nbufs buffers.

Parameters

nbufs – Amount of buffers to dump the content of.

Returns

std::string Content of the RX buffers.

inline std::string dump_rx_buffers()

Dump the content of all RX buffers to a string.

Returns

std::string Content of all RX buffers.

std::string dump_communicator()

Dump the content of the communicator to a string.

Returns

std::string Content of the communicator.

inline int devicemem()

Retrieve the devicemem memory bank.

Returns

int The devicemem memory bank