ACCL
ACCL::ACCL
-
class ACCL
Main ACCL class that talks to the CCLO on hardware or emulation/simulation.
Public Functions
-
ACCL(const std::vector<rank_t> &ranks, int local_rank, xrt::device &device, xrt::ip &cclo_ip, xrt::kernel &hostctrl_ip, int devicemem, const std::vector<int> &rxbufmem, int networkmem, networkProtocol protocol = networkProtocol::TCP, int nbufs = 16, addr_t bufsize = 1024, const arithConfigMap &arith_config = DEFAULT_ARITH_CONFIG)
Construct a new ACCL object that talks to hardware.
- Parameters
ranks – All ranks on the network
local_rank – Rank of this process
device – FPGA device on which the CCLO lives
cclo_ip – The CCLO kernel on the FPGA
hostctrl_ip – The hostctrl kernel on the FPGA
devicemem – Memory bank of device memory
rxbufmem – Memory banks of rxbuf memory
networkmem – Memory bank of network memory
protocol – Network protocol to use
nbufs – Amount of buffers to use
bufsize – Size of buffers
arith_config – Arithmetic configuration to use
-
ACCL(const std::vector<rank_t> &ranks, int local_rank, unsigned int start_port, networkProtocol protocol = networkProtocol::TCP, int nbufs = 16, addr_t bufsize = 1024, const arithConfigMap &arith_config = DEFAULT_ARITH_CONFIG)
Construct a new ACCL object that talks to the ACCL emulator/simulator.
- Parameters
ranks – All ranks on the network
local_rank – Rank of this process
start_port – First port to use to connect to the ACCL emulator/ simulator
protocol – Network protocol to use
nbufs – Amount of buffers to use
bufsize – Size of buffers
arith_config – Arithmetic configuration to use
-
ACCL(const std::vector<rank_t> &ranks, int local_rank, unsigned int start_port, xrt::device &device, networkProtocol protocol = networkProtocol::TCP, int nbufs = 16, addr_t bufsize = 1024, const arithConfigMap &arith_config = DEFAULT_ARITH_CONFIG)
Construct a new ACCL object that talks to emulator/simulator and is compatible with the Vitis emulator.
- Parameters
ranks – All ranks on the network
local_rank – Rank of this process
start_port – First port to use to connect to the ACCL emulator/ simulator
device – Simulated FPGA device from the Vitis emulator
protocol – Network protocol to use
nbufs – Amount of buffers to use
bufsize – Size of buffers
arith_config – Arithmetic configuration to use
-
inline val_t get_retcode()
Get the return code of the last ACCL call.
- Returns
val_t The return code
-
CCLO *set_timeout(unsigned int value, bool run_async = false, std::vector<CCLO*> waitfor = {})
Set the timeout of ACCL calls.
-
CCLO *nop(bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the nop operation on the FPGA.
-
CCLO *send(BaseBuffer &srcbuf, unsigned int count, unsigned int dst, unsigned int tag = TAG_ANY, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, streamFlags stream_flags = streamFlags::NO_STREAM, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the send operation on the FPGA.
- Parameters
comm_id – Index of communicator to use.
srcbuf – Buffer that contains the data to be send. Create a buffer using ACCL::create_buffer.
count – Amount of elements in buffer to send.
dst – Destination rank to send data to.
tag – Tag of send operation.
from_fpga – Set to true if the data is already on the FPGA.
stream_flags – Stream flags to use.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *recv(BaseBuffer &dstbuf, unsigned int count, unsigned int src, unsigned int tag = TAG_ANY, communicatorId comm_id = GLOBAL_COMM, bool to_fpga = false, streamFlags stream_flags = streamFlags::NO_STREAM, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the receive operation on the FPGA.
- Parameters
dstbuf – Buffer where the data should be stored to. Create a buffer using ACCL::create_buffer.
count – Amount of elements to receive.
src – Source rank to receive data from.
tag – Tag of receive operation.
comm_id – Index of communicator to use.
to_fpga – Set to true if the data will be used on the FPGA only.
stream_flags – Stream flags to use.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *copy(BaseBuffer &srcbuf, BaseBuffer &dstbuf, unsigned int count, bool from_fpga = false, bool to_fpga = false, bool run_async = false, std::vector<CCLO*> waitfor = {})
Copy a buffer on the FPGA.
- Parameters
srcbuf – Buffer that contains the data to be copied. Create a buffer using ACCL::create_buffer.
dstbuf – Buffer where the data should be stored to. Create a buffer using ACCL::create_buffer.
count – Amount of elements in buffer to copy.
from_fpga – Set to true if the data is already on the FPGA.
to_fpga – Set to true if the copied data will be used on the FPGA only.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *combine(unsigned int count, reduceFunction function, BaseBuffer &val1, BaseBuffer &val2, BaseBuffer &result, bool val1_from_fpga = false, bool val2_from_fpga = false, bool to_fpga = false, bool run_async = false, std::vector<CCLO*> waitfor = {})
Perform reduce operation on two buffers on the FPGA.
- Parameters
count – Amount of elements to perform reduce operation on.
function – Reduce operation to perform.
val1 – First buffer that should be used for reduce operation. Create a buffer using ACCL::create_buffer.
val2 – Second buffer that should be used for reduce operation. Create a buffer using ACCL::create_buffer.
result – Buffer where the result should be stored to. Create a buffer using ACCL::create_buffer.
val1_from_fpga – Set to true if the data of the first buffer is already on the FPGA.
val2_from_fpga – Set to true if the data of the second buffer is already on the FPGA.
to_fpga – Set to true if the copied data will be used on the FPGA only.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *bcast(BaseBuffer &buf, unsigned int count, unsigned int root, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the broadcast operation on the FPGA.
- Parameters
buf – Buffer that should contain the same data as the root after the operation. Create a buffer using ACCL::create_buffer.
count – Amount of elements in buffer to broadcast.
root – Rank to broadcast the data from.
comm_id – Index of communicator to use.
from_fpga – Set to true if the data is already on the FPGA.
to_fpga – Set to true if the copied data will be used on the FPGA only.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *scatter(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, unsigned int root, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the scatter operation on the FPGA.
- Parameters
sendbuf – Buffer of count × world size elements that contains the data to be scattered. Create a buffer using ACCL::create_buffer. You can pass a DummyBuffer on non-root ranks.
recvbuf – Buffer of count elements where the scattered data should be stored. Create a buffer using ACCL::create_buffer.
count – Amount of elements to scatter per rank.
root – Rank to scatter the data from.
comm_id – Index of communicator to use.
from_fpga – Set to true if the data is already on the FPGA.
to_fpga – Set to true if the scattered data will be used on the FPGA only.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *gather(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, unsigned int root, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the gather operation on the FPGA.
- Parameters
sendbuf – Buffer of count elements that contains the data to be gathered. Create a buffer using ACCL::create_buffer.
recvbuf – Buffer of count × world size elements to where the data should be gathered. Create a buffer using ACCL::create_buffer. You can pass a DummyBuffer on non-root ranks.
count – Amount of elements to gather per rank.
root – Rank to gather the data to.
comm_id – Index of communicator to use.
from_fpga – Set to true if the data is already on the FPGA.
to_fpga – Set to true if the gathered data will be used on the FPGA only.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *allgather(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the allgather operation on the FPGA.
- Parameters
sendbuf – Buffer of count elements that contains the data to be gathered. Create a buffer using ACCL::create_buffer.
recvbuf – Buffer of count × world size elements to where the data should be gathered. Create a buffer using ACCL::create_buffer.
count – Amount of elements to gather per rank.
comm_id – Index of communicator to use.
from_fpga – Set to true if the data is already on the FPGA.
to_fpga – Set to true if the gathered data will be used on the FPGA only.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *reduce(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, unsigned int root, reduceFunction func, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the reduce operation on the FPGA.
- Parameters
sendbuf – Buffer that contains the data to be reduced. Create a buffer using ACCL::create_buffer.
recvbuf – Buffer to where the data should be reduced. Create a buffer using ACCL::create_buffer. You can pass a DummyBuffer on non-root ranks.
count – Amount of elements to reduce.
root – Rank to reduce the data to.
func – Reduce function to use.
comm_id – Index of communicator to use.
from_fpga – Set to true if the data is already on the FPGA.
to_fpga – Set to true if the reduced data will be used on the FPGA only.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *allreduce(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, reduceFunction func, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the allreduce operation on the FPGA.
- Parameters
sendbuf – Buffer that contains the data to be reduced. Create a buffer using ACCL::create_buffer.
recvbuf – Buffer to where the data should be reduced. Create a buffer using ACCL::create_buffer.
count – Amount of elements to reduce.
func – Reduce function to use.
comm_id – Index of communicator to use.
from_fpga – Set to true if the data is already on the FPGA.
to_fpga – Set to true if the reduced data will be used on the FPGA only.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
CCLO *reduce_scatter(BaseBuffer &sendbuf, BaseBuffer &recvbuf, unsigned int count, reduceFunction func, communicatorId comm_id = GLOBAL_COMM, bool from_fpga = false, bool to_fpga = false, dataType compress_dtype = dataType::none, bool run_async = false, std::vector<CCLO*> waitfor = {})
Performs the reduce_scatter operation on the FPGA.
- Parameters
sendbuf – Buffer of count × world size elements that contains the data to be reduced. Create a buffer using ACCL::create_buffer.
recvbuf – Buffer of count elements to where the data should be reduced. Create a buffer using ACCL::create_buffer.
count – Amount of elements to reduce per rank.
func – Reduce function to use.
comm_id – Index of communicator to use.
from_fpga – Set to true if the data is already on the FPGA.
to_fpga – Set to true if the reduced data will be used on the FPGA only.
compress_dtype – Datatype to compress buffers to over ethernet.
run_async – Run the ACCL call asynchronously.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
- Returns
CCLO* CCLO object that can be waited on and passed to waitfor; nullptr if run_async is false.
-
void barrier(communicatorId comm_id = GLOBAL_COMM, std::vector<CCLO*> waitfor = {})
Performs a barrier on the FPGA.
- Parameters
comm_id – Index of communicator to use.
waitfor – ACCL call will wait for these operations before it will start. Currently not implemented.
-
template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(size_t length, dataType type) Construct a new buffer object without an existing host buffer.
Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer(xrt::bo &, size_t, dataType).
- Template Parameters
dtype – Datatype of the buffer.
- Parameters
length – Amount of elements to allocate for.
type – ACCL datatype of the buffer.
- Returns
std::unique_ptr<Buffer<dtype>> The allocated buffer.
-
template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(size_t length, dataType type, unsigned mem_grp) Construct a new buffer object without an existing host buffer on the specified memory bank.
Only use this function if you want to store the buffer on a different memory bank than the devicemem bank specified during construction.
Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer(xrt::bo &, size_t, dataType).
- Template Parameters
dtype – Datatype of the buffer.
- Parameters
length – Amount of elements to allocate for.
type – ACCL datatype of the buffer.
mem_grp – Memory bank to allocate buffer on.
- Returns
std::unique_ptr<Buffer<dtype>> The allocated buffer.
-
template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(dtype *host_buffer, size_t length, dataType type) Construct a new buffer object from an existing host pointer.
On hardware it is required that the host pointer is aligned to 4096 bytes. If a non-aligned host pointer is provided and ACCL is running on hardware, ACCL will keep it’s own aligned host buffer, and copy between the unaligned and aligned host buffers when required. It is recommended to provide an aligned host pointer to avoid unnecessary memory copies.
Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer(xrt::bo &, size_t, dataType).
- Template Parameters
dtype – Datatype of the buffer.
- Parameters
host_buffer – The host pointer containing the data.
length – Amount of elements in the host buffer.
type – ACCL datatype of the buffer.
- Returns
std::unique_ptr<Buffer<dtype>> The allocated buffer.
-
template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(dtype *host_buffer, size_t length, dataType type, unsigned mem_grp) Construct a new buffer object from an existing host pointer on the specified memory bank.
Only use this function if you want to store the buffer on a different memory bank than the devicemem bank specified during construction.
On hardware it is required that the host pointer is aligned to 4096 bytes. If a non-aligned host pointer is provided and ACCL is running on hardware, ACCL will keep it’s own aligned host buffer, and copy between the unaligned and aligned host buffers when required. It is recommended to provide an aligned host pointer to avoid unnecessary memory copies.
Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer(xrt::bo &, size_t, dataType).
- Template Parameters
dtype – Datatype of the buffer.
- Parameters
host_buffer – The host pointer containing the data.
length – Amount of elements in the host buffer.
type – ACCL datatype of the buffer.
mem_grp – Memory bank to allocate buffer on.
- Returns
std::unique_ptr<Buffer<dtype>> The allocated buffer.
-
template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer(xrt::bo &bo, size_t length, dataType type) Construct a new buffer object from an existing BO buffer.
When using an ACCL emulator or simulator, this function can be used to pass a simulated BO buffer from the Vitis emulator and use the Vitis emulator together with the ACCL emulator. In this case, ACCL will also create a new internal simulated BO buffer to copy data between the simulated BO buffer and the simulated ACCL buffer when required.
When running on hardware, ACCL will simply use this BO buffer internally, instead of allocating a new one.
- Template Parameters
dtype – Datatype of the buffer.
- Parameters
bo – The BO buffer to use.
length – Amount of elements in the BO buffer.
type – ACCL datatype of the buffer.
- Returns
std::unique_ptr<Buffer<dtype>> The allocated buffer.
-
template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer_p2p(size_t length, dataType type) Construct a new p2p buffer object.
Will create a normal buffer when running in simulated mode.
Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer_p2p(xrt::bo &, size_t, dataType).
- Template Parameters
dtype – Datatype of the buffer.
- Parameters
length – Amount of elements to allocate for.
type – ACCL datatype of the buffer.
- Returns
std::unique_ptr<Buffer<dtype>> The allocated P2P buffer.
-
template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer_p2p(size_t length, dataType type, unsigned mem_grp) Construct a new p2p buffer object on the specified memory bank.
Will create a normal buffer when running in simulated mode.
Only use this function if you want to store the buffer on a different memory bank than the devicemem bank specified during construction.
Note that when running in simulated mode, this constructor will not create an underlying simulated BO buffer. If you need this functionality, use create_buffer_p2p(xrt::bo &, size_t, dataType).
- Template Parameters
dtype – Datatype of the buffer.
- Parameters
length – Amount of elements to allocate for.
type – ACCL datatype of the buffer.
mem_grp – Memory bank to allocate buffer on.
- Returns
std::unique_ptr<Buffer<dtype>> The allocated P2P buffer.
-
template<typename dtype>
inline std::unique_ptr<Buffer<dtype>> create_buffer_p2p(xrt::bo &bo, size_t length, dataType type) Construct a new p2p buffer object from an existing P2P BO buffer.
If you do not pass a non-P2P BO buffer, data will not be copied correctly from and to the FPGA.
Will create a normal buffer when running in simulated mode. See the notes of create_buffer(xrt::bo &, size_t, dataType) about using BO buffers in simulated mode.
- Template Parameters
dtype – Datatype of the buffer.
- Parameters
length – Amount of elements to allocate for.
type – ACCL datatype of the buffer.
- Returns
std::unique_ptr<Buffer<dtype>> The allocated P2P buffer.
-
std::string dump_exchange_memory()
Dump the content of the exchange memory to a string.
- Returns
std::string Content of the exchange memory.
-
std::string dump_rx_buffers(size_t nbufs)
Dump the content of the RX buffers to a string for the first nbufs buffers.
- Parameters
nbufs – Amount of buffers to dump the content of.
- Returns
std::string Content of the RX buffers.
-
inline std::string dump_rx_buffers()
Dump the content of all RX buffers to a string.
- Returns
std::string Content of all RX buffers.
-
std::string dump_communicator()
Dump the content of the communicator to a string.
- Returns
std::string Content of the communicator.
-
inline int devicemem()
Retrieve the devicemem memory bank.
- Returns
int The devicemem memory bank
-
ACCL(const std::vector<rank_t> &ranks, int local_rank, xrt::device &device, xrt::ip &cclo_ip, xrt::kernel &hostctrl_ip, int devicemem, const std::vector<int> &rxbufmem, int networkmem, networkProtocol protocol = networkProtocol::TCP, int nbufs = 16, addr_t bufsize = 1024, const arithConfigMap &arith_config = DEFAULT_ARITH_CONFIG)