c++ - CUDA cudaMemcpy Struct of Arrays -
i'd clean parameters of cuda kernels in project.
now, kernel needs 3 uint32_t
arrays, leads pretty ugly code: (id means global thread id , valx arbitrary value)
__global__ void some_kernel(uint32_t * arr1, uint32_t * arr2, uint32_t * arr3){arr1[id] = val1; arr2[id] = val2; arr3[id] = val3;}
i'd sorround arrays struct:
typedef struct s{uint_32_t arr1, uint_32_t arr2, uint_32_t arr3, uint32_t size} s;
where size denotes length of every arrx inside struct.
what have, like:
__global__ void some_kernel(s * s){s->arr1[id] = val1; s->arr2[id] = val2; s->arr3[id] = val3;}
what corresponding cudamalloc , cudamemcpy struct this? there performance drawbacks this, i'm not seeing yet?
thanks in advance!
you have @ least 2 options. 1 excellent choice already given talonmies, i'll introduce "learn hard way" approach.
first, struct definition:
typedef struct s { uint32_t *arr1; uint32_t *arr2; uint32_t *arr3; uint32_t size; } s;
...and kernel definition (with global variable, don't need follow pattern):
const int size = 10000; __global__ void some_kernel(s *s) { int id = blockidx.x * blockdim.x + threadidx.x; if (id < size) { s->arr1[id] = 1; // val1 s->arr2[id] = 2; // val2 s->arr3[id] = 3; // val3 } }
notice if
protects running out-of-bounds.
next, come function prepares data, executes kernel , prints result. part 1 data allocation:
uint32_t *host_arr1, *host_arr2, *host_arr3; uint32_t *dev_arr1, *dev_arr2, *dev_arr3; // allocate , fill host data host_arr1 = new uint32_t[size](); host_arr2 = new uint32_t[size](); host_arr3 = new uint32_t[size](); // allocate device data cudamalloc((void **) &dev_arr1, size * sizeof(*dev_arr1)); cudamalloc((void **) &dev_arr2, size * sizeof(*dev_arr2)); cudamalloc((void **) &dev_arr3, size * sizeof(*dev_arr3)); // allocate helper struct on device s *dev_s; cudamalloc((void **) &dev_s, sizeof(*dev_s));
it's nothing special, allocate 3 arrays , struct. looks more interesting how handle copying of such data device:
// copy data host device cudamemcpy(dev_arr1, host_arr1, size * sizeof(*dev_arr1), cudamemcpyhosttodevice); cudamemcpy(dev_arr2, host_arr2, size * sizeof(*dev_arr2), cudamemcpyhosttodevice); cudamemcpy(dev_arr3, host_arr3, size * sizeof(*dev_arr3), cudamemcpyhosttodevice); // note: binding pointers dev_s cudamemcpy(&(dev_s->arr1), &dev_arr1, sizeof(dev_s->arr1), cudamemcpyhosttodevice); cudamemcpy(&(dev_s->arr2), &dev_arr2, sizeof(dev_s->arr2), cudamemcpyhosttodevice); cudamemcpy(&(dev_s->arr3), &dev_arr3, sizeof(dev_s->arr3), cudamemcpyhosttodevice);
beside ordinary copy of array noticed, it's neccessary "bind" them struct. need pass address of pointer. result, these pointers copied.
next kernel call, copy data again host , printing results:
// call kernel some_kernel<<<10000/256 + 1, 256>>>(dev_s); // block size need multiply of 256 // copy result host: cudamemcpy(host_arr1, dev_arr1, size * sizeof(*host_arr1), cudamemcpydevicetohost); cudamemcpy(host_arr2, dev_arr2, size * sizeof(*host_arr2), cudamemcpydevicetohost); cudamemcpy(host_arr3, dev_arr3, size * sizeof(*host_arr3), cudamemcpydevicetohost); // print result std::cout << host_arr1[size-1] << std::endl; std::cout << host_arr2[size-1] << std::endl; std::cout << host_arr3[size-1] << std::endl;
keep in mind in serious code should check errors cuda api calls.
Comments
Post a Comment