c++ - CUDA cudaMemcpy Struct of Arrays -


i'd clean parameters of cuda kernels in project.


now, kernel needs 3 uint32_t arrays, leads pretty ugly code: (id means global thread id , valx arbitrary value)

__global__ void some_kernel(uint32_t * arr1, uint32_t * arr2, uint32_t * arr3){arr1[id] = val1; arr2[id] = val2; arr3[id] = val3;} 

i'd sorround arrays struct:

typedef struct s{uint_32_t arr1, uint_32_t arr2, uint_32_t arr3, uint32_t size} s; 

where size denotes length of every arrx inside struct.

what have, like:

__global__ void some_kernel(s * s){s->arr1[id] = val1; s->arr2[id] = val2; s->arr3[id] = val3;} 

what corresponding cudamalloc , cudamemcpy struct this? there performance drawbacks this, i'm not seeing yet?

thanks in advance!

you have @ least 2 options. 1 excellent choice already given talonmies, i'll introduce "learn hard way" approach.

first, struct definition:

typedef struct s {     uint32_t *arr1;     uint32_t *arr2;     uint32_t *arr3;      uint32_t size; } s; 

...and kernel definition (with global variable, don't need follow pattern):

const int size = 10000;  __global__ void some_kernel(s *s) {     int id = blockidx.x * blockdim.x + threadidx.x;     if (id < size)     {         s->arr1[id] = 1; // val1         s->arr2[id] = 2; // val2         s->arr3[id] = 3; // val3     } } 

notice if protects running out-of-bounds.

next, come function prepares data, executes kernel , prints result. part 1 data allocation:

uint32_t *host_arr1, *host_arr2, *host_arr3; uint32_t *dev_arr1, *dev_arr2, *dev_arr3;  // allocate , fill host data host_arr1 = new uint32_t[size](); host_arr2 = new uint32_t[size](); host_arr3 = new uint32_t[size]();  // allocate device data    cudamalloc((void **) &dev_arr1, size * sizeof(*dev_arr1)); cudamalloc((void **) &dev_arr2, size * sizeof(*dev_arr2)); cudamalloc((void **) &dev_arr3, size * sizeof(*dev_arr3));  // allocate helper struct on device s *dev_s; cudamalloc((void **) &dev_s, sizeof(*dev_s)); 

it's nothing special, allocate 3 arrays , struct. looks more interesting how handle copying of such data device:

// copy data host device cudamemcpy(dev_arr1, host_arr1, size * sizeof(*dev_arr1), cudamemcpyhosttodevice); cudamemcpy(dev_arr2, host_arr2, size * sizeof(*dev_arr2), cudamemcpyhosttodevice); cudamemcpy(dev_arr3, host_arr3, size * sizeof(*dev_arr3), cudamemcpyhosttodevice);  // note: binding pointers dev_s cudamemcpy(&(dev_s->arr1), &dev_arr1, sizeof(dev_s->arr1), cudamemcpyhosttodevice); cudamemcpy(&(dev_s->arr2), &dev_arr2, sizeof(dev_s->arr2), cudamemcpyhosttodevice); cudamemcpy(&(dev_s->arr3), &dev_arr3, sizeof(dev_s->arr3), cudamemcpyhosttodevice); 

beside ordinary copy of array noticed, it's neccessary "bind" them struct. need pass address of pointer. result, these pointers copied.

next kernel call, copy data again host , printing results:

// call kernel some_kernel<<<10000/256 + 1, 256>>>(dev_s); // block size need multiply of 256  // copy result host: cudamemcpy(host_arr1, dev_arr1, size * sizeof(*host_arr1), cudamemcpydevicetohost); cudamemcpy(host_arr2, dev_arr2, size * sizeof(*host_arr2), cudamemcpydevicetohost); cudamemcpy(host_arr3, dev_arr3, size * sizeof(*host_arr3), cudamemcpydevicetohost);  // print result std::cout << host_arr1[size-1] << std::endl; std::cout << host_arr2[size-1] << std::endl; std::cout << host_arr3[size-1] << std::endl; 

keep in mind in serious code should check errors cuda api calls.


Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -