python - RDD to multidimensional array -


i using spark's python api , finding few matrix operations challenging. rdd 1 dimensional list of length n (row vector). possible reshape matrix/multidimensional array of size sq_root(n) x sq_root(n).

for example,

vec=[1,2,3,4,5,6,7,8,9] 

and desired output 3 x 3=

[[1,2,3] [4,5,6] [7,8,9]]  

is there equivalent reshape in numpy?

conditions: n (>50 million) huge rules out using .collect(), , can process made run on multiple threads?

something should trick:

rdd = sc.parallelize(xrange(1, 10)) nrow = int(rdd.count() ** 0.5) # compute number of rows  rows = (rdd.    zipwithindex(). # add index, assume data sorted    groupby(lambda (x, i): / nrow). # group row    # order column , drop index    mapvalues(lambda vals: [x (x, i) in sorted(vals, key=lambda (x, i): i)]))) 

you can add:

from pyspark.mllib.linalg import densevector rows.mapvalues(densevector) 

if want proper vectors.


Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

apache - setting document root in antoher partition on ubuntu -