python - RDD to multidimensional array -
i using spark's python api , finding few matrix operations challenging. rdd 1 dimensional list of length n (row vector). possible reshape matrix/multidimensional array of size sq_root(n) x sq_root(n).
for example,
vec=[1,2,3,4,5,6,7,8,9]
and desired output 3 x 3=
[[1,2,3] [4,5,6] [7,8,9]]
is there equivalent reshape in numpy?
conditions: n (>50 million) huge rules out using .collect(), , can process made run on multiple threads?
something should trick:
rdd = sc.parallelize(xrange(1, 10)) nrow = int(rdd.count() ** 0.5) # compute number of rows rows = (rdd. zipwithindex(). # add index, assume data sorted groupby(lambda (x, i): / nrow). # group row # order column , drop index mapvalues(lambda vals: [x (x, i) in sorted(vals, key=lambda (x, i): i)])))
you can add:
from pyspark.mllib.linalg import densevector rows.mapvalues(densevector)
if want proper vectors.
Comments
Post a Comment