python - RDD to multidimensional array -

- September 15, 2011

i using spark's python api , finding few matrix operations challenging. rdd 1 dimensional list of length n (row vector). possible reshape matrix/multidimensional array of size sq_root(n) x sq_root(n).

for example,

vec=[1,2,3,4,5,6,7,8,9]

and desired output 3 x 3=

[[1,2,3] [4,5,6] [7,8,9]]

is there equivalent reshape in numpy?

conditions: n (>50 million) huge rules out using .collect(), , can process made run on multiple threads?

something should trick:

rdd = sc.parallelize(xrange(1, 10)) nrow = int(rdd.count() ** 0.5) # compute number of rows  rows = (rdd.    zipwithindex(). # add index, assume data sorted    groupby(lambda (x, i): / nrow). # group row    # order column , drop index    mapvalues(lambda vals: [x (x, i) in sorted(vals, key=lambda (x, i): i)])))

you can add:

from pyspark.mllib.linalg import densevector rows.mapvalues(densevector)

if want proper vectors.

Search This Blog

Click Hand

python - RDD to multidimensional array -

Comments

Post a Comment

Popular posts from this blog

apache - setting document root in antoher partition on ubuntu -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -

python - pip install -U PySide error -