Multicore usage in Spark and Scala and memory sharing -
i'm new spark, scala , have few questions -
- when set
master
local[2]
, mean that, whenever performrdd.map( ... ).collect()
,map
function execute on 2 cores? - following on 1, optimal if number of items in rdd equal number of cores? (so each map done in 1 core.)
how memory shared if reuse values?
i.e.
val conf = new sparkconf().setmaster("local["+cores+"]") val sc = new sparkcontext(conf) // in practice following sparse matrix want shared between threads val myhugevalue = 100 // in practice following interact myhugevalue sc.parallelize(array.range(0, cores)).map(myhugevalue + _).collect()
in above,
myhugevalue
copied each mapper function usesmyhugevalue
? if so, how change impl doesn't copied over?
edit 1:
to more specific 2., suppose there 10 nodes or worker cores in cluster, won't processing done efficiently if data split 10 parts such each node can process batch? there reason split data smaller chunks , increase network overhead?
as far reason not want copy value myhugevalue
because huge object , take time copying every map operation. map operations called lot of iterations. rather want cache object once per cluster , call map function repeatedly. there spark way of doing this?
Comments
Post a Comment