Multicore usage in Spark and Scala and memory sharing -


i'm new spark, scala , have few questions -

  1. when set master local[2], mean that, whenever perform rdd.map( ... ).collect(), map function execute on 2 cores?
  2. following on 1, optimal if number of items in rdd equal number of cores? (so each map done in 1 core.)
  3. how memory shared if reuse values?

    i.e.

    val conf = new sparkconf().setmaster("local["+cores+"]") val sc = new sparkcontext(conf)  // in practice following sparse matrix want shared between threads val myhugevalue = 100  // in practice following interact myhugevalue sc.parallelize(array.range(0, cores)).map(myhugevalue + _).collect() 

    in above, myhugevalue copied each mapper function uses myhugevalue? if so, how change impl doesn't copied over?

edit 1:

to more specific 2., suppose there 10 nodes or worker cores in cluster, won't processing done efficiently if data split 10 parts such each node can process batch? there reason split data smaller chunks , increase network overhead?

as far reason not want copy value myhugevalue because huge object , take time copying every map operation. map operations called lot of iterations. rather want cache object once per cluster , call map function repeatedly. there spark way of doing this?


Comments

Popular posts from this blog

python - pip install -U PySide error -

arrays - C++ error: a brace-enclosed initializer is not allowed here before ‘{’ token -

cytoscape.js - How to add nodes to Dagre layout with Cytoscape -