Wednesday, December 28, 2016

Spark - RDD


First - RDD stands for Resilient Distributed Dataset.

Spark RDD is a distributed collection of data. This distributed collection is usually created in two ways: by external data ( a file, data from HDFS) or by distributing a collection of object ( eg: List/Set) in driver program.

Scala code to create RDD:

  1. External data RDD: val lines = sc.textFile("input.txt") 
  2. Distribute Collection RDD: val nums = sc.parallelize(List(1, 2, 3, 4))
*sc - is SparkContext object 

Now we have RDDs created in our driver program. Once RDD created, we do computation on theses.
Two ways of computation can be performed on RDD:

  • Transformation: Transformation results in new RDDs. Commonly used Transformations:
    • flatMap(): apply function to each element in RDD and returns cotent of iterator returned as new RDD.
    • filter(): returns an RDD that contains only elements that pass filter condition
    • map():  returns an RDD applying function to each element in RDD 
    • distinct(): removes duplicate elements.
    • Union: produces an RDD with contianning elements from both ...
  • Actions: Actions are the operations that return some value or write data. Commonly used Actions:
    • collect(): returns all elements in RDD
    • count(): Number of elements in RDD
    • foreach(): iterate over the elements in RDD 
    • top(num): returns top num elements from RDD ... 

Scala - Scala Build Tool (SBT)


Couple of weeks back i started with Spark, which requires Scala and SBT (a build tool for Scala). I downloaded Scala 2.12.1 and SBT 0.13. While creating assembly jar (package) for SimpleWordCount scala project using SBT i hit an issue. The error message was not clear enough to figure out solution. Later through documentation i got to know SBT 0.13 does not support Sacla 2.12.1 I had to go back and download Scala 2.10.6. Now packaging of SimpleWordCount was fine.

Before getting SBT, check supported vresions of Scala.