Java Brew: 2016

Wednesday, December 28, 2016

Spark - RDD

First - RDD stands for Resilient Distributed Dataset.

Spark RDD is a distributed collection of data. This distributed collection is usually created in two ways: by external data ( a file, data from HDFS) or by distributing a collection of object ( eg: List/Set) in driver program.

Scala code to create RDD:

External data RDD: val lines = sc.textFile("input.txt")
Distribute Collection RDD: val nums = sc.parallelize(List(1, 2, 3, 4))

*sc - is SparkContext object

Now we have RDDs created in our driver program. Once RDD created, we do computation on theses.
Two ways of computation can be performed on RDD:

Transformation: Transformation results in new RDDs. Commonly used Transformations:

flatMap(): apply function to each element in RDD and returns cotent of iterator returned as new RDD.
filter(): returns an RDD that contains only elements that pass filter condition
map(): returns an RDD applying function to each element in RDD
distinct(): removes duplicate elements.
Union: produces an RDD with contianning elements from both ...

Actions: Actions are the operations that return some value or write data. Commonly used Actions:

collect(): returns all elements in RDD
count(): Number of elements in RDD
foreach(): iterate over the elements in RDD
top(num): returns top num elements from RDD ...

Scala - Scala Build Tool (SBT)

Couple of weeks back i started with Spark, which requires Scala and SBT (a build tool for Scala). I downloaded Scala 2.12.1 and SBT 0.13. While creating assembly jar (package) for SimpleWordCount scala project using SBT i hit an issue. The error message was not clear enough to figure out solution. Later through documentation i got to know SBT 0.13 does not support Sacla 2.12.1 I had to go back and download Scala 2.10.6. Now packaging of SimpleWordCount was fine.

Before getting SBT, check supported vresions of Scala.