First - RDD stands for Resilient Distributed Dataset.
Spark RDD is a distributed collection of data. This distributed collection is usually created in two ways: by external data ( a file, data from HDFS) or by distributing a collection of object ( eg: List/Set) in driver program.
Scala code to create RDD:
- External data RDD: val lines = sc.textFile("input.txt")
- Distribute Collection RDD: val nums = sc.parallelize(List(1, 2, 3, 4))
*sc - is SparkContext object
Now we have RDDs created in our driver program. Once RDD created, we do computation on theses.
Two ways of computation can be performed on RDD:
Two ways of computation can be performed on RDD:
- Transformation: Transformation results in new RDDs. Commonly used Transformations:
- flatMap(): apply function to each element in RDD and returns cotent of iterator returned as new RDD.
- filter(): returns an RDD that contains only elements that pass filter condition
- map(): returns an RDD applying function to each element in RDD
- distinct(): removes duplicate elements.
- Union: produces an RDD with contianning elements from both ...
- Actions: Actions are the operations that return some value or write data. Commonly used Actions:
- collect(): returns all elements in RDD
- count(): Number of elements in RDD
- foreach(): iterate over the elements in RDD
- top(num): returns top num elements from RDD ...