Hadoop input and output formats in spark

ACADGILDACADGILD

In this post, we will be discussing how to implement Hadoop input and output formats in Spark.

In order to understand the concepts explained here, it is best to have some basic knowledge of Apache

Spark. We recommend you to go through the following posts before going through this post

- Beginners Guide to Spark and Spark RDD's in Scala.

Now, let’s discuss about the input formats of Hadoop. An input split is nothing but the chunk of data

that is present in HDFS. Each mapper will work on each input split. Before going through the map

method, RecordReader will work on the input splits and arrange the records as key-Value format.

The InputFormat describes the input-specification for a Map-Reduce job.

The Map-Reduce framework relies on the InputFormat of the job to do the following:

1.Validate the input-specification of the job.

2.Split-up the input file(s) into logical InputSplits, each of which is then assigned to an

individual Mapper.

3.Provide the RecordReader implementation to be used to clean input records from the

logical InputSplit for processing by the Mapper.

The default behaviour of file-based InputFormats, typically sub-classes of FileInputFormat , is to split

the input into logical InputSplits based on the total size, in bytes, of the input files. However,

the FileSystem block size of the input files is treated as an upper bound for input splits. A lower bound

on the split size can be set via mapreduce.input.fileinputformat.split.minsize.

By default, Hadoop takes TextInputFormat, where columns in each record is separated by tab space.

This is also called as KeyValueInputFormat. The keys and values used in Hadoop are serialized.

HadoopInputFormat:In Spark, we can implement the InputFormat of Hadoop to process the data, similar to Hadoop. For

this, Spark provides API's of Hadoop in Java, Scala, Python.

Now, let’s look at a demo using Hadoop input formats in spark.

Spark has given support for both the old and new API's of Hadoop. They are as follows:

Old APIs (which supports mapred libraries of Hadoop)

•hadoopRDD

https://acadgild.com/blog/wp-admin/post.php?post=6475&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=6475&action=edit

https://acadgild.com/blog/beginners-guide-for-spark/

https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml#mapreduce.input.fileinputformat.split.minsize

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/InputSplit.html


https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/InputFormat.html


https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/FileInputFormat.html


https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/Mapper.html

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/RecordReader.html

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/RecordReader.html

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/Mapper.html



https://acadgild.com/blog/spark-rdds-scala/

ACADGILDACADGILD

•hadoopFile

New APIs (Which support mapreduce libraries of Hadoop)

•newAPIHadoopRDD

•newAPIHadoopFile

We can implement the APIs using Spark context as shown beloe.

Old APIs:

SparkContext.hadoopRDD(conf, inputFormatClass, keyClass, valueClass, minPartitions)

Here’s the explanation of the above line:

•conf – Here the conf to be passed is org.apache.hadoop.mapred.JobConf. In thisspecific format, we need to pass the input file from the configuration.

•InputFormatClass – Here you need to pass the Input format class of Hadoop.

•KeyClass – Here you need to pass the input Key class.

•ValueClass – Here you need to pass the input Value class.

SparkContext.hadoopFile(path, inputFormatClass, keyClass, valueClass,

minPartitions)

The above line can be explained as:

•Path - Here Input file is passed as the arguments itself in path.

•InputFormatClass – You need to pass the Input format class of Hadoop.

•KeyClass – You need to pass the input Key class.

•ValueClass – Here you need to pass the input Value class.

•minPartitions- specify the minimum number of partitions.

New APIs

sc.newAPIHadoopRDD(conf, fClass, kClass, vClass)

Here conf to be passed is org.apache.hadoop.conf.Configuration.


ACADGILDACADGILD

•fClass is the Input format class

•kClass is the input key class

•vClass is the input value class

sc.newAPIHadoopRDD(conf, fClass, kClass, vClass)

Here conf to be passed is org.apache.hadoop.conf.Configuration

•fClass is the Input format class

•kClass is the input key class

•vClass is the input value class

Now, let’s look at a demo using one input file. The input data we are using here is:

Manjunath 50,000

Kiran 40,0000

Onkar 45,0000

Prateek 45,0000

Now, we will implement KeyValueTextInputFormat on this data.

val input = sc.newAPIHadoopFile("/home/kiran/Hadoop_input", classOf[KeyValueTextInputFormat], classOf[Text],classOf[Text])

To use/implement any Java class in a Scala project, we need to use the syntax classOf[class_name].

Here KeyValueTextInputFormat, Text are Hadoop's IO classes written in Java. In this place, we can use our custom input format classes also, which will be discussed in our next post.


ACADGILDACADGILD

HadoopOutputFormatLet’s look at HadoopOutputFormat and how to implement it in Spark. By default,Hadoop takes TextOutputFormat, where key and value of the output are saved in the partfile separated by coma.

The same can be implement in Spark; for this spark provides APIs. Spark provides API'sfor both old and new APIs of Hadoop. Spark provides API's for both the mapred as wellas the mapreduce output formats.

Mapred API is saveAsHadoopFile

saveAsHadoopFile(path, keyClass, valueClass, outputFormatClass, conf,

codec)

The explanation of the above line is as follows:

•Path - Here in path, we need to give the path of the output file where it need to be saved.


ACADGILDACADGILD

•Keyclass - We need to give the output key class.

•Valueclass - We need to give the output value class.

•outputFormatClass – Here, we need to give the outputFormat class.

•conf - We need to give the Hadoop configuration – org.apache.hadoop.mapred.jobConf.

•codec – Here, we need to give the compression format.

•conf and codec are optional parameters.

saveAsNewAPIHadoopFile(path, keyClass, valueClass,

outputFormatClass, conf )

The explanation of the above line is as follows:

•Path - Here in path, we need to give the path of the output file where it need to be saved.

•Keyclass - We need to give the output key class.

•Valueclass - We need to give the output value class.

•outputFormatClass - here we need to give the outputFormat class.

•conf – Here, we need to give the Hadoop configuration –

org.apache.hadoop.conf.configuration.

Here, in the place of keyclass, valueclass, outputFormatClass, we can define and giveour own customOutputFormat classes as well.

Now, let’s save the output of the above HadoopInputFormat usingHadoopOutputFormat. You can refer to the below screenshot for the same.


ACADGILDACADGILD

In the above screenshot, you can see that we have saved the output using old API ofHadoop i.e., using mapred libraries.

Now, we will save the same using new APIs of Hadoop i.e., using mapreducelibraries.You can refer to the below screenshot for the same.

This is how the output is saved in Spark, using the hadoopOutputFormat!


ACADGILDACADGILD

We hope this post has been helpful in understanding how to work on HadoopInput and Output formats in Spark. In case of any queries, feel free to drop us a comment and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.


http://www.acadgild.com/

Education

Hadoop input and output formats in spark