5
ACADGILD ACADGILD In this blog, we will work on a use case involving electric bulbs and work with the date and time concepts in Pig. In this instance, Pig is used in the local mode to load the local data. We can use Pig in HDFS mode as per our convenience. In the research center of bulb manufacturing companies, the longevity of bulbs is tested by subjecting them to adverse conditions. The dataset used in this case is a sample from the light bulb production house where bulbs are tested at random intervals of time. The first column is StartDate which is the date and time when the testing of the bulb started and the second column is EndDate which is the date when the testing ended. StartDate EndDate 30-Jun-2018 23:42 04-Jul-2018 15:10 30-Jun-2018 23:37 30-Jun-2018 23:13 30-Jun-2019 23:34m https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit

Apache pig use case electrical bulb testing

Embed Size (px)

Citation preview

ACADGILDACADGILD

In this blog, we will work on a use case involving electric bulbs and work with the date andtime concepts in Pig.

In this instance, Pig is used in the local mode to load the local data. We can use Pig in HDFS

mode as per our convenience.

In the research center of bulb manufacturing companies, the longevity of bulbs is tested by

subjecting them to adverse conditions.

The dataset used in this case is a sample from the light bulb production house where bulbs are

tested at random intervals of time. The first column is StartDate which is the date and time

when the testing of the bulb started and the second column is EndDate which is the date

when the testing ended.

StartDate EndDate

30-Jun-2018 23:42 04-Jul-2018 15:10

30-Jun-2018 23:37

30-Jun-2018 23:13 30-Jun-2019 23:34m

https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit

ACADGILDACADGILD

A few rows may be empty which indicates that data is not available, maybe because of various

reasons. But as a developer we need not worry about missing data. With the help of Data

Filtering, we can remove the unnecessary data.

Loading Data into the Pig environment

Since Pig uses default as tab(\t) delimited data, it’s not mandatory to state USING

PigStorage('\t') in the code while loading, nevertheless it is good to write it. You have to use

this parameter depending on the dataset.

Since we have data inside Pig, the first step is to filter data in the column we are working on.

https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit

ACADGILDACADGILD

Here we remove all the rows with null data.

In this step, it is mandatory to filter all the data in EndTime containing - symbol.

We have to convert the data loaded in Pig into datetime format in order to work with it.

Here, we use two predefined functions:

ToDate()

MinutesBetween()

The first one converts the character array to datetime readable structure which can be

interpreted by Pig and the second one takes the difference between two DateTime parameters

provided.

The ToDate function can be used in different formats of year, month and date. Some examples

are as follows:

YYYY-MM-DD

DD/MM/YYYY

DD-YY-MM

Depending on the appropriate structure in the dataset provided, we can choose the format.

After simple filtering and conversion of character array data to datetime format, we have now

determined the difference in terms of minutes for every bulb which was in ON state during

testing.

https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit

ACADGILDACADGILD

We can see the results with dump command.

Result in minutes is displayed:

Once we achieve this, we can perform analysis on the result, for example, to find the

maximum time a bulb can stay ON or minimum time and so on..

Shown below is the result for the average time the bulbs were ON during the testing phase.

https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit

ACADGILDACADGILD

Dump Avg_ALL;

This way we can perform analysis on the filtered result and get the results with help of Pig in a

matter of minutes from a large set of data.

For dataset and code for practice, click HERE.

For more such blogs on various topics, please visit ACADGILD.

https://acadgild.com/blog/wp-admin/post.php?post=11871&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=11871&action=edit