8
ACADGILD ACADGILD In this blog, we will see web log analysis using Apache Pig. Before we proceed, let us see a brief introduction about Apache Pig. Apache Pig is a high-level tool for processing big data. The language used is known as Pig Latin. Pig can run in two different modes. MapReduce mode : Data gets loaded from HDFS and against every transformation, Map Reduce job is executed in the backend. Local Mode: Generally, run for testing the script. Data gets loaded from local file system, and no Map Reduce jobs run in the backend which makes the testing fast. Pig Latin is procedural and uses lazy evaluation, and ETL (Extract, Transform, Load). It has a rich set of libraries for loading web logs. All the functionalities for this can be found in Piggybank jar which comes bundled with Apache Pig. We will use CombinedLogLoader() to load logs. The logs used is in Combined log format. Let us see the sample structure of it! 127.0.0.1 - - [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" Please find some of the column description of combined log format (web log format) data below: Ipaddress : 127.0.0.1) ip address of the client (hostname). Logname: (-) The "hyphen" in the output indicates that the requested piece of information is not available. Userid: (-) NA Timestamp: (10/Oct/2000:13:55:36 -0700) time at which server finished processing request. Request: (GET /apache_pb.gif HTTP/1.0) request made by client. Denoted by “GET” Page link: (http://www.example.com/start.html)web page through which client made a request. Sample dataset: https://acadgild.com/blog/?p=18427&preview=true https://acadgild.com/blog/?p=18427&preview=true

Web log analysis using apache pig

Embed Size (px)

Citation preview

Page 1: Web log analysis using apache pig

ACADGILDACADGILD

In this blog, we will see web log analysis using Apache Pig. Before we proceed, let us see abrief introduction about Apache Pig.

Apache Pig is a high-level tool for processing big data. The language used is known as Pig Latin. Pig can run in two different modes.

•MapReduce mode: Data gets loaded from HDFS and against every transformation, Map Reduce job is executed in the backend.

•Local Mode: Generally, run for testing the script. Data gets loaded from local file system, and no Map Reduce jobs run in the backend which makes the testing fast.

Pig Latin is procedural and uses lazy evaluation, and ETL (Extract, Transform, Load). It has a rich set of libraries for loading web logs. All the functionalities for this can be found in Piggybank jar which comes bundled with Apache Pig.

We will use CombinedLogLoader() to load logs.

The logs used is in Combined log format. Let us see the sample structure of it!

127.0.0.1 - - [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

Please find some of the column description of combined log format (web log format) data below:

Ipaddress : 127.0.0.1) ip address of the client (hostname).

Logname: (-) The "hyphen" in the output indicates that the requested piece of information is not available.

Userid: (-) NA

Timestamp: (10/Oct/2000:13:55:36 -0700) time at which server finished processing request.

Request: (GET /apache_pb.gif HTTP/1.0) request made by client. Denoted by “GET”

Page link: (http://www.example.com/start.html)web page through which client made a request.

Sample dataset:

https://acadgild.com/blog/?p=18427&preview=truehttps://acadgild.com/blog/?p=18427&preview=true

Page 2: Web log analysis using apache pig

ACADGILDACADGILD

Download the dataset from here.

PROBLEM STATEMENT 1:

Find out the most viewed page

Below are the steps:

Step 1) First and foremost, we have to register the Piggybank jar to use its classes.

Step 2) Next, load the data using CombinedLogLoader() and specify the schema.

https://acadgild.com/blog/?p=18427&preview=truehttps://acadgild.com/blog/?p=18427&preview=true

Page 3: Web log analysis using apache pig

ACADGILDACADGILD

Step 3) Group the data by page link to count the page hits of each unique link.

Step 4) For every grouped data (grouped by link) we have to generate the link and its total count. Here, we have used flatten() to explode the tuples and then count thehits.

https://acadgild.com/blog/?p=18427&preview=truehttps://acadgild.com/blog/?p=18427&preview=true

Page 4: Web log analysis using apache pig

ACADGILDACADGILD

Step 5) Once COUNT is received, we need to order it in descending order and generate the only first result.

https://acadgild.com/blog/?p=18427&preview=truehttps://acadgild.com/blog/?p=18427&preview=true

Page 5: Web log analysis using apache pig

ACADGILDACADGILD

Step 6) use dump to get the desired result.

PROBLEM STATEMENT 2:

Find total hits per unique day:

Based on each unique day we need to find the total hits. For example, on 24th of a particular month, there were X hits, on 27th of the month, there can be Y hits.

The assumption has been made that logs are of a single month.

https://acadgild.com/blog/?p=18427&preview=truehttps://acadgild.com/blog/?p=18427&preview=true

Page 6: Web log analysis using apache pig

ACADGILDACADGILD

To solve this problem, we have to use DateExtractor() available in Piggybank jar. This will take the timestamp as input and will give corresponding “day” against each timestamp.

Step 1) Define the DateExtractor() in the Pig Grunt shell as shown below:

Step 2) Use the above class defined to extract the day and group by it.

Step 3) To find the unique hits per day, run the below command.

Step 4) Dump the result and see the output.

https://acadgild.com/blog/?p=18427&preview=truehttps://acadgild.com/blog/?p=18427&preview=true

Page 7: Web log analysis using apache pig

ACADGILDACADGILD

The first column of the output is the date, and the second is the total number of hitson that day.

Hope this post was helpful to you in performing web log analysis using Apache Pig. In the case of any queries, feel free to comment below and we will get back to you atthe earliest. Keep visiting www.acadgild.com for more updates on the courses.

https://acadgild.com/blog/?p=18427&preview=truehttps://acadgild.com/blog/?p=18427&preview=true

Page 8: Web log analysis using apache pig

ACADGILDACADGILD

https://acadgild.com/blog/?p=18427&preview=truehttps://acadgild.com/blog/?p=18427&preview=true