Upload
skillspeed
View
157
Download
2
Tags:
Embed Size (px)
Citation preview
Slide 1© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
HDFS and Big Data TDD Using PigUnit
Slide 2© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Session Objectives
ᗍ Introduction to BIG Data & Hadoop
ᗍ Understand HDFS concepts
ᗍ Understand Big Data TDD Using PigUnit
ᗍ BIG Data & Hadoop Course Syllabus
ᗍ Webinar by Skillspeed
Get Started with BIG Data & Hadoop
Slide 3© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Big Data and its Challenges
Get Started with BIG Data & Hadoop
Slide 4© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Big Data and its Challenges
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information
It’s very difficult to manage such huge data……
Get Started with BIG Data & Hadoop
Slide 5© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Who Generates Big Data?
Have you ever wondered how Google, Facebook or LinkedIn manages to store and utilize the huge data?
Today, it is becoming a problem for all of us to manage such BIG DATA…. Get Started with BIG Data & Hadoop
Slide 6© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Hadoop can be used for easy processing of such huge Data…..
We will answer how?
Before that let’s understand what is Hadoop?Get Started with BIG Data & Hadoop
Slide 7© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Hadoop and its Characteristics
Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model
It is an Open-source Data Management technology with scale-out storage and distributed processing
Hadoop Characteristics
Flexible
Reliable
Economical
Scalable Get Started with BIG Data & Hadoop
Slide 8© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Hadoop Ecosystem
Flume Sqoop
Import Or Export
Unstructured or Semi-Structured data Structured Data
Apache Oozie (Workflow)
HDFS(Hadoop Distributed File System)
Pig LatinData Analysis
HiveDW System
MapReduce Framework HBase
OtherYARN
Frameworks (MPI,GIRAPH)
YARNCluster Resource Management
Get Started with BIG Data & Hadoop
Slide 9© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
HDFS
Get Started with BIG Data & Hadoop
Slide 10© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
HDFS and its Components
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. It has the following two components:
NameNode
ᗍ Storage side master of the systemᗍ It maintains, manages, and administers the data blocks present on the DataNodes
DataNodes
ᗍ Slave machines which provide the actual and redundant storageᗍ End points for client read and write operations
Get Started with BIG Data & Hadoop
Slide 11© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
HDFS Architecture
NameNode
Client
Rack 1 Client Rack 2
Metadata (Name, replicas,...): /home/foo/data, 3,…
Read DataNodes
Write
Replication
Blocks
Block ops
DataNodes
Metadata ops
Get Started with BIG Data & Hadoop
Slide 12© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
HDFS NameNode
Keeps Meta data in Main Memory
ᗍ The entire metadata is in main memoryᗍ FS meta data is not loaded from hard disk
Metadata type
ᗍ Files in HDFSᗍ Data Blocks for each fileᗍ DataNodes for each blockᗍ File attributes, e.g. access time, replication factor, access control
Get Started with BIG Data & Hadoop
Slide 13© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Secondary NameNode
Secondary NameNode:
ᗍ In HDFS 1.0, not a hot standby for the NameNode
ᗍ By Default connects to NameNode every hour*
ᗍ Housekeeping, backup of NameNode metadata
ᗍ Saved metadata is used to bring up the secondary NameNode
NameNode
SecondaryNameNode
Metadata
I’’ll take metadata every hour and will
make it secure
Get Started with BIG Data & Hadoop
Slide 14© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Big Data TDD Using PigUnit
Get Started with BIG Data & Hadoop
Slide 15© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
What is TDD?
ᗍ TDD stands for Test Driven Development
ᗍ Test Driven Development aims to shorten the development cycles
ᗍ It aims to “get something now and perfect it later” approach
ᗍ The typical process involves “RED-GREEN-REFACTOR” cycle
ᗍ It’s a part of larger software design paradigm- “Extreme Programming”
ᗍ Test Driven Development requires tests to be written before code itself!
ᗍ It leads to a better code which is just enough to pass the tests
ᗍ 100% code coverage is ensured for TDD based code
Get Started with BIG Data & Hadoop
Slide 16© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
I can’t follow TDD because…..
ᗍ “It’s working! Let’s freeze it for now”
ᗍ The release date is quite aggressive!
ᗍ It slows down our development cycle
ᗍ We are already short staffed..
ᗍ What are Testers supposed to do?
All (or possibly more) reasons above lead the teams for “Technical Debt”
Get Started with BIG Data & Hadoop
Slide 17© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
-Albert Einstein
“The most powerful force in the universe is compound Interest”
Get Started with BIG Data & Hadoop
Slide 18© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Time Taken to Fix Bugs
0
250
500
750
1000
Design Implementation QA Post-release
Get Started with BIG Data & Hadoop
Slide 19© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Traditional Development
Design Implement Test
Get Started with BIG Data & Hadoop
Slide 20© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
TDD
Design Test Implement Test
Get Started with BIG Data & Hadoop
Slide 21© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
TDD
Implement
Design
Test Test
Get Started with BIG Data & Hadoop
Slide 22© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
TDD
Implement
Design
Test Test
Get Started with BIG Data & Hadoop
Slide 23© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Why Unit Test Pig?
ᗍ Pig is NOT a programming language
ᗍ Pig is a Data Flow Language
ᗍ It just converts the Pig Latin data flow to Map-Reduce jobs
ᗍ The best use-case for Pig in Big Data projects is for “Data Factory” operations
ᗍ Since we are not talking about a “programming language”, does testing make sense?
ᗍ Pig already comes with the diagnostic operators, so extra testing will be overhead!
All of the above reasons lead to even bigger problems, as the testing in Big Data world is data driven in nature
Get Started with BIG Data & Hadoop
Slide 24© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
What is PigUnit?
ᗍ PigUnit is the unit testing framework for Pig scripts
ᗍ It is not really a *Unit framework
ᗍ It’s a library which can be used within JUnit tests to
• Run Pig scripts from within JUnit tests
• Override variables in Pig scripts to provide data from tests rather than from external sources such as HDFS
• Inspect the values of your Pig script relations
• Make your STORE statements into no-ops so that your Pig scripts run without side effects
Get Started with BIG Data & Hadoop
Slide 25© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Job Trends – Hadoop
Get Started with BIG Data & Hadoop
Slide 26© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Why SkillSpeed?
Course Curriculum
from Industry Experts
Instructor Led Live Virtual
Sessions
Lifetime access to Course
Content via LMS
100% Placement Assistance
24x7 Support
Get Started with BIG Data & Hadoop
Slide 27© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Course Topics
Module 1
Introduction to Big Data and Hadoop
Module 2
HDFS Internals, Hadoop Configurations and
Data Loading
Module 3
Introduction to Map Reduce
Module 4
Advanced Map Reduce Concepts
Module 5
Introduction to Pig
Module 6
Advanced Pig and Introduction to Hive
Module 7
Advanced Hive Concepts
Module 8
Extending Hive and HBase Introduction
Module 9
Advanced HBase and Oozie Introduction
Module 10
Project Set-up Discussion
Get Started with BIG Data & Hadoop
Slide 28© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Corporate Partners
Get Started with BIG Data & Hadoop
Slide 29© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Lines open 24/7
To know more about the course, Please contact:
IND +91-90660-20904 USA 1866-607-6547 (Toll Free)
Or reach us at
Contact us..
Get Started with BIG Data & Hadoop
Slide 30© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
References
http://en.wikipedia.org/wiki/Albert_Einstein
http://www.lincs.fr/research/areas/big-data/
Google images – credit for google, Facebook and LinkedIn LOGO and Snapshots
http://www.counsellingpages.co.uk/
http://langfordsconsultancy.com/langfords-training-support-package/
http://cbsepathshala.blogspot.in/2012/05/physics-class-x-chapter-electricity.html
http://mmatycoon.com/tycoontimes/tycoontimesstory.php?SID=1010