32
Big Data Project on Crystal Ball Submitted By: Sushil Sedai(984474) Suvash Shah(984461) Submitted to: Prof. Prem Nair

CrystalBall - Compute Relative Frequency in Hadoop

Embed Size (px)

Citation preview

Page 1: CrystalBall - Compute Relative Frequency in Hadoop

Big Data Project on

Crystal BallSubmitted By:

Sushil Sedai(984474)

Suvash Shah(984461)

Submitted to:Prof. Prem Nair

Page 2: CrystalBall - Compute Relative Frequency in Hadoop

Pair approach (Mapper) – pseudo code

method map(docid id, doc d)

for each term w in doc d do

total = 0;for each neighbor u in Neighbor(w) do

Emit(Pair(w, u), 1);

total++;

Emit(Pair(w, *), total);

Page 3: CrystalBall - Compute Relative Frequency in Hadoop

Pair approach (Mapper) – Java Code

Page 4: CrystalBall - Compute Relative Frequency in Hadoop

Pair approach (Reducer) – pseudo code

method reduce(Pair p, Iterable<Int> values)

if p.secondValue == *

if p.firstValue is new

currentvalue = p.firstvalue;

marginal = sum(values)

else

marginal += sum(values)

else Emit(p, sum(values)/marginal);

Page 5: CrystalBall - Compute Relative Frequency in Hadoop

Pair approach (Reducer) – Java Code

Page 6: CrystalBall - Compute Relative Frequency in Hadoop

Pair approach - input

Mapper1 input

18 29 12 34 79 18 56 12 34 92

Mapper2 input

18 29 12 34 79 18 56 12 34 92

Page 7: CrystalBall - Compute Relative Frequency in Hadoop

Pair approach – Output (Reducer1)(10,12) 0.5

(10,34) 0.5

(12,10)0.09090909090909091

(12,18)0.09090909090909091

(12,34)0.36363636363636365

(12,56) 0.18181818181818182

(12,79)0.09090909090909091

(12,92)0.18181818181818182

(18,12) 0.25

(18,29) 0.125

(18,34) 0.25

(18,56) 0.125

(18,79) 0.125

(18,92) 0.125

(29,10)0.06666666666666667

(29,12)0.26666666666666666

(29,18)0.06666666666666667

(29,34)0.26666666666666666

(29,56)0.13333333333333333

(29,79)0.06666666666666667

(29,92)0.13333333333333333

(34,10)0.08333333333333333

(34,12) 0.25

(34,18)0.08333333333333333

(34,29)0.08333333333333333

(34,56) 0.25

(34,79)0.08333333333333333

(34,92)0.16666666666666666

(56,10) 0.1

(56,12) 0.3

(56,29) 0.1

(56,34) 0.3

(56,92) 0.2

(92,10)0.3333333333333333

(92,12)0.3333333333333333

(92,34)0.3333333333333333

Page 8: CrystalBall - Compute Relative Frequency in Hadoop

Pair approach – Output (Reducer2)

(79,12) 0.2

(79,18) 0.2

(79,34) 0.2

(79,56) 0.2

(79,92) 0.2

Page 9: CrystalBall - Compute Relative Frequency in Hadoop

Stripe approach (Mapper) – pseudo code

method map(docid id, doc d)

Stripe H;

for each term w in doc d do

clear(H);

for each neighbor u in Neighbor(w) do

if H.containsKey(u)

H{u} += 1;

else

H.add(u, 1);

Emit(w, H);

Page 10: CrystalBall - Compute Relative Frequency in Hadoop

Stripe approach (Mapper) – Java Code

Page 11: CrystalBall - Compute Relative Frequency in Hadoop

Stripe approach (Reducer) – pseudo code

total = 0;

method reduce(Text key, Stripe H [H1, H2, …])

total = sumValues(H);

for each Item h in H do

h.secondValue /= total;

Emit(key, H);

Page 12: CrystalBall - Compute Relative Frequency in Hadoop

Stripe approach (Reducer) – Java Code

Page 13: CrystalBall - Compute Relative Frequency in Hadoop

Stripe appoach (Reducer) – Java Code

Page 14: CrystalBall - Compute Relative Frequency in Hadoop

Stripe approach – input

Mapper1 input

34 56 29 12 34 56 92 10 34 12

Mapper2 input

18 29 12 34 79 18 56 12 34 92

Page 15: CrystalBall - Compute Relative Frequency in Hadoop

Stripe approach – Output(Reducer1)

10 [ (34,0.5000) (12,0.5000) ]

12 [ (56,0.1818) (92,0.1818) (34,0.3636) (18,0.0909) (79,0.0909) (10,0.0909) ]

18 [ (56,0.1250) (92,0.1250) (34,0.2500) (79,0.1250) (29,0.1250) (12,0.2500) ]

29 [ (56,0.1333) (92,0.1333) (34,0.2667) (18,0.0667) (79,0.0667) (10,0.0667) (12,0.2667) ]

34 [ (56,0.2500) (92,0.1667) (18,0.0833) (79,0.0833) (29,0.0833) (10,0.0833) (12,0.2500) ]

56 [ (92,0.2000) (34,0.3000) (29,0.1000) (10,0.1000) (12,0.3000) ]

92 [ (34,0.3333) (10,0.3333) (12,0.3333) ]

Page 16: CrystalBall - Compute Relative Frequency in Hadoop

Stripe approach – Output(Reducer2)

79 [ (56,0.2000) (92,0.2000) (34,0.2000) (18,0.2000) (12,0.2000) ]

Page 17: CrystalBall - Compute Relative Frequency in Hadoop

Hybrid approach (Mapper) – pseudo code

method map(docid id, doc d)

HashMap H;

for each term w in doc d do

for each neighbor u in Neighbor(w) do

if H.contains(Pair(w, u))

H{Pair(w, u)} += 1;

else

H.add(Pair(w, u));

for each Pair p in H do

Emit(p, H(p));

Page 18: CrystalBall - Compute Relative Frequency in Hadoop

Hybrid approach (Mapper) – Java Code

Page 19: CrystalBall - Compute Relative Frequency in Hadoop

Hybrid approach (Reducer) – pseudo codeprev = null;

HashMap H;

Method reduce(Pair p, Iterable<Int> values)

if p.firstValue != prev and not first

total = sumValues(H);

for each item h in H

h(prev.secondValue) /= total;

Emit(p.firstValue, H);

clear(H);

End if

prev = p.firstValue;

H.add(p.secondValue, sum(values));

Method close

//for last pair

total = sumValues(H);

for each item h in H

h(prev.secondValue) /= total;

Emit(p.firstValue, H);

Page 20: CrystalBall - Compute Relative Frequency in Hadoop

Hybrid approach (Reducer) – Java Code

Page 21: CrystalBall - Compute Relative Frequency in Hadoop

Hybrid approach (Reducer) – Java Code

Page 22: CrystalBall - Compute Relative Frequency in Hadoop

Hybrid approach - Input

Mapper1 input

34 56 29 12 34 56 92 10 34 12

Mapper2 input

18 29 12 34 79 18 56 12 34 92

Page 23: CrystalBall - Compute Relative Frequency in Hadoop

Hybrid approach – Output(Reducer1)

10(12,0.5) (34,0.5)

12(10,0.09090909) (18,0.09090909) (34,0.36363637) (56,0.18181819) (79,0.09090909) (92,0.18181819)

18(12,0.25) (29,0.125) (34,0.25) (56,0.125) (79,0.125) (92,0.125)

29(10,0.06666667) (12,0.26666668) (18,0.06666667) (34,0.26666668) (56,0.13333334) (79,0.06666667) (92,0.13333334)

34(10,0.083333336) (12,0.25) (18,0.083333336) (29,0.083333336) (56,0.25) (79,0.083333336) (92,0.16666667)

56(10,0.1) (12,0.3) (29,0.1) (34,0.3) (92,0.2)

92(10,0.33333334) (12,0.33333334) (34,0.33333334)

Page 24: CrystalBall - Compute Relative Frequency in Hadoop

Hybrid approach – Output(Reducer2)

79 (12,0.2) (18,0.2) (34,0.2) (56,0.2) (92,0.2)

Page 25: CrystalBall - Compute Relative Frequency in Hadoop

Comparison

Page 26: CrystalBall - Compute Relative Frequency in Hadoop

Apache Spark

Write a java program on spark to calculate total number of students in MUM coming in different entries. This program should display total number student by country.

Page 27: CrystalBall - Compute Relative Frequency in Hadoop

Spark - Java Code

Page 28: CrystalBall - Compute Relative Frequency in Hadoop

Spark - input

2014 Feb Nepal 20

2014 Feb India 15

2014 Oct Italy 2

2014 July France 1

2015 Feb Nepal 10

2015 Feb India 25

2015 Oct Italy 7

Page 29: CrystalBall - Compute Relative Frequency in Hadoop

Spark - Output

(France,1)

(Italy,9)

(Nepal,30)

(India,40)

Page 30: CrystalBall - Compute Relative Frequency in Hadoop

Tools Used

• VMPlayer Pro 7

• cloudera-quickstart-vm-5.4.0-0-vmware

• Eclipse Version: Luna Service Release 2 (4.4.2)

• Windows 8.1

Page 31: CrystalBall - Compute Relative Frequency in Hadoop

References

• http://glebche.appspot.com/static/hadoop-ecosystem/mapreduce-job-java.html

• https://hadoopi.wordpress.com/2013/06/05/hadoop-implementing-the-tool-interface-for-mapreduce-driver/

• http://www.bogotobogo.com/Hadoop/BigData_hadoop_Apache_Spark.php

Page 32: CrystalBall - Compute Relative Frequency in Hadoop

Thank You