40
Developing unit-testable software with Hadoop HUG UK - Jan 13th 2015

Developing Unit Testable Software with Hadoop at Expedia

  • Upload
    huguk

  • View
    442

  • Download
    1

Embed Size (px)

Citation preview

Developing unit-testable

software with Hadoop

HUG UK - Jan 13th 2015

8

108 Keywords

1010 Web hits

9

Big data +

High stakes =

Exciting

12

Big data +High stakes +Complexity +Change +Unknowns =Fear

13

Benefits of unit testing

• Build confidence

• Enable change

• Describe behaviour

• Accelerate development

14

Tests need to be:

• Simple and quick to write

• Simple and quick to run

15

Hadoop testing challenges

• Framework modularization issues

• Heavyweight execution engine

• Availability of testing utilities

16

Hadoop testing challenges

• Framework modularization issues

• Heavyweight execution engine

• Availability of testing utilities

17

Sequential modularization

18

Modularization via encapsulation

19

Hive modularity

20

Cascading modularity

21

Pig modularity

22

Crunch modularity

23

Mapreduce modularity

25

Hadoop testing challenges

• Framework modularization issues

• Heavyweight execution engine

• Availability of testing utilities

26

Local execution

Framework Local engine

Hive Local-mode

Cascading LocalFlowConnector

Pig Local mode

Crunch MemPipeline

27

Hadoop testing challenges

• Framework modularization issues

• Heavyweight execution engine

• Availability of testing utilities

28

Helper libraries

Framework Library

Hive HiveRunner

Cascading cascading-test

Plunger

Pig PigUnit

Crunch MemPipeline

29

Example

Topic | Subtopic

30

Hive example with HiveRunner

[https://github.com/klarna/HiveRunner]

31

Hive + HiveRunner: pros

• Write/test Hive apps in the same environment

• Seamless UDF development

32

Hive + Runner: cons

• Slow execution

• CSV data – hard to maintain

• Assertions on CSV strings is brittle

• Hadoop compatibility issues

33

Cascading example with Plunger

[https://github.com/HotelsDotCom/plunger]

34

Cascading + Plunger: pros

• Compact tests, well defined scope

• Use standard Java tools

• Fast

35

Cascading + Plunger: cons

• Some tools only appear to work

• False sense of security

36

Measuring coverage

• Identify activated branches

• No tools do this

37

Conclusions

• Testing possible with most frameworks

• Efficacy largely influenced by framework

• Tooling is immature

38

We’re hiring

Java developers

Hadoop developers

39

Questions?

40

Attribution

https://flic.kr/p/7XQdXm - Chris Campbell - CC BY-NC 2.0https://flic.kr/p/4hpX7j - Andrew_Writer - CC BY-NC-ND 2.0http://bit.ly/1BVe8xH - Prokofiev - CC BY-SA 3.0

Resources

HiveRunner

https://github.com/klarna/HiveRunner

Plunger

https://github.com/HotelsDotCom/plunger