Our experience with NoSQL and MapReduce technologies

Grid Technology

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

DBCFCFGT

Our experience with NoSQL and MapReduce technologies

Fabio Souto

IT Monitoring Working Group, 19th September 2011

CERN IT Department

CH-1211 Geneva 23


it

GT Outline

• Objective

• Big data technologies

• Technologies reviewed

• Deployed infrastructure

• Current status

• Lessons learned

2

CERN IT Department

CH-1211 Geneva 23


it

GT Problem and goal

• The SAM infrastructure for WLCG– monitors 400 sites and ~2,000 services daily

– receives and stores ~600,000 metric results daily

– computes statuses and hourly availabilities for services and sites

• SWAT is a system to gather information about the configuration of WNs

• Massive data generation, making storage, search, sharing, analytics and visualizing difficult

• Objective: proof of concept using big data technologies

3

CERN IT Department

CH-1211 Geneva 23


it

GT Big Data Technologies

•NoSQL databases– Not relational. Schema free.– Distributed – High availability

•MapReduce– Framework for processing huge datasets on clusters of

computers– Takes advantage of data locality:

• Move computation is more efficient than moving data

4

CERN IT Department

CH-1211 Geneva 23


it

GT Technologies reviewed

• NoSQL databases~140 different solutions, we focused on:

–MongoDB• No durability(at the moment of study)

–Cassandra• No single point of failure• Big and responsive community

• Apache Hadoop–Big data de facto standard

–Framework for data intensive applications

–To write MapReduce jobs for Cassandra

5

CERN IT Department

CH-1211 Geneva 23


it

GT Technologies reviewed II

• Hive and Pig– ease the complexity of writing MapReduce– Initially not considered

• Less efficient than pure Hadoop

– Independent from the data source• We can change to HBase easily

– Hive: SQL-like syntax– Pig: data flow language

• Is not turing complete (no loops, if-else…)– But can be embebed into python code– It’s possible to write custom functions in python/java

6

CERN IT Department

CH-1211 Geneva 23


it

GT Technologies reviewed III

• Hue– Set of Django apps to interact with Hadoop

• OpenTSDB– Open source time series database– Lack of flexibility

• Oozie– Job scheduler and workflow engine for Hadoop

7

CERN IT Department

CH-1211 Geneva 23


it

GT Other Tools

• Msg-consume2db inserter:– WLCG Messaging infrastructure -> NoSQL

• sql2nosql-sync – SAM Oracle DB -> NoSQL

8

CERN IT Department

CH-1211 Geneva 23


it

GT Actual infrastructure

Deployed infrastructure

9

CERN IT Department

CH-1211 Geneva 23


it

GT Actual infrastructure

10

CERN IT Department

CH-1211 Geneva 23


it

GT Current status

11

• SAM– DONE: running infrastructure reading messaging

and SAM data and launch pig jobs to calculate availability.

– TODO:• Results tuning• Web interface to visualize the results• JSON/XML API to extract results• Unit testing

• SWAT– Early stage of development (~6 days)– Data collection

CERN IT Department

CH-1211 Geneva 23


it

GT Lessons learned

• Use abstraction layer on top of Hadoop– Write pure MapReduce Hadoop apps is difficult

and time-consuming

• Choose a solution with a responsive community:– Technology in early state(unresolved bugs,

undocumented functions), you will need to get in touch with developers/users

• Big data needs big platform

12

CERN IT Department

CH-1211 Geneva 23


it

GT Lessons learned

• Must keep up to date. New companies, technologies and tools are emerging– Twitter real time hadoop about to be released– Cascalog, hadoop data mining language– Bigdata distributions: Cloudera, Datastax, Mapr…

13

Grid Technology

Questions?

14

Documents

Our experience with NoSQL and MapReduce technologies