14
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/ DB CF CF GT Our experience with NoSQL and MapReduce technologies Fabio Souto IT Monitoring Working Group, 19 th September 2011

Our experience with NoSQL and MapReduce technologies

  • Upload
    blue

  • View
    53

  • Download
    0

Embed Size (px)

DESCRIPTION

Our experience with NoSQL and MapReduce technologies. Fabio Souto IT Monitoring Working Group, 19 th September 2011. Outline. Objective Big data technologies Technologies reviewed Deployed infrastructure Current status Lessons learned. Problem and goal. The SAM infrastructure for WLCG - PowerPoint PPT Presentation

Citation preview

Page 1: Our experience with NoSQL and MapReduce technologies

Grid Technology

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

DBCFCFGT

Our experience with NoSQL and MapReduce technologies

Fabio Souto

IT Monitoring Working Group, 19th September 2011

Page 2: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Outline

• Objective

• Big data technologies

• Technologies reviewed

• Deployed infrastructure

• Current status

• Lessons learned

2

Page 3: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Problem and goal

• The SAM infrastructure for WLCG– monitors 400 sites and ~2,000 services daily

– receives and stores ~600,000 metric results daily

– computes statuses and hourly availabilities for services and sites

• SWAT is a system to gather information about the configuration of WNs

• Massive data generation, making storage, search, sharing, analytics and visualizing difficult

• Objective: proof of concept using big data technologies

3

Page 4: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Big Data Technologies

•NoSQL databases– Not relational. Schema free.– Distributed – High availability

•MapReduce– Framework for processing huge datasets on clusters of

computers– Takes advantage of data locality:

• Move computation is more efficient than moving data

4

Page 5: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Technologies reviewed

• NoSQL databases~140 different solutions, we focused on:

–MongoDB• No durability(at the moment of study)

–Cassandra• No single point of failure• Big and responsive community

• Apache Hadoop–Big data de facto standard

–Framework for data intensive applications

–To write MapReduce jobs for Cassandra

5

Page 6: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Technologies reviewed II

• Hive and Pig– ease the complexity of writing MapReduce– Initially not considered

• Less efficient than pure Hadoop

– Independent from the data source• We can change to HBase easily

– Hive: SQL-like syntax– Pig: data flow language

• Is not turing complete (no loops, if-else…)– But can be embebed into python code– It’s possible to write custom functions in python/java

6

Page 7: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Technologies reviewed III

• Hue– Set of Django apps to interact with Hadoop

• OpenTSDB– Open source time series database– Lack of flexibility

• Oozie– Job scheduler and workflow engine for Hadoop

7

Page 8: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Other Tools

• Msg-consume2db inserter:– WLCG Messaging infrastructure -> NoSQL

• sql2nosql-sync – SAM Oracle DB -> NoSQL

8

Page 9: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Actual infrastructure

Deployed infrastructure

9

Page 10: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Actual infrastructure

10

Page 11: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Current status

11

• SAM– DONE: running infrastructure reading messaging

and SAM data and launch pig jobs to calculate availability.

– TODO:• Results tuning• Web interface to visualize the results• JSON/XML API to extract results• Unit testing

• SWAT– Early stage of development (~6 days)– Data collection

Page 12: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Lessons learned

• Use abstraction layer on top of Hadoop– Write pure MapReduce Hadoop apps is difficult

and time-consuming

• Choose a solution with a responsive community:– Technology in early state(unresolved bugs,

undocumented functions), you will need to get in touch with developers/users

• Big data needs big platform

12

Page 13: Our experience with NoSQL and MapReduce technologies

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Lessons learned

• Must keep up to date. New companies, technologies and tools are emerging– Twitter real time hadoop about to be released– Cascalog, hadoop data mining language– Bigdata distributions: Cloudera, Datastax, Mapr…

13

Page 14: Our experience with NoSQL and MapReduce technologies

Grid Technology

Questions?

14