Hadoop and NoSQL Basics: Big Data Demystiﬁed - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt...

Hadoop and NoSQL Basics: Big Data Demystified

NYS Innovation Summit, 12/17/2013

Matt LeMay, @mattlemay

“When I want people to think I’m smart, I just say ‘HADOOP’ really loud.”

“Big Data!”

“Data Science!”

“Hadoop! There it is.”

“Algorithms!”

... why are we thinking about this at all?

=ALL the data

created until the year 2003

ALL the data created every

two days

Writes > 12 terabytes of data per day.

*the 451 group

... how did we get here?

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

• Used in early mainframe computing !• Stores data in one-to-many “trees” !• Not very flexible

AppleOrange Grape

Granny Smith Honeycrisp Red Delicious

• Invented in 1970 by Edgar F. Codd at IBM !• Stores data in “tuples” which resemble rows of a table !• Still the most widely used database model

Fruit_Variety Fruit

Granny Smith Apple

Honeycrisp Apple

Red Delicious Apple

Navel Orange

• ... can also store hierarchical data!

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

Variety_ID Variety_Name Fruit_ID

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

• Has rigid structure or “schema.”

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

• Uses unique “keys” for consistency across “tables”

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

Red Delicious AppleHoneycrisp Apple

Granny Smith Apple

Navel Orange

• Doesn’t have a single structure or “schema” that each entry must follow !• Developed in 1995 for use with Lotus Notes !• SO TRENDY

• CAN have structured elements, but structure doesn’t need to be consistent across entries

{!“Fruits”: [!{!“Type”: “Apple”,!“Variety”: “Red Delicious”!

},!{!“Name”: “Granny Smith Apple”!

},!“Navel Orange”!

FLEXIBLE

Relational Database is to Document Database !

As Excel Spreadsheet is to Word Document

... as SQL is to NoSQL

*... mostly / sorta. Stay tuned!

... as SQL is to NoSQL*

SQL, or “Structured Query Language,” is a language for getting data into and out of a relational database.

“SELECT Variety_Name FROM fruits WHERE fruit_id = 2”

!Variety_Name!---------------------- !Granny Smith!Honeycrisp!Red Delicious!

Depending on who you ask, “NoSQL” means “NOT SQL” or “NOT ONLY SQL.”

(in fact, some characterize NoSQL as a “movement,” not a particular

technology or set of technologies.)

“SQL Databases” are highly standardized. !

“NoSQL Databases” are highly fragmented.

“SQL Databases” are highly standardized. !

“NoSQL Databases” are highly fragmented. Some are document model databases, some use a variation of a key-value store.

Document Databases

So, what are the characteristics of NoSQL databases* that make them so

trendy and exciting?

* Generally

Relational databases have strict “schemas” dictating the structure of data.

NoSQL databases are generally “schemaless,” even when they use key-value stores.

Can start entering data before deciding on how that data will be formatted

Less structured, consistent

More flexible

NoSQL databases are generally “schemaless,” even when they use key-value stores.

Can start entering data before deciding on how that data will be formatted

Less structured, consistent

More flexible

Relational databases can scale up (on one computer) but not easily out (across many computers).

NoSQL databases are designed to scale out across many computers.

Lots of machines == BIG data

More complicated to set up

Can scale quickly if needed

No single point of failure

Relational databases read and write information directly to a disk drive.

NoSQL databases store information in memory, and/or include robust built-in caching in memory.

Faster

Memory more expensive than disk

Potential reliability issues

Relational databases follow the “ACID” model:

NoSQL databases do not follow the “ACID” model.

More freedom to handle requests in a way that honors the uniqueness of “things.”

Much greater room for (potentially serious) errors.

NoSQL databases do not follow the “ACID” model.

Relational databases represent data as “rows” and “columns.”

NoSQL databases often represent data in formats such as JSON, which are native to

many programming languages.

NoSQL databases often represent data in formats such as JSON, which are native to

many programming languages.

Easier, faster for programmers

Harder for non-programmers

SO WAIT, THOUGH, how the f*** do you find anything in a NoSQL database????

HADOOP is an open source framework for doing MapReduce.

MapReduce is one way to make sense of a document database.

(That’s how GOOGLE does it.)

MapReduce has two core steps: !

Reduce. !

... both are pretty much what they sound like.

This is what it actually looks like:

function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)

“For a given document, map each word phrase or item to the number of times that word phrase or item appears.”

“NOW, take all of those maps from every document, and reduce them to a single list of items and counts.”

REDUCE:

function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

Red Delicious Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

Red Delicious Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)

REDUCE

Red Delicious Apple

Navel Orange

REDUCE

The hard work is distributed

The easy work is centralized

Red Delicious Apple

Navel Orange

COMP 1 COMP 2

... but what if we’ve got our documents stored on multiple machines?

Red Delicious Apple

Navel Orange

COMP 1 COMP 2

MAP MAP

Red Delicious Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 2) (Honeycrisp, 1)

(Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1) (Apple, 1)

COMP 1 COMP 2

MAP MAP

REDUCE REDUCE

Red Delicious Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 2) (Honeycrisp, 1)

(Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1) (Apple, 1)

COMP 1 COMP 2

MAP MAP

REDUCE REDUCE

REDUCE

Is this the easiest way to count apples?

* relational database

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

(1808, +.9)

MAP (WITH MATH + SENTIMENT)

(33, -.6)(Distance in Miles, Sentiment Score)

(1808, +.9)

REDUCE

(1808, +.9) (33, -.6)

(1808, +.9)

REDUCE

(1808, +.9) (33, -.6)

RINSE AND REPEAT LIKE A MILLION TIMES

... none of this is magic.

... in fact, the “magic” part is just a precursor to doing the actual hard work.

Danah Boyd’s Six Provocations for Big Data:

1. Automating Research Changes the Definition of Knowledge. !2. Claims to Objectivity and Accuracy are Misleading !3. Bigger Data are Not Always Better Data !4. Not All Data Are Equivalent !5. Just Because it is Accessible Doesn’t Make it Ethical !6. Limited Access to Big Data Creates New Digital Divides

What about THE FUTURE?

FLEXIBLE

Hadoop and NoSQL Basics: Big Data Demystiﬁed - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt...

Documents

Bring Location Intelligence To Big Data Applications on ... · –Using the Oracle NoSQL Key-Value API Hadoop classes –Using the Oracle NoSQL Table API Hadoop classes –Using the

Oracle NoSQL Database...Running Hadoop in NoSQL Database (Interactive Slide) Steps to run a Hadoop operation in NoSQL Database: 1. Start Kvlite. 2. Load data into NoSQL Database. 3

Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

Big Data with NoSQL, Hadoop, Spark, and Kafka – Couchbase Connect 2016

Circuit Analysis Demystified - a wonderful siteonestepeasy.webs.com/documents/Circuit Analysis-1.pdf · Relativity Demystified Robotics Demystified Sales Management Demystified

NoSQL, Hadoop, Cascading June 2010

Introduction to NoSQL Databases | Hadoop Quick Introduction

Hadoop, SQL & NoSQL: No Longer an Either-or Question

Hadoop - Abteilung Datenbanken Leipzig · 0 Hadoop HDFS und MapReduce Seminararbeit im Modul NoSQL-Datenbanken Bachelorstudiengang Informatik Universität Leipzig JOHANNES FREY UNTER

NoSQL with Hadoop and HBase

Precalculus Library/Misc...Precalculus Demystified Project Management Demystified Robotics Demystified Statistics Demystified Trigonometry Demystified Precalculus Demystified

20140202 fosdem-nosql-devroom-hadoop-yarn

Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

Mahidol University · 2018-10-22 · NOSQL big data Hadoop Weka, R, Spark Open Source Community Edition (Pentaho CE ... Hadoop (version 1) Data Scientist 2: Hadoop Cluster (Hadoop

NoSQL - cs. · PDF fileExample MapReduce Problem Exercise: Write your own queries in Hadoop! “No SQL ... cd NoSQL-activity/ Run the initialization script local-hadoop/start-local-hadoop.py

Adattárház alapú vezetői információs rendszerek · Yahoo! Hadoop, PNUTS Columnar NoSQL Twitter FlockDB, Cassandra, Hadoop/Hbase Graph, Columnar NoSQL Wikipedia Memcached, Flatfile,

Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar

Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing Code

3 Hadoop? Cloud data warehousing? Machine learning? NoSQL?

Big data, Hadoop, NoSQL DB - introduction