Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt...

Preview:

Citation preview

Hadoop and NoSQL Basics: Big Data Demystified

NYS Innovation Summit, 12/17/2013

Matt LeMay, @mattlemay

“When I want people to think I’m smart, I just say ‘HADOOP’ really loud.”

“Big Data!”

“Data Science!”

“Hadoop! There it is.”

“Algorithms!”

... why are we thinking about this at all?

=ALL the data

created until the year 2003

ALL the data created every

two days

Writes > 12 terabytes of data per day.

*the 451 group

... how did we get here?

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

HIERARCHICAL DATABASE MODEL

• Used in early mainframe computing !• Stores data in one-to-many “trees” !• Not very flexible

Fruit

AppleOrange Grape

Granny Smith Honeycrisp Red Delicious

RELATIONAL DATABASE MODEL

• Invented in 1970 by Edgar F. Codd at IBM !• Stores data in “tuples” which resemble rows of a table !• Still the most widely used database model

Fruit_Variety Fruit

Granny Smith Apple

Honeycrisp Apple

Red Delicious Apple

Navel Orange

RELATIONAL DATABASE MODEL

• ... can also store hierarchical data!

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

Variety_ID Variety_Name Fruit_ID

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

RELATIONAL DATABASE MODEL

• Has rigid structure or “schema.”

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

Variety_ID Variety_Name Fruit_ID

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

RELATIONAL DATABASE MODEL

• Uses unique “keys” for consistency across “tables”

Fruit_ID Fruit_Name

1 Orange

2 Apple

3 Grape

Variety_ID Variety_Name Fruit_ID

1 Granny Smith 2

2 Honeycrisp 2

3 Red Delicious 2

4 Navel 1

DOCUMENT DATABASE MODEL

Red Delicious AppleHoneycrisp Apple

Granny Smith Apple

Navel Orange

• Doesn’t have a single structure or “schema” that each entry must follow !• Developed in 1995 for use with Lotus Notes !• SO TRENDY

DOCUMENT DATABASE MODEL

• CAN have structured elements, but structure doesn’t need to be consistent across entries

{!“Fruits”: [!{!“Type”: “Apple”,!“Variety”: “Red Delicious”!

},!{!“Name”: “Granny Smith Apple”!

},!“Navel Orange”!

]!}!

!

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

RIGID

FLEXIBLE

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

RIGID

FLEXIBLE

Relational Database is to Document Database !

As Excel Spreadsheet is to Word Document

... as SQL is to NoSQL

Relational Database is to Document Database !

As Excel Spreadsheet is to Word Document

Relational Database is to Document Database !

As Excel Spreadsheet is to Word Document

*... mostly / sorta. Stay tuned!

... as SQL is to NoSQL*

SQL, or “Structured Query Language,” is a language for getting data into and out of a relational database.

“SELECT Variety_Name FROM fruits WHERE fruit_id = 2”

!Variety_Name!---------------------- !Granny Smith!Honeycrisp!Red Delicious!

Depending on who you ask, “NoSQL” means “NOT SQL” or “NOT ONLY SQL.”

(in fact, some characterize NoSQL as a “movement,” not a particular

technology or set of technologies.)

“SQL Databases” are highly standardized. !

“NoSQL Databases” are highly fragmented.

“SQL Databases” are highly standardized. !

“NoSQL Databases” are highly fragmented. Some are document model databases, some use a variation of a key-value store.

Document Databases

So, what are the characteristics of NoSQL databases* that make them so

trendy and exciting?

* Generally

Relational databases have strict “schemas” dictating the structure of data.

NoSQL databases are generally “schemaless,” even when they use key-value stores.

NoSQL databases are generally “schemaless,” even when they use key-value stores.

Can start entering data before deciding on how that data will be formatted

Less structured, consistent

More flexible

NoSQL databases are generally “schemaless,” even when they use key-value stores.

Can start entering data before deciding on how that data will be formatted

Less structured, consistent

More flexible

Relational databases can scale up (on one computer) but not easily out (across many computers).

NoSQL databases are designed to scale out across many computers.

NoSQL databases are designed to scale out across many computers.

Lots of machines == BIG data

More complicated to set up

Can scale quickly if needed

No single point of failure

Relational databases read and write information directly to a disk drive.

NoSQL databases store information in memory, and/or include robust built-in caching in memory.

NoSQL databases store information in memory, and/or include robust built-in caching in memory.

Faster

Memory more expensive than disk

Potential reliability issues

Relational databases follow the “ACID” model:

NoSQL databases do not follow the “ACID” model.

More freedom to handle requests in a way that honors the uniqueness of “things.”

Much greater room for (potentially serious) errors.

NoSQL databases do not follow the “ACID” model.

Relational databases represent data as “rows” and “columns.”

NoSQL databases often represent data in formats such as JSON, which are native to

many programming languages.

NoSQL databases often represent data in formats such as JSON, which are native to

many programming languages.

Easier, faster for programmers

Harder for non-programmers

SO WAIT, THOUGH, how the f*** do you find anything in a NoSQL database????

HADOOP is an open source framework for doing MapReduce.

!

MapReduce is one way to make sense of a document database.

!

(That’s how GOOGLE does it.)

MapReduce has two core steps: !

Map !

and !

Reduce. !

!

!

... both are pretty much what they sound like.

This is what it actually looks like:

function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)

“For a given document, map each word phrase or item to the number of times that word phrase or item appears.”

MAP:

“NOW, take all of those maps from every document, and reduce them to a single list of items and counts.”

REDUCE:

function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

MAP

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

MAP

(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)

REDUCE

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

MAP

(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)

REDUCE

The hard work is distributed

The hard work is distributed

The easy work is centralized

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

COMP 1 COMP 2

... but what if we’ve got our documents stored on multiple machines?

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

COMP 1 COMP 2

MAP MAP

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

(Red, 1) (Delicious, 1) (Apple, 2) (Honeycrisp, 1)

(Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1) (Apple, 1)

COMP 1 COMP 2

MAP MAP

REDUCE REDUCE

Red Delicious Apple

Honeycrisp Apple Granny Smith Apple

Navel Orange

(Red, 1) (Delicious, 1) (Apple, 1)

(Honeycrisp, 1) (Apple, 1)

(Navel, 1) (Orange, 1)

(Granny, 1) (Smith, 1) (Apple, 1)

(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)

(Red, 1) (Delicious, 1) (Apple, 2) (Honeycrisp, 1)

(Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1) (Apple, 1)

COMP 1 COMP 2

MAP MAP

REDUCE REDUCE

REDUCE

Is this the easiest way to count apples?

NOT

*

* relational database

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

(1808, +.9)

MAP (WITH MATH + SENTIMENT)

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

(33, -.6)(Distance in Miles, Sentiment Score)

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

(1808, +.9)

MAP (WITH MATH + SENTIMENT)

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

(33, -.6)(Distance in Miles, Sentiment Score)

REDUCE

(1808, +.9) (33, -.6)

Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”

(1808, +.9)

MAP (WITH MATH + SENTIMENT)

Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”

(33, -.6)(Distance in Miles, Sentiment Score)

REDUCE

(1808, +.9) (33, -.6)

RINSE AND REPEAT LIKE A MILLION TIMES

... none of this is magic.

... in fact, the “magic” part is just a precursor to doing the actual hard work.

Danah Boyd’s Six Provocations for Big Data:

1. Automating Research Changes the Definition of Knowledge. !2. Claims to Objectivity and Accuracy are Misleading !3. Bigger Data are Not Always Better Data !4. Not All Data Are Equivalent !5. Just Because it is Accessible Doesn’t Make it Ethical !6. Limited Access to Big Data Creates New Digital Divides

What about THE FUTURE?

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

RIGID

FLEXIBLE

HIERARCHICAL DATABASE MODEL

RELATIONAL DATABASE MODEL

DOCUMENT DATABASE MODEL

RIGID

FLEXIBLE

?

Further Reading:

Martin Fowler on NoSQL: http://martinfowler.com/nosql.html !Helpful Stack Overflow thread: http://stackoverflow.com/questions/11844603/technology-decision-sql-vs-nosql-vs-newsql !Finding Friends with MapReduce: http://stevekrenzel.com/finding-friends-with-mapreduce !Choosing a Database That’s Right for Your Business: http://slashdot.org/topic/bi/choosing-a-database-right-for-business-2/ !Demystifying the Role of Big Data in Marketing: http://www.guardian.co.uk/media-network/media-network-blog/2013/mar/12/big-data-marketing-demystified !The NoSQL Movement: http://strata.oreilly.com/2012/02/nosql-non-relational-database.html !Big Data Tools Cost Too Much, Do Too Little: http://www.theregister.co.uk/2013/02/28/hadoop_no_sql_dont_believe_the_hype/ !Is Big Data an Economic Big Dud?: http://www.nytimes.com/2013/08/18/sunday-review/is-big-data-an-economic-big-dud.html?hp&_r=1& !Six Provocations for Big Data: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431

Recommended