36
Big Data in the “Real World” Edward Capriolo

Big data nyu

Embed Size (px)

DESCRIPTION

Big data talk done for Stern NY

Citation preview

Page 1: Big data nyu

Big Data in the “Real World”Edward Capriolo

Page 2: Big data nyu

What is “big data”?

● Big data is a collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.

● The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

http://en.wikipedia.org/wiki/Big_data

Page 3: Big data nyu

Big Data Challenges

● The challenges include: – capture

– curation– storage

– search

– sharing – transfer

– analysis

– visualization– large

– complex

Page 4: Big data nyu

What is “big data” exactly?

● What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain.

● As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data.

http://en.wikipedia.org/wiki/Big_data

Page 5: Big data nyu

Big Data Qualifiers

● varies● capabilities● traditionally● feasibly● reasonably● [somptha]bytes of data

Page 6: Big data nyu

My first “big data” challenge

● Real time news delivery platform● Ingest news as text and provide full text search● Qualifiers

– Reasonable: Real time search was < 1 second

– Capabilities: small company, <100 servers

● Big Data challenges – Storage: roughly 300GB for 60 days data

– Search: searches of thousands of terms

Page 7: Big data nyu
Page 8: Big data nyu

Traditionally

● Data was placed in mysql● MySQL full text search● Easy to insert● Easy to search● Worked great!

– Until it got real world load

Page 9: Big data nyu

Feasibly in hardware (circa 2008)

● 300GB data and 16GB ram● ...MySQL stores an in-memory binary tree of the keys.

Using this tree, MySQL can calculate the count of matching rows with reasonable speed. But speed declines logarithmically as the number of terms increases.

● The platters revolve at 15,000 RPM or so, which works out to 250 revolutions per second. Average latency is listed as 2.0ms

● As the speed of an HDD increases the power it takes to run it increases disproportionately

http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drivehttp://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/

http://dev.mysql.com/doc/internals/en/full-text-search.html

Page 10: Big data nyu

“Big Data” is about giving up things

● In theoretical computer science, the CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:– Consistency (all nodes see the same data at the same time)

– Availability (a guarantee that every request receives a response about whether it was successful or failed)

– Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

http://en.wikipedia.org/wiki/CAP_theorem

http://www.youtube.com/watch?v=I4yGBfcODmU

Page 11: Big data nyu

Multi-Master solution

● Write the data to N mysql servers and round robin reads between them– Good: More machines to serve reads

– Bad: Requires Nx hardware

– Hard: Keeping machines loaded with same data especially auto-generated-ids

– Hard: What about when the data does not even fit on a single machine?

Page 12: Big data nyu
Page 13: Big data nyu

Sharding

● Rather then replicate all data to all machines● Replicate data to selective machines

– Good: localized data

– Good: better caching

– Hard: Joins across shards

– Hard: Management

– Hard: Failure

● Parallel RDBMS = $$$

Page 14: Big data nyu

Life lesson

“applications that are traditionally used to”● How did we solve our problem?

– We switched to lucene● A tool designed for full text search● Eventually sharded lucene

● When you hold a hammer:– Not everything is a nail

● Understand what you really need● Understand reasonable and feasable

Page 15: Big data nyu

Big data Challenge 2

● Large high volume web site● Process them and produce reports● Big Data challenges

– Storage: Store GB of data a day for years– Analysis, visualization: support reports of existing system

● Qualifiers– Reasonable to want daily reports less then one day

– Honestly needs to be faster / reruns etc

Page 16: Big data nyu

Enter hadoop

● Hadoop (0.17.X) was fairly new at the time ● Use cases of map reduce were emerging

– Hive had just been open sourced by Facebook

● Many database vendors were calling map/reduce “a step backwards”– They had solved these problems “in the 80s”

Page 17: Big data nyu

Hadoop file system HDFS

● Distributed redundant storage– We were a NoSPOF across the board

● Commodity hardware vs buying a big SAN/NAS device

● We already had processes that scp'ed data to servers, easily adapted to placing them into hdfs

● HDFS easy huge

Page 18: Big data nyu

Map Reduce

● As a proof of concept I wrote a group/count application that would group/count on column in our logs

● Was able to show linear speed up with increased nodes

Page 19: Big data nyu

Winning (why hadoop kicked arse)

● Data capture, curation– bulk loading data into RDBMS (indexes, overhead)– bulk loading into hadoop is network copy

● Data anaysis– RDBMS would not parallel-ize queries (even across

partitions)

– Some queries could cause very locks and performance degradation

http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html

Page 20: Big data nyu

Enter hive

● Capture- NO● Curation- YES● Storage- YES● Search- YES● Sharing- YES● Transfer- NO● Analysis-YES● Visualization-NO

Page 21: Big data nyu

Logging from apache to hive

Page 22: Big data nyu

Sample program group and count

Source data looks like

jan 10 2009:.........:200:/index.htmjan 10 2009:.........:200:/index.htmjan 10 2009:.........:200:/igloo.htmjan 10 2009:.........:200:/ed.htm

Page 23: Big data nyu

In case your the math type

(input) <k1, v1> →map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Map(k1,v1) -> list(k2,v2)Reduce(k2, list (v2)) -> list(v3)

Page 24: Big data nyu

A mapper

Page 25: Big data nyu

A reducer

Page 26: Big data nyu

Hive style

hive>create table web_data ( sdate STRING, stime STRING, envvar STRING, remotelogname STRING ,servername STRING, localip STRING, literaldash STRING, method STRING, url STRING, querystring STRING, status STRING, litteralzero STRING ,bytessent INT,header STRING, timetoserver INT, useragent STRING ,cookie STRING, referer STRING);

SELECT url,count(1) FROM web_data GROUP BY url;

Page 27: Big data nyu

Life lessons volume 2

● feasible and reasonable were completely different then case 1#

● Query from seconds -> hours● Size from GB to TB● Feasilble from 4 Nodes to 15

Page 28: Big data nyu

Big Data Challenge #3 (work at m6d)

● Large high volume ad serving site● Process them and produce reports● Support data science and biz-dev users● Big Data challenges

– Storage: Store and process terabytes of data● Complex data types, encoded data

– Analysis, visualization: support reports of existing system

● Qualifiers– Reasonable: adhoc, daily,hourly, weekly, monthly reports

Page 29: Big data nyu

Data data everywhere

● We have to use cookies in many places● Cookies have limited size● Cookies have complex values encoded

Page 30: Big data nyu

Some encoding tricks we might do

LastSeen: long (64 bits)

Segment: int (32 bits)

Literal ','

Segment: int (32 bits)

Zipcode (32bits)

● 1 chose a relevant epoc and use byte

● Use a byte for # of segments

● Use a 4 byte radix encoded number

● ... and so on

Page 31: Big data nyu

Getting at embedded data

● Write N UDFS for each object like:– getLastSeenForCookie(String)

– getZipcodeForCookie(String)

– ...

● But this would have made a huge toolkit● Traditionally you do not want to break first

normal form

Page 32: Big data nyu

Struct solution

● Hive has a struct like a c struct● Struct is list of name value pair● Structs can contain other structs● This gives us the serious ability to do object

mapping● UDFs can return struct types

Page 33: Big data nyu

Using a UDF

● add jar myjar.jar;● Create temporary function parseCookie as

'com.md6.ParseCookieIntoStruct' ;● Select

parseCookie(encodedColumn).lastSeen from mydata;

Page 34: Big data nyu

LATERAL VIEW + EXPLODE

SELECT

client_id, entry.spendcreativeid

FROM datatable

LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist) entryList as entry

where hit_date=20110321 AND mid=001406;

3214498023360851706 215286

3214498023360851706 195785

3214498023360851706 128640

Page 35: Big data nyu

All that data might boil down to...

Page 36: Big data nyu

Life lessons volume #3

● Big data is not only batch or real-time● Big data is feed back loops

– Machine learning

– Ad hoc performance checks

● Generated SQL tables periodically synced to web server

● Data shared between sections of an organization to make business decisions