Big data nyu

Big Data in the “Real World”Edward Capriolo

What is “big data”?

● Big data is a collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.

● The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

http://en.wikipedia.org/wiki/Big_data

Big Data Challenges

● The challenges include: – capture

– curation– storage

– search

– sharing – transfer

– analysis

– visualization– large

– complex

What is “big data” exactly?

● What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain.

● As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data.

http://en.wikipedia.org/wiki/Big_data

Big Data Qualifiers

● varies● capabilities● traditionally● feasibly● reasonably● [somptha]bytes of data

My first “big data” challenge

● Real time news delivery platform● Ingest news as text and provide full text search● Qualifiers

– Reasonable: Real time search was < 1 second

– Capabilities: small company, <100 servers

● Big Data challenges – Storage: roughly 300GB for 60 days data

– Search: searches of thousands of terms

Traditionally

● Data was placed in mysql● MySQL full text search● Easy to insert● Easy to search● Worked great!

– Until it got real world load

Feasibly in hardware (circa 2008)

● 300GB data and 16GB ram● ...MySQL stores an in-memory binary tree of the keys.

Using this tree, MySQL can calculate the count of matching rows with reasonable speed. But speed declines logarithmically as the number of terms increases.

● The platters revolve at 15,000 RPM or so, which works out to 250 revolutions per second. Average latency is listed as 2.0ms

● As the speed of an HDD increases the power it takes to run it increases disproportionately

http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drivehttp://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/

http://dev.mysql.com/doc/internals/en/full-text-search.html

http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drive

“Big Data” is about giving up things

● In theoretical computer science, the CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:– Consistency (all nodes see the same data at the same time)

– Availability (a guarantee that every request receives a response about whether it was successful or failed)

– Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

http://en.wikipedia.org/wiki/CAP_theorem

http://www.youtube.com/watch?v=I4yGBfcODmU

Multi-Master solution

● Write the data to N mysql servers and round robin reads between them– Good: More machines to serve reads

– Bad: Requires Nx hardware

– Hard: Keeping machines loaded with same data especially auto-generated-ids

– Hard: What about when the data does not even fit on a single machine?

Sharding

● Rather then replicate all data to all machines● Replicate data to selective machines

– Good: localized data

– Good: better caching

– Hard: Joins across shards

– Hard: Management

– Hard: Failure

● Parallel RDBMS = $$$

Life lesson

“applications that are traditionally used to”● How did we solve our problem?

– We switched to lucene● A tool designed for full text search● Eventually sharded lucene

● When you hold a hammer:– Not everything is a nail

● Understand what you really need● Understand reasonable and feasable

Big data Challenge 2

● Large high volume web site● Process them and produce reports● Big Data challenges

– Storage: Store GB of data a day for years– Analysis, visualization: support reports of existing system

● Qualifiers– Reasonable to want daily reports less then one day

– Honestly needs to be faster / reruns etc

Enter hadoop

● Hadoop (0.17.X) was fairly new at the time ● Use cases of map reduce were emerging

– Hive had just been open sourced by Facebook

● Many database vendors were calling map/reduce “a step backwards”– They had solved these problems “in the 80s”

Hadoop file system HDFS

● Distributed redundant storage– We were a NoSPOF across the board

● Commodity hardware vs buying a big SAN/NAS device

● We already had processes that scp'ed data to servers, easily adapted to placing them into hdfs

● HDFS easy huge

Map Reduce

● As a proof of concept I wrote a group/count application that would group/count on column in our logs

● Was able to show linear speed up with increased nodes

●

Winning (why hadoop kicked arse)

● Data capture, curation– bulk loading data into RDBMS (indexes, overhead)– bulk loading into hadoop is network copy

● Data anaysis– RDBMS would not parallel-ize queries (even across

partitions)

– Some queries could cause very locks and performance degradation

http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html

Enter hive

● Capture- NO● Curation- YES● Storage- YES● Search- YES● Sharing- YES● Transfer- NO● Analysis-YES● Visualization-NO

Logging from apache to hive

Sample program group and count

Source data looks like

jan 10 2009:.........:200:/index.htmjan 10 2009:.........:200:/index.htmjan 10 2009:.........:200:/igloo.htmjan 10 2009:.........:200:/ed.htm

In case your the math type

(input) <k1, v1> →map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Map(k1,v1) -> list(k2,v2)Reduce(k2, list (v2)) -> list(v3)

A mapper

A reducer

Hive style

hive>create table web_data ( sdate STRING, stime STRING, envvar STRING, remotelogname STRING ,servername STRING, localip STRING, literaldash STRING, method STRING, url STRING, querystring STRING, status STRING, litteralzero STRING ,bytessent INT,header STRING, timetoserver INT, useragent STRING ,cookie STRING, referer STRING);

SELECT url,count(1) FROM web_data GROUP BY url;

Life lessons volume 2

● feasible and reasonable were completely different then case 1#

● Query from seconds -> hours● Size from GB to TB● Feasilble from 4 Nodes to 15

Big Data Challenge #3 (work at m6d)

● Large high volume ad serving site● Process them and produce reports● Support data science and biz-dev users● Big Data challenges

– Storage: Store and process terabytes of data● Complex data types, encoded data

– Analysis, visualization: support reports of existing system

● Qualifiers– Reasonable: adhoc, daily,hourly, weekly, monthly reports

Data data everywhere

● We have to use cookies in many places● Cookies have limited size● Cookies have complex values encoded

Some encoding tricks we might do

LastSeen: long (64 bits)

Segment: int (32 bits)

Literal ','

Segment: int (32 bits)

Zipcode (32bits)

● 1 chose a relevant epoc and use byte

● Use a byte for # of segments

● Use a 4 byte radix encoded number

● ... and so on

Getting at embedded data

● Write N UDFS for each object like:– getLastSeenForCookie(String)

– getZipcodeForCookie(String)

– ...

● But this would have made a huge toolkit● Traditionally you do not want to break first

normal form

Struct solution

● Hive has a struct like a c struct● Struct is list of name value pair● Structs can contain other structs● This gives us the serious ability to do object

mapping● UDFs can return struct types

Using a UDF

● add jar myjar.jar;● Create temporary function parseCookie as

'com.md6.ParseCookieIntoStruct' ;● Select

parseCookie(encodedColumn).lastSeen from mydata;

LATERAL VIEW + EXPLODE

SELECT

client_id, entry.spendcreativeid

FROM datatable

LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist) entryList as entry

where hit_date=20110321 AND mid=001406;

3214498023360851706 215286

3214498023360851706 195785

3214498023360851706 128640

All that data might boil down to...

Life lessons volume #3

● Big data is not only batch or real-time● Big data is feed back loops

– Machine learning

– Ad hoc performance checks

● Generated SQL tables periodically synced to web server

● Data shared between sections of an organization to make business decisions

Technology

Big data nyu