25
Data-Driven Development Era and Its Technologies Developers Summit 2015 Autumn (Oct 14, 2015) Satoshi Tagomori (@tagomoris)

Data-Driven Development Era and Its Technologies

Embed Size (px)

Citation preview

Data-Driven Development Era and Its Technologies

Developers Summit 2015 Autumn (Oct 14, 2015)

Satoshi Tagomori (@tagomoris)

Satoshi "Moris" Tagomori (@tagomoris)

Fluentd, Norikra, Hadoop, ...

Treasure Data, Inc.

HQ

Branch

http://www.treasuredata.com/

Main Topics around "Data"• Data collection

• Storage

• Data processing • Batch distributed processing • Stream processing • Machine Learning • Near real-time query & Data lake

• Visualization

Data Analytics Flow

Collect Store Process Visualize

Data source

Reporting

Monitoring

Where before What

Using Services or Not• Using services fully-managed:

• Google BigQuery & Dataflow • Treasure Data services

• Using services self-managed: • Amazon EMR & Redshift • Google Cloud Dataproc

• Using your own environment & cluster

Using Services or Not• Using services fully-managed:

• Google BigQuery & Dataflow • Treasure Data services

• Using services self-managed: • Amazon EMR & Redshift • Google Cloud Dataproc

• Using your own environment & cluster

a bit more cost extremely less efforts

fully controlled by self extremely more efforts

less cost less efforts

Using Services or Not:

"Use Services!"

To concentrate DATA and Analytics,

NOT tools

Why should we use services?

• About distributed systems: • hard to operate & upgrade • impossible to "small-start" • very hard to hire professional engineer

• Data Driven Development: • collect/store data at first! • consider output data at second! • "before building your own environment"

Really? Are you TD guy?

• ...Really!

• But it requires very long discussions :P

• "スタートアップのデータ処理基盤、作るか、使うか"http://tsuchinoko.dmmlabs.com/?p=1770

How to choose software/services in

Data-Driven Development

"What" decides "How"

• Distributed systems are to solve problems • There're many kind of data • There're many problems

• Systems solve different problems from each other • There are no "Silver bullet"!

What First, How Second

• What do you want to do? • Reporting? Analytics? Recommendation? or ...

• What type of data you wan to process? • Stored large log? Stream sensor data? or ...

• What is you need as result? • CSV? Spreadsheet? Graph? DB Relation? or ...

How? (just for example)

• MapReduce, Tez • Large batch jobs, big JOINs, high stability

• Spark • Small/Middle batch jobs, machine learning

• Impala, Presto, Drill, Redshift, BigQuery • Near-real-time search, small-to-large analytics

• Storm, Spark streaming • Stream data conversion/aggregation

"Processing" is just a part of whole dataflow!

Data Analytics Flow (again)

Collect Store Process Visualize

Data source

Reporting

Monitoring

Data Analytics Flow (again)

Collect Store Process Visualize

Data source

Reporting

Monitoring

Data Collection• Data Driven Development -> collect at first!

• As batch: Data already exists as files • Easily integrated with existing batch systems • Sqoop, Embulk, ...

• As stream: Data just generated now • Easily connected with monitoring systems • Without burst network traffic • Flume, Logstash, Fluentd, ...

Fluentd: Support Service by SRA OSS

with Treasure Data

Released TODAY!

Other Important Topics

• Storage: Performance, Availability, Schema management • Apache Hadoop HDFS, Apache HBase, Amazon S3, Cloudera Kudu, ...

• Visualization: Functionality, Connectivity, Visibility • Tableau, Pentaho, Many other enterprise products, ...

• Distributed Queues: Performance, Stability, Connectivity • Apache Kafka, Amazon Kinesis, ...

Get Familiar with Options NOT to Take Pains about Technology!

Concentrate DATA and Analytics,

NOT tools.

Thanks!