Big Data Use Cases in Europe - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/balassi_marton...Introduction •As a Solutions Architect I have worked with 20+ customers in

1© Cloudera, Inc. All rights reserved.

Marton Balassi | Solutions Architect

| Flink PMC@MartonBalassi | [email protected]

Big Data Use Cases in EuropeExperiences from the field


Introduction

• As a Solutions Architect I have worked with 20+ customers in Europe during the last year

• Focused on architecture, but also involved in implementation

• My favorite topics are stream processing and data science

• Let me share some of the uplifting and the challenging lessons learned from colleagues

of mine and my own experience

• Solutions from Telco, Finance, Retail, Gaming, Data Science

• Disclaimer: My view is my own, subjective and inherently partial.


Let us do our first Hadoop PoC

What is the most common first Hadoop use case?


Data warehouse offloading

• Reproduce an RDBMS-based report

• Easily comparable results

• Ingestion (Sqoop, Flume, Gobblin)

• Storage (HDFS, Kudu, HBase)

• Interactive Query (Impala, Spark

SQL, Hive LLAP, Presto)

• User interface (Hue, Zeppelin)


Let us see some more interesting use cases


Syslog ingest @ Vodafone UK

• SIEM/Cybersecurity depends on

the input data quality and quantity

• Facilitates fault monitoring, threat

intelligence, incident response, and

litigation

• Data is collected on national level

from TCP, UDP syslog

Tristans Stevens,https://blog.cloudera.com/blog/2016/03/building-benchmarking-and-tuning-syslog-ingest-architecture/

https://blog.cloudera.com/blog/2016/03/building-benchmarking-and-tuning-syslog-ingest-architecture/


Syslog ingest @ Vodafone UK

• Ingestion with Flume, Kafka

• Interactive queries with Impala

• Free-text search with Solr

• Machine Learning with Spark MLLib

Tristans Stevens,https://blog.cloudera.com/blog/2016/03/building-benchmarking-and-tuning-syslog-ingest-architecture/

https://blog.cloudera.com/blog/2016/03/building-benchmarking-and-tuning-syslog-ingest-architecture/


Augmenting the log analytics pipeline

Michael Sun and Jeff Shmain,https://blog.cloudera.com/blog/2017/03/how-to-log-analytics-with-solr-spark-opentsdb-and-grafana/

https://blog.cloudera.com/blog/2017/03/how-to-log-analytics-with-solr-spark-opentsdb-and-grafana/


Augmenting the log analytics pipeline

Michael Sun and Jeff Shmain,https://blog.cloudera.com/blog/2017/03/how-to-log-analytics-with-solr-spark-opentsdb-and-grafana/

Error tracking

(Solr/Hue)

Custom monitoring

(OpenTSDB/Graphana)

https://blog.cloudera.com/blog/2017/03/how-to-log-analytics-with-solr-spark-opentsdb-and-grafana/


• Search works on distance of features

• The canonical example is searching words in documents

• Searching dresses by color or shape is also possible (given we can describe a shape)

• Implementation relies on Solr

Search is not solely for text

Base implementation by Mathias Lux, https://github.com/dermotte/liresolr.Use case by Nihed Mbarek.


Near real-time transactional analytics system@ Santander• Bank card transactions data

• “Spendlytics” app

• Stored in HBase to serve the

frontend

• Ingested through Flume/Kafka

• Enriched from local RocksDB

instances

James Kinley, Ian Buss, and Rob Siwickihttp://blog.cloudera.com/blog/2015/08/inside-santanders-near-real-time-data-ingest-architecture/

http://blog.cloudera.com/blog/2015/08/inside-santanders-near-real-time-data-ingest-architecture/


Near real-time transactional analytics system@ Santander• Bank card transactions data

• “Spendlytics” app

• Stored in Hbase to serve the

frontend

• Ingested through Flume/Kafka

• Enriched from local RocksDB

instances

James Kinley, Ian Buss, and Rob Siwickihttp://blog.cloudera.com/blog/2015/08/inside-santanders-near-real-time-data-ingest-architecture/

http://blog.cloudera.com/blog/2015/08/inside-santanders-near-real-time-data-ingest-architecture/


Scalable Real-Time Analytics Platform @ King.com

• Low latency Gaming analytics

• Analysts write Groovy scripts

• Deployed in Apache Flink

• 30 billion events/day

• RocksDB state in TB scale

• State is queryable from the outside

Gyula Fora, Mattias Anderssonhttps://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king

https://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king


Scalable Real-Time Analytics Platform @ King.com

• Low latency Gaming analytics

• Analysts write Groovy scripts

• Deployed in Apache Flink

• 30 billion events/day

• RocksDB state in TB scale

• State is queryable from the outside

Gyula Fora, Mattias Anderssonhttps://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king

https://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king


A new breed of Data Science libraries

• Hail is a Genomics library

• Implemented in Python, on Spark

• Genome sequencing is feasible,

today we are facing thousands of

sequences

• Easy access to distributed

computing is key

Tom White, Jonathan Keebler https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/

https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/


Data Science environments

• Notebook environments (Jupyter,

Zeppelin)

• Great for story telling

• Pain points:

• Collaboration

• Multi-tenancy

• Security

• New solutions are emerging…Tristan Zajonchttps://blog.cloudera.com/blog/2017/05/getting-started-with-cloudera-data-science-workbench/

https://blog.cloudera.com/blog/2017/05/getting-started-with-cloudera-data-science-workbench/


We have some gotchas too…


Be mindful of…

• Educating your team

• Security

• Authentication

• Authorization

• Encryption

• Auditing, lineage

• Workflow management


Thank you@[email protected]

Documents

Big Data Use Cases in Europe - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/balassi_marton...Introduction •As a Solutions Architect I have worked with 20+ customers in