Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
1© Cloudera, Inc. All rights reserved.
Marton Balassi | Solutions Architect
| Flink PMC@MartonBalassi | [email protected]
Big Data Use Cases in EuropeExperiences from the field
2© Cloudera, Inc. All rights reserved.
Introduction
• As a Solutions Architect I have worked with 20+ customers in Europe during the last year
• Focused on architecture, but also involved in implementation
• My favorite topics are stream processing and data science
• Let me share some of the uplifting and the challenging lessons learned from colleagues
of mine and my own experience
• Solutions from Telco, Finance, Retail, Gaming, Data Science
• Disclaimer: My view is my own, subjective and inherently partial.
3© Cloudera, Inc. All rights reserved.
Let us do our first Hadoop PoC
What is the most common first Hadoop use case?
4© Cloudera, Inc. All rights reserved.
Data warehouse offloading
• Reproduce an RDBMS-based report
• Easily comparable results
• Ingestion (Sqoop, Flume, Gobblin)
• Storage (HDFS, Kudu, HBase)
• Interactive Query (Impala, Spark
SQL, Hive LLAP, Presto)
• User interface (Hue, Zeppelin)
5© Cloudera, Inc. All rights reserved.
Let us see some more interesting use cases
6© Cloudera, Inc. All rights reserved.
Syslog ingest @ Vodafone UK
• SIEM/Cybersecurity depends on
the input data quality and quantity
• Facilitates fault monitoring, threat
intelligence, incident response, and
litigation
• Data is collected on national level
from TCP, UDP syslog
Tristans Stevens,https://blog.cloudera.com/blog/2016/03/building-benchmarking-and-tuning-syslog-ingest-architecture/
7© Cloudera, Inc. All rights reserved.
Syslog ingest @ Vodafone UK
• Ingestion with Flume, Kafka
• Interactive queries with Impala
• Free-text search with Solr
• Machine Learning with Spark MLLib
Tristans Stevens,https://blog.cloudera.com/blog/2016/03/building-benchmarking-and-tuning-syslog-ingest-architecture/
8© Cloudera, Inc. All rights reserved.
Augmenting the log analytics pipeline
Michael Sun and Jeff Shmain,https://blog.cloudera.com/blog/2017/03/how-to-log-analytics-with-solr-spark-opentsdb-and-grafana/
9© Cloudera, Inc. All rights reserved.
Augmenting the log analytics pipeline
Michael Sun and Jeff Shmain,https://blog.cloudera.com/blog/2017/03/how-to-log-analytics-with-solr-spark-opentsdb-and-grafana/
Error tracking
(Solr/Hue)
Custom monitoring
(OpenTSDB/Graphana)
10© Cloudera, Inc. All rights reserved.
• Search works on distance of features
• The canonical example is searching words in documents
• Searching dresses by color or shape is also possible (given we can describe a shape)
• Implementation relies on Solr
Search is not solely for text
Base implementation by Mathias Lux, https://github.com/dermotte/liresolr.Use case by Nihed Mbarek.
11© Cloudera, Inc. All rights reserved.
Near real-time transactional analytics system@ Santander• Bank card transactions data
• “Spendlytics” app
• Stored in HBase to serve the
frontend
• Ingested through Flume/Kafka
• Enriched from local RocksDB
instances
James Kinley, Ian Buss, and Rob Siwickihttp://blog.cloudera.com/blog/2015/08/inside-santanders-near-real-time-data-ingest-architecture/
12© Cloudera, Inc. All rights reserved.
Near real-time transactional analytics system@ Santander• Bank card transactions data
• “Spendlytics” app
• Stored in Hbase to serve the
frontend
• Ingested through Flume/Kafka
• Enriched from local RocksDB
instances
James Kinley, Ian Buss, and Rob Siwickihttp://blog.cloudera.com/blog/2015/08/inside-santanders-near-real-time-data-ingest-architecture/
13© Cloudera, Inc. All rights reserved.
Scalable Real-Time Analytics Platform @ King.com
• Low latency Gaming analytics
• Analysts write Groovy scripts
• Deployed in Apache Flink
• 30 billion events/day
• RocksDB state in TB scale
• State is queryable from the outside
Gyula Fora, Mattias Anderssonhttps://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king
14© Cloudera, Inc. All rights reserved.
Scalable Real-Time Analytics Platform @ King.com
• Low latency Gaming analytics
• Analysts write Groovy scripts
• Deployed in Apache Flink
• 30 billion events/day
• RocksDB state in TB scale
• State is queryable from the outside
Gyula Fora, Mattias Anderssonhttps://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king
15© Cloudera, Inc. All rights reserved.
A new breed of Data Science libraries
• Hail is a Genomics library
• Implemented in Python, on Spark
• Genome sequencing is feasible,
today we are facing thousands of
sequences
• Easy access to distributed
computing is key
Tom White, Jonathan Keebler https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/
16© Cloudera, Inc. All rights reserved.
Data Science environments
• Notebook environments (Jupyter,
Zeppelin)
• Great for story telling
• Pain points:
• Collaboration
• Multi-tenancy
• Security
• New solutions are emerging…Tristan Zajonchttps://blog.cloudera.com/blog/2017/05/getting-started-with-cloudera-data-science-workbench/
17© Cloudera, Inc. All rights reserved.
We have some gotchas too…
18© Cloudera, Inc. All rights reserved.
Be mindful of…
• Educating your team
• Security
• Authentication
• Authorization
• Encryption
• Auditing, lineage
• Workflow management
19© Cloudera, Inc. All rights reserved.
Thank you@[email protected]