Upload
amir-sedighi
View
495
Download
0
Tags:
Embed Size (px)
Citation preview
Case Studies
on Big-Data Processing and Data Streaming
By: Amir Sedighi
LinkedIn: http://linkedin.com/in/amirsedighi
Twitter: @amirsedighi
JUG - A.Sedighi - 2015 2 / 48
Background
● BS and MS degrees in Software Engineering
● Senior Software Engineer
– +20 Years of Programming Experience● Cross-platform Software Development
– +4 Years of Big-Data Processing and Machine-Learning Experience● Log Management and Forensic● Big-Data Visualization● Data Warehouse using Big-Data Technologies● Recommender Systems ● Analytical Real-Time Search Engines● Integrating Fedora Digital Library with HDFS● Next Generation Event Processing
● Online Resume
– http://linkedin.com/in/amirsedighi
JUG - A.Sedighi - 2015 3 / 48
Outline
● An Introduction to Big-Data Processing
● Big-Data and Processing and Data Streaming
– Data Processing
1. +TB Scale Data Warehouse
2. Analytical Real-Time Search Solution and BI
3. Scaleable Recommender System
4. Integrating Fedora Digital Library with HDFS
– Stream and Event Processing
1. Super Fast Scaleable Log Management, Forensic and BI
2. Super Fast Scaleable Fraud Detection
JUG - A.Sedighi - 2015 4 / 48
What Big-Data Is?
JUG - A.Sedighi - 2015 5 / 48
● Every 2 Days Human Create As Much Information As We Did Up To 2003 - Eric Schmidt
JUG - A.Sedighi - 2015 6 / 48
Big-Data Characteristics
● Volume
● Variety
● Velocity
JUG - A.Sedighi - 2015 7 / 48
You're a Part of It Every Day
● We've have the ability to store anything● Companies and people are generating data like
never before in history
– Social Networks– Online Web Portals– Log Writers - Our Digital Footprint!
JUG - A.Sedighi - 2015 8 / 48
You're a Part of It Every Day
● Big-Data is whatever people do in the digital world, including the foot print of what people, companies, devices and services do (Logs), including traditional tabular data stores.
JUG - A.Sedighi - 2015 9 / 48
As a Manager still You're a Part of It
● “Over half of the business leaders today, realize they don't have access to the insights they need to do their job.” - IBM
JUG - A.Sedighi - 2015 10 / 48
Vertical or Horizontal?
JUG - A.Sedighi - 2015 11 / 48
Scale Up or Scale Out
JUG - A.Sedighi - 2015 12 / 48
Linear Scalability
JUG - A.Sedighi - 2015 13 / 48
Big-Data Processing Solutions
JUG - A.Sedighi - 2015 14 / 48
Q: How To Be Linear Scaleable on Commodity Machines? A: MapReduce
JUG - A.Sedighi - 2015 15 / 48
Q: How to store big data on commodity machines?A: Distributed File System
JUG - A.Sedighi - 2015 16 / 48
Replication → Fault TolerantReplication → Data Locality → Utilization
JUG - A.Sedighi - 2015 17 / 48
Big-Data Processing, Most Popular Technologies
● Apache Hadoop Ecosystem
● NoSQL Databases
– HBase
– Cassandra
– MongoDB
– Neo4j
● Elasticsearch
– Lucene
– SolR
● Java
JUG - A.Sedighi - 2015 18 / 48
+TB Scale Data Warehouse
1
JUG - A.Sedighi - 2015 19 / 48
DW Solution
● SQL
● ETL
– RDBMS
– NoSQL
– File System
● REST API
JUG - A.Sedighi - 2015 20 / 48
REST Admin Panel
JUG - A.Sedighi - 2015 21 / 48
Features
● Extendable Capacity for Data Warehousing
● Making Very Big Integrated Databases Based on Different Technologies/Schemas
– DB2, Oracle, MS-SQL …
– Different Schemas Such as HRMS, Banking, Sales...
– Making Small Dense Integrated RDBMSs
● SQL Language Interface
● Linear Scalability
JUG - A.Sedighi - 2015 22 / 48
Main Technologies and Frameworks
● Apache Hadoop
– Sqoop
– YARN/HDFS
– Hive or Drill or Impala
● Microservices Architecture– Java 1.7
– Spring Boot
JUG - A.Sedighi - 2015 23 / 48
Analytical Real-Time Scalable Search Solution and BI
2
JUG - A.Sedighi - 2015 24 / 48
+TB Scale RT Searching
● Indexing Incoming Data on-the-fly
● Highly Scaleable and Reliable
● Simple or Complex Queries
● REST API
● Schema Agnostic
● Customizable GUI and BI
JUG - A.Sedighi - 2015 25 / 48
Business Intelligence
JUG - A.Sedighi - 2015 26 / 48
Rich GUI
JUG - A.Sedighi - 2015 27 / 48
Main Technologies and Frameworks
● Elasticsearch
– Apache Lucene
– REST
● Kibana
JUG - A.Sedighi - 2015 28 / 48
Scalable Recommender System
3
JUG - A.Sedighi - 2015 29 / 48
Recommender System
● Value-added Service (Loyalty Services)
● Machine-Learning
– Clustering Throw Thousands of Nodes● Apache Mahout
● Super Fast
JUG - A.Sedighi - 2015 30 / 48
How It Works?
JUG - A.Sedighi - 2015 31 / 48
Technologies and Frameworks
● Microservices Architecture
● Java 1.6
● Apache Mahout
● Redis
Fedora Digital Library and HDFS Integration
4
Migrating from Expensive Servers to Commodity Machines
● Making HDFS as Fedora Digital Library Storage
– Research and Development
– Hadoop 1.2, Later Hadoop YARN 2.2
– Integrating with SolR over HDFS
● Java 1.7
● Fedora
– Islandora
– GSearch
JUG - A.Sedighi - 2015 34 / 48
Data Streaming
JUG - A.Sedighi - 2015 35 / 48
Big-Data Streaming, Most Popular Technologies
● Piping and Messaging – Kafka, Flume, FluentD and ZeroMQ
● Stream Processing– Storm, Samza and Spark
● Machine Learning– Machine Learning: MLLib and Mahout
● Persisting– NoSQL DBs
– HDFS
JUG - A.Sedighi - 2015 36 / 48
Log Management, Forensic and BI
1
JUG - A.Sedighi - 2015 37 / 48
Log Management, Forensic and BI
● Every Digital Stuff Writes Things Into Log Files– Log Files Are Streams of Data
– Log Files Are Messy
– Log Files Come Very Fast, in an Un-Predictable Manner
– Log Files Are About Everything within Your Business
● Log Files Are Full of Insight– Who Can Hold Them For a Reasonable Period of Time
– Who Can Search Them Rapidly
– Who Can Visualize Them Easily (BI)
JUG - A.Sedighi - 2015 38 / 48
Network Topology
LB
Masters
Data
JUG - A.Sedighi - 2015 39 / 48
Main Technologies and Frameworks
● LogStash
– Flume
● Elasticsearch
● Kibana
JUG - A.Sedighi - 2015 40 / 48
Snapshot
JUG - A.Sedighi - 2015 41 / 48
Fraud Detection
2
JUG - A.Sedighi - 2015 42 / 48
Inputs & Outputs
● Inputs: One or multiple sources generate data continuously, in real time– Sensor Networks
– Transaction Logs
– Text Streams such as News
– Network Traffic Analysis
● Outputs: Up-to-date Answers generated continuously or periodically
JUG - A.Sedighi - 2015 43 / 48
Data Processing
Transient Query
– Issued once, then forgotten
Persistent DataStored until deleted by user or apps
JUG - A.Sedighi - 2015 44 / 48
Stream Processing
Transient Data
– Deleted as Window Slides
Forward
Generated up-to-date answers as time goes on
Persistent Queries
Tim
e B
ased
Cou
nt B
ased
JUG - A.Sedighi - 2015 45 / 48
Features
● Scalability
● Real-Timing, (Only 1 Second delay at most)
● Super Fast Decision Making
● Implementing Complex Fraud Scenarios Aa Easy as Defining Queries
● Uniform Api For Processing Old or Early Events
JUG - A.Sedighi - 2015 46 / 48
Main Technologies and Frameworks
● Java 1.7, Scala 2.11
● Apache Flume
● Apache Kafka
● Apache Spark
Where To Start?
● You need Big Amount of Data
● You need to change your mind
– Rack Space and Number of Servers, IO and Process Limitations
● You need To Understand Fundamentals
– Linux (Bash Script)
– Java is a Most, Python works and Scala is an advantage
– SQL and ETL
– MapReduce, Resource Management and Serialization Frameworks
– Apache Hadoop Ecosystem and Successors
JUG - A.Sedighi - 2015 48 / 48
Thank You!, Question?
http://slideshare.net/amirsedighi