Case Studies on Big-Data Processing and Streaming - Iranian Java User Group

Case Studies

on Big-Data Processing and Data Streaming

By: Amir Sedighi

LinkedIn: http://linkedin.com/in/amirsedighi

Twitter: @amirsedighi

http://linkedin.com/in/amirsedighi

JUG - A.Sedighi - 2015 2 / 48

Background

● BS and MS degrees in Software Engineering

● Senior Software Engineer

– +20 Years of Programming Experience● Cross-platform Software Development

– +4 Years of Big-Data Processing and Machine-Learning Experience● Log Management and Forensic● Big-Data Visualization● Data Warehouse using Big-Data Technologies● Recommender Systems ● Analytical Real-Time Search Engines● Integrating Fedora Digital Library with HDFS● Next Generation Event Processing

● Online Resume

– http://linkedin.com/in/amirsedighi

http://linkedin.com/in/amirsedighi

JUG - A.Sedighi - 2015 3 / 48

Outline

● An Introduction to Big-Data Processing

● Big-Data and Processing and Data Streaming

– Data Processing

1. +TB Scale Data Warehouse

2. Analytical Real-Time Search Solution and BI

3. Scaleable Recommender System

4. Integrating Fedora Digital Library with HDFS

– Stream and Event Processing

1. Super Fast Scaleable Log Management, Forensic and BI

2. Super Fast Scaleable Fraud Detection

JUG - A.Sedighi - 2015 4 / 48

What Big-Data Is?

JUG - A.Sedighi - 2015 5 / 48

● Every 2 Days Human Create As Much Information As We Did Up To 2003 - Eric Schmidt

JUG - A.Sedighi - 2015 6 / 48

Big-Data Characteristics

● Volume

● Variety

● Velocity

JUG - A.Sedighi - 2015 7 / 48

You're a Part of It Every Day

● We've have the ability to store anything● Companies and people are generating data like

never before in history

– Social Networks– Online Web Portals– Log Writers - Our Digital Footprint!

JUG - A.Sedighi - 2015 8 / 48

You're a Part of It Every Day

● Big-Data is whatever people do in the digital world, including the foot print of what people, companies, devices and services do (Logs), including traditional tabular data stores.

JUG - A.Sedighi - 2015 9 / 48

As a Manager still You're a Part of It

● “Over half of the business leaders today, realize they don't have access to the insights they need to do their job.” - IBM

JUG - A.Sedighi - 2015 10 / 48

Vertical or Horizontal?

JUG - A.Sedighi - 2015 11 / 48

Scale Up or Scale Out

JUG - A.Sedighi - 2015 12 / 48

Linear Scalability

JUG - A.Sedighi - 2015 13 / 48

Big-Data Processing Solutions

JUG - A.Sedighi - 2015 14 / 48

Q: How To Be Linear Scaleable on Commodity Machines? A: MapReduce

JUG - A.Sedighi - 2015 15 / 48

Q: How to store big data on commodity machines?A: Distributed File System

JUG - A.Sedighi - 2015 16 / 48

Replication → Fault TolerantReplication → Data Locality → Utilization

JUG - A.Sedighi - 2015 17 / 48

Big-Data Processing, Most Popular Technologies

● Apache Hadoop Ecosystem

● NoSQL Databases

– HBase

– Cassandra

– MongoDB

– Neo4j

● Elasticsearch

– Lucene

– SolR

● Java

JUG - A.Sedighi - 2015 18 / 48

+TB Scale Data Warehouse

1

JUG - A.Sedighi - 2015 19 / 48

DW Solution

● SQL

● ETL

– RDBMS

– NoSQL

– File System

● REST API

JUG - A.Sedighi - 2015 20 / 48

REST Admin Panel

JUG - A.Sedighi - 2015 21 / 48

Features

● Extendable Capacity for Data Warehousing

● Making Very Big Integrated Databases Based on Different Technologies/Schemas

– DB2, Oracle, MS-SQL …

– Different Schemas Such as HRMS, Banking, Sales...

– Making Small Dense Integrated RDBMSs

● SQL Language Interface

● Linear Scalability

JUG - A.Sedighi - 2015 22 / 48

Main Technologies and Frameworks

● Apache Hadoop

– Sqoop

– YARN/HDFS

– Hive or Drill or Impala

● Microservices Architecture– Java 1.7

– Spring Boot

JUG - A.Sedighi - 2015 23 / 48

Analytical Real-Time Scalable Search Solution and BI

2

JUG - A.Sedighi - 2015 24 / 48

+TB Scale RT Searching

● Indexing Incoming Data on-the-fly

● Highly Scaleable and Reliable

● Simple or Complex Queries

● REST API

● Schema Agnostic

● Customizable GUI and BI

JUG - A.Sedighi - 2015 25 / 48

Business Intelligence

JUG - A.Sedighi - 2015 26 / 48

Rich GUI

JUG - A.Sedighi - 2015 27 / 48


● Elasticsearch

– Apache Lucene

– REST

● Kibana

JUG - A.Sedighi - 2015 28 / 48

Scalable Recommender System

3

JUG - A.Sedighi - 2015 29 / 48

Recommender System

● Value-added Service (Loyalty Services)

● Machine-Learning

– Clustering Throw Thousands of Nodes● Apache Mahout

● Super Fast

JUG - A.Sedighi - 2015 30 / 48

How It Works?

JUG - A.Sedighi - 2015 31 / 48

Technologies and Frameworks

● Microservices Architecture

● Java 1.6

● Apache Mahout

● Redis

Fedora Digital Library and HDFS Integration

4

Migrating from Expensive Servers to Commodity Machines

● Making HDFS as Fedora Digital Library Storage

– Research and Development

– Hadoop 1.2, Later Hadoop YARN 2.2

– Integrating with SolR over HDFS

● Java 1.7

● Fedora

– Islandora

– GSearch

JUG - A.Sedighi - 2015 34 / 48

Data Streaming

JUG - A.Sedighi - 2015 35 / 48

Big-Data Streaming, Most Popular Technologies

● Piping and Messaging – Kafka, Flume, FluentD and ZeroMQ

● Stream Processing– Storm, Samza and Spark

● Machine Learning– Machine Learning: MLLib and Mahout

● Persisting– NoSQL DBs

– HDFS

JUG - A.Sedighi - 2015 36 / 48

Log Management, Forensic and BI

1

JUG - A.Sedighi - 2015 37 / 48

Log Management, Forensic and BI

● Every Digital Stuff Writes Things Into Log Files– Log Files Are Streams of Data

– Log Files Are Messy

– Log Files Come Very Fast, in an Un-Predictable Manner

– Log Files Are About Everything within Your Business

● Log Files Are Full of Insight– Who Can Hold Them For a Reasonable Period of Time

– Who Can Search Them Rapidly

– Who Can Visualize Them Easily (BI)

JUG - A.Sedighi - 2015 38 / 48

Network Topology

LB

Masters

Data

JUG - A.Sedighi - 2015 39 / 48


● LogStash

– Flume

● Elasticsearch

● Kibana

JUG - A.Sedighi - 2015 40 / 48

Snapshot

JUG - A.Sedighi - 2015 41 / 48

Fraud Detection

2

JUG - A.Sedighi - 2015 42 / 48

Inputs & Outputs

● Inputs: One or multiple sources generate data continuously, in real time– Sensor Networks

– Transaction Logs

– Text Streams such as News

– Network Traffic Analysis

● Outputs: Up-to-date Answers generated continuously or periodically

JUG - A.Sedighi - 2015 43 / 48

Data Processing

Transient Query

– Issued once, then forgotten

Persistent DataStored until deleted by user or apps

JUG - A.Sedighi - 2015 44 / 48

Stream Processing

Transient Data

– Deleted as Window Slides

Forward

Generated up-to-date answers as time goes on

Persistent Queries

Tim

e B

ased

Cou

nt B

ased

JUG - A.Sedighi - 2015 45 / 48

Features

● Scalability

● Real-Timing, (Only 1 Second delay at most)

● Super Fast Decision Making

● Implementing Complex Fraud Scenarios Aa Easy as Defining Queries

● Uniform Api For Processing Old or Early Events

JUG - A.Sedighi - 2015 46 / 48


● Java 1.7, Scala 2.11

● Apache Flume

● Apache Kafka

● Apache Spark

Where To Start?

● You need Big Amount of Data

● You need to change your mind

– Rack Space and Number of Servers, IO and Process Limitations

● You need To Understand Fundamentals

– Linux (Bash Script)

– Java is a Most, Python works and Scala is an advantage

– SQL and ETL

– MapReduce, Resource Management and Serialization Frameworks

– Apache Hadoop Ecosystem and Successors

JUG - A.Sedighi - 2015 48 / 48

Thank You!, Question?

http://slideshare.net/amirsedighi