47
Open Cloud Engine Introduction and Case Study of Open Source Project Flamingo, the Big Data Platform Open Cloud Engine Flamingo Project Leader Edward Kim ([email protected]) 2014.03.01 v0.8

OpenSource Big Data Platform - Flamingo Project

Embed Size (px)

DESCRIPTION

Flamingo is a open-source Big Data Platform that combine a Ajax Rich Web Interface + Workflow Engine + Workflow Designer + MapReduce + Hive Editor + Pig Editor. Movies : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=2064714 Screen Shots : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=2065069 Download : http://sourceforge.net/projects/hadoop-manager/files Wiki : http://wiki.opencloudengine.org/pages/viewpage.action?pageId=819212

Citation preview

Page 1: OpenSource Big Data Platform - Flamingo Project

Open Cloud Engine

Introduction and Case Study of Open Source Project Flamingo, the Big Data Platform Open Cloud Engine Flamingo Project Leader Edward Kim ([email protected])

2014.03.01 v0.8

Page 2: OpenSource Big Data Platform - Flamingo Project

What is a Big Data Platform?

Page 3: OpenSource Big Data Platform - Flamingo Project

The roles of the Big Data Platform •  What are main tasks that can be done on the Big Data platform?

•  Data mining, Statistical analysis, Log handling (collecting, pre-processing)

•  Who does what on the platform? •  Varies with users.

•  Most operators: development background, so focus on system management and log handling.

•  Analysts: focus on establishing a better environment for analyzing data. •  How many users are using the Big Data platform?

•  Lots of users à functionality of platform and accessibility of infra is important.

•  Big Data platform handles data à vulnerable. Hadoop is insecure. •  What am I? Operator? Architect? Developer? Data Scientist?

•  Depending on a role, functions of platform can be defined differently.

Page 4: OpenSource Big Data Platform - Flamingo Project

What Big Data Platform Must Provide SOFTWARE STACK

Page 5: OpenSource Big Data Platform - Flamingo Project

What Big Data Platform Must Provide

INFRA MANAGEMENT MONITORING

Page 6: OpenSource Big Data Platform - Flamingo Project

What Big Data Platform Must Provide

WORKFLOW

Page 7: OpenSource Big Data Platform - Flamingo Project

What Big Data Platform Must Provide

ANALYSIS AND VISUALIZATION

Page 8: OpenSource Big Data Platform - Flamingo Project

What Big Data Platform Must Provide

DASHBOARD

Page 9: OpenSource Big Data Platform - Flamingo Project

What Big Data Platform Must Provide

SECURITY

•  ACCESS •  AUTHENTICATION •  AUTHORIZATION •  ENCRYPTION •  AUDITING •  POLICY

Page 10: OpenSource Big Data Platform - Flamingo Project

What Big Data Platform Must Provide •  Batch job management and monitoring

•  MR based Parallel analysis program

•  User activities monitoring

•  Policy on accessing resources and systems.

•  Various functions to improve accessibility to infrastructures.

Page 11: OpenSource Big Data Platform - Flamingo Project

Flamingo Project In Open Cloud Engine •  Taking advantage of the web technology, using big data infrastructures and data becomes convenient.

•  Users can handle data easily. •  Provides functionalities to do various jobs in one workspace.

•  Can reuse analysis and processing MapReduces •  Open source oriented and all systems are ready to go. •  Designed to be operator friendly. •  Supports Hadoop EcoSystem.

Page 12: OpenSource Big Data Platform - Flamingo Project

Browser  

Designer   Search  

Morphology����������� ������������������  Analysis����������� ������������������  

Analyze����������� ������������������  Graph����������� ������������������  

User����������� ������������������  Evaluation����������� ������������������  

Elect����������� ������������������  a����������� ������������������  leader����������� ������������������  

Log����������� ������������������  

Data����������� ������������������  Scientist����������� ������������������  Service����������� ������������������  Planner����������� ������������������  

Data����������� ������������������  Analyst����������� ������������������  

Browser  

Informa.on  Catalogue   Search  

Informa-on   Security   Batch   Type  

User  Similarity   1   Daily,  4  PM   XML  

Item  Recommenda.on   2   Daily,  2  AM   JSON  

Purchase  Preference   3   Daily,  8  PM   XML/JSON  

Opinion  Leader   2   Daily,  7  AM   XML/JSON  

Data����������� ������������������  users����������� ������������������  

Systems����������� ������������������  

Opinion����������� ������������������  Leader����������� ������������������  Score����������� ������������������  Board����������� ������������������  

Open����������� ������������������  API����������� ������������������  

Data����������� ������������������  Visualization����������� ������������������  Charts����������� ������������������  

Design����������� ������������������  a����������� ������������������  workflow����������� ������������������  

Collect����������� ������������������  

Data����������� ������������������  user����������� ������������������  

Request����������� ������������������  Service����������� ������������������  

Mobil����������� ������������������  Devices����������� ������������������  

Reuse����������� ������������������  analyzed����������� ������������������  results����������� ������������������  Analyzed����������� ������������������  results����������� ������������������  are����������� ������������������  exposed����������� ������������������  through����������� ������������������  an����������� ������������������  Open����������� ������������������  API����������� ������������������  

Validation����������� ������������������  Log����������� ������������������  Data����������� ������������������  

MapReduce����������� ������������������  Analysis����������� ������������������  Module����������� ������������������  

Big����������� ������������������  Data����������� ������������������  Analysis����������� ������������������  and����������� ������������������  Service����������� ������������������  Platform����������� ������������������  

1����������� ������������������  

2����������� ������������������  

3����������� ������������������  

4����������� ������������������  

5����������� ������������������  

6����������� ������������������  

7����������� ������������������  

Future of Big Data Platform

Page 13: OpenSource Big Data Platform - Flamingo Project

Flamingo Project •  Functionalities matter the most in the Hadoop based Big Data environment.

•  Integrated open source projects are difficult to manage and not enough UIs exist to handle them

Page 14: OpenSource Big Data Platform - Flamingo Project

Flamingo Workbench •  Users can freely move around w/in a workspace conducting various jobs.

•  Each window is separated for its own functionalities

•  To minimize coding, reusable parts are componentized.

•  The system is simplified and well-known frameworks are implemented for easy addition

•  A development method is standardized (Tools, Procedures, Manuals, Environments…)

Page 15: OpenSource Big Data Platform - Flamingo Project

Flamingo Architecture

Page 16: OpenSource Big Data Platform - Flamingo Project

File System Browser •  Managing files is an integral part of Hadoop

•  A familiar windows file explorer style UI provides a better UX to users

Page 17: OpenSource Big Data Platform - Flamingo Project

File System Browser

Converts directories into Hive DBs or tables

Hive DBs and tables are marked with different icons in the browser.

FLAMINGO HAS OPTIMIZED FREQUENTLY NEEDED

FUNCTIONS

Page 18: OpenSource Big Data Platform - Flamingo Project

File System Browser Enhancement •  Previewing files and its location

•  Restrictions on viewing directories and files to unauthorized users (doesn’t come with Hadoop). •  E.g. /tmp directory is not visible to common users.

•  Setting permission on directories and files

•  A home directory for each user (doesn’t come with Hadoop)

•  Setting a quota on directories

•  Regularly dumping file system size info (for monitoring)

Page 19: OpenSource Big Data Platform - Flamingo Project

Audit Log •  Search all recorded HDFS logs.

Page 20: OpenSource Big Data Platform - Flamingo Project

Workflow Designer •  Mounts various analytic modules (e.g. Mahout)

•  Drag and drop provided modules to the canvas.

•  Currently analytic and statistical modules are mounted, Mahout and Giraph are being mounted, and ETL MRs

will be mounted soon.

Page 21: OpenSource Big Data Platform - Flamingo Project

Big Workflow Case Supports a workflow composed of multiple nodes.

Page 22: OpenSource Big Data Platform - Flamingo Project

Apache Access Log To CSV

Page 23: OpenSource Big Data Platform - Flamingo Project

Apache Access Log To CSV

Parameters to MapReduce •  Delimiter •  An option to print non matching pattern logs

Location of Apache Access Log and an output path of a CSV file.

MapReduce JAR file and a driver name

Page 24: OpenSource Big Data Platform - Flamingo Project

Workflow Designer •  A complex workflow is needed to see a final output.

•  Most times several steps are required to process files with MapReduce jobs. It makes creating a workflow difficult.

•  Engineers prefer the Apache Hive’s SQL like query language over writing MapReduces, so Workflow Designer comes in handy.

•  When handling various types of log file, Workflow Designer and MapReduce are essential.

Page 25: OpenSource Big Data Platform - Flamingo Project

Workflow Monitoring •  Monitors workflows submitted from Workflow Designer. Accurate logs can be checked.

Page 26: OpenSource Big Data Platform - Flamingo Project

Workflow Monitoring

root@n02:~/flamingo_data/tmp/2014/03/31/90/JOB_20140331_172000_90_157566920/26385942 $> ls -lsa 합계 40 4 drwxr-xr-x 2 root root 4096 2014-03-31 17:23 . 4 drwxr-xr-x 20 root root 4096 2014-03-31 17:23 .. 16 -rw-r--r-- 1 root root 12731 2014-03-31 17:23 action.log à execution log 4 -rwxrwxrwx 1 root root 1259 2014-03-31 17:23 core-site.xml 0 -rw-r--r-- 1 root root 0 2014-03-31 17:23 hadoop.job_201403300831_0471 à MapReduce Job ID 4 -rwxrwxrwx 1 root root 852 2014-03-31 17:23 script.sh root@n02:~/flamingo_data/tmp/2014/03/31/90/JOB_20140331_172000_90_157566920/26385942 $>

NODES IN A WORKFLOW CONTAIN SEVERAL MAPREDUCE JOBS. SO THEY MUST BE ABLE TO BE TR

ACKED

What users view in the MapReduce execution history

Page 27: OpenSource Big Data Platform - Flamingo Project

Hadoop Job Monitoring

Must be able to be tracked in Hadoop Job Monitoring.

Page 28: OpenSource Big Data Platform - Flamingo Project

Expression Language (EL) •  Dynamically substitute values into variables.

•  E.g. Today’s date : dateFormat(‘yyyyMMdd’) dateFormat(‘yyyy-MM-dd’)

•  For example, replace variables with certain dates •  E.g. Daily batch. Record yesterday’s date into a workflow executed today.

•  Supported Expression Language •  dateFormat(‘DATE FORMAT’) à dateFormat(‘yyyyMMddHHmmss’) •  hostname, escapeString, •  yesterday, tommorow •  month, day, hour, minute, … à day(‘yyyyMMdd’, -1) :: yesterday’s date(2013

1111) •  trim, concat, urlEncode, firstNotNull

Page 29: OpenSource Big Data Platform - Flamingo Project

Expression Language (EL)

The ${EL} format is dynamically replaced with real values.

Page 30: OpenSource Big Data Platform - Flamingo Project

Hadoop Job Tracker Monitoring •  Displays Hadoop’s job tracker info on graphs

Page 31: OpenSource Big Data Platform - Flamingo Project

Hadoop Job Tracker Monitoring •  Remote monitoring and tracking of Hadoop jobs are available.

Page 32: OpenSource Big Data Platform - Flamingo Project

Hive Editor & Hive Metastore Browser •  Search, browse, and download using SQL.

•  Hive Metastore is integrated. Easy to manage databases and tables.

Page 33: OpenSource Big Data Platform - Flamingo Project

Hive Editor Use Case •  Case 1: Search user access log with Hive

–  If the log is semi-structured or unstructured, it’s problematic.

–  If a column contains an array of map, it’s also problematic.

•  Below is an example of a semi-structured log

TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[875899349] unq[5000000]" TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[8758ßß99349] unq[5000000]"

Page 34: OpenSource Big Data Platform - Flamingo Project

Hive Editor Use Case

TYPE="IPINSIDE" TIME="2014-03-20 17:40:37" ID="guest0899349" MAC="AA-BB-01-18-68-68" NAT_IP="10.24.104.104" NAT_IP_NATION="USA" PROXY_USE="Y" VPN_USE="Y" REMOTE_USE="Y" PROXY_IP="192.24.104.104" PROXY_IP_NATION="USA" VPN_IP="192.24.104.104" VPN_IP_NATION="USA" SVC_CODE="SVC_CODE_0899349" HDD_DISK="HDD_DISK_0899349" CPU_INFO="CPU_INFO_0899349" USE_OS_NATION="USA" MESG="mesg..... time[1395284830] rnd[875899349] unq[5000000]”

Page 35: OpenSource Big Data Platform - Flamingo Project

Hive Editor Use Case

Page 36: OpenSource Big Data Platform - Flamingo Project

Hive Editor Use Case public class MasSerde implements SerDe { private StructTypeInfo rowTypeInfo; private ObjectInspector rowOI; private List<String> colNames; private List<Object> row = new ArrayList<Object>(); Pattern p = Pattern.compile("\"(.*?)\""); @Override public Object deserialize(Writable blob) throws SerDeException { row.clear(); Matcher m = p.matcher(blob.toString()); List list = new ArrayList(); while (m.find()) { list.add(m.group(1)); } String[] split = (String[]) list.toArray(new String[list.size()]); int i = 0; for (String fieldName : rowTypeInfo.getAllStructFieldNames()) { TypeInfo fieldTypeInfo = rowTypeInfo.getStructFieldTypeInfo(fieldName); row.add(parseField(split[i], fieldTypeInfo)); i++; } return row; } ... 생략 }

WHEN A LOG FILE IS LOADED, IT’S DESERIA

LIZED.

Page 37: OpenSource Big Data Platform - Flamingo Project

Hive Editor Use Case

Page 38: OpenSource Big Data Platform - Flamingo Project

Pig Script Editor •  Edits and saves Pig Latin scripts.

•  Executes and manages Pig Latin scripts to expedite data processing.

Page 39: OpenSource Big Data Platform - Flamingo Project

Dashboard •  Displays batch job history

Page 40: OpenSource Big Data Platform - Flamingo Project

Job Management •  Schedules, monitors, and executes batch job execution

Page 41: OpenSource Big Data Platform - Flamingo Project

Job Management •  Cron Expression Fully Supported

Page 42: OpenSource Big Data Platform - Flamingo Project

Project Details •  Download

–  http://www.sourceforge.net/projects/hadoop-manager •  Wiki(manuals and tech notes)

–  http://wiki.opencloudengine.org/pages/viewpage.action?pageId=819205

•  Issues(bugs and new features) –  http://jira.opencloudengine.org

•  Build Server –  http://build.opencloudengine.org

•  Google Groups: [email protected]

•  Subscription : [email protected]

Page 43: OpenSource Big Data Platform - Flamingo Project

The Future of Flamingo Project

•  Big Data on Cloud

•  Netra (OpenStack based Hadoop Provisioning)

+ Flamingo (Hadoop based Workspace)

•  Open Source based Big Data Platform

•  Apache Hadoop EcoSystem

•  Big Data Management Using Flamingo

•  Apache Hadoop PaaS (Platform as a Service)

•  Big Data All In One Package

Page 44: OpenSource Big Data Platform - Flamingo Project

Workflow Designer •  MapReduce developers use different parameters.

•  How will we standardize these various MapReduces?

Page 45: OpenSource Big Data Platform - Flamingo Project

Workflow Designer •  Most UI parts are reusable and provided as components

•  MapReduce Module and UI controls are standardized and offered as a framework

Reuse components

UI Layout

Page 46: OpenSource Big Data Platform - Flamingo Project

Workflow Designer •  Define module icons through metadata and minimize coding.

•  The framework takes care of most of them, and users only handle metadata

Page 47: OpenSource Big Data Platform - Flamingo Project

Participate and Share with Us!!

www.opencloudengine.org