Upload
jo6566
View
219
Download
0
Embed Size (px)
Citation preview
Turnkey Riak ClusterOctober 2015
2015 (C) physIQ All rights reserved 2
Agenda
Part One: (Today)- Create a 10 node Riak KV cluster from scratch- Populate it with time series data (big data!)
Part Two: (Next Time)- Index the data- Perform analytics on the time series data
Part Three: (Next Next Time)- Use machine learning / predictive analytics to
model the data and measure model effectiveness
2015 (C) physIQ All rights reserved 3
Mission
Build a turnkey solution for working with a riak cluster of any size
2015 (C) physIQ All rights reserved 4
Implementation
Turnkey at Any Scale Means• Public Cloud Infrastructure (GCE)• Automated Cloud Orchestration• Automated Configuration Management
Usage of Large Scale Cluster Requires• Load Balancing • Centralized Logging• Centralized Monitoring
2015 (C) physIQ All rights reserved 5
Cloud Orchestration
TERRAFORM (http://terraform.io)
“Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently”
• Runs Locally on Your Machine• Uses Cloud Provider APIs To:
• Create Network and Firewall Rules• Launch Server Instances• Provision and Attach Storage Disks• Bootstraps servers with init scripts
*This is the only tool the user configures and runs
2015 (C) physIQ All rights reserved 6
Configuration Management
SaltStack (http://saltstack.com)
“Salt delivers a dynamic communication bus for infrastructures that can be used for orchestration, remote execution, configuration management and much more.”
• Salt installs software and configures each server• Each server type has its own Salt profile
• i.e. riak, HAProxy, Zabbix, Elk Stack• Bootstrapped by init script Terraform set
• Installs Salt and sets the server profile• Init script runs automatically once server is initialized
2015 (C) physIQ All rights reserved 7
Key – Value Data Store
Riak KV (http://www.basho.com)
Riak is a highly available, scalable, fault tolerant, key – value big data store
• Masterless architecture – every node capable of servingread / write requests
• Automatic sharding of data to ensure even distribution
• Tunable consistency – better performance
2015 (C) physIQ All rights reserved 8
Load Balancing
HAProxy (http://www.haproxy.org)
“HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications”
• Equally distributes load between nodes in riak cluster
*Project leverages Cloud Provider LB for single IP external access only
2015 (C) physIQ All rights reserved 9
Centralized Monitoring
ZABBIX (http://www.haproxy.com)
“Zabbix is an enterprise open source monitoring solution for networks and applications”
• Collects Basic Server Metrics• i.e. CPU, Memory, Disk Usage
• Added Basho Zabbix Template• https://github.com/basho/riak-zabbix• Adds Riak throughput, latency, health, & Erlang Metrics
• Easily set alerts and configure dashboards
2015 (C) physIQ All rights reserved 10
Centralized Logging
ELK Stack (http://elastic.co)
“ELK Stack combines the Elasticsearch, Logstash & Kibana to provide realtime insights of any type of structured, unstructured data.” (i.e. Logs)
• A process on servers sends logs to ELK Stack• https://github.com/josegonzalez/python-beaver
• Kibana provides UI for• Adhoc queries• Building Dashboards
2015 (C) physIQ All rights reserved 11
Stack Diagram
2015 (C) physIQ All rights reserved 12
Cost
What Does Google Cloud Platforms Charge for?• Hourly Rate of each server instance you create• Hourly Rate for each GB of SSD or Magnetic Storage Used• Hourly Charge for Each GCE LB forwarding rule• $.008 / GB of Data through GCE Load Balancer
Default Configuration CostItem Quantity Unit Cost / Hour Total
Riak Node (n1-standard-2) 5 $0.10 $0.50
Riak SSD Storage 100 $0.000236 $0.02
Riak Magnetic Storage 1000 $0.000056 $0.06
HAProxy Server (n1-standard-1) 2 $0.05 $0.10
Zabbix Server (n1-standard-2) 1 $0.10 $0.10
ELK Stack Server (n1-standard-1) 1 $0.05 $0.05
Server Magnetic Boot Disk Space 90 $0.000056 $0.01
Load Balancer Forwarding Rules 2 $0.03 $0.05
Data Through Load Balancer ? $0.008 $0.00
Total Cost (Per Hour) $0.88
* See GCE Pricing (https://cloud.google.com/compute/pricing#lb)
2015 (C) physIQ All rights reserved 13
Loading Data - IoT
Chicago Transit Authority (CTA)- Public API- Allows you to query location of buses and trains- Data refreshed every minute- 1500+ vehicles- Just under a million fixes a day
fix = (id, time, lat, long, ……)
2015 (C) physIQ All rights reserved 14
Loading Data - IoT
Example data:{"vid": 1950, "tmstmp": "20150218 23:59", "lat": 41.880667662009216, "lon": -87.741054045848358,"hdg": 269, "pid": 949, "rt": "20", "des": "Austin", "pdist": 4042, "spd": 15, "tablockid": "N20 -893", "tatripid": 1040830, "zone": null}
One per vehicle, per minute
2015 (C) physIQ All rights reserved 15
Loading Data - IoT
- CTA data archive located athttps://s3.amazonaws.com/cta-tracker
250 days = 250,000,000 fixes
How do we configure a cluster to allow this much data to be input for analysis in a reasonable* amount of time?
*reasonable = 30 minutes
2015 (C) physIQ All rights reserved 16
Stack Diagram
S3
Loader Loader Loader Loader Loader Loader
2015 (C) physIQ All rights reserved 17
Conclusions
• With the three basic building blocks (public cloud API, Terraform, Salt), you can build & configure complex, low cost, high performance network infrastructures in minutes.
• Riak KV provides highly scalable, reliable data throughput (read / write) in a cloud environment.
• Supporting tools (HAProxy, Zabbix, ELK) allow you to measure the cluster’s effectiveness.
Together, these tools can be used to build throwaway big data analysis environments!
2015 (C) physIQ All rights reserved 18
Access To Project Components
• Cluster Configuration Scripts• https://github.com/physIQ/turnkey-riak
• Archived CTA fix data• https://s3.amazonaws.com/cta-tracker
• CTA Tracker project• https://github.com/jolson7168/ctaTracker