Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
Donggeng Yu12/07/2019, Pronto, eBay
ProntoElasticsearch Extension Practice in eBay
22
Agenda
Overview of Elasticsearch in eBay
Use Cases & Challenges
Tools Extension for Clusters Management
Service Extension for Clusters Capability
1
2
3
4
33
Elastic Stack• ELKB
‒ Elasticsearch - Search & Aggregation‒ Logstash – ETL‒ Kibana – Visualization‒ Beats – Data Shipper
• X-Pack‒ security, alerting, monitoring,
reporting, machine learning and etc.
• Use Cases & OOTB Solutions‒ Logs / Metrics‒ APM / Uptime‒ SIEM / Endpoint Security‒ Site Search / App Search / Enterprise‒ Maps
44
Pronto Ecosystem in eBay
62%Supporting text goes here under the number
5
100+ clusters6k+ nodes
VM(openstack) / Container(k8s)
66
Agenda
Overview of Elasticsearch in eBay
Use Cases & Challenges
Tools Extension for Clusters Management
Service Extension for Clusters Capability
3
4
2
1
7
Use Cases in eBay
• Use Cases:‒ Near real time search / aggregation
‒ Virtual Shop / Tire Installation / Terapeak / SEO
‒ On-Site Traffic
‒ Metrics & Logs‒ UFES / Ceilometer / SRE / UMP‒ More than 20T/day for a single cluster
88
Vertical Shop & Tire Installation
99
Terapeak - eCommerce Data Insights
• Terapeak‒ SAAS based tool for providing
ecommerce data insights to online sellers
‒ Acquired by eBay
• Tech Stack‒ From RMDB + SOLR to ELK‒ S3 and Hadoop for data staging‒ Spark for data ETL‒ Kafka for data queue‒ Postgres for Data Warehouse‒ Elasticsearch for indexing and
search‒ ReactJS for front-end application
1010
UFES - Anomaly Detection for SLB
• Goal‒ Unified Front-End Services - Move eBay
Closer to Users so that the world shops first on eBay. UFES team built out 8 new Internet Points of Presence(POP) across the globe
‒ Need to route traffic via UFES PoPs by replacing the Netscaler Hardware SEO Load Balancers with Envoy Proxy based Software Load Balancers.
• Elastic Stack‒ Filebeats + Kafka + Elasticsearch Clusters‒ Dashboard for monitoring and comparison‒ Anomaly Detection for SLB
1111
Ceilometer - IT Operation Analytics
1212
Challenges of Managing Clusters Fleets at Scale
• Integrated with eBay’s Platform & follow the standards
‒ Configuration management & Change management‒ Full lifecycle management
• Easy onboarding and integration‒ Elasticsearch as a Service‒ How to free customer to focus on domain business
• Performance & High Availability‒ Search: Site facing application response time should less than
100 ms‒ Ingesting: 20T per day for a single cluster ‒ Different deployments, like cross region deployment
• Cost Control‒ Hardware cost‒ License fee (support some features like security, alert and ML)‒ Human resource‒ Support (7*24 on-call support & on-site support, etc.)
Performance
HA
Onboarding Integration
Cost
1313
Solutions for Challenges
• From VM to Container‒ VM (Openstack)
‒ Fixed flavor‒ Puppet Foreman infrastructure‒ Puppet module for Elasticsearch
‒ Container (K8s)‒ Flexible flavor (request/limit)‒ Operator Pattern
‒ Deployment + Statefulset + Service
• Best practices & Different deployments‒ Important System Configuration & Best practices‒ Anti-Affinity (High availability)‒ Cross region deployment (High availability)‒ Flavor chosen by traffic (Cost saving)‒ Hot-warm architecture (Cost saving)‒ LB for write / read
Cluster Provision & Management
Performance
HA
Onboarding Integration
Cost
1414
Solutions for ChallengesTooling and Service Extension
Performance
HA
Onboarding Integration
Cost
1515
Agenda
Overview of Elasticsearch in eBay
Use Cases & Challenges
Tools Extension for Clusters Management
Service Extension for Clusters Capability
2
1
4
3
1616
Use Case Onboarding
• Capacity planning ‒ What’s the use case and use scenarios
‒ Data retention / active period‒ Performance
‒ Index rate / search rate‒ Document & bulk size
‒ Deployment & Cost‒ How many nodes?‒ What’s the hardware configuration?‒ What kind of deployment should be used?
‒ Best practices‒ Software configuration‒ Deployment in different Region‒ Keep the margin to ensure that traffic
becomes large without performance issues
Node Storage Memory CPU Network
Master Low Low Low Low
Data Extreme High High Medium
Ingest Low Medium High Medium
Coordinator Low Medium Medium Medium
Machine Learning Low Extreme Extreme Medium
1717
Onboarding Self-Service and Sizing ToolOnboarding Integration
1818
Customer Support
• Support model‒ Different SLA for different use cases
‒ Search response time should less than 100ms‒ Cluster should NOT be in RED
‒ 7*24 support for Site-facing or Tier 2 above‒ SEC call / Pagerduty
• Support case‒ Cluster in RED
‒ Node missing and replica is 0‒ Dangling index
‒ Response time‒ Full GC because of Machine check error (MCE)‒ Too many shards and fields
Onboarding Integration Cost
1919
Data Ingestion Pipeline
• Added Value for customers ‒ Self-service, no coding/testing‒ No onboarding required
• Shared cluster ‒ 30+ use cases / 3T per day
• Shared data assets‒ Partition by application name
• Shared dashboard‒ 30+ Dashboards‒ 300+ Charts/Visualizations
Onboarding Integration
2020
Simple Steps - service onboarding a new use case
pom.xml
web.xml
2121
Data Management & Optimization
• Backup & Restore‒ Snapshot lifecycle management (SWIFT as
the repository )
• Time series data‒ Benefits of using time-based indices
‒ Delete index is faster than delete by query‒ Use hot-warm architecture‒ Close indices or force-merge read-only
indices‒ Time series data
‒ Treapeak v.s UFES (different needs)
• LifeCycle management‒ Central policy management / Web UI /
OOTB Policies
Performance
Onboarding
Integration
Cost
2222
Index Management Tool vs. Curator vs. ILM
Function Curator Pronto Index Mgmt. Tool Elastic ILM
High Availability N/A YES YES
Web UI N/A YES YES
Version Compatibility N/A 2.x/5.x/6.x/7.x 6.8+
Multi-Clusters N/A YES N/A
2323
Diagnostic Tool
• Features‒ Find Improper settings or usage‒ Job scheduler & Diagnostic report for
potential issues
• Rules‒ Too many indices / Too many shards /
Index have too many fields‒ Shard size check (20GB to 40GB)‒ Imbalance shards‒ Replica number should bigger than 0‒ Node missing / Rack Id attribute missed /
Minimum master ‒ Machine check error / Server disk full‒ Alias & index template checking
Performance Cost
2424
Performance & User Scenarios
• Many Factors:‒ Index / Shard ‒ Query / Scripting ‒ Mapping / Setting
Behavior Use Cases
Index heavy Logging / Metrics / Security / APM
Search heavy App Search / Site Search / Analytics
Update heavy Caching / Systems of Record
2525
Performance Issues & Optimization
• Wildcard search‒ Customer use beginning patterns
with * and ?.‒ Avoid to use * or ?.
• Stopwords & Shard Size‒ Reindex with the stop words‒ Use more shards to improve the
throughput
• Too many indices / shards / fields‒ Close or delete the unused indices‒ Improve the document modeling‒ Disable the dynamic mapping
• Performance Optimization‒ Disable swapping & give memory to the
file system cache‒ Unset or increase the refresh interval‒ Disable refresh and replicas for initial
loads‒ Use auto generated Ids‒ Disable the features you do not need‒ Don’t use default dynamic string mapping‒ Watch your shard size / shrink index‒ Force Merge ‒ Pre-Index data‒ Avoid scripts‒ Force-merge read-only indices‒ Warm up global ordinals‒ Replicas might help with through, but not
always
2626
Performance Testing Tool
• Performance testing‒ Testing data‒ Testing scripts‒ Test report for analysis
• Web based tool‒ Developed based on the Gatling‒ Web UI to select the testing scripts
and testing data‒ Test report for analysis
Performance
2727
Agenda
Overview of Elasticsearch in eBay
Use Cases & Challenges
Tools Extension for Clusters Management
Service Extension for Clusters Capability
2
1
3
4
2828
Solution and security plugin for Elasticsearch
• Pronto Security Plugin‒ TLS for encrypted communications‒ Cluster / Index level RBAC control‒ Follow eBay’s standard
‒ API Key for Application ‒ 2FA for user login‒ Audit logs
• Security Consideration‒ Authentication / RBAC‒ Certification retention‒ Firewall / White IP list‒ Vulnerability management
Cost
2929
X-Pack Subscription
• License cost‒ License fee is based on
the node count
• How to Extend ‒ Develop the Kibana
Application‒ Integrate with the
alerting and anomaly detection service
Cost
3030
Alerting Service
• Schedule‒ A schedule for running a query and
checking the condition.• Query
‒ The query to run as input to the condition. Watches support the full Elasticsearch query and aggregation
• Condition‒ A condition that determines whether
or not to execute the actions. You can use simple conditions (always true), or use scripting for more sophisticated scenarios
• Action‒ One or more actions, such as sending
email, pushing data to 3rd party systems through a webhook
‒ Throttling
Biz Data Alert
Cost
3131
Cost
3232
ML for Anomaly DetectionCost
3333
Review
• Easy onboarding and integration
• High Availability & High Performance
• Low cost for hardware / license fee / support efforts
Performance
HA
Onboarding Integration
Cost