Upload
lucidworks
View
1.518
Download
0
Embed Size (px)
Citation preview
Solr & Fusion for Big Data
• Where search fits in the big data landscape?
• Solr on HDFS• Indexing strategies• End-to-end security• Lambda architecture• Spark and how we
use it in Fusion
The standard for enterprise search. of Fortune 500
uses Solr.
90%
Why search for big data?• Speed at scale• Basic analytics (facets, pivot facets, facets +
stats) + visualizations• Query structured and unstructured data• Ad hoc exploration is inherent in big data• People grok search• Context for aggregations (drill into the numbers)
Common use case:log analysis
• Time-ordered data• Raw data stored in
HDFS• How much data? How
fast?• Access patterns?• Schema design ~ no
free lunch at scale
Time-based Partitioning SchemeFusion
Log AnalyticsDashboard
logs_feb26(daily collection)
logs_feb25(daily collection)
logs_feb01(daily collection)
h00(shard)
h22(shard)
h23(shard)
h00(shard)
h22(shard)
h23(shard)
Add replicasto support higherquery volume & fault-tolerance
recent_logs(colllection alias)
Use a collectionalias to make multiplecollections look like a single collection; minimizeexposure to partitioning strategy in client layer
Every daily collection has 24 shards (h00-h23), each covering 1-hour blocks of log messages
Solr on HDFS• Maturing solution still some issues• My test showed ~23-25% slower than local SSD• Better ROI, operational efficiency, security• Needed for YARN• Enables auto add replicas• Interesting features coming soon: ZooKeeper lock
(SOLR-8169) and replicas share index (SOLR-6237)
Solr on HDFS
Solrshard1 / replica1
block cache
Solrshard1 / replica2
block cache
writes
reads
HDFSDataNode C
HDFSDataNode B
HDFSDataNode A writes
reads
HDFS block replication
Solr replication
Auto Add Replica
HDFSDataNode C
block cache
Solrshard1 / replica1
writes
reads
HDFSDataNode A
HDFS block replication
Solrshard1 / replica2
block cache
HDFSDataNode Bwrites
reads
Solr replication
overseer
ZooKeeper
watches
Solrshard1 / replica3
writes
reads
Indexing Strategies• Many tools available!• MapReduce indexer (Solr contrib)• LWOutputFormat, Hive SerDe, Pig StoreFunc, HBase• Storm to Solr or Fusion
(github.com/LucidWorks/storm-solr)• Spark to Solr or Fusion
(github.com/LucidWorks/spark-solr)• Lucidworks Fusion Connectors
Any Data. Any Source.
Fusion Indexing Pipelines in MapReduce
Solr
Map Task (or reducer if needed)
ZooKeeper
CloudSolrClient
HDFS
Get collection metadatafrom ZooKeeper(e.g. shard leader URL)
Send updates to shardleaders in parallel
Fusion Pipelinedocs
…N map tasks (1 per block)
30+ index stages- Field mapping- JavaScript- Tika parsing- NLP- Regex- JDBC lookup
Many common file formats supported:CSV, SequenceFile, grok, XML, warc
Security• End-to-end security is now a reality for Hadoop• Kerberos authentication (ZK, Solr, HDFS, jobs)• Pluggable authorization framework• Collection and document-level access controls (via
Fusion)• SSL• Apache Ranger (centralized admin, auditing,
monitoring for Hadoop)
Cluster Sizing Worksheet• There is no formula, only guidelines!• # of documents / avg. doc size / number of fields• Updates per second / soft-commit frequency• Storage type (local SSD vs. HDFS)• Sharding scheme (time-based vs. hash-based)• Peak QPS / 95th percentile response time / query
complexity• Must test your data on your servers ;-)
• Search engine fits perfectly with lambda
• Use batch layer to build indexes instead of “views”
• Speed layer uses Spark streaming to build near real-time index
• Aggregation collections for historical data
Lambda Architecture
source: http://lambda-architecture.net/
Spark
Spark Core
SparkSQL
SparkStreaming
MLlib(machinelearning)
GraphX(BSP)
Hadoop YARN Mesos Standalone
HDFSExecution
ModelThe Shuffle Caching
engine
clustermgmt
Tachyon
languages Scala Java Python R
sharedmemory
The most relevant results every single time.
Massive scale. Real-time. Secure.
Any data. Any source.
Lucidworks Is Search
Any questions?• Try Fusion http://lucidworks.com/products/fusion/
download• LinkedIn / Twitter / Solr JIRA: @thelabdude