35
The Internet in a Database A Cassandra Use Case

The Internet in Database: A Cassandra Use Case

Embed Size (px)

DESCRIPTION

An exploration of the potential and challenges of using Cassandra.

Citation preview

Page 1: The Internet in Database: A Cassandra Use Case

The Internet in a DatabaseA Cassandra Use Case

Page 2: The Internet in Database: A Cassandra Use Case

Data on the Web

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● 48 billion pages on the Internet

● 56 million GB of data

● Incredibly powerful connections

● 70% of useful data is unstructured

● User generated data + facts

Page 3: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Too Much Data…

Page 4: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Modern search engines

○ Unstructured data

○ Unconnected data

○ Unnormalized data

Search

Page 5: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Goals

○ Collect vast amounts of data through web crawling

○ Normalize and deduplicate data

○ Make it searchable and meaningful

Page 6: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Speed

● Scale

● Adaptable

Needs

Page 7: The Internet in Database: A Cassandra Use Case

● Very fast

○ Log-structured storage

● Easily scalable

○ Decentralized rings

● Completely adaptable

○ Schema-less key/value store

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

The Solution

Page 8: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

…Almost

● Useful searching was missing

○ Secondary indexes not flexible

○ No free text searches

○ No (reasonable) range queries

Page 9: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Pros: Full control over indexing

● Cons: Not scalable

What We Needed

Page 10: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Reasons to go with DSE

○ Combines Cassandra and Solr

○ Constant refinements and integrations

○ Support

Putting It All Together

Page 11: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Normalization

Cassandra

Solr

Cassandra

Solr

Cassandra

Solr

Load Balancing

Our Stack

Web Crawling

Page 12: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Cassandra / Solr Setup

● 3 column families / 3 cores

○ Locations○ Products○ People

● 73,114,909 records

Page 13: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● 29,818,644 records

● Interesting data

○ Reviews○ Revenue○ Contact information

● Businesses vs. Locations

○ Unique key○ Location specific user data

Data: Locations

Page 14: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Data: Products

● 18,470,005 records

● Interesting data

○ Categories○ Price○ Reviews

● Challenges

○ Too many unique keys

Page 15: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Data: People

● 24,826,260 records

● Interesting data

○ Work History○ Education History○ Location

● Challenges

○ Normalization○ Identification

Page 16: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges

● Memory

● Speed

● Space

● Representation

Page 17: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Memory

● Multi-minute garbage collection

● Exponential increase in frequency

● Virtual memory confusion

● Solr + Cassandra

● Heap Size vs Buffer Cache

● Bash Scripts

Page 18: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● Upgrade

○ Better memory management○ Smaller index size

● Reduce index size

● Future: Solaris

Page 19: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● Providing a real-time service

● Issues

○ Solr not inherently real time○ Search speeds○ I/O

Page 20: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● Solr Solution: DSE integration leverages

○ Cassandra's speed○ Cassandra's caches○ Cassandra's distribution○ Solr caches less useful

Page 21: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● Search complexity solution

○ Text vs String indexing○ Uniqueness vs Flexibility○ Leveraging Cassandra

Page 22: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● I/O Solution

○ Cassandra's built in mapping○ Increase disk access speeds (SSDs)

■ Not cost effective○ Future: Solaris

Page 23: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Space

● Field corruption

○ Caused by improper encoding○ Exponential growth○ Fills up Solr index

● Locate, inspect & remove corrupt records

Page 24: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Space

● Solr index issue

○ No compression (vs Cassandra)○ Must adjust indexing

● Key things to keep in mind

○ Size of fields○ Scale vs Flexibility○ Index as little as possible

Page 25: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Representation

● Cassandra is flat

● Actual data is not flat

○ Reviews○ Price information

● Many different output formats

○ CSV, JSON, XML, etc.

Page 26: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Solution: Flatten when possible

○ E.g. Address object -> Separate fields● Internal subgroup representation

○ Composite keys (Occasionally)■ Known subgroups■ Non multiple subgroups

○ Dynamic fields■ Composite field + Dynamic tag■ E.g. review.text_<tag>

Challenges: Representation

Page 27: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Representation

● Robust and adaptable conversion package

● JSON -> Internal

○ Solr returns JSON● Internal -> CSV, JSON, XML

○ User defined views○ Specify field groupings○ Specify partitioning

Page 28: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Page 29: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Memory Usage

● Speed

● Space

● Containers

Future Work

Page 30: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Future Work: Memory

● Java 7 G1 (Garbage First) Collector

○ Ideal for large heaps○ Big Data Sets○ Bursty Workloads

Page 31: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Future Work: Speed

● Solaris Kernel Scheduler > Linux Kernel Scheduler

○ (At large number of cores)● Drastically increase iops

○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s)○ Cache writes (ZIL) on PCIe SSD (~800 MB/s)○ Reduce needed size of SSD

■ More smaller SSDs in ZFS pool○ Fewer moving parts

Page 32: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Future Work: Space

● Caching at PCIe, Storing on SATA III

○ Cheaper larger storage via ZFS pools○ Easier to grow

● ZFS Compression (LZ4)

○ Replaces Cassandra's Snappy compression○ Very fast lossless compression (400 Mb/s per core)○ Scales to multiple CPUs○ Hits the ram speed limit

Page 33: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Future Work: Containers

● OS Level virtualization

○ Resource control○ Boundary separation

● More control over cassandra resources

● Better snapshots (whole machine)

● Hardware abstracted out

○ Many disks represented as single space○ Easily add or remove hardware

Page 34: The Internet in Database: A Cassandra Use Case

Questions?https://www.datafiniti.net

http://blog.datafiniti.net@datafiniti

Page 35: The Internet in Database: A Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Addendum 1

ZFS Comparison

Name Ratio (MB/s) Compression (MB/s)

Decompression (MB/s)

LZ4 (r97) 2.084 410 1810

LZO 2.06 2.106 409 600

QuickLZ 1.5.1b6 2.237 373 420

Snappy 1.1.0 2.091 323 1070

LZF 2.077 270 570

zlib 1.2.8 -1 2.730 65 280

LZ4 HC (r97) 2.720 25 2040

zlib 1.2.8 -6 3.099 21 300