Modern software design in Big data era

Wenjin [email protected]

Modern software design in Big data era

A quick Demo

A simple Java program speeds up 10 times by adding some dummy variables at the end of the class declaration.

False sharing

Numbers Everyone Should Know(taken from Jeff Dean – Google keynote)

•L1 cache reference 0.5 ns•Branch mispredict 5 ns•L2 cache reference 7 ns•Mutex lock/unlock 25 ns•Main memory reference 100 ns•Compress 1K w/cheap compression algorithm 3,000 ns•Send 2K bytes over 1 Gbps network 20,000 ns•Read 1 MB sequentially from memory 250,000 ns•Round trip within same datacenter 500,000 ns•Disk seek 10,000,000 ns•Read 1 MB sequentially from disk 20,000,000 ns•Send packet CA->Netherlands->CA 150,000,000 ns

Some facts

• L1<<L2<<RAM<<Disk• Sequential access is much faster than random

access (10 times+)• Cheap Compression is faster than transfer

data on the network• Gbps<Disk<100MbpsZippy: encode@300 MB/s, decode@600MB/s, 2-4X compressiongzip: encode@25MB/s, decode@200MB/s, 4-6X compression

https://code.google.com/p/snappy/

https://code.google.com/p/snappy/

Key to Performance- Improve memory efficiency

Java is bad at memory efficiency:int (4 bytes) -> Integer (16 bytes): always prefer

primary type, but map key must be Object1M records, each record has 5 string fields: 82Ma. Use Map<Map<String, String>>: 706Mb. Use Map<String, String[]>: 495Mc. Use Map<String, byte[][]>: 292 Md. Use ByteBuffer + Trove map: 92 Mhttp://java-performance.info/overview-of-memory-saving-techniques-java/

http://java-performance.info/overview-of-memory-saving-techniques-java/

http://java-performance.info/overview-of-memory-saving-techniques-java/

Bloom Filter – Hash without value

Question: How to support remove?

Merkle Tree (Tree of Hash)

Cassandra gossip

Data locality – Key to Performance

• On the cache level, CPU always request data at the cache line boundary (64 bytes at once)

Place variables used by a same thread nearby Place variables used by different threads at least 64

bytes apart (Java 8 introduced @Contended)http://daniel.mitterdorfer.name/articles/2014/false-sharing/

http://daniel.mitterdorfer.name/articles/2014/false-sharing/

http://daniel.mitterdorfer.name/articles/2014/false-sharing/


• On the memory and disk level, repeat using same data set is faster due to warm cache

• On the disk level, sequential access is 10 times faster than random access => write data sequentially in blocksExample: CommitLog, Big table row range


• On the network level, data locality means computing data locally. Instead of moving data to computation, moving computation to data. (CPU is faster than network, so it’s cheaper than data)

Data Decoupling – key to Scalability

Modeling data in reader/writer perspective to eliminate hotspot instead of group data conceptually

Example:• Unlike many traditional file systems, GFS does not have a per-

directory data structure that lists all the files in that directory. GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. (agent group, access group vs. agent skills)

• column (family) based database. Anti-pattern: User settings in CfgPerson

Data Decoupling – key to Scalability

Normalization or Denormalization? It’s a question.

We are taught for decades Normalization is good: Small size + Consistency

But, it makes strong data coupling => hard to be scalable

Data Immutability – key to Scalability

• Always available, no contention • Always consistent, no need to synchronize• Can be replicated freely whenever needed

Data Immutability – key to Scalability

• Append instead of update (GFS)• Merge instead of update (SSTable)• Add tombstone instead of delete (Cassandra)

SSTable

• SSTable : immutable sorted string table, index table is always in memory

• Merge to remove tombstone

SSTable (LSM-Tree)

• Commit Log (node): sequential write to maximize write throughput (vs B+ tree)

• SSTable (column family ): immutable sorted string table, index table is always in memory

• Merge to remove tombstone

Shared nothing architecture

• nodes are independent and self-sufficient• no single point of contention across the

system• The invention of DHT

Hash is great, but inconsistency is a showstopper

Consistent Hash- two objects meet at one keyspace

Karger (MIT, 2001 - Chord)

Cassandra, MapReduce

HRW hashing

An alternative solution: hashing both data and host, pick the best fit

w1 = h(S1, O), w2 = h(S2, O), ..., wn = h(Sn, O)

Winner: wO = max {w1, w2, ..., wn} David Thaler and Chinya Ravishankar (University of Michigan, 1996)

MapReduce The post office model

MapReduce

map: (K1, V1) → list(K2, V2) reduce: (K2, list(V2)) → list(K3, V3)

Word count:Split filemap: (void, line) → list(word, 1) Shufflereduce: (word, list(1)) → list(word, count)

Apache Spark

Apache Spark

• Developed by Berkeley AMPLab

• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

• Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDDs)

• Hadoop MapReduce is on the disk -> SlowRDDs is a distributed memory model -> Fast

• Traditional distributed memory supports fine grained updates -> No fault tolerance or need extensive loggings or replicationsRDDs are Immutable, created by coarse grained transformations (map, join, filter) -> quickly rebuilt

Other interesting algorithms

• HyperLogLog (cassandra)•Skip List (lucene,Redis,levelDB)•MurmurHash (google, cassandra)•BallTree (google map)•Fractal Tree(MySQL,mongoDB)•Dynamic Time Warping

Check list

•Calculate performance in your design•Estimate data size before you build it•Good designs are always tailored•Knows your tools (guava, gs collection, protobuf, snappy…)•Share with others

Software

Modern software design in Big data era