Neptune Distributed Data System

Neptune

hjkim

http://www.openneptune.com

http://dev.naver.com/projects/neptune(korean)

[email protected]

Neptune

Distributed Data Storagesemi-structured data store(not file system)

Use Distributed File System for Data file

Supports real time and batch processing

Google Bigtable cloneData Model, Architecture, Features

Open sourcehttp://dev.naver.com/projects/neptune(korean)

http://www.openneptune.com

Goal500 nodes

200 GB 이상/per node, Peta bytes

Features

Schema ManagementCreate, drop, modify table schema

Real-time TransactionSingle row operation(no join, group by, order by)

Multi row operation: like, between

Batch TransactionScanner, Direct Uploader, MapReduce Adapter

ScalabilityAutomatic table split & re-assignment

ReliabilityData file stored in Distributed File System

Commit log stored in ChangeLog Cluster

FailoverTablet takeover time: max 1 min.

UtilityWeb Console, Shell(simple query), Data Verifier

Architecture

Distributed File System(Hadoop or other)

TabletServer #1 TabletServer #2 TabletServer #n

Neptune

Master

MapReduce

User Application

Neptune

Table

Physical storage

Components

DFS #1

(DataNode)

Computing #1

(Map&Reduce)

TabletServer #1

(Neptune)

Local disk

DFS #2

(DataNode)

Computing #2

(Map&Reduce)

TabletServer #2

(Neptune)

Local disk

DFS #n

(DataNode)

Computing #n

(Map&Reduce)

TabletServer #n

(Neptune)

Local disk

Master

Neptune Master

Neptune Client

NTable Scanner

Lock Server

NChubbyfailover/ event

Pleidas

failover/ event

ZooKeeperNeptune Master

Data/ControlControl

Shell

LogServer

#1

LogServer

#2

LogServer

#n

Data Model

row #n

row #m+1

row #m

row #k+1

row #k

row #1

Rowkey Column#1

TabletA-1 ck-1 v1, t1rk-1v2, t2

ck-2 v3, t2

v4, t3

v5, t4

ck-n vn, tn

- Sorted by rowKey

- Sorted by columnKey

Column#n

TabletA-2

TabletA-n

Table

Row#1

Cell1

Column1

Cell2

Cell3

Cell-n

Cell1

Cell2

Cell-k

Column2

Cell1

Cell2

Cell-m

Column-n

Row.Key

…… …

Cell

Cell.Key

Cell.Value(t1)

Cell.Value(t2)

Cell.Value(tn)

…

Index

M.T1.1000:M1 … Max:mn

T1.100:U1 … xx xx … n

10 20 … 100 110 120 … 200

TableMapFile(Physical file,sorted by rowkey, columnkey)

64KBscan

index of User Tablet

…

…

Index of TableMapFile’s block(max-key, file-offset)

index of Meta Index

Root Index

Meta Tablet m1 m2 mn

T1.200:U2 T1.1000:UN

User defined Tablet U1 U2

… T1.2000:UN

M.T1.2000:M2

Index Record format: . Key - TableName.MaxRowKey

. Value – Tablet Name, assigned host

T1.1100:U1 T1.1200:U2

Data/index file in HDFS

TabletA

TabletB

Column Data/index file

TabletServer

MemoryTableChangeLogServer

Data Operation

MapFile#2

(HDFS)

put(key, value)

ChangeLog

MapFile#1

(HDFS)

MapFile #n

(HDFS)

Minor

Compaction

Merged

MapFile

(HDFS)

Major Compaction

Searcherget(key)

Failover

Master fail

disabled only Table Schema Management

and Tablet Split

can execute Multi-Master

TabletServer fail

assign to other TabletServer by master

within 2 minutes

MapReduce with Neptune

Tablet A-3

Tablet A-N

…

Tablet A-2

TabletA-1

TableA

META Table

Map

Task

TaskTracker

Map

TaskMap

Task

Map

Task

TaskTracker

Map

TaskMap

Task

Map

Task

TaskTracker

Map

TaskMap

Task

TaskTracker

Reduce

Task

TaskTracker

Reduce

Task

TableB

Tablet B-2

Tablet B-1

Partitioned

by key

DBMS

or HDFS

Table

tInputF

orm

at

Hadoop

Client

Client API

Single row operation: put/get

Multi row operation: like, between

Batch operation: scanner/uploader

MapReduce: TabletInputFormat

Command line Shell

NQL(Neptune Query Language)• JDBC support

Web Console

Client API Example

TableShema tableSchema =

new TableSchema(“T_TEST”, new String[]{“col1”, “col2”});

NTable.createTable(tableSchema);

NTable ntable = Ntable.openTable(“T_TEST”);

Row row = new Row(new Row.Key(“RK1”));

Row.addCell(“col1”, new Cell(new Cell.Key(“CK1”), “test_value”.getBytes()));

ntable.put(row);

Row selectedRow = ntable.get(new Row.Key(“RK1”));

System.out.println(selectedRow.getCellList(“col1”).get(0));

TableScanner scanner = ntable.openScanner(ntable, new String[]{“col1”});

Row scanRow = null;

while( (scanRow = scanner.next()) = null) {

System.out.println(selectedRow.getCellList(“col1”).get(0));

}

scanner.close();

Neptune Shell

Data DefinitionCREATE TABLE

DROP TABLE

SHOW TABLES

DESC

Data ManipulationSELECT

DELETE

INSERT

TRUNCATE COLUMN

TRUNCATE TABLE

Cluster MonitoringPING TABLETSERVER

REPORT TABLE

Web Console

Performance

Experiment Neptune

Random read 495

Random write 1,223

Sequential read 498

Sequential write 1,327

Scan 40,329

Number of 1000-byte values read/written per second

HBase/Bigtable Comparison

Neptune Bigtable HBase

File System Hadoop DFS or

other DFS

GFS Hadoop DFS

Computing Hadoop or others MapReduce Hadoop

Master failover Yes(ZooKeeper) Yes(Chubby) 0.20(ZooKeeper)

Script Language No(NQL) Sawzall No

Change log ChangeLog cluster GFS HDFS + Memory

API Java, Thrift, REST C++ Java, Thrift, REST

ACL Yes Yes No

Memory Table No Yes No

Scanner Yes Yes Yes

Uploader Yes Unknown No

Powered by Neptune

http://searcus.com/nosql

twitter search service

http://searcus.com/nosql

Technology

Neptune Distributed Data System