26
Optimizing the Storage of Massive Electronic Pedigrees in HDFS Authors: Yin Zhang, Weili Han, Wei Wang, Chang Lei Presented by: Yin Zhang Oct 25 th 2012

Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Optimizing the Storage of Massive

Electronic Pedigrees in HDFS

Authors: Yin Zhang, Weili Han, Wei Wang, Chang Lei

Presented by: Yin Zhang

Oct 25th 2012

Page 2: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Abstract

Electronic pedigree system

>>trustworthily tracking of the processes

>>Small-sized but huge volume of electronic

pedigrees

Optimizating the storing and accessing of massive

small XML files in HDFS

2

Page 3: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Abstract

reduce the metadata occupation at NameNode

improve the efficiency of accessing small XML files

Feasibility,

Effectiveness,

Efficiency.

3

Page 4: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Reduce memory consumption

of NameNodes by 50%

4

Page 5: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Improve performance of storing

by 91%

5

Page 6: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Accelerate accessing by 88%

in Hadoop

6

Page 7: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Background: electronic

pedigree

What is a pedigree?

Electronic pedigree

7

Page 8: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Background: electronic

pedigree

Of a nested architecture

Tens of KB to hundreds of KB

Different types

Attribute ‘Lot number’

Attribute ‘Item serial number’

Attribute ‘pedigreeID’

8

Page 9: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

A simple sample

envelope

Pedigree pedigreeID=“001”

shippedPedigree pedigreeID=“101”documentInfotransactionInfoitem serial numberlot number

Digital signature

initialPedigree pedigreeID=“201”productInfoitem serial numberreceivingInfotransactionInfoattachment

Digital signature

Pedigree pedigreeID=“002”

shippedPedigree pedigreeID=“102”documentInfotransactionInfoitem serial numberlot number

Digital signature

initialPedigree pedigreeID=“202”productInfoitem serial numberreceivingInfotransactionInfoattachment

Digital signature

Pedigree pedigreeID=“003”

shippedPedigree pedigreeID=“103”documentInfotransactionInfoitem serial numberlot number

Digital signature

initialPedigree pedigreeID=“203”productInfoitem serial numberreceivingInfotransactionInfoattachment

Digital signature

9

Page 10: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Characteristics of

electronic pedigree

serialized attributes

One single-writer, multiple-reader Model

Correlated electronic pedigrees

freshness date of goods

10

Page 11: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Electronic Pedigree Storage Server

Manage massive electronic pedigrees

Access electronic pedigrees

Receive electronic pedigrees in two ways

Background: EPSS 11

Page 12: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Background: HDFS

Hadoop Distributed File System

• open-source software framework

• One single-writer, multiple-reader model

Performance bottlenecks for massive small files

12

Page 13: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Disadvantages of HDFS in

EPSS

① High cost for metadata management

② High memory cost for files

③ High time cost for contacting

Optimizing the Storage of Massive Electronic

Pedigrees in HDFS

13

Page 14: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

File merging

Merge correlated electronic pedigrees into a bigger file

>>decrease file size in HDFS

>>relieve NameNode’s memory

Four strategies to merge

Internal index file ……

Internal index file

Small file A name Small file A offset Small file A size

Small file B name Small file B offset Small file B size

14

Page 15: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

File mapping

A global mapping table >> records between each

small XML file and its merged file

An indexing table >> records between attributes

and small XML files

e.g. “pedigreeID=001”→“pedigree001”→“bigfile1”

E-pedigree file name(var)

Start(Long) End(Long)Merged file name(var)

Serialized

attributes

Electronic

pedigree Merged file

15

Page 16: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

File reading 16

Page 17: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

File writing 17

Page 18: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Prefetching

Reduce response time for user

Prefetch XML files in the same merged file

Consistent between cache and HDFS

Influenced by merging strategy

18

Page 19: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

File remerging

Frequency of accessing electronic pedigrees will

turn down with time passing by, especially after the

freshness date of goods ends

19

Page 20: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Experiment 20

Page 21: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Evaluation 21

Page 22: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Evaluation 22

Page 23: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Evaluation 23

Page 24: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Difference from related

work

① Helpful for the files of XML type

② Dynamic grouping

③ File correlation and attribute serialization

④ Global indexing table and global mapping table

⑤ Prefetching technology

⑥ File remerging technology

24

Page 25: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Conclusion and future

work

Optimizing the Storage of Massive Electronic

Pedigrees in HDFS

Avoiding privacy problem of storing electronic

pedigrees

25

Page 26: Optimizing the Storage of Massive Electronic Pedigrees in HDFS · reduce the metadata occupation at NameNode improve the efficiency of accessing small XML files Feasibility, Effectiveness,

Q&A