Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
65
Chapter 4 Homomorpic Distribution Verification Protocol
4.1. Introduction
In this chapter, we propose a new homomorpic distributed verification protocol to address
the Availability and Integrity of data stored in the cloud. Recently, Wang et al. [164, 165]
described a homomorpic distributed verification protocol to ensure the data storage security in
cloud computing. Their scheme relies on Vandermonde Reed-Solomon Code for the data
Availability, and utilizes the token pre-computation using pseudorandom data to verify the data
storage correctness. It achieves the guaranty of data Availability, Integrity and identifies the
misbehaving servers also. However, their scheme is inefficient and gives the week Integrity
guarantee to the Clients about the security of their data. The HDVP over comes the drawbacks of
Wang‘s[164,165] protocol and achieves the strong guarantee to the Availability and Integrity of
data stored in the cloud with partial dynamic data support through private verifiability.
4.2. Preliminaries
4.2.1. Erasure Codes
a) Cauchy Reed-Solomon (CRS) Code
Erasure coding is a fundamental technique to prevent the data loss in distributed storage
systems, which is composed of multiple storage devices, such as network application [123], data
domain [193], peer-to-peer storage network [190] and cloud storage [26, 165].
Erasure codes based on Vandermonde Reed-Solomon codes, which have existed for
decades [131] and are widely used in communication and storage systems. The Vandermonde
Reed-Solomon codes are very powerful as they can be defined for any value of m(original
blocks) and n(redundancy blocks) . However, they have a drawback of requiring n Galois Field
multiplications per coding block, and since coding blocks are typically smaller than a machine‘s
word size, they can require 2n to 8n multiplications per machine word. Thus, Vandermonde
Reed-Solomon codes are expensive [132].
In 1995, Blomer, et al. [22] proposed a Cauchy Reed-Solomon Code (CRS) with two
improvements to Vandermonde Reed-Solomon code. The first improvement is to use projections
that convert encoding operations over Galios Field GF(2w) to XORs, so that encoding operation
66
takes O(n log2(m+n)) XORs per coding block. The second improvement is that it uses Cauchy
Matrix over Galios Field GF(2w) instead of classical vandermonde matrix. The distribution
matrix has property that all n×n matrices are invertible.
A Cauchy matrix is defined over the Galois Field GF (2w ) in the following way:
let X={x1, . . . , xm} and Y = {y1, . . . , yn} be defined such that each xi and yi is a distinct
element of GF(2w), and X ∩ Y=Ф. Then the Cauchy matrix defined by X and Y is 1/(xi + yj ) in
element i, j as shown in the Fig. 4.1.
nmmm
nmmm
n
n
yxyxyx
yxyxyx
yxyxyx
yxyxyx
1,
1,
1
)
1,
1,
1
11,
1
11,
1
21
12111
22212
12111
Fig. 4.1 Cauchy Matrix
This approach is a systematic layout with parity blocks that are achieved with the Cauchy
distribution matrix A is derived from Cauchy Matrix [22].
T
A= (I/P)T
=
Fig. 4.2 Cauchy Distribution Matrix
1 0 0
0 1 0
0 0 1
1/( p1,0) 1/(p1,1 ) 1/(p1,(m-1))
1/ (p2,0) 1/ (p2,1) 1/(p2,(m-1))
1/ (pn-1,1) 1/ (pn-1,1) 1/(pn-1, (m-1))
67
Where )(
1,
ndmP nm
(4.1)
The Cauchy distribution matrix A is given in Fig. 4.2 is composed of (m×m) identity matrix
in the first m columns and (m×n) Cauchy distribution matrix is in the remaining n columns
Next, we expands the Cauchy distribution matrix using a projection defined by in
GF (2w) by factor of w in each dimension to yield a w(m+n)×wm distribution matrix known as
Binary Distribution Matrix(BDM). To encode with bit matrix, we multiply the Binary
Distribution Matrix A with wm element vector, where wm is composed of w bits from each data
device. This product operation is depicted in Fig. 4.3.
T T T
By multiplying F with A, the Client obtains encoded file, which is a w(m+n) element vector
and is composed of w bits from data and coding devices. In multiplication operation, the Cauchy
Reed-Solomon uses series of bit-wise XOR operations to encode the file instead of performing n
multiplication using Galois Arithmetic in classical Vandermonde Reed-Solomon code [164].
Although Cauchy-Reed Solomon codes are achieving faster encoding and decoding times
than vandermonde Reed Solomon Code, the decoding and encoding times are slow for large
files.
Fig. 4.3 Encoding System with Binary Distribution Matrix [133]
Data +Coding
Data
= *
0 0 0
P0,0 P0,1
P0,2
P1,0
P1,1
P1,2
1
P0,m-1
P1,m-1
1 0 0
0 1 0
0 0 1
0
0
0
Pn-1,0
Pn-1,2
Pn-1,1
p1,m-1
Dm-1
D1
D2
C1
C2
D3
Cn-1
D
2
D1
D2
D3
Dm-1
Binary Distribution Matrix (BDM)
68
To improve the encoding and decoding performance of large files, we use Tornado codes
instead of Cauchy Reed-Solomon code. The brief description of tornado codes will be explained
in next section.
b) Tornado Codes
Erasure codes based on reed-Solomon code can be used to encode and decode the data in
distributed storage systems [131-133]. The limitation of these codes is that encoding and
decoding times are slow for large block sizes in files, effectively limiting block size to small
values for practical applications.
Tornado codes are erasure block codes based on irregular bipartite graphs. They can be
constructed for arbitrary rates allowing recovery of the original file from any fraction of the
encoded file [11]. The encoding and decoding will take O(n log n), whereas Reed-Solomon (RS)
codes take O(n2) as shown in Table 4.1.
Table 4.1: Difference between Tornado Codes and Reed-Solomon Codes
Erasure code Encoding Time Decoding Time
Tornado code[11] O(nlogn) O(nlogn)
Reed-Solomon
code[131] O(n2) O(n2)
Tornado codes are generated by cascading a sequence of irregular random bipartite graphs;
these graphs are equivalent to generator matrices. The operation of this graph is shown in
Fig .4.4. The nodes on the leftmost level of the graph correspond to the original data. The values
of nodes on the right side contains redundant data, which is computed by performing an XOR
operation of the neighboring input nodes level to the left. The number of exclusive-or operations
required for both encoding and decoding is thus dominated by the number of edges in the entire
graph.
69
C1=x1+x2+x3
Fig. 4.4 Structure of Tornado Codes
The overall code C(B0,B1,…,Bm , Λ) is a cascade of bipartite graphs B0,B1,…,Bm and Λ. the
graph B0 has n message bits as input and produces βn check bits. These form the input bits of B1
and β2n new check bits are formed. The graph Bi, has β
in input bits and produce β
i+1n check bits.
This sequence is truncated by a conventional rate 1-β erasure code Λ. The codeword consists of
the n message bits and all the check bits produced at each stage of the cascade. It is thus a
systematic code. The total number of check bits produced by this sequence is given by
11
1
1
2 nnn
m
i
mi (4.2)
The length of the codeword produced given n input bits is
11
nnn . The resulting
code is rate 1-β code for all values of m. the length of cascade is selected such that βm+1
n≈√n. we
begin by using this decoding algorithm for Λ, to recover losses that occur within its bits. This
will be successful if at most β fractions of bits have been lost in Λ. If all the losses are recovered,
we know the check bits of Bm. These check bits could be used to recover any losses in the input
bits of Bm. since the input bits of Bi are check bits of Bi-1, this recursion can be continued until all
the input bits of B0 are recovered.
X3
Original Data
Redundant Data
X2
X1
70
4.2.2. Sobol Sequence
I.M. Sobol in 1967 introduced the construction of quasi-random sequence of points that
have low-star discrepancy, which is called Sobol sequence [27], The Sobol sequence generates
numbers between zero and one as binary fraction of length w bits, from a set of special binary
fractions, vi i=1,2..w, called direction numbers. It uses base two for all dimensions in which
numbers are generated sequentially to fill the larger ―gaps‖ in the uncorrelated sequences
(pseudorandom sequences). The Sobol sequence is more uniform than pseudorandom sequence
[147] as shown in Fig. 4.5.
To generate sobol sequence of values x1, x
2 ,…
with 0< x
i<1, first, we need to construct
direction numbers v1,v2,... each vi is a binary fraction that can be written as
vi=mi/2, (4.3)
where mi is an odd integer such that 0<mi<2i.To obtain mi,
The construction of starts by choosing a primitive polynomial in the field Z2 i.e. one may
choose
P=xd+a1xd-1+…+ad-1x+1 (4.4)
where each ai is 0 or 1 and P is an arbitrary chosen primitive polynomial of degree d in Z2.
Then, the mi by the q-term recurrence relation as
didi
d
did
d
iii mmmamamam
2222 11
1
22
2
11 (4.5)
Fig. 4.5 Comparison of Pseudorandom data and Sobol Sequence.
71
where each term is expressed in base 2 and denotes a bit-by-bit exclusive. When using a
primitive polynomial of degree d, the initial values m1….md can be arbitrarily chosen provided
that each mi is odd and mi<2i, i=1,..d.
In order to genarate sequence x1,x
2,..........,x
n Sobol proposed using
and
Where b1,b2 is the binary representation of n and c is the rightmost zero bit in binary
representation of n .
For Example[172]: To generate the Sobol QRNs, choose a primitive polynomial, say,
P = x3+x+1=1x3++ox2++1x + 1, (4.8)
where the coefficients h1 =1 and h2 =0.
Using(4.5) the corresponding recurence relation is:
mi =4mi−2 ⊕8mi−3 ⊕mi−3. (4.9)
we Choose the first three mi’s arbitrarily as m1 =1,m2 =3,and m3 =7.
Then m4 =4m2 ⊕8m1⊕m1 =12⊕8⊕1 = 1100⊕1000⊕0001 = 0101 = 5.
Similarly m5 =4m3 ⊕8m2 ⊕m2 =28 ⊕24 ⊕3 = 11100 ⊕11000 ⊕00011 = 00111 = 7,
m6 =4m4 ⊕8m3 ⊕m3 == 43, and so on.
Using (4.1), find v1= 1/21=1/2=0.1,
v2 = 3/22=3/4=0.11,
v3 = 7/23=7/8=0.111, and so on.
Now x0 = 0 , n=0 , c=l
Step 1: x1 = x0 ⊕ v1 = 0.0 ⊕ 0.1 = 0.1 =1 /2
n=l , C= 2
Step 2 : x2 = x1 ⊕ v2 == 0.10 ⊕ 0.11 = 0.01 =1/4
n = 2 ,c= 1
Step 3: x3 = x2 ⊕ v1
= 0.01 ⊕ 0.10 == 0.11 =3/4, so on
Sobol sequence is: 0,1/2/ 1/4, 3/4, ....
c
nn vxx 1
2211 vbvbn
x(4.6)
(4.7)
72
4.2.3. Universal Hash Function(UHF)
UHF [31] is an algebraic function h: K × Il ← I that compresses a message or file element
mєIl into a compact digest or ―hash‖ based on a key. kєK . We denote the output of h as hk(m) .
A UHF has the property that given two inputs x≠y, with overwhelming probability over keys k ,
it is the case that hk( x ) ≠ hk( y ) . In other words, a UHF is collision-resistant when the message
pair (x; y ) is selected independently of the key k . A related notion is that of almost exclusive-or
universal (AXU) hash functions that have the property that given three input messages, the
probability that the XOR of the hashes of the first two inputs matches the third input is small.
Formally:
Definition 4.1: h is an universal hash function family if for any x ≠ y єIl :
.)]()([Pr yhxh kkKk h is an є-AXU family if for any x ≠ y єIl, and for any z є I
: .])()([Pr zyhxh kkKk
Many common UHFs are also linear, meaning that for any message pair (m1 , m2 ), it is the
case that
h( m 1) +hk(m2 ) = hk ( m1 + m2 ) (4.10)
4.3. Homomorpic Distribution Verification Protocol (HDVP)
The Homomorphic Distribution Verification Protocol is used to ensure the Availability and
Integrity of data stored in the cloud. It consists of three phases as shown in fig. 4.5:
1) Setup phase: in this phase, the Client encode the data for Availability and pre-computes the
certain number of short verification tokens before distributing a file into cloud servers.
2) Verification Phase: in this phase, the Client verifies the Integrity of data stored in the cloud
and recovers the data that lost in the cloud servers and gives guaranty to file retrieval.
3) Dynamic Data Operations and Verification Phase: in this phase, the Client performs
dynamic data operations to cloud data storage without retrieving original data.
4.3.1. Setup Phase: HDVP
The setup phase consists of three methods: a) Encoding b) KeyGeneration
c) MetadataGeneration as shown in the Fig. 4.6.
The detailed description of these three methods is explained in following sections:
73
a) Encoding: HDVP
In cloud data storage, we rely on erasure codes based on Cauchy Reed-Solomon(CRS) code
[22, 132] to distribute the entire file redundantly across the set of k=m+n cloud servers to
guaranty the data Availability against data loss. A (m+n, n) Cauchy Reed-Solomon code
generates n redundancy blocks(parity blocks) from m data blocks in such a way that the original
m data blocks can be reconstructed from any m out of m+n data and redundancy blocks.
Because of storing each of m+n data blocks on multiple servers, the original file can be survived
from any failures of any m of m+n servers without any data loss, with less space overhead
Fig.4.6 Architecture of Homomorpic Distributed Verification Protocol
Client
1
2
3
4
5
6
7 8
TPA
CSP
(a) Encoding
(b) KeyGeneration
(c) MetadataGeneration
(a)Challenge
(b)Response
(c)Check Integrity
(a) PrepareUpdate
(b) ExecuteUpdate
(c)CheckUpdate
Setup Phase
Verification Phase
Dynamic Data Operations
and Verification Phase
74
(m/n)[100]. For the support of an efficient sequential I/O to the original file, data file layout is
systematic, i.e. the unmodified m data blocks together with the n parity blocks are distributed
across m+n different servers. In the algorithm 4.1[176], we write procedure for encoding of the
file using CRS.
In algorithm 4.1, we assume the following:
Let F= {D1, D2,…Dm) and Di =(d1i, d2i…..dli )T
where (i є {1……m}), l ≤ 2w-1, and
w is a data word, which is equal to 8 or 16. Here T (shorthand for transpose) denotes that each Di
is represented as a column vector, and l denotes data vector size in blocks. All these elements
are elements of Galois Field (GF) (2w). To encode the data file, the Client multiplying F with A,
as follows:
AFC (4.11)
),,,,,( 2121 nmm CCCDDDC
(4.12)
),( )()()( j
i
j
i
j CDC
where j є{ 1, …,k}
),,,( )()(
2
)(
1
)( j
l
jjj
i dddD (4.13)
where (j є { 1, …,k}) and i
є { 1, …,l})
Algorithm 4.1: Encoding : HDVP
1. Procedure: File-Encoding 2. for i=0 to n-1 do 3. for k=0 to w-1 do 4. Ci
k=[0,0,..,0] 5. for j=0 to m-1 do 6. for l=0 to w-1 do 7. Ci
k=Cik fi
k,j,l * dj,l 8. end for 9. end for 10. end for 11. end for 12. end procedure
75
And ),,,( )()(
2
)(
1
)( j
l
jjj
i cccC (4.14)
where (j є { 1, …,k}) and i
є { 1, …,l}
After encoded the data file, the Client generates key pair for the later processing of the
system, which we discuss in the next section.
b) KeyGeneration: HDVP
After encoding the data, the Client generates the random challenge key x and master
permutation key y for the later processing of the file as shown algorithm 4.2: we use Sobol
Random Function(SRF) to generate the keys x and y.
)(ifxSRFk (4.15)
and
)(ify
SRPk (4.16)
The both the keys are indexed on some (usually secret) key: f :{0,1}* ×key-GF(2
w)
c) MetadataGeneration: HDVP
In order to guaranty the Integrity of data stored in the cloud, this protocol entirely relies on
the metadata. The main idea as follows: before distributing the encoded data into cloud, the
Client computes the certain number of short verification tokens(metadata) on individual data
block C(k)
using Sobol Sequence[27] instead of pseudorandom sequence[164]. Each token covers
the random set of blocks. The procedure of metadata Computation is given in algorithm 4.3:
Algorithm 4.2: KeyGen: HDVP
1. Procedure: KeyGen 2. Generate random challenge key kSRF and 3. master permutation key kSRP using Sobol sequence;
4. Derive )(ifx
SRFk and
)(ifySRPk
5. end for 6. end procedure
76
The Client does the following procedure while generating ith
token computation:
1) The Client generates the set of r randomly chosen indices
)(qI yq (4.17)
Where ]}1][....1[{ rqlIq and
πkey(.) is a Sobol Random Permutation (SRP), which is indexed under key:
π : {0,1}log2(l)
×key–{0,1} log2(l).
2) Then, the Client Computes the tokens as
][)(
1
)(
q
kr
q
qj
i ICxV (4.18)
Where x is a challenge key and y is a master permutation key.
The metadata computation function Vi(j)
belongs to a family of Universal Hash
Function(UHF) [31], chosen to preserve the homomorphic properties which can be perfectly
integrated with the verification of erasure-coded data . The Vi(k)
is an element of GF(2w) with
small size which is the response that the Client expecting to receive from the CSP, when Client
challenges it on specific random blocks.
Algorithm 4.3: MetadataGen: HDVP
1. Procedure: MetadataGen 2. for vector G(k), k←1,n do 3. for round i←1, t do
4. Compute the set of random indices: )(qI yq
]}1][....1[{ rqlIq
5. Compute ][)(
1
)(
q
kr
q
qj
i ICxV
6. end for
7. end for
8. store all the Vi ‘s locally. 9. end procedure
77
The overall processes of setup phase is given in Fig.4.7
Fig. 4.7 Setup Phase: HDVP
Once this required number of metadata is computed, the Client keeps metadata locally and
distributes the all the k =m+n encoded data blocks C(j)
(j є {1……k}) to CSP. Then, CSP stores
all the data and parity blocks across the cloud servers namely S1, S2………Sk.
4.3.2. Verification Phase: HDVP
The verification phase consists of three methods as shown in Fig 4.6: a) Challenge
b) Response c) CheckIntegrity.
a) Challenge: HDVP
In this phase, the Client issues ―random sample‖ challenge to verify the Integrity of data
stored in the cloud. To create a challenge, the Client re-generates the challenge key and master
permutation keys using Sobol sequence and send to the CSP. Here, challenge key is needed to
prevent the potential pre-computations performed by the CSP. The procedure of creating
challenge is given in algorithm 4.4.
Client CSP 1. The Client encodes the file 2. Client generate a key pair x and y
3. Client computes metadata
][)(
1
)(
q
kr
q
qj
i ICxV
4.Sends F=(m1,m2,…mn) to server
5. The CSP stores
the file F in cloud
6. keep metadata at locally
and deletes file F and metadata Tm from locally
78
b) Response: HDVP
Upon receiving a request from the Client, the CSP computes the response (signatures as
Integrity proof) on specified random block indices and return response back to the Client. The
procedure of Response protocol is given in algorithm 4.5.
The procedure of the ith
response algorithm for cross check over the k servers in algorithm
4.5 is: the CSP computes a response over each data block G(j)
specified by index is:
)]([)(
1
)(qGxR y
jr
q
qj
i (4.19)
c) CheckIntegrity: HDVP
After receiving a response R(j)
from the CSP, the Client checks the data Integrity by
comparing response with metadata previously computed by the Client as shown in
algorithm 4.6:
)()( j
i
j
i VR , (jє {1, ……,k}). (4.20)
Algorithm 4.5: Response: HDVP
1. Procedure: Response (i) 2. for j←1, k do
3. CSP computes }1|)]([{ )(
1
)(kjqGxR y
jr
q
qj
i
4. return to Ri(j) to Client;
5. end procedure
Algorithm 4.4: Challenge: HDVP
1. Procedure: Challenge 2. The Client re-Generate random challenge key kSRF
and master permutation key kSRP using Sobol sequence;
3. Derive )(ifx
SRFk and
)(ifySRPk
4. end procedure
79
If the Integrity of data is achieved, HDVP is ready for the next challenge. Otherwise it
retrieves the file and corrects the data loss in the file. The process of verification phase is
illustrated in Fig.4.8.
Fig. 4.8 Verification Phase: HDVP
In next phase, we explain dynamic data operations of this protocol, If Client wants to update
the data in the cloud without retrieving the file.
Algorithm 4.6: CheckIntegrity: HDVP
1. Procedure: CheckIntegrity 2. for j←1, k do
3. if )( )()( j
i
j
i VR then
4. Accept and ready for next challenge 5. else 6. return data has been corrupted 7. end if 8. end for 9. end Procedure
Client CSP 1. generates a challenge
chal=( x,y)
2. Sends chal to the CSP
3. Computes
}1|)]([{ )(
1
)(kjqGxR y
jr
q
qj
i ]
4. Send Integrity Proof )( j
iR to the Client
5. The Client Verify if )( )()( j
i
j
i VR
returns ‘1’ otherwise ‘0’
80
4.3.3. Dynamic Data Operations and Verification Phase: HDVP
The cloud storage is a dynamic data storage, which means that the Clients frequently
modifying their data in the cloud, like photos, electronic documents and log files. So, it is critical
importance to consider the dynamic block level operations to modify the data file in the cloud
while maintaining the same data Integrity and Availability assurance, such as modification,
delete, append and insert.
There is one easy way to support all dynamic operations is that the Client has to download
the entire data from the CSPs, update it and re-distribute it again to the cloud. Clearly, it would
be secure, but highly inefficient. In this section, we will show how HDVP scheme explicitly and
efficiently supports dynamic data operations to cloud data storage as follows:
This phase consists of three methods: a) PrepareUpdate b) ExecuteUpdate c) CheckUpdate
as given in Fig.4.6.
a) PrepareUpdate: HDVP
After storing the data in the cloud, later, the Client may need to update the some data
(insertions of a new block after a given block i, deletion of a block i, and modification of a block
i). To perform an update operation, the Client prepares an update request (modification,
insertion, append and delete) and sends to the CSP. We outline the algorithm 4.7 for the prepare
update request performed by the Client as follows:
1) If an update operation is a modification, then the Client prepare the update request as
follows: modification means adding a current value dij to a new value Δdij
ijij dd (4.21)
Due to the linear property of Cauchy Reed-Solomon Code, a Client can easily performs the
update operation and generate update parity blocks without involving any other unchanged
blocks. So the Client can construct a generated updated matrix ΔF as follows:
ΔF=
Δd11 Δd12......... Δd1m
Δd21 Δd22……. Δd2m
Δdl1 Δdl2… …. Δdlm
81
=( ΔD1 ΔD2......... ΔDm) (4.22)
To keep the corresponding parity blocks are consistent with as well as original file layout,
the Client can multiply ΔF by Binary Distribution Matrix (BDM) and thus generate the update
parity information for both the data blocks and parity blocks as follows:
AFG ),,,,,( 121121 nm CCCDDD (4.23)
),,,,,( )()(
2
)(
1
)()(
2
)(
1
)( j
n
jjj
m
jjj CCCDDDG
(4.24)
where j є{ 1, …,k}
The data block update operation automatically affects some or all remaining verification
metadata. So, in order to maintain the same Integrity assurance, the Client has to modify those
unused metadata for each data block C(k)
. In other words, the Client needs to exclude every
occurrence of the old data block and replace it with new one as follows:
The data block G(j)
[Is] is covered by the specific token Vi(j)
has been changed to G(j)
[Is] + Δ
G(j)
[Is] as shown in algorithm 4.7, where Is= πy(s). To maintain the usability of token Vi(j)
, the
Client can simply update it by without retrieving any other r-1 blocks required in pre-
computation of Vi(j)
.
][)()()(
s
jsj
i
j
i IGxVV (4.25)
The crucial part of the algorithm is at line 7 where the Client simultaneously changes the old
version Vi(j)
with the new one
][)(
s
js IGx (4.26)
2) If an update operation is deletion. The delete operation considering here similar to general
delete operation in which the Client replaces the data blocks zero or some special reserved data
symbol. From this point of view, the delete operation is a special case of update operation.
Therefore, the Client can use the update operation procedure to support delete operation i.e.
ijij dd in ∆F (4.27)
82
In delete operation also the effected tokens have to be modified using same algorithm 4.7
in modification operation.
3) In some times, the Client wants to increase the size of his file in the cloud by adding new
data blocks at the end of file which is called as block append operation. If update operation is
append‘ in which the Client needs to upload large amount of data as follows: Given the file
matrix F is illustrated in file encoding, appending blocks towards the end of a data file is to
concatenate corresponding rows at the bottom of the matrix layout for the file F. We assume that
in the beginning there are only ‗l’ rows in the file matrix F. In order to support block append
operation, the Clients have to do a small modification to metadata as specified in modification
operation.
The Client has to expect the maximum size for his data blocks denoted as lmax, for each data
block. The idea of block append operation is considering here is similar to [16, 164] in which
each encoded data vector as well as the system parameter rmax=[r*(lmax/l)] for each pre-commuted
token. The pre-computation of the ith
token on server j is modified as follows:
][)(
1
)( max
q
jr
q
qj
i IGxV (4.28)
where G(j)
[Iq]= G(j)
[πy(q)],[ πy(s)]≤l
0 ,[πy(s)]≥l
This formula guarantee the on average, there will be r indices falling into the range of
existing l blocks, since the CSP and the Clients has the agree on the number of existing blocks
in each data block G(j)
, the CSP will follow the exactly same procedure when re-computing each
token upon receiving a challenge request.
After the re-computation of metadata, then, the Client is ready to append number of new
blocks i.e. both the data and parity blocks are generated, the total length of each block G(j)
will
be increased and fall into the range[l, lmax]. Therefore, the Client will update those affected
metadata by adding xs
* G(j)
[Is] to old Vi using algorithm 4.7 in update operation whenever
G(j)
[Is]!=0 for Is>l, where Is= πy(s) thus is omitted here.
83
4) if update operation is an insert operation, the insert operation to data file stored in the
cloud is similar to an append operation at the desired index position while maintaining the same
block level structure for the entire data file i.e. inserting a block F[j] corresponds to shifting all
blocks starting with index j+1 by one slot. An insert operation may affect all rows in logical
data file in matrix F, and a subsequent number of computations are required to rearrange the all
sub sequent blocks as well as pre-computation tokens. Hence, it is difficult to support an efficient
insertion operation.
After the preparation of an update request, the Client sends it to the CSP.
Algorithm 4.7: PrepareUpdate: HDVP
/assume that data block dij changed to ∆dij/ 1. Procedure: PrepareUpdate 2. If(update==modification/append) 3. for round i to t do
4. Derive )(ifx
SRFk and
)(ifySRPk
5. for each vector G(j) j←1 to r do
6. ][)()()(
s
jsj
i
j
i IGxVV
7. end for 8. end for 9. else if(update==delete) 10. for i = 1 to r do 11. if ( gk i ( l) == j) then
12. )],[( )()()( DblockIGxVV s
jsj
i
j
i
13. end for 14. else if(update=insert) 15. physical insert is not supported 16. end if 17. store Vi
(j) locally 18. end Procedure
84
b)ExecuteUpdate: HDVP
In this method, the CSP performs the update operations in response to Client requests which
perform the update operation and put update version of the file as follows:
}),,1{()()()( kjGGG jjj (4.29)
The procedure of execute operation is given in algorithm 4.8.
Algorithm 4.8: ExecuteUpdate: HDVP 1. Procedure: ExecuteUpdate←{F’’} 2. if(update==modification/append)
3. }),,1{()()()( kjGGG jjj
4. updates the file
5. else if(update==deletion)
6. }),,1{()()()( kjGGG jjj
7. update the file F 8. move all blocks backward after ith block 9. end if 10. end procedure
c)CheckUpdate : HDVP
After updates the data, the Client wants to verify whether the CSP has updated the data
file successfully or not? The verification of dynamic data operations in this protocol directly
starts form the default Integrity verification process (same as algorithm 4.6). The only difference
is in the way is we compute vi rather than hashing the concatenation of blocks, the adversary
hashes the each single block and XOR the resulting outputs. This does not change the ability of
the simulator to extract the blocks queried during the challenge.
The process of dynamic data operations and verification process is depicted in Fig. 4.9.
85
Fig. 4.9 Dynamic Data operations and Verification Phase: HDVP
4.4. Analysis of HDVP
In this section, we analyze the security and performance of the Homomorpic distribution
verification protocol.
4.4.1. Security Analysis of HDVP
In security analysis, we analyze the Integrity; Availability of data stored in the cloud and
compares results with existing schemes [164,165].
a) Integrity
In this section, we analyze the detection probability of it being corrupted blocks and show
that HDVP scheme required to operate on selected (random) blocks instead of all which can
greatly reduce the computational overhead of the verifier (Client) and server (CSP) while
maintaining detection of data corruptions with a high probability (99%).
Assume that, an attacker modified the data blocks in z rows out of ‗l‘ rows in encoded
file matrix G. Let r be number of selected rows for which the Client challenges the CSP to verify
the Integrity of data stored in the cloud in each Integrity verification process. Let X is a discrete
random variable which is defined as number of rows modified by the attacker. First, we analyze
the matching probability that at least one of the blocks picked by the Client matches one of the
blocks modified by an attacker is [164,165]:
Client CSP
1. Computes ][)()()(
s
jsj
i
j
i IGxVV
Update request = (s, m'i,update/append/delete), 2. Update request
3. Server updates the file
}),,1{()()()( kjGGG jjj
4. Delete m'i , from locally 5. The verification directly starts from algorithm 4.6
86
rr
i
r
m rzlzXPP )/1(1})1,1/min{1(1}0{11
0
(4.30)
The attacker avoids detection only if none of selected rows in ith
verification process are
deleted or modified.
Next, we consider the probability of false negative result. That is the specified data blocks
in those specified r rows has been damaged or deleted but verification equation still holds:
Ri(j)
= Vi(j)
. Consider the responses Ri(1)
.…..Ri(k)
returned to the Client from the CSPs for the ith
challenge and each response value Ri(j)
is computed within the GF(2w) is based on r blocks on
server j. The number of responses R(m+1)
…….R(k)
from parity servers is n=k-m. Thus according
to preposition 2 of [170], the false negative probability is:
where
kP
kckcP
r
fP
)2)(1Pr1(Pr
)12/()1)21((Pr
PrPr
2
1
21
Based on above discussion it follows that detection probability of data corruption across all
cloud storage servers is [164]:
).1( r
f
r
md PPP (4.32)
Comparison with Existing Scheme
In this section, we compare the detection probability of HDVP scheme with the existing
schemes [164-165] for the corrupted data blocks and prove that proposed scheme is better than
existing scheme in terms of detection probability.
We compare HDVP scheme with existing scheme in two cases:
i) Strong Integrity: the HDVP gives the strong Integrity guarantee to the Client‘s data than
existing probabilistic schemes i.e. If an adversary corrupts some part of the file, HDVP method
is able to detect the data corruption with high probability by issuing ―random samples‖ challenge
to the CSP for the data Integrity proof. These random samples generated by sobol sequence,
which covers the entire file uniformly or equally distributed in the file in order to detect
(20)
(4.31)
87
corrupted part, whereas existing schemes may not give satisfactory Integrity results, i.e. the
existing schemes will say that no data is lost or data is safe in the cloud even though data is
corrupted. Because, pseudorandom sequence may not cover the entire file or it may lie on the
same place while computing Integrity proof, so it is giving the Integrity proof without covering
corruption part of the file. To conclude that existing schemes may not provide satisfactory
Integrity results when a data blocks is corrupted in the file, assume that a file contains 100000
blocks, out of which 1% or 5% data is corrupted. If the verifier use 1000 or 500 sobal random
samples over entire file to detect data corruptions respectively, these random samples are
uniformly distributed or equally distributed to entire file for each and every challenge, thus it will
detect the corruptions with 99% probability, whereas if use pseudorandom samples for the same,
these random blocks may not detect corruption because these blocks may not be uniformly
distributed i.e. these may lie on upper part of the file or lower part of file and sometimes more
samples on the upper part and less samples on the lower side. Hence, sometimes these blocks
will give Integrity proof without covering the entire file. In Fig. 4.10a and 4.10b, we have shown
that probability of detection of 1% and 5% data corruptions respectively using sobol sequence
and pseudorandom sequence.
0 1 2 3 4 5 6 7 8 9 100
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
r(number of queried blocks)(as percentage of l)
l(to
tal
nu
mb
er
of
blo
ck
s)
Sobol Sequence(99%)
Pseudorandom sequence(99%)
(a) z=1%l
88
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
r(number of queried rows)(as percentage of l)
l(to
tal
nu
mb
er
of
row
s)
Sobol Sequence(99%)
Pseudorandom sequence(99%)
ii) Efficiency: Here, we analyze the efficiency of HDVP scheme with existing schemes in terms
of detection probability for the corrupted blocks as follows: assume that both schemes are
detecting corruptions with high probability. However, the verifier using existing methods would
have to dramatically increase the number of random samples in order to achieve detection with
high probability. This would make impractical the whole concept of lightweight verification,
where as HDVP scheme is always detects the corruption with high probability in a few numbers
of random samples as shown in Fig. 4.11a & b.
Fig. 4.11 plots Px for the detection probability of different values of l, r, z. To understand
the importance of Fig.4.11, for example we assume that a file contains only l=10000 blocks, and
in Fig. 4.11a if z=1%l (data corruption) data is corrupted. The sobol sequence detects with 99%
probability in a 4.9%l samples, while pseudorandom sequence requires nearly 10%l samples and
in Fig. 4.11b if z=5%l data is corrupted. The HDVP scheme using sobol sequence detects with
3%l random samples, while existing schemes[164,165] detects with 6%l random indices.
(b) z=5%l
Fig. 4.10: The detection probability Px against 1%l and 5%l data corruption using Existing Scheme and HDVP.
89
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000
1
2
3
4
5
6
7
8
9
10
l (total number of blocks)
r(n
um
be
r o
f q
ue
rie
d b
loc
ks
as
pe
rce
nta
ge
of
l)
Proposed Method (0.99%)
Proposed Method (0.9%)
Proposed Method (0.8%)
Existing Method (0.99%)
Existing Method (0.9%)
Existing Method (0.8%)
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000
1
2
3
4
5
6
l (Total number of blocks)
r(n
um
be
r o
f q
ue
rie
d b
loc
ks
as
pe
rce
nta
ge
of
l)
Proposed Method(0.99%)
Proposed Method(0.9%)
Proposed Method(0.8%)
Existing Method(0.99%)
Existing Method(0.9%)
Existing Method(0.8%)
Clearly, we are verifying the Integrity of data stored in the cloud using random sampling
approach. The Pseudorandom sequence is often used for this purpose; however, pseudorandom
sequence tends to show some clustering effect, this effect will be more or less pronounced based
on the Pseudorandom Generator (PRNG) used. Because of this effect, we may not get
satisfactory Integrity results. To obviate this undesired effect, consequently, we use Sobol
(b) z=5%l
(a) z=1%l
Fig. 4.11 The detection probability Px against data corruption using Existing Scheme and HDVP Scheme. Px shown as a function of l(total number of blocks) and r(the number of blocks queried by the Client, shown as percentage of l) for value of z(the number of blocks modified by the adversary).
90
sequence; the generators of these numbers are designed and developed such that they produce
more uniformity distributed random sequence, hence, can get satisfactory Integrity results.
Therefore, HDVP scheme is more secure and efficient than the existing probability
verification schemes [164, 165] since, the sobol sequence is more uniform than pseudorandom
sequence.
Uniformity Testing
Now, we turn to Monte-Carlo simulation to determine uniformity of random sequences. For
the goodness of random numbers, we calculated the Monte Carlo integration using random
numbers. The integration of a function f(x) in the s-dimensional unit cube Is. we are in fact
calculating the average of the function at a set of randomly sample points. Where there are N
sample points in the integral is:
)'(1
1
N
i
xfN
V
(4.33)
Where V is used to denote the approximation to the integral and x1, x
1,.. ,x
N are the N, s-
dimensional sample points. The Monte Carlo integration of V sampling in the region -1<x'<1 for
the two cases: uncorrelated random numbers (pseudorandom sequence) and Sobol sequence. If
pseudo-random sequence is used, the points x' will be independently and identically distributed,
the estimate the expected error of integral is N-1/2
, while sobol sequence is used, whose fractional
error decreases of N-1
. In Fig. 4.12, we presented for calculation of six dimensional integral is:
654321
1
0
1
0
1
0
1
0
1
0
1
0
1
0
6
1
)cos(( dxdxdxdxdxdxixiI i
i
(4.34)
The exact value of integral is:
6
1
)sin(i
iI
(4.35)
Fig. 4.12 shows that the pseudorandom sequence gives worst performance, whilst Sobol
Sequence gives rapid convergence to the solution. To conclude that it has been shown that Sobol
sequence can evaluate integrals more efficient than pseudorandom sequences.
(25)
(24)
91
0 20 40 60 80 100 120 140 160 180 200-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
Number of Points(in thousands)
Valu
e o
f in
teg
ral
Pseudorandom Data
Sobol Sequence
Fig. 4.12 Monte Carlo simulation using random numbers
b) Availability
To make sure that original file can be recoverable and retrievable, if an attacker corrupts a
fraction of data file. The HDVP scheme gives guaranty to data retrievablility using erasure code
as shown in Table 4.2. From the Table 4.2, we can see that as soon as we increasing the n blocks,
the guarantee of Availability of the data is also increasing. Therefore, by generating n parity
blocks from m blocks, we can retrieve the original data from any of (m, n) blocks.
Table 4.2: For increasing (m, n) values to get 99.99% Availability guarantee.
Total no. of Blocks m blocks n blocks Availability guarantee
6 6 0 53%
10 6 4 85%
14 8 6 96.16%
18 12 6 97.19%
22 14 8 98.83%
26 16 10 99.97%
30 18 12 99.99%
The following theorem would prove the Availability guaranty of the file:
92
Theorem 4.1[148]: Given a ρ fraction of the n blocks of an encoded file, it is possible to
recover the entire original data with all but negligible probability.
Proof: Here, we considering the economically motivated adversaries those are willing to
modify or delete the small percentage of the file. In this context, deleting or modifying the few
bits won‘t provide any financial benefit. If detection of any modification/deletion of small parts
of file, then erasure codes could be used. So ρ fraction of encoded file blocks sufficient for
recover the original file with a linear time. Therefore, erasure code are guarantee that the ρ
fraction of retrieved blocks will allow decoding with overwhelming probability.
4.4.2. Performance Analysis and Experimental Results of HDVP
The performance analysis focuses on implementation of encoding, metadata generation and
CSP computation and compares the experimental results with Wang‘s scheme [164-165].
a) Encoding
In file encoding, we implemented a data file encoding for the data Availability guarantee.
HDVP experiments are conducted using C++ on system with core due 2 processor, running at
2.80GHz, 4GB of RAM and 3GB of SATA hard disk. We are considering 2 parameters for the
(m+n, n) Cauchy Reed-Solomon code over Galois Field GF(2w), w=8 or 16. The bellow Table
4.3 and 4.4 shows that the average encoding cost of 1GB file using Vandermonde Reed-Solomon
code and Cauchy Reed-Solomon Code respectively. In both the tables, we fixed the n parity
blocks are 2 and increase number of data blocks m in Set I and fixed the number of m data
blocks are 10 and increase the parity blocks in Set II. Note that as m increases, the length ‗l’ of
data blocks on every server will decrease, which enables fewer calls to the Cauchy Reed-
Solomon Encoder.
Fig. 4.13a &b shows that Compare to existing ssheme [164,165], the HDVP scheme
takes less time for encoding of 1GB file over Galois Field GF (2w) on different servers.
Therefore, the encoding cost of 1GB file using Cauchy Reed-Solomon encoding is faster than
Vandermonde based Reed-Solomon encoding because it uses XOR operations instead of
classical Galois field arithmetic to encode the file.
93
Table 4.3: Encoding cost of 1 GB file using Vander monde Reed-Solomon Code
Table 4.4: Encoding cost of 1 GB file using Cauchy Reed-Solomon Code
Set I m=4 m=6 m=8 m=10
n=2 110.21s 81.87s 65.42s 49.1s
Set II n=2 n=4 n=6 n=8
m=10 49.1s 83.2s 138.11s 189.87
Set I m=4 m=6 m=8 m=10
n=2 80.3s 62.67s 47.32 32.42
Set II n=2 n=4 n=6 n=8
m=10 32.42s 57.25s 103.21s 154.32s
94
10,2 10,4 10,6 10,8
0
20
40
60
80
100
120
140
160
180
200
m is fixed and n is increasing
To
tal co
st
in t
ime(S
eco
nd
s)
Vandermonde Reed-Solomon Code
Cauchy Reed-Solomon Code
(a)
4,2 6,2 8,2 10,20
20
40
60
80
100
120
m is increasing, and n is fixed
To
tal co
st
in t
ime(s
eco
nd
s)
Vandermonde Reed-Solomon Code
Cauchy Reed-Solomon Code
(b)
Next, we measure the encoding and decoding performance of file using erasure codes based
on Tornado code and compare the results with Reed-Solomon code as shown in Tables 4.5 and
Fig. 4.13 Encoding Performance Comparison between two different parameter settings for 1GB file encoding using different erasure coding techniques (Cauchy Reed-Solomon code and Vandermonde Reed-Solomon code) under the different Systems.
95
4.6. In tables 4.5 & 4.6, we can see that the encoding and decoding cost of HDVP scheme is
faster than existing schemes [164, 165]
Table 4.5: Encoding cost of the file with different sizes using Reed-Solomon code
and Tornado code
File Size Reed-Solomon
code[131] Tornado Codes[11]
2MB 442 seconds 0.60 seconds
4MB 1717 seconds 1.35 seconds
6MB 4213 seconds 2.0 seconds
8MB 6994 seconds 2.6 seconds
10MB 9018 seconds 3.1 seconds
Table 4.6: Decoding cost of the file with different sizes using Reed-Solomon code
and Tornado code
File Size Reed-Solomon
code[131] Tornado Code[11]
2MB 199 seconds 0.44 seconds
4MB 800 seconds 0.74 seconds
6MB 1883 seconds 1.03 seconds
8MB 3166 seconds 1.38 seconds
10MB 4824 seconds 1.73 seconds
b) MetadataGeneration
In this section, we measure the processor speed to compute the metadata. The HDVP
scheme deciding total metadata dynamically, for example, when t is selected to be 7300 and
14600, the data file can be verified every day for next 20 and 40 years respectively. In metadata
computation, the Client first generates random key, permutation key using sobol sequence and
computes the metadata using UHF [31]. Other parameters are along with file encoding; HDVP
implementation shows that average token pre-computation cost is 0.2. The metadata generation
96
also contains SRF and SRPs. However, these operations are performed over the short inputs, so
their costs are negligible with respect to universal hash function (UHF).
Table 4.7 gives the summary of storage and computation cost of metadata for 1GB file using
HDVP scheme and existing scheme under different system settings.
Table 4.7: Storage and computation cost of metadata generation for 1GB data file under different
system settings using HDVP and existing scheme.
c) CSP Computation
Here, we measure the computation cost of the CSP to compute the Integrity proof for a
corresponding challenge that is created by the Client during the challenge-response protocol. The
HDVP takes less time to compute a response (Integrity proof) when compared the existing
method because in HDVP, the CSP computes a response for the few numbers of challenged
blocks. For example assume that the file contains l=10000 blocks out of which 1%l data is
corrupted. The HDVP computes a response for the c=360 random blocks out of total blocks to
achieve 99% detection probability where as existing schemes compute a response for the c=460
blocks to get 99% detection probability. Therefore, the CSP computation time of HDVP is faster
than existing scheme as shown in Table 4.8.
Verify Daily for next 20 Years
(m, n) = (10,2) (m, n) = (10,4) (m, n) = (10,6) (m, n) = (10,8)
HDVP Existing Scheme
[165] HDVP
Existing Scheme
[165] HDVP
Existing Scheme[
165] HDVP
Existing Scheme[165]
Storage Overhead(KB)
167.11 199.61 183.6
1 228.13 217.63 256.64
260.34
313.67
Computation Overhead(Sec)
33.26 41.40 38.40 47.31 45.32 53.22 53.09 63.05
Verify Daily for next 40 Years
Storage Overhead(KB)
341.22 399.22 369.0
8 456.25 413.65 513.28 470.85 627.34
Computation Overhead(Sec)
64.51 82.79 73.22 94.62 81.11 106.45 94.32 130.10
97
Table 4.8: Detection probability of 1%l data corruption out of 10000 blocks
Detection Probability Number of samples out of total samples are required
HDVP Existing schemes[164,165]
0.50 40 110
0.6 90 170
0.7 120 230
0.8 160 280
0.9 195 320
0.95 240 380
0.99 290 460
4.5. Summary
In this chapter, we proposed a homomorpic distribution verification protocol(HDVP) to
address Integrity and Availability of data in cloud computing. This scheme relies on erasure
codes in setup phase instead of replica mechanisms to guaranty the Availability of data storage,
and utilize token pre-computation using Sobol Sequence to check the Integrity of data storage.
The homomorpic distributed verification protocol achieves the Availability and Integrity of
data guaranty stored in the cloud through private verifiability and it is useful where the
application needs Availability and Integrity of data in the cloud. Through detailed security and
performance analysis, we proved that, the HDVP scheme is more efficient and protects Clients‘s
data stored in the cloud against internal and external attacks.
However, it does not support an efficient Dynamic Data operations because, the
construction of metadata is involved with the file index information i.e. once a file block is
inserted, the computation overhead is unacceptable since the metadata of all the following file
blocks should be recomputed with the new indexes. In addition, since it is based on symmetric
key cryptography, does not support public verifiability. It is also difficult for the Clients to verify
the Integrity of data when file size is large and Clients having fewer resources and limited
computing power.
To overcome these drawbacks of HDVP, we propose Dynamic public audit protocols, which
will be explained in the next chapter.