15
arXiv:1402.7190v1 [cs.DB] 28 Feb 2014 Two Stage Prediction Process with Gradient Descent Methods Aligning with the Data Privacy Preservation S kumarasawamy a , Srikanth P L a , Manjula S H a , K R Venugopal a , L M Patnaik b a Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore University, Bangalore 560 001 India, Contact: [email protected]. b Honorary Professor, Indian Institute of Science, Bangalore. Privacy preservation emphasize on authorization of data, which signifies that data should be accessed only by authorized users. Ensuring the privacy of data is considered as one of the challenging task in data management. The generalization of data with varying concept hierarchies seems to be interesting solution. This paper proposes two stage prediction processes on privacy preserved data. The privacy is preserved using generalization and betraying other communicating parties by disguising generalized data which adds another level of privacy. The generalization with betraying is performed in first stage to define the knowledge or hypothesis and which is further optimized using gradient descent method in second stage prediction for accurate prediction of data. The experiment carried with both batch and stochastic gradient methods and it is shown that bulk operation performed by batch takes long time and more iterations than stochastic to give more accurate solution. Keywords : Batch Gradient, Gradient Descent, RDF and Ontology, Stochastic Gradient. 1. INTRODUCTION Data mining is a process of extracting the knowledge from large set of data. The knowledge extraction defines a model or rules for performing the accurate analysis on the future data. The ini- tial data set is referred as training data, since it is used as reference for arriving at knowledge. The analysis performed on the data should not reveal the data as such and hence preserving the privacy of data becomes significant. Intuitively, differen- tial privacy ensures that the system behaves the essentially same way, independent of whether any individual, or small group of individuals, opts in to or opts out of the database [1]. Generalization and suppression are two predominant techniques to achieve the privacy of data [2-3]. Privacy of data is very important in certain domains like hospital, analysis of psychological behaviour of patients [4]. The solution presented in this paper is to apply gradient descent methods on the pri- vacy preserved data. The gradient descent is first order optimization algorithm to find local mini- mum of the function [5]. 1.1. Motivation Gradient descent is a widely used paradigm for solving many optimization problems. In ma- chine learning or data mining, this optimization function corresponds to a decision model that is to be discovered [6]. Shuguo Han et. al., has proposed the solution for application of gradient descent methods on the privacy preserved data. The data is either vertically or horizontally parti- tioned across communicating parties. The factor for partition is mutually synchronized between each other. In vertical partitioning every com- municating party has same set of objects with varying attributes. In this scenario the prediction is first performed on unknown attributes before performing prediction for specific object. In hor- izontal partitioning every communicating party has disjoin set of records with same set of at- tributes. The prediction in this scenario is per- formed on unknown objects. To achieve required accuracy in prediction gradient descents are used in the context of machine learning. The gradi- ent descent follows the iterative approach to min- imize the prediction function until local minimum 66

arXiv:1402.7190v1 [cs.DB] 28 Feb 2014

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

arX

iv:1

402.

7190

v1 [

cs.D

B]

28

Feb

2014 Two Stage Prediction Process with Gradient Descent Methods

Aligning with the Data Privacy Preservation

S kumarasawamya, Srikanth P La, Manjula S Ha, K R Venugopal a, L M Patnaikb

aDepartment of Computer Science and Engineering, University Visvesvaraya College ofEngineering, Bangalore University, Bangalore 560 001 India, Contact: [email protected].

bHonorary Professor, Indian Institute of Science, Bangalore.

Privacy preservation emphasize on authorization of data, which signifies that data should be accessed only byauthorized users. Ensuring the privacy of data is considered as one of the challenging task in data management.The generalization of data with varying concept hierarchies seems to be interesting solution. This paper proposestwo stage prediction processes on privacy preserved data. The privacy is preserved using generalization andbetraying other communicating parties by disguising generalized data which adds another level of privacy. Thegeneralization with betraying is performed in first stage to define the knowledge or hypothesis and which isfurther optimized using gradient descent method in second stage prediction for accurate prediction of data.The experiment carried with both batch and stochastic gradient methods and it is shown that bulk operationperformed by batch takes long time and more iterations than stochastic to give more accurate solution.

Keywords : Batch Gradient, Gradient Descent, RDF and Ontology, Stochastic Gradient.

1. INTRODUCTION

Data mining is a process of extracting theknowledge from large set of data. The knowledgeextraction defines a model or rules for performingthe accurate analysis on the future data. The ini-tial data set is referred as training data, since it isused as reference for arriving at knowledge. Theanalysis performed on the data should not revealthe data as such and hence preserving the privacyof data becomes significant. Intuitively, differen-tial privacy ensures that the system behaves theessentially same way, independent of whether anyindividual, or small group of individuals, opts into or opts out of the database [1]. Generalizationand suppression are two predominant techniquesto achieve the privacy of data [2-3]. Privacy ofdata is very important in certain domains likehospital, analysis of psychological behaviour ofpatients [4]. The solution presented in this paperis to apply gradient descent methods on the pri-vacy preserved data. The gradient descent is firstorder optimization algorithm to find local mini-mum of the function [5].

1.1. Motivation

Gradient descent is a widely used paradigmfor solving many optimization problems. In ma-chine learning or data mining, this optimizationfunction corresponds to a decision model that isto be discovered [6]. Shuguo Han et. al., hasproposed the solution for application of gradientdescent methods on the privacy preserved data.The data is either vertically or horizontally parti-tioned across communicating parties. The factorfor partition is mutually synchronized betweeneach other. In vertical partitioning every com-municating party has same set of objects withvarying attributes. In this scenario the predictionis first performed on unknown attributes beforeperforming prediction for specific object. In hor-izontal partitioning every communicating partyhas disjoin set of records with same set of at-tributes. The prediction in this scenario is per-formed on unknown objects. To achieve requiredaccuracy in prediction gradient descents are usedin the context of machine learning. The gradi-ent descent follows the iterative approach to min-imize the prediction function until local minimum

66

Gradient Descent Methods Aligning with the Data Privacy Preservation 67

is reached. The local minimum is the thresholdwhich defines the required accuracy factor. Thegradient descents are applied on matrix of num-bers and the values of unknown columns are de-termined using random distribution before pre-diction.

1.2. Contribution

The solution proposed works in 2 differentstages. First stage defines the model or rulesrepresenting the knowledge about the data. Themodel is defined separately by every communica-tion party. While defining the model the values ofunknown attributes are inferred from bamboozledabstract information. The abstract informationindicates the high level generalized information.The abstract information is further disguised us-ing disguising factor to make other communica-tion party deceived by the information shared andhence referred as bamboozle process. The bam-boozling is performed to preserve the privacy atmaximum extent.

The data obtained after bamboozling is mis-leading Meta data. The Meta data is processedby every communication party to find the valuesof unknown attributes. The aggregation of bothunknown and known attributes produces the re-gression model. The regression model representsthe upper bound values which are optimized withthe application gradient methods until the re-quired accuracy is reached. The prediction func-tion used here is the quadratic function defined asw2F where w represents the weight vector usedas coefficient of function which is diminished inevery iteration of gradient descent until local op-timum is reached. The factor that decides the lo-cal minimum function is also referred as learningstoppage as it terminates the prediction process.There are two approaches in gradient descent i.e.,batch and stochastic.

In stochastic model prediction is performed byconsidering one data sample at a time. In batchgradient all the samples are processed before up-dating individual sample. Batch gradient takeslonger in its inner loop (large sum) but it canuse a larger step size because of this sum. Al-though these methods assume un-thresholded lin-

ear units, they can be easily modified to workon regular perceptions [7-12]. In this paper thebehaviour of both batch and stochastic gradientmethods are analyzed w.r.t number of iterationsand execution time required to perform the pre-diction process. The bamboozled information isconveyed using standard format which is machinereadable. One such Meta data is Resource De-scription Framework (RDF), in which the infor-mation is represented as sequence of triplets. Ev-ery RDF [13] should adhered to respective domainOntology. Ontology [14] defines the concepts andrelations among the concepts, RDF describes theweb document in the form of triplets. Every RDFtriplet is a composition of Subject, Predicate andObject [15]. The RDF is processed by every com-municating party to get the values of unknownattributes.

1.3. Organization

This section describes the rest of the paper inbrief. Section 3 describes the background workand its comparison with current solution. Sec-tion 2 describes the other significant related workperformed in this area. Section 4 defines the prob-lem. Section 5 defines mathematical model. Sec-tion 6 describes the algorithms used to arrive atthe solution Section 7 gives the overall system ar-chitecture along with the ontology model used inthis paper. Section 8 explains the experimentalresults and data set used to simulate the process.The paper concludes by mentioning the enhance-ment that can be incorporated in prediction pro-cess along with the suppression and the list ofreferences considered by the authors.

2. RELATED WORK

Some predominant research is performed inthe area of data privacy preservation. KanishkaBhaduri et. al., [16], suggested an approach toallow a user to control the amount of privacyby varying the degree of nonlinearity. It is alsoshown how the general transformation can beused for anomaly detection in practice for twospecific problem instances: a linear model and apopular nonlinear model using the sigmoid func-tion. There is also an analysis on the proposed

68 Kumaraswamy S, et al.,

nonlinear transformation in full generality andthen show that, for specific cases, it is distancepreserving. A main contribution of this paper isthe discussion between the invertibility of a trans-formation and privacy preservation and the appli-cation of these techniques to outlier detection.

Benjamin C M Fung et. al., [17]done thestudy on the resolution of a privacy problem ina real-life mashup application for the online ad-vertising industry in social networks, and pro-pose a service-oriented architecture along witha privacy-preserving data mashup algorithm toaddress the aforementioned challenges. Experi-ments on real-life data suggest that our proposedarchitecture and algorithm is effective for simulta-neously preserving both privacy and informationutility on the mashup data. To the best of ourknowledge, this is the first work that integrateshigh-dimensional data for mashup service.

Jung Yeon Hwang et. al., [18]has presented ashort group signature scheme for dynamic mem-bership with controllable linkability.The control-lable linkability enables an entity who possesses aspecial linking key to check if two signatures arefrom the same signer while preserving anonymity.It can be used for various anonymity-based ap-plications that require necessarily the linkabilitysuch as vehicular ad hoc network, and privacy-preserving data mining. Our scheme is suffi-ciently efficient and so, well-suited for real-timeapplications even with restricted resources, suchas vehicular ad hoc network and Trusted PlatformModule.

ZHU Yu-quan, TANG Yang et. al.,[19]have ad-dressed the problem that the existing protocol ofsecure two-party vector dot product computationhas the low efficiency and may disclose the pri-vacy data, a method which is effective to find fre-quent item sets on vertically distributed data isput forward. The method uses semi-honest thirdparty to participate in the calculation, put theconverted data of the parties to a third party tocalculate. Alberto Trombetta et. al., [20] havecome up with two protocols solving k-anonymityproblem on suppression-based and generalization-based k-anonymous and confidential databases.

The protocols rely on well-known cryptographicassumptions and we provide theoretical analysesto proof their soundness and experimental resultsto illustrate their efficiency.

3. BACKGROUND WORK

Shuguo Han et. al., [6] have proposed the ap-plication of stochastic gradient descent methodson the vertically partitioned. The experiment isperformed on the matrix of numbers and the dataof unknown attributes are generated by using ran-dom distribution. But the current solution givesthe relevant information to the other communi-cation party to find the values of unknown at-tributes. Even though the information is rele-vant it preserves the privacy of data by exposingthe upper bound generalized values. To energizethe privacy the generalized data is disguised andhence the complete data given other communi-cating party forms the semantic metadata. Eventhough gradient descents take more iterations be-cause the of bamboozling process but the accurateprediction is ensured by using descent learningstoppage.

4. PROBLEM DEFINITION

Given the database distributed across the com-municating parties using vertical partitioning, themain problem is to perform the prediction of sin-gle data with proper synchronization but not dis-closing each others data in other words the pri-vacy of data should not be affected. The predic-tion is performed using gradient descent methodsincorporated as iterative process. The gradientdescents are required to applied on privacy pre-served data.

Assumptions: It is assumed that the datais partitioned and both communicating partiesshould adhered to common ontology model.

5. MATHEMATICAL MODEL

5.1. List of notations used

Following table enumerates the list of notationsused and their purpose while defining the model.

Gradient Descent Methods Aligning with the Data Privacy Preservation 69

Table 1Basic Notations

Notations Meaning

ηs learning rate forStochastic gradient.

ηb learning rate ofBatch gradient.

λ Learning stoppagef→ Result Vector after

first stage prediction.p→ Result Vector after

second stage prediction.w→ Weight vector used

as a reduction factor.F (w→, f→) Second stage prediction

function defined as thefunction of weight vectorand first stage result vector.

x Un partitioned Data Records.n Number of Data records.m Number of attributes

of every data record.E→ Expected Vector

for all data records.

5.2. Definitions

Learning Rate: The learning rate is factor bywhich the value is minimized. The learning rateand iterations to reach local minimum is inverselyproportional to each other i.e., If the learning rateis high then the number of iterations required toreach local minimum for a function is less andvice versa.

ηs - is the Learning rate of stochastic gradientand for the employee case study it is maintainedas 0.00001 as the sample values are in hundred tothousand range.

ηb - is the Learning rate of batch gradientmethod and for the employee case study it ismaintained as 0.000001 as the sample values arein hundred to thousand range and it runs overall the samples before it updates particular sam-ple and hence for employee case study it is main-tained as 0.000001 i.e., ηb < ηs.

Weight Vector:The weight vector is a reduc-tion factor to minimize the first stage predictionresult. The gradient descent methods runs overseveral iterations and in every iteration the weightvector is reduced by certain factor until local min-imum is reached. The weight vector is defined asfollows,

Wi = {x|x ∈ R+, 0 ≤ i < n} (1)

Any ith value of weight vector is an positive num-bers. For current implementation every elementof weight vector is initialized to 1 and it is reducedby small factor in every iteration.

UN Partitioned Data Records: The dataset is a collection of data records where everyrecord is an information about individual objectand defined as follows,Xi ={ Ob→i |Ob→=

{value1, value2..m}, 0 ≤ i < n} (2)

Any ith value of X is a object of m attributes.The entire table structure is of n×m dimension.

First Stage Vector: This is the result of firststage prediction and is defined as,

fi ={

x|x ∈ R+, 0 ≤ i < n}

(3)

Any ith value of weight vector is an positive inte-ger. The prediction is done for every object usingRDF metadata and hence the magnitude of f i.e.,|f | is n.

Second Stage Prediction Function: In sec-ond stage the gradient descent methods are ap-plied on f→ (first stage prediction result) by mul-tiplying weight vector in multiple iterations. Inevery iteration the prediction function is mini-mized using updated vector. The prediction func-tion is defined as,

F (w→, f→) = p→ = w→2 ∗ f→ (4)

The prediction function is a square of weight vec-tor multiplied with the first stage prediction func-tion. In order to minimize the function in finegranules without the data loss the quadratic fac-tor is used. The Eq. (4) can also be defined as,{

pi = wi2 ∗ fi|0 ≤ i < n

}

(5)

70 Kumaraswamy S, et al.,

Every ith value is predicted as the square of cor-responding ith element of weight vector and firststage prediction vector.

Updation of Weight Vector: There are twovariations in updating weight vector dependingon the type of gradient descent method. TheStochastic gradient approach updates sample assoon as it is encountered it unlike Batch gradientwhich runs over all samples before it updates anyindividual sample. According to Stochastic Gra-dient Method any ith element of weight vector isupdated as per the Eq. (6)

wi = wi − ηs▽ F (wi, fi) (6)

Similarly the Batch Gradient Method updates theweight vector by considering all the elements ofweight vector before it updates particular elementand hence it is defined as per Eq. (7).

wi = wi − ηbj

n−1∑

j=0

▽F (wj , fj) (7)

The ▽F (wj , fj) is a differential of second stageprediction function w.r.t w (weight vector) andhence it is defined as,

▽ F (w→, f→) = 2 ∗ w→ ∗ f→ (8)

In every iteration the next state of weight vectoris updated based on the previous value. When▽F (w→, f→) becomes zero the ▽F (w→, f→)reaches minimum value and hence learning stopsas there is no value change happens from previousto next step.

Expected Vector: The Expected vector isa collection of prediction values for every datarecord and is defined as,

Ei = {m∑

j=0

Xi,j |0 ≤ i < n, 0 ≤ j < m} (9)

Learning Stoppage and Expectation Prob-

ability: The learning stoppage is the minimumprobability that decides the termination point oflearning process. The expectation probability isthe ratio of least square value of prediction to the

expected outcome and is defined as per the Eq.(10).

ep =

n∑

i=0

pi2/2÷

n∑

i=0

Ei2/2 (10)

The learning process stops when expectationprobability hits learning stoppage i.e., when epλ.

6. ALGORITHMS

The entire learning process is driven by se-mantic web based two stage prediction process.Each communicating party generates the requiredRDF metadata for their respective partitioneddata. The first stage prediction process startswith each communicating parties exchanging theRDF metadata for their respective unknown val-ues. The RDF metadata provides high leveldisguised information that allows communicatingparties to infer upper bound values for the un-known attributes.

Once the unknown values are inferred they areprocessed with known data in second stage pre-diction process to reach the approximation to theexpected vector. Since the first stage predictionoperates on upper bound values the gradient de-scents methods are applied in second stage pre-diction process to minimize the prediction vectorin negative steepest descent until learning stop-page is reached. The data is partitioned verti-cally where every communication party has sameset of records but with varying attributes. Thealgorithm in Table 2 is used for generating RDFMeta data model for the contents of data records.Every data record has certain criteria with whichmost generalized information is retrieved. Foremployee record, one such criteria is a type of em-ployee/category of employee. The category in em-ployee context defines the designation or job po-sition like Team Lead, Project Manager etc.. De-pending on the category the generic informationof salary components are retrieved. The above al-gorithm finds the maximum value of every salarycomponent(PF, Gratuity...) depending on classof employee and hence RDF record is created forevery data record with upper bound values for ev-ery attribute as generic information. The generic

Gradient Descent Methods Aligning with the Data Privacy Preservation 71

Table 2Algorithm for generating RDF Model

Input:

1. Ontology mod ; common Ontologymodel used by all parties

2. Xa ; Alice partitioned data3. Xb ; Bob partitioned data4. df ; Disguising factor is a factor by

which the original data is generalizedOutput:

1. RDF A ;Alice Department RDF for Bob2. RDF B ;Bob department RDF for AliceProcess:

1. data set=Xa;2. attributes=list of attributes of data set

and first element is always record identifier.3. RDF Mod=RDF Ado

R i= data set{ 0 };do

max attrj = get the max value ofjth attribute depending on therecord criteria.Relj = find the relation fromOntology mod corresponding to thejth attribute.Val j=max attr j+ df ;generalize the value using disguising factor.RDF < Ri, Relj, V alj > = form RDFtriple with subject as record identifier ,object as V alj and Relj as predicatedefining the context between subject .and Object. RDF Mod = RDF Mod∩RDF < Ri, Relj , V alj >

done

done

5. Repeat the above process i.e., step 4for RDF Mod set to RDF B.

done

information is further disguised using disguisingfactor and hence privacy of every data record isretained. Every RDF triples are defined using sin-gle ontology model. The algorithm in Table 3 isfor first stage prediction. In first stage predictionthe Alice and Bob exchanges the RDF model con-

Table 3Algorithm for First Stage Prediction Process

Input:

RDF A, RDF BOutput:

Af→, Bf→

Process:

1. Alice initiates communication usingCON INIT(Connection InitializationSegment)and Bob responds withCON INIT ACK as acknowledgementfor CON INIT.

2. Alice request for location ofRDF B of Bob using REQUESTRDF B(Request packet of theformat REQUEST followed byRDF filename) and Bob respondswith RESPONSE RDF B URL(Bobresponds with the location of RDF)and Bob also piggy back the responsewith REQUEST RDF A(Requestfor Alice RDF location).

3. Alice responds with the locationof RDF A with the packet RESPONSERDF A URL and piggyback thetermination request using CON TERM.

4. Bob responds with CON TERM ACK asacknowledgement for closing connection.

5. Alice reads the Meta data fromRDF B URL and predicts thedata for unknown attributes.

6. Bob reads the Meta Data fromRDF A URL and predicts thedata for unknown attributes.

7. Alice and Bob producesAf→, Bf→ first stagevectors by summing up the valuesof known attributes with theunknown values interpretedfrom RDF model for every record.

taining disguised abstract information. The ab-stract information is more generic, like for everyemployee the values are generalized by finding themaximum value of corresponding attributes based

72 Kumaraswamy S, et al.,

on the record criteria i.e., type of employee andfurther the generalized information is disguisedusing disguising factor. For Example basic salaryof any employee who is team lead is updated withmaximum of basic salaries of all the employees be-longing to team lead class added with disguisingfactor forming the more generic information.

In First stage prediction process the Alice andBob mutually exchanges the location of gener-ated RDF data and individually the RDF data isprocess to find the values of unknown attributes.The first stage prediction vector is produced bysumming up the values of known and interpretedvalues of unknown attributes. The communi-cation parties follows the specific protocol forexchanging the required messages. The protocolhas following message segments used,CON INIT: Connection initialization segment.The is sent by any communication party to trig-gerthe communication.CON INIT ACK: Acknowledgement toCONINIT. This is sent by the receiver.REQUEST Message: The Message is a stringwhich represents the request information. In firststage prediction this represents the filename ofRDF required. In second stage, message is aname of the vector expected from other commu-nication party.RESPONSE Message: This is the re-sponse message for request message. TheMessage in this context is list of strings sep-arated by a delimiter |. In first stage pro-cess it is of the form RDF FILE NAME|RDF URL.Similarly in second stage it is VEC-TOR NAME| JSON(VECTOR NAME).Where JSON(VECTOR NAME) is a JSON rep-resentation of Vector.CON TERM: Request to terminate communi-cation.CON TERM ACK: Acknowledgement forCON INIT.Both communication parties has shared secretkey and the messages are encrypted using DESalgorithm. All the protocol segments are en-crypted and the cipher is exchanged between thecommunication parties, hence the secure commu-nication adds further privacy to the data.

Table 4Algorithm for Second Stage Prediction Process

Input:

1. Af→, Bf→;Aliceand Bob

first stage prediction vectors.2. w→

a , w→

b ;Alice and Bob

unit weight vectors.3. E→; Expected Vector.4. GDTYPE : Type of Gradient Descent

Method.Values are either Stochastic or Batch.

Output: p→ ; Second stage vectors asa result of gradient descentmethods application where

Process:

1. Alice performs AP→ = wa→2 ∗Af→ and

Bob performs BP→ = wb→2 ∗Bf→

2. Alice requests for AP→ and Bob

requests for BP→

3. After the successful exchange ,Alice and Bob calculatesp→ combinedly as = (AP→ +BP→)/2

4. Calculate Expectation Probability (ep)according to equation(6)

5. if epλ then

go to step 6.else

if GDTY PE==Stochastic then

update the weight Vector(wa

→, wb→) as per equation(4).

else

update the weight vector(wa

→, wb→) as per equation(5).

end if

go to step 1.end if

6. Stop the process.

Gradient Descent Methods Aligning with the Data Privacy Preservation 73

The Table 4 is an algorithm for second stageprediction. This algorithm takes first stage re-sult, weight vectors, expected vector and Gradi-ent descent type as inputs. It applies the predic-tion function as shown in step 1 to find the sec-ond stage prediction vector. The predicted vectoris normalized/fine tuned using gradient descentmethods in the direction of negative steepest de-scent until expectation probability reaches learn-ing stoppage. The algorithm applies only onegradient descent method at a time. This is aniterative algorithm, where in every iteration theweight vector is reduced and prediction functionis applied until expectation probability becomesless than or equal to learning stoppage.

As shown in step 5 the weight vectors areupdated based on the type of gradient descentmethod. If the type is stochastic it updates in-dividual element at a time as per Eq. (6) elseif the type is batch gradient descent then aver-age of weight vector values are calculated beforeit updates any element of weight vector as perEq. (7). The algorithm runs with single gradientdescent method at a time. The second stage al-gorithm uses the same protocol segments as usedin first stage prediction for exchange of messagesand all the messages are encrypted before theyare exchanged.

The sub graphs are processed to from all pos-sible RDF-triplets which are submitted to thedatabase to retrieve URL set. Since there isa probability that RDF-triplets are repeated inmultiple web pages the final URL set is obtainedby the intersection of URL sets of all RDF-tripletsmatching the user query as explained in the algo-rithm of Table 5.

7. SYSTEM ARCHITECTURE

As shown in Figure 2. The data records arevertically partitioned between the two processes(Alice and Bob). Vertical partitioning meansthat every process has same set of records withdisjoint attributes set similarly Horizontal parti-tioning means that every process has disjoint setof records but with the same attributes. Com-plete prediction process is divided into two stages.

Figure 1. Ontology Model for Employee Domain

7.1. Ontology Model

The Ontology defines concepts and relationsbetween the concepts. Ontology is domain spe-cific and it is generated based on the databaseschema. Ontology defines concept for every tablein the schema and relation between the tables aredefined as Object and Data type properties. Theobject property describes the object from otherobject and the object being described is referredas subject. The data type property describes thesubject using the textual information for examplethe email of Bob can be described with the rela-tion <hasEmail> between the person and String.Where Bob belongs to the concept person and hisemail say [email protected] is conceptualized asa String. The ontology model gives the specifica-tion for which the multiple RDFs adhered to.

7.2. RDF Metadata and First Stage Pre-

diction

The RDF represents a metadata for the cor-responding partitioned data. The metadata ishigh level generalized information. The gener-alized information is further disguised to increasethe generality using the disguising factor. TheRDF is represented as collection of triplets whichare defined according to the corresponding do-main ontology. The Alice and Bob exchanges the

74 Kumaraswamy S, et al.,

RDF information mutually before the actual pre-diction process. The data inferred from the RDFis the result of first stage prediction process. Theresult of first stage prediction process defines themodel for second stage prediction.

The model in the machine learning con-text defines the knowledge extracted from thetraining data with aid of past experience buthere the new approach is used to define therules/model/knowledge from the actual data.The RDF content is used as semantics to de-fine the model instead of the training data.Since the semantics is more relevant or near-est to the actual data but not data itself andhence the privacy is preserved. The result ofthe first stage prediction process in this contextis the regression model.The Fig 1 shows the on-tology model for Employee domain. The ontol-ogy model is a definition of concepts and re-lations for generalization. The ontology modeldefines <haxMax> and <hasMin> relations be-tween the Employee and Numbers domain. The<haxMax> defines the maximum value of at-tribute among the records belonging to specificcategory like Project Manager, Team Lead andProgramManager. For example<hasMaxBasic>represents maximum value of basic componentof salary and this is calculated for every cat-egory to create generalized information. Simi-larly <hasMinBasic> represents minimum valueof basic component of salary. All the at-tributes are represented as datatype propertiesexcept <hasData> which represents the parti-tioned data of Employee database. The RDFgenerator uses the relations defined in ontologymodel to represents the most generalized infor-mation as semantics for known attributes. Thesample RDF generated is shown in Table 5.

The RDF Metadata is a high level generalizedinformation disguised with certain disguising fac-tor to preserve the privacy of data. The disguisingfactor is not disclosed between the communicat-ing parties, each of them can use his/her own dis-guising factor. Alice and Bob combinedly definesthe model from the deceived data.

Table 5Sample RDF of Alice for Bob

< rdf:RDF xmlns:j.0=”http://www.ppgd.com/”

xmlns:rdf=”http://www.w3.org/1999/02/22-

rdf-syntax-ns#” >< rdf:Description rdf:about

=”http://www.SkumarSolutions.com/ID10” ><j.0:hasMinflat>37</j.0:hasMinflat>

<j.0:hasMinTravel>38</j.0:hasMinTravel>

<j.0:hasMaxTravel>55</j.0:hasMaxTravel>

<j.0:hasMinBasic>20</j.0:hasMinBasic>

<j.0:hasMaxflat>45</j.0:hasMaxflat>

<j.0:hasMaxBasic>30</j.0:hasMaxBasic>

<n.0:hasMinHRA>30</j.0:hasMinHRA>

<j.0:hasMaxHRA>42</j.0:hasMaxHRA>

<j.0:hasName>reva123</j.0:hasName>

</rdf:Description>

7.3. RDF Generator

The RDF generator is a vital component whichis responsible for generating required RDF se-mantics for the vertically partitioned data. Thiscomponent runs over every record and creates re-spective generalized data depending on the cat-egory of record. The generalized information isdeceived to other communicating party with theaggregation of disguising factor. The disguisedinformation is used to infer the values of un-known attributes of records and finally a regres-sion model is defined as a result.

7.4. Second Stage prediction

In second stage prediction the regression modelis used as input for further optimization. The co-efficients of regression model are optimized usinggradient descent methods. The batch/stochasticgradient descents are applied in negative steep-est descent until the threshold is reached. Thethreshold indicates the fine granular value belowwhich the further optimization is not requiredand hence it is a stoppage point for the predic-tion. The stochastic gradient descent is appliedto reduce weight vector by considering single ele-ment at a time where as batch gradient runs onall elements before it updates particular elementof weight vector. In every iteration the weight

Gradient Descent Methods Aligning with the Data Privacy Preservation 75

Figure 2. Components in TPPGD

vector is updated and the predicted vector af-ter every iteration is exchanged securely. Anydata exchanged between the communication par-ties are encrypted using common secret key withthe known cryptographic algorithm. This addsthe security further to the privacy preserved data.

8. EXPERIMENTAL RESULTS

The experiment is carried out on the em-ployee database for calculation of salary. Thescenario is Alice and Bob heads payroll depart-ment with distributed database. The employeetable is vertically partitioned in such a way that

76 Kumaraswamy S, et al.,

attributes forming the salary component is dis-tributed across Alice and Bob department. Theidentified attributes of salary components are Ba-sic, HRA, flat, Travel, PF, Gratuity, GD and Per-formance Award. After partitioning, the Alicehas { Basic, HRA, flat, Travel } and Bob has{ PF, Gratuity, GD and Performance Award }as shown in Figure 3. The Employee Id and

EmpID | name| Basic| HRA| flat| Travel| PF| Gratuity |GDP |PerformanceAward |Category|

(Alice): | EmpID|Basic|| HRA|flat|| Cravel|Category|

(Bob): | EmpID|PF|Gratuity|| GDP|PerformanceAward|

| Category|

Figure 3. Vertically Partitioned Data

Category is used in both departments for creat-ing RDF semantics information. The categoryin this scenario indicates the designation of em-ployee which is used as key attribute for general-izing the data. The generalization is performedby taking max value of known attributes depend-ing on the category. For example, from the exist-ing database content the maximum of Basic, PFetc., is determined for every category like team-Lead, projectManager and programManager andfurther it is wrapped with disguising factor. Aliceand Bob performs the above operation with theircorresponding known attributes. The disguisedinformation is serialized as RDF data and ex-changed. Since the known attributes of Alice isunknown attributes for Bob and vice versa is trueand therefore Alice/Bob mutually finds the val-

ues of unknown attributes using RDF semantics.The first stage prediction is performed by thesummation of values of unknown attributes in-ferred from the RDF and known attribute valuesfor every employee record. This results in singlesalary vector defining the model for second stageprediction process at each department. Experi-ment is carried out with the disguising factor setto 10$ during the first stage prediction process.The identified maximum values are disguised byadding 10$ to the amount.

Alice and Bob performs the optimization of re-gression model obtained as a result of first stageprediction process. The optimization is achievedby applying gradient descent methods on the pre-dicted output. In gradient descent the weight vec-tor is used as a coefficient for prediction functionas defined in Eq. (5) and the weight vector isoptimized as per Eq. (6) during stochastic gra-dient and as per Eq. (7) during batch gradientdescent. In stochastic gradient descent the opti-mization happens in with less factor in each it-eration as it update one element of weight vectorat a time whereas in batch gradient the optimiza-tion happens with high factor.As a result of which the batch gradient descenttakes less number of iterations to predict outputfor high minimization factor/learning stoppageas shown in Figure 4, which indicates for highminimization factor/learning stoppage the batchgradient descent method takes more iterationsuntil switching point of 0.5 minimization factorafter which the behavior of batch and stochas-tic remains stagnant. The stochastic gradienttakes less number of iterations for prediction after0.5 i.e., for lower values of minimization factor.Similarly the execution time for batch gradientis high as number of iterations are more for pre-diction up to switching point as shown in Figure5. After switching point the stochastic the batchgradient takes more time than stochastic for lowervalues of minimization factor. The experimentis carried out by varying learning stoppage fromhigh value to low and behavior of stochastic andbatch gradient descents are analyzed with respectto number of iterations and time required to per-form prediction.

Gradient Descent Methods Aligning with the Data Privacy Preservation 77

Figure 4. Batch vs Stochastic Gradient with re-spect to no of Iterations.

Figure 5. Batch vs Stochastic Gradient with re-spect to Execution Time.

The DES algorithm is used as cryptographicalgorithm to retain confidentiality of data ex-

changed between Alice and Bob. The experimentis simulated with raw sockets and implementedusing JAVA. The Bob socket is used as serversocket waiting for Alice to initiate communica-tion. In First stage prediction the RDF locationis exchanged securely and in second stage the re-gression model which obtained as result from firststage prediction is exchanged securely betweenAlice and Bob.

9. CONCLUSION

In this paper a new approach is proposed toretain the privacy of data with the combinationof generalization and bamboozling. The bamboo-zling is the process where the second level privacyis achieved by disguising the generalized informa-tion with certain factor to deceive the other com-munication parties. The complete generalizationand bamboozling process happens as first stageprediction process to define the regression modelwhich is used as input for second stage predictionfor optimization and predict accurate result.

The experimental results shows the behaviourof batch and stochastic process with respect tothe number of iterations and execution time re-quired for prediction process. It is shown thatbatch gradient descent takes long time and moreiterations for higher values of minimization factorthan stochastic method. The solution can be en-hanced by adding the suppression technique alongwith generalization and bamboozling processes tofurther strengthen the privacy of data.

REFERENCES

1. Microsoft research.Database Privacy, http://research.microsoft.com/en-us/projects/DatabasePrivacy.

2. Latanya Sweeney. Achieving K-Anonymity Pri-vacy Protection Using Generalization and Sup-pression, In International Journal on Uncer-tainty, Fuzziness and Knowledge-based Systems,10(5): 571-588, 2002.

3. Gabriel Ghinita, Member, IEEE, Panos Kalnisand Yufei Tao. Anonymous Publication of Sen-sitive Transactional Data, In IEEE Transactionson Knowledge and Data Engineering, 23(2): 161-174, Feb 2011.

4. Hopoper N, Saunders J and McHugh L. The

78 Kumaraswamy S, et al.,

Derived Generalization of Thought Suppression,Learn Behav, 38(2): 160-168, 2010.

5. Microsoft research. Database Privacy,http:// research.microsoft.com/en-us/projects/DatabasePrivacy

6. Shuguo Han, Student Member, IEEE Com-puter Society, Wee Keong Ng, Member, IEEEComputer Society, Li Wan and Vincent C SLee. Privacy-Preserving Gradient-Descent Meth-ods, IEEE Transactions on Software Engineering,22(6):884-899, June 2010.

7. Afshar P. Gradient Descent Optimisation for ILC-based Stochastic Distribution Control, IEEE In-ternational Conference on Control and Automa-tion(ICCA), 11341139, 2009.

8. Zhi Ding, Junqiang Hu and Dayou Qian. OnSteepest Descent Adaptation: A Novel Batch Im-plementation of Blind Equalization Algorithms,Global Telecommunications Conference (GLOBE-COM 2010)IEEE, 1-6, 2010.

9. Gannot S. Iterative-batch and Sequential Algo-rithms for Single Microphone Speech Enhance-ment, IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP-97),2:1215-1218, 1997.

10. Gonzalez A. A Note on Conjugate Natural Gra-dient Training of Multilayer Perceptrons, Inter-national Joint Conference on Neural Networks(IJCNN ’06), 887-891, 2006.

11. Ningning Jia, E Y Lam. Stochastic Gradient De-scent for Robust Inverse Photomask Synthesisin Optical Lithography, 17th IEEE InternationalConference on Image Processing, 2010.

12. S Bonnabel. Stochastic Gradient Descent on Rie-mannian Manifolds, IEEE Transactions on Auto-matic Control, 58(9): 2217-2229, 2013.

13. D Beckett. RDF/XML Syntax Specification (Re-vised), http:// www.w3.org/TR/2 004/REC-rdf-syntax-grammar-20040210/, 1994.

14. John Hebeler, Matthew fisher, Ryan Blac, An-drew perez-lopez. Semantic-Web Programming,Third Edition, Wiley India pvt.ltd, 2009.

15. Stefan Decker, Sergey Melnik, Frank Van Harme-len, Dieter Fensel, Michel Klein, Jeen Broekstra,Michael Erdmann and Ian Horrocks. The Seman-tic Web: The Roles of XML and RDF, IEEE In-ternet Computing, pages 63-73, 2000.

16. Kanishka Bhaduri, Member, IEEE, Mark D Ste-fanski and Ashok N Srivastava, Senior Mem-ber, IEEE. Privacy Preserving Outlier Detec-tion through Random Nonlinear Data Distortion,IEEE Transactions on Systems, Man and Cyber-

netics, 41(1):260-272, 2011.17. Benjamin C. M. Fung, Member, IEEE, Thomas

Trojer, Patrick C. K. Hung, Member,IEEE,Li Xiong, Khalil Al-Hussaeni and RachidaDssouli. Service-Oriented Architecture for High-Dimensional Private Data Mashup, In IEEETransactions on Services Computing, 5(3):373-386, 2012.

18. Jung Yeon Hwang, Sokjoon Lee, Byung-HoChung, Hyun Sook Cho and DaeHun Nyang.Short Group Signatures with Controllable Linka-bility, Workshop on Lightweight Security and Pri-vacy: Devices, Protocols and Applications, 44-52,March 2011.

19. ZHU Yu-quan, TANG Yang CHEN Geng. APrivacy Preserving Algorithm for Mining Dis-tributed Association Rules, In International Con-ference on Computer and Management (CA-MAN), 1-4, May 2011.

20. Alberto Trombetta, Wei Jiang, Elisa Bertinoand Lorenzo Bossi. Privacy-Preserving Updatesto Anonymous and Confidential Databases, InIEEE Transactions on Dependable and SecureComputing, 8(4):578-587, July-August 2011.

Kumaraswamy S is currentlyworking as an Assistant Profes-sor in the Department of Com-puter Science and Engineering,KNS Institute of Technology,Bangalore, India. He obtained

his Bachelor of Engineering from SiddaGanga In-stitute of Technology, Tumkur. Bangalore Univer-sity, Bangalore. He is presently pursuing his Ph.Dprogramme in the area of privacy management indatabases in Bangalore University. His research in-terest is in the area of Data Mining, Web Mining andSemantic Web.

Srikanth P L received hisMaster’s degree from the De-partment Computer Science andEngineering, University Visves-varaya College of Engineering,Bangalore University, Banga-lore. His research interest is inthe area of Web Technology, Se-

mantic Web and Cloud Computing.

Gradient Descent Methods Aligning with the Data Privacy Preservation 79

S H Manjula is currently theChairman, Department of Com-puter Science and Engineering,University Visvesvaraya Collegeof Engineering, Bangalore Uni-versity, Bangalore. She obtainedher Bachelor of Engineering andMasters Degree in ComputerScience and Engineering from

University Visvesvaraya College of Engineering. Shewas awarded Ph.D. in Computer Science from Dr.MGR University, Chennai. Her research interests arein the field of Wireless Sensor Networks and Datamining.

K R Venugopal is currentlythe Principal, University Visves-varaya College of Engineering,Bangalore University, Banga-lore. He obtained his Bachelorof Engineering from UniversityVisvesvaraya College of Engi-neering. He received his Mastersdegree in Computer Science and

Automation from Indian Institute of Science Ban-galore. He was awarded Ph.D in Economics fromBangalore University and Ph.D in Computer Sciencefrom Indian Institute of Technology, Madras. Hehas a distinguished academic career and has degreesin Electronics, Economics, Law, Business Finance,Public Relations, Communications, Industrial Re-lations, Computer Science and Journalism. He hasauthored 39 books on Computer Science and Eco-nomics, which include Petrodollar and the World

Economy, C Aptitude, Mastering C, MicroprocessorProgramming, Mastering C++ and Digital Circuitsand Systems etc.. During his three decades of ser-vice at UVCE he has over 350 research papers tohis credit. His research interests include ComputerNetworks, Wireless Sensor Networks, Parallel andDistributed Systems, Digital Signal Processing andData Mining.

L M Patnaik is currently Hon-orary Professor, Indian Instituteof Science, Bangalore, India.He was a Vice Chancellor,Defense Institute of AdvancedTechnology, Pune, India andwas a Professor since 1986 withthe Department of ComputerScience and Automation, Indian

Institute of Science, Bangalore. During the past 35years of his service at the Institute he has over 700 re-search publications in refereed International Journalsand Conference Proceedings. He is a Fellow of all thefour leading Science and Engineering Academies inIndia; Fellow of the IEEE and the Academy of Sciencefor the Developing World. He has received twenty na-tional and international awards; notable among themis the IEEE Technical Achievement Award for hissignificant contributions to High Performance Com-puting and Soft Computing. His areas of researchinterest have been Parallel and Distributed Comput-ing, Mobile Computing, CAD for VLSI circuits, SoftComputing and Computational Neuroscience.