23
IN DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS , STOCKHOLM SWEDEN 2018 A Step Toward GDPR Compliance Processing of Personal Data in Email LINNEA OLBY ISABEL THOMANDER KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2018

A Step Toward GDPR ComplianceProcessing of Personal Data in Email

LINNEA OLBY

ISABEL THOMANDER

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

Page 2: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

SAMMANFATTNING

Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den ökande betydelsen av IT idagens samhälle samt allmänhetens krav på ökad kontroll över personuppgifter för den enskilde individen. Till skillnadfrån det tidigare direktivet, omfattar den nya förordningen även personuppgifter som är lagrad i ostrukturerad form,som till exempel e-post, snarare än endast i strukturerad form. Många företag tvingas därmed att anpassa sig efter detta,tillsammans med ett flertal andra nya krav, i syfte att efterfölja förordningen. Den här studien syftar till att lägga framett förslag på en uppförandekod för behandling av personuppgifter i e-post som ett verktyg för att nå medgörlighet.Utöver detta undersöks det om Named Entity Recognition (NER) kan användas som ett hjälpmedel vid identifieringav personuppgifter, mer specifikt namn. En litteraturstudie kring tidigare forskning och aktuella rekommendationerutfördes inför utformningen av uppförandekoden. Ett NER-system konstruerades med hjälp av Binär Logistisk Regression,handgjorda regler och ordlistor. Modellen applicerades på ett urval av e-postmeddelanden, med eventuella bilagor, somtillhandahölls från ett litet konsultbolag aktivt inom bilindustrin. Den rekommenderade uppförandekoden består av sexpunkter, applicerade på konsultbolaget. NER-modellen påvisade en låg förmåga att identifiera namn och ansågs därförinte vara lämplig för den utsatta uppgiften.

Page 3: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

ACKNOWLEDGEMENTS

This thesis was written in collaboration with the company On-On AB. Therefore, we want to thank Rolf Thomander, thefounder and CEO of the company, for providing us with support and expertise throughout the study. We also wish toexpress our gratitude to our supervisors Olov Engwall and Bo Karlson at KTH Royal Institute of Technology, for theirsupport and guidance.

Page 4: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

1

A Step Toward GDPR Compliance -Processing of Personal Data in Email

Linnea Olby and Isabel Thomander

Abstract—The General Data Protection Regulation enforced on the 25th of may in 2018 is a response to the growing importance of ITin today’s society, accompanied by public demand for control over personal data. In contrast to the previous directive, the newregulation applies to personal data stored in an unstructured format, such as email, rather than solely structured data. Companies arenow forced to accommodate to this change, among others, in order to be compliant. This study aims to provide a code of conduct forthe processing of personal data in email as a measure for reaching compliance. Furthermore, this study investigates whether NamedEntity Recognition (NER) can aid this process as a means of finding personal data in the form of names. A literature review of currentresearch and recommendations was conducted for the code of conduct proposal. A NER system was constructed using a hybridapproach with Binary Logistic Regression, hand-crafted rules and gazetteers. The model was applied to a selection of emails, includingattachments, obtained from a small consultancy company in the automotive industry. The proposed code of conduct consists of sixitems, applied to the consultancy firm. The NER-model demonstrated low ability to identify names and was therefore deemedinsufficient for this task.

Index Terms—Information extraction, named entity recognition, machine learning, binary logistic regression, GDPR, code of conduct

F

1 INTRODUCTION

THE emergence of IT and its many applications is difficultto separate from the notions of privacy and discussions

about data protection. One could claim that technology isnot privacy neutral as technology has the potential to protectprivacy but the digital development tends to do the opposite[1].

The General Data Protection Regulation (GDPR) is aresponse to the increasing importance of data in societyand an attempt to modernize the preceding Data ProtectionDirective 95/46/EC, adopted in 1995. For the past twodecades, the Data Protection Directive has been the directiveupon which all members of the European Union (EU) resttheir internal laws and regulations in regard to processingof personal data and the free flow of such data. However,as developments in the IT industry have increased duringthe past decade [2], the effective laws and regulations are nolonger sufficient, which is the main reason why an upgradeof the directive from 1995 was proposed. The GDPR wasadopted in the EU parliament in April 2016 and enforced onthe 25th of May 2018, with the hope of creating a commonground for all EU members [3] while ensuring the right toprivacy and right to protection of personal information.

Although the key principles of data privacy remain in ac-cordance with the preceding directive, a number of changeshave been introduced. Among these are penalty charges forviolations of up to 4% of the annual global turnover or e20million (whichever is greatest) [4] and the extended appli-cability to unstructured data. With current developments inareas such as Big Data comes an abundance of unstructureddata, resulting in heightened demands on organizations tobe compliant with legal and regulatory requirements. As theamount of data grows, new layers of complexity are added,of which one is the ability to find data quickly.

Today, more than 80 % of the world’s data is unstruc-tured [5], meaning it does not fit into conventional relational

databases, the neat structures upon which most of today’sanalytics is performed [6]. Examples of unstructured datainclude text documents, emails, social media posts, videos,audio files, and images [7], and its growth is constantlyfuelled by the advancements made in technology. In order toto handle the increasing amount of unstructured data whilestaying compliant to laws and regulations, businesses needto adapt.

Adaptation calls for new processes and procedures tobe instilled, where setting up codes of conduct could servehelpful. In the face of the GDPR, several companies offermanagerial solutions for reaching compliance. Equally asimportant are the technological solutions for data process-ing, with a variety of alternatives on the market [8]. Al-though concerns have been voiced regarding the privacyimplications of using machine learning for business ana-lytics under the GDPR, a question arises whether it couldinstead help facilitate a company’s GDPR compliance. Apossible technological solution would be that of using ma-chine learning techniques for finding personal data withinthe company systems.

The process of Information Extraction (IE) is about turn-ing unstructured data embedded in texts into structureddata to enable further processing. A sub-task of IE, NamedEntity Recognition (NER), involves detecting references toentities such as names of people, companies and locations[9]. The term “Named Entity” was first introduced at theSixth Message Understanding Conference (MUC-6) in 1996[10]. At that time, the conference was mostly concentratedon IE tasks such as extracting information regarding firmactivities or defense related activities from unstructuredtext sources. However, when approaching the task peoplenoticed that the ability to distinguish information units wascritical [10]. Present-day, these information units are knownas Named Entities, and the task of detecting references to

Page 5: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

2

them in text, NER, is often considered the first step of anyIE task [9].

1.1 Case studyOn-On AB is consultancy firm with three employees, fo-cusing on business development within the automotivebusiness. Services include dealer analyses and support ofdevelopment projects within sales and after sales processes,designing and operating individual customer care feedbacksystems for sales and after sales representatives, and lost-sales analysis among other things [11]. Recent clients consistof Audi Sweden, Porsche Sweden and Volkswagen GroupSweden as well as independent auto dealers.

The dealer analysis are performed monthly and entailthe distribution of surveys to car dealers’ customers, withthe help of customer information obtained from the client’ssales management system and delivered to On-On AB. Theresults from the surveys are compiled and reported back tothe dealers. The customer care feedback systems for salesand after sales representatives are executed in the samemanner.

Before the GDPR, On-On AB has been acting in ac-cordance with previous directive where an exception fromthe rules of data processing was made for unstructureddata [12]. With the new regulation this exception is lifted,meaning On-On AB must look into their processes.

1.2 Research ProblemIn the case of On-On AB, the deliverance of customerinformation and the distribution of reports are executedvia email. Hence, a framework for the processing of emailsshould be decided upon. This study aims to identify neces-sary changes by answering the following question:

• How should a micro-enterprise deal with the pro-cessing of personal data in email in order to complywith the GDPR?

Chapter 3 (Articles 12-23) of the GDPR [13] addresses therights of data subjects. These rights include a data subject’sright to access their personal data and information concern-ing how it is processed, as well as tools for submitting re-quests for the rectification, erasure and export of such data.Hence, in order to be GDPR compliant a company needsto be able to detect personal data in the unstructured dataresiding in the company’s systems. Therefore, the followingquestion is asked:

• How well suited is Named Entity Recognition forfinding personal data in email?

Hence, the purpose of this study is to provide a proposalfor a code of conduct limited to unstructured data in theform of emails along with a NER approach for analyzingsuch data. On-On AB serves as an example of micro-enterprise where these solutions could be implemented.The proposed method for detecting personal data is theimplementation of a machine learning algorithm, binarylogistic regression. The scope of the algorithm is restrictedto apply to a dataset of emails, in order to detect personaldata in the form of proper names of persons.

2 BACKGROUND

For the purposes of the regulation, a number of key conceptsare defined in Article 4 of the GDPR [13]. In order to fullyunderstand the ordinances included in the new regulation afew of the central definitions will be described. Furthermore,this section will provide an account of research surroundingthe practical implications of the regulation as well as thecontent and implementation of codes of conduct. Addition-ally, this section will provide a theoretical framework re-garding NER and a selection of performance measurements.

2.1 GDPR definitions

2.1.1 Personal dataThe most important concept to be defined related to dataprotection is personal data. In the GDPR it is defined asfollowing:

“‘personal data’ means any information relating toan identified or identifiable natural person (‘datasubject’); an identifiable natural person is one whocan be identified, directly or indirectly, in partic-ular by reference to an identifier such as a name,an identification number, location data, an onlineidentifier or to one or more factors specific to thephysical, physiological, genetic, mental, economic,cultural or social identity of that natural person.“[13]

By this definition it is evident that any information thatcould be associated unequivocally to a living individual ispersonal data. Different pieces of information, which col-lected together can lead to the identification of a particularperson, also constitute personal data [14]. Also mentionedin this definition, a ‘data subject’ is a living individual, towhom processing of personal data relates. For instance, ifan organization stores personal data about their customers,each customer is a data subject.

2.1.2 ProcessingThe processing of personal data is mentioned all throughoutthe GDPR and it is therefore important to define whichoperations are included in the concept.

“‘processing’ means any operation or set of oper-ations which is performed on personal data or onsets of personal data, whether or not by automatedmeans, such as collection, recording, organization,structuring, storage, adaptation or alteration, re-trieval, consultation, use, disclosure by transmis-sion, dissemination or otherwise making available,alignment or combination, restriction, erasure ordestruction.” [13]

Given the regulation’s broad scope of application, bothterritorially and materially, and the above quoted definition,one can conclude that most industries will be affected.The law is technology neutral, i.e. it protects personal dataregardless of the technology used to process that data.Furthermore, no distinction is made between data stored inan IT-system or in an unstructured manner such as emailor on paper; in all cases, personal data is subject to theprotection requirements set out in the GDPR.

Page 6: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

3

2.1.3 Data Protection by Design and DefaultThere is a growing understanding that innovation, creativityand competitiveness must be approached from a “design-thinking” perspective, i.e. a perspective that is simultane-ously integrative, innovative, interdisciplinary and inspiring[15]. Nevertheless, privacy must be approached from thesame perspective, making it a critical aspect of considerationin organizational priorities and project objectives.

The concept of Privacy by Design (PbD) is integrated inArticle 25 of the GDPR [13], and it refers to the philosophyand approach of embedding privacy into the design specifi-cations of various technologies. PbD was first introduced byCavoukian in 2009 with IT as its intended area of applica-tion, providing a framework of seven principles concerningprivacy-preserving design [15]. The scope of the concepthas since been expanded to entire information ecosystemsincluding organizational procedures and business models[16].

The GDPR encourages companies to take measures, bothtechnical and organizational, in such a way that ensure pri-vacy and data protection principles from the initial stages ofdesign of processing operations. This concept is defined as“data protection by design”. Article 25 in the regulation [13]also defines another principle, “data protection by default”,which addresses procedures and processes explicitly. It isdefined as per below:

“By default, companies/organizations should en-sure that personal data is processed with the high-est privacy protection (for example only the datanecessary should be processed, short storage pe-riod, limited accessibility) so that by default per-sonal data isn’t made accessible to an indefinitenumber of persons (‘data protection by default’).”

By this definition, one can conclude that the GDPR forcesorganizations to implement appropriate technological andoperational safety measures for securing data. Noteworthyis that the regulation does not prescribe which technologiesthat should be used to accomplish this, as this would put thelegislation in risk of becoming obsolete. Rather, it highlightsthe importance of establishing internal privacy controls aswell as considering the principles of data protection bydesign and default when planning security within organi-zations.

2.2 GDPR Compliance and Codes of Conduct

2.2.1 Codes of Conduct and CertificationArticles 40-41 of the regulation [13] encompass codes ofconduct and the responsible entities. It is clearly stated thatmember states, supervisory authorities (such as the SwedishData Protection Authority), the European Data ProtectionBoard as well as the EU Commission should encouragethe preparation of sector-specific codes of conduct for dataprotection. Furthermore, Article 40 states that these codes ofconduct can be drawn up by trade associations and repre-sentative bodies for controllers and processors, as definedby the GDPR. The purpose of these codes are to facilitatecompliance with the regulation which may include the fol-lowing areas: fair and transparent processing, legitimate in-terest, collection of personal data, pseudonymization, public

disclosures, exercise of data rights, consent for the process-ing of personal data of children, security measures, breachnotifications, international data transfers and complaintshandling and dispute resolution [13].

Apart from promoting codes of conduct, the same en-tities are responsible for approving and publishing them.The supervisory authority should provide an opinion onwhether the proposed code of conduct complies with theGDPR and if so, the EU Commission account for makingsuch a code publically available. Companies that implementa code of conduct are subject to mandatory monitoring bya supervisory authority. Although these entities promotecodes of conduct, the implementation of such a code is notobligatory for companies. However, they are still forced toadhere to provisions provided in the GDPR.

While the implementation of a code of conduct facilitatesadherence to GDPR, it also leads to eligibility for certifica-tion as is stipulated in Articles 42-43 [13]. The certificates areissued by supervisory authorities to controllers or proces-sors. They are a means to demonstrate compliance with theregulation as well as the existence of appropriate safeguardsconcerning data transfers.

2.2.2 Contemporary FrameworksMuch of the previous work concerning GDPR complianceis directed toward the changes that are brought forward bythe regulation and which steps are required for a companyto be compliant. Tikkinen-Piri, Rohunen and Markkula [17]identified major changes introduced by the GDPR and theirpractical implications. Their study resulted in a frameworkcontaining 12 aspects of the implications and what is re-quired by the company to prepare for the new requirements.For instance, they argue that companies should keep arecord of data processing activities which can be providedto a supervisory authority if requested. Overall, these impli-cations were kept at a general level.

Clarification of the changes followed by the GDPR andthe impact they will have on a company is the first steptoward compliance. However, companies must also preparemethods to turn the theory into practice. An example ofa more detailed account of such measures is presentedin a white paper provided by Microsoft [18]. The whitepaper discusses data governance related to GDPR, based onprinciples, processes and practices. It stipulates that thesetools should address: data acquisition, data ownership andaccountability, data access and usage, data discovery, datamanagement, data protection as well as the documentationof the aforementioned. From the requirements regardinghow personal data is processed, Microsoft present four cate-gories the lay a foundation for a data governance plan: datadiscovery, data management, data protection and reporting.

The Swedish Data Protection Authority has put forwardspecific recommendations regarding the processing of per-sonal data in email to help with GDPR compliance [19].Personal data is often present in email correspondence anddue to the new regulation’s extended application to unstruc-tured data, all personal data found in emails is subject to thesame protection requirements as the processing of personaldata in other systems. However, on account of the specificcharacteristics of emails where one often cannot controlnor know the content of emails received, The Swedish

Page 7: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

4

Data Protection Authority puts emphasis on the processingof personal data in incoming emails. Once a message isreceived and read, their recommendation is to make anassessment depending on the content whether or not theinformation should be stored, and if so, where and forhow long in order to comply to the applicable requirementsset in the regulation. They also proposed that companiesonly send sensitive personal data using encrypted email,because of the difficulties involved in ensuring that sentmessages are shared solely to the designated recipients. Thereasoning is that it is hard to authorize someone by a givenemail address alone and that there are specific security risksin the protocols many platforms use. Furthermore, today’semail clients employ smart features that increase the riskof messages being sent to wrong recipients, such as auto-completion of email addresses.

2.3 Code of Conduct2.3.1 DefinitionA formal code of conduct (CoC) is a common managerialtool in Corporate Social Responsibility (CSR), although themission of distinguishing a universal definition of a codeof conduct in existing literature has proven to be difficult.The terms “code of conduct”, “ethical code” or “guidelinesfor behavior” are used interchangeably [20]. In a study onthe effectiveness of the concept, Kaptein and Schwartz [21]opted for a generic term, “business code” by a means ofincorporating all types of codes at the corporate level. Theyput forward the following definition:

“A business code is a distinct and formal docu-ment containing a set of prescriptions developedby and for a company to guide present and futurebehavior on multiple issues of at least its managersand employees toward one another, the company,external stakeholders and/or society in general. “

Erwin [22] mean that a code of conduct guides em-ployee behavior, as well as establishing and communicatingsocially responsible business behavior and organizationalculture. In short, codes of conduct are means of steeringbehavior within an organization.

2.3.2 ContentRather than solely focusing on the definition of the concept,a greater understanding of the concept could be achieved byinvestigating what a code of conduct consists of and in whatshape it presents itself in an organization. The code could bea regulation on a range of levels, both informal and formal,such as beliefs and values or guidelines, procedures andrules [21]. When the content of the code is focused on datamanagement, the code is sometimes referred to as a datagovernance plan.

Codes of conduct can derive from internal forces orexternal forces such as legal pressures or public opinion.By implementing a code of conduct, the company benefitsby decreasing the liability for financial penalties, increasingorganizational efficiency and improving the work climate[21] as well as preserving a socially responsible reputation.Kabanov [23] studied frameworks for compliance with andfound that many companies have based their work towardbecoming compliant with the GDPR on ISO-standards.

These efforts make them eligible for a certificate whichwould strengthen the position of a company in the publiceye.

2.3.3 ImplementationFor an organization to achieve real influence on its employ-ees’ behavior through a code of conduct, more than drawingup and adopting one is required [21]. Research into the ef-fectiveness of codes has yielded conflicting results, rangingfrom counterproductive [24] to invaluable [25]. Grundstein-Amado [24] argues that given that code of conducts rarelyachieve their purpose, there are two alternatives: either toabandon codes altogether or to design a plan for effectiveformulation and implementation. The author proposes thatthe best method to succeed with the implementation of acode of conduct is by internalization of the code’s provi-sions. According to the author, the conditions for accom-plishing internalization are when the organization’s mem-bers establish integrated value systems and when the code isformulated by a democratic process, meaning each memberis welcome to contribute to the formulation process.

Sethi [25] suggests another approach concerning theconditions for successful implementation of a business code,focusing on the management of an organization. The authorclaims that in order to benefit from a code of conduct,companies’ top management must be committed to codecompliance in the long run, meaning executive work (in-cluding code compliance) should be associated to manage-ment evaluation and compensation at all levels of manage-ment. Additionally, Sethi argues that to gain the “reputationeffect” from a code of conduct, a company’s operationsand code compliance need to be exposed to the public forverification. Only then can organizations profit from theconsumers’ approval and support.

2.4 Named Entity RecognitionThe objective of NER is to identify a word (entity) andclassify it based on a predetermined set of of categories. Themost studied entities are those of “proper names” whichcan be further divided into names of persons, locationsand organizations [10]. Entities could also be found in themiscellaneous category which contains entities such as date,time, percentage and monetary expressions [26].

The task of NER can be performed by using different ap-proaches, or by combining them. These approaches are rule-based, employ machine learning or incorporate gazetteers(dictionaries/wordlists), or they could be a hybrid of theseapproaches [27]. The rule-based methods dominated earlierresearch [10], using hand-crafted rules to extract names.These rules are designed based on grammatical, syntacticand orthographic features of the text [26] and they havethe advantage of predicting complex entities where machinelearning methods may falter. However, since the desiredentities can appear in a variety of forms and contexts, thetask of manually developing rules resulting in robust NERsystems can be very difficult.

Due to the challenge involved in rule-based methods,the approach that has become most successful is supervisedmachine-learning [28], whereby features of negative andpositive examples of annotated data together with rules

Page 8: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

5

Figure 1: Illustration of different values of learning rate ⌘and the effect of choosing a small versus large value of ⌘.

are used to train a model. The model is trained on non-contextual (word-level), lexical, morphological and globalfeatures [27] in order to perform classification of test data.A drawback of using this approach is that a large amountof annotated training data is required. Examples of modelsapplied in NER tasks include Conditional Random Fields[29], Support Vector Machines, Maximum Entropy Models(equivalent to Logistic Regression Models) [30] and they canbe applied individually or combined [31].

The gazetteer method differs from the aforementionedapproaches in that it uses external information, rather thaninformation gathered from the text to be classified. Agazetteer is a list of entities against which the classifiermatches the entities in question. While a system relyingsolely on list lookup may experience issues with ambiguity(such as the name and city Paris) gazetteer matches as fea-tures have substantial positive impact in machine-learningbased approaches [32].

While much of the research has focused on formal textssuch as news articles and web pages, Wang and Cohen [29]studied name retrieval from informal texts in the form ofemail messages. The authors meant that difference in stylebetween the two text types affect the performance of existingNER methods. Their results indicate that names in informaltexts have less informative types of contextual evidencealthough structural regularities of emails could be exploitedwith the correct set of features. The study utilised repetitionas a means of solving ambiguity and list-lookup for improv-ing recall. Similarly, Maynard et al. [33] acknowledged thedifferences in formality between different forms of mediaand recognized a need for a name recognition system robustenough to handle these variations. The authors utilisedhand-crafted rules, gazetteers and a system called MUSEfor NER in email, religious texts, and scientific texts - bothspoken and written. The results for this NER system for textsof widely differing domains were very promising.

This study will adopt a hybrid approach to NER, draw-ing upon previous work by implementing hand-craftedrules, gazetteers and a machine learning model. It will addto existing research on information extraction from emailby evaluating the implementation of binary logistic regres-sion as machine learning model, using stochastic gradientdescent as a mean of optimizing the model.

2.4.1 Binary Logistic RegressionLogistic regression is a multivariate statistical method usedin classification problems to calculate the probability that anevent will occur. The method predicts the categorical valueof an event based on its predictive values, the features. Inbinary logistic regression the predicted outcome is one oftwo categories, occurrence (b = 1) or non-occurrence (b =0). The probability of an event x occurring is representedby P ((b = 1)|x) while the probability of non-occurrenceis represented by P ((b = 0)|x) or 1 � P ((b = 1)|x). Themethod fits the data to a logistic function:

✓P ((b = 1)|x) = logit(P ((b = 1)|x)

The logistic function ✓P ((b = 1)|x) can then be used by alinear function of the features represented by a1, a2, ..., akand it can be rewritten as:

✓P ((b = 1)|x) = w0 + w1a1 + w2a2 + ...+ wkak

where w0 represents the intercept and w1, w2, ..., wk rep-resent the regression coefficients, the weights. Positiveweights indicate that the given feature is more likely tobe true for names while negative weights indicate is morelikely to be false for names. Each independent variable ismultiplied with weights and the results are summarized.The probability of an event occurring is calculated by theinverse of the logit function, also known as the sigmoidfunction:

P ((b = 1)|x) = e✓P ((b=1)|x)

1 + e✓P ((b=1)|x) =

ew0+w1a1+w2a2+...+wkak

1 + ew0+w1a1+w2a2+...+wkak

producing results between 0 and 1. The predicted outcomeis 1 if P ((b = 1)|x) > 0.5 and 0 if P ((b = 1)|x) 0.5.

Logistic regression is especially useful when there issufficient training data and is considered both simple androbust [34]. However, while it is effective long processingtime could become an issue. For this reason, stochasticgradient descent is a suggested optimization function incases where training time is the bottleneck [35].

2.4.2 Stochastic Gradient DescentGradient descent is the process of minimizing a multivariatefunction f(a1, ..., an) by iteratively following the gradienttoward the minimum. This is often used in optimizingproblems where one seeks to minimize the loss function.The gradient is the vector of the partial derivatives rf= ( @f

@a1, ..., @f

@an) and a negative gradient points in the di-

rection of steepest descent of f . The objective is to findthe coefficients w1, . . . , wn for which the loss function fis minimized, by iteratively computing the gradient andupdate the coefficients proportional to the steepest descentof the function, i. e. its negative gradient.

In stochastic gradient descent, each iteration uses a sin-gle randomly picked sample zt = (xt, yt) from the trainingset, composed of an arbitrary input x and output y. Eachiteration updates the weights using the gradient rf(zt, wt)in the following calculation:

wt+1 = wt � ⌘rf(zt, wt)

Page 9: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

6

True PositivesP’

P

False Positives

N

False NegativesN’ True Negatives

Predictedclass

Actual class

Figure 2: Confusion matrix

where ⌘ is a variable known as the learning rate or step size.The stochastic process continues until convergence, for a setmaximum number of iterations.

The values of the learning rate, convergence margin andmaximum iterations affect the optimization. A low learningrate is equal to small steps steps toward the minimum,resulting in a more reliable training but at the cost oflonger computing time. On the other hand, a high learningrate could result in the function not converging or evendiverging (see Fig. 1). The learning rate is therefore typicallya small value, around 0.01 [36].

2.5 Performance Measurements2.5.1 Confusion MatrixThe labelled instances can be divided into four categories:true positives, true negatives, false positives and false neg-atives, often depicted in a confusion matrix (see Fig. 2). Theoff-diagonal elements (FP + FN) represent misclassified dataand the aim of any predictive model is therefore to have adominant diagonal (TP + TN).

2.5.2 AccuracyAccuracy is often the first step for analyzing the perfor-mance of an algorithm applied to a binary classificationproblem. Accuracy is defined as the number of correctlypredicted labels divided by the number of predictions made.Although accuracy might seem like an obvious key metricfor evaluation of predictive models, there might be causefor caution as the value can be misleading. Accuracy is bestsuited as an performance measurements when the cost of afalse positive (false alarm) is equivalent to the cost of a falsenegative (missed prediction) [37].

Accuracy =TP + TN

TP + FP + TN + FN

2.5.3 Precision and RecallThe efficacy of a classifier can be measured based on itsprecision and recall. Precision measures how many of thelabels predicted to be true were in fact true, and it is definedby:

Precision =TP

TP + FP

Recall, on the other hand, measures the number of success-fully classified instances. It is defined as per below:

Recall =TP

TP + FN

Take the case of a spam filter for illustration. Precisionmeasures how many of the emails identified as spam that

were in fact spam. Recall measures how many of the actualspam mails the classifier managed to find.

Perfect scores (100 %) of precision and recall are thereforenot considered identical. Whereas perfect precision for aspecific class means that the model was able to classify everysingle data point of that class correctly, a perfect recall is aresult of the model predicting every single instance of anactual class to that specific class.

Unlike accuracy, precision and recall does not give equalweight to mislabelings of both types (FP and FN) [37]. Pre-cision penalizes retrieval of irrelevant instances (FP), but itdoes not penalize failures by the model to retrieve instancesthat the user considers to be relevant (FN). Recall, contrarily,penalizes false negatives but not false positives. There isa common known inverse relationship between precisionand recall, that is, a trade-off where an increase in one ofthe measures leads to a decrease in the other [37]. Theseaspects are important to consider when tuning classificationmodels, as the importance of the respective measurementsvary depending on given task.

2.5.4 F1-scoreF-measure is a combination of precision and recall, pro-viding an overall performance measurement. The metriccan be calibrated to favor recall or precision if needed,but the most common form for evaluation is the F1-score,where precision and recall is equally weighed [9]. F1-scoreis defined as below:

F1 =2PR

P +R

2.5.5 BaselineWhile accuracy may give an indication of how well aclassifier performs, a baseline serves as a reference pointindicating the relevance of the measurement. There are sev-eral baselines to compare the performance measurementsagainst.

The human baseline is constructed by having expertsclassify the data and compare the performance of the al-gorithm to this measurement, and it is beneficial when thereis no previous knowledge of the data. This is also known asthe Gold Standard [9].

In logistic regression, a central tendency measure canserve as a baseline, by assuming a specific value for allpredictions. An example of such measure is the mean valueof all predictions which can be estimated using the intercept-only model [38]. That is, to set all independent variables(features) in the logistic function to zero. Without predictivevalues, the model predicts that the best estimate of thedependent variable for each instance is the overall mean,which in turn can be derived from the value of the intercept.Positive values of the intercept, w0, yield baseline eventrates greater than 0.5, while negative values result in meanvalues less than 0.5. Consequently, according to this model,all observations would be predicted to belong in the mostfrequent class [38].

3 METHOD3.1 The DataIn order to extract information from unstructured data,it should first be organized in a structured manner. The

Page 10: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

7

unstructured data was restricted to an extract of emailcorrespondence to and from On-On AB. The emails alongwith attachments in the form of Excel files were exportedfrom Microsoft Office Outlook and saved to text-files (.txt) orcsv-files (.csv) respectively. Finally, these files were mergedto one large data set during preprocessing.

Preprocessing of the data consisted of further structur-ing through tokenization and cleaning. The entire set wassplit on white space in order to represent each word asa token. A selection of punctuation marks were removedfrom the data set, such as slashes, brackets and apostrophes.Periods were left and no conversion from capital letters tolowercase letters was applied for the feature purposes. Thefinal dataset consisted of 30 000 datapoints or tokens whichwere manually annotated as “Name” or “NotName” bythe authors of this study, consequently creating a human-labeled Gold Standard baseline [9]. When annotating po-tential nicknames or misspelled names were excluded, aswell as names that were a part of a larger entity name likean organization. Since the preprocessing was optimized toextract names of persons, the tokenization resulted in emailaddresses being split into three tokens, such as “local-part”,“@” and “domain”. In the case where names were includedin the local-part, these were also excluded.

3.2 Training Set and Test SetIn order to make best use of supervised machine learningmodels, a large amount of annotated data is required. How-ever, the task of manually annotating the data is rather timeconsuming, hence delimitations were made in regard to thesize of the corpus.

When dividing the data into the different sets there isa trade-off, as it is desirable to have as large a test set aspossible since a smaller test set may yield unrepresentativeresults [9]. Furthermore, a larger training set could also bepreferable in order to provide more learning opportunitiesfor the model. The aim is to have enough data in the test setto be able to generalize the findings.

3.2.1 K-Fold Cross-ValidationIn the case of a smaller dataset, such as in this study,where further division might result in a test set too smallto draw significant conclusions from, cross-validation canserve as a solution and provide more accurate predictionsand measurements. The aim of this method is to ensurethat an instance from the original dataset is just as likelyto appear in the training set as it is in the test set [9].

In order to compensate for the size of the dataset, k-foldcross validation was implemented. This study used 5-foldcross validation where the dataset D was partitioned intofive mutually exclusive, equal-sized subsets: D1, D2, ..., Dk,where in this case k = 5. The model was then trained onthe set difference D \ Dt = {x : x 2 D and x /2 Dt}where t = {1, 2, . . . , k}, and tested on Dt. This process wasrepeated 5 times, meaning each fold was used as test setonce. By alternating which partition is used for testing andlastly calculate the average results of all rounds, not onlycan the variability be reduced, but also the entire datasetcan be used for both training and testing. This process wasalso repeated for k = 10.

Table 1: Data distribution over the five folds (F1, F2, ..., F5)of the cross-validation, and total data distribution.

Fold ‘Name’ ‘NotName’F1 1332 4668F2 1311 4689F3 1217 4783F4 272 5728F5 129 5871Total 4261 25739

Table 2: The selected features and their descriptions.

Feature Descriptioncapitalized_T First letter in token is capitalizedfirst_T_in_sentence Preceding token is a punctuation characterend_of_T The token ends with common ending of

Swedish last namesT_in_f_names Token in list of female namesT_in_m_names Token in list of male namesT_in_lastnames Token in list of last namesT_in_cities Token in list of Swedish cities

The distribution of ‘Names’ and ‘NotNames’ for eachfold is depicted in Table 1, as well as the total number forthe entire data set.

3.3 Feature SelectionAs a consequence of the generally low contextual evidencefound in emails as well as the Excel files found as attach-ments, the features mainly consist of word-level rules andlist-lookup. A contextual feature is “first token in sentence”which can be induced from the preceding entity. At word-level the “capitalized token” features looks at the first letterin the entity and determines whether it is capitalized. Thefeature “common ending” is a morphology feature lookingfor common endings of Swedish last names.

A number of gazetteers were incorporated as individualfeatures. A gazetteer for common Swedish male, femaleand last names respectively were sampled from StatisticsSweden [39] and added. As the data set contained a largenumber of Swedish cities, a gazetteer for a range of citieswas included. A list of the implemented features are foundin Table 2.

3.4 Implementation of ModelThe NER system was implemented in python3, buildingupon a code created 2017 by Johan Boye and Patrik Jonellas a part of the computer assignments for the courseDD1418/DD2418 Language engineering at KTH. The codecan be found in Appendix B and the parts added by theauthors of this study are marked up in the code.

3.5 Evaluation MethodIn order to evaluate the model’s performance and decideupon the optimal selection of features and variable settings,a number of performance measurements were calculated.These measurements were used to compare the results withdifferent tuning of variables and combination of features, in

Page 11: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

8

an attempt to maximize recall for ‘Name’. The fine-tuningwas applied to the learning rate, convergence margin, andmaximum iterations in the binary logistic regression. Whenfine-tuning the model, it became apparent that altering max-imum iterations had no significant impact on the results,therefore this value was set to 10 throughout the testing.The tuning was conducted by testing how alterations in thelearning rate affected the performance measurements, whileholding all else constant. The same approach was appliedto the convergence margin. At first, these tests were runimplementing all features found in Table 2. By consideringthe feature weights in the binary logistic regression, thetests were then repeated removing features shown to haveless significance. These features were “token in cities” and“token in last name”.

For each round of testing, five sets of results (one foreach of the 5-fold cross validation partitions) were stored.The raw data (i.e. the number of instances classified to eachclass) was stored in the form of a confusion matrix, and foreach of the five folds accuracy, precision, recall, F1-measureand the most frequent class baseline were calculated. Theresults were then averaged, resulting in an overall estimatefor the model’s performance. By doing this, the model couldbe optimized in terms of selection of features and value oflearning rate, convergence margin and maximum iterations.An attempt was also made using 10-fold cross validation.

3.6 Code of Conduct3.6.1 Literature ReviewA literature review was conducted in two parts for thepreparation of a suggested code of conduct. The first partfocused on gathering a theoretical foundation necessary forunderstanding how the GDPR approaches codes of conductand the concept of codes of conduct independently. Thiswork was presented in the Background section.

The second part of the review focused on contemporarywork concerning GDPR and codes of conduct. This infor-mation is also presented in the Background section.

3.6.2 Formulation of ProposalThe proposal for the code of conduct was formed basedon the studied material and the findings presented in thetheoretical framework. The proposal was delimited to pro-cesses regarding personal information in email, not takinginto account other necessary measures that should be ap-plied to the processing of personal data at large for GDPRcompliance. Such measures include (but are not limited to)the question of consent from data subject, protocols for re-sponding to requests from data subjects and data protection.

4 RESULTS

4.1 Proposal of Code of ConductIn the following section, a content proposal of a code of con-duct is presented. The content is delimited to the processingof personal information in email and the reasoning behindit is presented in the Discussion.

• Define and establish a common understanding ofwhat constitutes personal data in the organization.

• Identify personal data in email through automatedmethods such as machine learning.

• Establish routines for continuously marking emailscontaining personal data.

• Establish rules concerning which personal data canbe sent in email.

• Decide on duration of data storage.• Establish routines for documentation and reporting.

4.2 Performance of NERFor maximum iterations = 10, convergence margin = 0.001,learning rate ⌘ = 0.05 with all features implemented, theaccuracy was measured to 92.51 %. Precision for ‘NotName’measured up to 92.43 %, ‘Name’ to 93.38 % with an averageprecision of 92.91 %. Recall for ‘NotName’ was measuredat 98.96 %, for ‘Name’ 51.82 % and average recall 75.39 %.For the same settings, F1-score was measured to 66.65 %and the most frequent class baseline 85.80 %. This case wasconsidered the base case and yielded the best performancevalues. These results are averaged for the five folds of thecross-validation and are depicted in Table 3. For results ofthe individual folds, see Appendix A.

Modifying the learning rate ⌘ in the range 0.1-0.01, withall features implemented yielded the same results as forlearning rate ⌘ = 0.05, i.e. the base case. Setting learningrate ⌘ = 0.001 resulted in overall lower performance valuesas shown in Table 3.

Using the same settings as the base case, only alteringthe convergence margin to 0.01, yielded lower performancevalues within a range of 1-2 %. Setting convergence marginto 0.005 yielded similar results although slightly higher forsome values in comparison to convergence margin 0.001 (seeTable 3). Lower convergence margins than base case settingshowed no difference in performance.

The weights in the logistic regression for the base caseare depicted in Table 4. Excluding the features with thelowest average weight (“token in cities” and “token in lastname”) respectively from base case resulted in an overallchange of up to ± 1 % for all values, although recall for‘Name’ was lower (see Table 3).

Altering max iterations had no apparent effect on theresults and it was therefore kept at a constant value of 10.

An attempt to use 10-fold cross-validation was also con-ducted, although due to the imbalance in the data distribu-tion (indicated in Table 1), this resulted in a fold consistingof only one instance of actual class ‘Name’. Consequently,results were only obtained for 5-fold cross-validation.

5 DISCUSSION

The aim of this study was to provide a proposal of a codeof conduct, concerning the processing of personal data inemail, where NER was suggested as a tool for identifyingthe personal data. The discussion addresses the perfor-mance of the suggested model as well as its limitations andpossible improvements. Furthermore, it is discussed howwell the implemented model fulfills its purpose as specifiedin the proposed code of conduct. A detailed account is givenfor the reasoning behind the proposed code of conduct aswell as how it could be implemented for best effect, tailoredto On-On AB.

Page 12: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

9

Table 3: Average performance measurements for different alterations relative to the base case (learning rate (⌘) = 0.05convergence margin (CM) = 0.001, maximum iterations = 10, all features implemented).

Performance Measurement Base case ⌘ = 0.001 CM=0.005 CM=0.01T_in_citiesexcluded

T_in_lastnamesexcluded

Accuracy 92.51 % 91.77 % 91.89 % 91.75 % 92.33 % 92.38 %Precision for ’NotName’ 92.43 % 91.81 % 91.93 % 91.78 % 92.38 % 92.30 %Precision for ’Name’ 93.38 % 91.26 % 91.35 % 91.61 % 91.86 % 93.31 %Average precision 92.91 % 91.53 % 91.64 % 91.70 % 92.12 % 92.81 %Recall for ’NotName’ 98.96 % 98.79 % 98.79 % 98.80 % 98.78 % 98.96 %Recall for ’Name’ 51.82 % 48.01 % 49.13 % 46.54 % 51.64 % 50.36 %Average recall 75.39 % 73.40 % 73.96 % 72.67 % 75.21 % 74.66 %F1-measure 66.65 % 62.92 % 63.89 % 61.73 % 66.11 % 65.42 %Baseline 85.80 % 85.80 % 85.80 % 85.80 % 85.80 % 85.80 %

Table 4: Feature weights for the five folds (F1, F2, ..., F5) ofthe cross-validation, and average weight for each feature.Settings as in base case.

Features F1 F2 F3 F4 F5 Avg.Intercept -6.60 -6.52 -6.92 -6.43 -6.86 -6.67capitalized_T 3.57 3.52 3.49 3.80 3.70 3.62first_T_in_-sentence

-1.78 -1.81 -1.85 -1.84 -1.59 -1.78

end_of_T 2.20 2.23 2.24 2.29 2.30 2.27T_in_f_names 2.40 3.09 2.41 2.89 2.63 2.68T_in_m_names 4.92 4.56 5.10 4.22 4.89 4.74T_in_lastnames 2.37 0.27 2.22 0.24 2.43 1.51T_in_cities 1.44 1.38 1.42 1.29 1.55 1.42

5.1 Code of Conduct ContentThis study aimed to answer a question regarding how amicro-enterprise should deal with the processing of per-sonal data in emails in order to comply with the GDPR.Below is the proposed code of conduct for this purpose,consisting of six items, applied to On-On AB.

5.1.1 Define and establish a common understanding ofwhat constitutes personal data in the organizationBefore procedures can be set up for data processing, it isof great importance that all employees share a commonunderstanding of what personal data is. A definition hasbeen stated in the GDPR which should serve as a foundationupon which On-On AB should base its definition. However,“any information relating to an identifiable person” wouldbenefit from further specification while keeping in mindwhich type of personal data the company handles in itsemail system.

5.1.2 Identify personal data in email through automatedmethods such as machine learningThe ability to meet the data subjects’ rights relies heavilyon the extent to which On-On AB has knowledge of wherepersonal data is kept. After having distinguished which typeof personal data that resides in email, the company shouldestablish a procedure for identifying the data. The descrip-tion of emails as an unstructured form of data indicates thelevel of difficulty associated with the task of finding thesought after data. For each employee to manually searchthrough their email could potentially become a tedious and

time-consuming task. The implementation of a technologi-cal solution would therefore save time, while also removingthe risk of human error. A discussion of the suitability of theNER method proposed in this study can be found furtherbelow.

5.1.3 Establish routines for continuously marking emailscontaining personal dataWhether or not a technological solution for finding personaldata has been applied, a routine for continuously markingdata is useful for maintaining an overview of the data andincreasing organizational efficiency. Furthermore, it is im-portant that On-On AB consider this precautionary measureat an early stage of the design of data processing methods toensure compliance. As such, the development of this routineshould be in close relation to the method for finding thedata.

5.1.4 Establish rules concerning which personal data canbe sent in emailThere are particular risks of violating the GDPR involvedin the processing of personal data in emails due to thecharacteristics of this form of communication. Consideringthe case of when emails are being sent, the main risk lies inthe fact that it is difficult to guarantee that the message sentis shared exclusively to the intended recipient. Althoughsome of the presented reasons behind this are hard for On-On AB to control, there are precautionary measures thatcould reduce these risks if implemented.

The suggested method for mitigating the risks involvedin the process of sending emails while ensuring GDPR com-pliance, is to establish a set of rules declaring which typesof personal data can versus cannot be sent through email,and under which circumstances. The previously mentionedelement of defining personal data should contribute to thecomprehensibility of these rules. Well-defined rules supportthe principles of data protection by design and default setout in the GDPR, hence the rules should be formulatedso that any uncertainties are diminished. For instance, theprocessing of personal data defined to be of sensitive natureshould be addressed explicitly in the rules.

5.1.5 Decide on duration of data storageOn-On AB would benefit greatly in their GDPR compliancework from deciding on if information received in emails

Page 13: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

10

should be stored and if so, how and for how long. Boththe principle of data protection by default and the SwedishData Protection Authority’s recommendations encouragesuch operational measures. More specifically, On-On shouldassess and decide on specific rules regarding the storage ofpersonal data that are commonly dealt with by the company.These rules can help employees make lawfully supportedassessments of incoming emails. All continuous assessmentsshould be done with respect to the particular content of themessage.

5.1.6 Establish routines for documentation and reportingRegardless of whether On-On AB creates a comprehensivecode of conduct of their own, or abides by one put forwardby a trade association/representative body, the fact remainsthat the company must adhere to the GDPR and must beable to prove that they are. The company must be able toprovide evidence of this, which is achieved by documentingprocesses. Furthermore, as On-On AB is a third-party dataprocessor receiving customer information from car dealer-ships, it is crucial that they can demonstrate compliance fortheir clients.

5.2 Code of Conduct ImplementationDerived from Kaptein and Schwartz’s definition of a busi-ness code, codes of conduct are developed to guide presentand future behavior. As such, this code aims to guide futurebehavior in order to guarantee On-On AB’s compliance withthe GDPR. Furthermore, the GDPR encourages companiesto take both technical and organizational safety measures,where the above mentioned routines and procedures fallunder the category of organizational measures.

Drawing upon research concerning the implementationof a code of conduct, there are several factors On-On ABshould take into consideration in order to ensure that acode is followed. The policies and procedures will only beeffective if they are internalized by the employees. Thisinternalization could be facilitated through involving theemployees in discussions regarding the items of the code.While some of them are heavily steered by the regulation,such as the definition of personal data and the durationof storage, there exists room for interpretation where thecompany can set internal rules, e.g. deciding on the routinefor marking personal data in email. By involving the peoplewho are to follow these guidelines daily, it is more likely thatthe resulting routines are easily understood, time-savingand adapted to fit the current processes - all of whichcontribute to internalization.

The proposed code of conduct relies on a selection ofprovisions from the GDPR that were deemed relevant forthe scope of this study, as well as previous work and existingrecommendations. As the code is delimited to unstructureddata in the form of email it is not a blanket solution, meaningOn-On AB is required to take further action to be fully com-pliant. Furthermore, one could argue for the universalityof this code. In theory, it could be scaled up and appliedto larger companies, however, the practicalities could differbetween a small and a large company. It seems feasiblethat involvement of employees in the decision making maybe easier in a smaller company. Moreover, the process of

finding and marking personal data in a smaller companycould look a lot different from the same process in a largecompany.

Implementing and adhering to the proposed code ofconduct brings On-On AB one step closer toward com-pliance with the GDPR. The result would be a reducedliability for penalty fees as well as potential gains in theaforementioned “reputation effect”. This code of conductis a response to legal pressure in the form of the EU-provided regulation, however one could also argue that itstems from public pressure seeing as the GDPR is a result ofthe growing demands from the public for data privacy andcontrol over their own personal data. However, apart fromorganizational measures, the right technological measuresare required.

5.3 Named Entity Recognition Approach

The results of this thesis show that applying a NER modelmade up of a combination of hand-crafted rules, gazetteersand binary logistic regression is not the most suitable ap-proach for finding names in emails. In the case of this study,of which the objective was to investigate how well sucha model could facilitate the detection of personal data forGDPR compliance purposes, the model did not performwell enough in the measurement of recall for ‘Name’ to beconsidered suitable. However, these results are not in linewith previous research, where combined NER approacheshave proven to be an effective tool for tackling name re-trieval tasks in unstructured text, such as email.

The intercept-only model indicated that when assigninga token to the most frequent class, this token would beassigned to ‘NotName’ resulting in a baseline of 85.80 %.In comparison, the model scored a high value for accuracyat 92.52 %. Furthermore, the average precision at 92.91% implies that the model performed well in assigning atoken to its true label. Although the model produces highperformance values for accuracy and precision, recall forname was only 51.82 %. These results indicate that themodel’s ability to retrieve names was relatively poor; onlyevery other name was in fact found by the model.

With the objective of enhancing recall for ‘Name’, exper-imentation was performed through fine-tuning of param-eters. However, these adjustments had low impact on thedifferent results. A possible explanation for these results isthe combination of the choice of hybrid model and the datacharacteristics. Previous research performed NER strictly onthe content of an email - may it be different parts of the emailsuch as header or body. As personal data embedded in emailattachments is subject to the requirements set in the regu-lation, this too was included in this study. Data extractedfrom a plain email and data extracted from Excel-files havedifferent characteristics. For example, all leading words in acell in an Excel-file begin with a capital letter and all wordsput together from the file do not form a grammatical patternlike the words in a body of text. Consequently, the features“first token in sentence” and “capitalized token” could betrue for many words. Furthermore, context-level featureswhich have proven successful in previous studies were notimplemented due to the current construction of the model,although these features may not have been appropriate

Page 14: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

REFERENCES 11

in this case. It is instead possible that a combination ofmachine learning models, with different features attunedto particular characteristics of the data could yield a betterresult.

Another possible explanation is the imbalance of the datain the five partitions. As depicted in Table 1 the distributionof ‘Names’ and ‘NotNames’ differ vastly. While k-fold cross-validation is useful for smaller datasets, using stratifiedcross-validation would help ensure that each partition con-tains similar proportions of the two class labels. This methodis therefore suggested for future research.

The model ultimately used in this study cannot beconsidered as finished for the purpose of using NER tofind personal data in email, on account of it not beingparticularly user-friendly nor able to find all types of per-sonal data, in its current construction. In the interest ofdeveloping a convenient technological solution for scanningemails in the search of personal data, the model shouldbe further developed and integrated into such application.However, the authors of this thesis are sceptical toward theuse of an autonomous solution such as this one, for thisspecific purpose. In order for the application to be seen assufficient and therefore useful, the tool must be performingexceptionally in both precision and recall for all types ofpersonal data, which can be difficult to achieve even with arefined model.

5.4 ConclusionAs the scope of the GDPR is extended to unstructured datait is important that companies handle personal data in emailcorrectly. Setting up a code of conduct that encompasses thedefining, identifying, marking, distribution and storage ofpersonal data, as well as documentation of these procedures,can be considered as a step towards compliance. However acomplete solution for all data is needed. Based on the resultsof the conducted study, the suggested method of using NERto find personal data is not deemed suitable for the purposeof ensuring compliance with the GDPR.

REFERENCES

[1] S. Garfinkel, Database nation: the death of privacy in the21st century. " O’Reilly Media, Inc.", 2000.

[2] I. T. Union, Measuring the Information Society Report2017. 2017, p. 30.

[3] T. S. D. P. Authority. (). Dataskyddsförordningenssyfte, [Online]. Available: https : / / www .datainspektionen . se / dataskyddsreformen /dataskyddsforordningen / introduktion - till -dataskyddsforordningen / dataskyddsforordningens -syfte/. (accessed 08.05.2018).

[4] Trunomi. (). Gdpr key changes, [Online]. Available:https : / / www . eugdpr . org / key - changes . html.(accessed 09.05.2018).

[5] A. Khan, B. Baharudin, L. H. Lee, and K. Khan,“A review of machine learning algorithms for text-documents classification”, Journal of advances in infor-mation technology, vol. 1, no. 1, pp. 4–20, 2010.

[6] R. Blumberg and S. Atre, “The problem with unstruc-tured data”, Dm Review, vol. 13, no. 42-49, p. 62, 2003.

[7] J. P. Isson, Unstructured Data Analytics : How to ImproveCustomer Acquisition, Customer Retention, and FraudDetection and Prevention, eng. John Wiley & Sons, Inc.,2018, ISBN: 1-119-37884-2.

[8] A. Cross. (). Best gdpr software tools and solutions:50 leading tools & solutions for gdpr security, datagovernance, user consent management & more, [On-line]. Available: https : / / www. ngdata . com / gdpr -software - tools - and - solutions / #DataGovernance.(accessed 24.05.2018).

[9] D. Jurafsky, Speech and language processing : an introduc-tion to natural language processing, computational linguis-tics and speech recognition, eng, 2. ed.., ser. Prentice Hallseries in artificial intelligence. Upper Saddle River,N.J.: Prentice Hall, 2009, ISBN: 0131873210.

[10] D. Nadeau and S. Sekine, “A survey of named entityrecognition and classification”, Lingvisticae Investiga-tiones, vol. 30, no. 1, pp. 3–26, 2007.

[11] R. Thomander, Personal correspondence, email, On-On AB, 2018.

[12] T. S. D. P. Authority. (). Allmänna frågor, [Online].Available: https://www.datainspektionen.se/fragor-och- svar/eus- dataskyddsreform/allmanna- fragor/#A1a. (accessed 09.05.2018).

[13] C. o. t. E. U. European Parliament, “Regulation (eu)2016/679 of the european parliament and of the coun-cil of 27 april 2016 on the protection of natural personswith regard to the processing of personal data and onthe free movement of such data, and repealing direc-tive 95/46/ec (general data protection regulation)”,Official Journal of the European Union, vol. L 119/1,pp. 1–88, 2016.

[14] (2018), [Online]. Available: https : / / ec . europa . eu /info/law/law-topic/data-protection/reform/what-personal-data_en.

[15] A. Cavoukian, “Privacy by design”, Take the challenge.Information and privacy commissioner of Ontario, Canada,2009.

[16] G. Danezis, J. Domingo-Ferrer, M. Hansen, J.-H.Hoepman, D. L. Metayer, R. Tirtea, and S. Schiffner,“Privacy and data protection by design-from policy toengineering”, arXiv preprint arXiv:1501.03726, 2015.

[17] C. Tikkinen-Piri, A. Rohunen, and J. Markkula, “Eugeneral data protection regulation: Changes and im-plications for personal data collecting companies”,Computer Law & Security Review, vol. 34, no. 1, pp. 134–153, 2018, ISSN: 0267-3649. DOI: https://doi.org/10.1016 / j . clsr . 2017 . 05 . 015. [Online]. Available: http :/ / www. sciencedirect . com / science / article / pii /S0267364917301966.

[18] Microsoft, Data governance for gdpr compliance: Princi-ples, processes, and practices, 2017. [Online]. Available:http : / / info . microsoft . com / rs / 157 - GQE - 382 /images/Data_Governance_for_GDPR_Compliance_whitepaper232_EN_US.pdf.

[19] T. S. D. P. Authority. (). Hantera personuppgifteri e-post, [Online]. Available: https : / / www .datainspektionen.se/dsf-epost. (accessed 22.05.2018).

[20] B. O’Dwyer and G. Madden, “Ethical codes of conductin irish companies: A survey of code content andenforcement procedures”, Journal of Business Ethics,

Page 15: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

12

vol. 63, no. 3, pp. 217–236, Feb. 2006, ISSN: 1573-0697.DOI: 10.1007/s10551-005-3967-x. [Online]. Available:https://doi.org/10.1007/s10551-005-3967-x.

[21] M. Kaptein and M. S. Schwartz, “The effectivenessof business codes: A critical examination of existingstudies and the development of an integrated re-search model”, Journal of Business Ethics, vol. 77, no. 2,pp. 111–127, Jan. 2008. DOI: 10.1007/s10551-006-9305-0. [Online]. Available: https : / / doi . org / 10 . 1007 /s10551-006-9305-0.

[22] P. M. Erwin, “Corporate codes of conduct: The effectsof code content and quality on ethical performance”,Journal of Business Ethics, vol. 99, no. 4, pp. 535–548,Apr. 2011, ISSN: 1573-0697. DOI: 10.1007/s10551-010-0667-y. [Online]. Available: https://doi.org/10.1007/s10551-010-0667-y.

[23] I. Kabanov, “Effective frameworks for delivering com-pliance with personal data privacy regulatory require-ments”, in Privacy, Security and Trust (PST), 2016 14thAnnual Conference on, IEEE, 2016, pp. 551–554.

[24] R. Grundstein-Amado, “A strategy for formulationand implementation of codes of ethics in public ser-vice organizations”, International Journal of Public Ad-ministration, vol. 24, no. 5, pp. 461–478, 2001.

[25] S. S. Prakash, “Standards for corporate conduct inthe international arena: Challenges and opportunitiesfor multinational corporations”, Business and SocietyReview, vol. 107, no. 1, pp. 20–40, DOI: 10 . 1111 /0045-3609.00125. eprint: https://onlinelibrary.wiley.com/doi/pdf/10 .1111/0045- 3609 .00125. [Online].Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/0045-3609.00125.

[26] A. Ekbal and S. Saha, “Multiobjective optimization forclassifier ensemble and feature selection: An applica-tion to named entity recognition”, International Journalon Document Analysis and Recognition (IJDAR), vol. 15,no. 2, pp. 143–166, 2012.

[27] D. C. DC. (). A survey of stochastic and gazetteerbased approaches for named entity recognition- part 2, [Online]. Available: http : / / www .datacommunitydc . org / blog / 2013 / 04 / a - survey -of - stochastic - and - gazetteer - based - approaches -for - named - entity - recognition - part - 2. (accessed19.05.2018).

[28] R. J. Mooney and R. Bunescu, “Mining knowl-edge from text using information extraction”, ACMSIGKDD explorations newsletter, vol. 7, no. 1, pp. 3–10,2005.

[29] E. Minkov, R. C. Wang, and W. W. Cohen, “Extractingpersonal names from email: Applying named entityrecognition to informal text”, in Proceedings of theconference on Human Language Technology and EmpiricalMethods in Natural Language Processing, Association forComputational Linguistics, 2005, pp. 443–450.

[30] O. Bender, F. J. Och, and H. Ney, “Maximum entropymodels for named entity recognition”, in Proceedingsof the seventh conference on Natural language learning atHLT-NAACL 2003-Volume 4, Association for Compu-tational Linguistics, 2003, pp. 148–151.

[31] R. Florian, A. Ittycheriah, H. Jing, and T. Zhang,“Named entity recognition through classifier com-

bination”, in Proceedings of the seventh conference onNatural language learning at HLT-NAACL 2003-Volume4, Association for Computational Linguistics, 2003,pp. 168–171.

[32] L. Ratinov and D. Roth, “Design challenges and mis-conceptions in named entity recognition”, in Proceed-ings of the Thirteenth Conference on Computational Natu-ral Language Learning, Association for ComputationalLinguistics, 2009, pp. 147–155.

[33] D. Maynard, V. Tablan, C. Ursu, H. Cunningham,and Y. Wilks, “Named entity recognition from diversetext types”, in Recent Advances in Natural LanguageProcessing 2001 Conference, 2001, pp. 257–274.

[34] D. Liu, T. Li, and D. Liang, “Incorporating logisticregression to decision-theoretic rough sets for classifi-cations”, International Journal of Approximate Reasoning,vol. 55, no. 1, pp. 197–210, 2014.

[35] L. Bottou, “Stochastic gradient descent tricks”, in Neu-ral networks: Tricks of the trade, Springer, 2012, pp. 421–436.

[36] J. Boye, Logistic regression, University Lecture, 2017.[37] S. A. C. Álvarez, “An exact analytical rela-

tion among recall, precision, and classification ac-curacy in information retrieval”, 2002. [Online].Available: https : / / www . semanticscholar .org / paper / An - exact - analytical - relation -among - recall % 2C - and - in - %C3 % 81lvarez /d8ff71a903a73880599fdd2c7be12de1f3730d29.

[38] C.-Y. J. Peng, K. L. Lee, and G. M. Ingersoll, “Anintroduction to logistic regression analysis and report-ing”, The Journal of Educational Research, vol. 96, no. 1,pp. 3–14, 2002. DOI: 10.1080/00220670209598786.

[39] S. Statistics. (). Namnstatistik, [Online]. Available:https://www.scb.se/hitta- statistik/statistik- efter-amne / befolkning / amnesovergripande - statistik /namnstatistik/. (accessed 19.05.2018).

Linnea Olby is 24 years old from Sundsvall and she is currently pur-suing a Master of Science in Industrial Engineering and Managementat KTH Royal Institute of Technology, Stockholm, Sweden, minoring inSoftware Engineering.

Her contribution to this paper was mainly focused on the imple-mentation of the NER model in python3 as well as details regardingthe assembly of this paper, such as the graphical presentation of themodel’s performance. However, this paper is essentially a result ofa collaboration between the authors including continuous discussionsconcerning both content and linguistics.

Isabel Thomander received a Diploma of Higher Education from theUniversity of St Andrews, Scotland in 2016. She is currently enrolledat KTH Royal Institute of Technology, Stockholm, Sweden, pursuing aMaster of Science in Industrial Engineering and Management, minoringin Software Engineering.

Her contribution to this study consisted of valuable insights regard-ing the structure of the paper as well as a substantial contribution totheory and the discussion of the results. Although, as stated above, thestudy is mainly a result of extensive collaboration.

Page 16: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

13

APPENDIX APerformance measurements for the five folds (F1, F2, ..., F5) of the cross-validation, using base case settings.

Fold Accuracy Averageprecision

Averagerecall

Recall’Name’

F1-score

F1 86.40 % 83.82 % 73.90 % 51.43 % 62.67 %F2 89.67 % 90.77 % 78.08 % 57.51 % 70.86 %F3 90.05 % 93.95 % 75.69 % 51.52 % 67.75 %F4 97.48 % 96.51 % 73.29 % 46.69 % 62.72 %F5 98.97 % 99.48 % 75.97 % 51.94 % 68.37 %

APPENDIX B

Listing 1: NER.pyimport argparseimport sysimport codecsfrom Bi nary Log is t i cRe gres s io n import Bi nary Log is t i cRe gres s io n

"""Th i s f i l e i s p a r t o f t h e computer a s s i g n m e n t s f o r t h e c o u r s e DD1418 / DD2418 Language e n g i n e e r i n g a t KTH.C r e a t e d 2017 by Johan Boye and P a t r i k J o n e l l .

The c o d e was f u r t h e r d e v e l o p e d by Linnea Olby and I s a b e l Thomander in 2017 , and l a t e r upda t ed in 2018 .The p a r t s t h a t were added a r e marked up in t h e c o d e ."""

c l a s s NER( object ) :"""Th i s c l a s s p e r f o r m s Named E n t i t y R e c o g n i t i o n (NER ) .

I t e i t h e r b u i l d s a b i n a r y NER model ( which d i s t i n g u i s h e sbe tween ’ name ’ o r ’ noname ’ ) from t r a i n i n g data , o r t r i e s a NER modelon t e s t data , o r b o t h .

Each l i n e in t h e d a t a f i l e s i s s u p p o s e d t o have 2 f i e l d s :Token , L a b e l

The ’ l a b e l ’ i s ’ o ’ i f t h e t o k e n i s not a name ."""

c l a s s Dataset ( object ) :"""I n t e r n a l c l a s s f o r r e p r e s e n t i n g a d a t a s e t ."""def _ _ i n i t _ _ ( s e l f ) :

# The l i s t o f d a t a p o i n t s . Each d a t a p o i n t i s i t s e l f# a l i s t o f f e a t u r e s ( e a c h f e a t u r e c o d e d as a number ) .s e l f . x = [ ]

# The l i s t o f l a b e l s f o r e a c h d a t a p o i n t . The d a t a p o i n t s s h o u l d# have t h e same o r d e r as in t h e ’ x ’ l i s t a b o v e .s e l f . y = [ ]

# ��������������������������������������������������

"""B o o l e a n f e a t u r e c o m p u t a t i o n ."""

def c a p i t a l i z e d _ t o k e n ( s e l f ) :return s e l f . current_token != None and s e l f . current_token . i s t i t l e ( )

def f i r s t _ t o k e n _ i n _ s e n t e n c e ( s e l f ) :return s e l f . l a s t _ t o k e n in [ None , ’ . ’ , ’ ! ’ , ’ ? ’ ]

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤def end_of_token ( s e l f ) :

return s e l f . current_token . endswith ( ( ’ berg ’ , ’ q v i s t ’ , ’ k v i s t ’ , ’ son ’ , ’ s t r m ’ ) )

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤

Page 17: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

14

def token_in_f_names ( s e l f ) :return s e l f . current_token in s e l f . f_names

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤def token_in_m_names ( s e l f ) :

return s e l f . current_token in s e l f . m_names

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤def token_in_lastnames ( s e l f ) :

return s e l f . current_token in s e l f . l_names

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤def t o k e n _ i n _ c i t i e s ( s e l f ) :

return s e l f . current_token not in s e l f . c i t i t e s

c l a s s FeatureFunct ion ( object ) :def _ _ i n i t _ _ ( s e l f , func ) :

s e l f . func = func

def evaluate ( s e l f ) :return 1 i f s e l f . func ( ) e lse 0

# ��������������������������������������������������

def label_number ( s e l f , s ) :return 0 i f ’ o ’ == s e lse 1

def read_and_process_data ( s e l f , f i lename ) :"""Read t h e i n p u t f i l e and r e t u r n t h e d a t a s e t ."""d a t a s e t = NER. Dataset ( )with codecs . open ( f i lename , ’ r ’ , ’ utf�8 ’ ) as f :

for l i n e in f . r e a d l i n e s ( ) :f i e l d = l i n e . s t r i p ( ) . s p l i t ( ’ ; ’ )i f len ( f i e l d ) == 3 :

# S p e c i a l c a s e : The t o k e n i s a s e m i c o l o n " ; "s e l f . process_data ( dataset , ’ ; ’ , ’ o ’ )

e lse :s e l f . process_data ( dataset , f i e l d [ 0 ] , f i e l d [ 1 ] )

return d a t a s e treturn None

def process_data ( s e l f , dataset , token , l a b e l ) :"""P r o c e s s e s one l i n e (= one d a t a p o i n t ) in t h e i n p u t f i l e ."""s e l f . l a s t _ t o k e n = s e l f . current_tokens e l f . current_token = token

datapoint = [ ]for f in s e l f . f e a t u r e s :

datapoint . append ( f . evaluate ( ) )

d a t a s e t . x . append ( datapoint )d a t a s e t . y . append ( s e l f . label_number ( l a b e l ) )

def read_model ( s e l f , f i lename ) :"""Read a model from f i l e"""with codecs . open ( f i lename , ’ r ’ , ’ utf�8 ’ ) as f :

d = map( f l o a t , f . read ( ) . s p l o t ( ’ ’ ) )return d

return None

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤def c r e a t e _ l i s t ( s e l f , f i lename ) :

"""C r e a t e s a l i s t f o r l i s t l o o k u p f e a t u r e s .

Page 18: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

15

"""e n t i t y _ l i s t = [ ]with open ( f i lename , encoding= ’ utf�8 ’ ) as f :

for l i n e in f :e n t i t y _ l i s t . append ( l i n e . s t r i p ( "\n" ) )

return e n t i t y _ l i s t

# ����������������������������������������������������������

def _ _ i n i t _ _ ( s e l f , t r a i n i n g _ f i l e , t e s t _ f i l e , model_f i le ) :"""C o n s t r u c t o r . T r a i n s and t e s t s a NER model us ing b i n a r y l o g i s t i c r e g r e s s i o n ."""

s e l f . current_token = None # The t o k e n c u r r e n t l y under c o n s i d e r a t i o n .s e l f . l a s t _ t o k e n = None # The t o k e n on t h e p r e c e d i n g l i n e .

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤# The g a z e t t e e r s f o r f e m a l e , male and l a s t names as w e l l a s f o r c i t i e ss e l f . f_names = s e l f . c r e a t e _ l i s t ( " f100_names . t x t " )s e l f . m_names = s e l f . c r e a t e _ l i s t ( " m100_names . t x t " )s e l f . l_names = s e l f . c r e a t e _ l i s t ( " l100_names . t x t " )s e l f . c i t i t e s = s e l f . c r e a t e _ l i s t ( " l o c a t i o n s . t x t " )

s e l f . f e a t u r e s = [NER. FeatureFunct ion ( s e l f . c a p i t a l i z e d _ t o k e n ) ,NER. FeatureFunct ion ( s e l f . f i r s t _ t o k e n _ i n _ s e n t e n c e ) ,NER. FeatureFunct ion ( s e l f . end_of_token ) ,NER. FeatureFunct ion ( s e l f . token_in_f_names ) ,NER. FeatureFunct ion ( s e l f . token_in_m_names ) ,NER. FeatureFunct ion ( s e l f . token_in_lastnames ) ,NER. FeatureFunct ion ( s e l f . t o k e n _ i n _ c i t i e s ) ,

]

i f t r a i n i n g _ f i l e :# Tra in a modelt r a i n i n g _ s e t = s e l f . read_and_process_data ( t r a i n i n g _ f i l e )i f t r a i n i n g _ s e t :

b = Bin aryL ogi s t i c Reg ress ion ( t r a i n i n g _ s e t . x , t r a i n i n g _ s e t . y )b . s t o c h a s t i c _ f i t ( )

e lse :model = s e l f . read_model ( model_f i le )i f model :

b = Bin aryL ogi s t i c Reg ress ion ( model )

# T e s t t h e model on a t e s t s e tt e s t _ s e t = s e l f . read_and_process_data ( t e s t _ f i l e )i f t e s t _ s e t :

b . c l a s s i f y _ d a t a p o i n t s ( t e s t _ s e t . x , t e s t _ s e t . y )

# ����������������������������������������������������������

def main ( ) :"""Main method . Decodes command�l i n e arguments , and s t a r t s t h e Named E n t i t y R e c o g n i t i o n ."""

parser = argparse . ArgumentParser ( d e s c r i p t i o n ="Named E n t i t y Recognit ion " , \usage="\n⇤ I f the �d and �t are both given , the program w i l l t r a i n a model , \and apply i t to the t e s t f i l e . \n⇤ I f only �t and �m are given , the program w i l l \read the model from the model f i l e , and apply i t to the t e s t f i l e . " )

required_named = parser . add_argument_group ( " required named arguments " )required_named . add_argument ( ’�t ’ , type=s t r , required=True , help=" t e s t f i l e ( mandatory ) " )

group = required_named . add_mutually_exclusive_group ( required=True )group . add_argument ( ’�d ’ , type=s t r , help=" t r a i n i n g f i l e ( required i f �m i s not s e t ) " )group . add_argument ( ’�m’ , type=s t r , help=" model f i l e ( required i f �d i s not s e t ) " )

i f len ( sys . argv [ 1 : ] ) = = 0 :parser . pr in t_he lp ( )

Page 19: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

16

parser . e x i t ( )arguments = parser . parse_args ( )

NER( arguments . d , arguments . t , arguments .m)

i f __name__ == ’ __main__ ’ :main ( )

Page 20: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

17

APPENDIX C

Listing 2: BinaryLogisticRegression.py# c o d i n g =u t f �8from __future__ import p r i n t _ f u n c t i o nimport mathimport randomimport numpy as np

"""Th i s f i l e i s p a r t o f t h e computer a s s i g n m e n t s f o r t h e c o u r s e DD1418 / DD2418 Language e n g i n e e r i n g a t KTH.C r e a t e d 2017 by Johan Boye and P a t r i k J o n e l l .

The c o d e was f u r t h e r d e v e l o p e d by Linnea Olby and I s a b e l Thomander in 2017 , and l a t e r upda t ed in 2018 .The p a r t s t h a t were added a r e marked up in t h e c o d e ."""

c l a s s Bi nary Log is t i cRe gres s io n ( object ) :"""Th i s c l a s s p e r f o r m s b i n a r y l o g i s t i c r e g r e s s i o n us ing s t o c h a s t i c g r a d i e n t d e s c e n t ."""

# ������������� Hyperparame t e r s ������������������ #

LEARNING_RATE = 0 . 0 5 # The l e a r n i n g r a t e .CONVERGENCE_MARGIN = 0 .001 # The c o n v e r g e n c e c r i t e r i o n .MAX_ITERATIONS = 50 # Maximal number o f p a s s e s through t h e d a t a p o i n t s in s t o c h a s t i c g r a d i e n t d e s c e n t .

# ����������������������������������������������������������������������

def _ _ i n i t _ _ ( s e l f , x=None , y=None , t h e t a =None ) :"""C o n s t r u c t o r . I m p o r t s t h e d a t a and l a b e l s n e ed ed t o b u i l d t h e t a .

@param x The i n p u t as a DATAPOINT⇤FEATURES a r r a y .@param y The l a b e l s a s a DATAPOINT a r r a y .@param t h e t a A ready�made model . ( i n s t e a d o f x and y )"""i f not any ( [ x , y , t h e t a ] ) or a l l ( [ x , y , t h e t a ] ) :

r a i s e Exception ( ’You have to e i t h e r give x and y or t h e t a ’ )

i f t h e t a :s e l f . FEATURES = len ( t h e t a )s e l f . t h e t a = t h e t a

e l i f x and y :# Number o f d a t a p o i n t s .s e l f .DATAPOINTS = len ( x )

# Number o f f e a t u r e s .s e l f . FEATURES = len ( x [ 0 ] ) + 1

# Encoding o f t h e d a t a p o i n t s ( a s a DATAPOINTS x FEATURES s i z e a r r a y ) .s e l f . x = np . concatenate ( ( np . ones ( ( s e l f .DATAPOINTS, 1 ) ) , np . array ( x ) ) , a x i s =1)

# C o r r e c t l a b e l s f o r t h e d a t a p o i n t s .s e l f . y = np . array ( y )

# The w e i g h t s we want t o l e a r n in t h e t r a i n i n g p h a s e .s e l f . t h e t a = np . random . uniform (�1 , 1 , s e l f . FEATURES)

# The c u r r e n t g r a d i e n t .s e l f . gradient = np . zeros ( s e l f . FEATURES)

# ����������������������������������������������������������������������

def sigmoid ( s e l f , z ) :"""The l o g i s t i c f u n c t i o n ."""return 1 . 0 / ( 1 + math . exp(�z ) )

Page 21: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

18

def condi t ional_prob ( s e l f , l a b e l , d ) :"""Computes t h e c o n d i t i o n a l p r o b a b i l i t y P ( l a b e l | d a t a p o i n t )"""

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤i f l a b e l == 0 :

return 1 � s e l f . sigmoid ( np . dot ( s e l f . theta , s e l f . x [ d ] ) )e l i f l a b e l == 1 :

return s e l f . sigmoid ( np . dot ( s e l f . theta , s e l f . x [ d ] ) )

def compute_gradient ( s e l f , d ) :"""Computes t h e g r a d i e n t b a s e d on a s i n g l e d a t a p o i n t ."""

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤for k in range ( s e l f . FEATURES ) :

s e l f . gradient [ k ] = s e l f . x [ d , k ]⇤ ( s e l f . sigmoid ( np . dot ( s e l f . theta , s e l f . x [ d ] ) ) � s e l f . y [ d ] )

def s t o c h a s t i c _ f i t ( s e l f ) :"""P e r f o rm s S t o c h a s t i c G r a d i e n t Descent ."""

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤i t =0while ( any ( abs ( i ) > s e l f .CONVERGENCE_MARGIN for i in s e l f . gradient ) and\\i t <= s e l f . MAX_ITERATIONS⇤ s e l f .DATAPOINTS) or i t == 0 :

i = random . randrange ( s e l f .DATAPOINTS)s e l f . compute_gradient ( i )for k in range ( s e l f . FEATURES ) :

s e l f . t h e t a [ k ] �= s e l f .LEARNING_RATE ⇤ s e l f . gradient [ k ]i t += 1

def c l a s s i f y _ d a t a p o i n t s ( s e l f , t e s t _ d a t a , t e s t _ l a b e l s ) :"""C l a s s i f i e s d a t a p o i n t s"""print ( ’ Model parameters : ’ ) ;

print ( ’ ’ . j o i n ( ’ { : d } : { : . 4 f } ’ . format ( k , s e l f . t h e t a [ k ] ) for k in range ( s e l f . FEATURES ) ) )

s e l f .DATAPOINTS = len ( t e s t _ d a t a )

s e l f . x = np . concatenate ( ( np . ones ( ( s e l f .DATAPOINTS, 1 ) ) , np . array ( t e s t _ d a t a ) ) , a x i s =1)s e l f . y = np . array ( t e s t _ l a b e l s )confusion = np . zeros ( ( s e l f . FEATURES, s e l f . FEATURES ) )

for d in range ( s e l f .DATAPOINTS ) :prob = s e l f . condi t ional_prob ( 1 , d )predic ted = 1 i f prob > . 5 e lse 0confusion [ predic ted ] [ s e l f . y [ d ] ] += 1

print ( ’ Real c l a s s ’ )print ( ’ ’ , end= ’ ’ )print ( ’ ’ . j o i n ( ’ { : >8d } ’ . format ( i ) for i in range ( 2 ) ) )for i in range ( 2 ) :

i f i == 0 :print ( ’ Predic ted c l a s s : { : 2 d } ’ . format ( i ) , end= ’ ’ )

e lse :print ( ’ { : 2 d } ’ . format ( i ) , end= ’ ’ )

print ( ’ ’ . j o i n ( ’ { : > 8 . 3 f } ’ . format ( confusion [ i ] [ j ] ) for j in range ( 2 ) ) )

# ⇤⇤⇤ Wri t t en by Linnea & I s a b e l ⇤⇤⇤⇤accuracy = f l o a t ( confusion [ 0 ] [ 0 ] + confusion [ 1 ] [ 1 ] ) / f l o a t ( s e l f .DATAPOINTS)

prec_notname = f l o a t ( confusion [ 0 ] [ 0 ] ) / f l o a t ( confusion [ 0 ] [ 0 ] + confusion [ 0 ] [ 1 ] )reca_notname = f l o a t ( confusion [ 0 ] [ 0 ] ) / f l o a t ( confusion [ 0 ] [ 0 ] + confusion [ 1 ] [ 0 ] )

prec_name = f l o a t ( confusion [ 1 ] [ 1 ] ) / f l o a t ( confusion [ 1 ] [ 1 ] + confusion [ 1 ] [ 0 ] )reca_name = f l o a t ( confusion [ 1 ] [ 1 ] ) / f l o a t ( confusion [ 1 ] [ 1 ] + confusion [ 0 ] [ 1 ] )

Page 22: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

19

avg_prec = f l o a t ( ( prec_name + prec_notname )/ 2)avg_reca = f l o a t ( ( reca_name + reca_notname )/ 2)b a s e l i n e = 1 � ( f l o a t ( confusion [ 1 ] [ 1 ] + confusion [ 0 ] [ 1 ] ) / 6000)

print ( " Accuracy : " + s t r ( accuracy ) )print ( " P r e c i s i o n f o r NotName : " + s t r ( prec_notname ) )print ( " P r e c i s i o n f o r Name: " + s t r ( prec_name ) )print ( " Average p r e c i s i o n : " + s t r ( avg_prec ) )print ( " R e c a l l f o r NotName : " + s t r ( reca_notname ) )print ( " R e c a l l f o r Name: " + s t r ( reca_name ) )print ( " Average r e c a l l : " + s t r ( avg_reca ) )print ( " B a s e l i n e f o r Name: " + s t r ( b a s e l i n e ) )"""p r i n t ( s t r ( a c c u r a c y ) )p r i n t ( s t r ( prec_notname ) )p r i n t ( s t r ( prec_name ) )p r i n t ( s t r ( a v g _ p r e c ) )p r i n t ( s t r ( r e ca_notname ) )p r i n t ( s t r ( reca_name ) )p r i n t ( s t r ( a v g _ r e c a ) )p r i n t ( s t r ( b a s e l i n e ) )"""

def p r i n t _ r e s u l t ( s e l f ) :print ( ’ ’ . j o i n ( [ ’ { : . 2 f } ’ . format ( x ) for x in s e l f . t h e t a ] ) )print ( ’ ’ . j o i n ( [ ’ { : . 2 f } ’ . format ( x ) for x in s e l f . gradient ] ) )

def main ( ) :"""T e s t s t h e c o d e on a t o y example ."""x = [

[ 1 ,1 ] , [ 0 ,0 ] , [ 1 ,0 ] , [ 0 ,0 ] , [ 0 ,0 ] , [ 0 ,0 ] ,[ 0 ,0 ] , [ 0 ,0 ] , [ 1 ,1 ] , [ 0 ,0 ] , [ 0 ,0 ] , [ 1 ,0 ] ,[ 1 ,0 ] , [ 0 ,0 ] , [ 1 ,1 ] , [ 0 ,0 ] , [ 1 ,0 ] , [ 0 ,0 ]

]

# Encoding o f t h e c o r r e c t c l a s s e s f o r t h e t r a i n i n g m a t e r i a ly = [ 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 1 , 0 ]b = Bin ary Logi s t i cReg res s ion ( x , y )b . f i t ( )b . p r i n t _ r e s u l t ( )

i f __name__ == ’ __main__ ’ :main ( )

Page 23: A Step Toward GDPR Compliancekth.diva-portal.org/smash/get/diva2:1264630/FULLTEXT01.pdf · Dataskyddsförordningen började gälla den 25e maj 2018, och uppstod som ett svar på den

TRITA TRITA-EECS-EX-2018:429

www.kth.se