17
Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 ykia Jackson, arbara Shapter im Sterret-Day ohns Hopkins University Applied Physics Laboratory [email protected]

Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

Embed Size (px)

Citation preview

Page 1: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

Protecting Information in the NIEM Lifecycle Using Synthetic Data

15 December 2009

Nykia Jackson,Barbara ShapterKim Sterret-DayJohns Hopkins University Applied Physics [email protected]

Page 2: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 2

Introduction

JHU/APL for DHS S&T CCI in support of DHS EDMO

Objective: to help EDMO and NIEM community By learning of needs and gaps By exploring technologies that fill critical gaps

in the NIEM lifecycles of model management, IEPD development, and implementation support.

12/15/09

Page 3: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 3

Filling a Gap

12/15/09

Objective:

To develop a proof-of-concept capability to create test data for an information exchange using synthetic data

At the practitioner level

Need a tool to aid the testing of an implementation by the generation of “safe” test data

Page 4: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 4

Synthetic Data

Vision for Solution

12/15/2009

SYNINGEN

Page 5: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 5

Proof-of-Concept for one IEPD.

Synthetic Instance Generator (SYNINGEN) Synthetic data source Embedded database Dynamic insertion of

“controlled” erroneous data

Design

SYNINGENSynthetic Data

Test Record

s

Schema

Binding

IEPD Schema

s

Pre-process

9/17/09

Page 6: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 6

IEPD Selection

Requirements

Consists of a cross-section of commonly used data fields

Contains a minimal amount of domain specific data to utilize the capabilities of the previously developed synthetic data generator

Provides a concrete method for assessing IEPD implementation / test data via development of a web service (WS)

Selection: CONNECT Driver License Search IEPD

Designed to facilitate the effective exchange of criminal justice information amongst the CONNECT states (Alabama, Nebraska, Wyoming, Tennessee, and Kansas)

Defines the driver license search parameters, driver license search results (summary), and driver license details

9/17/09

Page 7: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 7

Demonstration

Generate Test Data Good values “Bad” Values

Test Web Service Client Web Service

Driver License Query Client

SYNINGENSynthetic Data

Test Record

s

Schema

Binding

IEPD Schema

s

Pre-process

9/17/09

Page 8: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 8

Synthetic Data Generation

12/15/09

Synthetic Data

SYNINGEN

Page 9: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 9

Why Use Synthetic Data?

Need for large-scale, high-quality synthetic datasets to support DHS Test and Evaluation (T&E) activities Designing Modeling Testing (including usability) Training Tool studies

Need poses privacy protection challenges due to lack of access to actual data, i.e., Personally Identifiable Information (PII), and other access limitations

Four possible data methods available to address data needs (with limitations) Use of actual data Sanitized or anonymized data Manually created fictitious data Machine-generated large-scale datasets from real world

models, algorithms, or reference statistical patterns12/15/09

Page 10: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 10

Synthetic Data Generator (SDG)

Synthetic Data: datasets comprised entirely of fictitious data, that can be used in a given context (or situation), instead of directly measurable or accessible actual data.

Prototype developed for Department of Homeland Security (DHS) Science and Technology (S&T) Command, Control, and Interoperability (CCI) Division

Automatic capability to produce robust datasets comprised of entities (e.g. people with behaviors/ “footprints” over time) Creates synthetic test data that models a community with

highly connected social networks of entities and relationships Data reflects typical daily activities in which people travel,

communicate, and spend money in ways that are normally expected in a reasonable world

Datasets are in simple delimited text format

12/15/09

Page 11: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

11

SDG Data

JHU/APL developed the concept and rules that characterize the “reasonable world”

Categories of interest Demographics (including

immigration) Social networks Communication patterns Travel Financial transactions

(including consumer spending)

Produce datasets that are consistent in time and space

travels

Person

City

Credit Card

Transaction

Credit Card

Number

purchases

Phone Number

communicates using

Phone Number

Call Transcrip

t

caller

receiver

NIEM Blue Team Tools Day 12/15/09

Page 12: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

12

Current Synthetic Fields

PERSON …PersonIDBinaryBase64ObjectBinaryDescriptionTextBinaryFormatIDBinaryFormtStandardTextBinaryCategoryTextGivenNameFamilyNameMiddleNameSuffixCitizenshipPassportNumberDriverLicenseNumberDriverLicenseStateDriverLicenseExpiration. DateDriverLicenseIssueDateDOB 12/15/09

NIEM Blue Team Tools Day

PHONE_NUMBERPersonIDType (Landline or Mobile)Number

CREDIT_CARD_ TRANSACTIONPersonIDTransactionNumberCreditCardNumberPurchaseCityDateAmountCompanyIndustry

PERSONEthnicityCodeEthnicityTextEyeColorCodeEyeColorTextGenderCodeGenderTextHairColorCodeHairColorTextHeightInchesWeightPoundsAddressStreetNumberAddressStreetNameAddressCityAddressCounty*AddressStateAddressPostalCodeAddressPostalExtensionCode

PHONE_CALLPersonIDDateDurationSecondsTypeFromCityIDFromNumberToCityIDToNumber

TRAVELPersonIDFromCityIDToCityIDDate

CITYCityIDCityStateRegionCountry

Page 13: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

13

Synthetic Data Sample

12/15/09NIEM Blue Team Tools Day

Bio:

Eladio Berstis, a USA citizen, lives in Lansing Michigan, and was born on June 8, 1974. He subscribes to two landline telephone numbers: 517-513-5528 and 517-567-1171. He shares these numbers with family members who live with him. Eladio has a relative, Soto Berstis, who lives in Providence Rhode Island. Eladio calls Soto regularly. Eladio owns two MasterCard credit cards.

Page 14: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 14

Utility

Developed prototype web portal interface to SDG User specifies characteristic attributes for a dataset

through this interface Has been extended to generate other domain specific

data Applications

North American Threat (NAT) Dataset for intelligence analysis

Privacy Protection Technology NIEM Test Data Suspicious Activity Reports (SARs)

Datasets have been generated and distributed to research institutions and agencies

12/15/09

Page 15: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 15

Feedback CY2010 Q1: Delivery of SYNINGEN software to

DHS EDMO Independent of IEPD

Definition of fields desired in a synthetic dataset for NIEM What is useful? What level of fidelity desired for “reasonable

world”?

Feedback welcomed [email protected] [email protected]

12/15/09

Page 16: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 16

Backup Slides

12/15/09

Page 17: Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University

NIEM Blue Team Tools Day 17

How reasonable is the data?

12/15/09

Names• People in the same family tend to share the same family name• Western naming convention of a single first name and single last name• Many companies do not have realistic names• Actual cities around the world• Affiliations can have either actual (e.g. Al Qaeda) or fictitious (e.g. Augusta Gang) names

Travels• Does not specify the transportation modes for travels• Tracks people at the city level (data does not tell us whether a person was seen at a

restaurant)• No more than one travel event in one day

Phone Calls• Simplified communications among people • Access to both landline and mobile phone numbers• Mobile number is “owned” by only one person• Landline phone number may be used by a number of people• Two phone calls originating from the same phone number will not overlap in time (no

guarantee that a phone number could not be a receiver and a caller at the same time)

Credit Card Transactions• Types of data that people are likely to find in their own monthly statements: date,

amount, company, and industry• Transactions occur in the same city as a person’s current location