View
217
Download
1
Category
Tags:
Preview:
Citation preview
Protecting Information in the NIEM Lifecycle Using Synthetic Data
15 December 2009
Nykia Jackson,Barbara ShapterKim Sterret-DayJohns Hopkins University Applied Physics LaboratoryBarbara.shapter@jhuapl.edu
NIEM Blue Team Tools Day 2
Introduction
JHU/APL for DHS S&T CCI in support of DHS EDMO
Objective: to help EDMO and NIEM community By learning of needs and gaps By exploring technologies that fill critical gaps
in the NIEM lifecycles of model management, IEPD development, and implementation support.
12/15/09
NIEM Blue Team Tools Day 3
Filling a Gap
12/15/09
Objective:
To develop a proof-of-concept capability to create test data for an information exchange using synthetic data
At the practitioner level
Need a tool to aid the testing of an implementation by the generation of “safe” test data
NIEM Blue Team Tools Day 4
Synthetic Data
Vision for Solution
12/15/2009
SYNINGEN
NIEM Blue Team Tools Day 5
Proof-of-Concept for one IEPD.
Synthetic Instance Generator (SYNINGEN) Synthetic data source Embedded database Dynamic insertion of
“controlled” erroneous data
Design
SYNINGENSynthetic Data
Test Record
s
Schema
Binding
IEPD Schema
s
Pre-process
9/17/09
NIEM Blue Team Tools Day 6
IEPD Selection
Requirements
Consists of a cross-section of commonly used data fields
Contains a minimal amount of domain specific data to utilize the capabilities of the previously developed synthetic data generator
Provides a concrete method for assessing IEPD implementation / test data via development of a web service (WS)
Selection: CONNECT Driver License Search IEPD
Designed to facilitate the effective exchange of criminal justice information amongst the CONNECT states (Alabama, Nebraska, Wyoming, Tennessee, and Kansas)
Defines the driver license search parameters, driver license search results (summary), and driver license details
9/17/09
NIEM Blue Team Tools Day 7
Demonstration
Generate Test Data Good values “Bad” Values
Test Web Service Client Web Service
Driver License Query Client
SYNINGENSynthetic Data
Test Record
s
Schema
Binding
IEPD Schema
s
Pre-process
9/17/09
NIEM Blue Team Tools Day 8
Synthetic Data Generation
12/15/09
Synthetic Data
SYNINGEN
NIEM Blue Team Tools Day 9
Why Use Synthetic Data?
Need for large-scale, high-quality synthetic datasets to support DHS Test and Evaluation (T&E) activities Designing Modeling Testing (including usability) Training Tool studies
Need poses privacy protection challenges due to lack of access to actual data, i.e., Personally Identifiable Information (PII), and other access limitations
Four possible data methods available to address data needs (with limitations) Use of actual data Sanitized or anonymized data Manually created fictitious data Machine-generated large-scale datasets from real world
models, algorithms, or reference statistical patterns12/15/09
NIEM Blue Team Tools Day 10
Synthetic Data Generator (SDG)
Synthetic Data: datasets comprised entirely of fictitious data, that can be used in a given context (or situation), instead of directly measurable or accessible actual data.
Prototype developed for Department of Homeland Security (DHS) Science and Technology (S&T) Command, Control, and Interoperability (CCI) Division
Automatic capability to produce robust datasets comprised of entities (e.g. people with behaviors/ “footprints” over time) Creates synthetic test data that models a community with
highly connected social networks of entities and relationships Data reflects typical daily activities in which people travel,
communicate, and spend money in ways that are normally expected in a reasonable world
Datasets are in simple delimited text format
12/15/09
11
SDG Data
JHU/APL developed the concept and rules that characterize the “reasonable world”
Categories of interest Demographics (including
immigration) Social networks Communication patterns Travel Financial transactions
(including consumer spending)
Produce datasets that are consistent in time and space
travels
Person
City
Credit Card
Transaction
Credit Card
Number
purchases
Phone Number
communicates using
Phone Number
Call Transcrip
t
caller
receiver
NIEM Blue Team Tools Day 12/15/09
12
Current Synthetic Fields
PERSON …PersonIDBinaryBase64ObjectBinaryDescriptionTextBinaryFormatIDBinaryFormtStandardTextBinaryCategoryTextGivenNameFamilyNameMiddleNameSuffixCitizenshipPassportNumberDriverLicenseNumberDriverLicenseStateDriverLicenseExpiration. DateDriverLicenseIssueDateDOB 12/15/09
NIEM Blue Team Tools Day
PHONE_NUMBERPersonIDType (Landline or Mobile)Number
CREDIT_CARD_ TRANSACTIONPersonIDTransactionNumberCreditCardNumberPurchaseCityDateAmountCompanyIndustry
PERSONEthnicityCodeEthnicityTextEyeColorCodeEyeColorTextGenderCodeGenderTextHairColorCodeHairColorTextHeightInchesWeightPoundsAddressStreetNumberAddressStreetNameAddressCityAddressCounty*AddressStateAddressPostalCodeAddressPostalExtensionCode
PHONE_CALLPersonIDDateDurationSecondsTypeFromCityIDFromNumberToCityIDToNumber
TRAVELPersonIDFromCityIDToCityIDDate
CITYCityIDCityStateRegionCountry
13
Synthetic Data Sample
12/15/09NIEM Blue Team Tools Day
Bio:
Eladio Berstis, a USA citizen, lives in Lansing Michigan, and was born on June 8, 1974. He subscribes to two landline telephone numbers: 517-513-5528 and 517-567-1171. He shares these numbers with family members who live with him. Eladio has a relative, Soto Berstis, who lives in Providence Rhode Island. Eladio calls Soto regularly. Eladio owns two MasterCard credit cards.
NIEM Blue Team Tools Day 14
Utility
Developed prototype web portal interface to SDG User specifies characteristic attributes for a dataset
through this interface Has been extended to generate other domain specific
data Applications
North American Threat (NAT) Dataset for intelligence analysis
Privacy Protection Technology NIEM Test Data Suspicious Activity Reports (SARs)
Datasets have been generated and distributed to research institutions and agencies
12/15/09
NIEM Blue Team Tools Day 15
Feedback CY2010 Q1: Delivery of SYNINGEN software to
DHS EDMO Independent of IEPD
Definition of fields desired in a synthetic dataset for NIEM What is useful? What level of fidelity desired for “reasonable
world”?
Feedback welcomed Barbara.shapter@jhuapl.edu Nykia.Jackson@jhuapl.edu
12/15/09
NIEM Blue Team Tools Day 16
Backup Slides
12/15/09
NIEM Blue Team Tools Day 17
How reasonable is the data?
12/15/09
Names• People in the same family tend to share the same family name• Western naming convention of a single first name and single last name• Many companies do not have realistic names• Actual cities around the world• Affiliations can have either actual (e.g. Al Qaeda) or fictitious (e.g. Augusta Gang) names
Travels• Does not specify the transportation modes for travels• Tracks people at the city level (data does not tell us whether a person was seen at a
restaurant)• No more than one travel event in one day
Phone Calls• Simplified communications among people • Access to both landline and mobile phone numbers• Mobile number is “owned” by only one person• Landline phone number may be used by a number of people• Two phone calls originating from the same phone number will not overlap in time (no
guarantee that a phone number could not be a receiver and a caller at the same time)
Credit Card Transactions• Types of data that people are likely to find in their own monthly statements: date,
amount, company, and industry• Transactions occur in the same city as a person’s current location
Recommended