Synthetic Data Generation

Embed Size (px)

Citation preview

  • 8/3/2019 Synthetic Data Generation

    1/6

    Page 1 of 6

    SYNTHETIC DATA GENERATION CAPABILTIES FOR TESTING DATA MINING TOOLS

    Daniel R. Jeske Behrokh Samadi

    Pengyue J. Lin Lucent TechnologiesCarlos Rendn [email protected]

    Rui XiaoUniversity of California, Riverside

    [email protected]

    ABSTRACT

    Recently, due to commercial success of data mining

    tools, there has been much attention to extractinghidden information from large databases to predict

    security problems and terrorist threats. The securityapplications are somewhat more complicated thancommercial applications due to (i) lack of sufficient specific knowledge on what to look for, (ii) R&D

    labs developing these tools are not able to easilyobtain sensitive information due to security, privacyor cost issues. Tools developed for securityapplications require substantially more testing and

    revisions in order to prevent costly errors. This paper describes a platform for the generation ofrealistic synthetic data that can facilitate the

    development and testing of data mining tools. The

    original applications for this platform were peopleinformation and credit card transaction data sets. Inthis paper, we introduce a new shipping container

    application that can support the testing of datamining tools developed for port security.

    KEYWORDS

    Knowledge Discovery and Data Mining, SyntheticData Generation, Semantic Graphs, ShippingContainer.

    INTRODUCTION

    Knowledge discovery and data mining (KDD)includes the technology of extracting unknown andpossibly useful information from data. This processhas been compared to finding a needle in a haystack.In spite of apparent complexity, KDD technologyhas shown to be successful in some commercial

    applications such as fraud prevention and medicaldiagnosis. KDD is a powerful technology with great potential to help focus attention on the mostimportant information in data warehouses. Datamining tools predict future trends and behaviors,

    allowing businesses to make proactive, knowledge-driven decisions. Recently, due to this commercialsuccess, there has been much attention paid toextraction of hidden information from largedatabases to predict security problems and terroristthreats. The security applications are somewhatmore complicated than the commercial applicationsdue to (1) lack of sufficient specific knowledge onwhat to look for, and (2) R&D labs developing thesetools are not able to easily obtain sensitiveinformation due to security, privacy or cost issues.KDD tools developed for security applications

    require substantially more testing and revisions ofrules in order to minimize the false positives andfalse negatives that could be very costly.

    The combination of (1) and (2) motivated us todevelop a platform for the generation of realisticsynthetic data that can facilitate the development andtesting of KDD tools. Realistic synthetic data canserve as background data sets into whichhypothetical future scenarios can be overlaid. KDDtools can then be measured in terms of their false positive and false negative error rates. In addition,

    the availability of synthetic data sets providesnecessary traction for new data mining ideas andapproaches, and thereby facilitates the developmentand feasibility assessment of techniques that mightotherwise die on vine.

    To be adequate substitutes for real data, the qualityof synthetic data sets needs to be reasonable. Pitfalls

  • 8/3/2019 Synthetic Data Generation

    2/6

    Page 2 of 6

    associated with unrepresentative data includeimproper training of data mining tools and maskingof true benefits and virtues associated with specifictechniques. Some synthetic data generation tools areavailable commercially [1]. To varying degrees,each tool comes with a pre-defined set of attributes

    whose values are available from built-in lists. Forexample, lists could include names, addresses,occupations, etc. None of the tools we are aware of preserve complex between-attribute relationships,but instead simply generate attributes as though theyare independent. Some of the commercial toolsallow integration of user-customized data sets.However, the practical issue of obtaining the datasets to be integrated with these tools remains anobstacle. To support various KDD tools, the systemarchitecture should be portable to different platforms, scaleable for simple to complex

    applications, flexible to integrate new applications,industrially compatible in data output formats, andhighly usable foroff-the-shelfandpower users.

    In previous work [2, 3] we developed a prototype ofan IDAS Dataset Generator (IDSG). The initial prototype was supported by a Technical SupportWorking Group (TSGW) that identifies, prioritizesand coordinates R&D requirements for combatingterrorism [4]. The IDSG platform uses semanticgraphs to represent relationships among dataattributes and to define data generation rules for

    specific data set applications. Some attributes aregenerated by statistical algorithms, while othersfollow rules-based algorithms. The platform allowsthe development of customized applications to servespecific user needs. The applications developed forthe prototype includedpeople information and creditcard transactions.

    The rest of this paper is organized as follows. In thenext Section we review the applications, features,and algorithms associated with the prototype IDSG by walking through a typical user interaction with

    the tool. Though not all features will be illustrated,the main functions of IDSG will become evident.Following that, we introduce a new shippingcontainer application. The shipping containerapplication poses a different challenge since unlikepeople information and creditcard transactions, itsdata attributes have not been the subject ofnumerous studies and relatively little a-priori

    information is available to persons not directlyworking in the shipping container domain. Forexample, names, addresses, social security numbersand even relationships such as the associationbetween income and education level can be found in public sources. Similarly rules for valid generation

    of credit card numbers can be found. However, thenature of the contents of shipping containers, therelationships between carriers, originating countries,arriving ports and commodity descriptions is notcommon knowledge and is difficult to ascertainthrough web surfing.

    TOUR OF IDSG

    The prototype IDSG is based on the client-servercomputing model. Whether the client and the serverexecute on the same machine or not is transparent tothe system. The role of the client is to allow the user

    to select an application type and specifyrequirements for the data sets they need.Specifically, the user constructs a schema for howthe outputted data sets should be organized byspecifying how many files of data they need andwhat attributes should be attached to each file.Figure 1 is a screen shot of the opening screen wherethe user selects the application they wish to generatesynthetic data for, and Figure 2 is the screen thatallows them to construct files for the output and toselect the attributes (using pull-down menus) that areto appear in each.

    Figure 1. IDSG Select Application

  • 8/3/2019 Synthetic Data Generation

    3/6

    Page 3 of 6

    Figure 2. IDSG Organizing Output Files

    In this example, the user selected the credit cardtransaction application, and created three outputfiles (names, numbers and transactions) with theattributes as shown. In total, the user has over 30different attributes within this application that theycan optionally select for inclusion in the output files.

    The primary function of IDSG is to generatebackground data with the intent of looking like realdata. However, the user is offered an opportunity tooverlay pre-defined scenarios that can be randomlyinserted and mixed with the background data. Withthis feature, the resulting data sets can be seededwith events and competing KDD tools can be testedfor their ability to find the signal amongst noise. Forexample, a user might insert an unusually largenumber of credit card transactions for chemical purchases. A KDD tool that looks for anomalouscredit card purchases could then be tested by presenting it with the transactions file that isoutputted from IDSG. Figure 3 shows how a usercan mix in pre-defined scenarios with the syntheticdata that is generated by IDSG. The user simplyenters file names that have the same structure as theIDSG output files (names, numbers, andtransactions). The user also selects the insertionmode that determines whether to add the anomalousrecords at the beginning, at the end, or randomlythroughout the IDSG output files.

    At this point the user has completed therequirements specification on the data sets and thedata generation responsibility is passed to the server.

    The user can check on the status of the job byviewing the progress monitor that is shown in Figure4. The status of the job demo in this example isthat it has run 16 seconds and is 46% complete.

    Figure 3. IDSG Insertion of Anomalous Records

    When the job completes, it will move from the lowertable to the upper table in Figure 4. At that point, allof the requested files are available to the user and are presented individually as CSV files. Early in theinteraction between the user and IDSG, the userwould have specified a minimum number of cases to be generated in each file. The server packages allthe output CSV files into a zipped file that the user

    can download from the server to their client by usingthe Destination Directory and Browse featuresshown in Figure 4.

    Figure 4. IDSG Download the Datasets

  • 8/3/2019 Synthetic Data Generation

    4/6

    Page 4 of 6

    Figure 5 illustrates the data that was created for thenames file. The numbers file would similarly showall the user requested numbers (telephone, socialsecurity and drivers license) for each person in thenames file. Likewise, the transactions file wouldshow credit card purchases that were generated for

    each person including the date of purchase, type ofcredit card used, expense type and transactionamount.

    Figure 5. IDSG Output File Names

    The data generation algorithms used for the peopleand credit card transaction applications in IDSGhave been described in [2,3,5]. In summary, thereare three types of algorithms that get employed:

    statistical, rule-based, and resampling. Attributessuch as gender, age, income, occupation, educationlevel, number of credit cards, credit card usefrequency, transaction type and transaction amountare statistically associated. For example, a personwith higher education generally has higher incomeand therefore would generally have a higher numberof credit card purchases and have higher transactionamounts. IDSG generates these attributes byconstructing a 9-dimensional joint distribution forthese attributes from knowledge about lowerdimensional associations. In particular, a survey of

    5000 adults in the U.S. provided sufficient data toidentify 23 pairs of these attributes that hadsignificant statistical associations. Theseassociations are depicted as undirected links betweenthe attributes in the semantic graph shown in Figure6.

    Gender

    Expenditure

    Category

    Age Income

    Expenditure

    Amount

    Use

    FrequencyNumber of

    Credit Cards

    Occupation

    Education level

    1 (2)

    7 (6)

    9 (15)

    8 (13)

    6 (6)

    5 (5)

    4 (8) 3 (5)2 (6)

    Gender

    Expenditure

    Category

    Age Income

    Expenditure

    Amount

    Use

    FrequencyNumber of

    Credit Cards

    Occupation

    Education level

    1 (2)

    7 (6)

    9 (15)

    8 (13)

    6 (6)

    5 (5)

    4 (8) 3 (5)2 (6)

    Figure 6. Semantic Graph Depicting Pair wiseAssociations between Attributes

    The 9-dimensional joint distribution was then built by imposing that its corresponding two-waymarginal distributions match these observed

    distributions. The 9-dimensional distribution was fitto the data using the iterative proportional fittingalgorithm (IPF) [6]. This approach has twodesirable features. First, it reflects exactly theinformation that is available concerning associationsbetween these attributes nothing more and nothingless. Second, the number of inputted lowerdimensional marginals is a lever that is proportionalto the quality of the synthetic data that getsgenerated. As additional lower dimensionalinformation is input, the synthetic data becomesmore realistic. Furthermore, IPF algorithm can be

    easily adapted to include additional informationabout attribute associations as it becomes available.

    Rule-based algorithms are used for attributes thathave specified types of random patterns. A classicexample is how IDSG generates credit cardnumbers. American Express card numbers are 15digits long, while Visa, Diners Club, Discovery andMaster Card numbers are 16 digits long. Moreover,there are rules associated with the leading digits(e.g., Visa always begins with 4 and AMEX alwaysbegins with either 34 or 37) and the Luhn algorithm

    [6] is used to certify the entire string of digits is avalid card number. IDSG creates credit cardnumbers that adhere to all of these rules. Similarly,drivers licenses and plate numbers respectindividual state conventions.

    Finally, the third type of data generation algorithm isresampling which refers to generating values for an

  • 8/3/2019 Synthetic Data Generation

    5/6

    Page 5 of 6

    attribute by resampling form a large population ofreal values. The classic example in IDSG is thesocial security number. The Social Security DeathMaster Index (SSDI) is a publicly available data base [8] that lists the names, last known addressesand the social security number of over 70 million

    deceased persons. Resampling from among thesesocial security numbers ensures the fidelity of thesocial security numbers generated by IDSG. TheSSDI is also resampled to randomly generate asurname, and then a first name is randomly selectedfrom lists of common gender-based first names. Asanother example, the U.S. Postal Service maintains adatabase that connects a zip code to a city, state and NPA-NXX for nearly 500,000 cities in the U.S.IDSG resamples records from this database to ensurethe fidelity between these four attributes.

    SHIPPING CONTAINER APPLICATION

    Approximately 140 different countries and regionsdeliver merchandise to the U.S. throughapproximately 170 different ports of call. Our newapplication is concerned with generating bills oflading for this imported merchandise. Interest in thisapplication was motivated by our awareness of public concern over the security of U.S. ports andthe corresponding anticipation that the developmentof KDD tools in this area are likely underway. Asshown in Figure 6, a user can now generate synthetic

    data sets for this application.

    For the initial development of this application,resampling was used for the data generationalgorithm. The source data set that was resampledwas purchased from the Port Import ExportReporting Service (PIERS) [9]. PIERS maintains adatabase of bills of lading that covers all vesselsmaking calls between U.S. ports and foreigncountries. For this particular application, we areinterested only in the import data, which is the datapertaining to the vessels and cargos coming into the

    U.S. from foreign countries. A single shippingcontainer can contain more than one order andtherefore be mapped to more than one bill of lading.Similarly, a single merchandise order can spanmultiple shipping containers.

    Figure 6. IDSG Shipping Container Application

    The data set we purchased from PIERS covers theshipping containers on all incoming vessels during

    the months of June, July and August of 2004. Thenumber of bills of lading during these months is2,071,371. Each bill of lading has as many as 64different attributes. Table 1 shows a selected set ofthese attributes to provide a feel for the type of dataon a bill of lading.

    Selected PIERS Data Attributes

    Container Number Name of Vessel

    Commodity Description Carrier Name

    Origin Country Carrier Code

    Consignee Name U.S. Port Code

    Consignee Address U.S. Port NameArrival Date Cargo Volume (cubic

    ft)

    Bill of Lading Number Quantity of Cargo

    Notify Partys Name Weight of Cargo (Kg)

    Notify Partys Address Value ($)

    Table 1. Selected PIERS Attributes

    Since little a-priori information is availableconcerning attribute associations, a straight forwardresampling data generation algorithm has beenimplemented for the first version of this application.As a result, the data sets generated will have veryhigh fidelity. Similar to the other applications, theuser can select how they wish to organize theinformation in the bills of lading into different files.Figure 7 shows the same type of file definition andattribute selection features that was provided for theother applications.

  • 8/3/2019 Synthetic Data Generation

    6/6

    Page 6 of 6

    Just as with the other applications, a user can insertanomalous bills of lading using the IDSG scenarioinsertion feature. Once the file structure is definedfor the output files, the user is offered the screenshown in Figure 3 to optionally seed the generateddata sets with unusual records that KDD tools

    developers might hope to discover during testing.

    Figure 7. IDSG Shipping Container Attribute Fileand Attribute Definitions

    In future work on this application, learningalgorithms will be developed to extract the mostmeaningful attribute characteristics and relationshipsfrom the data, so that new bills of lading can beformed by appropriately mixing information fromdifferent PIERS records.

    SUMMARY

    We have described a design and application for ourtest data generation tool IDSG that can facilitatebuilding test cases for data mining tools by enablingthe data mining developer to overcome time, cost,organizational and legal issues associated withgathering real data to build test cases. Our semanticgraph with data dependency and scenario insertionapproach differentiates IDSG from other commercialsoftware. The shipping container data generationapplication demonstrated that the software designand architecture allow for the unlimited developmentof new applications without a change to the systeminfrastructure. Our approach and softwarearchitecture not only provide for off-the-shelfapplication users who want data sets of the typeIDSG already generates, but also give the computing

    infrastructure to develop a wizard for power userswho want to design customized application data setsthemselves by specifying their own semantic graph.

    REFERENCES

    [1] Turbo Data (turbodata.com), GS Data Generator

    (GSApps.com), DTM Data Generator (sqledit.com)and RowGen (iri.com)

    [2] Jeske, D. R., Behrokh Samadi, B., Lin, P., Ye,L., Cox, S., Xiao, R., Younglove, T., Ly, M., Holt,D., Rich, R. (2005), Generation of Synthetic DataSets for Evaluating the Accuracy of KnowledgeDiscovery Systems. Proceedings of the Eleventh ACM SIGKDD International Conference on

    Knowledge Discovery and Data Mining, pp. 756-763, August 21-24, 2005, Chicago, USA

    [3] Lin, P., Behrokh Samadi, Jeske, D. R., Cox, S.,Rendn, C., Holt, D., Cipolone, A., Xiao, R. (2006),Development of a Synthetic Data Set Generator forBuilding and Testing Information DiscoverySystems, Proceedings of The Third InternationalConference on Information Technology : NewGenerations, IEEE Computer Society, pp. 707-712,April 10-12, 2006, Las Vegas, USA

    [4] TSWG TASK IP-IA-2126

    [5] Jeske, D. R., Gokhale, D. V. and Ye, L. (2006)

    Generating Synthetic Data from Marginal Fitting forTesting the Efficacy of Data Mining Tools, International Journal of Production Research(Special Issue on Data Mining) Vol. 44, No. 14, pp.2711-2730.

    [6] Deming, W. E. and Stefan, F. F. (1940), On aLeast Squares Adjustment of a Sampled FrequencyTable When the Expected Marginal Totals areKnown,Annals of Mathematical Statistics, Vol. 11,pp. 27-44.

    [7] http://en.wikipedia.org/wiki/Luhn_formula

    [8] SSDI - http://ssdi.genealogy.rootsweb.com

    [9] PIERS - http://www.piers.com/