Web Services for Data Mining

2nd International Workshop on Data Mining Standards, Services and Platforms

Workshop Chair: Robert Grossman, Univ. of IL at Chicago & Open Data Partners

Organizing Committee: Robert Chu, SAS Mark Hornick, Oracle

Dustin Hux, Elder Research, Inc. Dave Selinger, Amazon.com Zhaohui Tang, Microsoft Kurt Thearling, Capital One

August 22, 2004Seattle, USA

2

Proceedings of the Second Annual Workshop on Data Mining Standards, Services and Platforms

KDD 2004

August 22, 2004

Seattle, WA

Edited by Robert Grossman

University of Illinois at Chicago & Open Data Partners

3

Table of Contents Affiliations………………………………………………………………………….Page 4

Preface ……………………………………………………………………………..Page 5

Standards

Toward Standardization in Privacy-Preserving Data Mining by Stanley R. M. Oliveira and Osmar Zaïane …………………………………………………..………..…….Page 7

An Overview of PMML Version 3.0 by Stefan Raspl………………………....….Page 18

Java™ Data Mining (JSR-73): Status and Overview by Mark Hornick, Hankil Yoon and Sunil Venkayala………………………………………………………………...….Page 23

Services

Web Services Standards for Data Mining by Robert Chu ………………….....…..Page 31

Experimental Studies Scaling Web Services for Data Mining Using Open DMIX: Preliminary Results by Robert Grossman and David Hanley ……………..…..…..Page 38

Platforms Distributed Scoring Using PMML by Bill Hosken and Bernard Scherer……...….Page 47

A Simple Strategy for Composing Data Mining Operations by Robert Grossman and Gregor Meyer ……………………………………………………………….…..…Page 48

4

Affiliations Robert Chu SAS Robert Grossman University of Illinois at Chicago & Open Data Partners David Hanley National Center for Data Mining, University of Illinois at Chicago Mark Hornick, Hankil Yoon and Sunil Venkayala Oracle Bill Hosken SPSS Gregor Meyer IBM Stanley R. M. Oliveira University of Alberta and Embrapa Informática Agropecuária Stefan Raspl IBM Bernard Scherer SPSS Osmar R. Zaïane University of Alberta

5

Preface This year marks the fourth year that there has been a KDD workshop on the Predictive Model Markup Language (PMML) and related areas and the second year of a broader conference with the theme of Data Mining Standards, Services and Platforms. It’s perhaps useful to think of the role played by the relational database model and the standard infrastructure provided by relational databases in the theory and practice of databases. The field of data mining is in some sense very far from either a theory or a standard infrastructure for data mining. On the other hand, from another perspective one of the goals of PMML was to create a standard interface between producers of models, such as statistical or data mining systems, and consumers of models, such as scoring engines, applications containing embedded models, and other operational systems. There are now quite a few vendors shipping scoring engines, which is an important measure of success in this area. For the past several years, the developers of PMML have been working to create a similar mechanism so that the transformations and compositions required in the data processing, which are so essential to data mining, can be similarly encapsulated. This is one of the themes of this year’s workshop. As a standard architecture for scoring and a standard architecture for data preparation emerges, we are one step closer to a standard infrastructure for data mining. Data mining started as a stand-alone application; more recently data mining has been embedded in databases and distributed Java-based architectures have been developed. Another theme in these proceedings is the maturation of service-based architectures for data mining to complement these other approaches. Finally, the importance of privacy preserving data mining has grown enormously during the past few years. A third theme in this issue is beginning the process to develop standards in this area.

The Editor

6

Part 1.

Data Mining Standards

7

Toward Standardization in Privacy-Preserving Data Mining

Stanley R. M. Oliveira and Osmar R. Zaïane

Abstract. Issues about privacy-preserving data mining (PPDM) have emerged globally. The recent proliferation in PPDM techniques is evident. Motivated by the increasing number of successful techniques, the new generation in PPDM moves on toward standardization because it will certainly play an important role in the future of PPDM. In this paper, we lay out what needs to be done and take some steps toward proposing such standardization: First, we describe the problems we face in defining what information is private in data mining, and discuss how privacy can be violated in data mining. We also define privacy preservation in data mining based on users' personal information and information concerning their collective activity. Second, we analyze the implications of the Organization for Economic Cooperation and Development (OECD) data privacy principles in the context of data mining and suggest some policies for PPDM based on such principles. Finally, we propose some requirements to guide the development and deployment of technical solutions. 1. Introduction The debate on PPDM has received special attention as data mining has been widely adopted by public and private organizations. We have witnessed three major landmarks that characterize the progress and success of this new research area: the conceptive landmark, the deployment landmark, and the prospective landmark. We describe these landmarks as follows:

• The Conceptive landmark characterizes the period in which central figures in the community, such as O'Leary [14, 15], Fayyad, Piatetsky-Shapiro and Smith [8, 16], and others [12, 5], investigated the success of knowledge discovery and some of the important areas where it can conflict with privacy concerns. The key finding was that knowledge discovery can open new threats to informational privacy and information security if not done or used properly. Since then, the debate on PPDM has gained momentum.

• The Deployment landmark is the current period in which an increasing number of PPDM techniques have been developed and have been published in refereed conferences. The information available today is spread over countless papers and conference proceedings1. The results achieved in the last years are promising and suggest that PPDM will achieve the goals that have been set for it.

• The Prospective landmark is a new period in which directed efforts toward standardization occur. At this stage, there is no consent about what privacy preservation means in data mining. In addition, there is no consensus on privacy principles, policies, and requirements as a foundation for the development and deployment of new PPDM techniques. The excessive number of techniques is leading to confusion among developers, practitioners, and others interested in this technology. One of the most important challenges in PPDM now is to establish the groundwork for further research and development in this area.

1 The Privacy-Preserving Data Mining: http://www.cs.ualberta.ca/~oliveira/psdm/psdm index.html

8

Currently, one of the most important challenges in PPDM is to put forward standardization issues in PPDM because they will play a significant role in the future of this new area. In this paper, we lay out what needs to be done and take some steps toward proposing such standardization. Our contributions in this paper can be summarized as follows: a) we describe the problems we face in defining what information is private in data mining, and discuss how privacy can be violated in data mining; b) we define privacy preservation in data mining based on users' personal information and information concerning their collective activity; c) we describe the general parameters for characterizing scenarios in PPDM; d) we analyze the implications of the Organization for Economic Cooperation and Development (OECD) data privacy principles in knowledge discovery; e) we suggest some policies for PPDM based on instruments accepted world-wide; and f) we propose some requirements for the development of technical solutions and to guide the deployment of new technical solutions.

The effort described in this paper is by no means meant to be complete and comprehensive. Rather, our primary goal is to stir up the discussion on consensus about definition, requirements, principles and policies in PPDM. We argue that this line of work will eventually lead to standardization in PPDM.

This paper is organized as follows. In Section 2, we describe the problems we face in defining privacy for data mining. In Section 3, we describe some issues related to PPDM, such as privacy violation, and privacy definitions. In Section 4, we analyze the OECD principles in the context of data mining. We also suggest some policies for PPDM based on instruments accepted worldwide. In Section 5, we propose some privacy requirements for the development and deployment of technical solutions. Related work is reviewed in Section 6. Finally, Section 7 presents our conclusions. 2. Problems in Defining Privacy Analyzing what right to privacy means is fraught with problems, such as the exact definition of privacy, whether it constitutes a fundamental right, and whether people are and/or should be concerned with it. Several definitions of privacy have been given, and they vary according to context, culture, and environment. For instance, in an 1890 paper [22], Warren & Brandeis defined privacy as “the right to be alone.” Later, in a paper published in 1967 [23], Westin defined privacy as “the desire of people to choose freely under what circumstances and to what extent they will expose themselves, their attitude, and their behavior to others”. Schoeman [20] defined privacy as “the right to determine what (personal) information is communicated to others” or “the control an individual has over information about himself or herself.” More recently, Garfinkel [9] stated that “privacy is about self-possession, autonomy, and integrity.” On the other hand, Rosenberg argues that privacy may not be a right after all but a taste [18]: “If privacy is in the end a matter of individual taste, then seeking a moral foundation for it beyond its role in making social institutions possible that we happen to prize will be no more fruitful than seeking a moral foundation for the taste for truffles."

The above definitions suggest that, in general, privacy is viewed as a social and cultural concept. However, with the ubiquity of computers and the emergence of the Web, privacy has also become a digital problem [17]. With the Web revolution and the emergence of data mining, privacy concerns have posed technical challenges fundamentally different from those that occurred before the information era. In the information technology era, privacy refers to the right of users to conceal their personal information and have some degree of control over the use of any personal information disclosed to others [6, 1, 10].

Clearly, the concept of privacy is often more complex than initially realized. In particular, in data mining, the definition of privacy preservation is still unclear, and there is very little literature

9

related to this topic. A notable exception is the work presented in [3], in which PPDM is defined as “getting valid data mining results without learning the underlying data values.” However, at this point, each existing PPDM technique has its own privacy definition. Our primary concern about PPDM is that mining algorithms are analyzed for the side effects they incur in data privacy. Therefore, our definition for PPDM is close to those definitions in [20, 3] PPDM encompasses the dual goal of meeting privacy requirements and providing valid data mining results. Our definition emphasizes the dilemma of balancing privacy preservation and knowledge disclosure. 3. Privacy-Preserving Data Mining 3.1 Privacy Violation in Data Mining Understanding privacy in data mining requires understanding how privacy can be violated and the possible means for preventing privacy violation. In general, one major factor contributes to privacy violation in data mining: data misuse.

Users' privacy can be violated in different ways and with different intentions. Although data mining can be extremely valuable in many applications (e.g., business, medical analysis, etc), it can also, in the absence of adequate safeguards, violate informational privacy. Privacy can be violated if personal data are used for other purposes subsequent to the original transaction between an individual and an organization when the information was collected.

One of the sources of privacy violation is called data magnets [17]. Data magnets are techniques and tools used to collect personal data. Examples of data magnets include explicitly collecting information through on-line registration, identifying users through IP addresses, requiring registration for software downloads, and indirectly collecting information for secondary usage. In many cases, users may or may not be aware that information is being collected or do not know how that information is collected [7, 13]. Worse is the privacy invasion occasioned by secondary usage of data when individuals are unaware of “behind the scenes” uses of data mining techniques [11]. In particular, personal data can be used for secondary usage largely beyond the users' control and privacy laws. This uncontrollable privacy violation is not because of data mining itself, but fundamentally because of the misuse of data. 3.2 Defining Privacy Preservation in Data Mining In general, privacy preservation occurs in two major dimensions: users' personal information and information concerning their collective activity. We refer to the former as individual privacy preservation and the latter as collective privacy preservation, which is related to corporate privacy in [3]. • Individual privacy preservation: The primary goal of data privacy is the protection of

personally identifiable information. In general, information is considered personally identifiable if it can be linked, directly or indirectly, to an individual person. Thus, when personal data are subjected to mining, the attribute values associated with individuals are private and must be protected from disclosure. Miners are then able to learn from global models rather than from the characteristics of a particular individual.

• Collective privacy preservation: Protecting personal data may not be enough. Sometimes, we may need to protect against revealing sensitive knowledge representing the activities of a group. We refer to the protection of sensitive knowledge as collective privacy preservation. The goal here is quite similar to the one for statistical databases, in which security control mechanisms provide aggregate information about groups (population) and, at the same time,

10

should prevent disclosure of confidential information about individuals. However, unlike statistical databases, another objective of collective privacy preservation is to preserve strategic patterns that are paramount for strategic decisions, rather than minimizing the distortion of all statistics (e.g., bias and precision). In other words, the goal here is not only to protect personally identifiable information but also some patterns and trends that are not supposed to be discovered.

In the case of collective privacy preservation, organizations have to cope with some

interesting conflicts. For instance, when personal information undergoes analysis processes that produce new facts about users' shopping patterns, hobbies, or preferences, these facts could be used in recommender systems to predict or affect their future shopping patterns. In general, this scenario is beneficial to both users and organizations. However, when organizations share data in a collaborative project, the goal is not only to protect personally identifiable information but also to protect some strategic patterns. In the business world, such patterns are described as the knowledge that can provide competitive advantages, and therefore must be protected [21]. More challenging is to protect the knowledge discovered from confidential information (e.g., medical, financial, and crime information). The absence of privacy safeguards can equally compromise individuals' privacy. While violation of individual privacy is clear, violation of collective privacy can lead to violation of individual's privacy. 3.3 Characterizing Scenarios in PPDM Before describing the general parameters for characterizing scenarios in PPDM, let us consider two real-life examples where PPDM poses different constraints: • Scenario 1: A hospital shares some data for research purposes (e.g., concerning a group of

patients who have a similar disease). The hospital's security administrator may suppress some identifiers (e.g., name, address, phone number, etc) from patient records to meet privacy requirements. However, the released data may not be fully protected. A patient record may contain other information that can be linked with other datasets to re-identify individuals or entities [19]. How can we identify groups of patients with a similar disease without revealing the values of the attributes associated with them?

• Scenario 2: Two or more companies have a very large dataset of records on their customers'

buying activities. These companies decide to cooperatively conduct association rule mining on their datasets for their mutual benefit since this collaboration brings them an advantage over other competitors. However, some of these companies may not want to share some strategic patterns hidden within their own data (also called restrictive association rules) with the other parties. They would like to transform their data in such a way that these restrictive association rules cannot be discovered but others can be. Is it possible for these companies to benefit from such collaboration by sharing their data while preserving some restrictive association rules?

Note that the above scenarios describe different privacy preservation problems. Each scenario

poses a set of challenges. For instance, scenario 1 is a typical example of individual's privacy preservation, while scenario 2 refers to collective privacy preservation. How can we characterize scenarios in PPDM? One alternative is to describe them in terms of general parameters. In [4], some parameters are suggested:

• Outcome: Refers to the desired data mining results. For instance, someone may look for association rules identifying relationships among attributes, or relationships among

11

customers' buying behaviors as in scenario 2, or may even want to cluster data as in scenario 1.

• Data Distribution: How are the data available for mining - are they centralized or distributed across many sites? In the case of data distributed throughout many sites, are the entities described with the same schema in all sites (horizontal partitions), or do different sites contain different attributes for one entity (vertical partitions)?

• Privacy Preservation: What are the privacy preservation requirements? If the concern is solely that values associated with an individual entity not be released (e.g., personal information), techniques must focus on protecting such information. In other cases, the notion of what constitutes “sensitive knowledge” may not be known in advance. This would lead to human evaluation of the intermediate results before making the data available for mining.

4. Principles and Policies for PPDM 4.1 The OECD Privacy Guidelines Worldwide, privacy legislation, policies, guidelines, and codes of conduct have been derived from the set of principles established in 1980 by the OECD2. They represent the primary components for the protection of privacy and personal data, comprising a commonly understood reference point. A number of countries have adopted these principles as statutory law, in whole or in part. The OECD Privacy Guidelines outline the following basic principles:

1. Collection Limitation Principle: There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject (consumer).

2. Data Quality Principle: Personal data should be relevant to the purposes for which they are to be used, and, to the extent necessary for those purposes, should be accurate, complete and up-to-date.

3. Purpose Specification Principle: The purposes for which personal data are collected should be specified not later than at the time of data collection and the subsequent use limited to the fulfillment of those purposes, or others that are not incompatible with those purposes, and as are specified on each occasion of change of purpose.

4. Use Limitation Principle: Personal data should not be disclosed, made available or otherwise used for purposes other than those specified in accordance with [the Purpose Specification Principle] except: (a) with the consent of the data subject; or (b) by the authority of law.

5. Security Safeguards Principle: Personal data should be protected by reasonable security safeguards against such risks as loss or unauthorized access, destruction, use, modification, or disclosure of data.

6. Openness Principle: There should be a general policy of openness about developments, practices, and policies with respect to personal data. Means should be readily available for establishing the existence and nature of personal data, and the main purposes of their use, as well as the identity and usual residence of the data controller (e.g., a public or a private organization).

2 Privacy Online - OECD Guidance on Policy and Practice. http://www.oecd.org/dataoecd/33/43/2096272.pdf

12

7. Individual Participation Principle: An individual should have the right: a) to obtain from a data controller, or otherwise, confirmation of whether or not the data controller has data relating to him; b) to have communicated to him, data relating to him within a reasonable time, at a charge, if any, that is not excessive; in a reasonable manner, and in a form that is readily intelligible to him; c) to be given reasons if a request made under subparagraphs (a) and (b) is denied, and to be able to challenge such denial; and d) to challenge data relating to him and, if the challenge is successful to have the data erased, rectified, completed, or amended.

8. Accountability Principle: A data controller should be accountable for complying with measures that give effect to the principles stated above.

4.2 The implications of the OECD Privacy Guidelines in PPDM We now analyze the implications of the OECD principles in PPDM. We then suggest which principles should be considered absolute principles in PPDM.

1. Collection Limitation Principle: This principle states that some very sensitive data should not be held at all. Collection limitation is too general in the data mining context incurring in two grave consequences: a) the notion of “very sensitive” is sometimes unclear and may differ from country to country, leading to vague definitions; b) limiting the collection of data may make the data useless for knowledge discovery. Thus, this principle seems to be unenforceable in PPDM.

2. Data Quality Principle: This principle is related to the pre-processing stage in data mining in which data cleaning routines are applied to resolve inaccuracy and inconsistencies. This principle is relevant in the preprocessing stage of knowledge discovery. However, most PPDM techniques assume that the data are already in an appropriate form to mine.

3. Purpose Specification Principle: This principle is the fundamental basis of privacy. Individuals should be informed of the purposes for which the information collected about them will be used, and the information must be used solely for that purpose. In other words, restraint should be exercised when personal data are collected. This principle is extremely relevant in PPDM.

4. Use Limitation Principle: This principle is closely related to the purpose specification principle. Use limitation is perhaps the most difficult principle to address in PPDM. This principle states that the purpose specified to the data subject (consumer) at the time of the collection restricts the use of the information collected, unless the data subject has provided consent for additional uses. This principle is also fundamental in PPDM.

5. Security Safeguards Principle: This principle is basically irrelevant in the case of data privacy, but relevant for database security. Security safeguards principle is typically concerned with keeping sensitive information (e.g., personal data) out of the hands of unauthorized users, which ensures that the data is not modified by users who do not have permission to do so. This principle is unenforceable in the context of PPDM.

6. Openness Principle: This principle, also called transparency, states that people have the right to know what data about them have been collected, who has access to the data, and how the data are being used. In other words, people must be aware of the conditions under which their information is being kept and used. However, data mining is not an open and transparent activity requiring analysts to inform individuals about particular derived knowledge, which may inhibit the use of data. This principle is equally important in PPDM.

13

7. Individual Participation Principle: This principle suggests that data subjects should be able to challenge the existence of information gained through data mining applications. Since knowledge discovery is not openly apparent to data subjects, the data subjects are not aware of knowledge discoveries related to them. While debatably collected individual information could belong to individuals, one can argue that collective information mined from databases belongs to organizations that hold such databases. In this case, the implications of this principle for PPDM should be carefully weighed; otherwise, it could be too rigid in PPDM applications.

8. Accountability Principle: This principle states that data controllers should inform data subjects of the use and findings from knowledge discovery. In addition, data controllers should inform individuals about the policies regarding knowledge discovery activities, including the consequences of inappropriate use. Some countries (e.g., the UK, Japan, Canada) that have adopted the OECD privacy principles do not consider this principle since it is not limited in scope, area, or application. Thus, the accountability principle is too general for PPDM.

Our analysis above suggests that the OECD privacy principles can be categorized into three groups according to their influence on the context of PPDM:

• Group 1 is composed of those principles that should be considered as absolute principles in PPDM, such as Purpose Specification, Use Limitation, and Openness.

• Group 2 consists of some principles that somehow impact PPDM applications, and their full implications should be understood and carefully weighed depending on the context. The principles that fall into this category are Data Quality and Individual Participation.

• Group 3 encompasses some principles that are too general or unenforceable in PPDM. This group includes Collection Limitation, Security Safeguards, and Accountability. Clearly, the principles categorized in groups 1 and 2 are relevant in the context of PPDM and are fundamental for further research, development, and deployment of PPDM techniques.

4.3 Adopting PPDM Policies from the OECD Privacy Guidelines One fundamental point to be considered when designing some privacy policies is that too many restrictions could seriously hinder the normal functioning of business and governmental organizations. Even worse, perhaps, is that restrictions, if not carefully weighed, could make PPDM results useless.

Given these facts, we suggest some policies for PPDM based on the OECD privacy principles. We try to find a good compromise between privacy requirements and knowledge discovery. We describe the policies as follows:

1. Awareness Policy: When a data controller collects personally identifiable information, the data controller shall express why the data are collected and whether such data will be used for knowledge discovery.

2. Limit Retention Policy: A data controller shall take all reasonable steps to keep only personal information collected that is accurate, complete, and up to date. In the case of personal information that is no longer useful, it shall be removed and not subjected to analysis to avoid unnecessary risks, such as wrong decision-making, which may incur liability.

3. Forthcoming Policy: Policies regarding collecting, processing, and analyzing that produce new knowledge about individuals shall be communicated to those about whom the

14

knowledge discovered pertains, in particular when the discovered knowledge is to be disclosed or shared.

4. Disclosure Policy: Data controllers shall only disclose discovered knowledge about an individual for purposes to which the individual consents and the knowledge discovered about individuals shall never be disclosed inadvertently or without consent.

5. Requirements for PPDM 5.1 Requirements for the development of technical solutions Ideally, a technical solution for a PPDM scenario would enable us to enforce privacy safeguards and to control the sharing and use of personal data. However, such a solution raises some crucial questions:

What levels of effectiveness are in fact technologically possible and what corresponding regulatory measures are needed to achieve these levels?

What degrees of privacy and anonymity must be sacrificed to achieve valid data mining results?

These questions cannot have “yes-no” answers, but involve a range of technological possibilities and social choices. The worst response to such questions is to ignore them completely and not pursue the means by which we can eventually provide informed answers.

Technology alone cannot address all of the concerns surrounding PPDM scenarios [2]. The above questions can be to some extent addressed if we provide some key requirements to guide the development of technical solutions.

The following key words are used to specify the extent to which an item is a requirement for the development of technical solutions to address PPDM:

• Must: this word means that the item is an absolute requirement;

• Should: this word means that there may exist valid reasons not to treat this item as a requirement, but the full implications should be understood and the case carefully weighed before discarding this item.

1. Independence: A promising solution for the problem of PPDM, for any specific data mining task (e.g., association rules, clustering, classification), should be independent of the mining task algorithm.

2. Accuracy: When it is possible, an effective solution should do better than a trade-off between privacy and accuracy on the disclosure of data mining results. Sometimes a trade-off must be found as in scenario 2 in Section 3.3.

3. Privacy Level: This is also a fundamental requirement in PPDM. A technical solution must ensure that the mining process does not violate privacy up to a certain degree of security.

4. Attribute Heterogeneity: A technical solution for PPDM should handle heterogeneous attributes (e.g., categorical and numerical).

5. Versatility: A versatile solution to address the problem of PPDM should be applicable to different kinds of information repositories, i.e., the data could be centralized, or even distributed horizontally or vertically.

15

6. Communication Cost: When addressing data distributed across many sites, a technical solution should consider carefully issues of communication cost.

5.2 Requirements to guide the deployment of technical solutions Information technology vendors in the near future will offer a variety of products that claim to help protect privacy in data mining. How can we evaluate and decide whether what is being offered is useful? The nonexistence of proper instruments to evaluate the usefulness and feasibility of a solution to address a PPDM scenario challenge us to identify the following requirements:

1. Privacy Identification: We should identify what information is private. Is the technical solution aiming at protecting individual privacy or collective privacy?

2. Privacy Standards: Does the technical solution comply with international instruments that state and enforce rules (e.g., principles and/or policies) for use of automated processing of private information?

3. Privacy Safeguards: Is it possible to record what has been done with private information and be transparent with individuals about whom the private information pertains?

4. Disclosure Limitation: Are there metrics to measure how much private information is disclosed? Since privacy has many meanings depending on the context, we may require a set of metrics to do so. What is most important is that we need to measure not only how much private information is disclosed, but we also need to measure the impact of a technical solution on the data and on valid mining results.

5. Update Match: When a new technical solution is launched, two aspects should be considered: a) the solution should comply with existing privacy principles and policies; b) in case of modifications to privacy principles and/or policies that guide the development of technical solutions, any release should consider these new modifications.

6. Related Work Data mining from a fair information practices perspective was first discussed in [15]. O'Leary studied the impact of the OECD guidelines in knowledge discovery. The key finding of this study was that the OCDE guidelines could not anticipate or address many important issues regarding knowledge discovery, and thus, several principles are too general or unenforceable. Our work here is orthogonal to [15]. We investigate the influence of the OECD principles in the context of PPDM categorizing them in different groups of relevance. In particular, we show that the OECD guidelines are accepted world-wide and, therefore, they represent the primary components for standardization in PPDM. We discuss how the community in PPDM could derive some principles and policies from the OECD guidelines.

More recently, Clifton et al. discussed the meaning of PPDM as a foundation for further research in this field [3]. That work introduces some definitions for PPDM and discusses some metrics for information disclosure in data mining. The work in [3] is complementary to our work. The primary goal of our work is to put forward standardization issues in PPDM. Our effort encompasses the design of privacy principles and policies, and requirements for the development and deployment of technical solutions for PPDM.

16

7. Conclusions In this paper, we make some effort to establish the groundwork for further research in the area of Privacy-Preserving Data Mining (PPDM). We put forward standardization issues in PPDM. Although our work described in this paper is preliminary and conceptual in nature, we argue that it is a vital prerequisite for standardization in PPDM.

Our primary goal in this work is to conceive a common framework for PPDM, notably in terms of definitions, principles, policies, and requirements. The advantages of a framework of that nature are: (a) a common framework will avoid confusing developers, practitioners, and many others interested in PPDM; (b) adoption of a common framework will inhibit inconsistent efforts in different ways, and will enable vendors and developers to make solid advances in the future in the PPDM area.

Our contributions in this paper can be summarized as follows: 1) we describe the problems we face in defining what information is private in data mining, and discuss how privacy can be violated in data mining; 2) we define privacy preservation in data mining based on users' personal information and information concerning their collective activity; 3) we describe the general parameters for characterizing scenarios in PPDM; 4) we analyze the implications of the Organization for Economic Cooperation and Development (OECD) data privacy principles in knowledge discovery; 5) we suggest some policies for PPDM based on instruments accepted world-wide; and 6) we propose some requirements for the development of technical solutions and to guide the deployment of new technical solutions. Acknowledgments Stanley Oliveira was partially supported by CNPq, Brazil, under grant No. 200077/00-7. Osmar Zaïane was partially supported by a research grant from NSERC, Canada. References 1. M. Ackerman, L. Cranor, and J. Reagle. Privacy in E-Commerce: Examining User Scenarios

and Privacy Preferences. In Proc. of the ACM Conference on Electronic Commerce, pages 1-8, Denver, Colorado, USA, November 1999.

2. R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Hippocratic Databases. In Proc. of the 28th Conference on Very Large Data Bases, Hong Kong, China, August 2002.

3. C. Clifton, M. Kantarcioglu, and J. Vaidya. Defining Privacy For Data Mining. In Proc. of the National Science Foundation Workshop on Next Generation Data Mining, pages 126-133, Baltimore, MD, USA, November 2002.

4. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu. Tools For Privacy Preserving Distributed Data Mining. SIGKDD Explorations, 4(2):28-34, 2002.

5. C. Clifton and D. Marks. Security and Privacy Implications of Data Mining. In Workshop on Data Mining and Knowledge Discovery, pages 15-19, 1996.

6. S. Cockcroft and P. Clutterbuck. Attitudes Towards Information Privacy. In Proc. of the 12th Australasian Conference on Information Systems, Australia, 2001.

7. M. J. Culnan. How Did They Get My Name?: An Exploratory Investigation of Consumer Attitudes Toward Secondary Information. MIS Quartely, 17(3):341-363, September 1993.

8. U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy (eds.), pages 1-34, MIT Press, Cambridge, MA, 1996.

17

9. S. Garfinkel. Database Nation: The Death of the Privacy in the 21st Century. O'Reilly & Associates, Sebastopol, CA, USA, 2001.

10. P. Jefferies. Multimedia, Cyberspace & Ethics. In Proc. of International Conference on Information Visualisation (IV2000), pages 99-104, London, England, July 2000.

11. G. H. John. Behind-the-Scenes Data Mining. Newsletter of ACM SIG on KDDM, 1(1):9-11, June 1999.

12. W. Klösgen. KDD: Public and Private Concerns. IEEE EXPERT, 10(2):55-57, April 1995. 13. K. C. Laudon. Markets and Privacy. Communication of the ACM, 39(9):92-104, September

1996. 14. D. E. O'Leary. Knowledge Discovery as a Threat to Database Security. In G. Piatetsky-

Shapiro and W. J. Frawley (editors): Knowledge Discovery in Databases. AAAI/MIT Press, pages 507-516, Menlo Park, CA, 1991.

15. D. E. O'Leary. Some Privacy Issues in Knowledge Discovery: The OECD Personal Privacy Guidelines. IEEE EXPERT, 10(2):48-52, April 1995.

16. G. Piatetsky-Shapiro. Knowledge Discovery in Personal Data vs. Privacy: A Mini-Symposium. IEEE Expert, 10(2):46-47, 1995.

17. A. Rezgur, A. Bouguettaya, and M. Y. Eltoweissy. Privacy on the Web: Facts, Challenges, and Solutions. IEEE Security & Privacy, 1(6):40-49, Nov-Dec 2003.

18. A. Rosenberg. Privacy as a Matter of Taste and Right. In E. F. Paul, F. D. Miller, and J. Paul, editors, The Right to Privacy, pages 68-90, Cambridge University Press, 2000.

19. P. Samarati. Protecting Respondents' Identities in Microdata Release. IEEE Transactions on Knowledge and Data Engineering, 13(6):1010-1027, 2001.

20. F. D. Schoeman. Philosophical Dimensions of Privacy, Cambridge Univ. Press, 1984. 21. E. Turban and J. E. Aronson. Decision Support Systems and Intelligent Systems. Prentice-

Hall, New Jersey, USA, 2001. 22. S. D. Warren and L. D. Brandeis. The Right to Privacy. Harvard Law Review, 4(5):193-220,

1890. 23. A. F. Westin. The Right to Privacy, Atheneum, 1967.

18

An Overview of PMML Version 3.0 Stefan Raspl

Abstract This paper gives an overview of some of the changes in Version 3.0 of the Predictive Model Markup Language (PMML), which is expected to be released in 2004. PMML Version 3.0 adds several new models, including models for rule sets and text mining. It also adds the ability to compose certain data mining operations. For example, in PMML Version 3.0 the outputs of regression models can be used as the inputs to other models (model sequencing) and a decision tree or regression model can be used to combine the outputs of several embedded models (model selection). 1. Introduction

PMML is an application and system independent interchange format for statistical and data mining models. More precisely, the goal of PMML is to encapsulate a model in an application and system independent fashion so that two different applications (the PMML Producer and Consumer) can use it. PMML is developed by a vendor led working group, which is part of the Data Mining Group [1].

Here is a simple example: Assume that a data mining system can export PMML. Then a model developed by a statistician using the data mining system (the PMML Producer) can export the model so that a scoring system embedded in a CRM application (the PMML Consumer) can read the model and use it to score a list of prospects on the likelihood that they will respond to a mailing. The PMML Producer can be a windows application, while the PMML consumer can be a Linux application.

PMML 3.0, which is expected to be released in 2004, includes three new models and important changes to the infrastructure, including supporting the composition of data mining operations [3]

Overview of PMML

Here is a quick overview of PMML following [2].

PMML consists of the following components:

1. Data Dictionary. The data dictionary defines the fields that are the inputs to models and specifies the type and value range for each field.

2. Mining Schema. Each model contains one mining schema, which lists the fields used in the model. These fields are a subset of the fields in the Data Dictionary. The mining schema contains information that is specific to a certain model, while the data dictionary contains data definitions that do not vary with the model. For example, the Mining

19

Schema specifies the usage type of an attribute, which may be active (an input of the model), predicted (an output of the model), or supplementary (holding descriptive information and ignored by the model).

3. Transformation Dictionary. The Transformation Dictionary defines derived fields. Derived fields may be defined by normalization, which maps continuous or discrete values to numbers; by discretization, which maps continuous values to discrete values; by value mapping, which maps discrete values to discrete values; or by aggregation, which summarizes or collects groups of values, for example by computing averages.

4. Model Statistics. The Model Statistics component contains basic univariate statistics about the model, such as the minimum, maximum, mean, standard deviation, median, etc., of numerical attributes.

5. Model Parameters. PMML also specifies the actual parameters defining the statistical and data mining models. Models in PMML Version 3.0 include regression models, cluster models, trees, neural networks, Bayesian models, association rules, sequence models, support vector machines, rule sets, and text models.

Figure 1 illustrates the relationship of the Data Dictionary, Mining Schema and Transformation Dictionary. Note that inputs to models can be defined directly from the Mining Schema or indirectly as derived attributes using the Transformation Dictionary.

DataDictionary field1 field2 field3 field4 field5 ...

MiningSchemafield1 field3 field4 ...

TransformationDictionary d_field3 d_field4 ...

Model field1 d_field3 field4 d_field4 ... ModelParameters

Figure 1. This figure illustrates how the inputs to a model are of two types: basic attributes which are directly specified by the Mining Schema and derived attributes, which are defined in terms of transformations from the Transformation Dictionary applied to attributes in the mining schema. 2. New Models

PMML Version 3.0 adds three new models: rule sets, support vector machines, and text models.

Rule Set: Ruleset models can be thought of as flattened decision tree models, but cover areas where decision trees are not handy or are too limited. Rulesets can be applied to new instances to

20

derive predictions and associated confidences (scoring). They are not meant to replace decision trees, but rather are designed to meet the requirements of a common use case.

Support Vector Machine: Over the past several years, there has been a significant amount of research on support vector machines and today support vector machine applications are becoming more common. In essence, support vector machines define hyperplanes, which try to separate the values of a given target field. The hyperplanes are defined using kernel functions. The most popular kernel types are supported: linear, polynomial, radial basis and sigmoid. Support Vector Machines can be used for both, classification and regression.

Text: Version 3.0 also adds a text model consisting of the following components:

• Dictionary of terms or text dictionary that contains the terms in the model.

• Corpus of text documents: This element identifies the actual texts that are covered by this model. Only references are given, not the actual texts.

• Document-term matrix: This element specifies what terms are used in which document.

• Text model normalization: This element defines one of several possible normalizations of the document term matrix.

• Text model similarity: This element defines the similarity used to compare two vectors representing documents.

3. New Infrastructure

Model Composition: Using simple models as transformations is one of the major additions to PMML 3.0. It now offers the possibility to combine multiple conventional models into a single new one, using individual models as building blocks. This can result in models being used in sequence, where the result of each model is the input for the next one. This approach, called model sequencing, is not only useful for building more complex models, but can also be put to good use for data preparation.

Another form of model composition is also supported: the result of a model can be used to select which model should be applied next. For example, a decision tree can now have an embedded regression model in each leaf node.

Both model sequencing and model selection can be combined to develop quite complex models.

Built-in and user defined functions. PMML 3.0 now supports functions that can be used to perform preprocessing steps on the input data. A number of predefined built-in functions for simple arithmetic operations like sum, difference, product, division, square root, logarithm, etc., for numeric input fields, as well as functions for string handling, such as functions for trimming blanks or choosing substrings.

21

In addition, a mechanism to define custom functions was introduced to handle cases where the built-in functions do not suffice. In this way, models can include more sophisticated preprocessing. Users can define functions that, for example, extract the number of days since the year started, out of a given date.

Model verification. The addition of a mechanism for model verification will now greatly increase the compatibility of models between different vendors' applications consuming PMML. A verification model provides a mechanism for attaching a sample data set with sample results so that a PMML consumer can verify that a model has been implemented correctly. This will make model exchange a lot more transparent for users and inform them in advance in case compatibility problems might arise.

Output fields. All models can now have output fields. The output fields describe a set of result values that can be computed by the model. In particular, the output fields specify names, types and rules for selecting specific result features. This information can be used while writing an output table. The Output section in the model specifies default names for columns in an output table that might be different from names used locally in the model. Furthermore, they describe how to compute the corresponding values.

4. Other Changes to Models

PMML Version 3.0 also contains a number of other changes, some of which we quickly describe in this section.

All models: Derived fields can be used for preprocessing inputs prior to usage in the actual model.

Association: A lift attribute has been added. Lift is a popular measure of interestingness of a rule.

Clustering: Missing value weights were added for extended missing value handling. By this, the impact of a missing value in each individual input field can be controlled.

Regression: The attributes modelType, targetField and mean were removed because this functionality was now provided elsewhere. For example, 'mean' was basically used for missing value handling, but that can be done in the MiningSchema as well. New normalization methods probit, logit, cloglog and exp were added, in order to cover popular normalization methods. A new element PredictorTerm has been added, containing one or more fields that are combined by multiplication. That is, 'interaction terms' are now supported as well. Finally, binary classification and logistic regression with ordinal target fields are now supported.

5. Other Changes to Infrastructure

General structure: Sparse arrays have been added. This is a method to write sparsely filled arrays in a much more compact manner. This is especially useful for models such as support vector machines or text models, which make heavy use of array structures. It makes them more readable and prevents models from becoming unnecessarily bloated.

Data dictionary: Version 3.0 adds new data types: timeSeconds[], dateDaysSince[] and dateTimeSecondsSince[]. These additional types are supported in PMML because mining models often convert input values into numbers. After date and time values have been converted into

22

numbers they can be used easily in comparisons and other mathematical computations such as differences. For example, the date 2003-04-01 can be converted to the value 15796 of type dateDaysSince[1960]. These type casts are analogous to, e.g., casting an integer to a double or vice versa.

Mining schema: Version 3.0 adds attributes 'optype' and 'importance'. 'optype' overrides the corresponding value in the DataField. That is, a DataField can be used with different optypes in different models. For example, a 0/1 indicator could be used as a numeric input field in a regression model while the same field is used as a categorical field in a tree model. 'importance' states the relative importance of the field. This indicator is typically used in prediction models in order to rank fields by their predictive contribution.

Transformations: In version 3.0, one can define a replacement for missing values via the attribute 'mapMissingTo' in the transformations NormDiscrete, Discretize and MapValues. In the same way, default values can now be defined via 'devaultValue' to cover cases where the input is a missing value in Discretize and MapValues.

Target. In previous releases, the possible class labels of classification models were specified differently, varying between the different types of models. For example, the target categories in regression models were specified in the RegressionTable elements, while the TreeModel defines them within Node elements. Naive Bayes models on the other hand specify them in TargetValueCounts. The new PMML element Target for targets provides a common syntax for all models. This can also be used to provide additional information like display names for the class label or prior probabilities.

6. Summary Perhaps the most significant changes to PMML 3.0 is the support for model composition through model sequencing and model selection. Together with the improved support for built-in functions and user-defined functions, Version 3.0 of PMML now provides a much more powerful platform for data preparation. PMML 3.0 also adds several new model types: support vector machines, text models, and rule sets. References [1] The PMML Working Group is part of the Data Mining Group. See www.dmg.org.

[2] Robert Grossman, Mark Hornick, and Gregor Meyer, Data Mining Standards Initiatives, Communications of the ACM, Volume 45-8, 2002, pages 59-61

[3] PMML documentation can be found on the web site.: sourceforge.net/projects/pmml/

23

Java™ Data Mining (JSR-73): Status and Overview

Mark F. Hornick Hankil Yoon

Sunil Venkayala

Abstract With the completion of Java Data Mining (JSR-73), customers and vendors now have available a powerful standard to enable applications with data mining, both through Java and Web services. In this paper, we introduce Java Data Mining with examples highlighting both the Java and Web services interfaces. We discuss conformance requirements using the Technology Compatibility Kit (TCK) for vendors implementing the standard. Lastly, we comment on likely features for the next release of JDM. The expert group is now forming for Java Data Mining 2.0 as the JCP Executive Committee approved JSR-247. 1. Introduction and Background Traditionally, data mining algorithms were either home-grown and plugged into applications using raw code, or packaged in an end-user GUI complete with transformations and in some cases scoring code generation. However, the ability to embed data mining end-to-end in applications using commercial data mining products was difficult, if possible at all. Certainly, these APIs were not standards based, making the selection of a particular vendor’s solution even more challenging. As such, the ability to leverage data mining functionality easily via a standards-based API greatly reduces the risk of selecting a particular vendor’s solution as well as increases accessibility of data mining to application developers. Java™ Data Mining (JDM) addresses this need. Java technology, specifically as leveraged within the scalable J2EE architecture, facilitates integration with existing applications such as business-to-consumer and business-to-business web sites, customer care centers, campaign management, as well as new applications supporting national security, fraud detection, bioinformatics and life sciences. Java Data Mining allows users to draw on the strengths of multiple data mining vendors for solving business problems, by applying the most appropriate algorithm implementations to a given problem without having to invest resources in learning each vendor's proprietary API. Moreover, vendors and customers can focus on functionality, automation, performance, and price. With JDM’s extensible framework for adding new algorithms and functionality, vendors can still differentiate themselves while providing developers with a familiar paradigm. During the design of JDM, several data mining standards, including the DMG’s Predictive Model Markup Language [DM-PMML], OMG’s Common Warehouse Metadata for Data Mining [OMG-CWM], and ISO’s SQL/MM Part 6 Data Mining [ISO-SQL/MM], have been reviewed to ensure a reasonable degree of interoperability, either in concepts and options, or to facilitate the use of these standards. Similarly, JDM concepts and options have also influenced these standards.

24

2. Status JDM is now an official part of the Java™ standard. The Executive Committee (EC) of the Java Community Process voted to accept JSR-73, thereby enabling vendors to provide standard advanced analytics support for Java applications. See [JSR73] for the specification [Hornick:2004] and related information. JSR-73 has now moved into the Maintenance phase where minor corrections to the specification, RI, and TCK will be made. The EC concurrently approved JSR-247 to address extensions to JDM. The expert group for JSR-247 is now forming. Nominations can be submitted at its website [JSR247]. 3. Main Features JDM includes interfaces supporting mining functions such as classification, regression, clustering, attribute importance, and association; along with specific mining algorithms such as naïve bayes, support vector machines, decision tree, feed forward neural networks, and k-means. These functions are executed synchronously or asynchronously using mining tasks, which include build, apply for batch and real-time, test, import, and export, as appropriate for each mining function. Import and export can support multiple model representations, including PMML and native formats. Import and export can also be used for JDM metadata using the JDM XML Schema representation, or others such as CWM for Data Mining. Users will also find JDM interfaces supporting confusion matrix, lift, and ROC results, taxonomy and rule representation, and statistics. JDM further includes the specification of a web services interface based on the JDM UML model, thereby enabling Service Oriented Architecture (SOA) [Barry&Assoc] design. Although JDM-based web services map closely to the Java interface, JDM web services address needs beyond the Java Community, being based on WSDL and XML, a programming language neutral interface. Now, vendors of JDM can leverage their investment in a JDM server for both the Java and web service interfaces, using common metadata, object structure, and capabilities. However, non-JDM vendors can also leverage this same interface to be interoperable with a broader range of vendor implementations. 4. Java Interface Example The following code example illustrates the steps for building and retrieving a clustering model using. See [Hornick:2004] and the JDM javadoc documentation for details of the particular objects referenced. The first step in using the JDM API is to create a connection to a data mining engine (DME). In this example, we assume the connection dmeConn has been created. Object creation requires a corresponding factory, which is obtained from a connection. The following code block illustrates how to reference and describe data. In lines 1 through 3, a PhysicalDataSet object specifying the location of build data via a universal resource indicator (URI) is created and saved to the DME. In line 2, attribute metadata associated with the build data is automatically derived and imported to the object buildData. In lines 4 through 6, a LogicalData object based on the specified physical data is created and saved. A LogicalAttribute object is created within the logical data for each physical attribute in the build data. The attribute type, e.g., numerical or categorical, is automatically

25

assigned by a vendor specific method, possibly from the attribute data type and number of unique attribute values. However, the user can override this assignment. // Create the physical representation of the data

1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory)

dmeConn.getFactory( “javax.datamining.data.PhysicalDataSet” );

2) PhysicalDataSet buildData = pdsFactory.create( uri, true );

3) dmeConn.saveObject( “customerData”, buildData, false );

// Create the logical representation of the data from physical data

4) LogicalDataFactory ldFactory = (LogicalDataFactory) dmeConn.getFactory(

“javax.datamining.data.LogicalData” );

5) LogicalData ld = ldFactory.create( buildData );

6) dmeConn.saveObject( “customerLogicalData”, ld, false );

In the next code block (lines 7 through 12), a ClusteringSettings object is created and saved with the build settings such as its name, logical data, maximum number of clusters and minimum cluster case count. In this example, algorithm settings are not specified, leaving the DME to choose a suitable clustering algorithm with default or system determined settings. This highlights the separation of functions and algorithms supporting both data mining experts and novices. // Create the settings to build a clustering model

7) ClusteringSettingsFactory csFactory = (ClusteringSettingsFactory) dmeConn.getFactory( “javax.datamining.clustering.ClusteringSettings” );

8) ClusteringSettings clusteringSettings = csFactory.create(); 9) clusteringSettings.setLogicalDataName( “customerLogicalData” ); 10) clusteringSettings.setMaxNumberOfClusters( 20 ); 11) clusteringSettings.setMinClusterCaseCount( 5 ); 12) dmeConn.saveObject( “customerSettings”, clusteringSettings, false );

In the next code block (lines 13 through 15), a BuildTask object is created, which specifies the build data, the build settings, and the model name. The resulting model is placed in the connection’s associated repository. All objects associated with a task and the task itself must be saved prior to asynchronous task execution. However, tasks need not be saved for synchronous execution. In this example, the mapping between physical and logical attributes is based on name equivalence, however, users can explicitly map attributes. Note that the logical data may be omitted in the build settings if it is not supported by the mining function or if all physical attributes are to be used with default assignments. In this example, the lines 4 through 6, and line 9 can be omitted since no changes are made to the logical data after its import from the physical data.

26

// Create a task to build a clustering model with data and settings

13) BuildTaskFactory btFactory = (BuildTaskFactory) dmeConn.getFactory(

“javax.datamining.task.BuildTask” );

14) BuildTask task = btFactory.create( “customerData”, “customerSettings”,

“customerSegments” );

15) dmeConn.saveObject( “customerSegBuild”, task, false );

In the next code block (lines 16 through 19), we illustrate task execution for model build. In line 16, the named task is executed. The resulting model is placed in the mining object repository. The name of the model can later be used for applying the model to data, and other operations. In lines 17 through 19, the application asynchronously checks the status of the execution by extracting the execution handle and status. Execution handles can also be retrieved via a connection using the task name. // Execute the task and check the status

16) ExecutionHandle handle = dmeConn.execute( “customerSegBuild” );

17) handle.waitForCompletion( Integer.MAX_VALUE ); // wait until done

18) ExecutionStatus status = handle.getLatestStatus();

19) if( ExecutionState.success.equals( status.getState() ) )

// if true, then task completed successfully...

In the next block (line 21 through 29), the model built in the preceding block is retrieved for viewing. Each cluster provides details including support, statistics, predicates, its parent and child clusters, centroid coordinates, and case count. A few of these are illustrated below. // Retrieve the model to get the leaf clusters and their details

20) ClusteringModel customerSeg = (ClusteringModel)

dmeConn.retrieveObject(“customerSegments”);

21) Collection segments = customerSeg.getLeafClusters();

22) Iterator segmentsIterator = segments.iterator();

23) while( segmentsIterator.hasNext() ) {

24) Cluster segment = (Cluster) segmentsIterator.next();

25) Predicate splitPredicate = segment.getSplitPredicate();

26) long segmentSize = segment.getCaseCount();

27) double support = segment.getSupport();

28) AttributeStatisticsSet attrStats = segment.getStatistics();

29) }

5. Web Services Interface Example In this section, we illustrate the executeTask web service as defined in the JDM WSDL document and an example executing apply on a single record using a classification model. See the JDM specification and javadoc documentation for details of the particular objects referenced.

27

1) <complexType name="executeTask">

2) <sequence>

3) <choice>

4) <element name="taskName" type="xsd:string"/>

5) <element name="task" type="Task"/>

6) </choice>

7) </sequence>

8) </complexType>

9) <complexType name="executeTaskResponse">

10) <sequence>

11) <choice>

12) <element name="status" type="ExecutionStatus"/>

13) <element name="recordValue" type="RecordElement"

maxOccurs="unbounded"/>

14) </choice>

15) </sequence>

16) </complexType>

The execution of a task can be specified either by naming a task already present in the Data Mining Engine (DME) (line 4) or by specifying the task content inline (line 5). The ExecutionStatus used in the executeTaskReponse in line 12 provides task progress. However, some tasks return values, as in the case of real-time scoring (record apply) as specified in line 13. In lines 17 through 30, we illustrate executing a task called RecordApplyTask. A standard header is expected in line 17. Line 20 specifies the record apply for the model “ChurnClassification32”, a classification model predicting customer churn. Lines 21-23 provide the record to score consisting of two predictors, age and income, and the customer identifier. Lines 24 through 28 specify the content of the apply output. In line 25, we specify the customer identifier to be mapped from the input record to the output. In line 26, the top predicted category should be mapped to the destination attribute “churn”. Similarly in line 27, the probability of this prediction should be mapped to the destination attribute “churnProb”.

17) <SOAP-ENV:Envelope ... > <SOAP-ENV:Header ... />

18) <SOAP-ENV:Body>

19) <executeTask xmlns="”http:" www.jsr73.org="2004"

http:="www.jsr-73.org"/>

20) <task xsi:type="RecordApplyTask" modelName="ChurnClassification32">

21) <recordValue name="CustomerAge" value="23"/>

22) <recordValue name="CustomerIncome" value="50000"/>

23) <recordValue name="CustomerID" value="1003-2203-120"/>

24) <applySettingsName xsi:type="ClassificationApplySettings">

28

25) <sourceDestinationMap sourceAttrName="CustomerID"

destinationAttrName="CustId"/>

26) <applyMap content="predCat" destPhysAttrName="churn" rank="1"/>

27) <applyMap content="prob" destPhysAttrName="churnProb" rank="1"/>

28) </applySettingsName>

29) </task>

30) </SOAP-ENV:Envelope>

In lines 31 through 39, we depict the task response to the record apply, in this case a prediction result. In lines 34-36, the apply output for customer identifier, prediction and probability are provided.

31) <SOAP-ENV:Envelope ... >

32) <SOAP-ENV:Body>

33) <executeTaskResponse xmlns=”http://www.jsr-73.org/2004/webservices/”

xmlns:jdm=” http://www.jsr-73.org/2004/JDMSchema”>

34) <recordValue name="CustomerID" value="1003-2203-120"/>

35) <recordValue name="churn" value="1"/>

36) <recordValue name="churnProb" value=".87"/>

37) </executeTaskResponse>

38) </SOAP-ENV:Body>

39) </SOAP-ENV:Envelope>

6. Conformance As with any standard, defining conformance for vendor implementations raises myriad issues. Should all implementations be required to support all algorithms and features? Should the results of data mining, e.g., rules in a decision tree model, have the same results for the same datasets? In JDM, compliance is based on a core feature set with optional packages for each mining function and algorithm. In addition, JDM provides supportsCapability methods that allow applications to determine at runtime if a particular vendor implementation supports a finer grained feature, e.g., whether classification model build accepts a cost matrix specification, or the clustering algorithm produces hierarchically arranged clusters. For those features the vendor supports of a valid JDM configuration, the vendor implementation must pass the TCK. 7. Java Community Process As a Java Specification Request under SUN’s Java Community Process [JCP], JDM went through several reviews before the final vote by the JCP Executive Committee. In addition, the JCP-required Reference Implementation (RI) and Technology Compatibility Kit (TCK) further validate the API prior to becoming part of the Java™ standard. The RI ensures that the interface is able to be implemented and helps to identify modeling flaws before becoming a standard.

29

8. JDM Forum To facilitate public exchange of ideas on JDM, a new project on java.net has been created: “datamining” at https://datamining.dev.java.net/, providing a discussion forum, announcements, and document sharing among Java Data Mining users. 9. Summary and Future Work Through the course of designing the API, the expert group has made numerous tough choices of features to include in the first release. For example, the expert group decided to defer addressing transformations, ensemble models, and “wrapper” methods such as cross validation. However, the feature set of the first release provides a well-rounded core of data mining functionality, which can easily be augmented and extended in JDM 2.0 [JSR-247]. Some of the features being considered for JDM 2.0, include: mining unstructured data such as text and images, additional mining functions such as feature extraction and forecasting, model comparison, multi-target models, and ensembles, and expanding web services to include such features. The web services interface will also explore higher-level data mining services. Such higher-level services may include making a single request to mine and score named datasets with minimal user-provided settings, and returning to the user model quality metrics and individual scores. With the design work of JSR-73 complete, the expert group looks forward to the standard’s wide spread adoption and use. With vendors supporting JDM, users will realize the benefits originally conceived. User and vendor feedback on the standard will help guide the direction of the JDM 2.0. 10. References [Barry&Assoc2004] http://www.service-architecture.com/ [DMG-PMML] http://www.dmg.org

[Hornick:2004] Mark Hornick and JSR-73 Expert Group, “JavaTM Specification Request 73: JavaTM Data Mining (JDM)”, 2004. [JCP] http://www.jcp.org [JSR73] http://jcp.org/en/jsr/detail?id=73 [JSR247] http://jcp.org/en/jsr/detail?id=247 [OMG-CWM] http://www.omg.org/technology/cwm

http://www.service-architecture.com/

http://www.dmg.org/

http://www.jcp.org/

http://jcp.org/en/jsr/detail?id=73

http://jcp.org/en/jsr/detail?id=247

http://www.omg.org/technology/cwm

30

Part 2.

Data Mining Services

31

Web Services Standards for Data Mining

Robert Chu Abstract Most, if not all, data mining and scoring tool providers require users to use provider-specific ways to invoke their services. The provider-specific approach could be a major factor affecting why data mining tools and applications are not currently as widespread as one might hope. Web services standards can address these proprietary issues. This article discusses what web services are, in general, as well as in the context of data mining and scoring. The intended readers are data mining practitioners who are new to web services. 1. Web Services One not-so-rigorous description of web services is as follows: A web service client passes a request in text while the service provider acts on the request and returns text to the client, all via the Web. Plain old web browsing is a form of web services: a user sends “http://cnn.com”, for example, from a web browser to the CNN main web server which sends its home page back to the requesting browser in text. Web services are identical in concept to this process. However, complicated web services often involve richer content as input than simple web page browsing with web services. XML[8] is most often used to format the input. As to the output, the contrast between web browsing and web services is not about whether or not the content is complicated, but rather whether the format is HTML or not. Even though it is not entirely technically correct, one can view an HTML document as an instance of an XML document. However, HTML is particularly designed for web browser consumption, while an XML document is designed for a specific business need. It wouldn’t be complete to describe web services without mentioning the SOAP[5] protocol. Keep the following notes in mind if you are new to SOAP: SOAP is not really a simple protocol and “object” has nothing to do with the protocol. Fortunately, you most likely will have no need to understand SOAP as it should be transparent to you unless you deal with related low-level programming. The Worldwide Web is based on the HTTP protocol. Currently the SOAP protocol fits nicely on top of the HTTP protocol. The ubiquities of HTTP and low-cost HTTP-based web servers are catalysts to the quick and widespread adoption of web services. The data mining industry can take full advantage of the cost factor. 2. Web Services for Data Mining and Scoring Let us use examples to illustrate web services for data mining. Example 1. John has 5 (x, y) data points: (1, 12.1), (2, 14.2), (3, 16.1), (4, 18.2), and (5, 20.1) and would like to fit the following regression model: y = a + b x. John sends a simple web service via email like the following:

http://cnn.com/

32

Hi Fred, I have 5 (x, y) data points: (1, 12.1), (2, 14.2), (3, 16.1), (4, 18.2), and (5, 20.1). Could you help me fit the regression model y = a + bx? Thanks for your time. Your best friend, John Thirty minutes later, John gets the following response from Fred: Hi John, a is 0, b is 2. Let me know if you need more help. Fred Example 2. Data is the same as in example 1. John sends Fred email with the request in XML format: Fred, Please help me fit the regression model: <BuildModel> <RegressionModel> <Target>y</Target> <Intercept/> <Predictor>x</Predictor> </RegressionModel> <InlineTable> <row><x>1</x><y>12.1</y></row> <row><x>2</x><y>14.2</y></row> <row><x>3</x><y>16.1</y></row> <row><x>4</x><y>18.2</y></row> <row><x>5</x><y>20.1</y></row> </InlineTable> </BuildModel> If you will, could you describe the result in XML as well? Much obliged. John Being tired of reading the XML document John sent, Fred responded three days later with the following: John, Here is the result: <RegressionTable> <Intercept>10</Intercept> <Parameter name=”x”>2</Parameter> </RegressionTable> Fred. Example 3. The request text is the same as in example 2, but this time John does not send email, instead John copies and pastes the request text in a text box on a window of a data mining tool and clicks on the submit button. The request is sent to a remote data mining server. This time, instead of three days, John gets the modeling results back in one second.

33

Example 4. The data for this example is the same as in example 1, but is stored in a Microsoft Excel worksheet. Someone wrote an Excel add-in for John. John just launches the add-in GUI and specifies the data source and modeling settings by point and click. John then submits the model build request to a remote data mining server. The result is returned in one second. John doesn’t see any XML string flowing back and forth between Excel and the remote data mining server. The web service details are simply transparent to any user. Please note that all the examples above use an embedded data source and skip the connection parameters for easy illustration purposes. In the real world, the data source can be in a database and connection parameters are typically supplied in the request XML string. 3. Web Services Standards for Data Mining As you can imagine from example 2 in the previous section, just using XML can lead to multiple flavors of XML formats to describe input and output for data mining. Without a XML data mining standard, if you switched from one data mining provider to another, you would most likely need to rewrite your code. Currently, there are two publicly available data mining related web services specification standards: JDM API web services extensions[1] (JDMWS) and XML for Analysis Specification[2]. JDMWS is based on the object models used in the JDM API specification, while XML for Analysis reuses OLE DB for Data Mining Schema Rowsets. The next two sections show simple examples for each specification. It is not the intention of this article to rigorously compare these two specifications. Our intention is only to promote the idea of web services standards for data mining in general. 4. JDM Web Service Examples Java Specification Request 73: Java Data Mining (JDM) Version 1.0 is a pure Java API (Applications Programming Interface) to facilitate the development of data mining and scoring-enabled applications. It includes web services extensions (JDMWS). The following three example fragments are based on the specification to show readers what JDMWS strings look like. It is not an intention of this article to give an overview or tutorial of JDMWS, so the explanation is brief. Example 1. <SOAP-ENV:Body> <saveObject xmlns=”http://www.jsr73.org/2004/webservices/” xmlns:jdm=” http://www.jsr73.org/2004/JDMSchema” name=”CampaignSettings-101” overwrite=”true” verify=”true”> <object xsi:type=”ClassificationSettings” miningFunction="classification"> <algorithmSettings algorithm=”naiveBayes” pairwiseThreshold="0.1" singletonThreshold="0.1"/> <buildAttribute attributeName="Job" usage="active" outlierTreatment="asMissing"/> <buildAttribute attributeName="Gender" usage="active" outlierTreatment="asIs"/> <buildAttribute attributeName="Education" usage="active" outlierTreatment="asIs"/> <buildAttribute attributeName="customerID" usage="inactive"/> </object> </saveObject> </SOAP-ENV:Body> Each object that can be persisted in a JDM-based Data Mining Engine has a type and a unique name. This example shows that an object of type ClassificationSettings is saved. Later this object can be retrieved by the type and the name. Named objects are to promote object reuse.

34

Example 2. <SOAP-ENV:Body> <executeTask xmlns="http://www.jsr73.org/2004/webservices/"> <task xsi:type=”BuildTask” name="CampaignBuildTask-26"> <objectName>CampaignBuildTask_106</objectName> <modelName>Campaign_106</modelName> <buildDataName>Campaign20040115</buildDataName> <buildSettingsName>CampaignClassificationSettings</buildSettingName> </task> </executeTask> </SOAP-ENV:Body> A JDM-based task can be defined and persisted in a JDM-based Data Mining Engine. To execute a JDM task is to send a web service request that is defined by specifying a pre-defined task and then associating a few related resource objects. Example 3. <SOAP-ENV:Body> <executeTask xmlns="http://www.jsr73.org/2004/webservices/"> <task xsi:type="RecordApplyTask" modelName="CampaignClassification106"> <recordValue name="Job" value="Sales Management"/> <recordValue name="Gender" value="F"/> <recordValue name="Education" value="College"/> <recordValue name="CustID" value="20040214-5673"/> <applySettingsName xsi:type="ClassificationApplySettings"> <sourceDestinationMap sourceAttrName="CustID"destinationAttrName="CustomerID"/> <applyMap content="predictedCategory" destPhysAttrName="churn" rank="1"/> <applyMap content="probability" destPhysAttrName="churnProb" rank="1"/> </applySettingsName> </task> </executeTask> </SOAP-ENV:Body> This example shows a single-record-scoring web service. 5. XML for Analysis for DM Examples XML for Analysis (XMLA) is a Simple Object Access Protocol (SOAP)-based XML API designed specifically for standardizing the data access interaction between a client application and a data provider working over the Web. XMLA addresses both OLAP (OnLine Analytical Processing) and data mining. The following three example fragments are based on the specification and the accompanying OLE DB for Data Mining Specification Version 1.0[4] to show readers what XMLA strings look like. It is not an intention of this article to give an overview or tutorial of XMLA for Data Mining, so the explanation is brief.

35

Example 1. <SOAP-ENV:Body> <Execute xmlns="urn:schemas-microsoft-com:xml-analysis" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <Command> <Statement> CREATE MINING MODEL [MemberCards] ( [customer Id] LONG KEY , [Yearly Income] TEXT DISCRETE , [Member Card Type] TEXT DISCRETE PREDICT, [Marital Status] TEXT DISCRETE ) USING VendorA_Decision_Trees </Statement> <Command> <Properties> … </Properties> </Execute> </SOAP-ENV:Body> This example illustrates an XML string and an OLE DB for Data Mining script for building a decision tree mining model skeleton. [Member Card Type] is the target column since the keyword “PREDICT” is specified for the column. Example 2. <SOAP-ENV:Body> <Execute xmlns="urn:schemas-microsoft-com:xml-analysis" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <Command> <Statement> INSERT INTO [MyModel] // Define the list of columns to be populated ( [Name], [Age], [Hair Color] ) OPENROWSET ( 'SQLOLEDB', 'Initial Catalog=FoodMart 2000', 'Select [Name], [Age], [Hair Color] FROM [Customers]' ) </Statement> <Command> <Properties> … </Properties> </Execute> </SOAP-ENV:Body> This example illustrates a XML string plus a SQL-like script for building a data mining model.

36

Example 3. <SOAP-ENV:Body> <Execute xmlns="urn:schemas-microsoft-com:xml-analysis" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <Command> <Statement> SELECT t.[Customer ID], [Age Prediction].[Age] FROM [Age Prediction] PREDICTION JOIN ( SHAPE { SELECT [Customer ID], [Gender], FROM Customers ORDER BY [Customer ID] } APPEND ( {SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID] } RELATE [Customer ID] To [CustID] ) AS [Product Purchases] ) as t ON [Age Prediction] .Gender = t.Gender and [Age Prediction] .[Product Purchases].[Product Name] = t.[Product Purchases].[Product Name] and [Age Prediction] .[Product Purchases].[Quantity] = t.[Product Purchases].[Quantity] </Statement> <Command> <Properties> … </Properties> </Execute> </SOAP-ENV:Body> This example illustrates how a XML string is used for scoring based on a trained mining model. Note that both JDMWS and XMLA for Data Mining support generation of PMML-based[3] model output. 6. Issues in Web Services for Data Mining Data Security – By default, a data mining web services client and server may communicate with each other in clear text. Data integrity and confidentiality could be compromised. WS-Security[6] is a good starting point for exploring the web service security issues. Asynchronous Web Services – A data mining task could be long-running. Asynchronous operations should be considered, otherwise, a client application may be undesirably blocked until the service results are returned.

37

Data Mining Session State Management – Web services is a stateless operation. One can simulate a stateful session by including a session ID in web services strings. WS-Resource Framework – Many computers (often many cheap PC’s) can be grouped together to function as a virtual supercomputer for data mining tasks. WS-Resource Framework[7] is a standard that can facilitate such virtual supercomputers. 7. Conclusion Data mining, web services, and software standards in general are all growing in popularity. Data mining web services standards in particular are a sure bet for those data mining practitioners who are looking for technologies to improve their organizations’ competitive edges. This article briefly introduces web services in general, and web services in data mining in particular. Examples were given to simply illustrate the styles of current popular data mining web services standards – JDMWS[1] and XMLA for DM [2&4]. Related issues and suggested resolutions were also discussed. References [1] Java Specification Request 73: Java Data Mining (JDM), Version 1.0, Final Review. [2] XML for Analysis Specification version 1.0. [3] Predictive Model Markup Language, Version 2.1.2, http://www.dmg.org. [4] OLE DB for Data Mining Specification, Version 1.0. [5] SOAP Version 1.2, http://www.w3.org/TR/soap/. [6] WS-Security, http://www-106.ibm.com/developerworks/webservices/library/ws-secure/. [7] WS-Resource Framework, http://www.globus.org/wsrf/. [8] XML Specification, http://www.w3.org/TR/2000/REC-xml-20001006. Acknowledgement Thanks to the XML for Analysis Council and Microsoft for the permission to use examples described in the XML for Analysis Specification and the OLE DB for Data Mining Specification, respectively.

http://www.dmg.org/

http://www.w3.org/TR/soap/

http://www-106.ibm.com/developerworks/webservices/library/ws-secure/

http://www.globus.org/wsrf/

http://www.w3.org/TR/2000/REC-xml-20001006

38

Experimental Studies Scaling Web Services For Data Mining Using Open DMIX: Preliminary Results

Robert Grossman and David Hanley

Abstract

We have developed Open DMIX, which is an open source collection of web services for accessing, exploring, integrating, and mining remote and distributed data. Open DMIX clients interact with Open DMIX servers using a version of web services designed for high performance applications, which we call SOAP+. In this paper we describe experimental studies comparing SOAP and SOAP+ based data mining applications on high performance, wide area networks. For these applications, Open DMIX using SOAP+ can be significantly faster than traditional SOAP-based applications. 1. Introduction Today, most data mining takes place in one of two ways. In the first way, a client server or 3-tier data mining application accesses and analyzes local data. In the second way, data mining is embedded in another application, either explicitly or implicitly. For example, today data mining is embedded into the databases marketed and sold by IBM, Microsoft and Oracle. Data mining is also commonly embedded into a variety of applications, for example in CRM and financial risk applications. During the past several years, web services have matured to the point where it is now becoming practical to create distributed data mining infrastructures and platforms based upon them. Indeed, web services have the potential to change the infrastructure used to explore, analyze and mine data in a fundamental way. Consider the following: today, many people find it quicker to locate a preprint by using Google than to search for it on their own local disk. On the other hand, almost all data analysis is done using local data. As bandwidth becomes a commodity resource [5], accessing remote data and remote services will become easier, and one day it may be as easy to work with remote data as it is to work with local data. However, even basic experiments using web services for data mining quickly identify some fundamental limitations. As we show in more detail below, mining remote and distributed data using SOAP/XML-based web services [22] does not scale to even moderate size data sets. In this paper, we describe some experimental studies using Open DMIX, which is a collection of scalable web services for accessing, exploring, integrating and mining data. Open DMIX differs from prior work of which we are aware in three key ways.

• Open DMIX can use traditional SOAP/XML/TCP-based web services for small datasets and metadata. In addition, Open DMIX supports a new protocol called SOAP+ for larger datasets and metadata collections.

39

• SOAP+ has two channels. The first is a SOAP/XML/TCP-based control channel and the

second is a data channel. The data channel can employ specialized network protocols and packaging formats.

• The data channel in SOAP+ can use a specialized network protocol we have developed

for working with large remote and distributed data sets called UDT (UDP-based Data-Transfer Protocol).

In this paper we present some experimental studies comparing SOAP vs. SOAP+. 2. Background and Related Work This section is based in part on [10]. It is convenient to think of data mining systems that were developed during the past decade as comprising three generations: 1) client-server systems; 2) component and agent-based systems; and 3) systems based upon web services. The first generation of data mining system utilized local data, with either client-server or 3-tier architectures. With these systems, a client front end is used to access a server (possibly on the same machine) hosting the data mining application. With a client-server model, the server also manages the data; with a 3-tier model, the data is accessed from another source using ODBC, JDBC, or other related protocol. The next generation of data mining systems was component-based. The components could be local, relying on Microsoft's COM or DCOM platforms, for example, or global, relying on systems such as Suns J2EE platform. Angoss is an example of the former, and Kensington is an example of the latter. More or less at the same time, various experimental agent-based data mining systems were developed. The basic assumption in these systems is that the data is distributed and agents are used to move the data, move the models produced by a local data mining system, or move the results of a local data mining computation. Today, very few agent-based systems are used in practice. This is probably because no agent-based infrastructure, over which an agent-based data mining system must be built, was ever widely adopted. Examples of agent-based distributed data mining systems include JAM [19], Papyrus [7], and BODHI [14]. Somewhat later, the next generation of service-based data mining systems began to emerge. These are generally built using W3C's standardization of web services. Examples include DataSpace [9] and data mining systems developed by IBM, Microsoft and SAS that employ the XML for Analysis standard [3]. More general service-based infrastructures, such as grids or data grids [4], are also used for data mining, especially when large computational resources are required. A data grid uses Globus, or an equivalent infrastructure, to provide a security infrastructure and resource management infrastructure so that distributed computing resources can be used. In addition, Globus provides a high performance data transport mechanism called GridFTP. Recently, the Grid community has begun an effort called the Open Grid Service Architecture, or OGSA, that provides a web service-based access to some grid services [17]. OGSA Database Access and Integration Services (OGSA DAIS) [16] combine grid services with web services for remotely accessing databases.

40

3. SOAP Based Web Services A web service may be implemented as a standalone TCP server, or it may accessed via a URL through a web server. When running under a web server, the SOAP service can take advantage of firewall tunneling, although performance will be reduced. The SOAP service accepts XML that describes an action for the server to perform, and returns XML to the client describing the result of the operation. It is possible to maintain state between operations, and operations are essentially non-streaming, due to the marshaling rules of XML. There are two fundamental problems when using web services for data mining of moderate to large size remote and distributed data sets. First, due to the overhead of XML encoding and parsing, there is a limit to the speed of the data transmission and the total size of the return set. This is caused by the need to retain the entire dataset in local storage due to XML encoding and decoding rules. The specific issue is that redundant parts of XML documents can and must refer to the other similar parts of documents. This requires that the entire document be maintained for lookup purposes. Therefore, all data packaging mechanisms that are truly XML compliant are in essence non-streaming. While a server could, in theory, safely ignore this encoding rule when there are no circular data structures, a compliant client cannot safely do so. Second, web services are also limited by the performance of TCP sockets. As a simple example, when transporting data from our cluster in Amsterdam to our cluster in Chicago for the experiments reported below, TCP flows averaged between 3-4 Mb/s over a 1 Gb/s link. This is primarily due to the limitations of TCP when used on networks with high bandwidth delay products. Here the bandwidth delay product is 1 Gb/s x 110 ms or 13.75 MB. In contrast, UDT, the new network protocol we developed for Open DMIX, provides significantly higher performance. A single UDT flow can average 950 Mb/s over the same link. 4. SOAP+ Based Web Services Open DMIX employs both standard SOAP/XML-based web services, as well as a high performance version, which we call SOAP+. SOAP+ uses separate data and control channels. The data channel can employ high-speed network protocols and alternatives to XML. In a typical application, metadata and small data sets can be accessed using SOAP, while larger data sets can be accessed using SOAP+. The table below contains some performance measurements comparing SOAP and SOAP+ when accessing a synthetic data set containing 10 attributes and the indicated number of data records.

41

Record Count SOAP using TCP/XML (secs)

SOAP+ using UDT/ASCII (secs)

10,000 0.65 0.21 50,000 2.57 0.72 150,000 11.13 2.05 375,000 51.18 5.01 1,000,000 352.1 13.43

Table 1. All times are in seconds. The tests were performed on a 1 Gb/s network linking a cluster in Chicago with a cluster in Amsterdam. The roundtrip time was 110 ms. This table shows the results of using SOAP with SOAP+, which employed UDT, and simple delimited ASCII text records.

As the table makes clear, the SOAP/XML mode does not scale linearly with query size, and in fact breaks with sufficiently large queries. The 1 million row SOAP query consumed 99% of the CPU and much of the RAM on the server, and then on the client, in its marshalling and de-marshalling. Results may be returned via the standard mechanism, or they may be returned via high performance protocols requested in the call and described in the return. When using normal SOAP return mechanisms, the size of the return set is limited in order not to cause an excessive storage/CPU burden on the client or server. The simplest high performance mechanism is to return ASCII text records as a stream in a normal TCP socket. This approach is simple and allows reasonable speed over local distances. The problems with this approach are the overhead of parsing and encoding ASCII text data and the use of TCP, which does not scale to long distances. This is a streaming approach and only needs to consume a fixed amount of storage on the client and server, rather than storage proportional to the size of the dataset. Data may also be returned as binary records; in this case, the return of the SOAP call describes the encoding of the binary record. These records are generally significantly more compact than text, and much faster to parse. In fact, as the client can request a specific endianess, no client-side parsing may be needed. This approach has speed advantages even when the source is non-binary, although the performance is far greater with binary data sources. 5. Alternate Network Transport Protocols In prior work, we introduced an alternative to TCP for data mining called SABUL [8] and demonstrated that SABUL is significantly faster than TCP for mining remote and distributed data over networks with high bandwidth delay products [11]. Recall that the bandwidth delay product is the product of the bandwidth and the round trip time of the path. TCP throughput is directly limited by the bandwidth delay product of the connection it is using. In further experiments, we found that, although SABUL works well when mining a single high volume flow of data, there were difficulties when transporting multiple high volume flows, as would be required, for example, when integrating two high volume flows from two geographically distributed data sets prior to applying a streaming data mining operation, such as streaming clustering.

42

Since many Open DMIX queries require multiple flows, it is very important that Open DMIX use a network protocol that is fair to each of the data flows, so that each flow obtains approximately equal bandwidth. Since Open DMIX queries use traditional SOAP/XML/TCP web services, it is also important that Open DMIX use a network protocol that is friendly to multiple TCP flows. Recently, we have developed a new application level protocol called UDT and integrated it into Open DMIX. Since UDT is an application layer protocol, it is straightforward to use it for an application layer service, such as data mining. UDT is built on top of UDP and provides reliability (which UDP lacks) and congestion control to support the Open DMIX requirements of fairness (in order to support data mining operations on multiple UDT flows) and friendliness (in order to support TCP based control information). Other protocols similar to UDT have been proposed, including Tsunami [21], FOBS [2], and RBUDP [15]. For our experiments, we measured the performance of UDT over a 1 Gb/s network connecting a cluster in Chicago with a cluster in Amsterdam. The round trip time was 110 ms. The results are summarized in Table 1. TCP flows averaged 3-4 Mb/s. A single UDT flow averaged 950 Mb/s. Note that UDT is fair to multiple UDT flows. For example, four UDT flows share the bandwidth and average 169 Mb/s. This is important to support streaming data mining operations on multiple high volume data flows. Notice also that UDT is friendly to TCP in the sense that it essentially does not affect the performance of TCP flows--in general they average about 4.2 Mb/s until the link becomes congested. This is important since SOAP+ uses TCP for the control channel.

UDT TCP # Flows Average Aggregate # Flows Average Aggregate

Overall Throughput

1 667 667 50 4.23 212 878 2 315.5 631 50 4.19 210 841 3 222 666 50 4.05 203 869 4 169.3 677 50 3.81 191 868

Table 2. All measurements are in seconds. This table compares two network protocols UDT, which is used in SOAP+ for the data channel, and TCP, which is used with SOAP. SOAP+ uses TCP for its control channel. The tests were run between Chicago and Amsterdam on a 1 Gb/s network. The roundtrip time was 110 ms. This table illustrates the performance advantage of using UDT instead of TCP for mining high volume data flows. Notice that a single UDT flow is about 150x faster than a single TCP flow. Notice also that UDT is fair to multiple UDT flows and friendly to multiple TCP flows.

6. Alternative Packaging Formats In addition to returning data in delimited ASCII fields (instead of XML), SOAP+ may also use a binary format, which is faster for several reasons. Marshalling overhead will be far lower compared to ASCII records, and may even be nonexistent. The records are more compact, and of a predictable size. Decoding will also be far more efficient. Table 3 compares text and binary records as packaging formats. Table 4 provides another comparison between using XML, delimited ASCII and binary formats. It is worth observing that binary records have a 2:1 advantage with regards to transmission size, but have a speed advantage even larger, due to their more efficient encoding, which plays a role

43

even when no work is being done on the records, because there is no need to scan for end of string markers.

Record Count SOAP time (seconds) SOAP+ time using UDT and binary packaging

10000 0.07 0.07 50000 0.57 0.11 150000 8.24 0.15 375000 50.73 0.37 1000000 351.23 0.99

5000000 7400.47 4.66

Table 3. All times are in seconds. The tests were performed on a 1 Gb/s network linking a cluster in Chicago with a cluster in Amsterdam using data records with five attributes. The roundtrip time was 110 ms. This table compares SOAP and SOAP+. SOAP+ uses UDT as the network protocol and binary as the packaging formats.

Rows ASCII time (sec) ASCII size (MB) Binary time (sec) Binary size (MB) 1000000 1.57 48 0.52 28

2000000 2.99 96 0.97 56 5000000 7.31 240 2.32 140 10000000 14.56 480 4.61 280

Table 4. Note the significant advantages of binary compared to ASCII packaging formats. The tests were performed on a 1 Gb/s network linking a cluster in Chicago with a cluster in Amsterdam using data records with five attributes.

7. Clustering as a Web Service In a final series of experiments, we examined the ability of Open DMIX to integrate two streams of data and apply a streaming clustering algorithm. The streaming clustering algorithm we used is described in [13]. Table 5 shows the time required to transport, integrate, and cluster 1.25 GBs of data using Open DMIX with TCP as the transport protocol, then using UDT as the transport protocol. The data was network intrusion data with 11 attributes. The clustering process using UDT was CPU bound or the differences using these two protocols would be even greater. The experiments were repeated five times. The results are in seconds. The remote data resided on a server in Amsterdam and was accessed over a 1 Gb/s route with a 110 ms RTT.

44

SOAP SOAP+

Experiment 1 1346 280





Average 1357 283

Table 5. All times are in seconds. Five experiments were run on a 1 Gb/s network between Chicago and Amsterdam with a 110 ms round trip time. Two 1.25 GB datasets were transported, integrated and clustered using a streaming algorithm using both SOAP and SOAP+. The process was CPU bound or the time using SOAP and SOAP+ would be even greater.

8. Conclusion Web services are emerging as a standard mechanism for developing remote and distributed data mining applications. For commodity long haul networks, working with large data sets in this way can be very slow. On the other hand, the number of high performance networks is increasing, and wide area networks with 1 Gb/s and 10 Gb/s bandwidth are becoming available [23]. With these types of networks, it is practical to work with large (1 GB and larger) remote and distributed data sets if the appropriate network protocols and packaging formats are used. We have developed an open source data mining, exploration and integration system called Open DMIX based upon high performance web services called SOAP+. SOAP+ uses a traditional SOAP-based control channel and a separate data channel, which can employ high performance network protocols and alternate packaging formats. In this paper, we report on experimental studies comparing SOAP and SOAP+ over wide area high performance networks. SOAP+ can provide significant performance advantages, sometimes as much as 10x-100x. 9. References [1] Mario Cannataro, Domenico Talia, and Paolo Trunfio. The Knowledge Grid: Towards An Architecture For Knowledge Discovery On The Grid. to appear.

[2] Fobs. omega.cs.iit.edu/ ondrej/research/fobs, retrieved on April 16, 2003.

[3] XML for Analysis Consortium. Xml For Analysis. retrieved from http://www.xmla.org, October 10, 2003.

[4] I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, California, 1999.

[5] Ian Foster and Robert L. Grossman. Data Integration In A Bandwidth Rich World. Communications of the ACM, 46(11):50–57, 2003.

[6] R. L. Grossman. Standards And Infrastructures For Data Mining. Communications of the ACM, 45(8):45–48, 2002.

45

[7] R. L. Grossman, S. Bailey, A. Ramu, B. Malhi, H. Sivakumar, and A. Turinsky. Papyrus: A System For Data Mining Over Local And Wide Area Clusters And Super-Clusters. In Proceedings of Supercomputing. IEEE, 1999.

[8] R. L. Grossman, M. Mazzucco, H. Sivakumar, Y. Pan, and Q. Zhang. SABUL - Simple Available Bandwidth Utilization Library For High-Speed Wide Area Networks. Journal of Supercomputing, to appear.

[9] Robert Grossman and Marco Mazzucco. Dataspace – A Web Infrastructure For The Exploratory Analysis And Mining Of Data. IEEE Computing in Science and Engineering, pages 44–51, July/August, 2002.

[10] Robert L. Grossman. Standards, Services And Platforms For Data Mining: A Quick Overview. In Proceedings of the 2003 KDD Workshop on Data Mining Standards, Services and Platforms (DM-SSP 03), to appear.

[11] Robert L. Grossman, Yunhong Gu, Dave Hanley, Xinwei Hong, Dave Lillethun, Jorge Levera, Joe Mambretti, Marco Mazzucco, and Jeremy Weinberger. Experimental Studies Using Photonic Data Services At Igrid 2002. Journal of Future Generation Computer Systems, 19(6):945–955, 2003.

[12] Data Mining Group. Predictive Model Markup Language (PMML). http://www.dmg.org, January 10 2003.

[13] Chetan Gupta and Robert L. Grossman. Genic: A Single Pass Generalized Incremental Algorithm For Clustering. SIAM, 2004.

[14] I. Hamzaoglu H. Kargupta and B. Stafford. Scalable, Distributed Data Mining Using An Agent Based Architecture. In David Heckerman, Heikki Mannila, Daryl Pregibon, and Ramasamy Uthurusamy, editors, Proceedings of KDD ‘97, The Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, USA, pages 211–214. AAAI Press, August 1997.

[15] E. He, J. Leigh, O. Yu, and T. DeFanti. Reliable Blast UDP: Predictable High Performance Bulk Data Transfer. In IEEE Cluster Computing, 2002.

[16] Norman W. Paton, Malcolm P Atkinson, Vijay Dialani, Dave Pearson, Tony Storey, and Paul Watson. Database Access And Integration Services On The Grid. 2002.

[17] The Globus Project. Towards Globus Toolkit 3.0: Open Grid Services Architecture. http://www.globus.org/ogsa/, retrieved on January 10, 2003.

[18] R Project. Retrieved from http://www.r-project.org, January 10, 2003.

[19] Salvatore J. Stolfo, Andreas L. Prodromidis, Shelley Tselepis, Wenke Lee, Dave W. Fan, and Philip K. Chan. JAM: Java Agents For Meta-Learning Over Distributed Databases. In Knowledge Discovery and Data Mining, pages 74–81, 1997.

[20] W3c Semantic Web. Retrieved from www.w3.org/2001/sw/, September 2, 2002.

[21] Tsunami. www.anml.iu.edu/anmlresearch.html, retrieved on April 4, 2003.

[22] W3C. Semantic Web. retrieved from www.w3.org/2001/sw/, September 2, 2002.

[23] A. Chien, T. Faber, A. Falk, J. Bannister, R. Grossman, and J. Leigh. Transport Protocols for High Performance: Whither TCP? Communications of the ACM, volume 46/11, pages 42-49, 2003.

46

Part 3.

Data Mining Platforms

47

Distributed Scoring Using PMML Bill Hosken and Bernard Scherer Abstract: As modeling and direct marketing continue to converge, many companies are faced with managing multiple models and executing them in batch across multiple titles and companies. SPSS worked with a division of Experian that provides marketing solutions for the business-to-consumer catalog industry. Experian needed a system that would satisfy the following requirements: The ability to deploy billions of scores. The ability to manage multiple models from hundreds of titles. An open system that could utilize multiple machines. The solution must be flexible in that it was scalable and could

handle high volume at high speed. In this case, the Clementine data mining application was used to create models, which generated PMML. The PMML was then processed by the High Speed Scoring Engine (HSSE) for distributed processing. This talk discusses the architecture of the system and lessons learned.

48

A Simple Strategy for Composing Data Mining Operations

Robert L. Grossman and Gregor Meyer Abstract An important element in data preparation is composing data mining operations. In this note, we discuss some of the issues involved when composing data mining operations. We also describe the support in PMML Version 3.0 for two of the most common type of compositions: using the output of one model as the input to another model (model sequencing) and using one model to select one or more other models (model selection or averaging). 1. Introduction and Background It is now standard in data mining to view the data mining process as consisting of several steps, some of the most important of which are: data preparation, data modeling, and scoring or deployment. Today, there are well defined architectures and standards for scoring. In particular, the Predictive Model Markup Language or PMML [DMG] provides a clean interface between producers of models, such as a statistical or data mining system, and consumers of models, such as a scoring system or application that employs embedded analytics. On the other hand, the architectures, processes and standards are much less mature for data preparation. Although it can be much more complicated, it is sometimes helpful to think of data preparation as the process that takes one or more tables of data and produces a single table of feature vectors, which are an input to a statistical or data mining model. See Figure 1. For example, PMML Version 2.1 includes the following four common types of data preparation operations:

• Normalization: Normalization transforms continuous or discrete values or ranges to numbers.

• Discretization: Discretization transforms continuous values to discrete values.

• Value mapping: Value mapping transforms discrete values to discrete values.

• Aggregation: Aggregation summarizes or collects groups of values, for example by

computing averages. It is an interesting exercise to try to see in how many practical cases the data preparation can be reduced to the compositions of the four operations above. It turns out to be quite large. More generally, one could include certain built-in functions in data preparation or user-defined functions:

• Functions: Data preparation functions derive a value by applying a function to one or more parameters.

49

Notice that each of the itemized operations above can be thought of as a function. The question arises:

1. What is an appropriate architecture and what are appropriate standards so that the functions that arise in data preparation can be composed?

2. More generally, what is an appropriate architecture and what are appropriate standards so

that the data mining models can be composed? Here are two simple, motivating examples, which are quite common in practice, related to the second question of how data mining models can be composed: Example 1. A classification tree may be used to select two or more logistic regression models. In this paper, we refer to this as model selection. Example 2. A logistic regression model may be used as the input to a classification tree. In this paper, we refer to this as model sequencing. Note that composing models is difficult for several reasons:

• The composition of data mining operations is only partially defined. In general, it is not well defined to take the output of one model and use it as the input to another model.

• Model composition covers several different use cases in practice.

• Just as a clean separation between model producers and model consumers led to the

development of a wide variety of scoring engines and embedded data mining applications, the hope is that a simple mechanism for defining composition can lead to standard architectures for data preparation. The challenge is to balance the generality of defining a very general composition with the complexity required of model consumers with constructs sufficiently powerful to implement common use cases.

In this article, we introduce the composition mechanism developed by the PMML Working Group, which covers the composition of normalization, discretization, value mapping, and aggregation, as well as the two motivating examples. In Section 2, we describe some of the infrastructure of PMML. Section 3 outlines the basic idea. In Section 4, we describe how model selection is done in PMML. In Section 5, we describe how model sequencing is done in PMML. Section 6 indicates the current status of PMML and Section 7 contains a summary and conclusion. Acknowledgements. The work described in this paper was done by the Data Mining Group’s (DMG) Predictive Model Markup Language (PMML) Working Group. A more complete description is available at www.dmg.org.

50

multiple data tables

XML file for model

data prep

table of feature vectors or summary vectors

data mining

Model Producers

Model Consumer

XML file for data prep

XML file for model

XML file for data prep

data vectors

scores

Model Consumers

Model Producer

Figure 1. How XML descriptions of data preparation and models can be used to help create standard architectures for model producers and model consumers.

2. Data Attributes, Mining Attributes, and Derived Attributes In this section, we briefly review how data attributes, mining attributes and derived attributes are used in PMML. Features vectors, which are the inputs to models, are defined in PMML in the following way.

1. A data dictionary defines a set of data attributes. 2. A mining schema can identify one or more data attributes as mining attributes. A mining

attribute also includes additional information; for example, such as how a data attribute is

51

used by a model (as an independent attribute, a predicted attribute, or excluded). Some mining elements are directly used as inputs to a model.

3. Other mining attributes can be used as inputs to the Transformation Dictionary to define

derived attributes to the model. See Figure 2. In PMML Version 2.1, the Transformation Dictionary contains the required specifications for normalization, discretization, value mapping, and aggregation functions. See Example 1 for a simple example of a discretization.

Data Dictionary MiningSchema Transformation-Dictionary

Model

field 1 field 2 field 3 field 4 field 5 …

field 1 field 3 field 4

derived field 3 derived field 4

model parameters

model extensions

derived attributes

attributes

model statistics

Figure 2. This figure illustrates how four inputs to a model can be defined: two of the attributes are taken directly from the MiningSchema and two of the attributes are derived and defined using the Transformation Dictionary.

3. Basic Ideas There are three basic ideas at the core of how PMML Version 3.0 supports composition.

1. The first idea is that models can be embedded in other models. In this case, much of the model infrastructure of the embedded model (e.g., mining attributes, transformation dictionary, etc.) is not required. For this reason, PMML Version 3.0 supports a simplified version of a model called an embedded model. See Table 1. For example, a

52

stand-alone tree model is called TreeModel, an embedded tree model is called DecisionTree.

2. The second idea is that model sequencing can be supported by using essentially the same

approach as used to defined DerivedFields with a TransformationDictionary. More specifically, an embedded model can be used to define what is called a ResultField, which can be used as the input to a model in essentially the same way as a DerivedField.

3. The third idea is to provide a container for a collection of models called MiningModel.

Model selection can then be supported by using a DecisionTree within a MiningModel. Voting in ensembles can be supported by simply using a RegressionModel within a MiningModel.

We note that the philosophy in PMML has been to employ the simplest mechanism that will support the desired outcome. For example, models are explicitly defined by specifying parameters, instead of the supporting arbitrary code to define models. In the same way, the three ideas above are powerful enough to support a wide variety of different types of model compositions that occur in practice. We do not need to define a more general mechanism for composing data mining operations. The reason for this philosophy is that it simplifies the design of PMML Consumers.

Stand alone model Model used in selection or sequencing

Main content

RegressionModel Regression RegressionTable TreeModel DecisionTree Nodes … … …

Table 1. This table describes how the same model may be used in a stand-alone fashion or as an embedded model.

4. Model Selection The basic idea used for model selection in PMML is to use the embedded version of the model or models and then use a DecisionTree to select the appropriate model. Model selection in PMML Version 3.0 uses the MiningModel with a regression function as a container and a decision tree as the selection logic. With this approach, model selection is sufficiently powerful to support some common types of ensemble operations, such as voting or averaging. See Example 2. 5. Model Sequencing

As mentioned above, a motivating example for model sequences is using an attribute defined by a regression model as an attribute of a decision tree.

53

The basic idea used in PMML to describe a sequence of models is simple. As Example 2 illustrates, PMML Version 3.0 defines a new element called Regression, which contains enough information to specify a new attribute, which PMML calls a ResultField, and to define this result field using a regression model. The parameters for the regression model are specified using an element called RegressionTable. Once a ResultField is defined in this way, it can be used as an input to the surrounding model, which in this example is a TreeModel. In other words, the ResultField called “term” in this example can be used as in input to any node in a TreeModel.

See Example 3.

6. Status PMML Version 3.0 has not yet been released. The approach used for model selection and model sequencing is still subject to change. 7. Conclusion In this article, we have described why the composition of data transformations and common statistical and data mining models is important for data preparation. We have introduced the approach used in PMML Version 3.0 for selecting models, that is, for choosing one or models from a container of models. We have also introduced the approach used in PMML Version 3.0 for using the output of one model as the input to another model or what is called model sequencing. 8. References [DMG] Data Mining Group, www.dmg.org <Discretize field="Profit"> <DiscretizeBin binValue="negative"> <Interval closure="openOpen" rightMargin="0" />  </DiscretizeBin> <DiscretizeBin binValue="positive"> <Interval closure="closedOpen" leftMargin="0" />  </DiscretizeBin> </Discretize> Example 1. This example shows how PMML Version 2.1 defines a data preparation operation for discretization.

54

<PMML> ... <MiningModel function="regression"> <MiningSchema> as usual </MiningSchema> ... derived fields as usual ... <DecisionTree>  <Node><True/>  <Node>  <Predicate age<=50 .../> <Regression>  <RegressionTable intercept="2.34"> ... predictors: 0.03*income + 1.23*age ... <RegressionTable> </Regression> </Node> <Node>  <Predicate age>50 .../> <Regression>  <RegressionTable intercept="2.22"> ... predictors: 0.01*income -0.11*age*mc ... <RegressionTable> </Regression> </Node> </Node>  </DecisionTree> </MiningModel> </PMML> Example 2. This example illustrates how PMML Version 3.0 supports model selections. In this case, two or more embedded regression models are selected using a decision tree.

regression model

regression model

DecisionTree contains nodes that can contain embedded models

55

<PMML> ... <TreeModel function='regression'> <MiningSchema>  <MiningSchema>  <DerivedField name="mc" optype="continuous"> <MapValues ... >  </MapValues> </DerivedField>   <Regression> <ResultField name="term" feature="predicted">   <RegressionTable> <!—with intercept="2.34" and predictors: 0.03*income + 1.23*age*mc --> <RegressionTable> </Regression> <Node>   ... </Node> </TreeModel> </PMML> Example 3. This example illustrates how a TreeModel can use attributes that are defined using a regression model.

This code defines a derived field called “mc”.

This code defines a result field called “term” using a regression model.

This code defines a node that can use the derived field “mc” and the result field “term”.

Documents

Web Services for Data Mining