[IEEE 2009 13th International Conference on Computer Supported Cooperative Work in Design - Santiago, Chile (2009.04.22-2009.04.24)] 2009 13th International Conference on Computer

492

Autonomic Collaborative RSS: an Implementation of Autonomic Data using Data Killing Patterns

Wallace A. Pinheiro1,3, Marcelino C. O. Silva1, Ricardo Barros1,3, Geraldo Xexéo1,2, Jano de Souza1,2 1COPPE/UFRJ, Federal University of Rio de Janeiro, Brazil

2DCC-IM, Dept. of Computer Science, Institute of Mathematics, Brazil 3IME, Military Institute of Engineering, Brazil

{awallace, marcelino,rbarros, xexeo, jano}@cos.ufrj.br

Abstract

Corporate and personal computers are flooded by a huge amount of data. Among them, there are irrelevant, similar, false, wrong and obsolete data. Besides, the treatment of this data is relatively complex. Systems need to check, transform, adapt, and summarize data in order to use it. These activities spend time and money of many companies that should be concerned for business rules. In RSS Feeds, we have these problems: the users receive a big number of news, sometimes irrelevant or duplicated. The tools to cope with these data do not provide efficient mechanisms to manager information overload. To reach this goal, it is necessary to introduce a new complexity to feed readers, what is sometimes undesirable. Therefore, we propose the Autonomic Collaborative RSS that transfers the system complexity to data, facilitating the system development. At the same time, it allows to incorporate data treatment rules, as well, data filtering through data killing patterns.

Keywords: Autonomic Data, Autonomic Collaborative RSS,

Data Killing Patterns, Petri Nets.

1. Introduction

Technological evolution has leaded people to deal with more and more information. People can create and publish new information easily, using the Internet. Despite the obvious advantages of having access to this huge amount of information, data overload makes the selection of good quality information harder.

Besides, heterogeneous environments sometimes need to share data. When data moves among systems, frequently it needs to be transformed, reprocessed, adapted, etc. At the same time, different users have different needs. In other words, in a same system, data may be represented in different forms, with different points of view, depending on user profiles.

These features make the design of systems a complex job. System designers should be concerned for business rules and processes instead of data problems.

Nowadays, information overload and complexity of data processing are common problems in several systems. This work focuses on news feeds that face these problems.

Normally, feed users subscribe site domains and receive a big amount of news related to these domains. The problem becomes worse when these users subscribe many sites. In these cases, they receive duplicated or very similar information. Furthermore, feed tools offer few resources to select, filter, and classify news, obliging users to do these jobs manually. Even sharing preferences and opinions is not an available resource.

This research proposes the application of data killing and autonomic data concepts to minimize the problems mentioned previously. Rules based on data killing patterns are adopted to reduce information overload while autonomic data are adopted to reduce the system complexity. Users can build rules based on data killing patterns and those rules are incorporated in an autonomic data, named Autonomic Collaborative RSS. It filters different news, depending on the user, reducing the information overload with minimum of human interference. Additionally, it allows that users to share and suggest rules.

At the end of the paper, an experiment shows the advantages of using these data killing patterns in news feeds scenario through Autonomic Collaborative RSS.

2. Related work

The concept of autonomic data was introduced by

Silva [1]. He adopted the autonomic data to make the self-management of a peer-to-peer system easier. Our research improves and extends this concept, aggregating some features and the autonomic collaboration, proposed by Dorn [2]. This author proposes a type of collaboration that applies autonomic computing techniques to mediate collaboration among participants of an activity. Greif [3] suggests other interesting work where pro-active features of an autonomic system identifies common behaviors among participants of an activity what allows making inferences. In our work, autonomic collaboration aims to

Proceedings of the 2009 13th International Conference on Computer Supported Cooperative Work in Design

978-1-4244-3535-7/09/$25.00 ©2009 IEEE

493

define, share and suggest rules that are used to discard news from RSS news feeds.

Despite their potential, RSS applications1 are quite limited [4]. Their users subscribe news feeds from a site, adding links to feed files in a feed reader (aggregator). From this on, they receive periodically updated news without visiting that site. Currently, feed readers, such as Firefox Feed Client2, just store and organize news from the newest to the oldest. They do not offer additional features, for instance, filter tools based on keywords, rules, user’s profile, etc. Therefore, users receive a considerable amount of irrelevant news. Our work applies the data killing patterns, introduced by Pinheiro [5], to reduce the volume of undesirable news. These patterns are based on high-level Petri nets provided by CPN Tools [6][7].

3. Autonomic data

The complexity of systems that deal with a huge and

heterogeneous quantity of data is a problem to system designers. It would be desirable that pro-active features were incorporated to data, facilitating its processing to different systems, environments and users.

Thinking on this problem, this work applies the concept of autonomic data. Silva [1] defines autonomic data as a set of data, metadata and rules. Since then, the concept of autonomic data has evoluted by the addition and extension of some features. It aims to provide more adaptability and robustness to the autonomic data. Thus, we redefine this concept as follows:

Autonomic Data is a pro-active entity that, besides data, can contain, to each context in which can be processed, the following features: • Metadata that describes data characteristics and

semantics; • Rules that describe data behavior to each data

context; • An interface that encapsulates data and provides a

way to access only relevant data; • Processing capacity; • Different granularities, depending on the context and

goals. Different granularities allow that applications with

different concepts of data size can use the same set of data easily. For example, an application that processes Web pages can consider a page as data, meanwhile, an application that is interested in people’s name can consider a Web page a set of data. An autonomic data should provide these two sizes of granularity, allowing

1 http://blogspace.com/rss/readers

different behaviors in different situations. Besides the definition, this work suggests a list of 10

commandments that an autonomic data should follow: 1 – You shall know yourself; 2 – You shall know your environment and the context around your related activities; 3 –You shall have capability of moving to other environments when necessary; 4 – You shall provide suitable information in the environment and context where you are, hiding your complexity; 5 – You shall cooperate with your neighbors, aiming at improving yourself and the community; 6 – You shall encapsulate information and provide a communication interface that allows information access to your neighbors; 7 – You shall configure and reconfigure yourself under different conditions and environments; 8 – You shall heal yourself; 9 – You shall improve yourself continuously, trying to find optimizations; 10 – You shall protect yourself.

Summarizing, an autonomic data should be polymorphic to adapt itself in different situations and environments, should be mobile, should abstract the relevant information to each situation, should hide its complexity, should cooperate through information exchange, should encapsulate data through an interface, and should have the four self-* autonomic features (Configuring, Healing, Optimizing and Protection) [8] presented in commandments 7, 8, 9, and 10.

4. Autonomic collaborative RSS

Autonomic Collaborative RSS is an application of

autonomic data. This implementation focuses on rules based on data killing patterns. Among the format of news feeds, we chose RSS 2.03 to develop this work, because it is the most used specification.

In order to solve information overload problem, this research uses rules to filter irrelevant data. Autonomic Collaborative RSS embodies, analyses and processes rules. The autonomic data can share these rule filters, depending on rule restriction level.

We created a plug-in, called Feed Organizer, in Mozilla Firefox4 to access Autonomic Collaborative RSS. Fig 1 shows the plug-in interface. To use this tool, users should provide their preferred feeds, some data about topics of interest and prohibited topics, and the similarity degree with these topics. Then, rules are

2http://johnbokma.com/firefox/rss-and-live-bookmarks.html 3 http://cyber.law.harvard.edu/rss/rss.html 4 http://www.mozilla.org/

494

created based on these preferences. Moreover, users can choose rule visibility, in other words, how they will share these rules among them. They can choose 3 sharing levels: private (just the user can visualize the rule), restrict by group (only members of the selected group can visualize the rule), and free (everyone can visualize the rule). Besides, users can access a forum to discuss rule filters.

Fig 1 – Feed Organizer Plug-in

The main advantage of the group restriction to visualize news is the fact that autonomic data can share information only with user belonging to the same group. For example, consider a group wants to discuss about filtering specific news from a company that will negotiate stocks. This group does not want that other people know about that, then these rules can be classified as restrict by group.

Autonomic Collaborative RSS incorporates automatically all rules from all users. It provides a rule suggestion list, based on public and group restriction rules, to users that can visualize these rules

Fig 2 –Results

Fig 2 shows an example of result, after the application

of rules created by a user. Instead of visualizing a huge number of news, he receives just news that matches with the rules.

5. Architecture and proposed patterns

The system is based on a server and on Firefox plug-in that is installed in the user's computer. The server, that is called RSS+ Server, downloads XML files of the RSS feed. In these XML files are included OWL rules, created by the user, and OWL Schema (which describes the file). Fig 3 illustrates the whole system. We can see in the figure customers sharing rules.

Fig 3 – System Architecture

Jena and CPN Tools are part of autonomic data and they are responsible for rule processing. The Jena communication, that processes OWL rules, with CPN Tools, that runs data killing patterns, is made according to the algebra on table 1. In this table non-terminal symbols are represented by "<" and ">", and symbols <integer> <text> express their own meanings, because of that, they correspond to terminals symbols in this representation.

This algebra considers each pattern as an operator for data killing. It can be observed that, in addition to data and the similarity degree, additional parameters are defined, such as: steaming, stop words and language.

These parameters allow, for example, different languages definition and pre-process of a text before the operators’ execution (it eliminates stop words and adopts stemming techniques based on Porter Stemming algorithm [9]). For this research, we adopt the Snowball Stemmer algorithm and only consider English texts.

During rule definitions for Autonomic Collaborative RSS, it has been used the granularity of all news from a particular domain, captured by the RSS+ server. An RSS+ can vary from some to an unlimited news number.

495

Rules, defined by users, can be applied on this data set. Thus, instead of view the entire RSS data set, users see only that news that matches with the defined rules.

TABLE 1 DATA KILLING ALGEBRA

Syntax and symbols of the operators in BNF

<S>::=query <command> \n <command>::=<pattern> <data> <data>::=(<integer><metadata><text>)<data> |(<integer><metadata><text>) <metadata>::=<text> <pattern>::=ad<parameters1>|dp<parameters1> |ds<parameters2> <parameters1>::=<comparedMetadata><parameters> <parameters2>::=(comparedMetadata=<metadata>) <parameters> <comparedMetadata>::=(comparedMetadata=<metadata>) |(comparedMetadata=<metadata>)<specificMetadata> <specificMetadata>::=(<metadata>=<text>)<specificMetadata> |(<metadata>=<text> <parameters>::=(similarityStrategy=Jaccard|Keyword) (similarityCoeficient=<integer>) (stemming=<bool>)(stopwords=<bool>) (language=english|portuguese) <bool>::=true|false

Currently, all rules defined by users are composed of at least one of the following data killing patterns: Allowed Data, Discarding Prohibited Data e Discarding Similar Data. They use similarity degree, which is set by users, to compare news and parameters. These patterns are detailed below:

1) Allowed Data: it compares the content of one or more attributes of news with the content of a parameter that is obtained form the environment and user’s rules. Each news that has attribute(s) considered similar to the parameter is returned. Fig 4 describes this pattern. It shows an external repository that contains the news to be analyzed. Moreover, there is an internal repository that contains allowed news and a trash that contains news not allowed or considered irrelevant, after rule processing.

Fig 4 - Allowed Data Pattern

Gray places represent the control flow of the pattern, consisting of parameters and events. This flow is

composed by two rules: R1 (indicated by transition R1) that checks whether recent news has arrived and R2 (R2 transition) that effectively eliminates news. The pattern is triggered when it receives an event (in the place eventSource). Then news is removed to trash, leaving just allowed news in the internal repository.

2) Discarding Prohibited Data: it compares the content of one or more attributes of news with a parameter. Each news that has attribute(s) considered similar to the parameter is not returned. Fig 5 describes this pattern. It shows an internal repository that contains the news to be returned. This repository may initially contain prohibited or allowed news. Moreover, there is a trash that contains prohibited news, after rule processing. Gray places represent the control flow of the pattern, consisting of parameters and events. This flow is composed by two rules: R1 (transition R1) that checks whether recent news has arrived and R2 (R2 transition) that effectively eliminates news. The pattern is triggered when it receives an event (in the place eventSource). News in the internal repository is analyzed in accordance with the parameters passed by the environment and users (they are stored in the place parameters). Then rules are triggered and, at the end of processing, only news remaining in the internal repository is returned.

Fig 5 - Discarding Prohibited Data Pattern

3) Discarding Similar Data: it compares news with news. Each news considered similar to other news is not returned, but the newest one. Fig 6 describes this pattern. Gray places represent the control flow of the pattern, consisting of parameters and events. This flow is composed by two rules: R1 (transition R1) that checks whether recent news has arrived and R2 (R2 transition) that effectively eliminates news. The pattern is triggered when it receives an event (in the place eventSource). News in the internal repository is analyzed in accordance with the parameters passed by the environment and users (they are stored in the place parameters). Then rules are triggered and, at the end of processing, only news remaining in the internal repository is returned.

496

Since this architecture is a work in progress, only these three patterns are available and we have just implemented Jaccard coefficient [10] as a measure of similarity. Other patterns as well as strategies for similarity detection will be incorporated into future versions.

Fig 6 - Discarding Similar Data Pattern

6. Experiment

The experiment aimed at analyzing the proposed tool in terms of autonomic data rules processing and the obtained results, considering the patterns Allowed Data (AD) and Discarding Prohibited Data (DP). It joined the data collected from 18 undergraduate students. It was randomly selected 90 news from 3 communication vehicles (New York Times, Herald Tribune, and Washington Post), approaching 3 subjects: Sports, Business e US news.

Here, each analyzed domain (Business, Sports and US) corresponds to an autonomic data. These data don’t exchange information among them, but they filter news presented to the users. These filters, as it was previously seen, correspond to rules based on data killing patterns. We did not use, in this experiment, the composition of operators (AND and OR). We analyzed the obtained results using patterns separately.

Fig 7 and Fig 8 show the AD and DP characteristic curves related to the average number of news returned by users by the similarity degree between news and user selected keywords. The range varies from 1% to 40% of similarity. It has been noticed that the curves obey the expected results: when the similarity level rises for AD operator, the number of news decreases. DP operator has the inverse behavior. Moreover, it is possible to verify that the average number of returned results from AD Operator are relatively low (less than 30% of returned

results), revealing that users are interested in a small subset of the news that is provided by the Web feeds. This excessive number of news may make difficult to users find relevant information. Differently, the DP operator discards few number of news (maximum value is 100% - 87% = 13%). However, it is important to consider that its goal is to eliminate prohibited subjects, allowing the rest.

Fig 7 - AD Operator Characteristic Curve

Fig 8 - DP Operator Characteristic Curve

Fig 9 shows the average precision obtained from the users’ search considering the similarity level, using DP operator. Precision [10] can be defined as quantity of returned relevant answers divided by the number of returned answers. In our case, it is given by quantity of relevant returned news divided by number of returned news. We consider relevant news any news that contains a keyword selected by the user.

It is possible to see that high precision values correspond to low similarity values. The average precision for DP Operator reaches, as its minimum value, 91%, which shows that users have used isolated keywords and few keyword combinations. The average precision for the AD Operator is always 100% for any similarity level. It happens because this operator aiming at recovering relevant news according the user’s search, and, for any similarity level, it just returns relevant answers.

497

Fig 9 - DP Operator Precision

Fig 10 shows the average recall obtained from the users’ search considering the similarity level using AD Operators. Recall [10] can be defined as quantity of returned relevant answers divided by the number of relevant answers. In our case, it is given by quantity of relevant returned news divided by number of relevant news. The average recall for the DP Operator is always 100% for any similarity level. It happens because this operator aiming at removing irrelevant news according the user’s search, and, in the most restricted case, it contains all relevant answers. If the similarity level is increased then more irrelevant news are included.

Fig 10 – AD Operator Recall

It is important to taking account that the results consider the answers of all the 18 students, including when a set of keywords does not find any correspondent news. In other words, no results were discarded.

7. Future Trends and Conclusion

This work addresses two main problems (information overload and system complexity), through the concept of autonomic data and data killing patterns. The proposal adopts collaboration among users to suggest rule filters that reduces the effects caused by huge amount of data. Patterns and user’s rules are both encapsulated in autonomic data. This way, these data can collaborate, share and recommend discarding data rules, depending on rule classification.

The presented concepts are analyzed through an experiment that applied data killing patterns to news filtering. It was showed that this approach reduces significantly the number of news that is returned to users.

The future work will incorporate new data killing patterns and use more autonomic characteristics cited on this research. Besides, we also intend to improve the collaboration among autonomic data, allowing that user behavior patterns be also shared and recommended in an autonomic way. Acknowledgement

This work was supported in part by the CNPQ (Brazilian National Research Council) and Brazil Army Force. References [1] M. Silva; J. M. Souza; J. Sampaio. Autonomic Data: Use of Meta

Information to Evaluate, Classify and Reallocate Data in a Distributed Environment. In: LAACS 2008, Gramado - RS. LAACS 2008, the 3th Latin American Autonomic Computing Symposium, 2008. p. 73-76.

[2] C. Dorn, H. L. Truong and S. Dustdar; "Measuring and Analyzing Emerging Properties for Autonomic Collaboration Service Adaptation" Autonomic and Trusted Computing 2008 edition, pages:162-176, 2008

[3] I. Greif “CSCW: What does it mean?”, CSCW ‘88. Proceedings of the Conference on Computer-Supported Cooperative Work, September 26-28, 1988, Portland, Oregon, ACM, New York, NY, 1988.

[4] Liu H.; Ramasubramanian V.; Sirer E. “Client Behavior and Feed Characteristics of RSS, a Publish-Subscribe System for Web Micronews”. internet measure Conference. Berkeley, CA, US. 2005.

[5] W. A. Pinheiro, G. Xexéo, J. Souza, “Autonomic Patterns: Modelling Data Killing Patterns using High-Level Petri Nets”, Fourth International Conference on Autonomic and Autonomous Systems, IEEE Computer Society, 2008.

[6] A. V. Ratzer, et al, CPN Tools for Editing, Simulating, and Analysing Coloured Petri Nets”, Proceedings of the 24th International Conference on the Application and Theory of Petri Nets, Eindhoven, The Netherlands, Spring. 2003.

[7] K. Jensen, L.M. Kristensen, L. Wells, “Coloured Petri Nets and CPN Tools for Modelling and Validation of Concurrent Systems”, In International Journal on Software Tools for Technology Transfer (STTT). 2007.

[8] IBM. An architectural blueprint for autonomic computing. IBM White Paper. 2005. Available at: http://www-03.ibm.com/autonomic/pdfs/ACBP2_2004-10-04.pdf.

[9] M. Porter, “The Porter Stemming Algorithm”, Cambridge University, United Kingdom, 2006. Available at: http://tartarus.org/~martin/PorterStemmer/

[10] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999

Documents

[IEEE 2009 13th International Conference on Computer Supported Cooperative Work in Design - Santiago, Chile (2009.04.22-2009.04.24)] 2009 13th International Conference on Computer