Information Retrieval Definition: A part of computer science that studies the retrieval of information from a collection of written documents

INTELLIGENT WAYS OF RETRIEVING & REPRESENTING

CONTENT ON THE WEB

ADITYA BIR: [email protected]

COMS E6125

WEB-ENHANCED INFORMATION MANAGEMENT

Spring 2011

Prof. Gail Kaiser

Information Overload

Information Overload

• Web is flooded with information

• From this information more information is created

• User becomes overwhelmed

• Too much of information to be able to make a decision

Way to Filter Information

Way to Filter Information

• Today Web users just don't want to search for information on the web, in fact they want information to come to them.

• Various mechanisms have been proposed to extract, filter and present information to the user.

• Effectively extracting and organizing this information has been a challenge from the beginning.

The Entire Process

The Entire Process

• Searching

• Scanning

• Retrieving

• Extracting or Filtering

• Presenting

THIN LINE BETWEEN INFORMATION RETRIEVAL AND INFORMATION EXTRACTION

Information RetrievalDefinition: A part of computer science that studies the retrieval of information from a collection of written documents is called Information retrieval (IR) .

Information ExtractionDefinition: Information extraction (IE) denotes any activity whose goal is to automatically identify and acquire pre-specified sorts of information or data from natural language texts, aggregate them and store them in a unified and structured database.

WEB BASED AGGREGATION OF METADATA

• Comparison Aggregation

• Relationship Aggregation

Comparison Aggregation

Relationship Aggregation

• People maintain multiple bank accounts

• They don’t want to remember logins

• Relationship Aggregation takes care of this hurdle

• With One logon a users information can be automatically retrieved

• Eg. Facebook and Google Accounts

Technologies involved to make this happen

• TAGGING

• MASHUPS

• RSS

TAGGING

• Tagging is a process by which users assign labels in the form of keywords to web objects with a purpose to share , discover and recover them.

TAGGING

University New York

Consider the example of Columbia University in New York. If you had to organize the information of Columbia University in a file system on your computer then you might organize Columbia University in either ways as follows:

articles\United States\University

articles\United States\New York

FOLKSONOMY

• Aggregation of tags creates Folksonomy

• "Folk Taxonomy" led to the word Folksonomy which was coined by Thomas Vander Wal

• Folksonomy is a way in which a group of user who share a common vocabulary classify the objects with similar tags

• Folksonomy enables easy searching and aggregation of the metadata which can be presented to users with similar searching preferences

ADVANTAGES OF FOLKSONOMY

• Folksonomy helps in saving cost of time and effort for users

• Foksonomy has a huge impact on communication and sharing of information as well as personal organization.

• Groups of users do not need to agree on hierarchical rules to tag web content, they just need to understand the meaning of the tag to label similar material.

• Since the web today has become very social in nature Folksonomy enhances the sharing as it has an underlying social networking nature built into it.

DISADVANTAGES OF FOLKSONOMY

• Ambiguous documents can be retrieved as a result of the irregular and unsynchronized use of vocabulary for tagging

• Lack of Synonym control can lead to different word forms with different meanings

DELICIOUS & FLICKR

• Delicious was a online bookmarking website where users were allowed to bookmark documents

• Further users were allowed to describe each bookmark with the help of a tag.

• Flickr on the other hand allowed users to tag photographs at the time of publishing.

• It also has a mechanism to allow friends and family to add tags to photographs but is limited to the consent of the creator of the content

MASHUPS

About three years ago, a hacker named Paul Rademacher found a security hole in the Google Maps web application. Rather than disclosing the issue privately to Google, he built and published an exploit—but instead of landing him in a courtroom, his exploit, the housingmaps.com mashup, landed him a job with Google. Moreover, instead of patching the security hole, Google documented it and called it an API

MASHUPS

• Mashups is one of the components of Web 2.0 technologies

• It is a combination of application components such as Web Services, content and openly published API's which are used to dynamically extract information from two or more Web applications to create one integrated, intellectual dynamic entity .

• Mashups designers create Mashups which aim to enable web users with adhoc integration of a wide variety of applications, live data sources, services and rich navigation.

MASHUPS

TYPES OF MASHUPS

• Data mashupsBringing together and cross referencing data from various web sources

• Consumer mashupsDifferent visualizations and data elements for more appealing consumption of information

• Business mashupsInternal combinations of company resources, often enhanced with external web services

Advantages of Mashups

• Mashups enable you to effectively leverage Web Parts.

• It enables in reuse of already existing Web application and therefore reducing time and cost from prevent the rebuilding of similar applications.

• Mashups are easier to create and are light weight by nature.

• Mashups enable users to create customized applications of their own interest by simply integrating content from different sources.

Disadvantages of Mashups

• Although Mashups provide a dynamic integration of information, Mashups by nature can be very insecure, as the information presented to a user may raise various privacy and policy concerns.

• Although the number API's tend to grow over the period of time there are not many Web Services available to support Mashups.

• There are no Mashup Standards for building of Mashups.

• The content in a Mashup that is being extracted from a particular source cannot be guaranteed in its integrity.

YAHOO PIPES

RSS (REALLY SIMPLE SYNDICATION)

• RSS is a collection of Web formats used for publishing updates of dynamic web sites, portals and services such as blogs entries, headlines, audio and video, and other resources, in a standardized format.

• RSS is a kind of content aggregation where a particular section of a website is shared amongst various other websites.

• Many of the Operating systems are RSS centric by providing users access to Weather, News and Sports updates right on their Desktop

Facebook News Feed

Advantages of RSS

• When any information is published on a particular website then it is automatically propagated to all the subscribed user in a timely fashion.

• All information at one place.

• It benefits the user as well as the publisher.

• The publisher can publish easily and does not have to maintain a database thus saves cost of saving information.

• It saves the search cost and time.

• It can be used for advertisement.

• It is noise free and spam free as the user subscribes for such kind of a service.

Disadvantages of RSS

• Web content publishers are unaware of the number of users using their information

• RSS Feeds can be responsible for heavy load on the server.

• User can get overwhelmed with information if not filtered appropriately

COMPARISON BETWEEN TAGGING (FOLKSONOMY), MASHUP AND RSS

• Social Web

• Time and Search Costs

• Security, Privacy and Policy Concerns

• Intelligence

• Information Retrieval, Extraction

• Information Presentation or Content Aggregation

• Which Technology is better ?


Social Web

• Since the Web has become more social over the period of time, Tagging and RSS Feeds play an important role in making the Web more social.

• Tagging through Folksonomy categorizes tags of people with identical vocabulary together.

• RSS helps in information publishing and sharing.

• Since RSS will eventually land up in being part of Mashup Technology, Mashups have an indirect contribution to the Social Web


Time and Search Costs

• When it comes to search costs Tagging, Mashup and RSS all of them are quiet successful in saving time and search cost.


Security, Privacy and Policy Concerns

• Although Mashups are quite popular among web users, they expose users to a variety of security risks

• Mashup sites can further be used for their components as data sources by other mashup websites. This makes it difficult to figure out as to how each mashup component is being used.

• Since RSS are eventually a part of Mashup and function under a similar paradigm, it too raises security, privacy and policy concerns

• Tagging have no impact on security, privacy and policy concerns.


Intelligence

• Tagging and RSS Feeds have begun to exhibit more intelligent behavior.

• Mashups which just aggregate Web Components.

• Various algorithms are being implemented to search and filter tags based on Folksonomy.


Information Retrieval, Extraction

• Mashup can be categorized as a information retrieval technology.

• Tagging facilitates information retrieval and through its concept of Folksonomy it facilitates in further filtering and information extraction.

• RSS can be categorized in both, information retrieval and information extraction technologies.


Information Presentation or Content Aggregation

• Only Mashup and RSS can be categorized in web content presentation technologies


Which Technology is better ?

• Mashups, RSS and Tagging have all contributed in fetching information and presenting it to the web user in a faster and effective way.

• Quantifying or calling one technology better than the other will not provide a strong and collaborative solution of enhancing information retrieval and presentation

SOLUTION FOR A BETTER WEB CONTENT RETRIEVAL AND PRESENTATION MECHANISM

• It is understood that the independent usage of either Mashup, Tagging and RSS will not provide an optimum solution

• Focus on the advantages of the technologies

• With the help of intelligent algorithms for clustering of tags in folksonomy, web content retrieval and filtering will become much more quicker and will reduce irrelevant information and noise retrieval.

SOLUTION FOR A BETTER WEB CONTENT RETRIEVAL AND PRESENTATION MECHANISM

• Further by applying adaptive algorithms to the RSS, it will result in specific information to be propagated to the user based on the users behavior

• Eliminating the issues regarding the security, privacy and policy concerns of Mashups, all of this will eventually make web content retrieval, aggregation and presentation a stroll in the park

References

• [1] S.E. Madnick, M.D. Siegel, Seizing the Opportunity: Exploiting Web Aggregation, MISQ Executive, 1(1),2002, 1-12.

• [2] Hongwei Zhu, Stuart E. Madnick, Michael D. Siegel, " The Interplay Of Web Aggregation And Regulations", CISL WP #02-17, November 2002.

• [3] Nikola Vlahovic, " Web 2.0 and its Impact on Information Extraction Practices", Proceedings of the International Conference on Applied Computer Science.

• [4] R. Baeza-Yates, B. Ribeiro-Neto, “Modern Information Retrieval", ACM Press, 1999.–64.

• [5] http://www.sciencemag.org/content/325/5942/828.full.pdf

• [6] Zhichen Xu, Yun Fu, Jianchang Mao, and Difu Su, " Towards the Semantic Web: Collaborative Tag Suggestions"

• [7] Ohad Greenshpan, "Harnessing Data Management Technology for Web Mashups Development"

• [8] Scott A. Golder and Bernardo A. Huberman, " The Structure of Collaborative Tagging Systems ".

• [9] C. C. Tsai, C.-J. Lee, S.-M. Tang, The Web 2.0 Movement: MashUps Driven and Web Services, In the Proceedings of the 13th WSEAS International Conference on COMPUTERS, WSEAS Press, Athens, Greece, 2009, pp. 646 - 651.

• [10] Aaron Bohannon, "Building Secure Web Mashups", July 16, 2008

THANK YOU

Documents

Information Retrieval Definition: A part of computer science that studies the retrieval of information from a collection of written documents