Aardvark : Social Search Engine

Aardvark : Large - Scale Social Search Engine

A seminar report submitted in partial fulfillment of the requirementsfor the award of the degree of

Bachelor of Technology

in

Computer Science & Engineering

Eighth Semester 2011 Admission

ABSTRACT

A program that searches for and identifies items in a database that correspond tokeywords or characters specified by the user, used especially for finding particularweb sites on the World Wide Web is a simple definition for a Search Engine. Cur-rently there are two kind of Search Engines available in the Internet which are HyperTextual Based Search Engine and Social Search Engine.

This report is mainly concentrating on the concepts of Social Search Engine. TheSocial Search Engine is also like an ordinary web search engine but have a vast differ-ence on comparison with the traditional Hyper Text Based Search Engine. In SocialSearch Engine the Information source is not a Document but a Living Human Being.In traditional search engine when a query is provided as an input the search enginealgorithm searches for the best document containing the requested keyword or thekeyword related documents. But in a social search engine when a query is providedas input , it’s meaning . it’s purpose is found out then the search engines finds outthe best person who can help the question initiator to solve the problem. The out-put generated in the social search engine is the solution generated by the answererin more natural language and more understandable format for the question initiator.

So it can also be said that the Social Search Engine makes the Question/queryreach the expert solution provider or the Answerer while the Hyper Textual onlyleads to a document. More than one solutions can be also obtained for a singlequery in the form of suggestions in Social Search Engine.

Contents

List of Figures ii

1 Introduction 11.1 What is Village Paradigm? . . . . . . . . . . . . . . . . . . . . . . . . 11.2 What is Aardvark? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Outline 4

3 Motivation 5

4 Why Aardvark was Created? 6

5 Aardvark Architecture 85.1 Sign Up Process of a User . . . . . . . . . . . . . . . . . . . . . . . . 85.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.3 Working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6 How Aardvark Works? 126.1 Work Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126.2 Social Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136.3 Indexing People . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146.4 Question Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.5 Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.6 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Future Works 26

8 Conclusion 28

Bibliography 29

i

List of Figures

1.1 Screen Shot of Google Result. . . . . . . . . . . . . . . . . . . . . . . 21.2 Aardvark Search Engine and Animal . . . . . . . . . . . . . . . . . . 3

4.1 Aardvark Search Engine versus Google’s Search Engine . . . . . . . . 7

5.1 Schematic of the architecture of Aardvark . . . . . . . . . . . . . . . 10

6.1 Aardvark in iPhone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 Aardvark’s working using a pictorial description . . . . . . . . . . . . 226.3 Frequent Questions asked in Aardvark . . . . . . . . . . . . . . . . . 236.4 A random set of queries from Aardvark . . . . . . . . . . . . . . . . . 236.5 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.6 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.7 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7.1 Increasing number of users in Aardvark. . . . . . . . . . . . . . . . . 26

ii

Chapter 1

Introduction

A web search engine is a software system that is designed to search for informationon the World Wide Web. The search results are generally presented in a line ofresults often referred to as Search Engine Results Pages (SERPs). The informationmay be a mix of web pages, images, and other types of files. Some search enginesalso mine data available in databases or open directories. Unlike web directories,which are maintained only by human editors, search engines also maintain real-timeinformation by running an algorithm on a web crawler.

The famous search engine Google works on the principal of large-scale hyper textualweb searching process that is based on the inputted keywords the searching processis done over the concept of page ranking. The page ranking means the number ofvisits/votes achieved to the related web pages of the inputted key word. But thereis a drawback what if the input was ”Do you have any good babysitter recommen-dations in Palo Alto for my 6-year-old twins? I’m looking for somebody that won’tlet them watch TV.”. When this query was given to Google it generated the resultwhich is shown in the Figure:1.1. This is not the answer to the input provided by theuser so in order to generate more suitable and more accurate solution a query theapproach of large - scale social search engine is used. The large - scale social searchengine takes input via instant message, email, web input, text message, or voice.The search engine then routes the question to the person in the user’s extendedsocial network most likely to be able to answer that question. As compared to atraditional web search engine, where the challenge lies in finding the right documentto satisfy a user’s information need, the challenge in a social search engine lies infinding the right person to satisfy a user’s information need.

1.1 What is Village Paradigm?

From the name itself it is very clear that the village paradigm is based on the con-cept of village, i.e the village is the source of Information Retrieval. The hypertextual search engine and the social search engine works, works in simple terms. Asmentioned early the Google works on the concept of the hyper textual search engineand in this paradigm (concept) of Information Retrieval (IR) from a Library is used.The Google’s search engine does the process of IR from the Stanford Digital Library.The social search engine works on the paradigm called The Village Paradigm. The

1


Figure 1.1: Screen Shot of Google Result.

village paradigm states that the data which is needed for the user is possessed orknown by a person or individual in the village and we then find that particularperson by passing information from one person to another.

The differences how information can be obtained from a library versus a villagesuggest some useful principles for designing a social search engine. We can see theclear cut difference between a library and a village that is the library consist of setof chunks of books while a village consist of chunks of living people. In a library,the user uses a keyword to search. The knowledge is extracted from a small numberof content publishers, and trust is based on authority of the data publishers. Whilein a village, user uses natural language to ask questions and the answers are gener-ated in real-time by anyone in the community, and trust is based on intimacy. Thereal-time response for a highly contextualized and subjective query, for example, thequery ”Do you have any good babysitter recommendations in Palo Alto for my 6-year-old twins? I’m looking for somebody that won’t let them watch TV.” is betteranswered by a real time friend than the library. These differences in informationretrieval paradigm states that the social based search engine is entirely differentfrom that of the traditional hyper textual search engine. Thus as the concept isentirely different than comparing with the traditional search engine it uses a newarchitecture. Aardvark is the only social search engine based on the villageparadigm up till now.So a study about Aardvark will be like the study of Social Search Engine.

Dept. of Computer Science & Engg. 2 SIMAT, Vavanoor


1.2 What is Aardvark?

Actually the aardvark is a medium - sized, burrowing, nocturnal mammal (Figure:4.1a) native to Africa. It is a nocturnal feeder, it subsists on ants and termites,which it will dig out of their hills using its sharp claws and powerful legs. It alsowill utilize its digging ability to create burrows in which to live and rear its young.It receives a ”least concern” rating from the IUCN, although its numbers seem tobe decreasing.

(a) Aardvark Animal (b) Search Engine Logo

Figure 1.2: Aardvark Search Engine and Animal

Aardvark is a social search engine (Figure:4.1b) based on the village paradigm.Detailed description is also provided based on the architecture, ranking algorithms,and user interfaces in Aardvark, and the design considerations that was being moti-vated . Aardvark is believed to be useful to the research community for two reasons.First, the argument made in the original Anatomy paper[1] still holds true sincemost search engine development is done in industry rather than academia, the re-search literature describing end-to-end search engine architecture is limited to theirown company as a result knowledge is not flourishing out. Second, the shift inparadigm opens up a number of interesting research questions in information re-trieval, for example around expertise classification, implicit network construction,and conversation design.

Following the architecture description, a statistical analysis of usage patterns inAardvark is also presented. It was found that, as compared to traditional search,Aardvark queries tend to be long, highly contextualized and subjective. In simpleterms the Aardvark queries tend to be the types of queries that are not well-servicedby traditional search engines. It was also found out that the vast majority of ques-tions got answered promptly and satisfactorily, and the users were surprisingly ac-tive, both in asking and answering. Aardvark performs very well on queries that dealwith opinion, advice, experience, or recommendations, while traditional text-basedsearch engines remain a good choice for queries that are factual or navigational.


Chapter 2

Outline

This report is presenting the entirely new concept in the field of search engine whichis called the ”Social Search Engine”. The most famous traditional search enginewhich is none other than the ”Google” is a hypertext based search engine. Whilethis Social Search Engine is a search engine which connects the query inputed bythe user into the social network of the user’s people to get the solution of the query.

The first chapter is about describing out an introduction about the core concepts ofthe Social Search Engine. This chapter also describes about the introductory partsof the Social Search Engine along with the Introduction of the Aardvark SearchEngine. The third chapter describes about the relevance and motivation of thisSearch Engine. The fourth chapter gives the solution to various questions like WhyAardvark was created? and What it makes to be different from other Search Engine.The fifth chapter describes about the Architecture and the various important com-ponents of this Search Engine. The Sixth chapter details about the entire workingof this Search Engine along with descriptions of various important key things ofAardvark. The seventh chapter states about the current status of Aardvark and thedetails about the Social Search Engines which are in market. The final chapter isthe conclusion where a summarized format of the entire report is described. Theconclusion chapter is followed by the Bibliography.

4

Chapter 3

Motivation

The main motivation of this method is to create a general awareness about the newform of search engine which is more closer to the humans’ natural language and farmore realistic. This social based search engine will improvise the searching processon the web by making the user interface part more closer to humans. The otherpurpose behind this search engine is for the academic and research purpose so asmore social based engine’s advanced version comes into market. On implementationof this project the network paradigm has to be changed resulting in the entiremanufacturing process of the network and the search engines. The world will bemore like a village as the querying process leads the user to a single and the bestresource provider, here a user, from any part of the world.

5

Chapter 4

Why Aardvark was Created?

It is a general question to ask why the Aardvark search engine was created. As men-tioned in the introduction part , the main goal of Aardvark is to increase the searchresult operation for all type of queries. Its an extended version of the traditionalsearch approaches. Aardvark is a social search service that connected users live withfriends or friends-of-friends who were able to answer their questions, also known asa knowledge market. Users submitted questions via the Aardvark website, email orinstant messenger and Aardvark identifies and facilitates a live chat or email con-versation with one or more topic experts in the asker’s extended social network.

In the field of computer science the query is a string in English language whichrepresent a question as an input The method of asking questions in English to asearch engine are of four types.

1. Objective Questions:- Those questions which have a predefined answer arecalled objective questions. Example ”Who is the current President of India?”.The output for these types of questions are 100% accurate and fully satisfac-tory.

2. Subjective Questions:- Those questions for which the solution/answer variesfrom user to user is called subjective question. Example ”Which is your fa-vorite President of India?” . The search result generated by the search enginesare not full satisfactory. It only leads to another factor to browse further.

3. Natural Language Questions:- The process of translating the thoughts of ahuman into keyword is the main asset here. But this concept is very toughand the current progress in this field is insufficient as there are lots of keywordsthat can not be put into consideration.

4. Q&A not available in Web:- It is true that there are enormous number of infor-mational resources available in the web. But on comparison of this resourceswith the resources that can be obtained form every single human on earth isunlimited. For example ”The information that you possess is not there in web.Your Favorite restaurant, your own information which is local to yourself.”

Now looking into an example (Figure:4.1) where the evaluation of a query donein two search engines, the Social Search Engine and the Hypertextual Search Engine,are shown.

6


(a) Google (b) Aardvark App in iPhone

Figure 4.1: Aardvark Search Engine versus Google’s Search Engine

It is clear in the Figure that when the Query ”Where can i find a great half dayhike within a short drive of San Francisco? I just moved out here looking for some-thing great views and varied terrain(I am a strong hiker). Any Tips appreciated.”was provided as input to the Google’s Search engine its best result was a set ofweb links which pointed out to famous hiking organization providing hiking stuffs.While on the contrary the Aardvark gave the result of the perfect reply as expectedby the user. So this results points that the Hyper textual search engine looks forthe keywords in the query and generate the related documents. So it is nearly as astatic reply and this search engine doesn’t goes for the meaning for the query. Themeaning for a query can only be understood by a human as a result Aardvark gavethe optimal result as it is not static and its not form a library.

The main intention to launch Aardvark was Resource Enlargement. If there wasprovision provided in this system that keep tracks of the ongoing queries being pro-cessed and then store the optimal result for all kind of questions then the SearchEngine will be more and more smarted and will be more like a Non Living Friendwith whom a user can clarify the queries. Aardvark is used for asking subjectivequestions for which human judgment or recommendation was desired. It was alsoused extensively for technical support questions. Users could also review questionand answer history and other settings on the Aardvark website. Google acquiredAardvark for 50 million dollars on February 11, 2010.[3][4]. In September 2011,Google announced it would discontinue a number of its products, including Aard-vark.


Chapter 5

Aardvark Architecture

The main components of Aardvark are:

1. Crawler and Indexer - These two parameters are used to find and label re-sources that contains the correct information about the query being askedhere the resources are in the sense of the human beings not documents.

2. Query Analyzer - As mentioned earlier the Query which is provided as aninput is a highly complex query. From this query the meaning or the purposehas to be extracted not the important keywords.

3. Ranking Function - As per the Hypertextual Search engine where the resourcepages are ranked, here some what a similar process i carried out where theanswerer is ranked based on the result the answerer generated for a query.

4. User Interface - To present the information to the user in an accessible andinteractive form. The main forms of user interface is by mobile phone apps, emails, Web pages, Instant Messaging.

Most document - based search engines have similar key components with similaraims[1], but the means of achieving those aims are quite different. Before discussingin detail about Aardvark , it is useful to describe what happens behind the sceneswhen a new user joins Aardvark and when a user asks a question to Aardvark.

5.1 Sign Up Process of a User

The strength of the Aardvark lies in the number of users in the Aardvark network,because more the number of user ,more is the information resources in the network.Every individual may possess information of same or different or new things, butthe level of information varies from user to user. Some user may have high knowl-edge about a particular topic while some possess less knowledge. Based on thesefactors individuals are separated form each other. So the prime goal here is to knowthe information possessed by the user and this process has to be done in a simpleformat. When a new user first joins Aardvark, the Aardvark system performs anumber of indexing steps in order to be able to direct the appropriate questions toher for answering.

8


The questions in Aardvark are routed to the user’s extended network, the first stepinvolves indexing friendship and affiliation information. For this process of affiliatinginformation, the data structure responsible for is the Social Graph [16]. The SocialGraph in the Internet context is a graph that depicts personal relations of Internetusers. In short, it is a social network, where the word graph has been taken fromgraph theory to emphasize that rigorous mathematical analysis will be applied asopposed to the relational representation in a social network. The social graph hasbeen referred to as ”the global mapping of everybody and how they’re related”.

Aardvark is not to aiming to build a social network, but rather to allow peopleto make use of their existing social networks like facebook, twitter, hangouts, gmailcontacts etc. . So in the sign-up process, a new user has the option of connecting toa/many social network such as Facebook or LinkedIn, importing their contact listsfrom a web mail program, or manually inviting friends to join. Additionally, any-body whom the user invites to join Aardvark is appended to their Social Graph andsuch invitations are a major source of new users. Finally, Aardvark users are con-nected through common groups which reflect real-world affiliations they have, suchas the schools they have attended and the companies they have worked at; thesegroups can be imported automatically from social networks, or manually created byusers. Aardvark indexes this information and stores it in the Social Graph, whichis a fixed width ISAM index sorted by userId. ISAM stands for Indexed SequentialAccess Method, a method for indexing data for fast retrieval. In an ISAM system,data is organized into records which are composed of fixed length fields. Records arestored sequentially, originally to speed access on a tape system. A secondary set ofhash tables known as indexes contain ”pointers” into the tables, allowing individualrecords to be retrieved without having to search the entire data set.

Simultaneously, Aardvark indexes the topics about which the new user has somelevel of knowledge or experience. This topical expertise can be gathered from sev-eral sources by using several methods: a user can indicate topics in which he believeshimself to have expertise; a user’s friends can indicate which topics they trust theuser’s opinions about; a user can specify an existing structured profile page fromwhich the Topic Parser parses additional topics; a user can specify an account onwhich they regularly post status updates (e.g. Twitter or Facebook), from which theTopic Extractor extracts topics (from unstructured text) in an ongoing basis andfinally, the Aardvark also observes the user’s behavior on Aardvark, in answeringa type of questions,or electing not to answer to which kind of questions or aboutparticular topics.

The set of topics associated with a user is recorded in the Forward Index, whichstores each userId, a scored list of topics, and a series of further scores about auser’s behavior (e.g., responsiveness or answer quality). From the Forward Index,Aardvark constructs an Inverted Index. The Inverted Index stores each topicId anda scored list of userIds that have expertise in that topic. In addition to topics, theInverted Index stores scored lists of userIds for features like answer quality and re-sponse time. Once the Inverted Index and Social Graph for a user are created, theuser is now active on the system and ready to ask her first question.



5.2 Architecture

Figure 5.1: Schematic of the architecture of Aardvark

The Figure:5.1 shows the schematic Architecture of the Aardvark Search Engine.The flow of the request and reply is in a systematic order and each component ofthis architecture have a unique role.

• Gateways: - The modes of getting inputs from the user and passing it throughthe network to the user which will generate the result. This gateway can bean E mail, iPhone App, SMS, Website etc.

• Conversation Manger & Question Analyzer- The Social Search engine has twointeresting kind of responsibilities. The first is to do Dialog Management soif a user is interesting with a social search engine over IM, there need to bea little bot-mind that knows how to interpret all the commands that useris passing into the network. If a user just types in a question the bot willrecognize whether the input is a question or an answer to somebody’s elsequestion. If a user type command like ”Show my past history”, the bot hasto recognize that. So part of conversation management is basically being aNatural Language Bot. The other part is supervising the question to makesure that it gets the answer. So it is kind of being a mission control for findingall the people who might be relevant, and then contacting them one by one,seeing if the relevant answerers are responding and moving on to the otherpeople if there is no response. Organizing the whole episode of from usersubmitting a question to getting an answer is the key role of the conversationmanager..



• Routing Engine:- It’s job is to do the actual ranking of candidate answer. Soit is very much usual to ask that how in a web search engine designed sociallydoes the task of ranking as ranking is mostly done over the documents.Hereranking of people on a user’s social network in order of who the algorithmthink is best to answer user’s question.

• Database & Importers : After the entire discussion it is very clear that thedatabase is nothing but the number of users/ participants in the AardvarkNetwork. The Importers are those source which gives the related users of aspecific user like for example the mutual friend concept in Facebook. This willhelp to increase the number of user in the Aardvark Network.

5.3 Working

A user begins by asking a question, most commonly through instant message (onthe desktop) or text message (on the mobile phone). The question gets sent fromthe input device to the Transport Layer, where it is normalized to a Message datastructure, and sent to the Conversation Manager. Once the Conversation Managerdetermines that the message is a question, it sends the question to the QuestionAnalyzer to determine the appropriate topics for the question. The ConversationManager informs the asker which primary topic was determined for the question,and gives the asker the opportunity to edit it. It simultaneously issues a RoutingSuggestion Request to the Routing Engine. The routing engine plays a role analo-gous to the ranking function in a text based search engine. It accesses the InvertedIndex and Social Graph for a list of candidate answerers, and ranks them to reflecta combination of how well it believes they can answer the question and how goodof a match they are for the asker. The Routing Engine returns a ranked list ofRouting Suggestions to the Conversation Manager, which then contacts the poten-tial answerers one by one, or a few at a time, depending upon a Routing Policyand asks them if they would like to answer the question, until a satisfactory answeris found. The Conversation Manager then forwards this answer along to the asker,and allows the asker and answerer to exchange follow up messages if they choose.


Chapter 6

How Aardvark Works?

6.1 Work Model

The core functionality of Aardvark is a statistical model which does the job of routingquestions to potential answerers, because the answer and the question generator bothare a fine end users. Aardvark uses a network variant of what has been called anaspect model[5], that has two primary features. First, it associates an unobservedclass(i.e the question and topic being asked) variable t ∈ T with each observation(i.e., the successful answer of question q by user ui). This probability is to find therelevance of the user ui in answering the topic t. In other words it can be statedthat the probability p(ui|q) that user i will successfully answer question q depends onwhether q is about the topics t in which ui has expertise and it is query dependent:

p(ui, q) =∑t∈T

p(ui|t)p(t|q) (6.1)

The second main feature of the model is that it defines a query-independent prob-ability of success for each potential asker/answerer pair (ui, uj), based upon theirdegree of social connectedness, the intimacy and profile similarity between the askerand the answerer. This is indirectly referencing to the bondage between two friendsfor solving each others problem though the answerer may not be expertise in an-swering the topic. In other words, we define a probability p(ui|uj) that user ui willdeliver a satisfying answer, i.e the quality of the answer to the topic, to user uj ,regardless of the question. We then define the scoring function s(ui, uj, q) as thecomposition of the two probabilities.

s(ui, uj, q) = p(ui, uj).p(ui, q) = p(ui, uj).∑t∈T

p(ui|t)p(t|q) (6.2)

The main goal in the ranking problem is: given a question q from user uj, return aranked list of users ui ∈ U that maximizes s(ui, uj, q).

It is very interesting to note that the scoring function is composed of a query depen-dent relevance score p(ui|q) and a query-independent quality score p(ui|uj). Thisbears similarity to the ranking functions of traditional text-based search enginessuch as Google[1]. The difference is that unlike quality scores like Page Rank[6],Aardvark’s quality score aims to measure intimacy between the askers and the an-swerers rather than authority, of a document. And unlike the relevance scores in

12


hypertext-based search engines, Aardvark’s relevance score aims to measure a user’spotential to answer a query, rather than a document’s existing capability to answera query. Computationally, this scoring function has a number of advantages. It al-lows real-time routing because it pushes much of the computation off line. The onlycomponent probability that needs to be computed at query time is p(t|q). Comput-ing p(t|q) is equivalent to assigning topics to a question - in Aardvark it is done byrunning a probabilistic classifier on the question at query time. The distributionp(ui|t) assigns users to topics, and the distribution p(ui|uj) defines the AardvarkSocial Graph. Both of these are computed by the Indexer at sign up time, and thenupdated continuously in the background as users answer questions and get feedback.The component multiplications and sorting are also done at query time, but theseare easily parallelize able, as the index is shared by user.

6.2 Social Crawling

The primary goal of any search engine is to intake the query, study the query tounderstand the user’s need and then based on the query generate the solution orsolutions to the user. A Knowledge Base (KB)[11] is a technology used to storecomplex structured and unstructured information used by a computer system. Acomprehensive Knowledge Base is important for search engines as query distribu-tions tend to have a long tail [7]. Long tail in the sense the string length of thequery is longer. A Web crawler is an Internet bot that systematically browses theWorld Wide Web, typically for the purpose of Web indexing. Web search enginesand some other sites use Web crawling software to update their web content or in-dexes of others sites’ web content. Web crawlers can copy all the pages they visit forlater processing by a search engine that indexes the downloaded pages so that userscan search them much more quickly. In text-based search engines, the crawling pro-cess is achieved by large scale crawlers and thoughtful crawl policies. In Aardvark,the knowledge base consists of people rather than documents, so the methods foracquiring and expanding a comprehensive knowledge base are quite different.

With Aardvark, the more active users there are in network, the more potentialanswerers there are, and therefore the more comprehensive the coverage. More im-portantly, because Aardvark looks for answerers primarily within a user’s extendedsocial network, the denser the network, the larger the effective knowledge base. Thissuggests that the strategy for increasing the knowledge base of Aardvark cruciallyinvolves creating a good experience for users so that they remain active and areinclined to invite their friends.

Given a set of active users on Aardvark, the effective breadth of the Aardvarkknowledge base depends upon designing interfaces and algorithms that can collectand learn.



6.3 Indexing People

Indexing is the process of allocating the actual or the most suitable resource to therequest. The central technical challenge in Aardvark is selecting the right user toanswer a given question originated from another user. In order to do this, the twomain things Aardvark needs to learn about each user ui are:

1. The topics t that the user ui might be able to answer questions aboutpsmoothed(t|ui)

2. The users uj to whom which user ui is connected p(ui|uj).

Now moving to the first part which is used to describe about the concept of ”topics”known to a user ui. Aardvark computes the distribution p(t|ui) of topics known byuser ui from the following sources of information:

• Users are prompted to provide at least three topics which they believe theyhave expertise about.

• Friends of a user (and the person who invited a user) are encouraged to providea few topics that they trust the user’s opinion about.

• Aardvark parses out topics from users existing on line profiles (e.g., Facebookprofile pages, if provided). For such pages with a known structure, a simpleTopic Parsing algorithm uses regular expressions which were manually devisedfor specific fields in the pages, based upon their performance on test data. Likefor example if a user takes too many photographs and has joined various groupsof photography then the Aardvark can understand that user is an expertise inthe topics related to photography.

• Aardvark automatically extracts topics from unstructured text on users ex-isting on line homepage or blogs if provided. For unstructured text, a linearSVM, Support Vector Machine[17], is a supervised learning models with as-sociated learning algorithms that analyze data and recognize patterns, usedfor classification and regression analysis. SVM identifies the general subjectarea of the text, while an ad-hoc named entity extractor is run to extractmore specific topics, scaled by a variant tf-idf score.tf-idf[18] is a numericalstatistic that is intended to reflect how important a word is to a document ina collection or corpus.

• Aardvark automatically extracts topics from users status message updates(e.g., Twitter messages, Facebook news feed items, IM status messages, etc.)and from the messages they send to other users on Aardvark.

The motivation for using these latter sources of profile topic information is a simpleone: if you want to be able to predict what kind of content a user will generate (i.e.,p(t|ui)), first examine the content they have generated in the past. In this spirit,Aardvark uses web content not as a source of existing answers about a topic, butrather, as an indicator of the topics about which a user is likely able to give newanswers on demand.



In simple terms this process involves modeling a user as a content generator, withprobabilities indicating the likelihood he/she will likely respond to questions aboutgiven topics. Each topic in a user profile has an associated score, depending uponthe confidence appropriate to the source of the topic. In addition, Aardvark learnsover time which topics not to send a user questions about by keeping track of caseswhen the user:

1. Explicitly ”mutes” a topic

2. Declines to answer questions about a topic when given the opportunity

3. Receives negative feedback on his answer about the topic from another user.

Periodically, Aardvark will run a topic strengthening algorithm. The essential ideaor purpose of this algorithm is that ” if a user has expertise in a topic and most ofhis friends also have some expertise in that topic, we have more confidence in thatuser’s level of expertise than if he were alone in his group with knowledge in thatarea”. Mathematically, for some user ui, has a group of friends U , and user knowssome topic t, if p(t|ui) 6= 0, then s(t|ui) = p(t|ui)+γ

∑u∈U p(t|u), where γ is a small

constant. The s values are then renormalized to form probabilities.

Aardvark then runs two smoothing algorithms over the list of the topics knownto the user. The purpose of these two algorithms are to record the possibility thatthe user may be able to answer questions about additional topics which are not ex-plicitly recorded in her profile. The first algorithm uses basic collaborative filteringtechniques on topics i.e., based on users with similar topics of the user. For exampleif a person knows about photography then he also might knows about the camerastoo. The second algorithm uses semantic similarity. Semantic similarity [19] or se-mantic relatedness is a metric defined over a set of documents or terms, where theidea of distance between them is based on the likeness of their meaning or semanticcontent as opposed to similarity which can be estimated regarding their syntacticalrepresentation

Once all of these bootstrap, extraction, and smoothing methods are applied, wehave a list of topics and scores for a given user. Normalizing these topic scores sothat

∑t∈T p(t|ui) = 1, we have a probability distribution for topics known by user

ui. Using Bayes Law, we compute for each topic and user:

p(ui|t) =p(t|ui)p(ui)

p(t)(6.3)

using a uniform distribution for p(ui) and observed topic frequencies for p(t). Aard-vark collects these probabilities p(ui|t) indexed by topic into the Inverted Index,which allows for easy lookup when a question comes in.

Aardvark computes the connectedness between users p(ui|uj) in a number of ways.While social proximity is very important here, we also take into account similaritiesin demographics and behavior. The factors considered here include:



• Social connection (common friends and affiliations)

• Demographic similarity

• Profile similarity (e.g., common favorite movies)

• Vocabulary match (e.g., IM shortcuts)

• Chattiness match (frequency of follow-up messages)

• Verbosity match (the average length of messages)

• Politeness match (e.g., use of Thanks!)

• Speed match (responsiveness to other users)

Connection strengths between people are computed using a weighted cosine similar-ity over this feature set, normalized so that

∑ui∈U p(ui|uj) = 1, and stored in the

Social Graph for quick access at query time. Both the distributions p(ui|uj) in theSocial Graph and p(t|ui) in the Inverted Index are continuously updated as usersinteract with one another on Aardvark.

6.4 Question Analyzer

The purpose of the Question Analyzer is to determine a scored list of topics p(t|q)for each question q representing the semantic subject matter of the question. Insimple terms for a question q , a list of topics are generated which can give thesolution to the question q. This is the only probability distribution in equation 6.2that is computed at query time.

It is important to note that in a social search system, the requirement for a Ques-tion Analyzer is only to be able to understand the query sufficiently for routing itto a likely answerer. This is a considerably simpler task than the challenge facingan ideal web search engine, which must attempt to determine exactly what pieceof information the user is seeking (i.e., given that the searcher must translate herinformation need into search keywords), and to evaluate whether a given web pagecontains that piece of information. By contrast, in a social search system, it is thehuman answer who has the responsibility for determining the relevance of an answerto a question and that is a function which human intelligence is extremely well-suited to perform! Here the only process to be done is the process of match makingtwo users. While the rest of the process of generating the solution is all with theanswerer. Due to this unique feature there is partial complexity in implementing.The asker can express his information need in natural language, and the humananswerer can simply use her natural understanding of the language of the question,of its tone of voice, sense of urgency, sophistication or formality, and so forth, todetermine what information is suitable to include in a response. Thus, the role ofthe Question Analyzer in a social search system is simply to learn enough about thequestion that it may be sent to appropriately interested and knowledgeable humananswerers.



The questions inputed into Aardvark can be classified for better understanding byfollowing ways[2]:

• NonQuestionClassifier

• InappropriateQuestionClassifier

• TrivialQuestionClassifier

• LocationSensitiveClassifier

A NonQuestionClassifier determines if the input is not actually a question (e.g., isit a misdirected message, a sequence of keywords, etc.); if so, the user is asked tosubmit a new question.An InappropriateQuestionClassifier determines if the input is obscene, commercialspam, or otherwise inappropriate content for a public question-answering commu-nity; if so, the user is warned and asked to submit a new question.A TrivialQuestionClassifier determines if the input is a simple factual question whichcan be easily answered by existing common services (e.g., What time is it now?,What is the weather?, etc.); if so, the user is offered an automatically generatedanswer resulting from traditional web search.A LocationSensitiveClassifier determines if the input is a question which requiresknowledge of a particular location, usually in addition to specific topical knowledge(e.g., Whats a great sushi restaurant in Austin, TX?); if so, the relevant location isdetermined and passed along to the Routing Engine with the question.

After analyzing and screening a question the next process is to map this question torelevant topics and from there to the users with knowledge of the topic. The list oftopics relevant to a question is produced by merging the output of several distinctTopicMapper algorithms[2]:

• KeywordMatchTopicMapper

• TaxonomyTopicMapper

• SalientTermTopicMapper

• UserTagTopicMapper

A KeywordMatchTopicMapper passes any terms in the question which are stringmatches with user profile topics through a classifier which is trained to determinewhether a given match is likely to be semantically significant or misleading.A TaxonomyTopicMapper classifies the question text into a taxonomy of roughly3000 popular question topics, using an SVM trained on an annotated text of severalmillions questions.A SalientTermTopicMapper extracts salient phrases from the question using a noun-phrase chunker and a tf-idf based measurer of importance in a document and findssemantically similar user topics.A UserTagTopicMapper takes any user tags provided by the asker (or by any would-be answerers), and maps these to semantically-similar user topics.



At present, the output distributions of these classifiers are combined by weightedlinear combination[8]. The Aardvark TopicMapper algorithms are continuously eval-uated by manual scoring on random samples of 1000 questions. The topics used forselecting candidate answerers, as well as a much larger list of possibly relevant topics,are assigned scores by two human judges, with a third judge adjudicating disagree-ments. For the current algorithms on the current sample of questions, this processyields overall scores of 89% precision and 84% recall of relevant topics. In otherwords, 9 out of 10 times, Aardvark will be able to route a question to someone withrelevant topics in his her profile; and Aardvark will identify 5 out of every 6 possiblyrelevant answerers for each question based upon their topics.

6.5 Ranking Algorithm

Ranking in Aardvark is done by the Routing Engine, which determines an orderedlist of users (or candidate answerers) who should be contacted to answer a question.To this algorithm the inputs are the details of the asker of the question and theinformation about the question derived by the Question Analyzer. The core rankingfunction is described by equation 6.2; essentially, the Routing Engine can be seenas computing equation 6.2 for all candidate answerers, sorting, and doing somepost processing. The main factors that determine this ranking of users are TopicExpertise p(ui|q), Connectedness p(ui|uj), and Availability.

• Topic Expertise: First, the Routing Engine finds the subset of users who aresemantic matches to the question being asked here. Those users whose profiletopics indicate expertise relevant to the topics which the question is about.Those users whose profile topics are closer match to the question’s topics aregiven higher rank. For the questions which are location-sensitive (as definedabove), then only those users with matching locations in their profiles areconsidered for answering.

• Connectedness: Second, the Routing Engine scores each user according tothe degree to which he/she herself - as a person, independently of her topicalexpertise - is a good match for the asker for this information query. This partis more like giving score value to those users who gave the correct and themost relevant answer to the question being asked. The goal of this scoring isto optimize the degree to which the asker and the answerer feel more friendshipand trust, arising from their sense of connection and similarity, and meet eachother’s expectations for conversational behavior in the interaction.

• Availability: Third, the Routing Engine prioritizes candidate answerers insuch a way so as to optimize the chances that the present question will beanswered, while also preserving the available set of answerers (i.e., the quantityof answering resource in the system) as much as possible by spreading out theanswering load across the user base. This involves factors such as prioritizingusers who are currently online (e.g., via IM presence data, iPhone usage, etc.),who are historically active at the present time-of-day, and who have not beencontacted recently with a request to answer a question.



Given this ordered list of candidate answerers, the Routing Engine then filters outusers who should not be contacted, according to Aardvarks guidelines for preservinga high-quality user experience. These filters operate largely as a set of rules:

• Do not contact users who prefer to not be contacted at the present time ofday

• Do not contact users who have recently been contacted as many times as theircontact frequency settings permit etc. Since this is all done at query time,and the set of candidate answerers can potentially be very large, it is usefulto note that this process is parallelize able.

Each Shard[20] in the Index computes its own ranking for the users in that shard,and sends the top users to the Routing Engine. Shard is a horizontal partition ofdata in a database or search engine. Each individual partition is referred to as ashard or database shard. Each shard is held on a separate database server instance,to spread load. This is scalable as the user knowledge base grows, since as moreusers are added, more shards can be added.

The list of candidate answerers who survive this filtering process are returned tothe Conversation Manager. The Conversation Manager then proceeds with openingchannels to each of them, serially, inquiring whether they would like to answer thepresent question; and iterating until an answer is provided and returned to the asker.

6.6 User Interface

Since social search is modeled after the real-world process of asking questions tofriends, the various user interfaces for Aardvark are built on top of the existing com-munication channels that people use to ask questions to their friends: IM, email,SMS, iPhone, Twitter, and Web-based messaging. Experiments were also done us-ing actual voice input from phones, but this is not live in the current Aardvarkproduction system.

In its simplest form, the user interface for asking a question on Aardvark is anykind of text input mechanism, along with a mechanism for displaying textual mes-sages returned from Aardvark. This kind of very lightweight interface is importantfor making the search service available anywhere, especially now that mobile deviceusage has increased across in the globe.

However, Aardvark is most powerful when used through a chat-like interface thatenables ongoing conversational interaction. A private 1-to-1 conversation createsan intimacy which encourages both honesty and freedom within the constraints ofreal-world social norms. An answering forums where there is a public audience canboth inhibit potential answerers [14] or motivate public performance rather thanauthentic answering behavior[15] is a key criteria when the user interface is beingdesigned. Further, in a real time conversation, it is possible for an answerer to re-quest clarifying information from the asker about her question, or for the asker tofollow-up with further reactions or inquiries to the answerer.



There are two main interaction flows, in the sense communication flow betweentwo users, available in Aardvark for answering a question. The primary flow in-volves Aardvark sending a user a message (over IM, email, etc.), asking if she wouldlike to answer a question: for example, ”You there? A friend from the Stanfordgroup has a question about *search engine optimization* that I think you might beable to answer.”. The example of Aardvark response are shown in the Figure 6.2.If the user responds affirmatively, Aardvark relays the question as well as the nameof the questioner. The user may then simply type an answer the question, type ina friend’s name or email address to refer it to someone else who might answer, orsimply ”pass” on this request. Keeping the request and reply messages more userfriendly and trustworthy will make the attention of the user stick to the user itself.

A key benefit of this interaction model is that the available set of potential an-swerers is not just whatever users happen to be visiting a bulletin board at the timea question is posted, but rather, the entire set of users that Aardvark has contactinformation for. Because this kind of reaching out to users has the potential to be-come an unwelcome interruption if it happens too frequently, Aardvark sends suchrequests for answers usually less than once a day to a given user (and users caneasily change their contact settings, specifying preferred frequency and time-of-dayfor such requests). Further, users can ask Aardvark why they were selected for aparticular question, and be given the option to easily change their profile if they donot want such questions in the future. This is very much like the real-world modelof social information sharing: the person asking a question, or the intermediary inAardvark’s role, is careful not to impose too much upon a possible answerer. Theability to reach out to an extended network beyond a user’s immediate friendships,without imposing too frequently on that network, provides a key differentiating ex-perience from simply posting questions to one’s Twitter or Facebook status message.

A secondary flow of answering questions is more similar to traditional bulletin-board style interactions: a user sends a message to Aardvark (e.g., ”try”) or visitsthe Answering tab of Aardvark website or iPhone application, and Aardvark showsthe user a recent question from her network which has not yet been answered andis related to her profile topics. This mode involves the user initiating the exchangewhen she is in the mood to try to answer a question; as such, it has the benefit of aneager potential answerer - but as the only mode of answering it does not effectivelytap into the full diversity of the user base (since most users do not initiate theseepisodes). This is an important point: while almost everyone is happy to answerquestions to help their friends or people they are connected to, not everyone goesout of their way to do so. This willingness to be helpful persists because when usersdo answer questions, they report that it is a very gratifying experience: they havebeen selected by Aardvark because of their expertise, they were able to help someonewho had a need in the moment, and they are frequently thanked for their help bythe asker.



Figure 6.1: Aardvark in iPhone

In order to play the role of intermediary in an ongoing conversation, Aardvarkmust have some basic conversational intelligence in order to understand where todirect messages from a user: is a given message a new question, a continuation ofa previous question, an answer to an earlier question, or a command to Aardvark?The details of how the Conversation Manager manages these complications and dis-ambiguates user messages are not essential so they are not elaborated here; but thebasic approach is to use a state machine to model the discourse context.

In all of the interfaces, wrappers around the messages from another user includeinformation about the user that can facilitate trust: the user’s Real Name nametag, with their name, age, gender, and location; the social connection between youand the user (e.g., Your friend on Facebook, A friend of your friend Marshall Smith,You are both in the Stanford group, etc.); a selection of topics the user has exper-tise in; and summary statistics of the user’s activity on Aardvark (e.g., number ofquestions recently asked or answered).

Finally, it is important throughout all of the above interactions that Aardvark main-tains a tone of voice which is friendly, polite, and appreciative. A social search enginedepends upon the goodwill and interest of its users, so it is important to demon-strate the kind of (linguistic) behavior that can encourage these sentiments, in orderto set a good example for users to adopt. Indeed, in user interviews, users oftenexpress their desire to have examples of how to speak or behave socially when using



Figure 6.2: Aardvark’s working using a pictorial description

Aardvark; since it is a novel paradigm, users do not immediately realize that theycan behave in the same ways they would in a comparable real world situation ofasking for help and offering assistance. All of the language that Aardvark uses isintended both to be a communication mechanism between Aardvark and the userand an example of how to interact with Aardvark. Overall, a large body of research[9][10][12][13] shows that when you provide a 1-1 communication channel, use realidentities rather than pseudonyms, facilitate interactions between existing real-worldrelationships, and consistently provide examples of how to behave, users in an on-line community will behave in a manner that is far more authentic and helpful thanpseudonymous multi casting environments with no moderators. The design of theAardvarks UI has been carefully crafted around these principles.



6.7 Example

The better performance and evaluation of any algorithm can only be understoodusing an example. This section is illustrating a working model of the Aardvarknetwork. Before preceding further moving to a recap of what said early it is veryclear that the Aardvark is not an ordinary search engine. Its more complex andmore natural language based search engine which generates real time results for aquery. The figure 6.4 shows a random sample of questions asked on Aardvark whenthe Aardvark was officially launched in the world’s computer network. From thefigure itself it is very clear that not even a single query from this set of query can beanswered by the traditional text based search engine. The main reason is that theseare subjective questions and can only be answered by a human not a machine or adocument. The below Figure 6.3[2] shows the mostly asked topics on Aardvark.

Figure 6.3: Frequent Questions asked in Aardvark

Figure 6.4: A random set of queries from Aardvark



In Example 1 as in Figure: 6.5, Aardvark opened 3 channels(conversation) withcandidate answerers, which yielded 1 answer. An interesting aspect of this exampleis that the asker and the answerer in fact were already friends , though only listedasfriends-of-friends in their online social graphs; and they had a quick back-and-forth chat through Aardvark.

Figure 6.5: Example 1

In Example 2 as in Figure: 6.6, Aardvark opened 4 channels with candidate answer-ers, which yielded two answers. For this question, the referral feature was useful:one of the candidate answerers was unable to answer, but referred the question alongto a friend to answer; this enables Aardvark to tap into the knowledge of not just itscurrent user base, but also the knowledge of everyone that the current users know.The asker wanted more choices after the first answer, and resubmitted the questionto get another recommendation, which came from a user whose profile topics relatedto business and finance (in addition to dining).

In Example 3 as in Figure: 6.7, Aardvark opened 10 channels with candidate an-swerers, yielding 3 answers. The first answer came from someone with only a distantsocial connection to the asker; the second answer came from a coworker; and thethird answer came from a friend-of-friend-of-friend. The third answer, which isthe most detailed, came from a user who has topics in his profile related to bothrestaurants and dating.






Chapter 7

Future Works

From the previous chapters it will be very clear how this Search Engine work, whatare the main criteria and various examples to about how query is being processedby this search engine. Aardvark was originally developed by The Mechanical Zoo, aSan Francisco-based startup founded in 2007 by Max Ventilla, Nathan Stoll (bothformer Google employees), Damon Horowitz and Rob Spiro[21].A prototype ver-sion of Aardvark was launched in early 2008[with an alpha launch in October 2008.Aardvark was released to the public in March 2009, although initially new users hadto be invited by existing users. The company did not release usage statistics.

Later Google acquired Aardvark for $50 million on February 11, 2010. The Aard-vark sucessfully completed two years of development in Google. As shown in thegraph Figure 7.1 the number of Aardvark users were increasing which shows a clearresult of the success of Aardvark. But in September 2011, Google announcedit would discontinue a number of its products, including Aardvark.

Figure 7.1: Increasing number of users in Aardvark.

26


It was very mysterious because without specifying any kind of reasons the Googleshut down the project. No Flaws or Error was found in the Search Engine but stillit was declined.As mentioned earlier there is no other application launched or yet launched basedon the concept of Social search engine. Yet there is a new kind of website basedsearch engine named as Social Searcher. This is a social based search engine butit is not working as per the concepts of Aardvark. Its retrieving information abouta query not by proper meaning but by taking the keyword into consideration.


Chapter 8

Conclusion

Here a detail explanation and study was made on the concept of Social Search En-gine and Aardvark. From the detailed study it can be inferred that the concept ofSocial based search can revolutionize in the field of Search Engine. Currently thereare several search engines available in market but all of them are derived from theconcept of text based search engine.

Rather than generating a predefined output from a trillions of output, which iscommon for every user a new model is designed here. Here when a user queriessomething keep the thoughts and the need which is being projected in the query asthe key things to generate an output is achieved here. For the first time humans’thoughts is becoming one of the dimensions in the field of Information Retrieval.

After the detail study it was learned that the Social Based Search Engine can enrichthe Information Content residing in the World Wide Web. If a provision was pro-vided to incorporate both the Social Search Engine and Hypertextual Based SearchEngine then a creation of highly informative source can be developed. Along withthat if a process of extracting information from the subjective questions and answerscan make the system more smarter.

28

Bibliography

[1] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web searchengine. In WWW Conference, 1998.

[2] D. Horowitz and Sepandar D. Kamvar, The Anatomy of a Large-Scale SocialSearch EngineDamon Horowitz. In WWW Coference, 2010

[3] ”Google Acquires Aardvark For 50 million (Confirmed)”. TechCrunch.techcrunch.com. Archived from the original on February 13, 2010. RetrievedFebruary 12, 2010.

[4] ”Google Acquires Aardvark”. Official Google Blog. Google. Archived from theoriginal on February 14, 2010. Retrieved February 12, 2010.

[5] T. Hofmann. ”Probabilistic latent semantic indexing”. In SIGIR, 1999.

[6] L. Page, S. Brin, R. Motwani, and T. Winograd. ”The PageRank citation rank-ing: Bringing order to the Web”. Stanford University Technical Report, 1998.

[7] B. J. Jansen, A. Spink, and T. Sarcevic. ”Real life, real users, and real needs:a study and analysis of user queries on the web”. Information Processing andManagement, 2000.

[8] D. Klein, K. Toutanova, H. T. Ilhan, S. D. Kamvar, and C. D. Man-ning.”Combining heterogeneous classifiers for word-sense disambiguation”. InSENSEVAL, 2002.

[9] H. Bechar-Israeli. From <Bonehead> to <cLoNehEAd>: Nicknames, Play, andIdentity on Internet Relay Chat. Journal of Computer-Mediated Communica-tion, 1995.

[10] A. R. Dennis and S. T. Kinney. ”Testing media richness theory in the new me-dia: The effects of cues, feedback, and task equivocality”. Information SystemsResearch, 1998.

[11] Green, Cordell, D. Luckham, R. Balzer, T. Cheatham, C. Rich (1986). ”Reporton a knowledge-based software assistant”. Readings in artificial intelligence andsoftware engineering (M. Kaufmann): Page Number: 377 - 428.

[12] J. Donath. ”Identity and deception in the virtual community”. Communities inCyberspace Conference, 1998.

[13] L. Sproull and S. Kiesler. ”Computers, networks, and work. In Global Net-works”, Computers and International Communication. MIT Press, 1993.

29


[14] M. R. Morris, J. Teevan, and K. Panovich. ”What do people ask their socialnetworks, and why? A Survey study of status message Q&A behavior”. In CHI,2010.

[15] M. Wesch. ”An anthropological introduction to YouTube”. Library of CongressSummit, 2008.

[16] C. McCarthy. ”Facebook: One Social Graph to Rule Them All?”. CBS News.Web July 11, 2010.

[17] R. Cuingnet, C. Rosso, M. Chupin, S. Lehricy, D. Dormont, H. Benali, Y.Samson and O. Colliot, ”Spatial regularization of SVM for the detection ofdiffusion alterations associated with stroke outcome”, Medical Image AnalysisConference, 2011.

[18] Rajaraman, A. Ullman; ”Data Mining Mining of Massive Datasets”, WWWConference, 2011

[19] Harispe S., Ranwez S. Janaqi S., Montmain J. ”Semantic Measures for the Com-parison of Units of Language, Concepts or Entities from Text and KnowledgeBase Analysis”, Arxiv Corporation, August 2013.

[20] R. Roy. ”Shard - A Database Design”, International Conference on DatabaseStudy, Geneva, July 28, 2008.

[21] D. Horowitz.”Mechanical Zoo Gets $6 Million To Build Aardvark Social SearchProduct”. TechCrunch. techcrunch.com.Web. March 14, 2009.


Software

Aardvark : Social Search Engine