7
NExT: NUS-Tsinghua Center for Extreme Search of User-Generated Content T he Web has revolutionized the way we cre- ate, disseminate, and consume information. Users have changed from passive recipients of information to active content consumers and creators, and the nature of information has also changed from static text to dynamic multi- media. With the widespread use of social net- works, live user-generated content (UGC) has begun to dominate the Internet. Such UGC cov- ers a range of media, from text (tweets, forums, and Facebook messages) to images (Instagram and Flickr), videos (YouTube), location check- ins (Foursquare), and community question- and-answer forums (Yahoo!Answers and WikiAnswers). Like the Deep Web, the amount of information available from such live UGC is huge and unknown. Many research efforts have focused on analyz- ing, indexing, and retrieving the Web, Deep Web, and multimedia content. However, re- search efforts to make sense of the huge and unstructured UGC have barely begun. The NUS- Tsinghua Center for Extreme Search (or NExT Center) is a collaboration between the National University of Singapore (NUS) and Tsinghua Uni- versity that focuses on the novel, challenging task of analyzing and organizing UGC to make it available for general access. (The NExT Center received a major grant to support this work in May 2010 from the Media Development Author- ity of Singapore through the Interactive Digital Media Program Office.) With the term ‘‘extreme search,’’ we mean to emphasize the ability to index, organize, and search for live and dynamic UGCs beyond what is currently indexed on the Web. In particular, the center focuses on crawling and mining UGC in Singapore and Beijing in the following areas: ĸ image and location-oriented information, including shared photos and check-in venues; ĸ topic-oriented information, such as tweets, discussion forums, and community question- and-answer forums; ĸ application-oriented information covering local mobile apps and associated informa- tion and discussions; and ĸ structured information, including a loca- tion’s factual, cultural, and historical information. All these sources generally contain publicly available data documenting people’s social interactions and their collective preferences and interests within a city. The prevalence of UGC permits us to better understand the social pulses of a society, and NExT aims to crawl and analyze live UGC toward achieving this goal. This article summarizes our initial work on live monitoring of raw UGC and events as they unfold. Specifically, this article highlights six of the research projects carried out by the NExT Center along these lines. The NExT Center is working to generate higher-level analytics to understand topics of interests to society, their sentiments, evolving live events and social communities, as well as fashion habits and trends. Our long term aim Media Impact Cees Snoek University of Amsterdam Huanbo Luan National University of Singapore and Tsinghua University Maosong Sun and Shiqiang Yang Tsinghua University Tat-Seng Chua National University of Singapore Editor’s Note User-generated content offers a wealth of research, business, and so- cietal opportunities. To study the dynamic multimedia phenomenon, the National University of Singapore and Tsinghua University estab- lished the NExT Center. In this article, the authors describe the center and highlight current projects. 1070-986X/12/$31.00 c 2012 IEEE Published by the IEEE Computer Society 81

NExT: NUS-Tsinghua Center for Extreme Search of …lms.comp.nus.edu.sg/sites/default/files/publication-attachments/...The NUS-Tsinghua Center for Extreme Search ... Figure 1 shows

Embed Size (px)

Citation preview

NExT: NUS-Tsinghua Centerfor Extreme Searchof User-Generated Content

The Web has revolutionized the way we cre-

ate, disseminate, and consume information.

Users have changed from passive recipients of

information to active content consumers and

creators, and the nature of information has

also changed from static text to dynamic multi-

media. With the widespread use of social net-

works, live user-generated content (UGC) has

begun to dominate the Internet. Such UGC cov-

ers a range of media, from text (tweets, forums,

and Facebook messages) to images (Instagram

and Flickr), videos (YouTube), location check-

ins (Foursquare), and community question-

and-answer forums (Yahoo!Answers and

WikiAnswers). Like the Deep Web, the amount

of information available from such live UGC is

huge and unknown.

Many research efforts have focused on analyz-

ing, indexing, and retrieving the Web, Deep

Web, and multimedia content. However, re-

search efforts to make sense of the huge and

unstructured UGC have barely begun. The NUS-

Tsinghua Center for Extreme Search (or NExT

Center) is a collaboration between the National

University of Singapore (NUS) and Tsinghua Uni-

versity that focuses on the novel, challenging

task of analyzing and organizing UGC to make

it available for general access. (The NExT Center

received a major grant to support this work in

May 2010 from the Media Development Author-

ity of Singapore through the Interactive Digital

Media Program Office.) With the term ‘‘extreme

search,’’ we mean to emphasize the ability to

index, organize, and search for live and dynamic

UGCs beyond what is currently indexed on the

Web. In particular, the center focuses on crawling

and mining UGC in Singapore and Beijing in the

following areas:

� image and location-oriented information,

including shared photos and check-in venues;

� topic-oriented information, such as tweets,

discussion forums, and community question-

and-answer forums;

� application-oriented information covering

local mobile apps and associated informa-

tion and discussions; and

� structured information, including a loca-

tion’s factual, cultural, and historical

information.

All these sources generally contain publicly

available data documenting people’s social

interactions and their collective preferences

and interests within a city. The prevalence of

UGC permits us to better understand the social

pulses of a society, and NExT aims to crawl and

analyze live UGC toward achieving this goal.

This article summarizes our initial work on

live monitoring of raw UGC and events as

they unfold. Specifically, this article highlights

six of the research projects carried out by the

NExT Center along these lines.

The NExT Center is working to generate

higher-level analytics to understand topics of

interests to society, their sentiments, evolving

live events and social communities, as well as

fashion habits and trends. Our long term aim

[3B2-9] mmu2012030081.3d 12/7/012 11:1 Page 81

Media Impact Cees SnoekUniversity of Amsterdam

Huanbo LuanNational Universityof Singapore andTsinghua University

Maosong Sun andShiqiang YangTsinghua University

Tat-Seng ChuaNational Universityof Singapore

Editor’s Note

User-generated content offers a wealth of research, business, and so-

cietal opportunities. To study the dynamic multimedia phenomenon,

the National University of Singapore and Tsinghua University estab-

lished the NExT Center. In this article, the authors describe the center

and highlight current projects.

1070-986X/12/$31.00 �c 2012 IEEE Published by the IEEE Computer Society 81

is to generate social graphs of topics and peo-

ple, documenting the activities and concerns

of people within the society.

Live Monitoring

Part of the NExT Center’s efforts is to crawl

and monitor UGC related to a city. Along

this aim, we have continually crawled the

image and location-oriented UGC, topic-

oriented UGC, UGC related to local mobile

apps, and structured information. Table 1

lists the data sources that we crawled and

their size as of 9 June 2012. A live monitor is

designed to view and visualize the live UGCs

as they are being created and propagated

through the Internet. Figure 1 shows an exam-

ple of visualizing the check-ins and tweets re-

lated to the topic ‘‘Mother’s Day.’’

We analyze these UGC to generate first-

order analytics. For example, by analyzing the

tweets related to any topic at different time

stamps, we can generate keyword clouds that

depict the differences in context related to

this entity. Figure 2 shows the keyword clouds

for Instagram before and after it was acquired

by Facebook on 10 April 2012. This example

clearly shows the power of live analysis in

quickly evolving circumstances and the emer-

gence of live events. To achieve this, we de-

velop technologies for large-scale, real-time

crawling, indexing, and analysis.

Organizational Sense

Social media portals such as Twitter contain

the everyday thoughts, opinions, and experi-

ences of their online users. Parts of these

UGC reflect and reveal their sentiments

about organizations or events. Motivated by

real-world industrial needs, this project fo-

cuses on discovering the public perception of

organizations from live social media. We can

ascertain the public perception of an organiza-

tion by answering several questions:

� What are people saying about the organiza-

tion (current and emerging discussions1)?

� Who are these people (organization users)?

� Where are they talking about the organiza-

tion (locations of posts)?

Our work focuses on tackling the key chal-

lenges in identifying this information. First, dif-

ferent organizations might share the same

acronym—for example, NUS could stand for

the National University of Singapore in Singa-

pore (as in this article) or the National Union

of Students in the UK. Hence, we need to differ-

entiate the discussions and user communities

of such organizations. However, because tweets

are no more than 140 characters long, the dis-

ambiguation task is challenging, especially for

organizations that share the same context (for

example, two universities).

To tackle this issue, we utilize the posts’ con-

tent and geolocation as well as the community

of the authors (users) to accurately relate the

posts to target organizations. However, because

[3B2-9] mmu2012030081.3d 12/7/012 11:1 Page 82

Table 1. Monitored user-generated content sources and their sizes

(as of 9 June 2012).

Data source Data type Number

Amazon Images 829,867

Products 1,313,545

Reviews 5,314,731

Dianping (Beijing) Images 943,197

Reviews 2,491,023

Shops 129,367

Fantong (Beijing) Reviews 336,818

Shops 82,892

Flickr Images (Beijing) 2,622,134

Images (Singapore) 2,621,677

Reviews 10,022,514

Users 2,060,357

Foursquare Check-ins 2,243,267

Venues 1,092,096

Panoramio Images (Singapore) 65,832

Reviews 19,801

Tencent Weibo Images 4,483,140

Tweets 58,667,476

Users 22,101,730

User relationships 22,101,730

Twitter Images 696,815

Hot topics 1,408

Tweets 357,483,444

Users 45,609,743

Sina Weibo Famous people 262

Images 5,799,155

Hot topics 72

Tweets 97,311,248

Users 28,074,303

Baidu Zhidao Q&A pairs 33,636

Yahoo!Answers Q&A pairs 3,158,912

WikiAnswers Q&A pairs 49,018,100

IEEE

Mu

ltiM

ed

ia

82

Media Impact

users often turn off the GPS tracking services of

their devices, the posts’ geolocations are usually

unavailable.2 Therefore, the second challenge is

to automatically predict a post’s geolocation.

Obviously, an organization’s user commu-

nity is an important factor in relating tweets

to target organizations. However, discovering

user communities is difficult because users

might have completely dissimilar interests,

friendship groups, discussion topics, and so

forth. Thus, another research issue is to auto-

matically identify key users, hot and emerging

discussions, and events related to an organiza-

tion. Discovering such information requires

the design of sophisticated multimedia analysis

methods.

We have cumulated this research into a sys-

tem for discovering an organization’s sense.

We then applied that system to discover and

monitor emerging discussions and topics

about NUS.

Organizing User-Generated Content

With the availability of large volumes of UGC

on many hot topics, it has become infeasible

for users to access and manage huge, unstruc-

tured sets of information. It is also overwhelm-

ing for users to grasp the overview of the

[3B2-9] mmu2012030081.3d 12/7/012 11:1 Page 83

Figure 1. Mother’s Day event map. This example visualizes user-generated check-ins, Flickr image uploads,

and tweets that reference Mother’s Day. We found 22,822 records in all sources, and this map displays

only 1,000 of them in 1.071 seconds.

Figure 2. Keyword clouds. By sampling tweets that referenced Instagram, we generated (a) the first cloud

on 9 April 2012 and (b) the second on 10 April 2012, before and after the company was acquired by

Facebook, respectively.

(a) (b)

July

�Sep

tem

ber

2012

83

discussions and obtain a sense of key subtopics

of common concern and interest to most

users. To tackle this problem, we developed a

framework to automatically ‘‘read,’’ digest,

and organize UGC in human-oriented struc-

tures to facilitate user access. In other words,

we aim to leverage domain knowledge to orga-

nize UGC related to any topic into hierarchies

or network structures that resemble the se-

mantic models in human brains.

Our system consists of three major steps:

1. Mine the concepts and propositions from

natural language sentences in a large

closed-domain UGC corpus.

2. Extract the knowledge structure based on the

concepts/propositions graph (see Figure 3

for an example on MAC Cosmetics).

3. Use the extracted knowledge structure to

help users organize and visualize the UGC

corpus (see Figure 4).3

We believe that the resulting knowledge struc-

tures will help users browse and analyze the

large amount of UGC on any topic.

Location Profiling

In location-based social networks, users implic-

itly interact with each other by visiting places,

posting comments, and uploading photos.

These heterogeneous interactions convey latent

information we can use to identify meaningful

user groups (namely, social communities4) that

exhibit unique location-oriented characteris-

tics. In particular, social network users can per-

form check-ins (black arrows), upload photos

(red arrows), and post comments (blue arrows)

at various venues (see Figure 5a). By modeling

these interactions as a multimodality nonuni-

form hypergraph,5 we can detect and profile so-

cial communities as a dense subgraph within

the heterogeneous hypergraph (see Figure 5b).

In any given instance, each user might be-

long to one or more communities. Figure 6

shows a visualization of the food lover commu-

nity on the global scale.

Mining Fashion Trends

For many people, social networks have be-

come an indispensable part of social activities.

In particular, the large amount of shared

images and videos form a rich part of social

engagements. Mining such visual media,

along with the associated text and metadata,

enables us to study the various aspects of

users’ social habits and preferences.

To accomplish this, this project first aims to

analyze images and videos to identify faces and

[3B2-9] mmu2012030081.3d 12/7/012 11:1 Page 84

Figure 3. Knowledge structures extracted from a collection of reviews on MAC

Cosmetics. (a) We first automatically generated the results for major MAC

products and included attributes such as mascara and gel in the graph-based

structure. (b) Next we produced a hierarchical structure, a tree extracted from

the first graph with MAC as its root and major products as the first-layer of

children, which is easier to browse than the original.

eyeshadow

brushes

gel

liquid

liquid

pencil water

carbon

eyeliner

eyeshadow brush

brush

foundation lip swatches powder concealer shadow blush pencil studio

studio shade powder lipsitck lipgloss

longwear pro brown creamy studio

optimum black opulash lash maybeline auctioned

tfsi mascara

mac

(a)

(b)

foundation

studio

liquid

brush

brush

brown

carbon

brushes

lipstick

mascara

gel

lash

studio

longwear

studio

cpulashpowder

shadow

powder

pencil

pencil

creamylip

tfsimaybeline

auction

swatches

water

mac

optioumshade

pro

concealer

eyeliner

liquid

blush

84

Media Impact

bodies of people in order to infer what they are

wearing in terms of clothing styles and various

fashion items. We can extract clothing elements

while ignoring the background and use them to

match similar clothing from other street photos

or merchant photos (see Figure 7). Moreover,

we can classify these clothing elements into var-

ious predefined classes of styles.6

We have also developed techniques to iden-

tify fashion accessories such as handbags and

shoes in order to match for other similar

items. By leveraging the large amount of such

data as training, we can offer users occasion-

oriented fashion recommendations. For in-

stance, if a user specifies an occasion, the system

can recommend the most suitable clothing

from a clothing database (see Figure 8). Finally,

by correlating the results of analysis with time,

location, and weather patterns, we can generate

various types of life style analytics that will be

invaluable for studying the changing dressing

habits and predicting how fashions change

over time.

Live City

Today’s smartphones (such as iPhone and HTC

models) are equipped with GPS receivers that

can capture a user’s location. This has led to the

development of location-based services that pro-

vide users with location-aware experiences. By

checking into location-based social services such

as Foursquare (https://foursquare.com), users

can find out if their friends are in certain loca-

tions. This project aims to develop a location-

based system that exploits users’ locations to

improve the quality of their lives in a city.

Our system goes beyond traditional location-

aware applications by facilitating continuous

[3B2-9] mmu2012030081.3d 12/7/012 11:1 Page 85

Perform a check-in

(a) (b)

Upload a photo Post a tip

Figure 5. Overview of

the location-based

social network. (a) User

interactions within the

network and (b) the

community detection

process.

Figure 4. Organizing

user-generated content.

We aim to organize

the UGC corpus based

on the extracted

knowledge structures to

facilitate user access

and browsing. This

example distills

information on the

iPhone Jailbreak utility

application.

July

�Sep

tem

ber

2012

85

and spatial-keyword search for a variety of

query types including range, k-nearest neigh-

bors, reverse k-nearest neighbors, top-k, and

preference queries. For example, the system

would allow a user to continuously search for

the nearest gas stations while driving. It also

facilitates location-specific advertisements (see

Figure 9).

As a start, we have investigated spatial-

keyword searches that are direction-aware in

order to prioritize answers that are ahead of

the users as they move.7 Moreover, our search-

as-you-type feature enables answers to be con-

tinuously refined as the user types each additional

letter. To support these features efficiently, we

developed several novel data structures that

seamlessly integrate textual description and

spatial information to index spatial data, inves-

tigated the use of MapReduce,8 and developed

a scheme that efficiently batch processes a set

of queries to minimize multiple accesses to

the same table.

Our experimental results show that the

system is efficient. For example, searching

20 nearest neighbors from a 5 million points

of interest (POIs) dataset can be completed

within 5 ms. We are working toward a full-

fledged system for a specific application

domain.

[3B2-9] mmu2012030081.3d 12/7/012 11:1 Page 86

Figure 6. Visualization of the global food lover community. Each user might belong to one or more communities.

Americanrestaurant

Figure 7. The street-to-shop clothing retrieval model. This project extracts

clothing elements while ignoring the background and then matches them to

similar clothing from other street or merchant photos.

Street snap

Online store

Cross-scenariosearch

86

Media Impact

Conclusion

Archived UGC permits us to learn about the

evolution of various topics within the society

over time. Such information is valuable to gen-

eral users going about their daily lives, can

help enterprises better serve their customers,

and can help governmental organizations bet-

ter plan for social services and provisions as a

result of social trends and concerns.

This work is just beginning. We aim to share

our data, analytics, and resources in an effort to

increase collaboration and make our world a

better place to live in. MM

References

1. A. Saha et al., ‘‘Learning Evolving and Emerging

Topics in Social Media: A Dynamic NMF

Approach with Temporal Regularization,’’ Proc.

5th ACM Int’l Conf. Web Search and Data Mining

(WSDM), 2012, pp. 693�702.

2. A. Sadilek et al., ‘‘Finding Your Friends and Fol-

lowing Them to Where You Are,’’ Proc. 5th ACM

Int’l Conf. Web Search and Data Mining (WSDM),

ACM Press, 2012, pp. 723�732.

3. Z. Ming et al., ‘‘Prototype Hierarchy Based Clus-

tering for the Categorization and Navigation of

Web Collections,’’ Proc. 33rd Int’l ACM SIGIR

Conf. Research and Development in Information

Retrieval, ACM Press, 2010, pp. 2�9.

4. L. Tang et al., Community Detection and Mining in

Social Media, Morgan & Claypool Publishers, 2010.

5. H. Liu et al., ‘‘Robust Clustering as Ensemble of

Affinity Relations,’’ Proc. 24th Ann. Conf. Neural

Information Processing Systems (NIPS), Curran

Associates Publisher, 2010, pp. 1414�1422.

6. S. Liu et al., ‘‘Street-to-Shop: Cross-Scenario Cloth-

ing Retrieval Via Parts Alignment and Auxiliary Set,’’

Proc. IEEE Computer Soc. Conf. Computer Vision and

Pattern Recognition (CVPR), IEEE CS Press, 2012.

7. G. Li et al., ‘‘Desks: Direction-Aware Spatial Key-

word Search,’’ Proc. Int’l Conf. Data Eng. (ICDE),

IEEE CS Press, 2012, pp. 459�470.

8. L. Shi et al., ‘‘S3: An Efficient Shared Scan Sched-

uler on MapReduce Framework,’’ Proc. 2011 Int’l

Conf. Parallel Processing (ICPP), IEEE CS Press,

2011, pp. 325�334.

Tat-Seng Chua is the codirector of NExT Center and

the KITHCT chair professor in the Department of

Computer Science at the National University of

Singapore. Contact him at [email protected].

Huanbo Luan is the deputy executive director of

NExT Center at both the National University of

Singapore and Tsinghua University. Contact him at

[email protected].

Maosong Sun is the codirector of NExT Center, and a

professor and vice chairman of the Department of

Computer Science and Technology at Tsinghua Uni-

versity. Contact him at [email protected].

Shiqiang Yang is a professor in the Department of

Computer Science and Technology and codirector

of Microsoft-Tsinghua Multimedia Joint Lab at Tsing-

hua University as well as the president of Multimedia

Committee of China Computer Federation, China.

Contact him at [email protected].

[3B2-9] mmu2012030081.3d 12/7/012 11:1 Page 87

Figure 8. Occasion-oriented clothing recommendation. The system can

recommend the most suitable clothing from an occasion-oriented clothing

database.

Figure 9. The search-as-you-type and direction-aware, spatial-keyword search

for ‘‘shopping center.’’ As the user types ‘‘shopping center,’’ the system displays

the points of interest (POIs) ‘‘in front’’ of the user. We enrich the POI by

generating structured data from the user-generated data.

Magic closet, I amgoing on a date, tell

me what to wear.

Sports Travel

s

FuneralOccasionsConference

Dating Wedding

Recommendationresults

July

�Sep

tem

ber

2012

87