14
A Noun Phrase Analysis Tool for Mining Online Community Conversations Caroline Haythornthwaite <[email protected]> Anatoliy Gruzd <[email protected]>

A Noun Phrase Analysis Tool for Mining Online Community Conversations

Embed Size (px)

DESCRIPTION

A Noun Phrase Analysis Tool for Mining Online Community Conversations Caroline Haythornthwaite and Anatoliy Gruzd (U of Illinois) See a full paper at http://www.iisi.de/fileadmin/IISI/upload/C_T/2007/Haythornthwaite.pdf Abstract: Online communities are creating a growing legacy of texts. These texts record conversation, knowledge exchange, and variation in topic and orientation as groups grow, mature, and decline; they represent a rich history of group interaction and an opportunity to explore the purpose and development of online communities. The problem is how to approach and make sense of the vast amount of data stored by these communities and to use that information for some useful outcome. In this paper we use automated processes, including natural language processing, to explore the case of text accumulated from bulletin board postings from eight iterations of an online class. The paper presents work done on creating and refining the natural language processing procedures used to examine these data, and a description of results so far from these examinations.

Citation preview

Page 1: A Noun Phrase Analysis Tool for Mining Online Community Conversations

A Noun Phrase Analysis Tool for Mining Online Community Conversations

Caroline Haythornthwaite <[email protected]> Anatoliy Gruzd <[email protected]>

Page 2: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Problem

• Online communities are creating a growing volume of texts− By 2010, >70% of all digital information will

be user-generated. (Technology Consultancy IDC) − And the majority of it will still be text-based

• How can we analyze and make sense of such a vast amount of textual data?

• Possible Solution: − Automated text analysis

Page 3: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Research Questions

Can we automate the process of creating an “effective representation” of the texts produced by online communities?

A “representation” is an excerpt [of the original text] that describes or may stand in for the original text.

An “effective” representation can help us to answer:

− Can we discover what the community interests and priorities are?

− Can we discover patterns of language and interaction that characterize a community?

Page 4: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Why Nouns & Noun Phrases?1. Nouns and noun phrases tend to be the most

informative elements of any sentence

2. Noun phrases make it easy to disambiguate the meaning of words

e.g. ‘travel information’, ‘information center’, ‘information management’

3. Noun phrases extraction is easy and cheap to accomplish with today’s Natural Language Processing (NLP) tenchiques• NLP is a set of computational techniques for

processing natural (human) languages

Page 5: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Example of Noun-Phrase Extraction

“We will have minivan during the conference to help shuttle attendees from the other hotels to and from the Kellogg Center.”

Step 1. Part-of-Speech Tagging<We/PRP> <will/MD> <have/VB> <minivan/NN> <during/IN > <the/DT> <conference/NN> <to/TO> <help/VB> <shuttle/VB> <attendees/NNS> <from/IN> <the/DT> <other/JJ> <hotels/NNS> <to/TO> <and/CC> <from/IN> <the/DT> <Kellogg/NNP> <Center/NNP>

Step 2. Chunking<We/PRP> <will/MD> <have/VB> <minivan/NN> <during/IN > <the/DT> <conference/NN> <to/TO> <help/VB> <shuttle/VB> <attendees/NNS> <from/IN> <the/DT> (NP: <other/JJ> <hotels/NNS>) <to/TO> <and/CC> <from/IN> <the/DT> (NP: <Kellogg/NNP> <Center/NNP>)

Representation<We/PRP> <will/MD> <have/VB> (NP:<minivan/NN>) <during/IN > (NP:<the/DT> <conference/NN>) <to/TO> <help/VB> <shuttle/VB> (NP:<attendees/NNS>) <from/IN> (NP:<the/DT> <other_hotels/NNS>) <to/TO> <and/CC> <from/IN> (NP:<the/DT> <Kellogg_Center/NNP>)

Page 6: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Examples of Part-of-Speech Tags commonly used in NLP

We …

… Center

Page 7: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Examples of Open-source NLP Toolkits

• NLTK - http://nltk.sourceforge.net

• LingPipe - http://www.alias-i.com/lingpipe/

• MII NLP Toolkit -

http://www.mii.ucla.edu/nlp/

• OpenNLP - http://opennlp.sourceforge.net/ WARNING!

Advanced Knowledge of Computational Linguistics

& Programming Skills Required !!!

Page 8: A Noun Phrase Analysis Tool for Mining Online Community Conversations

ICTA Internet Community Text

Analyzer

Course nameInformation Organization and Access

Data sourcebulletin board messages

Classes 8

School years 2001 – 2004

Duration of each class

15 weeks

No. of students per class

31 - 54

No. of messages per class

1200 - 2100

Page 9: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Preliminary Exploration of the dataset using ICTA

1. Most frequently used words2. Important Topics Over Time3. Community Style4. Community Support

Page 10: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Preliminary Exploration 1. Most frequently used words

• Profession-related words− book/s, information, library/libraries,

librarian/s − user/s, and patron/s, people − database/s, search, document/s

• Learning-related words− question/s, article/s, example/s, way,

study, class, course, research, journal, reading, method, problem, hard time

Page 11: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Preliminary Exploration 2. Important Topics Over Time

% of messages containing "Database(s)" and "Google"

0

5

10

15

20

25

2001 2002 2003 2004

School year

% o

f m

sg

s

database(s)

google

Page 12: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Preliminary Exploration 3. Community Style

% of messages containing Don’t Think, Don’t Know, Don’t Have

0

5

10

15

20

2001 2002 2003 2004

School Year

% o

f m

sgs

Page 13: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Preliminary Exploration 4. Community Support

% of messages agreeing or disagreeing or containing ‘Thanks’

02468

101214161820

2001 2002 2003 2004School Year

AgreeDisagreeThanks

Page 14: A Noun Phrase Analysis Tool for Mining Online Community Conversations

Future Work• Exploring the use of other

word classes (e.g. verbs)

• Training NLP-algorithms on a CMC-type of corpora

• Automatic grouping of noun phrases into concepts (with manual override)

− e.g. RDB, database, relational database

• Connecting ICTA to external textual data from other public online communities (e.g. blogs, myspace)

− RSS feeds/ web APIs

• Understanding the social science of language use in online communities

• Collaboration with the NCSA’s DISCUS project

− concept maps

− social networks