A Noun Phrase Analysis Tool for Mining Online Community Conversations

A Noun Phrase Analysis Tool for Mining Online Community Conversations

Caroline Haythornthwaite <[email protected]> Anatoliy Gruzd <[email protected]>

Problem

• Online communities are creating a growing volume of texts− By 2010, >70% of all digital information will

be user-generated. (Technology Consultancy IDC) − And the majority of it will still be text-based

• How can we analyze and make sense of such a vast amount of textual data?

• Possible Solution: − Automated text analysis

Research Questions

Can we automate the process of creating an “effective representation” of the texts produced by online communities?

A “representation” is an excerpt [of the original text] that describes or may stand in for the original text.

An “effective” representation can help us to answer:

− Can we discover what the community interests and priorities are?

− Can we discover patterns of language and interaction that characterize a community?

Why Nouns & Noun Phrases?1. Nouns and noun phrases tend to be the most

informative elements of any sentence

2. Noun phrases make it easy to disambiguate the meaning of words

e.g. ‘travel information’, ‘information center’, ‘information management’

3. Noun phrases extraction is easy and cheap to accomplish with today’s Natural Language Processing (NLP) tenchiques• NLP is a set of computational techniques for

processing natural (human) languages

Example of Noun-Phrase Extraction

“We will have minivan during the conference to help shuttle attendees from the other hotels to and from the Kellogg Center.”

Step 1. Part-of-Speech Tagging<We/PRP> <will/MD> <have/VB> <minivan/NN> <during/IN > <the/DT> <conference/NN> <to/TO> <help/VB> <shuttle/VB> <attendees/NNS> <from/IN> <the/DT> <other/JJ> <hotels/NNS> <to/TO> <and/CC> <from/IN> <the/DT> <Kellogg/NNP> <Center/NNP>

Step 2. Chunking<We/PRP> <will/MD> <have/VB> <minivan/NN> <during/IN > <the/DT> <conference/NN> <to/TO> <help/VB> <shuttle/VB> <attendees/NNS> <from/IN> <the/DT> (NP: <other/JJ> <hotels/NNS>) <to/TO> <and/CC> <from/IN> <the/DT> (NP: <Kellogg/NNP> <Center/NNP>)

Representation<We/PRP> <will/MD> <have/VB> (NP:<minivan/NN>) <during/IN > (NP:<the/DT> <conference/NN>) <to/TO> <help/VB> <shuttle/VB> (NP:<attendees/NNS>) <from/IN> (NP:<the/DT> <other_hotels/NNS>) <to/TO> <and/CC> <from/IN> (NP:<the/DT> <Kellogg_Center/NNP>)

Examples of Part-of-Speech Tags commonly used in NLP

We …

… Center

Examples of Open-source NLP Toolkits

• NLTK - http://nltk.sourceforge.net

• LingPipe - http://www.alias-i.com/lingpipe/

• MII NLP Toolkit -

http://www.mii.ucla.edu/nlp/

• OpenNLP - http://opennlp.sourceforge.net/ WARNING!

Advanced Knowledge of Computational Linguistics

& Programming Skills Required !!!

http://nltk.sourceforge.net/

http://www.alias-i.com/lingpipe/

http://www.mii.ucla.edu/nlp/

http://opennlp.sourceforge.net/

http://opennlp.sourceforge.net/

ICTA Internet Community Text

Analyzer

Course nameInformation Organization and Access

Data sourcebulletin board messages

Classes 8

School years 2001 – 2004

Duration of each class

15 weeks

No. of students per class

31 - 54

No. of messages per class

1200 - 2100

Preliminary Exploration of the dataset using ICTA

1. Most frequently used words2. Important Topics Over Time3. Community Style4. Community Support

Preliminary Exploration 1. Most frequently used words

• Profession-related words− book/s, information, library/libraries,

librarian/s − user/s, and patron/s, people − database/s, search, document/s

• Learning-related words− question/s, article/s, example/s, way,

study, class, course, research, journal, reading, method, problem, hard time

Preliminary Exploration 2. Important Topics Over Time

% of messages containing "Database(s)" and "Google"

0

5

10

15

20

25

2001 2002 2003 2004

School year

% o

f m

sg

s

database(s)

google

Preliminary Exploration 3. Community Style

% of messages containing Don’t Think, Don’t Know, Don’t Have

0

5

10

15

20

2001 2002 2003 2004

School Year

% o

f m

sgs

Preliminary Exploration 4. Community Support

% of messages agreeing or disagreeing or containing ‘Thanks’

02468

101214161820

2001 2002 2003 2004School Year

AgreeDisagreeThanks

Future Work• Exploring the use of other

word classes (e.g. verbs)

• Training NLP-algorithms on a CMC-type of corpora

• Automatic grouping of noun phrases into concepts (with manual override)

− e.g. RDB, database, relational database

• Connecting ICTA to external textual data from other public online communities (e.g. blogs, myspace)

− RSS feeds/ web APIs

• Understanding the social science of language use in online communities

• Collaboration with the NCSA’s DISCUS project

− concept maps

− social networks

http://www-discus.ge.uiuc.edu/discussite/