Upload
anatoliy-gruzd
View
106
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A Noun Phrase Analysis Tool for Mining Online Community Conversations Caroline Haythornthwaite and Anatoliy Gruzd (U of Illinois) See a full paper at http://www.iisi.de/fileadmin/IISI/upload/C_T/2007/Haythornthwaite.pdf Abstract: Online communities are creating a growing legacy of texts. These texts record conversation, knowledge exchange, and variation in topic and orientation as groups grow, mature, and decline; they represent a rich history of group interaction and an opportunity to explore the purpose and development of online communities. The problem is how to approach and make sense of the vast amount of data stored by these communities and to use that information for some useful outcome. In this paper we use automated processes, including natural language processing, to explore the case of text accumulated from bulletin board postings from eight iterations of an online class. The paper presents work done on creating and refining the natural language processing procedures used to examine these data, and a description of results so far from these examinations.
Citation preview
A Noun Phrase Analysis Tool for Mining Online Community Conversations
Caroline Haythornthwaite <[email protected]> Anatoliy Gruzd <[email protected]>
Problem
• Online communities are creating a growing volume of texts− By 2010, >70% of all digital information will
be user-generated. (Technology Consultancy IDC) − And the majority of it will still be text-based
• How can we analyze and make sense of such a vast amount of textual data?
• Possible Solution: − Automated text analysis
Research Questions
Can we automate the process of creating an “effective representation” of the texts produced by online communities?
A “representation” is an excerpt [of the original text] that describes or may stand in for the original text.
An “effective” representation can help us to answer:
− Can we discover what the community interests and priorities are?
− Can we discover patterns of language and interaction that characterize a community?
Why Nouns & Noun Phrases?1. Nouns and noun phrases tend to be the most
informative elements of any sentence
2. Noun phrases make it easy to disambiguate the meaning of words
e.g. ‘travel information’, ‘information center’, ‘information management’
3. Noun phrases extraction is easy and cheap to accomplish with today’s Natural Language Processing (NLP) tenchiques• NLP is a set of computational techniques for
processing natural (human) languages
Example of Noun-Phrase Extraction
“We will have minivan during the conference to help shuttle attendees from the other hotels to and from the Kellogg Center.”
Step 1. Part-of-Speech Tagging<We/PRP> <will/MD> <have/VB> <minivan/NN> <during/IN > <the/DT> <conference/NN> <to/TO> <help/VB> <shuttle/VB> <attendees/NNS> <from/IN> <the/DT> <other/JJ> <hotels/NNS> <to/TO> <and/CC> <from/IN> <the/DT> <Kellogg/NNP> <Center/NNP>
Step 2. Chunking<We/PRP> <will/MD> <have/VB> <minivan/NN> <during/IN > <the/DT> <conference/NN> <to/TO> <help/VB> <shuttle/VB> <attendees/NNS> <from/IN> <the/DT> (NP: <other/JJ> <hotels/NNS>) <to/TO> <and/CC> <from/IN> <the/DT> (NP: <Kellogg/NNP> <Center/NNP>)
Representation<We/PRP> <will/MD> <have/VB> (NP:<minivan/NN>) <during/IN > (NP:<the/DT> <conference/NN>) <to/TO> <help/VB> <shuttle/VB> (NP:<attendees/NNS>) <from/IN> (NP:<the/DT> <other_hotels/NNS>) <to/TO> <and/CC> <from/IN> (NP:<the/DT> <Kellogg_Center/NNP>)
Examples of Part-of-Speech Tags commonly used in NLP
We …
… Center
Examples of Open-source NLP Toolkits
• NLTK - http://nltk.sourceforge.net
• LingPipe - http://www.alias-i.com/lingpipe/
• MII NLP Toolkit -
http://www.mii.ucla.edu/nlp/
• OpenNLP - http://opennlp.sourceforge.net/ WARNING!
Advanced Knowledge of Computational Linguistics
& Programming Skills Required !!!
ICTA Internet Community Text
Analyzer
Course nameInformation Organization and Access
Data sourcebulletin board messages
Classes 8
School years 2001 – 2004
Duration of each class
15 weeks
No. of students per class
31 - 54
No. of messages per class
1200 - 2100
Preliminary Exploration of the dataset using ICTA
1. Most frequently used words2. Important Topics Over Time3. Community Style4. Community Support
Preliminary Exploration 1. Most frequently used words
• Profession-related words− book/s, information, library/libraries,
librarian/s − user/s, and patron/s, people − database/s, search, document/s
• Learning-related words− question/s, article/s, example/s, way,
study, class, course, research, journal, reading, method, problem, hard time
Preliminary Exploration 2. Important Topics Over Time
% of messages containing "Database(s)" and "Google"
0
5
10
15
20
25
2001 2002 2003 2004
School year
% o
f m
sg
s
database(s)
Preliminary Exploration 3. Community Style
% of messages containing Don’t Think, Don’t Know, Don’t Have
0
5
10
15
20
2001 2002 2003 2004
School Year
% o
f m
sgs
Preliminary Exploration 4. Community Support
% of messages agreeing or disagreeing or containing ‘Thanks’
02468
101214161820
2001 2002 2003 2004School Year
AgreeDisagreeThanks
Future Work• Exploring the use of other
word classes (e.g. verbs)
• Training NLP-algorithms on a CMC-type of corpora
• Automatic grouping of noun phrases into concepts (with manual override)
− e.g. RDB, database, relational database
• Connecting ICTA to external textual data from other public online communities (e.g. blogs, myspace)
− RSS feeds/ web APIs
• Understanding the social science of language use in online communities
• Collaboration with the NCSA’s DISCUS project
− concept maps
− social networks