Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
BDSS IGERT Speed Dating/Matchmaking Event
September 19, 2014
Event&DataWho$did$what$to$whom
• Python,)R
• Natural)language)processing
• Forecas7ng
• Poli7cal)violence
Wanghuan'Chu'• 4th+year'Ph.D.'student'in'Sta6s6cs'• Key'strength:'Sta6s6cal'modeling'– Nonparametric'regressions,'mixed+effects/mul6level'models,'discrete'choice'models,'sta6s6cal'learning'algorithms,'causal'inference'techniques,'etc.'
• Research'experience'– Thesis&research:'Feature'screening'methods'for'ultrahigh'dimensional'longitudinal'data.'• e.g.'Gene6c'data'with'870,000'SNPs'from'540'subjects'(p'>>'n)'
– 1st&IGERT&rota1on:'Causal'media6on'analysis'for'clustering'data'using'mixed+effects'models,'propensity'score'modeling'and'inverse'probability'weigh6ng.'
Wanghuan'Chu'
• Poten6al'components'for'the'ideal'project'– Parallel'compu6ng'to'Big'Data'(e.g.'MapReduce).'– Sta6s6cal'methodologies'at'data'analy6cs'layer.'– Interes6ng'social'science'ques6ons'to'be'explored.'
• SoVware:'R'and'SAS'(MACRO'and'SQL)'• Interested'in'learning'– New'programming'language'(e.g.'Python).'– New'methodology'and'domain'knowledge.'
An Introduction: Cindy Cook [email protected]
! B.S. in Mathematics • Graph Theory • Parallel Computing with MPI:
• Recommender Systems ! M.S. in Applied Statistics
• R, SAS, Stata, C++ • Machine Learning • Survival Analysis
• Cox Models ! Ph.D. in Statistics
• No particular advisor or research
Research Interests:
! Big Data ◦ With statistical applications in the Social
Sciences ◦ Python, parallel computing in R, and
broadening my overall computing skills
! Data that has spatial/temporal trends ! Networks on a large scale ! Any combination of these
Timmy Huynh ■ Sociology & DemographyAdvisor: John Iceland [email protected]
Education B.A., Geography / Economics, The
University of Texas at Austin, 2010
M.A., Social Sciences, The University of Chicago, 2011
Research experience (selected)
REU Summer Institute in Minority Group Demography – Austin, TX, 2009
Summer Institute in LGBT Population Health – Boston, MA, 2010
Asian Americans Advancing Justice –Chicago, IL, 2011-2012
Oak Ridge National Laboratory – Oak Ridge, TN, 2012-2013
Research interests Urban sociology
Spatial demography
Economic geography
Networks
(Geo)Visualization
Skills Statistics (Stata, SPSS, R)
GIS (ArcGIS, GeoDa, ERDAS)
Programming (Python, JavaScript)
Christopher Inkpen Sociology and Demography
Recent Projects
- determinants of student migration - visualizing global migration patterns - assessing impact of recession on internal migration
Tools
- Statistical Models: linear regression, GLM, HLM, fixed and random effects, spatial econometrics
- Computing: Stata, R, Python, SQL - Mapping: ArcGIS, CartoDB
Broad Interests - global migration patterns - assimilation - population processes
Areas to explore
Population estimation and data fusion
Mapping of social networks
Department of Human Development and Family Studies
Rachel Koffer [email protected]
3rd year Ph.D student in the Department of Human Development and Family Studies Concentrations: Individual Development, Methodology Advisors: Nilam Ram, David Almeida B.A. Psychology, Economics; Minor: Environmental Studies Skills I Bring to the Rotation: SAS, R, LISREL, SPSS, STATA Statistical skills: General linear, multilevel, structural equation modeling, PCA and Factor Analysis
Skills I Hope to Develop/Improve During the Rotation: Python; Data visualization, Machine Learning
Methodological Interests: Analysis of: Intensive longitudinal data (many measurements across short time span); Multiple time scales (intensive longitudinal data w/in longer-term data);
Substantive Interests: Association between daily experiences and well-being. Effects of daily stressors on daily and long-term affective (mood) and physical well-being.
Potential Interests for Research Rotation: Machine learning techniques for developmental time series data Application of interdisciplinary methods to stress concepts
Department of Human Development and Family Studies
Rachel Koffer [email protected]
September 19, 2014
Fridolin Linder
Department: Political Science (2
ndYear PhD)
Fields: Methodology, Comparative Politics, Statistics (Grad. Minor)
Interests: Predictive Modeling/Machine Learning, Text Analysis
(Classification,Scaling), Political Representation, Research
Design/Causal Inference/Epistemology
Skills: Statistics, R (substantial), Python
Current Projects: Datamining as Exploratory Data Analysis (w/ Zach
Jones), Rationalization of candidate choice through missreporting of
ideological self-placement (experiment)
Fridolin Linder BDSS IGERT Matchmaking Event 1 / 1
Jonathan K. NelsonDepartment of Geography
Abstract—I am a Ph.D student in the department of geography. Prior to coming to Penn State I was a cartographer for National Geographic. I study spatial data representation and explore patterns and relationships in geographic phenomena, using spatial statistics and visual analytics approaches. I am particularly interested in interactive multi-scale visual and data abstraction techniques for making sense of BIG DATA. "
My current research rotation is in the GeoVISTA Center and involves leveraging geo-social media data to support crisis management. Other projects I am working on include: a visual analysis of 1200 student maps from a massive open online course titled “Maps and the Geospatial Revolution;” an exploratory analysis on multiscalar effects of the modifiable areal unit problem on cancer diagnosis rates and median income; and a human-pet-computer interaction study that aims to build healthy relationships between pet owners and their dogs using personal visualization and quantification."
Tools I commonly use for carrying out and conveying my research include: Adobe Creative Suite, Avenza MAPublisher, Final Cut Pro; ESRI ArcGIS, GeoDaA, R; CSS, HTML, JavaScript D3. "
!Keywords— spatial data, visualization, cartography, map, scale, aggregation, information design "
Deeper Learning in Large-Scale Text INTELLIGENT SYSTEMS LABORATORY
APPLIED COGNITIVE SCIENCES LABORATORY
Alexander G. Ororbia II, IST PhD Student
What do I do?
Build: Deep models for learning from Scholarly Big Data
Multilayer neural networks, learning kernels
Boltzmann Machines
Convolutional Networks—text recognition in-the-wild (“Text in the Wild”)
Active Learning Algorithms
Bayesian Network Lattice for error-correcting Amazon Mechanical Turker annotations (Ororbia et. al, 2014, Under Review)
Investigate: Can deep architectures discover/model inherent hierarchical structure in text?
How can intelligent systems work in tandem with humans to solve complex problems?
Can intelligent tools be built that harvest and organize vast amounts of scholarly data?
What insights can these same algorithms extract from the data?
BDSS-IGERT Joshua Snoke
• About Me:
• 2nd Year PhD Student, Department of Statistics
• 1st Year BDSS-IGERT Trainee
• B.S. in Mathematics and Economics
• Current Research:
• Data Privacy, Disclosure Limitation Methods
• Synthetic Data for Public Use (in Sociology Studies), Parametric and Non-Parametric Methods
BDSS-IGERT Joshua Snoke
• Currently Seeking a Research Project Outside of the Statistics Dept.
• Interests and Applications:
• Policy, Politics (National and Global)
• Social Networks, Relationships
• Methodology, Causal Inference, Bayesian Methods
• Computational Proficiency:
• Significant Experience in R
• Some Experience in Python, Java, and SQL
Sam Stehle Geography
Previous activities • Data management for mobile GIS
• Matching space-time patterns for political/social comparison
• Visual analytic software design/implementation
• Text classification of Twitter data
• Event data collection/classification
• Time series intervention modeling
Methods experience • Java • Python • C++ • R • Spatial analysis, GIS • SQL • Time series analysis • Raster/image analysis • Machine learning with Weka
Background • B.S. University of Utah; geography, minor in Computer Science
• M.S. Penn State; geography
Sam Stehle Interests for Future Work
Topics • Political geography
• Understanding events
• Spatio-temporal patterns
• Sport
• Media representations
• Multi-scale issues
Dissertation considerations • Geography/politics of international sport
• British Commonwealth Games
• Multi-scale spatio-temporal modeling
• RSS feed data + geo/social/political context
• Data-driven vs. dictionary event classification
• Spatio-temporal diffusion patterns
Clio Andris (Assistant Professor of GIScience) Dept. of Geography, [email protected].
Courses: Fall GEOG560: Interpersonal Relationships in Geographic Space, Spring GEOG363: GIS
A System of Systems
Network thanks to Paul Hooper, Emory U.
Example 2 Example 3
Example 4
MID/DLE: An End to Data Collection
AnalyzeCollect Researcher Computer
ParticipantMeasureMeasureMeasureParticipantMeasureMeasureMeasureParticipantMeasureMeasureMeasure Maintain
Tim Brick, HDFS
1950 1960 1970 1980 1990 2000 2010
-0.50
-0.25
0.00
0.25
0.50
0.75
Late
nt P
hysi
cal I
nteg
rity
Mor
e A
buse
Mor
e R
espe
ct
Estimated Yearly Average of Two Dynamic Latent Physical Integrity Variables
Dynamic Standard of AccountabilityConstant Standard of Accountability
-4 -2 0 2 4
-4
-2
0
2
4
19761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010
Dyn
amic
Sta
ndar
d M
odel
Est
imat
es
Constant Standard Model Estimates
Mor
e A
buse
Mor
e R
espe
ct
More Abuse More Respect
Disagreementbetweenestimatesincreaseseach year
Christopher J. Fariss Respect for Human Rights has Improved Over Time
Human Rights Documentation Project
11,715 human rights documents.Eventually I will have more than 20,000 documentsMost are already coded.
Christopher J. Fariss Respect for Human Rights has Improved Over Time
Epidemics – The Dynamics of Infectious Disease !
• > 19,000 participants!• 158 countries !• > 5,000 completed!• > 300,000 unique video views !• > 8000 browsed the forums!• > 4200 participating in forums!• 3683 forum threads!• 31919 forum posts!• 15486 forum comments!
• Novel format for delivering content !• Novel format for interacting with learners !• The online discussion is a educational resource !
Epidemics – The Dynamics of Infectious Disease !
Chris Fowler Assistant Professor of Geography
and Demography
The big question: When cities spend money on stuff (e.g.
affordable housing, transport systems, parks)….
…who suffers, who benefits?
…how do those costs and benefits change
communities?
The current question: Can we use demographic data at very fine
geographic scales to identify signals that relate
to the above?
Mo
re
se
gre
ga
te
d Æ
Å
Mo
re
Div
erse
Smallest Scale Largest Scale
Completely
Segregated
Very
Diverse
Segregated
The Measure: Multiscale segregation
Multiscale Segregation Profiles: Functional Forms
More
segregated
More
diverse
18,000 cells x 25 scales x 3 census years = 1.35 million data points to study 8 neighborhoods
Multiscale segregation is just the start • Improving the measure
• Scale and interpolation issues
• Restricted Census data
• Other variables
• Parcel data on housing price
• Poverty status
• Income
• Other cities
• Visualization and classification
• Interpretation and prediction
People and Ideas
GeoVISTA Student Affiliates
• John Beieler • Jennifer Mason • Jonathan Nelson • Sam Stehle • Josh Stevens
Project Ideas • Interactive NLP • GeoTxt crowd-
sourcing • Interactive social
graph analysis • Your pet idea
bridging social science and geo-science
The Psychometrics of College Tests
Loken, HDFS
Signal and noise in data on body weight
Stephen A. Matthews Professor of Sociology, Anthropology & Demography (courtesy, Geography) Director, Graduate Program in Demography
Research Interests: My research focuses on population health and health inequality. An important part of my work is an interest in conceptual and methodological issues associated with how neighborhoods are defined and their attributes are measured, and the relevance of these definitions and measures to individual behavior and health outcomes. Proposed Research Project: A friend and colleague, Basile Chaix (Université Pierre et Marie Curie, Paris) has geocoded data on 90,000 places (activity locations) for 6,000 Parisians. For each respondent we know the self-reported boundaries of their neighborhood (VERITAS-RECORD project). During the project the emphasis would be to develop/refine methods to (a)compare the patterning of locations visits to self-reported neighborhood; (b) compare patterns across individuals residing in the same neighborhood; (c) identify hierarchical use patterns (frequency) among location types; (d) examine the significance of focal locations (e.g., work, home); and, (e) determine the optimal/minimal number (and type) of locations reported that offer a useful proxy for the total distribution of locations that an individual visits.
Stephen A. Matthews – [email protected] – BDSS Speed-dating Meeting (Fall 2014) Slide 01
VERITAS-RECORD = Visualization and Evaluation of Route Itineraries, Travel Destinations, and Activity Spaces – Residential Environment and CORonary Heart Disease. Youtube VIDEO at https://www.youtube.com/watch?v=91x_S2Q-tic
Stephen A. Matthews – [email protected] – BDSS Speed-dating Meeting (Fall 2014) Slide 02
Ideal Skill Sets Required: a)Good data organizing skills b)Good communication skills c)Good documentation skills d)Solid statistical background e)Programming skills (for automating repetitive tasks) f)Patience g)Mapping and data visualization skills h)GIS experience, preferably ArcGIS i)Some familiarity with point pattern analysis, local neighborhood statistics, and density/surface mapping. j)Some familiarity with activity space and time-geography literature – and willingness to learn more.
Opportunities to be involved in manuscripts to be developed for publication in epidemiology, public health and/or geography-related journals
Quantitative interestProcess modelsMultivariate continuous time modeling with all driving parameters person-specific
- unbalanced/unstructured data- describing change in terms of instantaneous regulation and short- and long term trends- synchronicity in changes among the longitudinal variables
Bayesian statistics - flexible framework for implementing parameter estimation for highly complex models- focus: sequential updating methods for online inference from streaming data (health monitors,
Twitter etc.)
bayesian.zitaoravecz.net
Main goal: developing novel multivariate dynamical models that capture psychologically meaningful properties of change over time in terms of latent variables
Æstudy individual differences therein Æe.g., interventions can be tailored based on these variables
Substantive interestAffective science
- regulatory mechanisms in valence and arousal levels
- their connection to personality traits
Well-being
- subjective (self-reported) wellbeing as multidimensional state
- devising measurement instrument
Cognitive process models
- describing decision making in terms of latent variables
- identifying links between cognitive parameters
and individual characteristics
Research tools
- translating methodological research into practical tools
- free, user-friendly programs for applied researchers
bayesian.zitaoravecz.net
0 20 40 60 80 1000
20
40
60
80
100
Valence
Arousal
Donna PeuquetDepartment of Geography
Research interests:• Geographic knowledge
representation• Knowledge discovery• Space-time dynamics
– Visualization– Geovisual analytics– Data models– Computational/statistical
modes
Current project: STempo• Provide the capability to:
– quickly reveal temporal and spatio-temporal patterns from large collections of event data• Find previously unknown patterns• Confirm suspected / assumed
patterns• Find examples of specific patterns
in other locations / times
• Using computational + visualization techniques
• …and coded events from RSS newsfeeds - GDELT
Potential projects:
• Add context to visualization and/or analysis• Examine how patterns change over varying
contexts• Develop means to identify anomalies, precursor
and postcursor events• Develop capability to visually identify repeating
patterns/cycles• Develop means to facilitate evaluation of
pattern importance
What kinds of hidden significant structure exists in complex space-time behaviors?
NILAM RAM
HUMAN DEVELOPMENT & FAMILY STUDIES [email protected]
IGERT BDSS PSU
SEPTEMBER 19, 2014
FINDING MEANING IN THE DATA FOREST
stressdirections.com
Data Acquisition Data Fusion Data Management Data Visualization Data Mining Data Modeling
In-Vivo Data In-Silica Data
In-Virtual Data Real-Time Data Interactive Data
HMM (STATE SEQUENCE) ESTIMATION REAL TIME ANALYSIS TIME-AWARE RECOMMENDATIONS
Probabilistic state sequence extracted from
4-state HMM
ID# 103, age 2-mo Inoculation Paradigm
X t+1 = AX t +Vt+1Yt+1 =CX t +Wt+1
CELLULAR AUTOMATA SIMULATIONS OF COMPLEX EMERGENT BEHAVIOR + INTERACTIVE DATA VIZ GAMING
dt = .01, R = .2, A = .08, B = 1.5, C = .15, Du = .5, Dv = 20 100x100 grid with periodic boundaries, random uniform initial conditions (0,.1)
u t
= Rf u,v( ) +Du2u
v t
= Rg u,v( ) +Dv2v
f u,v( ) = A Bu+u2
v 1+Cu2( )g u,v( ) = u2 v
ENSEMBLE METHODS FOR (UN)STRUCTURED DATA
ENSEMBLE METHODS FOR (UN)STRUCTURED DATA
Data$Privacy,$Causal$Inference,$Categorical$Data$methodologies$…$
Aleksandra$(Sesa)[email protected]$$
$Departments$of$StaBsBcs$&$Public$Health$Sciences$$
Pennsylvania$State$University$$$$
Sep$19,$2014$@$BDSS$matching$day$
$
1"
Privacy"in"Sta-s-cal"Databases"Agency/"
Organiza-on/Database"
Respondents/Individuals/Organiza-ons"
Users"
Queries"
Answers"
Government,"Researchers,"Businesses"Clinicians"Pa-ents""(or)""
Malicious"adversary"
• ""Large"collec-ons"of"personal"informa-on""• ""census"&"survey"data"• ""social"networks""• $$medical/"public"health/genomic"• ""web"search"records,"etc"
Collect"""!"""""""Store""""!"Analyze/Share"
Cloud"compu-ng"
Privacy"Research"ques-ons"• Research$MaOers:"Privacy"in"Sta-s-cal"Databases"
• Main$theme:$integra-ng"computer"science"and"sta-s-cal"approaches"to"data"privacy"
– Social,"Behavioral"&"Economic"data"– TradeQoff"between"data"u-lity"and"disclosure"risk"– Rigorous"privacy"defini-ons"(e.g.,"Differen-al"Privacy)"– Synthe-c"data"– Priva-za-on"of"social"networks"data""– Private"GenomeQwide"associa-on"studies""– Privacy"with"Distributed"databases"
3"Image"ref:"hWp://www.orgnet.com/email.html"
Other"projects"• More"general"categorical"data"methodologies"(Bayesian"analysis,"algebraic"sta-s-cs,"…")"
with"observa-onal"data"– Causal"Inference"– Ecological"Inference""
• Sta-s-cal"Data"integra-on"– Combining"data"from"mul-ple"sources"– Merging"big"data"with"probability"samples."Can"we"use"informa-on"from"surveys"to"
help"generalize"analyses"from"largeQscale"administra-ve/private"or"organic"data?"
• Popula-on"size"es-ma-on"
• Data"analysis"and"methodology"with"small"n,"large"p"problems"in"two"se[ngs"– CSCW"and"HCII"data"– Study"of"communica-on"and"awareness"in"online"collabora-ve"tools"– Ques-oners"and"logQac-vity"data"
– NEW:"Neural"data"and"neuroimaging"fMRI"data"analysis"modeling"language"plas-city"in"bilinguals"
– Time"and"frequency"domain"analyses"4
Communication-based diffusion Rachel Smith, Communication Arts & Sciences
1. PEPFAR Namibia
• Existing data available for network analysis and HIV-related indicators – Two-mode networks (persons
and community groups) – Cross-sectional – 15 communities, ~n=300 in
each site
2. ‘Contagious’ messages
• Collect and analyze new data – Track online messages related
to ebola – Predict a) what types of
messages get passed onto another person, and b) predict what aspects of the message change and remain the same in the ‘retelling’
– Compare end to CDC or WHO stories and advice
Promoting Intergenerational Communication through Facebook Dr. S. Shyam Sundar
College of Communications
Different Use of Facebook among Senior Citizens (N=352)
Jung, E.H. & Sundar, S. S. (2014). Senior Citizens on Facebook: How do they Interact and Why? Paper presented at the 96th annual conference of the Association for Education in Journalism and Mass Communication, Montreal, Canada.
• Social bonding One-to-one communication (e.g., commenting, chatting) • Social bonding & Social bridging Self-presentation activities (e.g., updating status) • Social bonding & Curiosity Social surveillance activities (e.g., checking out people’s
walls)
Frequency of senior citizens’ participation in Facebook activities (N= 168) Facebook Activity Mean SD
Stay in touch with friends and family 3.27 1.43
Reunite with old friends 2.56 1.14
Keep up with others’ activities 2.39 1.14
Comment on others’ postings 2.38 1.17
View or upload photographs 2.33 1.29
Pass the time 2.15 1.77
Keep up with current events 1.95 1.20
Update my status 1.88 1.00
Browse profiles 1.80 1.08
Post items (e.g. news articles) 1.78 .91
Sundar, S. S., Oeldorf-Hirsch, A., Nussbaum, J. F., & Behr, R. A. (2011). Retirees on Facebook: Can online social networking enhance their health and wellness? Proceedings of the 2011 Annual Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA’11), 2287-2292.
BIG DATA
LIKES
PHOTOS
WALL POSTS
COMMENTS
CHATTING
PRIVATE MESSAGES
What are senior citizens doing on Facebook for intergenerational communication?
What’s on your mind?
BIG DATA @ FACEBOOK
• What kinds of technology affordances do senior citizens use? With whom are they using it?
• What is the relationship between sender and receiver?
• Through sentiment analysis, how much social support do they receive from family members on Facebook?
II
I
II
I
Leadership and Sentiment Analysis of an Online Cancer Support Community using Computational Text Mining
Kenneth Portier & Greta Greer, American Cancer Society John Yen, Prasenjit Mitra, Kang Zhao, Baojun Qiu, Dinghao Wu, & Cornelia Caragea, The Pennsylvania State University
Introduction Online communities are an important source of social support for cancer survivors and caregivers. The ACS Cancer Survivors Network (CSN) is the oldest and largest of these, with 160, 000+ members. This study used computational text mining analysis of 48,779 threaded discussions (468,000 posts by 27,173 members) to identify emerging community leaders and classify user sentiment over sequential posts.
Methods Leader Analysis: Posts of 41 recognized CSN leaders and 2366 other users were analyzed. 21 leadership characteristics were scored and used to calibrate single and ensemble classifiers capable of correctly identifying leaders.
Leader Analysis Results
Sentiment Analysis Results
Implications
78% and 85% of community leaders are correctly identified with the single best and ensemble classifiers respectively.
The best fitting sentiment classifier has an 80% correct classification rate with 68.8% of posts classified as positive. 75% of negative thread originators subsequently express positive sentiment when at least one reply is received from peers. Probability increases with number of replies. Positive thread initiators are more likely to have positive subsequent sentiment than negative thread originators.
• Early/proactive identification of potential leaders gives community managers the opportunity to encourage the growth of desired leadership qualities and thereby maintain strong peer leadership.
• Sentiment analyses results support the hypothesis that online cancer communities like CSN can effectively facilitate peer interactions in a safe, welcoming environment to help members feel more positive about their situation.
Influence
Micro level
Network structure, diffusion, and evolution Macro level
Sentiment influence & Influential users
Members’ publishing behaviors and influence
Information diffusion & the evolution of collaboration networks
Sentiment Analysis: User sentiment was computed through a multi-stage process. 13 lexical/style features were extracted from a training set of 298 randomly-selected posts manually assigned to positive (204) or negative (94) sentiment, then used to calibrate 5 classifiers. Utilizing the best-fit classifier, sentiment level was established for all 468,000 posts, and change in sentiment between users’ initial and subsequent posts examined.
Community Member Posting
Features
Classifiers
Decision
Leader
Participant Full Community
Training Set
Leader
Participant
• Contribution features – The numbers of posts/threads – The length of posts – The time span of one’s activities – … • Centrality features – A post-reply network among users – Nodes and Edges – In/out-degree, Betweenness, PageRank • Semantic features – Appearance of words with
positive/negative sentiment in a user’s posts
– The use of slangs and emoticons
80% use Internet for health-related purposes
Adult Internet users in the U.S.
1 in 4 joins OHCs
Community Member
Lexical/style Features
Classifier
Decision
Leader
Participant Full Community With unknown Sentiment.
Training Set with Assigned sentiment
+ - Similarity measure
CSN Forum Posts
Labeled Posts
Non-Labeled
Posts
Sentiment Model
Model Selection
Feature Extraction
(Post, Pr, Label)
ROC Area = Area under the Receiver Operator Characteristic Curve Best is AdaBoost: False positive rate 0.152, False negative rate 0.33 Best Features: Post_Length, #_Negitive_Words, #_Internet_Slang_Words, #_Names_Mentioned, (N#_Pos+1)/(#_Neg+1), PosStrength, NegStrength
Initial Post (P1)
Responding Reply (R1)
1st Self-Reply (P2)
…
M-th Self-Reply (Pn)
Reponding Reply (Rm)
…
…
Sentiment Change Indicator: The difference between the average sentiment of the originators’ self replies and the initial sentiment of the thread originator.
The more positive the sentiment of replies from others, the more positive the originator became.
Naïve Bayesian Logistic Reg Random Forest One-Class SVM Two-Class SVM
Ensemble Classifier
Topic Discovery Using Discussion Posts in an Online Cancer Community Kenneth Portier & Greta Greer, American Cancer Society
John Yen, Prasenjit Mitra, Siddhartha Banerjee, Mo Yu, & Prakhar Biyani, The Pennsylvania State University Lior Rokach & Nir Ofek, Ben-Gurion University of the Negev
• The ACS Cancer Survivors Network (CSN) is the oldest and largest online community for cancer survivors, with 25K unique visits per day.
• Question: Are different discussion topics associated with different sentiment changes, a measure of social support, of the thread initiators?
• Method: sentiment analysis and topic modeling • Data: CSN breast and colorectal cancer
discussion forum posts from 2005-2010
Sentiment Analysis: • User sentiment is computed through
a multi-stage process. • 13 lexical/style features are extracted
from a training set of 298 randomly-selected posts manually assigned to positive (204) or negative (94) sentiment
• The training set is used to calibrate 5 classifiers.
• Utilizing the best-fit classifier (80% correct classification rate), sentiment level was established for all 468,000 posts (68.8% Positive).
• Estimate the impact of Responding Reply sentiment on sentiment change of thread initiators.
• Negative thread initiators are likely to have positive sentiment change.
• Sentiment Change Score computed as the difference between the average sentiment of the originators’ self replies and the initial sentiment of the thread originator.
• Only one or a couple of days typically span the time between initial post and first follow-up response by the thread initiator.
• The sentiment change observed is likely a reaction to the positive sentiment posts from the community.
The more positive the sentiment of replies from others, the more positive the originator became.
Topic Model Analysis: • Topics of thread initiating posts were identified
using Modified Latent Dirichlet Allocation (LDA-VEM), which assigns each initiating post the probabilities of belonging to each topic.
• Posts were classified to the highest-probability topic.
• Analyses of 20 to 50 topics indicated choice of 30 topics as being reasonable for both forums.
• Selected word combinations (bi-grams) identified in an initial analysis of the posts were subsequently converted to single words (e.g. “breast cancer” to “breastcancer”) to retain their meaning.
• Remaining words were reduced to root form (i.e., stemming).
• Terms occurring very often (> 80% of posts) or very seldom (<5 posts) were removed prior to analysis.
Methods Overview
Breast Cancer Discussion Forum Colorectal Cancer Discussion Forum • Average sentiment change Index
(and associated 95% confidence intervals) vs main post topic for CSN beast and colorectal cancer discussion board
• High average sentiment change scores indicate that community responses have a positive effect on the emotions of the thread initiators.
• Low average sentiment change scores could indicate either that community response has little impact on the initiator’s emotions or (more likely) that the initial post sentiment was high to begin with.
• Sentiment change score vs initial sentiment by topics for the breast and colorectal cancer discussion board.
• Each box is centered on the mean for each topic with the area representing 95% confidence.
• Topics with high average initial post sentiment tend to have lower average sentiment change scores.
Results
• The increased understanding of topics and related sentiment supports development of ancillary information to be made available to CSN members to supplement forum discussions.
• We envision using these results to create tools • Notify community leaders when
posts with low initiating sentiment do not produce adequate community response
• Point the initiator to threads where the topic may have been discussed in the recent past.
• These improvements can further improve community social support and subsequently members’ quality of life.
• Both forums show that pain, medical worries, and treatment side-effect issues initiate with very low (most negative) sentiment and have highest sentiment change.
• Breast cancer posts tend to initiate lower sentiment than colon cancer posts while sentiment change tends to be higher.
Conclusions
Initial Post (P1)
1st Self-Reply (P2)
M-th Self-Reply (Pn)
Responding Replies (R1-Rk) To Initial Post
Responding Replies (R1-Rk) To Self-Reply
Training Set N=298
Feature Extraction
Classifier Calibration & Selection
Classification of Corpus
Sentiment Change Analysis
Initial Post Extraction
Breast & Colorectal Cancer Forums
Initial Post Key Words
Key Phrase Recoding
Common & Unique
Word Removal
LDA-VEM Analysis –
Topic Identification
Initial Post Topics & Likelihood
Cancer Survivors Network (CSN) Discussion Board Posts
Sentiment Change