Upload
sumit-saini
View
215
Download
1
Embed Size (px)
Citation preview
Online News Popularity Dataset
PRESENTED BY Sumit Kumar Saini, Shivali Advilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel
01
Introduction
Introduction
• Created to analyze the number of shares depending on the attributes and predict if an article will be popular on the internet or not.
• 39,644 observations• 61 attributes• Mashable website: collected over a 2 year period from Jan 2013 -
Jan 2015 • No missing values, but some topics were unclassified • Target: number of shares
02
Data Set Introduction
Data Set Introduction
Data accuracy
Data Set
Website
843,330 shares
12 videos128 videos
792 shares
0 videos12 videos
Attributes
LDA
The Latent Dirichlet Allocation algorithm was applied to all Mashable texts (known before publication) in order to first identify the five top relevant topics and then measure the closeness of each articles to such topics.• They were named LDA-00…...LDA-04 (undefined topics)• LDAs add up to one per observation• Maximum LDA impurity → overall low shares
• Mean: 1,660 vs 3,395• Median: 1,100 vs 1,400
03
Data Modification And Models
Data ModificationRecoding
Data channel Date of publication
0 Viral
1 Lifestyle
2 Entertainment
3 Business
4 Social Media
5 Technology
6 World
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
7 Sunday
Conference Paper• Max: 843,300, Mean: 3,395.380, Deviation: 11,626.951 Median shares: 1,400
shares. • Attribute popularity: Shares<=1400 unpopular; Shares>1400 popular• Avoided dealing with a class imbalance problem• Made it into a binary problem
Popular or UnpopularAUC = 0.73
Model 1
• 1500 trees• All attributes
Models - Chosen Attributes
Subjective Opinion Random Forest Importance Highly Correlated (w/ shares)• n_tokens_title• n_tokens_content• average_token_length• summary_channel_value• summary_weekday• LDA_00• LDA_01• LDA_02• LDA_03• LDA_04• global_subjectivity• global_sentiment_polarity• global_rate_positive_words• global_rate_negative_word
s• title_subjectivity• title_sentiment_polarity
• LDA _03• LDA_02• kw_max_avg• kw_avg_avg• summary_channel_value• self_reference_min_shares• self_reference_avg_shares
Models - Chosen Attributes
Random Forest Importance
R2: -1.376
Highly Correlated (w/ shares)
R2: 0.01434R2: 0.0148
Subjective Opinion
04
Data Insights
Data Insights
Publication Day:Most articles published - Tuesday, Wednesday, and Thursday.Least articles published - Weekends.
Channel:Most popular topic is Viral,
followed by Tech and Business.Least popular topic is Social Media.
No. of keywords: Generally between 5 to 10.
Challenges
Challenges
• Understanding the variableswhat is LDA topic #sentimentpolaritykeywords
• Finding relation among attributes and which attributes are important for modelling.
• Numbers in dataset vs. numbers on Mashablesharesvideosimages
• Can’t do boosting because we don’t have a binary outcome
Recommendations
Recommendations
For MashablePublish during the week rather than weekendPublish about world, technology, and business and avoid social media articlesPublish articles closer to the topic (minimize impurity)
For ResearchersAlways identify your attributes Ethically and accurately collecting dataTo get more accurate results, get data about the number of likes and
comments,number of tweets or hashtags, number of URL mentions and to understand thesource of shares
Conclusion
Conclusion
● R2 is very small regardless of the model● Using all attributes is the best combination● Removing attributes, changing number of trees, and
changing classifier does not improve R2 value
THANK YOU!
PRESENTED BY Sumit Kumar Saini, ShivaliAdvilkar, Chengdong Ben, Hebatalla Zaky, Manan Patel