Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Feng Xia
Scholarly Social Computing
Dalian University of Technology (DUT), China
http://fengxia.net
@ WIMS 2019, Seoul, Korea
• a major city and seaport in the south of
Liaoning province
• the southernmost city of Northeast China
• the province's second largest city and has
sub-provincial administrative status
• a financial, shipping and logistics center
for Northeast Asia
• In 2006, Dalian was named China's most
livable city by China Daily
• Population (2010): 6,690,432
From Wikipedia, the free encyclopedia: http://en.wikipedia.org/wiki/Dalian
The City: Dalian
What?
Why?
Where?
Our goal is to create innovation through conducting
interdisciplinary, application-driven academic research.
We are interested in a broad spectrum of cutting-edge
research topics including data science, knowledge
management, network science, computational social
science, human behavior, and mobile social networks.
Alpha has the meaning of first in Greek. We borrow this
word to express the idea that we pursue being
extraordinary not only in academic research, but also in
fully exploiting the potential of ourselves. We value hard
work and talents. We embrace the change and the
differences.
Full name of the Lab: The Alpha Lab @ Dalian University
of Technology, China.
Address: School of Software, Dalian University of
Technology, Development Zone, Dalian 116620 China.
URL (website): http://TheAlphaLab.org
Feng Xia:Professor
From 7 different countries
PhD, master, senior undergraduate students: 70+We are family!
1
2
3
Students
Social Computing
Scholarly Big Data
Social Relationship
Mining
Scientific Collaboration
Dynamics
Agenda
What is Scholarly Big Data?
Feng Xia, Wei Wang, Teshome Megersa Bekele, Huan Liu.
Big Scholarly Data: A Survey
IEEE Transactions on Big Data, 2017
DOI: 10.1109/TBDATA.2016.2641460
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.
Scholarly
big data
Growing at
an average
6.3% per
year
Velocity
Over 114
million
scholarly
documents
Volume Various
entities
including:
papers,
authors,
etc.
Variety
Research
prediction;
fund
allocation;
impact
evaluation
Value
Author
name
disambigu
ation,
deduplicati
on
Veracity
What is “Scholarly Big Data”?
Scholarly Big Data is coined for the rapidlygrowing scholarly data, which containsinformation including millions of authors, papers,books, citations, figures, tables, as well asscholarly networks and digital libraries.
Any data that relate to scholarship.
Actor-oriented Tasks
Relationship-oriented Tasks
Network-oriented Tasks
ASNs Sites
Academic Social Network Sites
Scholarly Search EnginesData preprocessing
Name Disambiguation
Integration
Profiling
Application
Similarity Measures
Statistics
Frequent Patterns
Machine Learning
Technology
Social Network Metrics
Properties
ASNs Analysis
Tools
Analysis
Co-authorship Network
Citation Nework
Modeling
Access
Indexing
Storage
Co-citation Nework
Bibliographic Coupling Network
Co-word Network
Academic website
Scholar Homepage
Conference and journal
official website
Collecting
Exploring Big Scholarly Data
Xiangjie Kong, Yajie Shi, Shuo Yu, Jiaying Liu, Feng Xia. Academic Social Networks: Modeling, Analysis, Mining and Applications, Journal of Network and Computer Applications, Volume 132, April 2019, Pages 86-103. DOI: 10.1016/j.jnca.2019.01.029
Scholarly Entities and their Relationships
Publications
Organizations
Venues
Title
Year
DOL
Pages
Contents
AbstractsName
Affiliations
Education
Field
Impact
Position Type
Name
Location
Member
Ranking
Impact
Position
Conferences Journals
Terms
Cite, Author, Tag
Host
Pu
blish
ed at
Work
at
Location Date Impact Field Publisher
Name
Publisher
Co-words
Co-authorship Cite, Co-citation
Collaboration,
Dependence
Keywords
Interest
Researchers
The Webs: Example I
Xiaomei Bai, Hui Liu, Fuli Zhang, Zhaolong Ning, Xiangjie Kong, Ivan Lee, Feng Xia. An Overview on Evaluating and Predicting Scholarly Article Impact, Information, 2017, 8(3), 73; DOI:10.3390/info8030073
Different networks for various scholarly entities and their relationships
The Webs: Example II
Number of advisees and their advisors in 63 countries on the world map which is generated by Tableau. It maps the advisees from different countries in different colors.
Jiaying Liu, Tao Tang, Wei Wang, Bo Xu, Xiangjie Kong, and Feng Xia. A Survey of Scholarly Data Visualization, IEEE Access, 2018, 6(1): 19205-19221. DOI: 10.1109/ACCESS.2018.2815030
The Webs: Example III
Jiaying Liu, Tao Tang, Wei Wang, Bo Xu, Xiangjie Kong, and Feng Xia. A Survey of Scholarly Data Visualization, IEEE Access, 2018, 6(1): 19205-19221. DOI: 10.1109/ACCESS.2018.2815030
Collaboration network of Harvard University in Acemap. Each node represents the author in the institution and the edges between the nodes represent the collaboration between the authors. The color of the nodes represents the research field of the author.
Key Mining Techniques
Similarity Measure
Linkage-based and Structural
Methods
PageRank
SimRank
Content-based Methods
Distance-based
Algorithms
Cosine-based Algorithms
Correlation-based
Algorithms
JaccardCoefficient
Statistical Relational Learning
Probabilistic Relational Models
Relational Markov Networks
Structural Logic Regression
Relational Dependency
Networks
Markov Logic Networks
Graph Mining
Frequent Subgraph
Mining
Apriori-based
Algorithms
FP-growth Algorithms
Significance Subgraph
Mining
Dense Subgraph
Mining
Machine Learning
Supervised Machine Learning
Decision Tree
Neural Networks
Support Vector
Machines
k-Nearest Neighbors
Unsupervised Machine Learning
Hierarchical Methods
Partition-based
Methods
Density-based
Methods
Grid-based Methods
Model-based methods
Deep Learning
Supervised Deep
Learning
Convolutional Neural
Networks
Unsupervised Deep
Learning
Auto Encoder-based
Methods
Boltzmann Machines
Actor-oriented
Relationship-oriented Network-oriented
Author Tasks
Paper Tasks
Journal Tasks
Academic
Recommendation
Link Prediction
Community
Detection
Big Scholarly
Data
Applicaitons
Collaboration
PatternInterdisciplinary
Evolution
Research Trend
Prediction
Advisor-advisee Relationship Mining in Scholarly Big Data
Wei Wang, Jiaying Liu, Feng Xia,
Irwin King, Hanghang Tong, et al.
ACM/IEEE JCDL 2016 Poster
WWW 2017
Work-in-Progress
Academic Mentorship is Vital
http://www.changeboard.com/content/5121/mentoring-the-good-the-bad-and-the-ugly/
Especially in scholarly data analytics
Advisor-advisee Relationship Information is Useful
• What makes a great advisor?• How the advisors’ academic performance influences the
future development of advisees?• Who is the right/best advisor for a particular student?
To answer questions, e.g.,
To address issues, e.g.,
• Scholarly impact assessment and prediction• Reviewer recommendation• Academic rising stars identification• And many more ….
The Academic Family Tree:https://academictree.org/
Building a single, interdisciplinary academic genealogy
Academic Genealogy Wiki:http://phdtree.org/
Documenting the academic family tree of PhDs worldwide, both past and present
Very few efforts ... heavily rely on volunteers' efforts, which results in limited records and information
Unfortunately, Such Dataset is NOT Available …
Ongoing Work
Relationship-based Data Analysis
Dataset Visualization
Understanding the underlying
principles of academic society
Automatically generate large-
scale relationship/ment
orship dataset
Academic genealogy
visualization platform/system
The Idea
The advisor-advisee
relationship is hidden in
scientific collaboration/co-
author networks
A
D
C
B
2000
Ada
Bob
Tom
Jack
Ada Tom
Advisor
Advisee
Advisee Publication Collaborators
2001
2001
2008
Similarity
Local properties
Advisor-advisee pairs
Proposed Shifu based on
stacked autoencoder
Design of Shifu
1
2
3
4
Match the samples in DBLP and extract the required features for training as unlabeled input to Shifu
Obtain real advisor-advisee pairs from the Academic Genealogy Wiki project as training set
The back propagation (BP) method is used to train Shifu and optimize the model
The result of identifying advisors is obtained through classifier after training
Shifu
Feature Selection
[1] Wu T, Chen Y, Han J. Re-examination of interestingness measures in pattern mining: a unified framework. Data Mining and Knowledge Discovery,
2010, 21(3): 371-397
𝑐𝑜ℎ𝑒𝑠𝑖𝑜𝑛𝑖𝑗𝑡 =
𝑇𝑖𝑗
2
1
𝑇𝑖+
1
𝑇𝑗[1]
Personal properties
• Number of Publications (NP)
• Collaboration Duration (CD)
• Times of First-two Authors (FTA)
• Collaboration Times (CT)
• Cohesion of Collaboration
• Academic Age (AA)
Collaboration network properties
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
0.92
Accuracy Precision Recall F-measure
With AA Without AA
Training Set
Acquisition
Academic Genealogy Wiki
http://phdtree.org/
Computer Science Area Qixiang Sun,Hector Garcia-Molina
NP of Hector
AD
Cohesion
AA of Sun
AA of Hector
CD
CT
FTA
NP of Sun
0.7
0.75
0.8
0.85
0.9
0.95
1 2 3 4 5 6 7 8
Accuracy Precision Recall F-measure
0.85
0.87
0.89
0.91
0.93
0.95
50% 60% 70% 80% 90%
Accuracy Precision Recall F-measure
Time Duration Size of Training Set
Number of Hidden Layers Number of UnitsResults: All these input features based on
publication information during the first eight years
Three hidden layers with 7 units each layer: [7, 7, 7]
ModelTraining
Results
1 Support Vector Machine (SVM) 2 k-Nearest Neighbor (KNN)
3 Logistic Regression (LR) 4Time-constrained Probabilistic Graphical Model (TPFG)
Baseline Methods Evaluation Metrics
Recall
Precision
F1
• Obtained 1,111,513 advisor-advisee pairs in DBLP• Integrated a wealth of scholars personal information
Large-scale advisor-advise pair data set
• Provided the method for calculating the probability of any two scholars of advisor-advisee relationship
Method for calculating probability of advisor-advisee relationship
Shifu for DBLP
The Web of Scholars
Discussion
Contributions
Future Work
• A deep learning-based advisor-advisee relationship identification approach
• A large-scale advisor-advisee relationship dataset
• Not every one has an advisor• Collaboration patterns between advisors and advisees may
vary from one institution to another• Further proofread the generated advisor-advisee pairs• Ground truth!
• Improve the solution design• Further improve the accuracy of the data set via
crowdsourcing?• Provide a platform based on the improved advisor-advisee
data set
Remarks
57
Overall Reconstruction
01
03
04
02
Pooling Layer
Compress the input features. We reduce each
adjacency vector to 1000 dimensions and
calculate the mean of the reduced vectors
accordingly as the input.
Advisor-advisee Relationship Prediction
we add a supervised classifier in our model, which
takes the output of the last hidden layer as the input and
outputs the edges’ representations.
Node Attributes Representation
Employ the deep autoencoder to convert the
node attribute matrix to the low-decisional
vector representation.
Edge Attributes Representation
Employ the deep autoencoder to convert the
edge attribute matrix to the low-decisional
vector representation.
58
Node Representation Construction
𝑨𝑨 = 𝒀𝒄 − 𝒀𝒇
𝒀𝒄: the year of the first collaboration
𝒀𝒇: the year of the first publication
Academic Age
Node Autoencoder
01
59
Edge Representation Construction
Edge Autoencoder
𝐾𝑢𝑙𝑐𝑖𝑗𝑡 =
𝑁𝑃𝑖𝑗
2
1
𝑁𝑃𝑖+
1
𝑁𝑃𝑗Collaboration Similarity
02
61
Training Set Acquisition
Datasets
Microsoft Academic Graph: https://www.openacademic.ai/oag/
The Academic Family Tree:https://academictree.org
Node attributes
Edge attributes
Label for each edge
64
Model Training: Input Features
Chemistry Computer Science
Shifu2 Shifu2-E Shifu2 Shifu2-E
Accuracy 0.939 0.789 0.931 0.813Precision 0.925 0.753 0.912 0.782
Recall 0.958 0.914 0.959 0.883F-measure 0.941 0.823 0.933 0.830
Economics Engineering
Shifu2 Shifu2-E Shifu2 Shifu2-E
Accuracy 0.913 0.507 0.915 0.736Precision 0.877 0.506 0.889 0.718
Recall 0.961 0.602 0.952 0.873F-measure 0.917 0.550 0.919 0.784
Mathematics Physics
Shifu2 Shifu2-E Shifu2 Shifu2-E
Accuracy 0.919 0.702 0.932 0.846Precision 0.898 0.670 0.935 0.827
Recall 0.947 0.889 0.959 0.890F-measure 0.922 0.760 0.933 0.857
The performance of shifu2 without the node autoencoder
66
Critical Issues in academic genealogy generation
Author name disambiguation: Merge scholars
• cited each other at least once;
• share at least one co-author;
• have at least one identical affiliation
To eliminate authors who has not an advisor in their career,
in the application of shifu, we limit the applications on authors
who meet the following criteria:
• have published at least 1 paper every 5 years;
• have published at least 10 papers in the entire dataset;
• their publication career spans at least 8 years.
Disciplinary differences and
Temporal effect elimination :
• Re-scaled number of publications.
• Research field normalization.
67
Remarks
We devise Shifu2, a task-dependent model based on the network representation learning. Different from the existing study in network representation, we consider the semantic information of both nodes and edges for embedding.
Novel Mining Model
We generate a large-scale dataset containing not only advisor-advisee pairs, but also the academic attributes and publication records for each scholar.
Benchmark Dataset
(1) Since we can verify the effectiveness of Shifu2 on advising relationship identification, how can we extend our model to other types of relationships identification such as friendship in the social network?
(2) How can we correlate the discovered implicit relationship with other tasks such as evaluating the impact of scholars?
Future Work
From triadic closure to conference closure:
The role of academic conferences in promoting
scientific collaborations
Wei Wang, Xiaomei Bai, Feng Xia, Teshome
Megersa Bekele, Xiaoyan Su, Amr Tolba
Scientometrics, 2017
DOI: 10.1007/s11192-017-2468-x
Scholars are becoming more and more collaborative
Continuous increase in the number of co-authored papers in every scientific discipline
Coauthored publications are cited more frequently than single-authored papers
Increasingly, public and private research funding agencies require interdisciplinary, international,and inter-institutional collaboration
Physics
Computer Science
Mathematics
Social Science
Observations
Factors Affecting CollaborationSeparation across distance and time
often places more reliance on
asynchronous communication and
can result in increased demands on
coordination
Collaborative work often involves
dealing with domain-specific tooling
in addition to generic productivity
tooling
Typically, multiple sets of tools are
involved
Communication underpins how
collaborators understand each other
and how collaborative work gets
managed and accomplished
Awareness is essential in enabling
collaborators to be efficient and
effective
Levels of participation are dictated
by roles and responsibilities of
collaborators with disparate
specialized expertise
Group norms and individual preferences
have a significant impact on the various
aspects of collaboration particularly in
terms of transparency, access, and
communication
Task type influences the degree of
collaboration experienced or
necessary
Highly interdependent or "tightly-
coupled" tasks require more
coordination and communication
“Loosely-coupled” tasks require less
interaction, can be accomplished
independently and subsequently
integrated into the collective output
Effective collaboration requires
being able to quickly and easily tap
into networks of human expertise
and knowledge
Roles & Responsibilities
• Owner• Co-owner• Contributor• Reviewer• Approver (if needed)
• Conceptual, social, logistical• Synchronous, asynchronous• Multi-channel (nonverbal,
verbal, formal, informal, in-person, remote)
Communication & Coordination
Awareness
• Task & activity status and conditions
• Whereabouts & actions of collaborators
• Availability of resources (human & knowledge)
Tech Ecosystem• Generic productivity tools• Communication tools• Domain-specific tools
• Hierarchical vs. lateral decision making
• Individual preferences
• Open (public) vs. private
• Collegial, reciprocating (of
favors) vs. self-interested
Norms & Culture
• Networks of experts (personal, professional, social)
• Relevant data/info (shared knowledge repository)
• Shared workspace (physical or virtual)
Access to Resources
Environmental Context
• Co-located• Distributed in time/space• Mobile • Intra- & inter-organization
Task Characteristics• Task types
• Routine/predictable vs.
unpredictable
• Cognitive/conceptual vs.
behavioral
• Complex vs. easy• Task structures
• Loosely-coupled
• Tightly-coupled
How to Find a New Collaborator?
A
B
B
CMU
MIT
A
A
B
KDD
WWW
(a) Triadic closure (b) Focal closure (c) Conference closure
Ada
p1
p3
p2
Research Questions
First Second Third
1. Will academic conferences bring/promote new scientific collaboration?
2. How to quantify the impact of academic conferences in promoting scientific collaborations?
3. What kind of academic conferences will bring more new collaborations?
Quantifying Conference Closure
A
B
C D
E F G
A1
A2
A3
CCi=2/5=0.4
KDD 2010CCcom =5/7=0.71
Individual Level:
Community Level:
Data Set
22 conferences 8,990 scholars
Data Mining DBLP
*The field rating of each conference is crawled from Microsoft Academic Search and a high field rating means a high reputation.
Assumptions: 1. If a scholar had published one paper in the
conference proceeding, he/she is regarded as the attendee of the conference with the probability α. The parameter α ranges from 0.1 to 1.
2. If two unconnected scholars coauthor a paper
several (1-5) years after attending a same conference, their collaborations are promoted by conference closure.
Influence of Attendance Ratio
There is no obvious variation
regularity which indicates that the α has little impact on the
conference closure. We
take α as 1 in the following experiments.
Impact of Conference Properties
Conferences with higher field ratings and larger scales promote more
research collaborations.
Impact of Involving in Multiple Conferences
Scholars involved in multiple conferences are more likely to meet
new collaborators, i.e., active scholars will gain more benefits from
the conferences.
Impact of Scientific Productivity
Productive scholars will meet more
collaborators during
conferences.
Multivariate Analysis of Conference Closure
All the three factors have positive relationships with conference closure.
Thank you!
Feng Xiahttp://fengxia.netMobile/WhataApp: +86-18504228752WeChat/Skype: TheSmartFengEmail: [email protected]; [email protected]