83
Feng Xia Scholarly Social Computing Dalian University of Technology (DUT), China http://fengxia.net @ WIMS 2019, Seoul, Korea

Scholarly Social Computingke.cau.ac.kr/wims2019/wp-content/uploads/2019/06/... · extraordinary not only in academic research, but also in fully exploiting the potential of ourselves

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Feng Xia

Scholarly Social Computing

Dalian University of Technology (DUT), China

http://fengxia.net

@ WIMS 2019, Seoul, Korea

2

• a major city and seaport in the south of

Liaoning province

• the southernmost city of Northeast China

• the province's second largest city and has

sub-provincial administrative status

• a financial, shipping and logistics center

for Northeast Asia

• In 2006, Dalian was named China's most

livable city by China Daily

• Population (2010): 6,690,432

From Wikipedia, the free encyclopedia: http://en.wikipedia.org/wiki/Dalian

The City: Dalian

What?

Why?

Where?

Our goal is to create innovation through conducting

interdisciplinary, application-driven academic research.

We are interested in a broad spectrum of cutting-edge

research topics including data science, knowledge

management, network science, computational social

science, human behavior, and mobile social networks.

Alpha has the meaning of first in Greek. We borrow this

word to express the idea that we pursue being

extraordinary not only in academic research, but also in

fully exploiting the potential of ourselves. We value hard

work and talents. We embrace the change and the

differences.

Full name of the Lab: The Alpha Lab @ Dalian University

of Technology, China.

Address: School of Software, Dalian University of

Technology, Development Zone, Dalian 116620 China.

URL (website): http://TheAlphaLab.org

Supervisors

Dongyu Zhang

张冬瑜

Feng Xia:Professor

From 7 different countries

PhD, master, senior undergraduate students: 70+We are family!

1

2

3

Students

Current Research Interests

Research Groups

Science /Eng. Innovation

Health /Medicine

Education /Career

HUMAN

Social Computing

Scholarly Big Data

Social Relationship

Mining

Scientific Collaboration

Dynamics

Agenda

What is (Scholarly) Social Computing?

What is Social Computing?

What is Social Computing?

What is Social Computing?

Why Social Computing?

Why Social Computing?

Source: Forrester Research

Towards Scholarly Social Computing

Towards Scholarly Social Computing

What is Scholarly Big Data?

Feng Xia, Wei Wang, Teshome Megersa Bekele, Huan Liu.

Big Scholarly Data: A Survey

IEEE Transactions on Big Data, 2017

DOI: 10.1109/TBDATA.2016.2641460

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Scholarly

big data

Growing at

an average

6.3% per

year

Velocity

Over 114

million

scholarly

documents

Volume Various

entities

including:

papers,

authors,

etc.

Variety

Research

prediction;

fund

allocation;

impact

evaluation

Value

Author

name

disambigu

ation,

deduplicati

on

Veracity

What is “Scholarly Big Data”?

Scholarly Big Data is coined for the rapidlygrowing scholarly data, which containsinformation including millions of authors, papers,books, citations, figures, tables, as well asscholarly networks and digital libraries.

Any data that relate to scholarship.

What is “Scholarly Big Data”?

The Tip of the Iceberg

Scholarly Big Data

State of the art

How Large is Scholarly Big Data?

Jason Rollins, iConference 2017

Jason Rollins, iConference 2017

Available Academic Datasets (Examples)

Actor-oriented Tasks

Relationship-oriented Tasks

Network-oriented Tasks

ASNs Sites

Academic Social Network Sites

Scholarly Search EnginesData preprocessing

Name Disambiguation

Integration

Profiling

Application

Similarity Measures

Statistics

Frequent Patterns

Machine Learning

Technology

Social Network Metrics

Properties

ASNs Analysis

Tools

Analysis

Co-authorship Network

Citation Nework

Modeling

Access

Indexing

Storage

Co-citation Nework

Bibliographic Coupling Network

Co-word Network

Academic website

Scholar Homepage

Conference and journal

official website

Collecting

Exploring Big Scholarly Data

Xiangjie Kong, Yajie Shi, Shuo Yu, Jiaying Liu, Feng Xia. Academic Social Networks: Modeling, Analysis, Mining and Applications, Journal of Network and Computer Applications, Volume 132, April 2019, Pages 86-103. DOI: 10.1016/j.jnca.2019.01.029

Scholarly Entities and their Relationships

Publications

Organizations

Venues

Title

Year

DOL

Pages

Contents

AbstractsName

Affiliations

Education

Field

Impact

Position Type

Name

Location

Member

Ranking

Impact

Position

Conferences Journals

Terms

Cite, Author, Tag

Host

Pu

blish

ed at

Work

at

Location Date Impact Field Publisher

Name

Publisher

Co-words

Co-authorship Cite, Co-citation

Collaboration,

Dependence

Keywords

Interest

Researchers

The Webs: Big/Complex Networks Behind!

Scholarly Homogeneous Networks

Scholarly Heterogeneous Networks

Institutions Authors Publications Venues

The Webs: Example I

Xiaomei Bai, Hui Liu, Fuli Zhang, Zhaolong Ning, Xiangjie Kong, Ivan Lee, Feng Xia. An Overview on Evaluating and Predicting Scholarly Article Impact, Information, 2017, 8(3), 73; DOI:10.3390/info8030073

Different networks for various scholarly entities and their relationships

The Webs: Example II

Number of advisees and their advisors in 63 countries on the world map which is generated by Tableau. It maps the advisees from different countries in different colors.

Jiaying Liu, Tao Tang, Wei Wang, Bo Xu, Xiangjie Kong, and Feng Xia. A Survey of Scholarly Data Visualization, IEEE Access, 2018, 6(1): 19205-19221. DOI: 10.1109/ACCESS.2018.2815030

The Webs: Example III

Jiaying Liu, Tao Tang, Wei Wang, Bo Xu, Xiangjie Kong, and Feng Xia. A Survey of Scholarly Data Visualization, IEEE Access, 2018, 6(1): 19205-19221. DOI: 10.1109/ACCESS.2018.2815030

Collaboration network of Harvard University in Acemap. Each node represents the author in the institution and the edges between the nodes represent the collaboration between the authors. The color of the nodes represents the research field of the author.

Key Mining Techniques

Similarity Measure

Linkage-based and Structural

Methods

PageRank

SimRank

Content-based Methods

Distance-based

Algorithms

Cosine-based Algorithms

Correlation-based

Algorithms

JaccardCoefficient

Statistical Relational Learning

Probabilistic Relational Models

Relational Markov Networks

Structural Logic Regression

Relational Dependency

Networks

Markov Logic Networks

Graph Mining

Frequent Subgraph

Mining

Apriori-based

Algorithms

FP-growth Algorithms

Significance Subgraph

Mining

Dense Subgraph

Mining

Machine Learning

Supervised Machine Learning

Decision Tree

Neural Networks

Support Vector

Machines

k-Nearest Neighbors

Unsupervised Machine Learning

Hierarchical Methods

Partition-based

Methods

Density-based

Methods

Grid-based Methods

Model-based methods

Deep Learning

Supervised Deep

Learning

Convolutional Neural

Networks

Unsupervised Deep

Learning

Auto Encoder-based

Methods

Boltzmann Machines

Actor-oriented

Relationship-oriented Network-oriented

Author Tasks

Paper Tasks

Journal Tasks

Academic

Recommendation

Link Prediction

Community

Detection

Big Scholarly

Data

Applicaitons

Collaboration

PatternInterdisciplinary

Evolution

Research Trend

Prediction

Potentials

Social Relationship Mining: Who are Connected with Who?

Advisor-advisee Relationship Mining in Scholarly Big Data

Wei Wang, Jiaying Liu, Feng Xia,

Irwin King, Hanghang Tong, et al.

ACM/IEEE JCDL 2016 Poster

WWW 2017

Work-in-Progress

Academic Mentorship is Vital

http://www.changeboard.com/content/5121/mentoring-the-good-the-bad-and-the-ugly/

Especially in scholarly data analytics

Advisor-advisee Relationship Information is Useful

• What makes a great advisor?• How the advisors’ academic performance influences the

future development of advisees?• Who is the right/best advisor for a particular student?

To answer questions, e.g.,

To address issues, e.g.,

• Scholarly impact assessment and prediction• Reviewer recommendation• Academic rising stars identification• And many more ….

The Academic Family Tree:https://academictree.org/

Building a single, interdisciplinary academic genealogy

Academic Genealogy Wiki:http://phdtree.org/

Documenting the academic family tree of PhDs worldwide, both past and present

Very few efforts ... heavily rely on volunteers' efforts, which results in limited records and information

Unfortunately, Such Dataset is NOT Available …

Ongoing Work

Relationship-based Data Analysis

Dataset Visualization

Understanding the underlying

principles of academic society

Automatically generate large-

scale relationship/ment

orship dataset

Academic genealogy

visualization platform/system

The Problem

How to get a large-scale mentorship dataset automatically?

The Idea

The advisor-advisee

relationship is hidden in

scientific collaboration/co-

author networks

A

D

C

B

2000

Ada

Bob

Tom

Jack

Ada Tom

Advisor

Advisee

Advisee Publication Collaborators

2001

2001

2008

Similarity

Local properties

Advisor-advisee pairs

Proposed Shifu based on

stacked autoencoder

Design of Shifu

1

2

3

4

Match the samples in DBLP and extract the required features for training as unlabeled input to Shifu

Obtain real advisor-advisee pairs from the Academic Genealogy Wiki project as training set

The back propagation (BP) method is used to train Shifu and optimize the model

The result of identifying advisors is obtained through classifier after training

Shifu

Feature Selection

[1] Wu T, Chen Y, Han J. Re-examination of interestingness measures in pattern mining: a unified framework. Data Mining and Knowledge Discovery,

2010, 21(3): 371-397

𝑐𝑜ℎ𝑒𝑠𝑖𝑜𝑛𝑖𝑗𝑡 =

𝑇𝑖𝑗

2

1

𝑇𝑖+

1

𝑇𝑗[1]

Personal properties

• Number of Publications (NP)

• Collaboration Duration (CD)

• Times of First-two Authors (FTA)

• Collaboration Times (CT)

• Cohesion of Collaboration

• Academic Age (AA)

Collaboration network properties

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

0.92

Accuracy Precision Recall F-measure

With AA Without AA

Training Set

Acquisition

Academic Genealogy Wiki

http://phdtree.org/

Computer Science Area Qixiang Sun,Hector Garcia-Molina

NP of Hector

AD

Cohesion

AA of Sun

AA of Hector

CD

CT

FTA

NP of Sun

0.7

0.75

0.8

0.85

0.9

0.95

1 2 3 4 5 6 7 8

Accuracy Precision Recall F-measure

0.85

0.87

0.89

0.91

0.93

0.95

50% 60% 70% 80% 90%

Accuracy Precision Recall F-measure

Time Duration Size of Training Set

Number of Hidden Layers Number of UnitsResults: All these input features based on

publication information during the first eight years

Three hidden layers with 7 units each layer: [7, 7, 7]

ModelTraining

Results

1 Support Vector Machine (SVM) 2 k-Nearest Neighbor (KNN)

3 Logistic Regression (LR) 4Time-constrained Probabilistic Graphical Model (TPFG)

Baseline Methods Evaluation Metrics

Recall

Precision

F1

• Obtained 1,111,513 advisor-advisee pairs in DBLP• Integrated a wealth of scholars personal information

Large-scale advisor-advise pair data set

• Provided the method for calculating the probability of any two scholars of advisor-advisee relationship

Method for calculating probability of advisor-advisee relationship

Shifu for DBLP

The Web of Scholars

Discussion

Contributions

Future Work

• A deep learning-based advisor-advisee relationship identification approach

• A large-scale advisor-advisee relationship dataset

• Not every one has an advisor• Collaboration patterns between advisors and advisees may

vary from one institution to another• Further proofread the generated advisor-advisee pairs• Ground truth!

• Improve the solution design• Further improve the accuracy of the data set via

crowdsourcing?• Provide a platform based on the improved advisor-advisee

data set

Remarks

Shifu2

56

Framework of Shifu2

57

Overall Reconstruction

01

03

04

02

Pooling Layer

Compress the input features. We reduce each

adjacency vector to 1000 dimensions and

calculate the mean of the reduced vectors

accordingly as the input.

Advisor-advisee Relationship Prediction

we add a supervised classifier in our model, which

takes the output of the last hidden layer as the input and

outputs the edges’ representations.

Node Attributes Representation

Employ the deep autoencoder to convert the

node attribute matrix to the low-decisional

vector representation.

Edge Attributes Representation

Employ the deep autoencoder to convert the

edge attribute matrix to the low-decisional

vector representation.

58

Node Representation Construction

𝑨𝑨 = 𝒀𝒄 − 𝒀𝒇

𝒀𝒄: the year of the first collaboration

𝒀𝒇: the year of the first publication

Academic Age

Node Autoencoder

01

59

Edge Representation Construction

Edge Autoencoder

𝐾𝑢𝑙𝑐𝑖𝑗𝑡 =

𝑁𝑃𝑖𝑗

2

1

𝑁𝑃𝑖+

1

𝑁𝑃𝑗Collaboration Similarity

02

60

Process of the Advisor-Advisee Relationship Mining Task

61

Training Set Acquisition

Datasets

Microsoft Academic Graph: https://www.openacademic.ai/oag/

The Academic Family Tree:https://academictree.org

Node attributes

Edge attributes

Label for each edge

62

Performance Comparison

01 DeepWalk

02 LINE

03 Node2vec

04 TransNet

63

Model Training: Number of Layers

Node Encoder

Edge Encoder

64

Model Training: Input Features

Chemistry Computer Science

Shifu2 Shifu2-E Shifu2 Shifu2-E

Accuracy 0.939 0.789 0.931 0.813Precision 0.925 0.753 0.912 0.782

Recall 0.958 0.914 0.959 0.883F-measure 0.941 0.823 0.933 0.830

Economics Engineering

Shifu2 Shifu2-E Shifu2 Shifu2-E

Accuracy 0.913 0.507 0.915 0.736Precision 0.877 0.506 0.889 0.718

Recall 0.961 0.602 0.952 0.873F-measure 0.917 0.550 0.919 0.784

Mathematics Physics

Shifu2 Shifu2-E Shifu2 Shifu2-E

Accuracy 0.919 0.702 0.932 0.846Precision 0.898 0.670 0.935 0.827

Recall 0.947 0.889 0.959 0.890F-measure 0.922 0.760 0.933 0.857

The performance of shifu2 without the node autoencoder

Model Training: Different Sizes of Training Data

66

Critical Issues in academic genealogy generation

Author name disambiguation: Merge scholars

• cited each other at least once;

• share at least one co-author;

• have at least one identical affiliation

To eliminate authors who has not an advisor in their career,

in the application of shifu, we limit the applications on authors

who meet the following criteria:

• have published at least 1 paper every 5 years;

• have published at least 10 papers in the entire dataset;

• their publication career spans at least 8 years.

Disciplinary differences and

Temporal effect elimination :

• Re-scaled number of publications.

• Research field normalization.

67

Remarks

We devise Shifu2, a task-dependent model based on the network representation learning. Different from the existing study in network representation, we consider the semantic information of both nodes and edges for embedding.

Novel Mining Model

We generate a large-scale dataset containing not only advisor-advisee pairs, but also the academic attributes and publication records for each scholar.

Benchmark Dataset

(1) Since we can verify the effectiveness of Shifu2 on advising relationship identification, how can we extend our model to other types of relationships identification such as friendship in the social network?

(2) How can we correlate the discovered implicit relationship with other tasks such as evaluating the impact of scholars?

Future Work

Scientific Collaboration Dynamics: How the Web Changes

From triadic closure to conference closure:

The role of academic conferences in promoting

scientific collaborations

Wei Wang, Xiaomei Bai, Feng Xia, Teshome

Megersa Bekele, Xiaoyan Su, Amr Tolba

Scientometrics, 2017

DOI: 10.1007/s11192-017-2468-x

Scholars are becoming more and more collaborative

Continuous increase in the number of co-authored papers in every scientific discipline

Coauthored publications are cited more frequently than single-authored papers

Increasingly, public and private research funding agencies require interdisciplinary, international,and inter-institutional collaboration

Physics

Computer Science

Mathematics

Social Science

Observations

Factors Affecting CollaborationSeparation across distance and time

often places more reliance on

asynchronous communication and

can result in increased demands on

coordination

Collaborative work often involves

dealing with domain-specific tooling

in addition to generic productivity

tooling

Typically, multiple sets of tools are

involved

Communication underpins how

collaborators understand each other

and how collaborative work gets

managed and accomplished

Awareness is essential in enabling

collaborators to be efficient and

effective

Levels of participation are dictated

by roles and responsibilities of

collaborators with disparate

specialized expertise

Group norms and individual preferences

have a significant impact on the various

aspects of collaboration particularly in

terms of transparency, access, and

communication

Task type influences the degree of

collaboration experienced or

necessary

Highly interdependent or "tightly-

coupled" tasks require more

coordination and communication

“Loosely-coupled” tasks require less

interaction, can be accomplished

independently and subsequently

integrated into the collective output

Effective collaboration requires

being able to quickly and easily tap

into networks of human expertise

and knowledge

Roles & Responsibilities

• Owner• Co-owner• Contributor• Reviewer• Approver (if needed)

• Conceptual, social, logistical• Synchronous, asynchronous• Multi-channel (nonverbal,

verbal, formal, informal, in-person, remote)

Communication & Coordination

Awareness

• Task & activity status and conditions

• Whereabouts & actions of collaborators

• Availability of resources (human & knowledge)

Tech Ecosystem• Generic productivity tools• Communication tools• Domain-specific tools

• Hierarchical vs. lateral decision making

• Individual preferences

• Open (public) vs. private

• Collegial, reciprocating (of

favors) vs. self-interested

Norms & Culture

• Networks of experts (personal, professional, social)

• Relevant data/info (shared knowledge repository)

• Shared workspace (physical or virtual)

Access to Resources

Environmental Context

• Co-located• Distributed in time/space• Mobile • Intra- & inter-organization

Task Characteristics• Task types

• Routine/predictable vs.

unpredictable

• Cognitive/conceptual vs.

behavioral

• Complex vs. easy• Task structures

• Loosely-coupled

• Tightly-coupled

How to Find a New Collaborator?

A

B

B

CMU

MIT

A

A

B

KDD

WWW

(a) Triadic closure (b) Focal closure (c) Conference closure

Ada

p1

p3

p2

Research Questions

First Second Third

1. Will academic conferences bring/promote new scientific collaboration?

2. How to quantify the impact of academic conferences in promoting scientific collaborations?

3. What kind of academic conferences will bring more new collaborations?

Quantifying Conference Closure

A

B

C D

E F G

A1

A2

A3

CCi=2/5=0.4

KDD 2010CCcom =5/7=0.71

Individual Level:

Community Level:

Data Set

22 conferences 8,990 scholars

Data Mining DBLP

*The field rating of each conference is crawled from Microsoft Academic Search and a high field rating means a high reputation.

Assumptions: 1. If a scholar had published one paper in the

conference proceeding, he/she is regarded as the attendee of the conference with the probability α. The parameter α ranges from 0.1 to 1.

2. If two unconnected scholars coauthor a paper

several (1-5) years after attending a same conference, their collaborations are promoted by conference closure.

Influence of Attendance Ratio

There is no obvious variation

regularity which indicates that the α has little impact on the

conference closure. We

take α as 1 in the following experiments.

Impact of Conference Properties

Conferences with higher field ratings and larger scales promote more

research collaborations.

Impact of Involving in Multiple Conferences

Scholars involved in multiple conferences are more likely to meet

new collaborators, i.e., active scholars will gain more benefits from

the conferences.

Impact of Scientific Productivity

Productive scholars will meet more

collaborators during

conferences.

Multivariate Analysis of Conference Closure

All the three factors have positive relationships with conference closure.

The Workshop

The System

http://WebOfScholars.org/

Thank you!

Feng Xiahttp://fengxia.netMobile/WhataApp: +86-18504228752WeChat/Skype: TheSmartFengEmail: [email protected]; [email protected]