Informetric methods seminar

Preview:

DESCRIPTION

Informetric methods seminar. Tutorial 2: Using Matlab for network construction, ranking, clustering, topic modeling, and path finding Erjia Yan. Contents. Network construction Ranking C lustering T opic modeling P ath finding. Contents. Network construction Ranking C lustering - PowerPoint PPT Presentation

Citation preview

Informetric methods seminar

Tutorial 2: Using Matlab for network construction, ranking, clustering, topic modeling, and path finding

Erjia Yan

Network construction Ranking Clustering Topic modeling Path finding

Contents

Network construction Ranking Clustering Topic modeling Path finding

Contents

Bibliographical data

From data to networks

Paper-to-paper citation network is the base

Web of Science cited references format: First Author, Year Of Publication, Abbreviated

Journal Name, Volume Number, Beginning Page Number

AANESTAD M, 2011, J STRATEGIC INF SYST, V20, P161

All fields can be found in “full record + cited references” downloading option

Web of Science format

Some of the newer records may also have DOI. For a better match, it is better to remove the DOI from the cited references

For citing papers, extract these fields and format them into Web of Science cited reference format.

Now we have citing papers and cited references that have the same format

Use these two fields, construct an internal citation network that only contains those cited references that are cited by the citing papers in the data set

Citation matching

If you can write an app for this, it would be great!

Otherwise, you can follow these instructions

Converting into

Use Access to construct the network Have a table for citing papers Import the converted citation pairs to Access Use query to extract those pairs whose papers are in

the table Now you have the node info and link info Import both into Matlab

Procedures

CP1 CR1; CR2; CR3

CP1 CR1

CP1 CR2

CP1 CR3

Now we have paper-to-paper citation networks, but in order to construct for instance author-to-author citation or author co-citation networks, we need to use adjacent matrices.

Adjacent matrices

Authors

Papersa cell number 1 (i,j)=1 indicates paper i is written by author j

Convert into

Add to the beginning of the file

Use Txt2Pajek on the linkage file Import the edge section of the .net file to

Matlab Select M(1:n,n+1:m) where m is the col

size. The selection is our author-paper adjacent matrix

Procedures

ID1 AU1; AU2; AU3

ID1 AU1

ID1 AU2

ID1 AU3ID1 ID1

ID2 ID2

… …

IDn IDn

Citation and coauthorship

Cocitation and biblio. coupling

Co-word

Network construction Ranking Clustering Topic modeling Path finding

Contents

By David Gleich of Purdue University http://

www.mathworks.com/matlabcentral/fileexchange/11613-pagerank

pagerank(M,options) options.c: the teleportation coefficient [double |

{0.85}] options.v: the personalization vector [vector |

{uniform: 1/n}]

PageRank

Network construction Ranking Clustering Topic modeling Path finding

Contents

By MIT Strategic Engineering http://

strategic.mit.edu/downloads.php?page=matlab_networks [modules,module_hist,Q] =

newmangirvan(adj,k) [groups_hist,Q]=newman_comm_fast(ad

j)

Modularity-based clustering

By Nees van Eck and Ludo Waltman of Leiden University

http://www.vosviewer.com/relatedsoftware/ A variant of the modularity-based

clustering technique [X, cluster_size, V] = VOS_clustering(A,

P)

VOSviewer clustering

Network construction Ranking Clustering Topic modeling Path finding

Contents

By Mark Steyvers of University of California Irvine

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Input: The input is a bag of word representation containing the number of times each words occurs in a document. 

Matlab Topic Modeling Toolbox

Network construction Ranking Clustering Topic modeling Path finding

Contents

http://www.mathworks.com/help/bioinfo/ref/graphshortestpath.html

[dist, path, pred]=graphshortestpath(G,S,T) from S to T in graph G

[dist] = graphallshortestpaths(G) find all shortest path in graph G; dist is a

distance matrix for the shortest path of each pair of nodes

Bioinformatics toolbox

Recommended