6
Academic Community Explorer (ACE) for Syntactic, Semantic and Pragmatic Document Analysis Akansha Bhardwaj ‡§ , Dominique Mercier ‡† , Hisham Hashmi ‡† , Andreas Dengel ‡† , and Sheraz Ahmed Smart Data and Services, DFKI Kaiserslautern, Germany TU Kaiserslautern, Kaiserslautern, Germany § eXascale Infolab, University of Fribourg, Switzerland f irstname.lastname@dfki.de, § akbwaj @exascale.inf o Abstract—This paper presents a novel Academic Community Explorer (ACE) which performs syntactic, semantic and prag- matic document analysis of scientific publications. Firstly, ACE uses syntactic structure to extract relevant information from a scientific document. Secondly, semantic analysis is performed to derive an article based co-authorship and citation network. Finally, ACE uses these document based networks to build a com- plete community network for pragmatic analysis. Furthermore, scientometric analysis is performed to extract the pragmatics by analyzing authors and publication community networks through micro and macro indicators. Two novel micro indicators Senti- Index, reflecting the sentiment present in citations and, Overlap index, reflecting community behavior have been introduced. This is a step in the direction of automatic qualitative assess- ment of scientific documents. In addition, ACE provides a rich visualization interface which helps in exploratory analysis of the community to identify hidden patterns, e.g, isolated small groups in the community which collaborate and cite each other frequently. A feasibility study is performed on the corpus of ICDAR publications from 1993-2015 to show the insights and benefits of the ACE framework. The results reveals that ICDAR is a highly collaborative community which has most likely arrived at its ‘phase transition’ stage with 70% of the community closely connected to each other. I. I NTRODUCTION In recent years, the field of document analysis has advanced tremendously from the use of handcrafted features for layout and textual analysis [1]–[3] to deep learning [4]. Though document understanding has several applications for business and academic purposes, the application of pragmatic analysis of documents for community analysis has been overlooked. The aim of community analysis is to explicitly focus on the quantitative and qualitative assessment of a community. This is quite important for scientific communities, as numerous scientific documents are published every year. In this context, citation analysis and quantitative metrics have been drawing interest from academia lately, attempting to replace traditional productivity indicators like h-index [5]. Although there are already some approaches available for analyzing communities by visualizing co-citation and bibliographic coupling networks [6]–[8], all of these approaches rely on meta-data of the publications, i.e., structured information like BibTex instead of the publication itself. This paper presents a novel framework (ACE) to per- form syntactic, semantic and pragmatic analysis of scientific publications/documents with an explicit aim of community analysis. Data which results this analysis can convey a lot of information about scholarly collaborations, communication, networks of scholars, and thematic trends. With the help of semantic and pragmatic analysis of ACE, different important metrics (macro and micro indicators) are computed, which help in understanding and analyzing the community from different perspectives. In addition, two novel metrics related to community behavior (Overlap Index) and sentiment of the publication (Senti-Index) are also introduced in the paper. Along with these novel features, ACE is generic and applicable to scientific publications from any community and in any format e.g., IEEE, ACM, Springer. A. Related Works This sections provides an overview of different approaches available for syntactic, semantic and pragmatic analysis of documents. Syntactic document analysis includes converting raw docu- ments to structured representation using information extrac- tion approaches. [9], PDF box 1 , pdftotxt 2 are some of the approaches to convert PDF documents into structured text. This structured representation includes title, header, keyword, abstract, and references tags as done in [10]. The next step is reference segmentation which is a challenging task owing to varying citation formats. The methods of reference seg- mentation can be broadly put into four major categories of template matching methods [11], [12], supervised machine learning based approaches [13]–[16], unsupervised classifica- tion approaches [17], [18], and web based look up approach [19]. Targeted data extraction has also been achieved using Named Entity Recognition (NER) approaches [20]. Semantic document analysis approaches make the data more meaningful. This includes data cleaning and transformation approaches like name resolution. 1 https://pdfbox.apache.org/ 2 https://linux.die.net/man/1/pdftotext

Academic Community Explorer (ACE) for Syntactic, Semantic ... · Academic Community Explorer (ACE) for Syntactic, Semantic and Pragmatic Document Analysis Akansha Bhardwajzx, Dominique

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Academic Community Explorer (ACE) for Syntactic, Semantic ... · Academic Community Explorer (ACE) for Syntactic, Semantic and Pragmatic Document Analysis Akansha Bhardwajzx, Dominique

Academic Community Explorer (ACE)for Syntactic, Semantic and Pragmatic

Document AnalysisAkansha Bhardwaj‡§, Dominique Mercier‡†, Hisham Hashmi‡†, Andreas Dengel‡†, and Sheraz Ahmed‡

‡Smart Data and Services, DFKI Kaiserslautern, Germany†TU Kaiserslautern, Kaiserslautern, Germany

§eXascale Infolab, University of Fribourg, Switzerland‡[email protected], §[email protected]

Abstract—This paper presents a novel Academic CommunityExplorer (ACE) which performs syntactic, semantic and prag-matic document analysis of scientific publications. Firstly, ACEuses syntactic structure to extract relevant information from ascientific document. Secondly, semantic analysis is performedto derive an article based co-authorship and citation network.Finally, ACE uses these document based networks to build a com-plete community network for pragmatic analysis. Furthermore,scientometric analysis is performed to extract the pragmatics byanalyzing authors and publication community networks throughmicro and macro indicators. Two novel micro indicators Senti-Index, reflecting the sentiment present in citations and, Overlapindex, reflecting community behavior have been introduced.This is a step in the direction of automatic qualitative assess-ment of scientific documents. In addition, ACE provides a richvisualization interface which helps in exploratory analysis ofthe community to identify hidden patterns, e.g, isolated smallgroups in the community which collaborate and cite each otherfrequently. A feasibility study is performed on the corpus ofICDAR publications from 1993-2015 to show the insights andbenefits of the ACE framework. The results reveals that ICDARis a highly collaborative community which has most likely arrivedat its ‘phase transition’ stage with 70% of the community closelyconnected to each other.

I. INTRODUCTION

In recent years, the field of document analysis has advancedtremendously from the use of handcrafted features for layoutand textual analysis [1]–[3] to deep learning [4]. Thoughdocument understanding has several applications for businessand academic purposes, the application of pragmatic analysisof documents for community analysis has been overlooked.

The aim of community analysis is to explicitly focus onthe quantitative and qualitative assessment of a community.This is quite important for scientific communities, as numerousscientific documents are published every year. In this context,citation analysis and quantitative metrics have been drawinginterest from academia lately, attempting to replace traditionalproductivity indicators like h-index [5]. Although there arealready some approaches available for analyzing communitiesby visualizing co-citation and bibliographic coupling networks[6]–[8], all of these approaches rely on meta-data of thepublications, i.e., structured information like BibTex insteadof the publication itself.

This paper presents a novel framework (ACE) to per-form syntactic, semantic and pragmatic analysis of scientificpublications/documents with an explicit aim of communityanalysis. Data which results this analysis can convey a lotof information about scholarly collaborations, communication,networks of scholars, and thematic trends. With the help ofsemantic and pragmatic analysis of ACE, different importantmetrics (macro and micro indicators) are computed, whichhelp in understanding and analyzing the community fromdifferent perspectives. In addition, two novel metrics relatedto community behavior (Overlap Index) and sentiment of thepublication (Senti-Index) are also introduced in the paper.Along with these novel features, ACE is generic and applicableto scientific publications from any community and in anyformat e.g., IEEE, ACM, Springer.

A. Related Works

This sections provides an overview of different approachesavailable for syntactic, semantic and pragmatic analysis ofdocuments.

Syntactic document analysis includes converting raw docu-ments to structured representation using information extrac-tion approaches. [9], PDF box1, pdftotxt2 are some of theapproaches to convert PDF documents into structured text.This structured representation includes title, header, keyword,abstract, and references tags as done in [10]. The next stepis reference segmentation which is a challenging task owingto varying citation formats. The methods of reference seg-mentation can be broadly put into four major categories oftemplate matching methods [11], [12], supervised machinelearning based approaches [13]–[16], unsupervised classifica-tion approaches [17], [18], and web based look up approach[19]. Targeted data extraction has also been achieved usingNamed Entity Recognition (NER) approaches [20].

Semantic document analysis approaches make the data moremeaningful. This includes data cleaning and transformationapproaches like name resolution.

1https://pdfbox.apache.org/2https://linux.die.net/man/1/pdftotext

Page 2: Academic Community Explorer (ACE) for Syntactic, Semantic ... · Academic Community Explorer (ACE) for Syntactic, Semantic and Pragmatic Document Analysis Akansha Bhardwajzx, Dominique

Fig. 1. Overview of the presented ACE framework.

Several approaches have been adopted in the past for thepurpose of pragmatic analysis of a scientific community. [6]–[8] presented some of these existing approaches for analyzingand visualizing co-citation and bibliographic coupling net-works. Some works focused on the network topology of co-authorship network of biology, physics, and smaller commu-nities [21]–[23]. Centrality indicators [24] have been studiedas a crucial determinant of the position of influence of authorsin co-authorship networks. It is relevant to mention that all ofthese approaches rely on the meta-data of documents. None ofthe existing approaches use document analysis for meta-dataextraction. Also, to the best of our knowledge, there has beenno study on author citation networks which has been presentedin this work.

II. ACE: THE PRESENTED APPROACH

Figure 1 provides an overview of the presented ACE frame-work, which is divided into three stages, syntactic analysis,semantic analysis and pragmatic analysis. ACE starts withsyntactic analysis of document to extract important informa-tion from scientific publications. This extracted informationis then used by semantic analysis to further add meaning toit. Both syntactic and semantic analysis build a baseline forpragmatic analysis, where all of the extracted information isevaluated in a global and local context of the community. Themain emphasis of ACE framework is on pragmatics analysisbased on author collaboration and citation networks derivedfrom scientific publications.

A. Syntactic Document Analysis

ACE uses syntactic structure of publications to extractimportant information from the documents, which serves asa backbone for higher-level semantic and pragmatic analysis.An important purpose of syntactic analysis is to convert

unstructured data of scientific articles to structured data. Thisstep starts with the conversion of unstructured PDF data totext [9] followed by meta-data extraction using syntacticalinformation. The extracted meta-data includes title, header,keywords, abstract and references. Once the meta-data isextracted, the next step is the identification of author namesoccurring in header and references tags from structured dataextracted above using NER. Furthermore, all the sentences intext containing citation/reference to any publication are alsoextracted. This is very important to compute the sentiment ofcitation/reference.

B. Semantic Document Analysis

Semantic analysis is performed on the structured data (ex-tracted through syntactic analysis) to add meaning to it. Toconvert this structured data to meaningful information, the firstthing is to build a co-authorship and author citation networkfor each publication/document. These publication/documentbased networks are combined further to form a communitynetwork. Furthermore, data cleaning is performed by removingauthors who never published in this conference. Another issuethat plagues such a data often is of name referencing. This isdue to variation in style for citation of an author. All thesestyles should accurately reference a common entity using theinformation of first name, middle name and last name providedin ’B-PER’, ’I-PER’ and ’E-PER’ tags in Senna library [25].

In contrast to existing approaches, semantic analysis in ACEnot only focuses on various networks but also emphasizes onanother unique aspect i.e., sentiment analysis of citations. Heresentiment of each citation in the publication is computed basedon the content of the corresponding sentence containing thecitation [26].

C. Pragmatic Analysis

The purpose of pragmatic analysis is to analyze the publica-tions in the context of whole community. To do so, first, clus-tering is performed on the community co-authorship network,followed by, calculation of various performance indicators forco-authorship and author-citation networks, which gives aninsight of authors, publications, and whole community throughdifferent perspectives.

1) Clustering co-authorship and author-citation network:It is important to perform clustering on the community net-work to find important structures and patterns in it. In thiscontext, it specifically refers to finding groups of authorswho collaborate and cite each other frequently. Authors inco-authorship network are clustered using Girvan-Newmanclustering approach [27], which is a hierarchical clustering ap-proach. Here, edges are removed through an iterative processbased on high betweenness centrality. The level of clusteringwas identified through an empirical process. Authors in author-citation network have been clustered using a graph partitioningbased on edge strength.

2) Performance indicators: There are two kinds of perfor-mance indicators that help in analyzing a community from dif-ferent dimensions, macro indicators and micro indicators. In

Page 3: Academic Community Explorer (ACE) for Syntactic, Semantic ... · Academic Community Explorer (ACE) for Syntactic, Semantic and Pragmatic Document Analysis Akansha Bhardwajzx, Dominique

this work, we have explored macro indicators on co-authorshipnetworks and author citation networks, while micro indicatorshave been calculated for both these networks separately.

Macro indicators in co-authorship network: Macro in-dicators are a series of characteristics that focus on networktopology.

• Statistical summary• Study of evolving co-authorship network• Region based co-authorship analysis• Publication vs. citation analysis

Micro indicators in co-authorship networks: The fieldof network analysis draws heavily on graphical imagery toreveal the display and patterns of links occurring within thenetwork and uses mathematical and computational modelsto describe and explain those patterns. These patterns andindicators are micro-indicators which represent qualitative datasuch as the power, stratification, ranking, and inequality insocial structures.

The co-authorship network is a special kind of socialnetwork where authors are represented as nodes and a co-authorship is denoted by the presence of an edge between twonodes. Vertex specific measures that have been explored inthis work for co-authorship networks include degree centrality,closeness centrality and betweenness centrality. A high degreecentrality denotes the existence of authors who collaboratevery often with many other authors. These are prolific writersof a community. A high betweenness centrality is used todenote the authors who act as bridges between small sub-groups in a community and thence, help to bring the completecommunity together. These positions are generally occupiedwith head of research groups. A high closeness centralitydenotes diversity of an author’s domain. These authors canspread research ideas quickly.

Micro indicators in author citation networks: Authorcitation networks are networks, where, authors are nodes anda directed edge exists between A and B if ‘A cites B’ in oneof its works. A basic property of an author citation network isthat, when nodes are arranged according to degree centralities,a big node denotes the dominating person in a community. Tothe best of our knowledge, author citation networks are beingexplored in this work for the first time.

Few vertex specific measures explored in this work includedegree centrality, indegree centrality, outdegree centrality, be-tweenness centrality, and eigenvector centrality. While a highindegree centrality denotes the authors who have been citedmost, a high outdegree centrality denotes the authors who citeothers most. Degree centrality is a combined metric for theabove two centrality measures. Betweenness centrality is usedto denote authors who diversely publish and communicate withothers in community. These authors cite others and are cited byin a balanced way from the community. Eigenvector centralityis used to identify the authors who are most likely to receivefirst new research ideas.

In addition to the above-mentioned matrices, this paper alsopresents two novel metrics, i.e., Overlap-Index and SentiIndex.

a) Overlap-Index: An overlap index is a new metricwhich is being introduced in this work to study an author’sdiversity of influence in a community. It is a cumulativegraph to glance over an author’s relation with others in thecommunity with a quantitative count of collaboration, citationand references. For an author, an overlap index shows aquantitative overlap with other authors who share a ’collabo-ration’,’cited by’,’cites’ relation with the former. In the ACEvisualization interface, we have focused on top-10 authorsfrom each of these three categories. If the count of authorsa1,..,an on x-axis is more in an overlap index for author a0 inconsideration, it denotes that the author a0 has a very diversegroup with whom he/she collaborates. The count on y-axisdenotes the number of times the ’collaboration’, ’cited by’,’cites’ relationship exists.

b) Senti-Index: Another novel metric introduced in thispaper is Senti-Index [26], which expresses the total number ofpositive, negative, and neutral citations which an author hasreceived in each of their articles. This is a step in the directionof automated qualitative assessment of scientific documents.Senti-index can be computed both for individual publicationsas well as for authors.

III. ANALYSIS OF ICDAR COMMUNITY WITH ACE

To show the effectiveness of the presented framework, wepresent an evaluation of the ICDAR community using theACE framework. This analysis is based on all publicationsin ICDAR from 1993-2015. ICDAR started in 1993 as acommunity on document analysis and recognition and ispresently one of the main conferences in this field. At present,it has around 3500 authors participating from approximately55 different countries.

A. Macro indicators of ICDAR community

1) Statistical summary of ICDAR: Table I provides a sta-tistical summary of ICDAR co-authorship network. There are3636 authors in this network, in which an average authorwrites 6.01 papers and collaborates with 4.75 authors. Thiscommunity has a very high clustering coefficient of 0.7 whichmeans that, there is a 70% chance of two authors being co-author if they have a mutual co-author. These numbers arerelatively higher as compared to the LIS co-authorship network[22] and similar to the co-authorship networks of biologyand physics constructed by Newman [23]. This number showsthat ICDAR is a community where authors collaborate morefrequently and widely as in the field of Biology and Physics.It is important to mention here that these results have beencalculated on the PDF documents from ICDAR conferencewhich were encoded with proper glyph to character mapping.

2) Evolution of ICDAR co-authorship network: Table IIshows the evolution of ICDAR co-authorship network from1993 to 2015. On an average, each author has more col-laborators from 1993-1997 period to 1993-2015 period. Thisindicates that authors have collaborated more widely in recentyears.

Page 4: Academic Community Explorer (ACE) for Syntactic, Semantic ... · Academic Community Explorer (ACE) for Syntactic, Semantic and Pragmatic Document Analysis Akansha Bhardwajzx, Dominique

1995 2000 2005 2010 2015year

0

10

20

30

40

50

60

num

ber o

f pap

ers

JapanUSAGermanyCanada

(a)

1995 2000 2005 2010 2015year

10

0

10

20

30

40

50

num

ber o

f pap

ers

ChinaSingaporeIndiaGreece

(b)

1995 2000 2005 2010 2015year

102

103

104

105

publ

icat

ion

vs to

tal c

itatio

n

publicationscitations

(c)Fig. 2. (a) Region based authorship analysis: decreasing trend, (b)Region based authorship analysis: increasing trend (c)Publication vs. citation (shown inblue)

TABLE ISUMMARY OF STATISTICS OF ICDAR CO-AUTHORSHIP NETWORK

Values ICDAR LIS[22] Physics[23]

Number of papers 2751 10,344 98,502Number of authors 3636 10,579 52,909Paper per author 6.01 2.40 5.1Author per paper - 1.80 -Largest component 70.84% 20.77% 85%Clustering coefficient 0.71 0.58 0.43Pearson Clustering coefficient 0.12 NA 0.36Average collaborators 4.75 2.24 9.7Average distance 5.58 9.68 5.9

The value of the largest component reveals that after 1999,mean distance has decreased and the ratio of the giant com-ponent with respect to the whole community has increasedfrom 17% in 1997 to 70% by 2015. This is similar to thework done by Barabasi [28] and suggests that ICDAR is ahighly collaborative community and has probably arrived atits ‘phase transition’ stage where authors collaborate morefrequently and widely with each other.

3) Region based authorship analysis: Since 1993, ICDARcommunity has received participation from around 55 coun-tries. It is interesting to observe that countries like Japan,USA, Germany and Canada which have been very prolificparticipants in ICDAR conference in 1993-2003 make lessercontributions at present. From 2005 onwards, a peculiar trendhas been observed where countries including China, India,Singapore have started participating more actively. This trend

TABLE IIICDAR EVOLVING CO-AUTHORSHIP NETWORK

Year Authors Papers Averagecoauthors

Largest Component

Size Ratio (%) Meandistance

1997 1088 634 3.23 189 17.37 5.322003 1859 1220 3.80 857 46.10 6.702009 2786 1962 4.12 1682 60.37 6.652015 3636 2751 4.75 2576 70.84 5.58

can be easily observed in Figure 2a, 2b. France has maintainedits consistency and has contributed maximum publications tothe community.

4) Publication vs. citation: A comparison of the numberof papers and the increasing citation count has been presentedin Figure 2c. It is observed that the citation tendency hasincreased exponentially after 2005. Though the total numberof publications remain almost same, there is an 150% increasein total number of citations from 2005 to 2015.

B. Micro indicators of ICDAR community

This section provides analysis of different micro indicators(Section II-C2) in context of ICDAR community. This in-cludes, co-author network, author citation network, word cloudand a novel measure called Overlap index and Senti-index ofauthors and publications.

1) Co-authorship network: ICDAR has a highly collabora-tive community structure where 70% of the authors are a partof the largest component.

(a) (b)Fig. 3. (a) Academic Community Explorer(ACE) Interface, (b) Zoomed in view of ACE interface

Page 5: Academic Community Explorer (ACE) for Syntactic, Semantic ... · Academic Community Explorer (ACE) for Syntactic, Semantic and Pragmatic Document Analysis Akansha Bhardwajzx, Dominique

TABLE IIINORMALIZED CENTRALITY VALUES FOR PROMINENT CONTRIBUTORS IN ICDAR

Author Coauthorship networks Author citation networks

Closeness Betweenness Degree Betweenness Eigenvector Outdegree Indegree

U. Pal 1.0 1.0 0.85 0.39 0.18 0.70 0.42C. Liu 0.93 0.64 1.0 0.65 1.0 1.0 0.80M. Liwicki 0.94 0.51 0.72 0.32 0.12 0.59 0.34C. Tan 0.93 0.30 0.56 0.47 0.11 0.57 0.47S. Uchida 0.92 0.51 0.74 0.24 0.11 0.44 0.17C. Suen 0.87 0.52 0.74 0.85 0.08 0.38 0.70D. Karatzas 0.92 0.29 0.37 0.14 0.07 0.31 0.19K. Kise 0.85 0.17 0.45 0.15 0.04 0.22 0.20A. Dengel 0.79 0.14 0.49 0.73 0.07 0.39 0.33J. Llados 0.92 0.29 0.55 0.17 0.08 0.41 0.20S. Srihari 0.83 0.27 0.32 0.38 0.01 0.17 0.69M. Nakagawa 0.86 0.23 0.66 0.20 0.42 0.42 0.30D. Doermann 0.88 0.21 0.28 0.65 0.07 0.35 0.57C. Jawahar 0.65 0.21 0.18 0.12 0.06 0.27 0.15D. Lopresti 0.78 0.06 0.18 0.24 0.03 0.20 0.22M. Blumestein 0.88 0.14 0.26 0.06 0.05 0.27 0.11V. Govindaraju 0.79 0.19 0.46 0.49 0.09 0.37 0.42M. Cheriet 0.84 0.26 0.41 0.36 0.09 0.37 0.26M. Iwamura 0.87 0.10 0.44 0.04 0.06 0.22 0.14R. Ingold 0.84 0.22 0.46 0.22 0.06 0.36 0.16

Table III is a table of few prominent authors in ICDAR co-authorship community with respect to closeness, betweenness,and degree centrality. The values have been normalized in therange of 0-1.

2) Author citation networks: Author co-citation networkshave been studied in the past and have led to conclude thepresence of few major clusters or group of researchers [29].The results from ICDAR also reflected the same presence ofclusters and can be seen in Figure 3a. Nodes are coloredbased on the information from co-authorship network wheresame color represents strong co-authorship. There is a strongtendency of authors to cite their co-authors. This can beexplained by the fact that generally authors choose to workclosely with their group and their previous works.

The author who is cited most often in ICDAR author-citation network is C. Liu, followed by H. Bunke and B.Gatos. Head of groups generally occupy positions where theycan be seen forming a bridge between different unconnectedcommunities and holding the complete community together.H. Bunke has the highest betweenness centrality in author-citation network.

It is also possible to explore an author’s relation to anotherrepresented as edge as shown in Figure 4b. Along with this,in Table III we present a list of prominent authors in authorcitation network with respect to betweenness centrality, eigen-vector centrality, outdegree centrality and indegree centrality.In addition, ACE visualization interface offers the functionalityof displaying an author’s centrality, along with the sentiment ofcitations received by an author and the list of papers publishedas shown in Figure 4a.

3) Topic Cloud: Topic cloud can offer an insight into anauthor’s domain at a glance as shown in Figure 4c. Theycan be used to study thematic trends in a community whenrepresented as a time series.

4) Overlap index: An example of overlap index graph forU. Pal is shown in Figure 5 where his quantitative relation withrespect to other authors in terms of ‘coauthorship’ , ‘cited by’and ‘cites’ is shown.

5) Senti-Index: Senti-index in ICDAR network offers thefunctionality to hover over a node and view an author’spositive, negative or neutral citations as shown in Figure 4a.

(a) (b) (c)Fig. 4. (a) Node information with Senti-Index (b) Edge information with Senti-Index (c) Topic cloud for author A. Dengel

Page 6: Academic Community Explorer (ACE) for Syntactic, Semantic ... · Academic Community Explorer (ACE) for Syntactic, Semantic and Pragmatic Document Analysis Akansha Bhardwajzx, Dominique

Fig. 5. Overlap index for author U. Pal

IV. CONCLUSION AND FUTURE WORK

In this work, we have presented a framework for extractingand analyzing scholarly document meta-data for the purposeof studying a scientific community. This framework has beentested on ICDAR community from 1993-2015 and the vi-sualization interface is live at http://www.dfki.uni-kl.de/ace/.ICDAR has the identifying characteristics of a highly collabo-rative scientific venue with 70% of authors being a part of thelargest component. The mean distance between co-authors hasconsiderably reduced since 1993 and the increasing ratio of thelargest component indicates that ICDAR has probably arrivedat the stage of phase transition. It has also been observedthat countries like China and India have recently joinedthe community and become prolific participants. In author-citation networks, there is a clear presence of communitiesthat collaborate and cite each other very often.

The results presented in this work are preliminary and needto be interpreted with great caution which is not possible inthe narrow focus of this study. In addition to the approacheswhich have been presented in this work, it is important toinvolve additional qualitative component using interviews withsome of the key members to help better understand the rela-tionships within and between communities. In its final stage,this information can be structured and used as a recommendersystem for the community members to get an overall pictureof their interaction to improve and reflect a better communitybehavior.

REFERENCES

[1] S. Klink, A. Dengel, and T. Kieninger, “Document structure analysisbased on layout and textual features,” in Proc. of International Workshopon Document Analysis Systems, DAS2000, 2000, pp. 99–111.

[2] G. Nagy, “Twenty years of document image analysis in pami,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 22,no. 1, pp. 38–62, 2000.

[3] L. O’Gorman and R. Kasturi, Document image analysis. IEEEComputer Society Press Los Alamitos, 1995, vol. 39.

[4] A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deepconvolutional nets for document image classification and retrieval,” inDocument Analysis and Recognition (ICDAR), 2015 13th InternationalConference on. IEEE, 2015, pp. 991–995.

[5] J. E. Hirsch, “An index to quantify an individuals scientific researchoutput that takes into account the effect of multiple coauthorship,”Scientometrics, vol. 85, no. 3, pp. 741–754, 2010.

[6] N. J. van Eck and L. Waltman, “Visualizing bibliometric networks,” inMeasuring scholarly impact. Springer, 2014, pp. 285–320.

[7] M. J. Cobo, A. G. Lopez-Herrera, E. Herrera-Viedma, and F. Herrera,“Science mapping software tools: Review, analysis, and cooperativestudy among tools,” Journal of the Association for Information Scienceand Technology, vol. 62, no. 7, pp. 1382–1402, 2011.

[8] H. Small, “Visualizing science by citation mapping,” Journal of theAssociation for Information Science and Technology, vol. 50, no. 9, p.799, 1999.

[9] J. Beel, S. Langer, M. Genzmehr, and C. Muller, “Docear’s pdf inspector:title extraction from pdf files,” in Proceedings of the 13th ACM/IEEE-CSjoint conference on Digital libraries. ACM, 2013, pp. 443–444.

[10] D. Bergmark and C. Lagoze, “An architecture for automatic referencelinking,” Research and Advanced Technology for Digital Libraries, pp.115–126, 2001.

[11] I.-A. Huang, J.-M. Ho, H.-Y. Kao, and W.-C. Lin, “Extracting citationmetadata from online publication lists using blast,” Advances in Knowl-edge Discovery and Data Mining, pp. 539–548, 2004.

[12] G. Sautter and K. Bohm, “Improved bibliographic reference parsingbased on repeated patterns,” International Journal on Digital Libraries,vol. 14, no. 1-2, pp. 59–80, 2014.

[13] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A.Fox, “Automatic document metadata extraction using support vector ma-chines,” in Digital Libraries, 2003. Proceedings. 2003 Joint Conferenceon. IEEE, 2003, pp. 37–48.

[14] F. Peng and A. McCallum, “Information extraction from research papersusing conditional random fields,” Information processing & manage-ment, vol. 42, no. 4, pp. 963–979, 2006.

[15] I. G. Councill, C. L. Giles, and M.-Y. Kan, “Parscit: an open-source crfreference string parsing package.” in LREC, vol. 2008, 2008.

[16] K. Seymore, A. McCallum, and R. Rosenfeld, “Learning hidden markovmodel structure for information extraction,” in AAAI-99 workshop onmachine learning for information extraction, 1999, pp. 37–42.

[17] E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S.de Moura, “Flux-cim: flexible unsupervised extraction of citation meta-data,” in Proceedings of the 7th ACM/IEEE-CS joint conference onDigital libraries. ACM, 2007, pp. 215–224.

[18] D. Besagni, A. Belaıd, and N. Benet, “A segmentation method forbibliographic references by contextual tagging of fields,” in DocumentAnalysis and Recognition, 2003. Proceedings. Seventh InternationalConference on. IEEE, 2003, pp. 384–388.

[19] D. Huynh and W. Hua, “Self-supervised learning approach for extractingcitation information on the web,” Web Technologies and Applications,pp. 719–726, 2012.

[20] B. Powley and R. Dale, “High accuracy citation extraction and namedentity recognition for a heterogeneous corpus of academic papers,” inNatural Language Processing and Knowledge Engineering, 2007. NLP-KE 2007. International Conference on. IEEE, 2007, pp. 119–124.

[21] P. Zervas, A. Tsitmidelli, D. G. Sampson, N.-S. Chen et al., “Studyingresearch collaboration patterns via co-authorship analysis in the field oftel: the case of educational technology & society journal,” Journal ofEducational Technology & Society, vol. 17, no. 4, p. 1, 2014.

[22] E. Yan and Y. Ding, “Applying centrality measures to impact analy-sis: A coauthorship network analysis,” Journal of the Association forInformation Science and Technology, vol. 60, no. 10, pp. 2107–2118,2009.

[23] M. E. Newman, “Coauthorship networks and patterns of scientificcollaboration,” Proceedings of the national academy of sciences, vol.101, no. suppl 1, pp. 5200–5205, 2004.

[24] A. Bavelas, “Communication patterns in task-oriented groups,” TheJournal of the Acoustical Society of America, vol. 22, no. 6, pp. 725–730, 1950.

[25] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, andP. Kuksa, “Natural language processing (almost) from scratch,” Journalof Machine Learning Research, vol. 12, no. Aug, pp. 2493–2537, 2011.

[26] “Senticite - an approach for publication sentiment analysis, undersubmission.”

[27] M. Girvan and M. E. Newman, “Community structure in social andbiological networks,” Proceedings of the national academy of sciences,vol. 99, no. 12, pp. 7821–7826, 2002.

[28] A.-L. Barabasi, “Linked: How everything is connected to everything elseand what it means,” Plume Editors, 2002.

[29] M. Tight, “Higher education research as tribe, territory and/or com-munity: A co-citation analysis,” Higher Education, vol. 55, no. 5, pp.593–605, 2008.