121
Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology Eleni Aontzi, 1 Giannis Kazadeis, 1 Leonidas Papachristopoulos, 2 Michalis Sfakakis, 2 Giannis Tsakonas, 2 Christos Papatheodorou 2 13th ACM/IEEE Joint Conference on Digital Libraries, July 22-26, Indianapolis, IN, USA 1. Department of Informatics, Athens University of Economics & Business 2. Database & Information Systems Group, Department of Archives & Library Science, Ionian University

Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

Embed Size (px)

Citation preview

Page 1: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

Charting the Digital Library Evaluation Domain with a Semantically Enhanced

Mining Methodology

S

Eleni A!ontzi,1 Giannis Kazadeis,1 Leonidas Papachristopoulos,2Michalis Sfakakis,2 Giannis Tsakonas,2 Christos Papatheodorou2

13th ACM/IEEE Joint Conference on Digital Libraries, July 22-26, Indianapolis, IN, USA

1. Department of Informatics,Athens University of Economics & Business

2. Database & Information Systems Group, Department of Archives & Library

Science, Ionian University

Page 2: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 3: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

aim & scope of research

Page 4: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

aim & scope of research

• To propose a methodology for discovering patterns in the scienti!c literature.

Page 5: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

aim & scope of research

• To propose a methodology for discovering patterns in the scienti!c literature.

• Our case study is performed in the digital library evaluation domain and its conference literature.

Page 6: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

aim & scope of research

• To propose a methodology for discovering patterns in the scienti!c literature.

• Our case study is performed in the digital library evaluation domain and its conference literature.

• We question:

Page 7: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

aim & scope of research

• To propose a methodology for discovering patterns in the scienti!c literature.

• Our case study is performed in the digital library evaluation domain and its conference literature.

• We question:- how we select relevant studies,

Page 8: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

aim & scope of research

• To propose a methodology for discovering patterns in the scienti!c literature.

• Our case study is performed in the digital library evaluation domain and its conference literature.

• We question:- how we select relevant studies,- how we annotate them,

Page 9: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

aim & scope of research

• To propose a methodology for discovering patterns in the scienti!c literature.

• Our case study is performed in the digital library evaluation domain and its conference literature.

• We question:- how we select relevant studies,- how we annotate them,- how we discover these patterns,

Page 10: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

aim & scope of research

• To propose a methodology for discovering patterns in the scienti!c literature.

• Our case study is performed in the digital library evaluation domain and its conference literature.

• We question:- how we select relevant studies,- how we annotate them,- how we discover these patterns,in an effective, machine-operated way, in order to have reusable and interpretable data?

Page 11: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 12: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

why

Page 13: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

why

• Abundance of scienti!c information

Page 14: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

why

• Abundance of scienti!c information• Limitations of existing tools, such as reusability

Page 15: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

why

• Abundance of scienti!c information• Limitations of existing tools, such as reusability• Lack of contextualized analytic tools

Page 16: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

why

• Abundance of scienti!c information• Limitations of existing tools, such as reusability• Lack of contextualized analytic tools• Supervised automated processes

Page 17: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 18: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

Page 19: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

1. Document classi!cation to identify relevant papers

Page 20: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

1. Document classi!cation to identify relevant papers- We use a corpus of 1,824 papers from the JCDL and ECDL

(now TPDL) conferences, era 2001-2011.

Page 21: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

1. Document classi!cation to identify relevant papers- We use a corpus of 1,824 papers from the JCDL and ECDL

(now TPDL) conferences, era 2001-2011.2. Semantic annotation processes to mark up important concepts

Page 22: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

1. Document classi!cation to identify relevant papers- We use a corpus of 1,824 papers from the JCDL and ECDL

(now TPDL) conferences, era 2001-2011.2. Semantic annotation processes to mark up important concepts

- We use a schema for semantic annotation, the Digital Library Evaluation Ontology, and a semantic annotation tool, GoNTogle.

Page 23: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

1. Document classi!cation to identify relevant papers- We use a corpus of 1,824 papers from the JCDL and ECDL

(now TPDL) conferences, era 2001-2011.2. Semantic annotation processes to mark up important concepts

- We use a schema for semantic annotation, the Digital Library Evaluation Ontology, and a semantic annotation tool, GoNTogle.

3. Clustering to form coherent groups (K=11)

Page 24: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

1. Document classi!cation to identify relevant papers- We use a corpus of 1,824 papers from the JCDL and ECDL

(now TPDL) conferences, era 2001-2011.2. Semantic annotation processes to mark up important concepts

- We use a schema for semantic annotation, the Digital Library Evaluation Ontology, and a semantic annotation tool, GoNTogle.

3. Clustering to form coherent groups (K=11)4. Interpretation with the assistance of the ontology schema

Page 25: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

1. Document classi!cation to identify relevant papers- We use a corpus of 1,824 papers from the JCDL and ECDL

(now TPDL) conferences, era 2001-2011.2. Semantic annotation processes to mark up important concepts

- We use a schema for semantic annotation, the Digital Library Evaluation Ontology, and a semantic annotation tool, GoNTogle.

3. Clustering to form coherent groups (K=11)4. Interpretation with the assistance of the ontology schema

Page 26: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

panorama

1. Document classi!cation to identify relevant papers- We use a corpus of 1,824 papers from the JCDL and ECDL

(now TPDL) conferences, era 2001-2011.2. Semantic annotation processes to mark up important concepts

- We use a schema for semantic annotation, the Digital Library Evaluation Ontology, and a semantic annotation tool, GoNTogle.

3. Clustering to form coherent groups (K=11)4. Interpretation with the assistance of the ontology schema

• During this process we perform benchmarking tests to qualify speci!c components to effectively automate the exploration of the literature and the discovery of research patterns.

Page 27: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 28: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

part 1

how we identify relevant studies

Page 29: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 30: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

Page 31: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.

Page 32: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

Page 33: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

- two researchers categorized, a third one supervised

Page 34: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords

Page 35: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL

Page 36: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL - inter-rater agreement: moderate levels of Cohen’s Kappa

Page 37: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL - inter-rater agreement: moderate levels of Cohen’s Kappa- 12% positive # 88% negative

Page 38: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL - inter-rater agreement: moderate levels of Cohen’s Kappa- 12% positive # 88% negative

• Skewness of data addressed via resampling:

Page 39: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL - inter-rater agreement: moderate levels of Cohen’s Kappa- 12% positive # 88% negative

• Skewness of data addressed via resampling: - under-sampling (Tomek Links)

Page 40: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL - inter-rater agreement: moderate levels of Cohen’s Kappa- 12% positive # 88% negative

• Skewness of data addressed via resampling: - under-sampling (Tomek Links)- over-sampling (random over-sampling)

Page 41: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 42: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

Page 43: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

• Classi!cation algorithm: Naïve Bayes

Page 44: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)

Page 45: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)• Ten-fold validation: the development set was randomly divided

to 10 equal; 9/10 as training set and 1/10 as test set.

Page 46: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)• Ten-fold validation: the development set was randomly divided

to 10 equal; 9/10 as training set and 1/10 as test set.

0

0.2

0.4

0.6

0.8

1.0

0 0.2 0.4 0.6 0.8 1.0

Page 47: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)• Ten-fold validation: the development set was randomly divided

to 10 equal; 9/10 as training set and 1/10 as test set.

0

0.2

0.4

0.6

0.8

1.0

0 0.2 0.4 0.6 0.8 1.0

TestDevelopment

Page 48: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)• Ten-fold validation: the development set was randomly divided

to 10 equal; 9/10 as training set and 1/10 as test set.

0

0.2

0.4

0.6

0.8

1.0

0 0.2 0.4 0.6 0.8 1.0

TestDevelopment

fp rate

Page 49: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)• Ten-fold validation: the development set was randomly divided

to 10 equal; 9/10 as training set and 1/10 as test set.

0

0.2

0.4

0.6

0.8

1.0

0 0.2 0.4 0.6 0.8 1.0

TestDevelopment

fp rate

tp rate

Page 50: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)• Ten-fold validation: the development set was randomly divided

to 10 equal; 9/10 as training set and 1/10 as test set.

0

0.2

0.4

0.6

0.8

1.0

0 0.2 0.4 0.6 0.8 1.0

TestDevelopment

fp rate

tp rate

Page 51: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 52: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

part 2

how we annotate

Page 53: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the schema - DiLEO

Page 54: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the schema - DiLEO

• DiLEO aims to conceptualize the DL evaluation domain by exploring its key entities, their attributes and their relationships.

Page 55: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the schema - DiLEO

• DiLEO aims to conceptualize the DL evaluation domain by exploring its key entities, their attributes and their relationships.

• A two layered ontology:

Page 56: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the schema - DiLEO

• DiLEO aims to conceptualize the DL evaluation domain by exploring its key entities, their attributes and their relationships.

• A two layered ontology:- Strategic level: consists of a set of classes related with the

scope and aim of an evaluation.

Page 57: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the schema - DiLEO

• DiLEO aims to conceptualize the DL evaluation domain by exploring its key entities, their attributes and their relationships.

• A two layered ontology:- Strategic level: consists of a set of classes related with the

scope and aim of an evaluation.- Procedural level: consists of classes dealing with practical

issues.

Page 58: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 59: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the instrument - GoNTogle

Page 60: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the instrument - GoNTogle

Page 61: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the instrument - GoNTogle

• We used GoNTogle to generate a RDFS knowledge base.

Page 62: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the instrument - GoNTogle

• We used GoNTogle to generate a RDFS knowledge base.

• GoNTogle uses the weighted k-NN algorithm to support either manual, or automated ontology-based annotation.

Page 63: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the instrument - GoNTogle

• We used GoNTogle to generate a RDFS knowledge base.

• GoNTogle uses the weighted k-NN algorithm to support either manual, or automated ontology-based annotation.

Page 64: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the instrument - GoNTogle

• We used GoNTogle to generate a RDFS knowledge base.

• GoNTogle uses the weighted k-NN algorithm to support either manual, or automated ontology-based annotation.

• http://bit.ly/12nlryh

Page 65: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 66: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 1/3

Page 67: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 1/3

• GoNTogle estimates a score for each class/subclass, calculating its presence in the k nearest neighbors.

Page 68: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 1/3

• GoNTogle estimates a score for each class/subclass, calculating its presence in the k nearest neighbors.

• We set a score threshold above which a class is assigned to a new instance (optimal score: 0.18).

Page 69: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 1/3

• GoNTogle estimates a score for each class/subclass, calculating its presence in the k nearest neighbors.

• We set a score threshold above which a class is assigned to a new instance (optimal score: 0.18).

• e user is presented with a ranked list of the suggested classes/subclasses and their score ranging from 0 to 1.

Page 70: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 1/3

• GoNTogle estimates a score for each class/subclass, calculating its presence in the k nearest neighbors.

• We set a score threshold above which a class is assigned to a new instance (optimal score: 0.18).

• e user is presented with a ranked list of the suggested classes/subclasses and their score ranging from 0 to 1.

• 2,672 annotations were manually generated.

Page 71: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 72: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 2/3

Page 73: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 2/3

• RDFS statements were processed to construct a new data set (removal of stopwords, symbols, lowercasing, etc.)

Page 74: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 2/3

• RDFS statements were processed to construct a new data set (removal of stopwords, symbols, lowercasing, etc.)

• Experiments both with un-stemmed (4,880 features) and stemmed (3,257 features) words.

Page 75: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 2/3

• RDFS statements were processed to construct a new data set (removal of stopwords, symbols, lowercasing, etc.)

• Experiments both with un-stemmed (4,880 features) and stemmed (3,257 features) words.

• Multi-label classi!cation via the ML framework Meka.

Page 76: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 2/3

• RDFS statements were processed to construct a new data set (removal of stopwords, symbols, lowercasing, etc.)

• Experiments both with un-stemmed (4,880 features) and stemmed (3,257 features) words.

• Multi-label classi!cation via the ML framework Meka.• Four methods

- binary representation

- Label powersets- RAkEL- ML-kNN

• Four algorithms- Naïve Bayes - Multinomial

Naïve Bayes - k-Nearest-

Neighbors- Support Vector

Machines

• Four metrics- Hamming Loss- Accuracy- One-error- F1 macro

Page 77: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 78: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 3/3

Page 79: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 3/3

• Performance tests were repeated using GoNTogle.

Page 80: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 3/3

• Performance tests were repeated using GoNTogle. • GoNTogle’s algorithm achieves good results in relation to the

tested multi-label classi!cation algorithms.

Page 81: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 3/3

• Performance tests were repeated using GoNTogle. • GoNTogle’s algorithm achieves good results in relation to the

tested multi-label classi!cation algorithms.

0

0.2

0.4

0.6

0.8

1.0

Hamming Loss Accuracy One - Error F1 macro

0.44

0.27

0.63

0.02

0.390.29

0.49

0.02

Page 82: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

the process - 3/3

• Performance tests were repeated using GoNTogle. • GoNTogle’s algorithm achieves good results in relation to the

tested multi-label classi!cation algorithms.

0

0.2

0.4

0.6

0.8

1.0

Hamming Loss Accuracy One - Error F1 macro

0.44

0.27

0.63

0.02

0.390.29

0.49

0.02GoNTogleMeka

Page 83: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 84: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

part 3

how we discover

Page 85: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 86: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 1/3

Page 87: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 1/3

• e !nal data set consists of 224 vectors of 53 features

Page 88: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 1/3

• e !nal data set consists of 224 vectors of 53 features- represents the assigned annotations from the DiLEO

vocabulary to the document corpus.

Page 89: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 1/3

• e !nal data set consists of 224 vectors of 53 features- represents the assigned annotations from the DiLEO

vocabulary to the document corpus.• We represent the annotated documents by 2 vector models:

Page 90: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 1/3

• e !nal data set consists of 224 vectors of 53 features- represents the assigned annotations from the DiLEO

vocabulary to the document corpus.• We represent the annotated documents by 2 vector models:

- binary: fi has the value of 1, if the respective to fi subclass is assigned to the document m, otherwise 0.

Page 91: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 1/3

• e !nal data set consists of 224 vectors of 53 features- represents the assigned annotations from the DiLEO

vocabulary to the document corpus.• We represent the annotated documents by 2 vector models:

- binary: fi has the value of 1, if the respective to fi subclass is assigned to the document m, otherwise 0.

- tf-idf: feature frequency ffi of fi in all vectors is equal to 1 when the respective subclass is annotated to the respective document m; idfi is the inverse document frequency of the feature i in documents M.

Page 92: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 93: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 2/3

Page 94: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 2/3

• We cluster the vector representations of the annotations by applying 2 clustering algorithms:

Page 95: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 2/3

• We cluster the vector representations of the annotations by applying 2 clustering algorithms:- K-Means: partitions M data points to K clusters. e rate of

decrease peaked for K near 11 when plotted the Objective function (cost or error) for various values of K.

Page 96: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 2/3

• We cluster the vector representations of the annotations by applying 2 clustering algorithms:- K-Means: partitions M data points to K clusters. e rate of

decrease peaked for K near 11 when plotted the Objective function (cost or error) for various values of K.

- Agglomerative Hierarchical Clustering: a ‘bottom up’ built hierarchy of clusters.

Page 97: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 98: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 3/3

Page 99: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 3/3

• We assess each feature of each cluster using the frequency increase metric.

Page 100: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 3/3

• We assess each feature of each cluster using the frequency increase metric.- it calculates the increase of the frequency of a feature fi in the

cluster k (cfi,k) compared to its document frequency dfi in the entire data set

Page 101: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 3/3

• We assess each feature of each cluster using the frequency increase metric.- it calculates the increase of the frequency of a feature fi in the

cluster k (cfi,k) compared to its document frequency dfi in the entire data set

• We select the threshold a that maximizes the F1-measure, the harmonic mean of Coverage and Dissimilarity mean.

Page 102: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 3/3

• We assess each feature of each cluster using the frequency increase metric.- it calculates the increase of the frequency of a feature fi in the

cluster k (cfi,k) compared to its document frequency dfi in the entire data set

• We select the threshold a that maximizes the F1-measure, the harmonic mean of Coverage and Dissimilarity mean.- Coverage: the proportion of features participating in the

clusters to the total number of features

Page 103: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

clustering - 3/3

• We assess each feature of each cluster using the frequency increase metric.- it calculates the increase of the frequency of a feature fi in the

cluster k (cfi,k) compared to its document frequency dfi in the entire data set

• We select the threshold a that maximizes the F1-measure, the harmonic mean of Coverage and Dissimilarity mean.- Coverage: the proportion of features participating in the

clusters to the total number of features- Dissimilarity mean: the average of the distinctiveness of the

clusters, de!ned in terms of the dissimilarity di,j between all the possible pairs of the clusters.

Page 104: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 105: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

metrics - F1-measure

Page 106: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

metrics - F1-measure

0

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Page 107: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

metrics - F1-measure

0

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

K-Means tf-idf K-Means binary Hierarchical tf-idf

Page 108: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 109: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

part 4

how (and what) we interpret

Page 110: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

Levels

patterns

hasDimensionsType

isAimingAt

Research Questions

isSupporting/isSupportedBy

hasPerformed/isPerformedIn

isUsedIn/isUsingFindings

CriteriaMetrics Factors

Means Types

Criteria Categories

hasConstituent/ isConstituting

Dimensionstechnical excellence

Instrumentssoftware

Activityreport

Goalsdesign

Subjectshuman agents

Dimension Type

summative

Meanssurvey studies

isParticipatingIn

Means laboratory studies

Characteristicscount

Characteristicsdiscipline

Dimensionseffectiveness

Objects

PROCEDURAL LAYER

STRATEGIC LAYER

K-Means tf-idf

Page 111: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

patternsResearch

Questions

hasPerformed/isPerformedIn

Findings

CriteriaMetrics Factors

Criteria Categories

hasConstituent/ isConstituting

isParticipatingIn

Instruments

Dimensionseffectiveness

Dimensions Types

meanssurvey studies

means laboratory studies

Characteristics

Goaldescribe

means typequantitative

hasMeansType

activityrecord

activitycompare

Levelinterface

isAimingAt

isAffecting/isAffectedBy

Objects

Subjectshuman agents

PROCEDURAL LAYER

STRATEGIC LAYER

Hierarchical

Page 112: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 113: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

part 5

conclusions

Page 114: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology
Page 115: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

conclusions

Page 116: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

conclusions

• e patterns re$ect and - up to a point - con!rm the anecdotally evident research practices of DL researchers.

Page 117: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

conclusions

• e patterns re$ect and - up to a point - con!rm the anecdotally evident research practices of DL researchers.

• Patterns have similar properties to a map.

Page 118: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

conclusions

• e patterns re$ect and - up to a point - con!rm the anecdotally evident research practices of DL researchers.

• Patterns have similar properties to a map. - ey can provide the main and the alternative routes one can

follow to reach to a destination, taking into account several practical parameters that might not know.

Page 119: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

conclusions

• e patterns re$ect and - up to a point - con!rm the anecdotally evident research practices of DL researchers.

• Patterns have similar properties to a map. - ey can provide the main and the alternative routes one can

follow to reach to a destination, taking into account several practical parameters that might not know.

• By exploring previous pro!les, one can weight all the available options.

Page 120: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

conclusions

• e patterns re$ect and - up to a point - con!rm the anecdotally evident research practices of DL researchers.

• Patterns have similar properties to a map. - ey can provide the main and the alternative routes one can

follow to reach to a destination, taking into account several practical parameters that might not know.

• By exploring previous pro!les, one can weight all the available options.

• is approach can extend other coding methodologies in terms of transparency, standardization and reusability.

Page 121: Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

ank you for your attention.

questions?