41
1/41 Text mining in CORE Petr Knoth The Open University

Text mining in CORE (OR2012)

Embed Size (px)

DESCRIPTION

OR2012 presentation on Text Mining in CORE

Citation preview

Page 1: Text mining in CORE (OR2012)

1/41

Text mining in CORE

Petr KnothThe Open University

Page 2: Text mining in CORE (OR2012)

2/41

Outline• Introduction of the CORE system• Three phases: • Metadata and content harvesting• Semantic Enrichment• Providing services

• Supporting research in mining databases of scientific publications (DiggiCORE)

Page 3: Text mining in CORE (OR2012)

3/41

CORE objectives

• To provide a platform for the delivery of Open Access content

aggregated from multiple sources and to deliver a wide range of services on top of this aggregation.

• A nation-wide aggregation system that will improve the discovery of publications stored in British Open Access Repositories (OARs).

Page 4: Text mining in CORE (OR2012)

4/41

CORE functionality

Page 5: Text mining in CORE (OR2012)

5/41

CORE functionality

Content harvesting, processing

Page 6: Text mining in CORE (OR2012)

6/41

CORE functionality

Semantic enrichment

Page 7: Text mining in CORE (OR2012)

7/41

CORE functionality

Providing services

Page 8: Text mining in CORE (OR2012)

8/41

CORE functionality

Content harvesting, processing

Page 9: Text mining in CORE (OR2012)

9/41

Growth of items in Open Access repositories

Page 10: Text mining in CORE (OR2012)

10/41

Growth of Open Access repositories

Page 11: Text mining in CORE (OR2012)

11/41

Green Open Access - statistics

Page 12: Text mining in CORE (OR2012)

12/41

Why we need aggregations?

“Each individual repository is of limited value for research: the real power of Open Access lies in the possibility of connecting and tying together repositories, which is why we need interoperability. In order to create a seamless layer of content through connected repositories from around the world, Open Access relies on interoperability, the ability for systems to communicate with each other and pass information back and forth in a usable format. Interoperability allows us to exploit today's computational power so that we can aggregate, data mine, create new tools and services, and generate new knowledge from repository content.’’

[COAR manifesto]

Page 13: Text mining in CORE (OR2012)

13/41

Aggregation in CORE

• OAI-PMH metadata harvesting• Locating full-text• Focused crawling (to locate full-texts)• Focused crawling (driven by citation analysis)

Page 14: Text mining in CORE (OR2012)

14/41

CORE functionality

Semantic enrichment

Page 15: Text mining in CORE (OR2012)

15/41

Aggregations need access to content, not just metadata!

• Certain metadata types can be created only at the level of the aggregation

• Certain metadata can be changing in time• Ensuring content:• accessibility• availability• validity• quality• …

Page 16: Text mining in CORE (OR2012)

16/41

Semantic similarity and duplicates detection• Cosine similarity calculated on tfidf vectors extracted from full-

texts

[Knoth et al, COLING 2010; Knoth et al, IMMM 2011]

Page 17: Text mining in CORE (OR2012)

17/41

Semantic similarity and duplicates detection• Heuristics to reduce the number of combinations (problem with

the query length)• Cross-language linking tests [Knoth et al, NTCIR-9 CrossLink 2011;

Knoth et al IJC-NLP CLIA 2011]

Page 18: Text mining in CORE (OR2012)

18/41

Information extraction, citation parsing and target recognition

• ParsCIT tool (based on CRF) for extraction of reference sections• Levensthein distance used for target detection

Page 19: Text mining in CORE (OR2012)

19/41

Text categorisation• 17 top-level DOAJ classes (

http://www.doaj.org/doaj?func=browse&uiLanguage=en)• 1080 examples• SVM multiclass• 10 fold cross-validation• 91.4% accuracy

Page 20: Text mining in CORE (OR2012)

20/41

CORE functionality

Providing services

Page 21: Text mining in CORE (OR2012)

21/41

Who should be supported by aggregations?

The following users groups (divided according to the level of abstraction of information they need):

• Raw data access. • Transaction information access.• Analytical information access.

Page 22: Text mining in CORE (OR2012)

22/41

Who should be supported by aggregations?

• The following users groups (divided according to the level of abstraction of information they need):• Raw data access. Developers, DLs, DL researchers, companies …• Transaction information access. Researchers, students, life-long learners …• Analytical information access. Funders, government, bussiness intelligence

Page 23: Text mining in CORE (OR2012)

23/41

Should a single aggregation system support all three user types?

Can be realised by more than one systemproviding that

the dataset is the same!

Page 24: Text mining in CORE (OR2012)

24/41

CORE applications

• CORE Portal• CORE Mobile• CORE Plugin• CORE API• Repository Analytics

Page 25: Text mining in CORE (OR2012)

25/41

Who should be supported by aggregations?

• The following users groups (divided according to the level of abstraction of information they need):• Raw data access. Developers, DLs, DL researchers, companies …• Transaction information access. Researchers, students, life-long learners …• Analytical information access. Funders, government, bussiness intelligence

Repository AnalyticsCORE Portal, CORE

Mobile, CORE PluginCORE API

Page 26: Text mining in CORE (OR2012)

26/41

CORE ApplicationsCORE API – Enables external systems and services to interact with the CORE repository.

• Search service• Pdf and plain text

service• Similarity service• Classification service• Citation service

Page 27: Text mining in CORE (OR2012)

27/41

CORE ApplicationsCORE Portal – Allows searching and navigating scientific publications aggregated from Open Access repositories

Page 28: Text mining in CORE (OR2012)

28/41

Snippets

Page 29: Text mining in CORE (OR2012)

29/41

CORE Applications

CORE Mobile – Allows searching and navigating scientific publications aggregated from Open Access repositories

Page 30: Text mining in CORE (OR2012)

30/41

CORE ApplicationsCORE Plugin – A plugin to system that recommendations for related items.

Page 31: Text mining in CORE (OR2012)

31/41

CORE ApplicationsRepository Analytics – is an analytical tool supporting providers of open access content (in particular repository managers).

Page 32: Text mining in CORE (OR2012)

32/41

Page 33: Text mining in CORE (OR2012)

33/41

Page 34: Text mining in CORE (OR2012)

34/41

CORE statistics

• Content• 7M records• 230 repositories• 402k full-texts • 1TB of data• 40GB large index• 35 million RDF triples in the CORE LOD repository

• Started: February 2011• Budget: 140k£

Page 35: Text mining in CORE (OR2012)

35/41

Outline• Introduction of the CORE system• Three phases: • Metadata and content harvesting• Semantic Enrichment• Providing services

• Supporting research in mining databases of scientific publications (DiggiCORE)

Page 36: Text mining in CORE (OR2012)

36/41

objective

Software for exploration and analysis of very large and fast-growing amounts of research publications stored across Open Access Repositories (OAR).

Page 37: Text mining in CORE (OR2012)

37/41

DiggiCORE networks

Three networks: (a) semantically related papers,(b) citation network, (c) author citation network

Page 38: Text mining in CORE (OR2012)

38/41

DiggiCORE objectives

Allow researchers to use this platform to analyse publications. Why?• To identifying patterns in the behaviour of research

communities• To detect trends in research disciplines• To gain new insights into the citation behaviour of researchers• To discover features that distinguish papers with high impact

Page 39: Text mining in CORE (OR2012)

39/41

Summary

• The rapid growth of OA content provides great opportunity for text-mining.

• Aggregations need to aggregate content, not just metadata. • Aggregations should serve the needs of different user groups

including researchers who need access to data. CORE aims to support them.

• We can have many services that are part of the infrastructure, but should work with the same data.

Page 40: Text mining in CORE (OR2012)

40/41

Thank you!

William Wallace

Page 41: Text mining in CORE (OR2012)

41/41