43
Moritz Schubotz 31/07/2016 www.formulasearchengine.com 1 CICM 2016, Doctoral program Augmenting Mathematical Formulae for More Effective Querying & Presentation

CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Moritz Schubotz

31/07/2016 www.formulasearchengine.com 1

CICM 2016, Doctoral program

Augmenting Mathematical Formulae for More Effective

Querying & Presentation

Page 2: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Motivation

31/07/2016 www.formulasearchengine.com 2

26th of February 2011

28th of March18.03.-26.03.2011

Page 3: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Example 1: 1

𝑥≤

1

𝑥

<apply>

<leq/>

<apply>

<divide/>

<cn type="integer">1</cn>

<apply>

<mean/>

<qvar>x</qvar></apply></apply>

<apply>

<mean/>

<apply>

<divide>

<cn type="integer">1</cn>

<qvar>x</qvar></apply></apply>…

1. Different forms e.g. 𝑥1

𝑥≥ 1

2. Different notations e.g.

𝑋𝑓 𝑥 𝑥𝑑𝑥 = ⟨𝑥⟩

3. Exact match seldom

4. Ambiguity in syntax e.g. 𝐸Ψ =𝐻Ψ

5. no TeX-function mean

$\frac 1 \mean ?x \le

\mean \frac 1 ?x$

NTCIR-11 Math-2 WMC-D1

Page 4: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Example 1: 1

𝑥≤

1

𝑥

<apply>

<leq/>

<apply>

<divide/>

<cn type="integer">1</cn>

<apply>

<mean/>

<qvar>x</qvar></apply></apply>

<apply>

<mean/>

<apply>

<divide>

<cn type="integer">1</cn>

<qvar>x</qvar></apply></apply>…

1. Different forms e.g. 𝑥1

𝑥≥ 1

2. Different notations e.g.

𝑋𝑓 𝑥 𝑥𝑑𝑥 = ⟨𝑥⟩

3. Exact match seldom

4. Ambiguity in syntax e.g. 𝐸Ψ =𝐻Ψ

5. no TeX-function mean

$\frac 1 \mean ?x \le

\mean \frac 1 ?x$

NTCIR-11 Math-2 WMC-D1

Page 5: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Result 1: 𝜑(𝔼[𝑋]) ≤ 𝔼[𝜑(𝑋)]

$?f[type=function] \mean ?x \le

\mean ?f[type=function] ?x $

$\frac 1 \mean ?x \le

\mean \frac 1 ?x$

$?f[type=function] \mean ?x \le

\mean ?f[type=function] ?x $

$\frac 1 \mean ?x \le

\mean \frac 1 ?x$

\le

\mean

?x

\mean

?f[type=function]

?f[type=function]

?x

\le

\mean

?x

\mean

\frac

\frac

?x

1

1

Page 6: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Result 1: 𝜑(𝔼[𝑋]) ≤ 𝔼[𝜑(𝑋)]

<apply>

<leq/>

<apply>

<divide/>

<cn type="integer">1</cn>

<apply>

<mean/>

<qvar>x</qvar></apply></apply>

<apply>

<mean/>

<apply>

<divide>

<cn type="integer">1</cn>

<qvar>x</qvar></apply></apply>…

<apply>

<leq/>

<apply>

<qvar type="function">f</qvar>

<apply>

<mean/>

<qvar>x</qvar></apply></apply>

<apply>

<mean/>

<apply>

<qvar type="function">f</qvar>

<qvar>x</qvar></apply></apply>…

$?f[type=function] \mean ?x \le

\mean ?f[type=function] ?x $

$\frac 1 \mean ?x \le

\mean \frac 1 ?x$

$\frac 1$ -> $?f[type=function]$

Not trivial

Page 7: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Solution 1 inexact matches

• Refined query:$\superconceptOf[

orderby = editdistance ]{

\frac 1 \mean ?x \le

\mean \frac 1 ?x

}$

• Computational complexity

• Restriction of the search space

• Check most likely solutions at first

31/07/2016 www.formulasearchengine.com 8

Page 8: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

But there are diverse information needs

1. Definition look-up

2. Explanation look-up

3. Proof look-up

4. Application look-up

5. Computation assistance

6. General formula search

31/07/2016 www.formulasearchengine.com 9

Page 9: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

, and the data looks like that

31.07.2016 www.formulasearchengine.com 10

Page 10: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Levels of Abstraction

31/07/2016 www.formulasearchengine.com 11

Presentation

Content

Semantic

Page 11: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Overview

31/07/2016 www.formulasearchengine.com 12

Image pro-

cessingImage Structure

detectionPresen-tation

Entity LinkageContent Semantic

Integrated Queries

Math

Presen-tation

Content Semantic

Text

Keywords Relations

Meta-data

Page 12: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Completed Research

Querying Processing Scalibility

- Making Math Searchable in Wikipedia (CICM 2012)

- Evaluation of Similarity-Measure Factors for Formulae (NTCIR 2015)

- Wikipedia Subtask at NTCIR 11 (SIGIR 2015)

- Exploring the single-brainbarrier (NTCIR 2016)

- Mathematical Language Processing (CICM 2014)

- Digital Repository of Mathematical Formulae (CICM2014 coauthor)

- Growing the DRMF with generic LaTeX sources

- Mathoid: Accessible MathRendering for Wikipedia ( CICM 2014)

- Semantification of Identifiers in Mathematics for Better Math Information Retrieval (SIGIR 2016)

- Applying Stratosphere for Big Data Analytics (coauthor, BTW 2013)

- Querying large Collections of Mathematical Publications (NTCIR 2013 with Marcus Leich)

31/07/2016 www.formulasearchengine.com 13

Integrated Queries

Math

Presen-tation

Content Semantic

Text

Keywords Relations

Meta-data

Page 13: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Mathoid: Robust, Scalable, Fast and Accessible Math Rendering for

Wikipedia

𝜑(𝔼[𝑋]) ≤ 𝔼[𝜑(𝑋)].

• convex function (Q319913, NDL ID 00573442)

• subclass of function

• ja:凸関数

31.07.2016 www.formulasearchengine.com 14

Let be a probability space, X an integrable real-

valued random variable and φ a convex function.

Then:

Page 14: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Exploring the single-brain barrier

• “one-brain barrier” [1]– Metaphor: relevant knowledge to conduct math research

needs to be co-located in one brain

• Goals of our contribution to NTCIR12: – Create a point of reference w.r.t. to this barrier for a

trained mathematician– Compare the performance of a human to MIR systems and

analyse characteristic strengths and weaknesses– Derive insights to improve MIR systems– Combine the relevant results of the human and the MIR

systems to create a gold standard

31/07/2016 www.formulasearchengine.com 15

Page 15: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Exploring the single-brain barrier

31/07/2016 www.formulasearchengine.com 16

Page 16: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Exploring the single-brain barrier

31/07/2016 www.formulasearchengine.com 17

Page 17: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Exploring the single-brain barrier

• Strengths of MIR systems:

– Definition lookup queries

– Application lookup

• Weaknesses of MIR systems

– Low precision

– No unified query language to specify query type

• Gold standard dataset can help to develop a math-aware search engine for Wikipedia

31/07/2016 www.formulasearchengine.com 18

Page 18: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Semantification of Identifiers in Mathematics for Better Math

Information Retrieval

31/07/2016 www.formulasearchengine.com 19

• First step to enable computer to understand

mathematicians notations

• Focus on identifiers

• Extract identifier semantics by combining math and

• Use computers to find relevant mathematics

• Computers must understand semantics in math to

provide needed information

Page 19: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Math Augmentation Approach

31/07/2016 www.formulasearchengine.com 20

Page 20: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 21

(1) Extract formulae

Page 21: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 22

(2) Extract identifiers

Page 22: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 23

(3) Find identifiers

Page 23: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 24

(4) Find definiens candidates

Page 24: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 25

(5) Score all identifer-definienspairs

Page 25: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 26

(6) Generate feature vectors

Page 26: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 27

(7) Cluster feature vectors

Page 27: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 28

(8) Map clusters to subject hierarchy

Page 28: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Wikipedia Subtask at NTCIR 11

• NTCIR 11 Wikipedia dataset*

• 30k Wikipedia Articles

• 280k Formulae

• 100 queries

31/07/2016 www.formulasearchengine.com 29

*) Moritz Schubotz, Abdou Youssef, Volker Markl, and Howard S. Cohl. 2015. Challenges of Mathematical Information Retrieval in the NTCIR-11 Math Wikipedia Task. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 951-954.

Page 29: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Wikipedia Subtask at NTCIR 11

• CICM 2012(2 Participants)

• NTCIR 2013 (pilot)(6 Participants)

• NTCIR 2014

arXiv (8 Participants) Wikipedia (7 Participants)

31/07/2016 30Part III / IV Completed Research

Page 30: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Wikipedia Task results

31/07/2016 31

Higher recall

Hig

her

„p

reci

sio

n“

Part III / IV Completed Research

Page 31: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Gold standardavailable from http://mlp.formulasearchengine.com

31/07/2016 www.formulasearchengine.com 32

Page 32: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Gold standard details

31/07/2016 www.formulasearchengine.com 33

174

4

27

97

8310 Identifiers

Wikidata item Two Wikidata items Wikidata item + NP

Individual NP Multiple NP

Page 33: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 34

Results: Identifiers

• 294/310 correctly extracted (94.8%)

• 57 false positive (fp)

• Problems

– Incorrect markup (8fn, 33fp) 𝜂 =𝑄1−𝑄2

𝑄1

– Symbols (9fp) d

d𝑥

– Sub-super script (3fp, 2fn) 𝜎𝑦2

– Special notation (10fp, 2fn) 𝒖 × 𝒗 = 𝜖𝑖𝑗𝑘𝑢𝑗𝑣𝑘𝒆𝒊

Page 34: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 35

Results: (4) Find definienscandidates

Page 35: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 36

Results: Definitions

88

120

19

83

Definitions

Exact match Partial match not found not in document

Page 36: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Distribution of identifier counts

31/07/2016 www.formulasearchengine.com 37

Page 37: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

31/07/2016 www.formulasearchengine.com 38

Results: Definitions with Namespace support

103

147

60

Definitions

Exact match Partial match not found

Page 38: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Discovered namespaces

31/07/2016 www.formulasearchengine.com 39

• 250 clusters -> 167 mapped to classification schemata

• 5618 definitions with 𝑠 > 1 (2124 Wikidata concepts)

• Evaluate 6 randomly sampled namespaces

Page 39: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Impression from namespace samples

31/07/2016 www.formulasearchengine.com 40

129

78

144

Definitions in Namespaces

correct wrong unspecific cannot say

Page 40: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Discovered namespaces

31/07/2016 www.formulasearchengine.com 41

• Physics

• Identifiers are significant for formulae

• Mathematics

• Identifiers might be less significant for formulae

Page 41: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Conclusions

• For 10% of the identifiers used in Wikipedia (en) we could assign the associated Wikidata item

• 90% ahead– More specific Wikidata items needed

– Combine data from different language versions

– Improve recognition rate within a document

• Namespaces for mathematical identifiers could be identified

31/07/2016 www.formulasearchengine.com 42

Page 42: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Next steps

• The identifier information is available from the Wikipedia API today http://en.wikipedia.org/api

• Develop tools to augment the user experience for math in Wikipedia and beyond– Tooltips (ongoing)– Physical Dimensions (ongoing)– Translation to Computer Algebra Systems (started)– Math Question and Answering (ongoing)– Author assistance (ongoing with WMF)– Related formulae search (started)

• Justification for semantification effort

31/07/2016 www.formulasearchengine.com 43

Page 43: CICM 2016, Doctoral program · •Extract identifier semantics by combining math and ... –Special notation (10fp, 2fn) × = ... •Namespaces for mathematical identifiers could

Contact

Moritz Schubotz (now at Universität Konstanz)

[email protected]

+49 7531 88 4438

Mobile: +49 1578 047 1397

www.isg.uni-konstanz.de

www.formulasearchengine.com

31/07/2016 www.formulasearchengine.com 44