Transcript
Page 1: Botanická 68a, 602 00 Brno, Czech Republic …research.nii.ac.jp/ntcir/workshop/OnlineProceedings10/...Botanická 68a, 602 00 Brno, Czech Republic Acknowledgements: This work was

Similarity Search for Mathematicshttps://mir.fi.muni.cz/

Martin Líška, Petr Sojka and Michal Růžička

[email protected], [email protected] and [email protected]

Botanická 68a, 602 00 Brno, Czech Republic

Acknowledgements: This work was partially supported by the European Union through its Competitiveness and Innovation Programme (Information and CommunicationTechnologies Policy Support Programme, ‘Open access to scientific information’, Grant Agreement No. 250503, a project of the European Digital

IntroductionFull paper texts have to be ‘homogenized’, convertedto some uniform representation, in order for math-aware fulltext searches and paper similarity compu-tations to work properly.

There is an average of 380 mathematical formulaeper arXiv paper in our MREC database. It has beenreported that even a single histogram of mathemati-cal symbols is sufficient for domain classification ofa  paper in the mathematical domain. To reliablyrepresent a  paper for DML processing, includinghandling the mathematics, it is necessary to

Math information retrieval (MIR) starts to be recog-nized as an important very domain-specific sort ofinformation retrieval research field. MasarykUniversity (MU) has entered the area of MIR duringthe development of the Czech Digital MathematicsLibrary DML-CZ in mid nineties. It became obviousthat Digital Mathematical Libraries (DMLs) arespecific in many aspects.

Some papers in DMLs consist of more formulaethan texts, and we started to think about repre-sentation and indexing of mathematical formulae inaddition to texts. There was no widely acceptableuser interface and representation for math formulaein information retrieval (IR). We have designed anddeveloped first math formulae indexing andretrieval prototypes in the series of Bachelor thesis.Math formulae are structures appearing withinaccompanying texts that convey meaning andrelations between objects mentioned in the text.They could be represented as trees and one coulddefine formulae similarity as tree structure simi-larity.

MU has partnered in the development of the Euro-pean Digital Mathematics Library, EuDML, where ithas been decided to support math formulae search,

Canonicalization

Our approach to searching mathematical contentin documents is based on similarities of mathstructures through conventional full-text sear-ching. As mathematical notation, e.g. expressionsand formulae, is highly structured, we preprocessmathematical content in order to be processableby full-text searching methods. The preprocessingprocedures include canoncalization, which is veryimportant in order to allow matching of two equalformulae with slight notational differences. The-refore, the level of canonicalization needs to be ashigh as possible. Then, to allow searching of sub-formulae, expressions are tokenized and subtreesof formulae extracted. Subformulae are stored in

the locations of their original forms so they can beeasily located at the query time. To be able tosearch for similar expressions, we propose severalgeneralization preprocessing techniques. Theseinclude unification of variables, unification ofnumber constants and font typeface preservation.These aim to increase the recall of mathematicalsearch. To increase the precision, we rank eachindexed expression according to its distance fromthe original non-tokenized formula. The lessunified subformulae extracted from a higher levelof the original formula tree, the higher weightfactor it gets. Assigned weights affect the orde-ring of retrieved results. The factors that influenceresulting weights of indexed subformulae are ad-justable. The current setup respects our genericview of distance of extracted subformulae to theiroriginal trees. Different setup might influence theorder of retrieved results significantly. There is noideal set of factors, however, we want to reach tothe optimal setup by repetitive evaluation.

We developed a search system according to theseprinciples. MIaS (Math Indexer and Searcher) is amath-aware full-text based search engine. It isbased on the state-of-the-art searching library Lu-cene. It supports combined text and math sear-ching. Refinement of many text query results byadding a math query is believed to be a very po-

Indexing andSearching

Evaluation

Workflow example

werful tool. Mathematical preprocessing is a plug-in that can be used with any Lucene or Solr basedsystems. MIaS processes documents with mathe-matics encoded in Presentation or ContentMathML. At the end of the preprocessing, expres-sion trees are linearized to compacted string formto reduce index space requirements.

The very straightforward query interface of MIaSconsists of only one input field. Users can type intextual queries together with math queries en-coded using LaTeX notation as well as MathMLnotation. Query is on-the-fly visualized as `typeset'formula in user's web browser to allow users toverify the correctness of the mathematical part ofthe query. Along the basic information aboutretrieved documents the result list shows asnippet with highlighted text and math tokens thatare the most significant in the document's rank.This allows for quick primary evaluation of the do-cuments relevance to user's query.

Alongside interactive web querying interface MI-aS offers searching using web services. This is aindispensable feature for automated querying thatwas used to retrieve evaluation results for theNTCIR Math Task.

as one of math specific features. We have also paidattention to the user interface aspects: formulae isrendered as user types by rendering the formulaeafter every keystroke.

To the best of our knowledge, EuDML with MIaS isthe first digital library collecting non-born-digitalPDFs that supports math search in full-texts.

1. select a  canonical representation of the non-textual structural entities appearing in fulltexts(mathematical symbols, formulae, and equations);and

2. decide on equivalence classes for these entities(e.g., for which formulae should be consideredequal for given DML tasks such as search, simi-larity computation, formulae editing, and conver-sion of math into Braille).

Using our public working demo of the WebMIaS sys-tem we discovered several discrepancies in the formof MathML generated by the real-time TeX toMathML converter we currently use – Tralics. Wetried to normalize the users’ MathML input and theMathML produced by the LaTeXML convertercontained in the arXMLiv collection. Then we wentthrough the Presentation MathML specifications andgathered a  list of possible reformatting rules wecould perform.

MIRMU team participated in the Math Retrieval Sub-task with contributions to all three types of search:Formula Search, Full Text Search and Open Infor-mation Retrieval for the NTCIR-10. Full Text Searchsimulated the standard use of a search system – que-ries comprised of math expressions as well as text. Foreach query, the system returned a list of documents asthey were provided in the test collection. No specialmodifications were therefore needed. For the FormulaSearch, however, several adjustments were necessary.Formula Search aimed at retrieving independent for-mulae located in the provided documents. If, forexample, a document contained 100 formulae, each ofthem could be retrieved as a hit on its own. This is a

difference to the normal workflow. However, flexibledesign of MIaS allowed us to index every formula as anindependent index document containing only that for-mula by adding a special document handler.

For the needs of Math Retrieval Subtask, we createdtwo indexes from the provided document collection,that contained 36,697,971 math expressions and had7.3 GB in size.

After preprocessing, both indices stored more than 1.5billion subexpressions. The first index, NTCIR-frag-ments, was created from single formulae to completeFormula Search search type. Every index documentrepresented only one formula from the input files, the-refore, the resulting index contained more than 73.5million documents. It took 8.5 hours to complete theindex sized around 39.5 GB. The second index calledNTCIR-files was created the regular way consistingboth of text and formulae where one index documentrepresented exactly one physical document from thecollection. It took 5 hours to complete the index sizedaround 30 GB. This comparison shows an interestingoverhead of the Formula Search index. It contains lessdata but is split into more logical units which resultedin the longer indexing time and a larger index.

Alongside text, MIaS accepts LaTeX and both Contentand Presentation MathML as a query notation formathematics. LaTeX queries are converted to combinedPresentation-Content MathML by LaTeXML converter.We decided to utilize the possibility of submission offour runs to analyse the difference in the performanceof the system with regard to the query language. Thiswas supported by the test query collection thatprovided all of the mentioned formats for each query.

Overall scores of MIaS were above average of the MathTask results. Precision at rank five (P-5) of MIaS in Run4 was the highest from the all competing submissions.Table 3 shows all four reported metrics for relevancelevel `relevant'. Table 4 shows the same metrics for re-levance level `partially relevant'.

We discovered, that ability to evaluate is very valuablein information retrieval. It is a driving force in the evo-lution process of IR systems, more so if it is impartialas for example at the NTCIR conference task. But, tojustify the development on a day to day basis, we needour own collection with gold standards against whichwe could evaluate our development steps. Our futuregoal is to create our own, gold standard evaluationcollection. We find it a prerequisite to the further deve-lopment of retrieval techniques.

Results

LÍŠKA, Martin and Petr SOJKA and Michal RŮŽIČKA. Similarity Search for Mathematics: Masaryk University team at the NTCIR-10 MathTask. In Proceedings of the 10th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Math Pilot Task

Math Processing

Publications:

SOJKA, Petr and Martin LÍŠKA. The Art of Mathematics Retrieval. InMatthew R. B. Hardy, Frank Wm. Tompa. Proceedings of the 2011ACM Symposium on Document Engineering. Mountain View, CA,USA: ACM, 2011. p. 57–60, 4 pp. ISBN 978-1-4503-0863-2.doi:10.1145/2034691.2034703.

SOJKA, Petr and Martin LÍŠKA. Indexing and Searching Mathematics inDigital Libraries – Architecture, Design and Scalability Issues. InJames H. Davenport, William M. Farmer, Josef Urban, Florian Rabe.Intelligent Computer Mathematics Lecture Notes in ComputerScience, 2011, Volume 6824/2011. Berlin / Heidelberg: Springer,2011. p. 228–243, 15 pp. ISBN 978-3-642-22672-4.doi:10.1007/978-3-642-22673-1_16.

SOJKA, Petr and Martin LÍŠKA and Michal RŮŽIČKA. Building Corporaof Technical Texts : Approaches and Tools. In Aleš Horák, PavelRychlý. Fifth Workshop on Recent Advances in Slavonic NaturalLanguages Processing, RASLAN 2011. Brno: Tribun EU, 2011.p. 71–82, 11 pp. ISBN 978-80-263-0077-9.

LÍŠKA, Martin and Petr SOJKA and Michal RŮŽIČKA and PeterMRAVEC. Web Interface and Collection for Mathematical Retrieval: WebMIaS and MREC. In Petr Sojka, Thierry Bouche. DML 2011:Towards a Digital Mathematics Library. Brno: Masaryk University,2011. p. 77–84, 8 pp. ISBN 978-80-210-5542-1.

Unifying Fences

<mrow> Minimizing

Sub-/SuperscriptsHandling

Applying Functions

Recommended