Upload
valentine-walsh
View
232
Download
6
Tags:
Embed Size (px)
Citation preview
Empowering EPrints Search with Xapian
Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC
Review of EPrints Internal Search
Indexing
Searching
Extras
TO-DO’s
Using & contributing
Demo(s)
Summary
EPrints “Internal” Search - Overview
Search
Field
DataSet
MetaField Condition
List1
1..n
1..n 1..n
match = “EX” queries the main & auxilliary dataset tables
match = “IN” queries the __rindex dataset table
ordering is done via the __ordervalues_$langid dataset
table
EPrints “Internal” Search – Overview (2)
Simple search is not scalable
Lots of derived data in the DB (backup?)
No relevance matching -> good matches do not surface
up
No advanced features: suggestions, facets, boolean op’s
etc.
Home-brewed: hard to maintain the code, hard to extend
Difficult to debug…
EPrints “Internal” Search – Downsides
Introduced in 3.3
Only integrated with the simple search
Little flexibility in controlling what is indexed
Advanced features “not really” enabled
Searches every fields (“text_index” not respected)
But the idea is good & worth building upon
EPrints Xapian Search
Attempts to re-use EPrints’ default configuration:
◦ datasets’ field defintion (+ “text_index”)
◦ fields defined in the simple search (un-prefixed terms)
But needs its own bits to define:
◦ default indexing methods (by MetaField type)
◦ facet-able indexes
◦ order-able indexes
May be used to declare derived indexes – examples:◦ “open_access”: to filter references from open full-text documents
◦ “year”: to filter by year of publication (rather than by date)
◦ “image_orientation”: if you had an archive of images, you could extract the orientation via
EXIF
Indexing
Indexing - Classes
Xapian::Index
IndexMethod
Config
OrderMethod
XapianDB
Fulltext Name, etc. Alpha. Name, etc.
Indexes are prefixed by “_” e.g. “_title” so we can sanitise the user query
– otherwise users could do prefixed search (and search not necessarily
allowed fields)
Z notation: indicates a stemmed value or index: Z_title, Zhappi (internal
Xapian convention)
Script available to re-process the Xapian indexes (similar to “epadmin
reindex” but doesn’t re-index the EPrints’ internal)
Reserved indexes:
◦ _id: keep the internal id of the data-obj (/id/eprint/123)
◦ _dataset: to which dataset the record belongs to (‘eprint’, ‘user’…)
◦ _configuration_md5: keeps an MD5 of the conf. the item was indexed
against (useful?)
◦ - _index_timestamp: when the item was last indexed
Indexing – Extra information
Again, attempts to re-use EPrints’ configuration:
◦ simple search (mostly for ordering methods)
◦ advanced/staff search: which fields to use (prefixed terms)
Extra bits can be configured such as which facets can be
used on each search (simple, advanced, …)
Only indexed stuff can be searched
◦ you cannot use a facet which has not been generated
◦ you need to re-index your data if you change the simple search def.
◦ same if you add new order-able fields
Searching
Abstracted by Plugin::Search (original implementation)
Tricky to make it work with EPrints’ UI because it expects
an EPrints::Search object
Plugin::Search::Internal is a wrapped EPrints::Search
object (hack) so Plugin::Search::Xapian must emulate this
behaviour
Searching (2)
Searching – Classes & Op. Stack
/cgi/xapian
Search::XapianSearch
Paginate::Facets
Plugin::Search::Xapian
Xapian DB
Xapian::Facets
May be used in a script
Exports & feeds work
Can be serialised/de-serialised (including facets) so should
work for Saved Searches (to test)
Searching – Extra information
“Related Items”
Jiadi has developed a Bootstrap-based Pagination module:
◦ more sexy
◦ supports alternative “views” of the search results
Extras
Range searching: possible in Xapian but not yet
implemented (e.g. 1..10)
Some refactoring:
◦ Xapian::Index -> Xapian::Indexer
◦ Plugin::Search::Xapianv2 => Plugin::Search::Xapian (and replace the
default EPrints’ Xapian implementation)
Test with real life data (done to a certain extent...)
Load & scalability testing (+ number of slots etc.)
Multi-lang considerations (and related IndexMethod)
TO-DO’s
Page displaying how a data-obj has been indexed
◦ prefixes
◦ terms
◦ facets & order-able fields
Status page (cf. “Admin > Status”):
◦ DB size
◦ number of Documents
◦ indexed datasets (and how)
Weighting: supported (via conf.) but un-tested in real life
TO-DO’s – Would be nice
Xapian is more of a user search
The internal search is still required to:
◦ get records from the Database ($dataset->search())
◦ this affects screens such as “Manage Deposits”, the “Review” etc.
which cannot wait for items to be indexed (direct DB calls)
◦ may be needed to apply ACL’s (if some items cannot be searched):
safer to use the (MySQL) DB as authority
Internal Search vs Xapian Search
Plugin::Search::Xapian may be set to debug mode: shows
processing and query building
Xapian comes with an analysis tool, “delve” to:
◦ view the content of the Xapian DB or some selected Documents
◦ see if a term exists in the DB (and in which Documents)
◦ other info (term frequency etc.)
Knowing what Xapian is searching and how a data-obj is
indexed is key to debug most search-relating issues
Debugging Xapian
Not quite at release stage but it is –currently- isolated so
shouldn’t break your IR
All the code is on GitHub:
https://github.com/eprints/xapianv2
Using & Contributing
http://puffin.ecs.soton.ac.uk/cgi/xapian
Simple search / facets / export / order
Simple search with boolean op’s, suggestion
Advanced search / facets / export / order
Related items
http://vmdev1.eprints.org/cgi/xapian (more data + cached
citations)
http://vmdev1.eprints.org/cgi/xapian_status
Demos
Let’s have a play?
Code overview?
Doc?
Q&A & what’s next