Tesserae: addressing scalability & flexibility concerns CHRIS EBERLE

Tesserae: addressing scalability & flexibility concernsCHRIS EBERLE

Background

Tesserae: A linguistics project to compare intertextual similarities

Collaboration between University of Buffalo and UCCS

Live version at http://tesserae.caset.buffalo.edu/

Source code at https://github.com/tesserae/tesserae

http://tesserae.caset.buffalo.edu/

http://tesserae.caset.buffalo.edu/

https://github.com/tesserae/tesserae

https://github.com/tesserae/tesserae

Tesserae

Background

The good: Well-designed, proven, robust algorithm

See “Intertextuality in the Digital Age” by Neil Coffee, J.-P. Koenig, Shakthi Poornima, Roelant Ossewaarde, Christopher Forstall, and Sarah Jacobson

See “The Tesserae Project: intertextual analysis of Latin poetry” by Neil Coffee, Jean-Pierre Koenig, Shakthi Poornima, Christopher W. Forstall, Roelant Ossewaarde and Sarah L. Jacobson

Simple website, intuitive operations, meaningful scores (user friendly)

Multi-language support

Large corpus (especially Latin)

Background

The bad: Perl outputs PHP outputs HTML

Error-prone deployments (hand-edit Perl scripts)

The ugly: Mixing data and display layers

Custom file formats Perl nested dictionaries serialized to external text files -- slow

Results must be partially pre-computed Statistics are pre-computed at ingest time

Text vs. text comparisons done all at once, in memory, results written to disk, paginated by another script – searches represent a “snapshot in time”, not a live search.

No online ingest All offline, involving multiple scripts to massage incoming data

Can only compare one text to another; no per-section, per paragraph, per-line, or per-author comparisons

Goals Tesserae-NG: The next generation of Tesserae

Performance Use live caches & lazy computation where appropriate, no more bulk computation

Make certain operations threaded / parallel

Scalability Proven storage backend (Solr) used for storage rather than custom binary formats

Use industry-standard practices to separate data and display, allowing the possibility for clustering, load-balancing, caching, and horizontal scaling as necessary.

Make all operations as parallel as possible

Flexibility Use Solr’s extensible configuration to support more advanced, flexible searches (more than simple

“Text A” vs “Text B” searches)

Ease of deployment Create a virtual environment that can easily be used by anyone to stand up their own instance

User interface Create a modern, user-friendly user interface that both improves on the original design AND gives

administrators web-based tools to manage their data.

Goals

In short: rewrite Tesserae to address scalability and flexibility concerns

(with a secondary focus on ease of development and a nicer UI)

Architecture

Frontend: Django-powered website with online uploader

Middleware: Asynchronous ingest engine to keep the frontend responsive

Backend: Solr-powered database for data storage and search

Architecture: Frontend Powered by Django, jQuery, Twitter Bootstrap, and Haystack

Simple MVC paradigm, separation of concerns (no more data logic in the frontend)

Nice template engine, free admin interface, free input filtering / forgery protection.

Responsive modern HTML5 UI thanks to jQuery and Twitter Bootstrap

Python-based, modular, well-documented

Solr searches very easy thanks to Haystack

Scalability provided by uWSGI and Nginx Interpreter is only run once, bytecode is cached and kept alive

Automatic scaling (multiple cores / multiple machines)

Static content doesn’t even get handled by Python, very cheap now

Architecture: Middleware

Celery Accepts texts to ingest

Each text is split into 100-line chunks and distributed amongst workers

Each worker translates the text into something Solr can ingest, and makes the required ingest call to Solr

Highly parallel, fairly robust. Interrupted jobs are automatically re-run.

Ensures that any large texts ingested from the frontend can’t degrade the frontend experience

Uses RabbitMQ to queue up any unprocessed texts

Architecture: Backend

Apache Solr for Storage and Search Proven search engine, fast, efficient

Perfectly suited for large quantities of text

Efficient, well-tested storage, easily cacheable, scales well

Flexible schema configuration

Support any kind of query on the data we wish to perform

Does not have text-vs-text comparison tool built-in

A custom Solr plugin was written to accommodate this, based on the original Tesserae algorithm

Tomcat for application container Can quickly create a load-balanced cluster if the need arises

Architecture: Other concerns Web-based ingest is tedious for batch jobs

Provide command-line tools to ingest large quantities of texts, just for the initial setup (use of these tools are optional)

Solr’s storage engine can’t / won’t handle some of the metadata that the current Tesserae format expects (e.g. per-text frequency data) Use a secondary key-value database to the side to store this extra information

(LevelDB – very fast lookups)

Tesserae’s CSV-based Lexicon database is too slow, and won’t fit into memory Create an offline, one-time transformer to ingest the CSV file into a LevelDB

database that will be quicker to read

Metrics – where are the slow points? Use the Carbon / Graphite to collect metrics (both stack-wide, and in-code)

May want to access texts directly – view only mode, no search PostgreSQL for simple storage

Architecture

Solr Plugin No built-in capability for Solr to compare one Document to

another Solr is a simple web-wrapper with configuration files

Uses Lucene under the covers for all heavy lifting

No built-in support for comparisons in Lucene either, but writing a Solr wrapper to do this is possible

Solr Plugin: Design decisions

What will be searched? Simple one document vs another?

Portions of a document vs another?

Actual text within document?

What is a “document”? A text? A volume of texts?

General approach Treat each line in the original text as its own document

This “minimal unit” is configurable at install time

Dynamically assemble two “texts” at runtime based on whatever parameters the user wishes.

Can compare two texts, two volumes, two authors, a single line vs. a whole text, a portion of a text vs. an entire author, etc, etc.

Only limited by the expressive power of Solr’s search syntax, and the schema

Solr Plugin: Schema ExampleAuthor

Title Volume

Line #

Text

Lucan Bellum Civile

1 1 Bella per Emathios plus quam civilia campos

Lucan Bellum Civile

1 2 Iusque datum sceleri canimus, populumque potentem

Lucan Bellum Civile

1 3 In sua victrici conversum viscera dextra,

…

Vergil Aeneid 12 950 hoc dicens ferrum adverso sub pectore condit

Vergil Aeneid 12 951 fervidus. Ast illi solvuntur frigore membra

Vergil Aeneid 12 952 vitaque cum gemitu fugit indignata sub umbras.Each row, in Solr parlance, is called a “document”. To be sure, these are actually documentfragments from the user’s perspective. Each “document” has a unique ID and can be addressed individually. We can combine them at runtime into two “pools” of documents, which will be compared to one another for similarity.

Solr Plugin: Ingest Logic Receive a batch of lines + metadata

For each line, do the following: Split the line into words (done automatically with Solr’s tokenizer)

Take each word, normalize it, and look up the stem word from a Latin lexicon DB

Look up all forms of the stem word in the DB

Place the original word, and all other forms of the word in the Solr index Encode the form into the word so we can determine at search time which form it is

Allows this line to match no matter which form of a word is used

Update a global (language-wide) frequency database with the original word, and all other forms of the word

Metadata is automatically associated, no intervention required

Final “document” is stored and indexed by Solr. Term vectors are calculated automatically.

Solr Plugin: Search Logic Take in two queries from the user

Source query, and Target query

Gather together Solr documents that match each query Collect each result set in parallel as “source set” and “target set”

Treat each result set as two large meta-documents

Dynamically build frequency statistics on each meta-document

Dynamically construct a stop-list based on global statistics Global statistics must live from one run to the next, use an external DB

Global statistics don’t change from one search to the next, cached

Run the core Tesserae algorithm on the two meta-documents Compare all-vs-all, only keeping line-pairs that share 2 or more terms

Words that are found in the stoplist above are ignored

Calculate distances for each pair, throw away distances above some threshold

Calculate a score based on distance and frequency statistics

Order results by this final score (high to low)

Format results, try to determine which words need highlighting

Stream result to caller (pagination is automatic thanks to Solr)

Solr Plugin: Flexible Query Language

Compare "Bellum Civile“ with “Aeneid” (all volumes) http://solrhost:8080/solr/latin?tess.sq=title:Bellum

%20Civile&tess.tq=title:Aeneid

Compare line 6 of “Bellum Civile” with all of Vergil’s works http://solrhost:8080/solr/latin?tess.sq=title:Bellum%20Civile%20AND

%20line:6&tess.tq=author:Vergil

Compare Line 3 of Aeneid Part 1 with Line 10 of Aeneid Part 1 http://solrhost:8080/solr/latin?tess.sq=title:Aeneid%20AND

%20volume:1%20AND%20line:3&tess.tq=title:Aeneid%20AND%20volume:1%20AND%20line:10

Rich query language provided by Solr, most queries easily supported https://wiki.apache.org/solr/SolrQuerySyntax

Solr Plugin: Difficulties Solr is optimized for text search, not text comparison

Bulk reads of too many documents can be very slow because the index isn’t used

Rather than loading the actual documents, use an experimental feature called “Term Vectors” which store frequency information for the row directly in the index.

Use the Term Vectors exclusively until the actual document is needed

The meta-document approach makes it impossible to pre-compute statistics. Calculating this at runtime is somewhat costly. Using a cache partially mitigates this problem for related searches.

The original Tesserae has a multi-layered index Actual word + location -> Stemmed word + All other forms

Allows the engine to make decisions about which word form to use at each stage of the search

Solr is flat: word + location Had to “fake” the above hierarchy by packing extra information into each word

Implies each word must still be split apart and parsed, this can be slow for large document collections.

Would need a custom Solr storage engine to fix this (yes, this is possible – Solr is very pluggable)

Would also need my own Term Vector implementation (also possible)

Easy deployment: Vagrant Many components, complicated build process, multiple languages, dozens

of configuration files Need to make this easy to deploy, or no one will use this

Solution: Vagrant Create a Linux image by hand with some pre-installed software

Java, Tomcat, Postgres, Maven, Ant, Sbt, Nginx, Python, Django, RabbitMQ, etc

Store all code, setup scripts, and configuration in git

Automatically download the Linux image, provision it, and lay down the custom software and configuration.

Automatically start all services, and ingest base corpora

Entire deployment boiled down to one command: vagrant up

Average deployment time: 10 minutes

Encourages more participation (lower barrier to entry)

The final product Step 1: Clone the project

The final product Step 2: Vagrant up (automatic provisioning, install, config, &

ingest)

The final product Step 3: Search

The final product

Live Demo

Results Results are generated within a similar time-frame to the original (a couple seconds on

average for one core)

Scores are nearly identical (many thanks to Walter Scheirer and his team for the help on translating and explaining the original algorithm, as well as testing the implementation).

Results are truly dynamic, no need to pre-compute / pre-sort No temporary or session files used

Related accesses are very fast (10s of milliseconds) Faster than original site

Possible thanks to Solr’s ability to cache search results

Scales very well Numbers are relatively constant regardless of how many other documents occupy the

database (storage volume doesn’t impede speed)

Can be made noticeably faster by deploying on a multi-core machine

Biggest determining speed factor is how big the two “meta-documents” are Can’t be made truly parallel, each phase relies on the previous being done

Only data that will be displayed is actually transmitted, no wasted bandwidth per search.

Analysis Success!

Both primary and secondary goals were met

While single searches on single-core setups won’t see any improvements, using multiple cores definitely improves speed

All original simple-search functionality is intact

New functionality added Sub/super-document comparisons via custom plugin

Single-document text search is a given with Solr

Solr multi-core support Can configure multiple instances of Solr to run at the same time, not only means multiple languages but also

multiple arbitrary configurations.

Online asynchronous ingest

Search and storage caching

Web-based administration

Because Solr uses the JVM, no need to run a costly interpreter for each and every search – JVM will compile the most-used pieces of code to near-native speeds.

Original scoring algorithm is O(m*n) (as a result of the all-vs-all comparison) – parallelism only helps so much

Conclusion The results speak for themselves

Unfortunate that Solr doesn’t have a built-in comparison endpoint Writing own turned out to be necessary anyway, doubtful they’d have a scoring scheme

based on the original Tesserae algorithm

Lucene API provided everything needed to do this comparison, very few “hacks” necessary

Should provide the Tesserae team with a nice framework moving forward Easy to deploy

Separation of concerns

Nice UI Simple, scriptable MVC frontend

Written against a well-documented set of APIs

Robust backend Scales better than the perl version

A formal, type-checked, thread-safe, compiled language for the core algorithm

Written against a well-documented set of APIs

Rich batch tools

Future work UI frontend

Add more advanced search types to frontend

Full UI management of ingested texts (view, update, delete)

Free-text search of available texts

Solr backend Word highlighting (expensive right now)

Core algorithm: address O(n*m) implementation

Refactor code, a tad jumbled right now

Address slow ingest speed

Add support for index rebuild

Vagrant / installer Flush out “automatic” corpora selection

Multi-VM installer (automatic load balancing)

Further information

Source code at https://github.com/eberle1080/tesserae-ng

Documentation at https://github.com/eberle1080/tesserae-ng/wiki

Live version at http://tesserae-ng.chriseberle.net/

SLOC statistics 3205 lines of Python

3119 lines of Scala

2034 lines of XML

719 lines of Bash

548 lines of HTML

237 lines of Java

https://github.com/eberle1080/tesserae-ng

https://github.com/eberle1080/tesserae-ng

https://github.com/eberle1080/tesserae-ng/wiki

https://github.com/eberle1080/tesserae-ng/wiki

http://tesserae-ng.chriseberle.net/

http://tesserae-ng.chriseberle.net/

Questions?

Documents

Tesserae: addressing scalability & flexibility concerns CHRIS EBERLE