77
Creating an Open Source Genealogical Search Engine With Apache Solr Brooke Schreier Ganz [email protected] Twitter: @LeafSeek www.LeafSeek.com

Creating an Open Source Genealogical Search Engine with Apache Solr

Embed Size (px)

DESCRIPTION

Set Your Records Free! LeafSeek is a new tool that helps you turn your genealogical or historical record collections into searchable online databases. Combine multiple datasets of different types — such as birth, marriage, and military records — into one unified searchable website. Find inter-connections in your data that you never noticed before. With great features like built-in geo-spatial searches, pop-up Google Maps, Beider-Morse Phonetic Matching, name synonyms, and language localization, LeafSeek can help you turn your spreadsheets of names and dates into a full-featured genealogy search engine. It’s designed for researchers and genealogy societies alike. Oh, and one more thing: LeafSeek is free and open source. No strings attached.

Citation preview

Page 1: Creating an Open Source Genealogical Search Engine with Apache Solr

Creating an Open Source

Genealogical Search Engine

With Apache Solr

Brooke Schreier Ganz

[email protected]

Twitter: @LeafSeek

www.LeafSeek.com

Page 2: Creating an Open Source Genealogical Search Engine with Apache Solr

Hi, I‟m Brooke

• I make web stuff for fun, and (sometimes) for profit

• Web Developer at IBM.com and Disney Consumer Products

• Lead Programmer at TMZ.com (yikes, sorry about that)

• Senior Web Producer at Bravo cable TV network and its spin-off websites

• Big dork

• Big genealogy dork

• #BigData dork

Page 3: Creating an Open Source Genealogical Search Engine with Apache Solr

Meet Gesher Galicia

• Non-profit 501(c)3 genealogy society

• Founded in 1993

• Hundreds of members, worldwide

• E-mail discussion group

• New website development in progress

(existing website is fugly)

• Needs a search engine…for data

Page 4: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 5: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 6: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 7: Creating an Open Source Genealogical Search Engine with Apache Solr

The Old Problem

Page 8: Creating an Open Source Genealogical Search Engine with Apache Solr

The Old Problem

Page 9: Creating an Open Source Genealogical Search Engine with Apache Solr

The New Problem

Page 10: Creating an Open Source Genealogical Search Engine with Apache Solr

The New Problem

• Diverse Data Languages

(German, Polish, Ukrainian, Russian, Yiddi

sh, Hebrew, English…)

• Diverse Data Types

(births, marriages, deaths, divorces, tax

lists, landsmanschaften lists, industrial

permit lists, school

yearbooks, governmental yearbooks…)

Page 11: Creating an Open Source Genealogical Search Engine with Apache Solr

Diverse Data Shapes

Page 12: Creating an Open Source Genealogical Search Engine with Apache Solr

Diverse Data Shapes

Page 13: Creating an Open Source Genealogical Search Engine with Apache Solr

Diverse Data Shapes

Page 14: Creating an Open Source Genealogical Search Engine with Apache Solr

Existing solutions

• They‟re okay...for small numbers of

databases, with small amounts of data

– Steve Morse's One-Step Tool Creator

– Roll-your-own solution with PHP and MySQL

• Both get more difficult to manage as data

sets increase in number and complexity

Page 15: Creating an Open Source Genealogical Search Engine with Apache Solr

In space, no one can hear your data scream

Page 16: Creating an Open Source Genealogical Search Engine with Apache Solr

To Sum Up

• There are lots of ways to publish your tree

• …but not so many ways to publish your

data

• Surely there must be a way to deal with

this?

Page 17: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 18: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 19: Creating an Open Source Genealogical Search Engine with Apache Solr

So I Made A Thing

But “That Thing I Made With The Database And Stuff”

was kind of an awkward name, so I called it

LeafSeek

Page 20: Creating an Open Source Genealogical Search Engine with Apache Solr

This is the part where I show you all

the shiny new All Galicia Database

http://search.geshergalicia.org/

Page 21: Creating an Open Source Genealogical Search Engine with Apache Solr

Meet Apache Solr

• Highly functional open source search

platform

• Based on Apache Lucene (Java)…

• …plus a web wrapper/API

• Not the prettiest or simplest tool

• FREE and open source

Page 22: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 23: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 24: Creating an Open Source Genealogical Search Engine with Apache Solr

Saves Time, and Heartache

Page 25: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 26: Creating an Open Source Genealogical Search Engine with Apache Solr

Saves Time, and Stomachache

Page 27: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 28: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 29: Creating an Open Source Genealogical Search Engine with Apache Solr

File Structure: Back-End

Page 30: Creating an Open Source Genealogical Search Engine with Apache Solr

Welcome to /conf

Page 31: Creating an Open Source Genealogical Search Engine with Apache Solr

The Important Stuff

Page 32: Creating an Open Source Genealogical Search Engine with Apache Solr

solrconfig.xml

Page 33: Creating an Open Source Genealogical Search Engine with Apache Solr

solrconfig.xml

Make sure this part is configured, so you can

import data:

Page 34: Creating an Open Source Genealogical Search Engine with Apache Solr

How to get your data into Solr

• Step 1: Make a properly-formatted spreadsheet

• Step 2: Save spreadsheet as a .CSV file

• Step 3: Create a MySQL database + table

• Step 4: Import CSV into that new table

• Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT)

• Step 6: Add this table‟s information todb-data-config.xml

Page 35: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 36: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 37: Creating an Open Source Genealogical Search Engine with Apache Solr

db-data-config.xml

• Basic XML file that tells Solr how to grab

data from your MySQL database(s)

• Add new <dataSource> for new databases

• Add new <entity> for new tables within the

databases

• You need to make sure your MySQL

connector .jar is installed for this to work

Page 38: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 39: Creating an Open Source Genealogical Search Engine with Apache Solr

Import!

Page 40: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 41: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml

• FieldTypes, Fields, and CopyFields

• FieldTypes give indexing and querying

instructions to “buckets”

• Fields say what‟s what and whether to

make something facetable or not

• CopyFields collect Fields together into

extra FieldTypes

Page 42: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml - FieldTypes

• 5 Custom FieldTypes (so far):

– givenname

– surname

– surname_bmpm (phonetic)

– place (note: not merely town)

– year (which we‟re treating as text right now)

Page 43: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml - FieldTypes

Page 44: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml - FieldTypes

Page 45: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml - Fields

Page 46: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml - Fields

• Uppercase fields come from the name of the MySQL column name

• Examples:

– Year

– SchoolYear

– Surname

– FathersTown

– MothersFathersGivenName

– MaternalGrandfathersGivenName

Page 47: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml - Fields

• Lowercase fields were added once the

data is getting inputted to Solr, and start

with the prefix record_

• Examples:

– record_type (birth, death, tax, whatever)

– record_source (name of repository)

– record_latlong (latitude,longitude)

– record_id (required!)

Page 48: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml - Fields

• You do not have to explicitly define every Field.

• If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it.

• Which is fine.

• But IMHO it‟s better to define everything anyway so you can remember what‟s what and what you are doing to it.

Page 49: Creating an Open Source Genealogical Search Engine with Apache Solr

schema.xml - CopyFields

Page 50: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 51: Creating an Open Source Genealogical Search Engine with Apache Solr

Add-ons and nice-to-have‟s

(for the back-end)• Wildcards, and lots of „em

• Non-name words handled through stopwords.txt

• Nicknames and name synonyms handled through synonyms.txt

• Two files included:– synonyms_-_american-anglo-saxon.txt

– synonyms_-_polish-ukrainian-jewish.txt

• Should be based on your data and yourhistorical/ethnic community standards

Page 52: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 53: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 54: Creating an Open Source Genealogical Search Engine with Apache Solr

More add-ons and nice-to-have‟s

(for the back-end)

• Translate your site into different languages – multi-lingual content deserves a real multi-lingual website

– Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want

• Built-in performance monitoring hooks for New Relic

• Soundalike searches for surname variants

– Levenstein distance

– “Regular” Soundex, Metaphone, Caverphone, etc.

Page 55: Creating an Open Source Genealogical Search Engine with Apache Solr

This is the part where I tell

the story about

THE SAGA

of Beider-Morse Phonetic Matching

(BMPM)

Page 56: Creating an Open Source Genealogical Search Engine with Apache Solr

Relevancy

• Right now, we‟re using exact matches

• (Of course, “exact” includes

wildcards, alternate names /

synonyms, etc.)

• Like “Old Search” on Ancestry.com

• DisMax! Boosting fields! Scoring!

• (…but not yet)

• Problems with records with multiple

people‟s names in the record

Page 57: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 58: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 59: Creating an Open Source Genealogical Search Engine with Apache Solr

Lots of Front-End Options

• Ruby:

Sunspot, RSolr, Tanning Bed, acts-as-solr

• Django/Python:

Haystack, Sunburnt, solrpy, pysolr

• Older PHP options:

PECL, solr-php-client

• Plugins for blog/CMS systems:

Drupal, WordPress

Page 60: Creating an Open Source Genealogical Search Engine with Apache Solr

Meet Solarium

• http://www.solarium-project.org/

• New, open source PHP wrapper for Solr

• Very active development

• Version 2.4 coming soon

Page 61: Creating an Open Source Genealogical Search Engine with Apache Solr

File Structure: Front-End

Page 62: Creating an Open Source Genealogical Search Engine with Apache Solr

Meet Solarium: The Config

Page 63: Creating an Open Source Genealogical Search Engine with Apache Solr

Meet Solarium: The Guts

Page 64: Creating an Open Source Genealogical Search Engine with Apache Solr

Meet Solarium: The Guts

• You choose the parts of your data to facet

• Data is submitted to the front-end by POST, not by GET, so the URL never changes

• You can (and should) paginate results listings

• You can't actually see the Solr server's URL from the front-end, not even in view-source

Page 65: Creating an Open Source Genealogical Search Engine with Apache Solr

Add-ons and nice-to-have‟s

(for the front-end)

• A welcome screen with information about

the database's contents

• Instructions (maybe twice)

• How many records in the database?

• How many datasets?

• What features are coming next?

• What datasets are coming next?

Page 66: Creating an Open Source Genealogical Search Engine with Apache Solr

Add-ons and nice-to-have‟s

(for the front-end)

• Make good UI choices

• Pop-Up Google Maps

• Tooltips to reduce UI clutter

• Cross-browser compatibility

• Still stuck with IE 7 and 8

• CSS and code that degrades gracefully

• No small text

Page 67: Creating an Open Source Genealogical Search Engine with Apache Solr

Bird‟s Eye View of Your Data

• What (surnames, towns, etc.) do I have in

my data?

• What are the TOP (surnames, towns, etc.)

in my data?

• Finding incorrect data

– Outlying years and dates

– Figure out that hard-to-read surname

• Make charts and graphs from your data

Page 68: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 69: Creating an Open Source Genealogical Search Engine with Apache Solr

The (Back-End) Future! (Maybe.)

• Date ranges, instead of just years

• Auto-complete as you type

• “Did you mean...?”

(based on data frequency)

• “More Like This”

(would have to do scoring)

• Record bookmarking system (hashes?)

Page 70: Creating an Open Source Genealogical Search Engine with Apache Solr

The (Front-End) Future! (Maybe.)

• Hierarchical facets for locations

• Disambiguating locations

• Social sharing of individual records

• New genealogy data schema

http://historical-data.org/

• Membership login system

Page 71: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 72: Creating an Open Source Genealogical Search Engine with Apache Solr

Please Do Not Build That Wall

• Password protect some of the databases

• Password protect some of the data

• Open data, but pay for record or surname

bookmarking system

• Open data, but pay for API access

• Open data, but sell online ads

• Open data, but give people guilt trips

Page 73: Creating an Open Source Genealogical Search Engine with Apache Solr

Presenting LeafSeek!

• Free and Open Source

• Code is all on GitHub

• Please add, edit, fix, change, tinker

• …and use it!

Page 74: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 75: Creating an Open Source Genealogical Search Engine with Apache Solr
Page 76: Creating an Open Source Genealogical Search Engine with Apache Solr

Why is this FREE?

And why is this important?

Page 77: Creating an Open Source Genealogical Search Engine with Apache Solr

Thank you! :-)