A Robust Open-source GEDCOM Parser

Preview:

DESCRIPTION

A Robust Open-source GEDCOM Parser presented by Dallan Quass and Ryan Knight at RootsTech 2012 Parses GEDCOM files into a "de facto" object model; includes round-tripping for the vast majority of GEDCOM files.

Citation preview

A Robust Open-source GEDCOM Parser

Dallan Quass dallan@werelate.orgRyan Knight ryan@grandcloud.com

What's a GEDCOM?

0 HEAD1 SOUR PAF2 NAME Personal Ancestral File2 VERS 5.2.18.02 CORP The Church of Jesus Christ of Latter-day Saints3 ADDR 50 East North Temple Street4 CONT Salt Lake City, UT 841504 CONT USA1 DEST Other1 DATE 9 Aug 20062 TIME 19:57:471 FILE temp-paf.ged1 GEDC2 VERS 5.52 FORM LINEAGE-LINKED1 CHAR UTF-81 LANG English1 SUBM @SUB1@0 @SUB1@ SUBM1 NAME Dallan Quass0 @I1@ INDI1 NAME Dallan /Quass/2 SURN Quass2 GIVN Dallan

If this looks unfamiliar to you,you may not get a lot out of this talk

On the other hand,the purpose of this project is to

handle this for you,

so you can develop cool projects in genealogyand let this be unfamiliar to you!

Why is parsing GEDCOMs so hard?

Challenge #1 – Character set detection

0 HEAD1 SOUR PAF2 NAME Personal Ancestral File2 VERS 5.2.18.02 CORP The Church of Jesus Christ of Latter-day Saints3 ADDR 50 East North Temple Street4 CONT Salt Lake City, UT 841504 CONT USA1 DEST Other1 DATE 9 Aug 20062 TIME 19:57:471 FILE temp-paf.ged1 GEDC2 VERS 5.52 FORM LINEAGE-LINKED1 CHAR UTF-81 LANG English1 SUBM @SUB1@0 @SUB1@ SUBM1 NAME Dallan Quass0 @I1@ INDI1 NAME Dallan /Quass/2 SURN Quass2 GIVN Dallan

Should be easy, except...

Challenge #1 – Character set detection

GeneWeb ASCII → ANSI

Geni.com ANSEL → UTF8

Geni.com UNICODE → UTF8

GENJ UNICODE → UTF8

All others UNICODE → UTF16

ASCII/MacOS Roman → x-MacRoman

Challenge #1 – Character set detection

ANSEL

Challenge #2 – Custom tags

The GEDCOM specification hasn't been updated in a LONG time

Challenge #3 – Misused tags

Shout out

Tim Forsythe

VGed - GEDCOM validator

http://ancestorsnow.blogspot.com/ 2011/07/vged.html

ALIA

1 SEX M1 ALIA /Ted/1 BIRT

SOUR

0 @N6@ NOTE1 CONT adopted surname Termaat2 SOUR @S9@

DATA

2 SOUR @S2149874917@3 DATA4 DATE 11 Sep 19243 NOTE ...3 DATA4 TEXT ...

2 SOUR @S99@3 DATA4 TEXT William Donald ...4 DATE 1 Sep 1997

2 SOUR @S28@3 PAGE Indian Prarie...3 QUAY 33 DATE 28 Feb 2005

Challenge #4 – Unused tags

EventPhone

Event Agency

Source Citation Event Type

Challenge #5 – Names

GEDCOM Standard?

The code is more what you'd call

"guidelines" than actual rules.

Two goals

Goal #1 – Parse GEDCOMs into a de facto object model

De Facto:

In fact or in practice; in actual use or existence, regardless of official or legal status. – Wictionary.org

Model should be straightforward, easy to use and understand

Goal #2 – Round-trip

From GEDCOM

To Object Model

Back to GEDCOMwithout information loss

Nirvana

There is no Nirvana

But we can get pretty close

94%

How is it done?

???

Object model

People

Extensions

GedML

Originally by Michael Kayhttp://users.breathe.com/mhkay/gedml/

Enhanced by Lynn Monsonhttp://lmonson.com/blog/?page_id=64

Further enhanced by Nathan Powell & Dallan Quasspart of this project

GEDCOM → SAX eventsANSEL reader & writer

Parser

Written in Java

~1500 LoC for parser + ~4000 LoC for POJOs

Handles SAX events emitted by GedML

Separate functions called to handle each tag

Maintains a stack of model objects

Attach unexpected tags to model objects as extensions

Fast

Easily extendible

Tree parser also available

GEDCOM Export

Visitor pattern

600 LoC

JSON

GEDCOM POJO JSON POJO GEDCOM

Simple model persistence using Google GSON

Further thoughts

Do we need a radically-different data-exchange model for genealogy?

I don't know

A new proposed object model could use this project tomigrate existing GEDCOMs to the de facto model,

then translate the de facto model objectsto the new model

Do we need GEDCOM validation tools?

Definitely!

A list of “standard” custom tagswould also be pretty helpful

We live in the real world

Purpose of this project

Demonstration of Gedcom Server

Demonstrates GEDCOM -> model -> json -> model -> GEDCOM

Built with Play 1.2.4 - A Java Web framework

Allows for rapid development of web applications with a fully integrated stack

Deployed to Heroku – Cloud Application Platform

Heroku allows one step deployment with git

Demonstration of Gedcom Server

Demonstration of Gedcom Server

Conclusion

Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license

Parsing GEDCOMs is hard

• it's like parsing HTML in the 1990's

But getting it right is pretty important

especially if you want to retain existing information

Open source algorithm is now freely available

http://github.com/DallanQ/Gedcom

simple object model with extensions, 94% round-trip

Hopefully others will benefit from this effort

Recommended