View
2.980
Download
33
Category
Preview:
DESCRIPTION
Citation preview
© 2008 Palantir Technologies Inc. All rights reserved.
Palantir XML Formats
PalantirXML (pXML) & PalantirDocXML (DocXML)
Ari Gordon-SchlosbergSenior Software Engineer
Palantir XML Formats
Written in XML Schema Definition (XSD) language– W3.org standard– Widely accepted
Allows developers to leverage existing XML tools– Editing– Verification– Transformation (XSLT) friendly
Designed to be simple & human-readable– Follows Palantir design principles– Meant to make life easier for developers to code, debug, learn
PalantirXML: An Introduction
A rendering of a Palantir object graph into XML– Encodes nearly all features in our lowest-level data model– “Close to the metal”
Used as open import format– Makes Palantir integration-friendly and a truly open platform– Federated Search on-the-fly-import uses it internally– Super efficient storage format
Used for export/interchange– Allows organization to pull knowledge out of Palantir– Can be transformed using XSLT to other XML formats
PalantirDocXML: An Introduction
Container for textual docs and entity extraction output– raw text– source document– entity extraction results– textual references to those entities– document metadata
Authored by Palantir, but it’s an open format– Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir– Not tied to a single extractor, multiple vendors already support it
Designed to be simple-to-author import format– XSLT friendly– Used existing entity extractor formats as design guides
Object-Model Refresher
Example Text Document
Contributors: Ari Gordon-Schlosberg, Kevin Simler, John Carrino
We're currently stuck in Atlanta, waiting for our flight to IL. We learned that our display case is 83 lineal inches, 3 inches longer than we're supposed to be able to fly with, but let us go this time. (I wonder if this is just a Delta thing?)
Eric Poirier called us and told us that the presentation at Cornell went very well, which gives us high hopes for tomorrow's presentation at UIUC. John and I are excited to get back home for a visit and I've been contacting professors to look for students that we should target for recruiting.
Things are going well.
Sincerely,
Your field team: Kevin, John, and Ari.
Imported Into Palantir
A Simple Example
A Simple Example
Keep In Mind…
We’ll be covering:– Details of these two formats– Explanations of where to use them– Some simple examples
Examples have been edited for brevity and clarity– Covering important features– Reference manuals and XSDs are the full references– Some elements abbreviated as <element/>where details are not
relevant; More detail may be required there
PALANTIR XMLpXML
pXML: Where To Use It
To import structured data that doesn’t import easily– Data from a database where objects span tables– Objects assembled from multiple DataSources– Other “exotic” data sources
To export data from Palantir– Other analytic tools– Other data platforms– Other Palantir instances
pXML And The Object Model
pXML is strongly coupled to the object model– Data sources– Objects– Properties – Notes– Media– Links– Data source records
pXML And The Object Model
pXML elements come directly from the object model– Data sources <dataSource/>– Objects <object/>– Properties <property/>– Notes <note/>– Media <media/>– Links <link/>– Data source records <dataSourceRecord/>
pXML Document Structure
Document/Data Source Duality
Data sources represent real-world sources of data– do not contain data– a collection of references
Palantir document objects contain real-world data
Primary object connects a data source to the object holding its data
Used by data sources representing unstructured data– Documents– Emails– Other sources of unstructured text
Data Sources
Object
Property
Property Values
Three types of property values are supported in pXML:– Simple
• Used for single, unparsed values• e.g. Nationality, Organization Name
– Composite• Used for values composed of discrete, semantic units• e.g. Name (first & last), Address (city, state, zip, etc.)
– Raw• Convenience format• Keeps pXML simple and allows the parsers to do the work• Allows ontology to change around existing pXML generators
Simple Property Value
Composite Property Value
Raw Property Value
Media
Notes
DataSourceRecords
Data source records (DSRs) tie data to their source Apply to all pieces of data
– Properties– Notes– Media– Links
Have two modes– Import keys are used to tie data to a record primary key or index in
structured data sources. e.g. a line number, primary key, etc.– String position locators are used to mark references in
unstructured text using character offsets and lengths.
DataSourceRecords
Links
Links represent a link between to objects All links are directed in Palantir
PALANTIR DOCUMENT XMLDocXML
PalantirDocXML: An Introduction
Authored by Palantir, but it’s an open format– Not inherently tied Palantir. – Contains some optional features to ease integration with Palantir
Support for multiple entity extractors per document– Object data is designed to be an easy transform target from
popular extractors– Contains hold the original output from the entity extractors
Allows ontologies to change over time– Architected to use pluggable type-mappings– Compatible with multiple Palantir instances– Never need to rebuild a DocXML document
Advanced Features
Advanced character set handling– Stores document originals in original character set– Careful UTF-8 encapsulation supports all human languages
Support for flexible document metadata– Captures arbitrary organizational or handling metadata
Easy to understand and transform into other formats– XSLT friendly by design– Can hold extractor configuration as well as output– Cross-data-platform format for extracted documents– Intermediate format for multi-step extraction– Single interface for ingestion of extracted document– Completely Palantir agnostic
DocXML Document Structure
Document Metadata
Document Metadata Example
Object Data
Extraction Metadata
Extraction Metadata Example
Object
Example Object
Relationship
Type Mapping
DocXML documents are not tied to an ontology– Single document can be ingested into different ontologies– Changes in an ontology does not require re-extraction or changes
to the extractor, just an edit of the type mapping– Each document can use multiple mappings
Mappings map extractor types and document properties– Separate mapping for each supported extractor– Document properties map into properties on the Palantir
Document object Centrally-managed resource for each enterprise
– Analysts don’t write type mappings, architects do– Imports seamlessly “just work”– Everyone uses a consistent mapping
Type Mapping Overview
Document Properties
Extractor Type Mappings
Extractor Type Mappings Example
Final Thoughts
This presentation is an overview– Both pXML and DocXML have features not covered here
The XSD files are the canonical reference– Full syntax and rules are covered there– Consult reference manual for usage and in-depth explanations
Living Standards– Backwards compatible– May add new features to support customer needs
See our blog for tips and techniques on XML processing– http://blog.palantirtech.com/
© 2008 Palantir Technologies Inc. All rights reserved.
Palantir XML Formats
PalantirXML (pXML) & PalantirDocXML (DocXML)
Ari Gordon-SchlosbergSenior Software Engineer
Recommended