Download pdf - PaleoDeepDive:&A&&&&&&&&&&&&&&&&&&&&&&&&&&&&Applica9on& · PDF fileWe#gratefully#acknowledge#the#supportfrom Toshiba,#Google,#DARPA#DEFT#and#XDATA,#ONR,#and#NSF.#Any#opinions,#ﬁndings,#and#conclusions#or#recommendaGons#expressed

Hazy Research Group Led by Christopher Ré (Zifei Shan, Mikhail Sushkov, Feiran Wang, Ce Zhang)

The Architecture of DeepDive

Rigorous ProbabilisIc Framework

ExecuIve Summary

Simpler Feature Engineering

PaleoDeepDive

We gratefully acknowledge the support from Toshiba, Google, DARPA DEFT and XDATA, ONR, and NSF. Any opinions, findings, and conclusions or recommendaGons expressed in this material are those of the authors and do not necessarily reflect the view of Toshiba, Google, DARPA, ONR, NSF, or the US government.

Takeaways DeepDive enables macroscopic science by building a “dark data” extrac9on system.

Developers should think about features, not algorithms. It is possible to abstract probabilis9c inference for use by domain scien9sts using standard SQL and Python.

We can achieve comparable (or beEer) quality to human volunteers.

We demonstrate these three takeaways!

DeepDive is the underlying framework

DeepDive has three features!

Unstructured text Structured Resources

RelaIonal Input Tables

Variables, factors, and connecIons between variables and factors

Factor Graph

e.g., Appear(Taxon, FormaIon) relaIon RelaIonal OutputTables

Text spans Freebase, Bing results…

StaIsIcal Inference & Learning

Feature ExtracIon

Feature extracIon with Python & SQL

DeepDive supports different high-‐level languages to specify a factor graph, e.g., Markov logic network, WinBUGS, etc.

PaleoDeepDive is built with a combinaIon of Python and SQL

Python SQL

Input relaIons (e.g. Coref Task)

PID DOC TEXT P1 D1 Columbia Fm. P2 D1 Columbia

Phrase Phrase A is coreferent to phrase B if the edit distance between A and

B is smaller than 5 and they appear in the same document.

We write an SQL query to generate all phrase pairs that appear in the same document, and pair it with a python funcAon.

SELECT t0.PID, t0.TEXT, t1.PID, t1.TEXT FROM Phrase t0, Phrase t1 WHERE t0.DOC=t1.DOC USEPYTHON pyfunc

We write a Python funcAon to process all phrase pairs and make predicAons.

def pyfunc(p1, t1, p2, t2): if edit_dist(t1, t2) < 5: emit(“Coref”, p1, p2)

DeepDive will learn the weight automaGcally

Extract Error analysis Extractor

ApplicaIon

Write/Improve extractors

DeepDive is able to integrate a diverse set of signals & feedback

DeepDive supports an “E3 loop” for feature engineering

Domain Experts

Unstructured Data

Structured Knowledge Base

-‐ Training labels -‐ HTML documents -‐ Scanned ArIcles -‐ Maps, photos, images

-‐ Freebase -‐ Macrostrat -‐ DicIonaries

-‐ Training labels -‐ HeurisIcs & Rules -‐ Hard Constraints

More-‐Structured

Signal

More-‐Supervised Signal

Crowd

The more signals we use, the beger quality we can expect!

Our DimmWiEed System is able to run Gibbs sampling in the speed of 100 million variables/sec! (hgp://arxiv.org/abs/1403.7550)

Geoscience Research Group Led by Shanan Peters (Jackson Borchardt, Tim Foltz)

PaleoDeepDive: A Applica9on Sta9s9cal Inference using Familiar Data-‐Processing Languages

Try out ! hgp://deepdive.stanford.edu

?! How does climate change impact biodiversity?

T. Rex are found daIng to the upper Cretaceous.

(“T. Rex”, “Cretaceous”)

380 volunteer scienGsts manually read 11K journal arGcles since 1994!

Extract biodiversity-‐related relaAons from journal arAcles.

The DeepDive Approach

The central task of PaleoDeepDive is to extract relaAons from unstructured text automaAcally:

PaleoDeepDive achieves comparable (or beNer) quality with human volunteers, in a cheaper way.

0

500

1000

1500

2000

2500

3000

0 100 200 300 400 500

Total D

iversity

Geological Time (M.A.)

Human

PaleoDeepDive

Different Sources of Signals DeepDive uses a joint probability model that enables rigorous probabilisIc interpretaIon

0

0.5

1

0 0.5 1

Accuracy

Actual

Ideal

0K

1K

2K

3K

4K

0 0.2 0.4 0.6 0.8

# Extrac9o

ns

Goal

Candidates for improvement

Output to users

Output Probability

Example Extractor

We expect that 8 of 10 with probability 0.8 will be correct

PaleoDeepDive