LuceneRDD for (Geospatial) Search and Entity Linkage

LuceneRDD Entity Linkage & (Spatial) Search

Anastasios Zouzias ScalaIO @ Lyon 2016

* Data Scientist @ Swisscom, CH: Mobility Insights Team* Researcher @ IBM Research, Zurich: Machine

Learning & Search Engineer * PhD in Univ. of Toronto: ML & Randomized Algorithms * Scala Developer (day job): 2-3 years * Apache Solr / Elasticsearch developer: 3+ years

About myself

How many of you have used Spark?

Apache Spark

RDD (Resilient Distributed Dataset)Distributes data over multiple nodes

Executor #1

Spark Master Executor #2

Executor #n

[Paris] [Bern]

[Athens]

[Ottawa] [Warsaw] [Rome]

[Zagreb] [Lisbon] [Dublin]

RDD“Spark Driver”

How many of you have used

Lucene/Solr/Elastic?

Apache Lucene

* Lucene is a powerful Java search library that lets you easily do search or Information Retrieval (IR)

* Used by LinkedIn, Twitter, and many more… * Scalable and High-performance indexing * Powerful, Accurate and Efficient Search algorithms

LuceneRDDWhat is it?

Open-source projecthttps://github.com/zouzias/spark-lucenerdd

Spark:Partitioning / distribution

of queries / data

Lucene:Indexing / querying

data (single partition)

LuceneRDD:Query / dispatch /

aggregation of data

MotivationWhy LuceneRDD exists?

* Natively support of full-text / spatial search in Spark(without external Elastic/Solr cluster)

* Scalable Entity Linkage (approx. join) with Spark & Lucene * Personal: Better understanding of Spark’s Internals (RDDs) * Personal: Better understanding of Scala (implicits)

LuceneRDD: RDD with searchLuceneRDD

Full-text search & Entity Linkage

FacetedLuceneRDDLuceneRDD + faceted search

ShapeLuceneRDDLuceneRDD + Spatial search

* Main development: Spark 2.x (supports Spark >= 1.4)* Lucene 5.5.3 (Lucene 6.2.2 JVM 8)* Released on maven central & spark-packages (Scala 2.10 / 2.11)

LuceneRDD

Main ideaIndex RDD partition data with

distributed Lucene index (inverted index per partition)

LuceneRDD

Executor #1

Spark Master

Executor #2

Executor #n

[Paris] [Bern]

[Athens]

Index storage disk or memory

LuceneRDDDataFrame to LuceneRDD*

Executor #1

Spark Master

Executor #2

Executor #n

[Paris] [Bern]

[Athens]

RDD/DataFrame

Language analyzers, i.e., French, English etc.. are

configurable

Language analyzers, i.e., French, English etc.. are

configurable

Simplistic ExampleExecutor #1

Spark Master

Executor #2

Executor #nsearch:par*

LuceneRDD

[Paris] [Bern]

[Athens]

Executor #1

Spark Master

Executor #2

Executor #n

Broadcast search query to executors

LuceneRDD

search:par*

Simplistic Example

Executor #1

Spark Master

Executor #2

Executor #n

Lucene prefix Search on each partition

LuceneRDD

search:par*

List(Paris)

List()

LuceneRDDResponse

topK results / partition

Simplistic Example

Aggregate ResultsExecutor #1

Spark Master

Executor #2

Executor #n

List(Paris)

List()

LuceneRDDResponse

response.take(k)

List(Paris)

List()

How to aggregate? TopK monoids in action

LuceneRDDResponse

TopKMonoid((0.8, Paris), (0.7, Result2), (0.6, Result3))

TopKMonoid((0.71, XXX), (0.35, YYY), (0.25, ZZZ))

TopKMonoid((0.12, XXX), (0.09, YYY), (0.05, ZZZ))

Executor #1

…Spark Master

Executor #2

Executor #n

List(Paris)

List()

List()+

TopK Sorted Scored Results

Twitter’s Algebird TopKMonoid

Aggregate Results

Executor #1

Spark Master

Executor #2

Executor #n

List(Paris)

List()

LuceneRDDResponse

LuceneRDDResponse.take(k)

List(Paris)

List()

LuceneRDD Operations

ShapeLuceneRDDLuceneRDD + Spatial search

FacetedLuceneRDDLuceneRDD + faceted search

Flexible multi-fields query: term/Fuzzy/Prefix/etc sub-queries

+ boolean queries

Entity Linkage a.k.a. approximate join

Join left & right dataset according to a “similarity”

(NO ids available)

Left Dataset Right DatasetEntity Linkage (approx. join)

Simplicity: one-to-one linkage

Left Dataset Right DatasetEntity Linkage (approx. join)

Naive approachCompute pairwise scores does not scale: O(n2)

Entity Linkage API

LuceneRDD Approximate Join1) Index right dataset using LuceneRDD 2) Define “search similarity function” between each left entity &

right dataset 3) For every left entity, search & zip using topK results

LuceneRDD

LuceneRDD Entity Linkage1) Index right dataset using LuceneRDD 2) Define “search similarity function” between left & right dataset 3) For every left entity, search & zip by search similarly

LuceneRDD

List(“name:james")

List(“name:joel")

List(“name:ram")

RDD[String]

Spark Master

LuceneRDD: linkage codeQuery each partition and zip query with topK results

Reduce topK monoids per query

Map each left entity to search query & broadcast to

each partition

Does LuceneRDD approx. join scale?

https://github.com/zouzias/spark-lucenerdd-aws

LuceneRDD Entity Linkage Scalability (EMR 4.8.0 / Spark 1.6.2)Li

# of executors (= # of AWS core nodes)

2 3 5 8 10 15 20

1135.23

755.69

482.98324.70 315.50

187.22 185.51

AWS: master: m3.xlarge, Core: r3.xlarge, Exec. mem: 9G

Joined Datasets* H1B US Visa Applications (2.6 million records) * Geonames US Cities (4.4 million records)

Demo LuceneRDD Linkage API

(ACM vs DBLP research articles)

https://github.com/zouzias/spark-lucenerdd-examples

Köpcke, H.; Thor, A.; Rahm, E. Evaluation of entity resolution approaches on real-world match problems Proc. 36th Intl. Conference on Very Large Databases (VLDB), 2010

Spatial/Entity Linkage API

Demo ShapeLuceneRDD Linkage

(Country polygons vs Capital points)

https://github.com/zouzias/spark-lucenerdd-examples

Future Work

* Contributors / use cases are welcome! * More performance tests with approx. join * Spatial-linkage performance tests using

open Street Map data * Bypass the collect-broadcast Spark pattern?

Merci pour votre attention! Questions?

Elastic/SolrCloud ConnectorExecutor

Executor #2

Executor #n

Driver

Cluster

Node 1

Node 2

Node m

Limitations - m is fixed & (usually) small compared to n- additional network communication: execs & elastic/solr

LuceneRDD for (Geospatial) Search and Entity Linkage

Technology

Comparison of Online Record Linkage Techniques · Record linkage is the process of quickly and accurately identifying records that belongs to the same entity across a number of heterogeneous

III. Linkage A. ‘Complete’ Linkage B. ‘Incomplete’ Linkage

DBPEDIA LINKAGE ANALYSIS LEVERAGING ON ENTITY SEMANTICS

Geospatial Semantics - UCSBgeog.ucsb.edu/~hu/geog176B/GeospatialSemantics.pdf · Geospatial semantics means the meaning of geospatial data and functions. Geospatial semantics can

Entity 78 11 108 Entity 75 Entity 77 Entity 76 Entity 74

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

OVERVIEW – GEOSPATIAL TECHNOLOGIESficci.in/sector/85/Project_docs/Sector-Profile-Geospatial... · OVERVIEW – GEOSPATIAL TECHNOLOGIES ... The geospatial market comprises four identifiable

ML for entity linkage ML for data fusion · ML for entity linkage ML for data extraction ML for data fusion ML for schema alignment Part III. ... Significant improvement on texts

LinkNBed: Multi-Graph Representation Learning with Entity Linkage · mation available on the web in relational format. All knowledge graphs share common drawback of incompleteness

Linkage Disequilibrium with Linkage Analysis of Multiline

Five sections: Data Linkage in WA Introduction Linkage

Combined linkage and linkage disequilibrium analysis of a

Four-bar Linkage Mechanism Optimization for Linkage Driven

PAIRWISE ESTIMATING EQUATIONS FOR THE ANALYSIS OF … · 1.1 Nature of record linkage Record linkage consists in identifying records that come from the same entity, e.g. a person

VALUE OF GEOSPATIAL TECHNOLOGY - Oman Geospatial Forum

Joint Linkage and Linkage Disequilibrium Mapping

Towards a Global Statistical Geospatial Framework – Geospatially … · 2017. 11. 6. · ITRF ETRF GREF. Geo-location Use of In-Situ- and Realtime-Data ... They incorporate a linkage

Optimal Stopping: A Record-Linkage Approach...Record-linkage is the process of identifying whether two separate records refer to the same real-world entity when some elements of the

Incremental Record Linkage - VLDBan existing cluster (i.e., referring to an already known entity), or create a new cluster for it (i.e., referring to a new entity). However, every

Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,