39
Building a Faceted Browser in CouchDB Using Views on Views and Erlang Metaprogramming Claus Zinn Overview Research Infrastructure Faceted Search Implementation CouchDB Map-Reduce Processing Stages Views Views on views Evaluation Future Work Related Work and Conclusion .1 Building a Faceted Browser in CouchDB Using Views on Views and Erlang Metaprogramming WFLP-2011 Odense, July 19 2011 Claus Zinn [email protected] The NaLiDa Project Nachhaltigkeit Linguistischer Daten http://www.sfs.uni-tuebingen.de/nalida/

Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

  • Upload
    vuthien

  • View
    223

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.1

Building a Faceted Browser inCouchDB Using Views on Viewsand Erlang MetaprogrammingWFLP-2011Odense, July 19 2011

Claus Zinn

[email protected]

The NaLiDa ProjectNachhaltigkeit Linguistischer Daten

http://www.sfs.uni-tuebingen.de/nalida/

Page 2: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.2

Overview

• Research infrastructure (in Linguistics)

• Faceted Search

• Implementation

• CouchDB• Map-Reduce• Processing Stages• Views• Views on Views

• Evaluation

• Future Work

• Related Work and Conclusion

Page 3: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.3

Research Infrastructure

State of affairs in the Humanities (and elsewhere)

• no systematic management of the underlying research data

• increasing pressure from funding agencies to document andmake public research data

⇒ eScience infrastructure needed to

⇒ support reproduction of results over identical data sets⇒ increase scientific quality and fights fraud in science⇒ help avoiding unmeant duplication of research work

NaLiDa Project

• contributes to infrastructure for languages resources (corpora,lexica, ...) and software tools (part-of-speech taggers, parsers, ...)

• supports scientific community with infrastructure building,metadata management and storage

• assists institutions to systematically describe and expose theirresearch with metadata terms of XML-based documents

• increase access to and visibility of resources

Page 4: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.4

Data Aggregation and Exposure

XML

A

XML

B XML

C

Faceted Search

OAI-PMHHarvesting

DocumentStorage

At regular intervalsnew providers may join

Page 5: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.5

Faceted Search

Metadata Descriptions in Linguistics

• can be very detailed with large variety in the usage of metadatafield descriptors and their structural organisation

• most of the information is of little use for most users

• some information pieces matter for most users

Increasing Popularity of Faceted Browsing

• well-suited for naive users to explore large data sets with smallbut informative set of facets

• customers can identify “products” along many dimensions

• facets & their value range & number of corresponding itemsshows structure and content of the search space

• many users learn the main criteria for navigation

Page 6: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.6

Faceted Search

Facet Selection

• governed by search for common denominator across collections

• will yield rather small set of (semantically similar) metadata fields

• main facets: organisation, language, resource type, modality

• conditional facets such as lifecyle status, tool type if ressourcetype is tool

Facetification

• Facets: F1, . . . , Fn

• with values ranges {f11, . . . f1n} . . . {fn1, . . . fnm}• document must be indexed by at least one facet-value pair

• document can be described by more than one value fij for Fi

• metadata for multimodal corpus with Fi = “modality” and fij“gesture”, “sign language” and “spoken language”

Page 7: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.7

Faceted Search Computations

Languages

German

English

French

Dutch SignLanguageBritish SignLanguageSwedish SignLanguageGerman SignLanguageGeorgian

Hungarian

Dutch

Italian

Latin

Russian

Tibeto-BurmanischCroatian

Konkani

Prinmi

Serbian

Teribe

Bosnian

ShamskatLadakhiTurkish

Catalan

Ewe

Finnish

Hausa

Norwegian

Romanian

Spanish

Albanian

Alttibetisch

Amerindian

BahasaIndonesiaBrazilianPortugueseBulgarian

Bulgariian

Chinese

dk

Early ModernHigh GermanEstonian

EuropeanSpanishGalician

Greek

Greenlandic

Guarani

Guruntum

Hindi

Hopi

Japanese

jp

Kanuri

Kenhat Ladakhi

Kenuzi-Dongola

Lithuanian

MandarinChineseMaung

MedievalSpanishMoore

Motu

Nahuatl

Navajo

Nepali

North Saami

Old HighGermanOld Portuguese

Old Spanish

Orokolo

Portuguese

Portugueze

Provenzal

Quechua

Samoan

Scottish-Gaelic

South AmericanSpanishSurselvan

Swahili

Swedish

Tamil

Tangale

Thai

Tibetan

Tibetisch

Tzeltal

Warao

Warlpiri

Yir Yoront

Yoruba

Yukatekisch

Zulu

• Once facet-value pair fik is selected, corresponding document setfik must be intersected with each of the other subsets of Fj with1 < j < n, j 6= i :

• document set of ring segment fik must be intersected withdocument sets of all segments of all rings other than Fi

• When users select facet Fi with value fik and facet Fj with fjl• first build intersection between the two corresponding

document collections• then, intersect (non-empty) result with all ring segments of

all rings other than Fi and Fj

Page 8: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.8

Implementation

Requirements

• cope with metadata heterogeneity, given that documents willadhere to different schemas each defining its own structured setof descriptors and values

• preserve the original format of all metadata descriptions, andconsider storing primary data in addition to the metadatadescribing it

• handle regular additions to document storage with onlyincremental update for document access

• provide effective and user-friendly access to all documents

• use a REST-based approach to make data storage read & writeweb-accessible

Page 9: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.9

CouchDB

• schema-less database design permits the inclusion of arbitrarilystructured documents into the database

• original metadata format can be preserved, and primary data canalso be associated with the metadata describing it

• map-reduce framework promises incrementality and scalability

• features a REST-based interface for document uploading,downloading and querying

• also “hosts” GUI, and provides Lucene port

CouchDB Views

• correspond to hardwired DB queries; also stored in CouchDB

• once a query is executed, its result is also stored

• defined in terms of map & reduce

• written in Erlang, Javascript, and other languages

Page 10: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.10

Map-Reduce Motivation

• process lots of data to produce other data

• using many CPUs

• supporting automatic parallelization & distribution,fault-tolerance, I/O scheduling, status and monitoring

Programming Model: Map

• processes input documents (key-value pairs)

• produces set/table of intermediate pairsmap(in_key, in_value) → list(out_key, intermed_value)

• must be referentially transparent

• given a document, the function will always emit the samekey-value pairs

• document indexing process is incremental, can run in parallel

• can be written in Javascript and Erlang (& other ports)

Page 11: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.11

Programming Model: Reduce

• combines all values for a particular key

• produces a set of merged output values (usually just one)

reduce(out_key, list(intermed_value)) → list(out_value)

• map function can be complemented by a reduce function

• takes as input the table of emitted values with identical keys asgenerated by the map function, and aggregates them, e.g.,

• summing up the values associated with the same key:

function(key, values) {return sum(values);

}

• must be referentially transparent, commutative and associative

• must be call-able with output of map process, but also withintermediate values computed by prior reduce (rereduce).

Page 12: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.12

Map-Reduce Framework

documents documentsmap

reduce reduce reduce final key-1

valuesfinal key-2

valuesfinal key-3

values

key-1values

intermediatevalues

key-2values

key-3values

map

key-1values

key-2values

key-3values

key 1 key 2 key 3

aggregation

values values values

Page 13: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.13

Implementation

Stages

1 ingestion: OAI-PMH-harvested documents validated against theirschema, which are then

• converted from XML to JSON• supplied with unique id, timestamp, source, and schema

information, and• added to DB with original XML as attachment

2 indexing: to attack data heterogeneity at schema level

3 curation: to address variability in facet values

4 faceted search indexing: to precompute all possible queries

5 presentation: to give users navigation access to datasets

Page 14: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.14

Document Indexing with map-reduce

• document indexing tackles data heterogeneity given thatdocuments may adhere to different schemas

Map Example (template)

function(doc) {switch( doc.schema ){case "<reference_to_schema_a>":

if ( <tree_has_node>) {

emit(<path_to_node_val>, 1);break;

}

case "<reference_to_schema_b>":[...]

[...]}

}

Page 15: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.15

Map to index organisations (fragment)

function(doc) {switch( doc.schema ){

case "http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/[...]1694580/xsd":if ( doc.CMD

&& doc.CMD.Components&& doc.CMD.Components.TextCorpusProfile&& doc.CMD.Components.TextCorpusProfile.GeneralInfo&& doc.CMD.Components.TextCorpusProfile.GeneralInfo.LegalOwner&& doc.CMD.Components.TextCorpusProfile.GeneralInfo.LegalOwner.$t

) {emit( doc.CMD.Components.TextCorpusProfile.GeneralInfo.LegalOwner.$t, 1);break;

}

case "http://theharvestingday.eu/schemas/clarin_bamdes-1.1.xsd":if ( doc.LexicalResource

&& doc.LexicalResource.organization&& doc.LexicalResource.organization.$t

) {emit( doc.LexicalResource.organization.$t, 1);break;

}...

}}

Page 16: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.16

Map Result (organisations)

Page 17: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.17

Reduce Result (organisations)

Page 18: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.17

Reduce Result (organisations)

Note:

• need for data curation

Page 19: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.18

Document Indexing with map-reduce

• initially, manually coded, and adapted after schema change

• but this is tedious and prone to error

⇒ now automatic generation of views from declarative facetspecification using JavaScript (string concatenation)

Facet specification

{ "facet" : "modality","pathInfos" : [{ "schema": "http://catalog.clarin.eu/...:cr1:p_129094580/...",

"path" : "doc.CMD.Components.TextCorpusProfile...",},{ "schema": "http://catalog.clarin.eu/...:cr1:p_129094579/...","path" : "doc.CMD.Components.LexicalResourceProfile..."

},...]

}{ "facet" : "language","pathInfos" : [ ... ]

}[...]

Page 20: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.19

Data Curation

• each map function gives a view of the document space in termsof the facet it represents

• analysis shows large variability for many facet values, e.g.,organisations with different names

• devised curation tables that map given names to preferred names

• data curation performed on the indices (for faceted search) ratherthan the original documents

Conversion of Views to Documents

• faceted search to be defined in terms of document indexingestablished in first map-reduce cycle

• but CouchDB’s map-reduce framework is defined in terms ofdocuments

• thus, not possible to define views on views, at least not directly

Page 21: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.20

Views on Views

• re-using the result of document indexing by converting resultingviews into documents

• conversion takes care of data curation

• conversion written in JavaScript implementing hash table of hashtables• outer hash table gives access to the facets

• “language”• inner hash table to all the values a chosen hash can take

• associating key “German” with all documents with this pieceof information

• new index (of type “docIndex”) is stored into extra CouchDB DB

• also holds all views to implement faceted search

• one index file for each document collection

Page 22: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.21

document index for one collection

Page 23: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.22

Map View for Country

fun ({Doc}) ->case proplists:get_value(<<"docType">>, Doc) of <<"docIndex">> ->

{CountryHash} = proplists:get_value(<<"country">>, Doc, {[]}),{LanguageHash} = proplists:get_value(<<"language">>, Doc, {[]}),<other hashes>

lists:foreach(fun (CountryItem) ->DocSet = proplists:get_value(CountryItem, CountryHash),DocSetSize = ordsets:size(DocSet),if DocSetSize > 0 ->

Emit(CountryItem,{[{<<"facet">>, <<"_total_">>},{<<"value">>, <<"_total_">>},{<<"docs">>, DocSet}]}),

lists:foreach(fun (LanguageItem) ->Intersection = ordsets:intersection(proplists:get_value(LanguageItem,

LanguageHash),proplists:get_value(CountryItem,

CountryHash)),case Intersection == [] of false ->

Emit(CountryItem,{[{<<"facet">>, <<"language">>},{<<"value">>, LanguageItem},{<<"docs">>, ordsets:size(Intersection)}]});

_ -> okend

end,proplists:get_keys(LanguageHash)),

<other intersections for other facets[...]>true -> ok

endend,proplists:get_keys(CountryHash));

_ -> okend

end.

Page 24: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.23

Result for Country View (fragment)

Page 25: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.24

Reduce Function (common to FB views

fun (Key, Values) ->AddToDict = fun (CurrentEntry, Dict) ->

{[{<<"facet">>, Facet}, {<<"value">>, Value},{<<"docs">>, Documents}]} =CurrentEntry,

DictKey = {Facet, Value},case Facet of<<"_total_">> ->

dict:append_list(DictKey, Documents, Dict);_ ->

dict:update(DictKey,fun (Old) -> Old + Documents end,Documents, Dict)

endend,

DictToList = fun (Dict) ->lists:map(fun (Entry) ->

{{Facet, Value}, Docs} = Entry,{struct,[{<<"facet">>, Facet},{<<"value">>, Value},{<<"docs">>, Docs}]}

end,dict:to_list(Dict))

end,

DictToList(lists:foldl(fun (Value, Dict) ->AddToDict(Value, Dict)

end,dict:new(), Values))

end.

Page 26: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.25

Coding of Views

• initially, views were coded manually in JavaScript

• but poor performance in view computation on large index files

• lead to the usage of Erlang instead, which resulted into asignificant performance boost

• writing views by hand is tedious and prone to error

• have written Erlang code that generates the code definitions forErlang views automatically

• Erlang meta-code based on the concatenation of Erlang codestrings

facet specification

-define( FACETS, ["country","language","modality","organisation", "resourceclass"] ).

-define( COND_FACETS, [{ "resourceclass", "corpus", ["genre"] },{ "resourceclass", "Tool", ["tooltype", "applicationtype"

"inputtype", "outputtype","lifecyclestatus" ]}]).

Page 27: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.26

Coding of Views (cont’d)

• specification leads to the generation of 121 views, with each viewhaving between 5000 and 12000 bytes of Erlang code

• not all possible combinations of set intersections are necessary

• document sets resulting from first selecting facet F1 and thenselecting facet F2 are identical to those when F2 is selected firstand then F1

• realized computation of all necessary intersections using Erlangcombinators

Use of Erlang Combinators

comb_4(L) ->case length(L) < 4 of true -> "supply lists with length >= 4" ;

_ -> [ {A,B,C,D,Z} || A <- L,B <- L--[A],A < B,C <- L--[A,B],B < C,D <- L--[A,B,C],C < D,

Z <- [L--[A,B,C,D]] ]end.

Page 28: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.27

Faceted Search GUI

Page 29: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.28

Faceted Search Queries = Map-Reduce

View request

/mpi_mgt/_design/country/_view/country?key=’’Germany’’&reduce=true

View result

{"rows":[{"key":"Germany","value":[{"facet":"modality","value":"Unspecified","docs":140},{"facet":"modality","value":"Speech/gestures","docs":230},{"facet":"language","value":"German Sign Language","docs":433},{"facet":"genre","value":"Secondary document","docs":3},{"facet":"genre","value":"Movie","docs":458},{"facet":"_total_","value":"_total_",

"docs":["oai:www.mpi.nl:MPI100","oai:www.mpi.nl:MPI1002978"...]}

[...]]}]}

Page 30: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.29

Evaluation

Views for Document Indexing

• views for document indexing are automatically generated fromfacet specification using JavaScript

• resulting map and reduce functions are in JavaScript too,CouchDB’s default view language

• computation of the view “organisation” takes approximately 25minutes on 86k documents

• one-time payoff

• no effort has been made yet to increase the speed of viewcomputation

• small changes in document database will have only small impacton view recomputation at the document indexing level

Page 31: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.30

Views for Faceted Search

• computation of faceted search views computationally expensive

• JavaScript too slow

• Erlang much faster (better in memory and processor usage)

Evaluation setting

• each Erlang view stored in separate CouchDB design document

• executed map-reduce computation to 24-core 96GB machine

• harvested and ingested approximately 86.000 metadatadocuments on language resources

• five unconditional facets “language” (371), “country” (67),“organisation” (39), “modality” (32), and “genre” (50)

• many different facet values: “modality” = “speech” (59463);“language” = “Dutch” (18345); “country” = “Germany” (16178);“organisation” = “Max Planck Institute for Psycholinguistics”(16568), and “genre”= “Discourse” (33676)

• 31 different map-reduce pairs

Page 32: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.31

Evaluation

Views Computation for Faceted Search

• generation of the views “language”, “country”, “organisation”,“modality”, and “genre” takes altogether less than one minute(using 5 cpus)

• generation of the ten 2-level views (users selected two facets,e.g., “country”:”genre”, “country”:”language”...) was computed inless than 1 minute (using 10 cpus).

• computation of the ten 3-level views where users selected threefacets: < 7.5 minutes

• computation of the 5 4-level views: more than 2 hours to compute

Page 33: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.32

Future work for optimisation

• currently, one indexing document for each of the metadataproviders

⇒ update from one data provider only requires a limited viewrecomputation

• but some data providers provide 10.000s of documents

⇒ optimise index documents for faceted search• reflect additions by new index document, so that incremental

updates are indeed limited to document additions• modifications and deletions by introducing MODIFY and

DELETE lists that a revised map-reduce combination wouldneed to consider

Page 34: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.33

Related Work: Flamenco

• toolkit with web-based interface to give faceted access to largedata collections given import format:

• the file facets.tsv listing all facets• the file attrs.tsv listing all attributes of a given item• the file items.tsv listing each collection item (following

definition in attrs.tsv) with unique id• for each entry facet in facets.tsv

• file facet_term: lists all terms for given facet with uniquefacet term ids

• facet_map associates unique facet term id with item ids

• data files ingested into Flamenco relational database (MySQL)

• Flamenco generates faceted browser’s default/customizable GUI

• user’s selection of facet terms translated into correspondingMySQL queries to compute all necessary set interactions

• results of executing MySQL queries are cached to avoidre-computation

Page 35: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.34

Flamenco used in VLO

• faceted search access to language resources using Flamencowith same dataset

• See http://www.clarin.eu/vlo

• used Perl to translate 80.000+ XML-based metadata files intoFlamenco’s indexing data format (incl. curation)

• ingested data into the Flamenco database and adapted GUI

• script to generate all queries to warm-up the cache

Comparison

• data preparation required for Flamenco roughly corresponds toour CouchDB-based document indexing phase (simple views)

• data curation only happens when the views of the indexing phaseare converted into the indexing documents

• MySQL queries fired by Flamenco correspond to the viewscomputed in terms of the indexing documents

Page 36: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.35

Advantages of CouchDB

• CouchDB also stores original metadata documents (with varyingschemata), thus also serves as permanent storage

• conditional facets contribute to usability guiding users’ navigation

• need only be computed in subsets whose documents areindexed against terms the conditional facet depends on

• index generation accommodates for incremental updates on themetadata sets, supporting regular harvesting withoutrecomputing all indices/views

• In Flamenco, any change in data set requires overwriting ofall contexts/caches

• facet specification offers more declarative view

• index generation taken to higher level;• easy to experiment with different facet configurations• but, once facet specification is changed, index generation

starts from scratch

Page 37: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.36

Conclusion

• CouchDB with its native language Erlang is well suited for thedevelopment of industrial-strength applications

• CouchDB’s REST-based interface offers lean alternative toestablished software (Java-based Apache Tomcat webserver)

• Erlang’s main limitations is lack of full macro package allowingusers to write programs to write other programs

• Common-Lisp like defmacro would have made life easier

• currently, no strong support for Lisp (or Haskell) port to index andquery documents in CouchDB

• CouchDB’s main limitation – when used with Erlang – being thelack of documentation and example code available

Page 38: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.37

Conclusion

• general approach to aggregate heterogeneously structureddocuments and to make them accessible via faceted (andfull-text) search

• works as long as documents’ relevant content can be given inJSON (CouchDB’s native format)

• for given context, facet specification was straightforward

• desirable to detect good facet candidates automatically

• Castanet algorithm• requires definition of target terms to best reflect the topics

present in given collection• combines target terms with hypernymy (IS-A) information of

WordNet to both• build facet hierarchies and• to assign documents to the facets

Page 39: Building a Faceted Browser in CouchDB Using Views on · PDF fileBrowser in CouchDB Using Views on Views and Erlang Metaprogramming ... B XML C Faceted Search OAI-PMH ... takes as input

Building a FacetedBrowser in CouchDB

Using Views onViews and ErlangMetaprogramming

Claus Zinn

Overview

ResearchInfrastructure

Faceted Search

ImplementationCouchDB

Map-Reduce

Processing Stages

Views

Views on views

Evaluation

Future Work

Related Work andConclusion

.38

Questions