PANGAEA - Providing access to geoscientific data using Apache Lucene Java

Embed Size (px)

Citation preview

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    1/25

    PANGAEA - Providing access togeoscientific data using Apache

    Lucene JavaUwe Schindler

    PANGAEA / SD DataSolutions GmbH, [email protected]

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    2/25

    My Background

    My main focus is on development of Lucene Java.Implemented fast numerical search and maintaining the new attribute-based text analysis API .Studied Physics at the University of Erlangen-Nuremberg andwork as consultant and software architect for PANGAEA(Publishing Network for Geoscientific & Environmental Data) in Bremen, Germany, where I implemented the portal's geo-spatial retrieval functions with Lucene Java .Talks about Lucene at various international conferences like

    ApacheCon EU/US, Lucene Eurocon, Berlin Buzzwords andvarious local meetups.

    I am committer and PMC member of Apache Lucene and Solr .

    http://www.lucidimagination.com/blog/2010/03/10/state-of-spatial-support-in-apache-solr/http://www.lucidimagination.com/blog/2010/03/10/state-of-spatial-support-in-apache-solr/http://www.lucidimagination.com/blog/2010/03/10/state-of-spatial-support-in-apache-solr/http://www.lucidimagination.com/blog/2010/03/10/state-of-spatial-support-in-apache-solr/http://www.lucidimagination.com/blog/2010/03/10/state-of-spatial-support-in-apache-solr/
  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    3/25

    since 1993Information system for earth system science data hosted by AWI &MARUM 2001Mandate of the International Council for Science (ICSU):

    World Data Center for Marine Environmental Sciences (WDC- MARE) 2007Mandate of the World Meteorological Organisation (WMO):World Radiation Monitoring Center (WRMC)

    2010 (certification in progress)Mandate of the World Meteorological Organisation (WMO):Data Collection and Processing Center (DCPC)

    About PANGAEA

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    4/25

    Nuclear RadiationTokyo, Japan

    WDC Co-ordination OfficesWashington DC, USABeijing, China

    MeteorologyAsheville NC, USA

    Beijing, ChinaObninsk, Russia

    OceaographyObninsk, RussiaSilver Spring MD, USATianjin, China

    PaleoclimatologyBoulder CO, USA

    Marine Geology and GeophysicsBoulder CO, USAMoscow, Russia

    Remotely Sensed Land DataSioux Falls SD, USA

    Renewable Resources and EnvironmentBeijing, China

    Recent Crustal MovementsOndrejov, Czech Republic

    AirglowMitaka,Japan

    AstronomyBeijing, China

    Atmospheric Trace GasesOak Ridge TN, USA

    AuroraTokyo, Japan

    Cosmic RaysToyokawa, Japan

    GeologyBeijing, China

    Human Interactions in the EnvironmentPalisades NY, USA

    IonosphereTokyo, Japan

    Earth TidesBrussels, Belgium

    GeomagnetismCopenhagen, DenmarkEdinburgh, UKKyoto, JapanColaba, India

    GlaciologyBoulder CO, USACambridge, UK

    Lanzhou, China

    Marine Environmental SciencesBremen, Germany, (2001)

    Rotation of the EarthObninsk, RussiaWashington DC, USA

    Satellite InformationGreenbelt MD, USA

    Rockets and SatellitesObninsk, Russia

    SeismologyDenver CO, USABeijing, China

    Solar Radio EmissionNagano, Japan

    Space ScienceBeijing, China

    Space Science SatellitesKanagawa, Japan

    Solar ActivityMeudon, France

    SoilsWageningen, The Netherlands

    Sunspot IndexBrussels, Belgium

    Solar Terrestrial PhysicsBoulder CO, USADidcot Oxon, UKMoscow, RussiaHaymarket, Australia

    Solid Earth GeophysicsBeijing, China

    Boulder CO, USAMoscow, Russia

    Network of World Data CentersGeophysical Year 1957

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    5/25

    Why do we need Data Libraries?

    - Good scientific practice- Needed for verification of scientific

    work- Good availability of data for large

    scale and complex scientificapproaches

    -than reproduction

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    6/25

    Geosciences before 1900

    Turin papyrus,~1160 BC

    William Smith, 1815Glomar challenger, 1875

    http://upload.wikimedia.org/wikipedia/commons/0/0b/Turine_Papyrus,_ca._1320_v.C..jpg
  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    7/25

    ENIAC, 1944

    Technical Improvements

    Magnetometer

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    8/25

    Development of the globalclimate

    The last 1300 years

    Thousands of years before present

    Thousands of years before present

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    9/25

    0

    5

    10

    15

    20

    25

    30

    1970 1980 1990 2000 2010

    Publications

    Data

    ?

    Information increase in empirical sciences

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    10/25

    Archiving and publication ofscientific data

    Data acquisitionQuality assuranceLong-term availability and access

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    11/25

    Long term archive

    Open access & non restricted datao Creative Commons license

    Data accepted from individual scientists,institutes, and science projectsLong term funding for basic operation

    o hardware, software, system management &organisation

    Long term preservation of datao Technical: security, migration of media,o Usability: preserving the integrity & semantics of

    data sets

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    12/25

    Contents

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    13/25

    Data Types in PANGAEA

    IRD(grav/ 10cm3)

    Sand(%)

    CaCO3(%)

    TOC(%)

    Radio(%/ sand)

    Smect(%/ cl ay)

    IRD(grav/ 10cm3)

    Sand(%)

    CaCO3(%)

    TOC(%)

    Radio(%/ sand)

    Smect(%/ clay)

    IRD(grav/ 10cm3)

    Sand(%)

    CaCO3(%)

    TOC(%)

    Radio(%/ sand)

    Smect(%/ clay)

    IRD(grav/ 10cm3)

    Sand(%)

    CaCO3(%)

    TOC(%)

    Radio(%/ sand)

    Smect(%/ clay)

    IRD(grav/ 10cm3)

    Sand(%)

    CaCO3(%)

    TOC(%)

    Radio(%/ sand)

    Smect(%/ cl ay)

    PS1389-3 PS1390-3 PS1431-1 PS1640-1 PS1648-1

    Age (kyr) max. : 233.55 kyr PS1389-3ff

    0.0

    100.0

    200.0

    0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 100 0 20 0 100 0 15 0 0. 5 0 50 0 1 00 0 2 0 0 10 0 0 1 5 0 0. 5 0 50 0 1 00

    54 0' 54 0'

    5430' 5430'

    55 0' 55 0'

    5530' 5530'

    11

    11

    12

    12

    13

    13

    14

    14

    15

    15

    World vectorshore lineGrain size class KOLP AGrain size class KOEHN2Grain size class KOEHNGeochemistryGrain size class KOLP B

    rain size class KLP DIN20 m

    Scale: 1:2695194 atL atitude 0

    Source:Baltic Sea Research Institute,Warnemnde.

    Profiles => doi:10.1594/PANGAEA.701299 Time series => doi:10.1594/PANGAEA.323487 Sea bed photos => doi:10.1594/PANGAEA.319877 Distributes samples => doi:10.1594/PANGAEA.51749 Complex data => doi:10.1594/PANGAEA.108079 Air photos => doi:10.1594/PANGAEA.323540 Audio record => doi:10.1594/PANGAEA.339110

    http://doi.pangaea.de/10.1594/PANGAEA.108079http://doi.pangaea.de/10.1594/PANGAEA.701299http://dx.doi.org/10.1594/PANGAEA.323487http://doi.pangaea.de/10.1594/PANGAEA.319877http://dx.doi.org/10.1594/PANGAEA.51749http://doi.pangaea.de/10.1594/PANGAEA.108079http://doi.pangaea.de/10.1594/PANGAEA.323540http://doi.pangaea.de/10.1594/PANGAEA.339110http://doi.pangaea.de/10.1594/PANGAEA.339110http://doi.pangaea.de/10.1594/PANGAEA.323540http://doi.pangaea.de/10.1594/PANGAEA.108079http://dx.doi.org/10.1594/PANGAEA.51749http://dx.doi.org/10.1594/PANGAEA.51749http://dx.doi.org/10.1594/PANGAEA.51749http://doi.pangaea.de/10.1594/PANGAEA.319877http://dx.doi.org/10.1594/PANGAEA.323487http://dx.doi.org/10.1594/PANGAEA.323487http://dx.doi.org/10.1594/PANGAEA.323487http://doi.pangaea.de/10.1594/PANGAEA.701299http://doi.pangaea.de/10.1594/pangaea.103958http://doi.pangaea.de/10.1594/pangaea.319879http://doi.pangaea.de/10.1594/pangaea.323487
  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    14/25

    unclassified

    Sediment

    Water

    Corals

    Atmosphere Ice

    Total number of data sets ~ 1 millionData items ~ 8 billions

    Statistics (9/2010)

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    15/25

    Now the technical details :-)

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    16/25

    SybaseASE

    Middleware Webserver

    Editorialsystem

    PANGAEAsearchengine

    PANGAEA -Architecture

    Harddisk+ tape (silo)

    RDB

    ApacheLucene

    GoogleMaps / Earth

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    17/25

    Indexing contents from relationaldatabase with dynamic updates

    Data Set

    Staffs

    Projects

    Data Series

    Events

    Update Log

    XML Data SetDescription(Metadata)

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    18/25

    Indexed Information

    Textual metadata: citation (authors, title),abstract, measurement parameters,methods, associated projects, comments,documentation including field info for allXML schema element types)Fulltext data set contentsGeographical information: latitude/longitude/BBOX/track, dates,

    geological age, depth/elevation[NumericField/NumericRangeQuery]Soon: Fulltext of attached external documentation

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    19/25

    Geo-Retrieval with Lucene

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    20/25

    Using scored querieswith KML regions as filters

  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    21/25

    Apache Luceneas fast Key-Value Store

    Lucene is used for almost every query on theweb-client

    of keyword terms indexed for quickretrieval of data setsExample: Lookup of datsets related topublications using DOI PANGAEA is hit byhundreds of DOI lookup queries per secondfrom scientific publishers:

    http://doi.pangaea.de/10.1016/0377-8398(92)90001-Z
  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    22/25

    Apache Luceneas fast Key-Value Store

    Lucene is used for almost every query on theweb-client

    of keyword terms indexed for quickretrieval of data setsExample: Lookup of datsets related topublications using DOI PANGAEA is hit byhundreds of DOI lookup queries per secondfrom scientific publishers:

    http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://dx.doi.org/10.1016/S0377-8398(01)00044-5http://doi.pangaea.de/10.1016/0377-8398(92)90001-Z
  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    23/25

    PRESENTATION

    Live

    http://www.pangaea.de/http://www.pangaea.de/
  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    24/25

    ContactUwe Schindler

    PANGAEA - Publishing Network for Geoscientific &Environmental Data

    MARUM, Leobener Str., 28359 Bremen, [email protected]

    SD DataSolutions GmbHWtjenstr. 49, 28213 Bremen, Germany

    [email protected]

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/8/2019 PANGAEA - Providing access to geoscientific data using Apache Lucene Java

    25/25

    Thank you!Know more about Apache Lucene at

    www.lucidimaginatin.com

    http://www.lucidimagination.com/events/revolution2010http://www.lucidimagination.com/events/revolution2010