14
Data Quality In Real Estate Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference

Data quality in Real Estate

Embed Size (px)

Citation preview

Page 1: Data quality in Real Estate

Data QualityIn Real Estate

Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo

Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference

Page 2: Data quality in Real Estate

About Geophy

● Goal to map all buildings in the world

● Provide a quality score for each building

○ Based on location, building status, history, environmental metrics, etc

● Semantic platform

○ RDF eases the data integration process

● Team of 45 with aim to double by next year

Page 3: Data quality in Real Estate

Real Estate is a very complex domain

Really!

Page 4: Data quality in Real Estate

Possible constraints on addresses?

● An address will start with, or at least include, a building number.

● When there is a building number, it will be all-numeric.

● No buildings are numbered zero

● Well, at the very least no buildings have negative numbers

● A building number will only be used once per street

● A building will only have one number

● A building name won't also be a number

● [...] https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses

Page 5: Data quality in Real Estate

Geophy [set of] ontologies

● 13 ontologies (+ 9 external)

● 125 Classes

○ Buildings

○ Addresses

○ Companies

○ [...]

● 720 properties

○ 500 datatype

○ 160 relation properties

● Growing...

Page 6: Data quality in Real Estate

Quality is expensive

● Quality of source data○ Free, open, closed data sources, etc.

● Data clean up process○ Violations, deduplication, precision, etc.

○ How much time and effort can one afford?

How much quality is good enough?

� Fitness for use

Page 7: Data quality in Real Estate

Quality of ...

● Source data○ Accuracy of the source

● Translation of source data○ RDF mappings, rml, d2rq, scripts etc.

● Model design○ Modelling quality

○ Data fitting on schema

● Model definition○ Mapping of model on RDFS, OWL, ShEx|SHACL Shapes, etc

○ Semantics i.e RDFS, OWL DL/RL/FULL, etc

Page 8: Data quality in Real Estate

Evolution & quality

� Data evolves

� so do ontologies

� so do RDF mappings

� so does code

� so do SPARQL queries

� so do constraints

http://aligned-project.eu

Page 9: Data quality in Real Estate

Scaling quality ...

● Thousands of triples

● Millions of triples

● Billions of triples

● ?

Try to move validation in the K range (when possible)

Page 10: Data quality in Real Estate

Validate closer to the source

� Validate the model

� Validate the RDF mappings

� Validate RDF mapping excerpts

� Validate instance data

Page 11: Data quality in Real Estate

Automate, automate & automate

Can you spot the error?

rdfs:label ⇒ rdf:langString

� :foo rdfs:label ″foo @en″ .

Page 12: Data quality in Real Estate

Automate, automate & automate

Can you spot the error?

rdfs:label ⇒ rdf:langString

� :foo rdfs:label ″foo @en″ .

� :foo rdfs:label ″foo″@en .

Page 13: Data quality in Real Estate

CI/CD is your buddy

● Integrate validation with your CI/CD

○ Choose tools & technologies wisely

○ Jenkins, Travis, Gitlab, TeamCity

● Fail the build until data issues are fixed

● Data integration validation checks

○ Standalone datasets can pass CI

Page 14: Data quality in Real Estate

Thank you for your attention

Questions?