18
PHDD – An RDF Vocabulary for the Physical Data Descrip:on Work in Progress Joachim Wackerow and Thomas Bosch (both GESIS – Leibniz Ins:tute for the Social Sciences)

A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

PHDD  –  An  RDF  Vocabulary  for  the  Physical  Data  Descrip:on  

Work  in  Progress  

Joachim  Wackerow  and  Thomas  Bosch  (both  GESIS  –  Leibniz  Ins:tute  for  the  Social  Sciences)  

Page 2: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

What  is  it?  

•  Descrip:on  of  the  physical  proper:es  of  a  data  file.  

•  Focus  on  most  common  format  types  – Rectangular  format  – Character-­‐separated  values  (CSV)  or  fixed-­‐record  length  

Page 3: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Character-­‐separated  Values  Header  row  

Page 4: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Fixed  Record-­‐length  Format  

Page 5: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Mo:va:on  

•  Data.gov  and  similar  ini:a:ves  provide  data  in  CSV  format  or  similar  

•  W3C  Government  Linked  Data  Working  Group  Charter  –  “The  mission  …  is  to  provide  standards  and  other  informa:on  which  help  governments  around  the  world  publish  their  data  as  effec:ve  and  usable  Linked  Data  using  Seman:c  Web  technologies.”  

•  Machine-­‐ac:onability  -­‐  intended  for  program  use  

Page 6: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Example  at  data.gov  

Page 7: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Exis:ng  Approaches  •  CSV  on  the  Web  Working  Group  Charter  (W3C)  •  Linkage  

–  Linked  CSV  (Jeni  Tennison)  –  CSV  linked  data  (Quoderat)  

•  Descrip:on  –  Common  Format  and  MIME  Type  for  Comma-­‐Separated  Values  (CSV)  Files,  

RFC  4180  –  csv:  a  vocabulary  for  describing  CSV  files  (Rurik  Thomas  Greenall,  Norwegian  

University  at  Trondheim)  •  Representa:ons  

–  URI  design  for  RDF  conversion  of  CSV-­‐based  data  (Tim  Lebo,  Gregory  Todd  Williams)  

–  CSV2RDF  Applica:on  (Ivan  Ermilov,  Sören  Auer,  Claus  Stadler)    Research  results:  

–  Intended  for  different  purposes  –  Descrip:on  approaches  not  sufficient  

Page 8: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

PHDD  –  First  Ideas  

Page 9: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

PHDD  –  UML  Model  class  phdd

TableDescription

-­‐   caseQuantity    :xsd:nonNegativeInteger  [0..1]-­‐   fi leName    :xsd:string,  xsd:uri-­‐   recordsPerCase    :xsd:positiveInteger  =  1-­‐   overallRecordCount    :xsd:nonNegativeInteger  [0..1]

TableStructure

-­‐   characterSet    :xsd:string-­‐   defaultDecimalSeparator    :xsd:string  [0..1]-­‐   defaultDigitGroupSeparator    :xsd:string  [0..1]-­‐   defaultLanguage    :xsd:string  [0..1]-­‐   defaultLocale    :xsd:string  [0..1]-­‐   defaultDecimalPositions    :xsd:positiveInteger  [0..1]-­‐   newLine    :xsd:string  =  CRLF

ColumnDescription

-­‐   recommendedDataType    :xsd:string  [0..1]-­‐   storageFormat    :xsd:string  [0..1]-­‐   recommendedDisplayDataFormat    :xsd:string  [0..1]-­‐   decimalPositions    :xsd:positiveInteger  [0..1]-­‐   recordNumber    :xsd:positiveInteger  [0..1]  =  1

skos:Concept

FixedRecordLength

-­‐   recordLength    :xsd:positiveInteger  [0..1]

Delimited

-­‐   delimiter    :xsd:string-­‐   textQualifier    :xsd:string  [0..1]-­‐   consecutiveDelimitersAsOne      :xsd:boolean  [0..1]  =  false-­‐   namesOnFirstRow    :xsd:boolean  [0..1]  =  true-­‐   firstDataLine    :xsd:positiveInteger  [0..1]  =  2

DistributionTable

Column

InputProgram

-­‐   programFileName    :xsd:string-­‐   softwareType    :xsd:string-­‐   programVersion    :xsd:string

FixedColumnDescription

-­‐   startPosition    :xsd:positiveInteger-­‐   endPosition    :xsd:positiveInteger  [0..1]-­‐   width    :xsd:positiveInteger  [0..1]

DelimitedColumnDescription

-­‐   columnPosition    :xsd:positiveInteger

0..*

isDescribedBy

1

0..*

isStructuredBy

1

0..*

isDescribedBy

1

0..*

column

1..*

0..*

storageFormat

0..1

0..1

inputProgram

0..*

0..*

defaultLocale

0..1

0..*

defaultLanguage

0..1

0..*

characterSet

1

0..*

recommendedDisplayDataFormat

0..1

Page 10: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

PHDD  -­‐  Overview  

General  approach  is  not  really  new,  just  a  complete  set  of  proper:es  for  the  most  common  cases.  

Structure  •  Table  –  the  rectangular  data  file  

[disco::DataFile,  dcat::Distribu8on]  –  TableStructure  -­‐  common  proper:es  plus  specific  ones  for  delimited  and  fixed  columns  •  Column  -­‐  common  proper:es  plus  specific  ones  for  delimited  and  fixed  columns  

       [disco::Variable]  

Page 11: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

class  phdd

TableDescription

-­‐   caseQuantity    :xsd:nonNegativeInteger  [0..1]-­‐   fi leName    :xsd:string,  xsd:uri-­‐   recordsPerCase    :xsd:positiveInteger  =  1-­‐   overallRecordCount    :xsd:nonNegativeInteger  [0..1]

TableStructure

-­‐   characterSet    :xsd:string-­‐   defaultDecimalSeparator    :xsd:string  [0..1]-­‐   defaultDigitGroupSeparator    :xsd:string  [0..1]-­‐   defaultLanguage    :xsd:string  [0..1]-­‐   defaultLocale    :xsd:string  [0..1]-­‐   defaultDecimalPositions    :xsd:positiveInteger  [0..1]-­‐   newLine    :xsd:string  =  CRLF

ColumnDescription

-­‐   recommendedDataType    :xsd:string  [0..1]-­‐   storageFormat    :xsd:string  [0..1]-­‐   recommendedDisplayDataFormat    :xsd:string  [0..1]-­‐   decimalPositions    :xsd:positiveInteger  [0..1]-­‐   recordNumber    :xsd:positiveInteger  [0..1]  =  1

skos:Concept

FixedRecordLength

-­‐   recordLength    :xsd:positiveInteger  [0..1]

Delimited

-­‐   delimiter    :xsd:string-­‐   textQualifier    :xsd:string  [0..1]-­‐   consecutiveDelimitersAsOne      :xsd:boolean  [0..1]  =  false-­‐   namesOnFirstRow    :xsd:boolean  [0..1]  =  true-­‐   firstDataLine    :xsd:positiveInteger  [0..1]  =  2

DistributionTable

Column

InputProgram

-­‐   programFileName    :xsd:string-­‐   softwareType    :xsd:string-­‐   programVersion    :xsd:string

FixedColumnDescription

-­‐   startPosition    :xsd:positiveInteger-­‐   endPosition    :xsd:positiveInteger  [0..1]-­‐   width    :xsd:positiveInteger  [0..1]

DelimitedColumnDescription

-­‐   columnPosition    :xsd:positiveInteger

0..*

isDescribedBy

1

0..*

isStructuredBy

1

0..*

isDescribedBy

1

0..*

column

1..*

0..*

storageFormat

0..1

0..1

inputProgram

0..*

0..*

defaultLocale

0..1

0..*

defaultLanguage

0..1

0..*

characterSet

1

0..*

recommendedDisplayDataFormat

0..1

Table  Structure  

Page 12: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

class  phdd

TableDescription

-­‐   caseQuantity    :xsd:nonNegativeInteger  [0..1]-­‐   fi leName    :xsd:string,  xsd:uri-­‐   recordsPerCase    :xsd:positiveInteger  =  1-­‐   overallRecordCount    :xsd:nonNegativeInteger  [0..1]

TableStructure

-­‐   characterSet    :xsd:string-­‐   defaultDecimalSeparator    :xsd:string  [0..1]-­‐   defaultDigitGroupSeparator    :xsd:string  [0..1]-­‐   defaultLanguage    :xsd:string  [0..1]-­‐   defaultLocale    :xsd:string  [0..1]-­‐   defaultDecimalPositions    :xsd:positiveInteger  [0..1]-­‐   newLine    :xsd:string  =  CRLF

ColumnDescription

-­‐   recommendedDataType    :xsd:string  [0..1]-­‐   storageFormat    :xsd:string  [0..1]-­‐   recommendedDisplayDataFormat    :xsd:string  [0..1]-­‐   decimalPositions    :xsd:positiveInteger  [0..1]-­‐   recordNumber    :xsd:positiveInteger  [0..1]  =  1

skos:Concept

FixedRecordLength

-­‐   recordLength    :xsd:positiveInteger  [0..1]

Delimited

-­‐   delimiter    :xsd:string-­‐   textQualifier    :xsd:string  [0..1]-­‐   consecutiveDelimitersAsOne      :xsd:boolean  [0..1]  =  false-­‐   namesOnFirstRow    :xsd:boolean  [0..1]  =  true-­‐   firstDataLine    :xsd:positiveInteger  [0..1]  =  2

DistributionTable

Column

InputProgram

-­‐   programFileName    :xsd:string-­‐   softwareType    :xsd:string-­‐   programVersion    :xsd:string

FixedColumnDescription

-­‐   startPosition    :xsd:positiveInteger-­‐   endPosition    :xsd:positiveInteger  [0..1]-­‐   width    :xsd:positiveInteger  [0..1]

DelimitedColumnDescription

-­‐   columnPosition    :xsd:positiveInteger

0..*

isDescribedBy

1

0..*

isStructuredBy

1

0..*

isDescribedBy

1

0..*

column

1..*

0..*

storageFormat

0..1

0..1

inputProgram

0..*

0..*

defaultLocale

0..1

0..*

defaultLanguage

0..1

0..*

characterSet

1

0..*

recommendedDisplayDataFormat

0..1

Column  Descrip:on  

Page 13: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Rela:onship  to  other  RDF  Vocabularies  

class  externalVocabularies

Table

dcat::Distribution

dcat::Dataset dcat::Catalog

TableStructure Column

disco::DataFile disco::LogicalDataSet

disco::Variable

dcat:distribution dcat:dataset

0..*

isStructuredBy

1 0..*

column

1..*

´ owl:equivalentClassª

disco:dataFile

´ owl:equivalentClassª

disco:containsVariable

Page 14: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Usage  Scenarios  

Data  

PHDD   Discovery   DCAT  

Program  

Descrip:on  

Transforma:on  Analysis  

Data  Data  

User  

Provider  

Search  

DDI  XML  

Page 15: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Rela:onship  to  DDI  XML  

•  Mapping  to  DDI  XML  Specifica:ons  – DDI  Codebook  2.*  •  approx.  half  of  the  proper:es  of  PHDD  

– DDI  Lifecycle  3.*  •  almost  all  proper:es  of  PHDD  

Page 16: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Rela:onship  of  DDI  Specifica:ons  

DDI  Codebook  2.*  

DDI  Lifecycle  3.*  

DDI  4  Model  

OWL/RDF  Representa:on  XML  Schema    Representa:on  

PHDD  

Discovery  

XKOS  

XML  Schema   OWL/RDF  

Future  

Page 17: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Acknowledgements  

•  Contribu:ons  by  – Larry  Hoyle  (Ins:tute  for  Policy  &  Social  Research,  University  of  Kansas)  

– Richard  Cyganiak  (DERI  -­‐  Digital  Enterprise  Research  Ins:tute  )  

Page 18: A4 NADDI2014 Wackerow PHDD - COnnecting REpositories · 2017-09-25 · PHDD$–An$RDF$Vocabulary$for$the$ Physical$DataDescrip:on$ WorkinProgress Joachim$Wackerow$and$Thomas$Bosch$

Further  Informa:on  

•  Development  repository  of  PHDD  – hlps://github.com/linked-­‐sta:s:cs/physical-­‐data-­‐descrip:on  

•  DDI  Alliance  RDF  Vocabularies  – hlp://www.ddialliance.org/Specifica:on/RDF