Aug. 14, 2012 2012 IASLOD Linking Korean Resources to LOD:
Issues in Localization Mun Y. Yi
Slide 2
- 1 - Agenda Project Scope System Architecture Silk in Action
Korean Traditional Knowledge Data Localization Issues
Slide 3
- 2 - LOD2 Work Packages The project is structured into twelve
consecutively numbered work packages (WPs). WP1 to WP6 are
concerned with development of the LOD2 Stack, and WP7 to WP9 are
designed to extensively validate and demonstrate the developed
technology on the basis of a carefully selected and representative
set of demonstrator applications, holding potentially great impact.
WP10 (SWC) is devoted to training, awareness and dissemination,
WP11 is concerned with exploitation and standardization activities,
as well as technical coordination activities with other projects.
WP12 is designed for high-level project coordination, reporting to
the EC as well as activities related to the resolution of the IPR
and maintenance of the Consortium Agreement.
Slide 4
- 3 - Simplified LOD2 Stack High-Level Architecture The main
result of LOD2 will be the LOD2 Stack, an integrated distribution
of aligned tools which support the whole life cycle of Linked Data
from creation over enrichment, interlinking, fusing to
maintenance.
Slide 5
- 4 - Project Scope: Tasks & Deliverables In Task4.1, a
semi-automatic machine learning technique will be developed and
implemented to simplify the creation of mappings between knowledge
bases and the assessment of their quality. KAIST will contribute to
this task by providing a platform for automatic linking with
Korean, Chinese, and Japanese RDF resources. Task 4.1
Semi-Automatic Data Interlinking - University Leipzig - Digital
Enterprise Research Institute - Free University Berlin - KAIST
Deliverable 4.1.1 First Linking Assist Release Due Date: M18
(2012-02) Deliverable 4.1.3 Korean Resource Linking Assist Release
Due Date: M24 (2012-08) Deliverable 4.1.4 Asian Resource Linking
Assist Release Due Date: M30 (2013-02)
Slide 6
- 5 - Project Scope: Tasks & Deliverables (Contd) Task 4.5
Link Data Fusion - University Leipzig - Digital Enterprise Research
Institute - Free University Berlin - KAIST Deliverable 4.5.1
Initial Release of Data Fusion Component Due Date: M24 (2012-08)
Deliverable 4.5.3 Korean Data Fusion Assistant Due Date: M30
(2013-02) Deliverable 4.5.4 Asian Data Fusion Assistant Due Date:
M36 (2013-08) In Task 4.5, methods for fusing data about single
concept from multiple different sources will be devised and
implemented. KAIST will work on the fusion of multilingual DBpedia
datasets, thus eliminating issues for other multilingual
resources.
Slide 7
- 6 - Phased Approaches 2 nd Cycle(~July, 2012) Implementation
of Korean Resource Linking Assistant Silk Localization Linking with
Silk Framework Internal publication 1 st Cycle(~Feb., 2012)
Understanding of the Task Domain Semantic Web LOD2 Concept Software
Architecture Data Model(Relational2RDF) Pilot Project Korean
Traditional Recipe data 3 rd Cycle(~Aug., 2012) Quality Enhancement
Linking Quality Publish to the LOD2 cloud The project has been done
in 3 iterative cycles. Each cycle focuses on specific tasks, and
lessons learned will be transferred into the next cycles. In the 1
st cycle, preliminary RDF data was generated. During the second
cycle, we localized Silk to support Korean resource linking. The
last cycle focuses on enhancing data quality.
Slide 8
- 7 - Silk in Action url:
http://lod.kaist.ac.kr/silk-workbench/http://lod.kaist.ac.kr/silk-workbench/
File or SPARQL endpoint can be sources or targets. Define a project
Define a source & a target Define a task Define an output And
then click Open
Slide 9
- 8 - Silk in Action (Contd) Multiple operators can be used for
complex tasks. Outputs can be displayed or written into a file.
Interim result can be exported as a final result or be used as
training data sets for machine learning. Learned algorithm can be
used to generate final links. Define a source & a target from
Property Paths Define operator(s) Click GenerateLinks Click
Start
Slide 10
- 9 - Korean Traditional Knowledge Portal
Slide 11
- 10 - Korean Traditional Knowledge Data includes Food (3,236
records) Food name Food type Recipe, ingredients Cooking process
(images) Medicine, sickness, and treatment (38,121 records)
Agriculture (2,775 units) Life (4,438 units)
Slide 12
- 11 - System Architecture Source Data in Relational DB Silk
Virtuoso Triple Store Proprietary RDFgen for transforming
relational model to RDF model Silk for link generation Virtuoso
triple store for serving RDF RDFgen* Link Creation Silk New Korean
Similarity Measures Transformation RDFgen Publication Virtuoso
triple store RDF Links Instances Ontology DBpedia
Slide 13
- 12 - Key Linking Issues Data Preprocessing Address Encoding:
URI vs.IRI Korean String Similarity Measure Handling Transliterated
Data
Slide 14
- 13 - Data Preprocessing : Mapping Relation to RDF Our goal is
to make the recipes of Korean traditional food open. Original data
from relational database were transformed into tables by object
relational mapping. Related ontologies for recipe:
LinkedRecipe.com, www.mindswap.org. Tool and IngredientPortion are
not implemented at this phase. RelationalRDF Table nameClass name
PK column valueSubject Non-PK column namePredicate FK column
valueObject(used as URI; RDF link) Non-FK column valueObject(used
as string; Literal triple)
Slide 15
- 14 - Handling Non-Latin Data Resources would be described in
non-Latin characters. Tools are not known whether to support
non-Latin characters. Writing Systems of the world today -
Wikipedia
Slide 16
- 15 - Address Encoding URI is a core component of linked data.
URIs are used as names for things. URI only allows US-ASCII
characters for names of the resource. W3 Recommendations for URI :
UTF-8 Character Set & URI Encoding Use UTF-8 character sets for
URI, and encode special/non-Latin characters using %. ex)
http://ko.wikipedia.org/wiki/%EB%B2%A0%EB%A5%BC%EB%A6%B0 But its
hard to understand what it is Another W3 Recommendations :
IRI(Internationalized Resource Identifier) ex)
http://ko.wikipedia.org/wiki/ Now we can understand what it means.
But some characters look so similar that chance for spoofing
increases. ( ex)
Slide 17
- 16 - Localization: Silk Workbench Address Encoding Silk
Workbench is GUI interface for the generation of links Silk
Workbench displays encoded URIs as is so that its hard to
understand non-Latin dataset. Decoding URIs enables non-Latin
dataset to be displayed in its native language, so its a lot easier
to work with.
Slide 18
- 17 - Localization: Korean String Similarity Measures Two
kinds of Korean resources exist: Resources in Korean and resources
in transliterated Korean. We need to calculate similarity distances
for both of them. Korean alphabet has 14 consonants and 10 vowels
(together with consonant clusters and diphthongs). For resources in
Korean i.e., Korean DBpedia Most of the resources in Korea For
resources in transliterated Korean bibimbap i.e., English DBpedia
Most of the resources abroad Most of the comparators in Silk are
based on string comparison i.e., Levenshtein distance However,
writing systems are different from languages to languages. So
comparators for Latin or Roman alphabets are appropriate for Korean
alphabet? String Similarity Distance Measures for Korean KorED
GrpSim OneDSim2 KorPhoD (Our approach) = (sD-1)*3 + min(pD),
sD:Syllable Distance, pD:
Slide 19
- 18 - Localization: Korean String Similarity Measures (Contd)
Several Korean similarity distances exist to reflect the
characteristics of Korean alphabet. We devised a new way to measure
based on the distribution of phonemes (KorPhoD). We implemented
KoreanPhonemeDistance operator in Silk and used it to build links
among Korean resources. SourceTargetLevenstein DistanceActual Edit
OperationDifferences in phonemesDifferences in syllables 23 (->,
-> add, -> delete)42 SourceTargetLevenshtein
DistanceKorEDGrpSimOneDSim2KorPhoD 2 + + *ws( and are similar) +
*w3 + 2 + + *wd( and are different) + *w4 + 3 + 3 + *wd+ *w+ *wd2 +
222 + : syllable distance, : phoneme distance Comparison of
Similarity Measures for Korean Application of Edit Distance to
Korean Resources Performance Comparison Precision : 1.28% vs.
17.78% (about thirteen times improvement ) F-score: 0.0223 vs.
0.0896 (Four times more effective finding correct links)
Slide 20
- 19 - Localization: Transliterated Korean Similarity Measures
Two kinds of transliteration related to Korean: From English to
Korean / From Korean to English. For now, we focus on the
transliteration from Korean to English to build links for resources
in Korean. The biggest problem is that there have been various
algorithms for transliterating Korean into English so far. From
English to Korean Digital -> , , , From Korean to English ->
Kalguksu, Kalguksoo, Kalgugsoo, Transliteration algorithms for
Korean McCune-Reischauer(1937) : Official standard in the past
(from 1984 to 2000) Uses breves( : indicates a short vowel),
apostrophes and diereses( : a vowel is sounded in a separate
syllable)brevesapostrophesdiereses Yale(1942) Revised
Romanization(2000) : Current official standard. Is generally
similar to MR, but uses no diacritics or apostrophes, and uses
distinct letters for / (t/d), / (k/g), / (ch/j) and / (p/b), etc.
and probably many more We found that many academic and government
websites still use MR more. Silk doesnt have phonetic similarity
measures though i.e., Soundex
Slide 21
- 20 - Localization: Transliterated Korean Similarity Measures
(Contd) We compare performance from both string similarity
perspective and phonetic similarity perspective. Levenshtein shows
good performance for precision, and Soundex shows good performance
for recall. KoTlit shows good performance for both precision and
recall, and we are still optimizing the algorithms. Performance
Comparison M.R.RelevantRetrievedRet. &
Rel.Precision(%)Recall(%) Levenshtein* 6669 2875277096.3541.54
Soundex38643259691.5489.50 KoTlit4469424194.9063.59 * threshold:0
R.R.RelevantRetrievedRet. & Rel.Precision(%)Recall(%)
Levenshtein* 6669 5552523794.3378.53 Soundex34818761881.7892.79
KoTlit5977564194.3884.59 * threshold:0
Slide 22
- 21 - Concluding Remarks Localization issues are important for
Asian and other non-Latin countries Need to develop its own
similarity measures string similarity and phonetic similarity SILK
is likely to become a key linking assistant program for LOD LOD is
a major movement to define the next version of the Internet.
Slide 23
- 22 - Thank you! Mun Yong Yi KAIST http://kslab.kaist.ac.kr
mail: [email protected]@kaist.ac.kr