52
情報抽出 構造化データを構造化させる技術 2013/05/16 PFIセミナー 株式会社プリファードインフラストラクチャー 海野 裕也 (@unnonouno)

情報抽出入門 〜非構造化データを構造化させる技術〜

Embed Size (px)

DESCRIPTION

 

Citation preview

  • 1. 2013/05/16 PFI (@unnonouno)

2. l (@unnonouno)l Jubatusl l 20114PFIJOINl l 3. l 5/18Twitter@l 6/2Jubatus Casual Talks #1l l LT3 4. l l l l 4 5. 5 6. l 7. l 7 8. Microsoft Academic Searchl 8 9. l l 9 10. 10P F I18:03 18:04 18:05 18:06 18:09 11. l l l l 11 12. l l 12 13. 1l l 13 14. 2l l l eBayGoogle14 650 15. 3l l protein-protein interactionl 15 16. l l l l l l l l 16 17. l l JOIN17King of PopMichael Jackson 18. 18 19. 331. 2. 3. 19 20. 2020135810 EXPOSedue for BigData2013/5/8 2013/5/104EXPO 2013/5/8 2013/5/10 4EXPO1. 2. 3. 21. 1. l Named Entity Recognition; NERl 21 22. l l l l l l 23. l l l l l l 24. BIOl NERl l B (Begin)l I (Inside)l O (Outside)l BIOBIIII 24 B I I OO O O 25. l l NNLl 25 26. l l l 26 27. (Hidden Markov Model; HMM)l l l l 27 P(|) P(|) 28. (Conditional Random Field; CRF)[Lafferty2001]l l P(y|x) exp(i f(i)w)l linear chain CRFl 28 fi 29. 2. l l l l 29 30. 1. 2. 3. 4. 5. 30 31. l l l l vs l vs l color vs colour31 32. SimString [ 10]l t l t t l N32$ simstring -u -d web1tja/unigrams.db -t 0.7 -s cosine 33. (Transliteration)l l vs Iwatal 33: transliteration 34. Transliteration Alignment [Pervouchine09]l l l 34[Li09] 35. l Tl http://shoname.jp/l 35 36. (Abbreviation)l l l (Acronym): l ASEAN, APEC, LINUXl l l l 36 37. l l l l l vs l 37 38. l l l l l vs l vs 38 39. Distributional Hypothesisl l 39 40. l MacApplel AppleiPhonel MaciPhone40 41. 3. l (Relation Extraction) l TemplateFilling41 42. l l l l l l X is located in the Y42 43. l l 43[Sarawagi08] 44. 44 45. l l l NER 46. l l 346 47. UIl 47ANNIEhttp://www.aktors.org/technologies/annie/Zoguma 48. l l l l 48 49. l http://areadas.jp/l 49 50. l l l l 3l l l l l l UI50 51. l S. Sarawagi.Information Extraction.Foundations and Treands in Databases, Vol. 1, No. 3 (2007) pp. 261-377,2008.l J. Lafferty, A. McCallum, F. Pereira.Conditional Random Fields: Probabilistic Models for Segmenting andLabeling Sequence Data.ICML2001.l , ..50, 1C-1, 2010.l V. Pervouchine, H. Li, B. Lin.Transliteration Alignment.ACL&IJCNLP 2009, pp. 136-144, 2009.51 52. Copyright 2006-2012Preferred Infrastructure All Right Reserved.