ĐẠI HỌC QUỐC GIA HÀ NỘI - repository.vnu.edu.vnrepository.vnu.edu.vn/bitstream/VNU_123/11826/1/MasterThesis_full(1... · Đặc biệt là 2 em Nguyễn Văn Hợp và Vũ

Embed Size (px)

Citation preview

  • I HC QUC GIA H NI

    TRNG I HC CNG NGH

    NGUYN TH THA

    PHN LOI CU TING VIT

    V NG DNG TRONG VN HI P

    LUN VN THC S CNG NGH THNG TIN

    H Ni - 2015

  • I HC QUC GIA H NI

    TRNG I HC CNG NGH

    NGUYN TH THA

    PHN LOI CU TING VIT

    V NG DNG TRONG VN HI P

    Ngnh : Cng ngh thng tin

    Chuyn ngnh : H thng thng tin

    M s : 60 48 01 04

    LUN VN THC S CNG NGH THNG TIN

    GIO VIN HNG DN KHOA HC: TS. PHAN XUN HIU

    Hc vin thc hin Gio vin hng dn Hi ng chm lun vn

    H Ni 2015

  • LI CAM OAN

    Ti Nguyn Th Tha xin cam oan ni dung trong lun vn ny l cng

    trnh nghin cu v sng to do chnh ti thc hin di s hng dn ca TS.

    Phan Xun Hiu. S liu, kt qu trnh by trong lun vn l hon ton trung thc

    v cha cng b trong bt c cng trnh khoa hc no trc y. Nu hnh nh

    c ly t ngun bn ngoi, ti u c trch dn ngun r rng v y .

    H Ni, ngy thng nm 2015

    Hc vin

    Nguyn Th Tha

  • 2

    LI CM N

    u tin, ti xin gi li cm n chn thnh n thy Phan Xun Hiu. Thy

    truyn cm hng hc tp, nhit huyt nghin cu khoa hc v dn li ti

    n vi lnh vc nghin cu ny. Thy cng l ngi tn tnh gip ti vt

    qua nhng th thch trong qu trnh nghin cu lun vn.

    Ti xin gi li cm n chn thnh n thy H Quang Thy. Cng tip xc

    vi thy, ti cng cm thy yu qu v trn trng thi gian c lm sinh vin

    nhiu hn.

    Ti xin by t lng bit n chn thnh ti cc thy, c gio ging dy ti

    trong sut 2 nm ti Trng i hc Cng ngh - i hc Quc gia H Ni. Mi

    thy c u cho ti nhng bi ging tht hay v b ch.

    Ti xin cm n cc anh ch trong Phng o to, Phng Cng tc sinh vin,

    Phng Ti v v cc anh ch khc trong trng. Nh c s lm vic tn ty ca

    cc anh ch, chng ti mi c mt ngi trng ng nht nh c nc hc tp

    v rn luyn.

    Ti xin by t s cm n su sc n cc thnh vin trong nhm MDN-

    Team. Thi gian chng ti bn nhau chia s nhng kh khn khi to ra ng

    dng tr l o cho ngi Vit - VAV. c bit l 2 em Nguyn Vn Hp v V

    Th Hi Yn nhit tnh gip ti trong qu trnh thc nghim, ti s khng

    bao gi qun.

    Ti xin gi li cm n chn thnh cc anh ch ng nghip ti Cc Thng

    tin khoa hc v cng ngh quc gia B Khoa hc v Cng ngh gip hon

    thnh cng vic ti c quan ti c th yn tm hc tp.

    Ti cng xin cm n cc anh ch trong Phng Th nghim cng ngh tri

    thc gp chi tit mi bui seminar hng tun ti hon thin tt lun vn

    ca mnh.

    Cui cng, ti xin chn thnh cm n n b m, anh ch trong gia nh.

    H l ngun ng vin khng th thiu trong cuc i ti.

    H Ni, ngy thng nm 2015

    Hc vin

    Nguyn Th Tha

  • 3

    MC LC

    T VN ................................................................................................................. 6

    Chng I. Gii thiu v phn loi cu v ng dng ................................................ 14

    1.1 Cc cng trnh nghin cu v phn loi cu .................................................... 14

    1.2. Phn loi cu ting Vit ................................................................................... 16

    1.2.1. Gii thiu v bi ton Phn loi cu ting Vit .................................... 16

    1.2.2. Cc phng php gii quyt bi ton ................................................. 18

    Chng II. Phn loi cu ting Vit bng cc phng php hc my ................... 19

    2.2. Phng php Nave Bayes ............................................................................... 19

    2.3. Phng php SVMs ......................................................................................... 21

    2.4. Thut ton Maximum Entropy ......................................................................... 23

    Chng III. Thc nghim ........................................................................................... 26

    3.1. Phng php thc nghim ............................................................................... 26

    3.2. D liu thc nghim ........................................................................................ 28

    3.3. La chn thuc tnh ......................................................................................... 29

    3.4. Kt qu thc nghim v phn tch ................................................................... 30

    3.4.1. M hnh MaxEnt .............................................................................. 30

    3.4.2. M hnh Nave Bayes ....................................................................... 33

    3.4.4. So snh MaxEnt, Nave Bayes v SVMs ............................................ 36

    KT LUN ............................................................................................................................. 38

    TI LIU THAM KHO ...................................................................................................... 39

    PH LC ................................................................................................................................ 41

  • 4

    DANH SCH HNH V

    Hnh 0.1 Giao din phn mm ng dng VAV Tr l o cho ngi Vit

    Hnh 0.2 Ngun d liu cho Big Data

    Hnh 0.3 Giao din phn mm VOS

    Hnh 1.1 M hnh n gin bi ton phn loi cu ting Vit

    Hnh 1.2 V d minh ho bi ton phn loi cu ting Vit

    Hnh 1.3 M hnh tng th bi ton phn loi cu ting Vit

    Hnh 2.1 M hnh SVMs

    Hnh 3.1 Phng php Cross Validation Test

    Hnh 3.2 S lng mi loi cu thu c qua ASR service (Google Voice)

    Hnh 3.3 Biu so snh o F1 ca m hnh MaxEnt trn 2 tp thuc tnh

    ln lp th 4

    Hnh 3.4 Biu so snh F1 ca m hnh Nave Bayes gia 2 tp thuc tnh n-

    grams v n-grams + Dictionary

    Hnh 3.5 Biu so snh o F1 ca m hnh SVMs gia 2 tp thuc tnh n-

    grams v n-grams + Dictionary sau 4 folds

    Hnh 3.6 Biu so snh o F1 ca 3 m hnh MaxEnt, Nave Bayes v

    SVMs ln lp th 4 trn tp thuc tnh n-grams

    Hnh 3.7 Biu so snh o F1 ca 3 m hnh MaxEnt, Nave Bayes v

    SVMs ln lp th 4 trn tp thuc tnh n-grams + Dictionary

    Hnh PL.1 S phn b d liu khi Phn loi vi phng php Nave Bayes

    Hnh PL.2 Kt qu Phn loi vi phng php Nave Bayes

    Hnh PL.3 S phn b d liu khi Phn loi vi phng php SVMs

    Hnh PL.4 Kt qu Phn loi vi phng php SVMs

    Hnh PL.5 D liu u vo fold th 4 vi phng php MaxEnt

    Hnh PL.6 D liu hun luyn fold 4

    Hnh PL.7 D liu kim tra fold 4

    Hnh PL.8 Kt qu nh gi m hnh MaxEnt

    Hnh PL.9 S phn b d liu khi Phn loi vi phng php Nave Bayes

    Hnh PL.10 Kt qu Phn loi vi phng php Nave Bayes

    Hnh PL.11 S phn b d liu khi Phn loi vi phng php SVMs

    Hnh PL.12 Kt qu Phn loi vi phng php SVMs

    Hnh PL.13 D liu hun luyn fold 4

    Hnh PL.14 D liu kim tra fold 4

    Hnh PL.15 Kt qu nh gi m hnh MaxEnt

  • 5

    DANH SCH BNG BIU

    Bng 1.1 Bng m t cc kiu cu thng dng

    Bng 3.1 Mt s thuc tnh mu hun luyn m hnh phn loi cu

    Bng 3.2 Kt qu ln lp th 4 ca m hnh MaxEnt vi tp thuc tnh n-grams

    Bng 3.3 Kt qu ln lp th 4 ca m hnh MaxEnt vi tp thuc tnh n-grams

    + Dictionary

    Bng 3.4 Kt qu tng ln lp ca m hnh MaxEnt vi tp thuc tnh n-grams

    Bng 3.5 Kt qu tng ln lp ca m hnh MaxEnt vi tp thuc tnh n-grams

    + Dictionary

    Bng 3.6 Kt qu sau 4 ln lp ca m hnh Nave Bayes vi tp thuc tnh n-

    grams

    Bng 3.7 Kt qu sau 4 ln lp ca m hnh Nave Bayes vi tp thuc tnh n-

    grams + Dictionary

    Bng 3.8 Kt qu sau 4 ln lp ca m hnh SVMs vi tp thuc tnh n-grams

    vi C = 0.1, gamma = 0.5, Kernel = exp (-gamma*|u-v|^2)

    Bng 3.9 Kt qu sau 4 ln lp ca m hnh SVMs vi tp thuc tnh n-grams +

    Dictionary vi C = 0.1, gamma = 0.5, Kernel = exp (-gamma*|u-v|^2)

  • 6

    T VN

    Theo PGS.TS. Bi Mnh Hng [1], thc hin mc ch pht ngn, ngi

    ta thng dng cu trc c php c trng kt hp vi nhng phng tin ngn

    ng ring bit nh: tiu t, ph t, ph t, trt t t, ng iu, hin tng tnh

    lc, v.v. Ngha l c mt mi tng quan kh u n gia hnh thc ca cu v

    mc ch s dng n. T hnh thnh nn khi nim kiu cu (sentence type)

    v nhng kiu cu thng dng nht thng c nhc n l: cu trn thut, cu

    nghi vn, cu cu khin, cu cm thn (x. J. Sadock & A. Zwicky 1990: 155-156).

    Phn loi cu ting Vit bng my tnh l bi ton c bn, lm tin cho

    cc nghin cu cao hn v x l v hiu ngn ng t nhin. Phn loi cu l mt

    trong nhng thnh phn x l ct li ca h thng hi p nh phn mm ng

    dng VAV (Vitual Assistant for Vietnammese) Tr l o cho ngi Vit do

    MDN Team thuc Trng i hc Cng ngh - i hc Quc gia H Ni sng

    lp, ca h thng phn tch social media nghin cu th trng nh cc h

    thng x l Big Data hay trong h thng tng hp ting ni nh VOS Ting ni

    Phng Nam do i hc Quc gia Tp. H Ch Minh sng lp.

    Hnh 0.1 Giao din phn mm ng dng VAV Tr l o cho ngi Vit

  • 7

    VAV l mt ng dng thng minh trn di ng cho php ngi dng tng

    tc bng ging ni hn chung bo thc, t lch cho mt cuc hp, bt nh

    v, gi in cho ai , truy cp mt trang web bt k, tm ng trn bn , nh

    v cy ATM ca mt ngn hng no gn vi bn, hay thng thc mt bn

    nhc mnh yu thch c thit k v pht trin da trn cc k thut tr tu

    nhn to (hc my, phn tch v hiu ngn ng t nhin), VAV c th hiu c

    nh ca ngi dng d h din t cu lnh ca mnh theo nhiu cch khc

    nhau m khng cn tun theo bt k khun mu no cho trc.

    VAV - ng dng tr l o cho ngi Vit l mt trong nhng phn mm

    nhn c nhiu s quan tm trn cc trang mng x hi, cc din n cng ngh.

    Phn loi cu gip VAV lc ra c nhng cu thuc kiu cu hi hoc kiu cu

    cu khin x l tip tc cc pha tip theo hoc VAV s hi p li ngay cho

    ngi dng m khng cn x l nu l cu cm thn hoc cu trn thut qua

    module h tr tch hp sn trong VAV.

    Big data l tp hp d liu ln v a dng nn khng th x l bng cch

    th cng hoc bng phn mm thng thng. Vic thu thp, qun l, phn tch d

    liu ny tr thnh ngnh ring trong cng ngh thng tin v thu ht c s

    ch ca gii kinh doanh trong nhng nm gn y v tim nng ca n.

    Hnh 0.2 Ngun d liu cho Big Data

  • 8

    Social media ch trong thi gian ngn to nn lng d liu bng lng

    d liu ca c th gii vi th h trc: Facebook mi ngy u x l 500 terabytes

    d liu, Twitter mi ngy cng x l 12 terabytes d liu; trong khi sn chng

    khon New Yorks ch x l 1 terabytes d liu. Lng d liu t Social Media s

    l m vng i vi cc doanh nghip mun hiu v hnh vi khch hng ca mnh,

    cch h a ra quyt nh mua sm, nhu cu ca h trong tng lai gn...

    Phn loi cu trong trng hp ny s gip h thng lc ra c nhng cu

    no th hin trng thi tm l ca ngi dng, nhng cu no phn nh s khen

    ch t doanh nghip s c th a ra gii php ci tin sn phm ca h

    hoc c nhng chin lc thu ht khch hng kp thi.

    Tng t, trong h thng tng hp ting ni, Ting ni Phng Nam VOS

    l mt h thng tng hp ting ni ting Vit, dnh cho chnh ngi Vit, c th

    to ra ging ni nhn to ca ngi trn my tnh t d liu u vo l vn bn.

    Phn loi cu lc ny s gip h thng thm c sc thi cho cu vn trong on

    text .

    Trong lnh vc truyn thng, h thng VOS c th c p dng trong cc

    ng dng truy vn thng tin qua tng i in thoi, trong yu cu ca ngi

    dng s c ng dng tip nhn v x l thnh dng vn bn. Thng tin ny s

    c h thng VOS chuyn thnh dng m thanh v tr v cho ngi dng. Cc

    h thng ny c kh nng ng dng cao do qu trnh x l hon ton t ng, c

    th hot ng lin tc, p ng c nhu cu v thng tin ca ngi dng, c

    bit l cc thng tin nng, cp nht.

    Trong lnh vc t ng ha, h thng VOS c th c tch hp vi h thng

    nh v GPS trong cc ng dng tm ng i, gn trn xe hi cung cp cc ch

    dn dng m thanh, hn ch vic li xe phi lin tc va nhn mn hnh GPS,

    lm tng an ton cho ngi iu khin.

    Trong lnh vc gio dc, VOS c th c s dng dy ting Vit cho

    con em Vit Kiu nh c nc ngoi, nht l cch c, cch pht m cc t

    ting Vit. y l phn mm thc hnh ting Vit hu hiu, c bit trong mi

    trng m ngn ng s dng khng phi l ting Vit.

  • 39

    TI LIU THAM KHO

    Ti liu ting Vit

    [1] Bi Mnh Hng (2011), Bn v vn Phn loi cu theo mc ch pht

    ngn, Khoa Ngn ng, i hc Quc gia Tp. H Ch Minh.

    [2] Bi c Tnh (1995), Vn phm Vit Nam. Tp. H Ch Minh: Vn ha.

    [3] Hong Trng Phin (1980), Ng php ting Vit Cu. H Ni: i hc &

    Trung hc chuyn nghip.

    [4] Nguyn H Nam (2013), Gio trnh Khai ph d liu, Nh Xut bn

    i hc Quc gia H Ni.

    Ti liu ting Anh

    [5] Adam L. Berger & Stephen A.Della Pietra & Vincent J. Della Pietra (1996),

    A Maximum Entropy Approach to Natural Language Processing.

    [6] Adwait Ratnapakhi (1997), A Simple Introduction to Maximum Entropy

    Models for Natural Language Processing.

    [7] Ashequl Qadir (2011), Classifying Sentences as Speech Acts in Message

    Board Posts, University of Utah, In Proceedings of the 2011 Conference

    on Empirical Methods in Natural Language Processing.

    [8] Arpit Trived (2013), Implementation of Bayesian Theory in Sentence

    Classification for Online Subjective Test, International Journal of

    Advanced Research in Computer Science and Software Engineering,

    Volume 3, Issue 12.

    [9] Anthony Khoo (2006), Experiments with Sentence Classification, Monash

    University, Australia.

    [10] Ben Hachey & Claire Grover (2004), Sentence Classification Experiments

    for Legal Text Summarisation, University of Edinburgh, In Proceedings of

    the 17th Annual Conference on Legal Knowledge and Information

    Systems.

    [11] Diego Molla (2012), Experiments with Clustering-based Features for

    Sentence Classification in Medical Publications: Macquarie Tests

    participation in the ALTA 2012 shared task, In Proceedings of Australasian,

    Language Technology Association Workshop, pages 139142.

  • 40

    [12] Helen Kwong (2012), Detection of Imperative and Declarative Question-

    Answer Pairs in Email Conversations, Stanford University, Journal AI

    Communications archive, Volume 25 Issue 4, Pages 271-283.

    [13] Martina Naughton (2008), Sentence-Level Event Classification in

    Unstructured Texts, University College Dublin, Ireland.

    [14] Menno v.an Zaanen (2005), Classifying Sentences using Induced Structure,

    Macquarie University, Volume 3772 of the series Lecture Notes in

    Computer Science, pp 139-150, 12th International Conference, SPIRE

    2005, Buenos Aires, Argentina.

    [15] Nal Kalchbrenner (2014),A Convolutional Neural Network for Modelling

    Sentences, University of Oxford, In Proceedings of the 52nd Annual

    Meeting of the Association for Computational Linguistics.

    [16] William Gardner Hale (1913), The Classification of Sentences and Clauses,

    The School Review, The University of Chicago Press, Vol. 21, No. 6, pp.

    388-397.

    [17] Ulf Hermjakob (2001),Parsing and Question Classification for Question

    Answering, University of Southern California, USA, Proceeding ODQA '01

    Proceedings of the workshop on Open-domain question answering -

    Volume 12, Pages 1-6.

    [18] Yoon Kim (2014), Convolutional Neural Networks for Sentence

    Classification, New York University.

    [19] Emile de Maat (2008), Automatic Classification of Sentences in Dutch

    Laws, University of Amsterdam, Proceedings of the 2008 conference on

    Legal Knowledge and Information Systems, The Twenty-First Annual

    Conference,Pages 207-216

    [20] Janyce Wiebe (2005), Creating Subjective and Objective Sentence

    Classifiers from Unannotated Texts, University of Pittsburgh, CICLing'05

    Proceedings of the 6th international conference on Computational

    Linguistics and Intelligent Text Processing, Pages 486-497.

    [21] Nitin Jindal (2006), Identifying Comparative Sentences in Text Documents,

    University of Illinois at Chicago, SIGIR06.

    [22] Thomasson, Amie, "Categories", The Stanford Encyclopedia of Philosophy

    (Fall 2013 Edition), First published Thu Jun 3, 2004, URL =

    .