88
A Deep Text Analysis System Based on OpenNLP Boris Galitsky

ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

A  Deep  Text  Analysis  System  Based  on  OpenNLP  

Boris Galitsky

Page 2: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Outline

1.  Introduc+on  to  OpenNLP.Similarity  component  

2.  What  is  a  Deep  Text  Analysis  and  its  Applica+ons  

3.  Func+onality  of  OpenNLP.Similarity:  search,  web  mining,  text  classifica+on,  text  analysis  

4.  Selected  problem:  content/document  genera+on  

  2  

Page 3: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

What is OpenNLP.Similarity

3  

Page 4: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

OpenNLP.Similarity entry point

A search engineer without knowledge of linguistic / parsing can easily integrate syntactic generalization into an application when relevance is important: SentencePairMatchResult matchRes = sm.assessRelevance(snippet, searchQuery);

List<List<ParseTreeChunk>> match =

matchRes.getMatchResult();

score = parseTreeChunkListScorer.

getParseTreeChunkListScore(match);

if (score > 1.5) {

// relevant

}

Page 5: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Deep Text Analysis is required: When  coun+ng  keywords  is  not  enough  When  part-­‐of-­‐speech  informa+on  is  not  enough  When  syntac+c  links  between  words  informa+on  is  not  enough  

5  

Page 6: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Motivation: from keywords to discourse structure

•  Just  keywords  -­‐  is  not  enough  in  many  cases  •  Presence  of  verbs  for  communica@ve  ac@ons  and  

mental  states  -­‐  may  help  to  iden+fy  a  few  paPerns,  but  is  s+ll  insufficient    

•  Parse  trees  -­‐  specific  phrases  by  texts  in  style  or  genre  but  s+ll  does  not  help  for  systema+c  explora+on  of  deep  linguis+c  features  

•  The  discourse  structure  including  rhetoric  rela+ons  and  communica+ve  ac+ons  

6  

We  should  be  able  to  learn  

discourse  trees  

Page 7: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

7  

Secondary, Tertiary, Quaternary structure of language

Computa+onal  biologists  figured  out  that  higher-­‐level  structures  of  protein  language  help!  However,  linguists  like  to  talk  about  higher-­‐level  (discourse)  structures  but  

neither  computa+onally  learn  them  nor  use  in  applica+ons  

Page 8: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Approach •  Build  syntac+c  trees  •  Extract  discourse  informa+on  

–  Coreferences    –  Rhetorical  rela+ons  [Mann  et  al.,  1987]  –  Communica+ve  ac+ons  in  VerbNet  format  

•  Build  addi+onal  arcs  between  syntac+c  tree  nodes  of  different  sentences  

•  Construct  parse  thicket  graph  [Galitsky  et  al  2013]  •  Pick  extended  trees  from  the  graph  •  Learn  on  extended  trees  •  Learn  a  special  case  of  parse  thicket:  communica+ve  discourse  

tree  (CDT)  

8  

Page 9: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Problem domain in security

•  Engineering  Design  document  is  a  document  which  contains  a  thorough  and  well-­‐structured  descrip+on  of  how  to  build  a  par+cular  engineering  system.    

•  Not  design  document  :  a  document  which  contains  meta-­‐level  informa+on  rela+vely  to  the  design  of  engineering  system,  such  as:  ü  how  to  write  design  docs  manuals,    ü  standards  design  docs  should  adhere  to,    ü  tutorials  on  how  to  improve  design  documents,    ü  project  requirement  document,    ü  requirement  analysis,    ü  opera+onal  requirements,    ü  construc+on  documenta+on,    ü  project  planning  and  binning,    ü  technical  services  review,  guidelines,  manuals,  etc.    

9  

Page 10: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Recognizing document style, not document topic

Classifica+on  dimensions  •  Topicality  •  Polarity  (sen+ment)  •  Authorship  •  Document  style  (what  kind  of  descrip+ons  is  provided  for  

things)   10  

Topic  

Authorship  

Style  

This  is  necessary  for  document  processing  

system  

Page 11: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Application domains where style recognition is necessary

•  Document  type  (not  topic)  recogni+on  •  Document  flow  analysis  systems  •  Customer  rela+onship  management  •  Security  analysis  •  Forensic  analysis    •  Document  quality  assessment  

11  

Page 12: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Method: Tree Kernels

12  

Page 13: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Tree Kernels

What  are  they  and  why  use  them  for  classifica+on?  

We  need  to  measure  distances  between  texts:  •   Tradi+onally  –  using  sta+s+c  of  keywords  •   More  accurate  –  using  sta+s+cs  of  all  sub-­‐parse  trees  

13  

Page 14: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Measuring  similarity  between  phrases  

Page 15: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

SVM Tree Kernel learning

 –  Trees  instead  of  numeric  vectors  

–  Tree  kernel  func+on  =  number  of  all  common  subtrees  [Collins  et  al  2002]  

 

15  

( ) ( )1 21 2

1 2 1 2, ,T Tn N n N

TK T T n n∈ ∈

= Δ∑ ∑

Page 16: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

16  

Convolu+on  tree  kernel  (Collins  and  Duffy,  2002)  defines  a  feature  space  consis+ng  of  all  subtree  types  of  parse  trees  and  counts  the  number  of  common  subtrees  as  the  syntac+c  similarity  between  two  parse  trees.    They  have  found  a  number  of  applica+ons  in  a  series  of  NLP  tasks,  including    •  search  (Moschil    2006);  •  syntac+c  parsing  re-­‐ranking;  •  rela+on  extrac+on  (Zelenko  et  al  2003);  •  named  en+ty  recogni+on  (Cumby  &  Roth  2003)  ;  •  Seman+c  Role  Labeling  /  rela+on  extrac+on  (Zhang  et  al  2006);  •  pronoun  resolu+on  (Yang  et  al.,  2006);  •  ques+on  classifica+on  and  machine  transla+on  (Zhang  and  Li  2009).  

Applications of Tree Kernels

Page 17: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Limita+on  of  sentence-­‐level  tree  kernels  

Phrases  important  for  classifica+on  can  be  distributed  through  different  sentences  

So  we  want  to  combine/merge  parse  trees  to  make  sure  we  cover  the  phrase  of  interest.  

This  document  describes  how  the  back  end  processor  can  be  designed.  Its  requirements  are  enumerated  below.  

From  the  first  sentence,  it  looks  like  we  got  the  design  doc.    To  process  the  second  sentence,  we  need  to  disambiguate  ‘its’.  As  a  result,  we  conclude  from  the  second  sentence  that  

it  is  a    requirement  doc  

Page 18: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Rhetoric structure

Discourse  tree  for  a  paragraph  of  text  

18  

Page 19: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Merging  parse  trees,  once  inter-­‐sentence  link  is  found  

Paragraph  Level  Kernel  =  Sum  of  kernels  for  all  extended  tree  pairs  

Page 20: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

20  

Parse Thicket

Page 21: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Measuring  similarity  

between  two  texts  using  

Parse  Thickets  

Page 22: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Evaluation 1 & 2

22  

Page 23: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Domain:Identifying Logical Argument in Text

23  

Page 24: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

What is required to represent an argument pattern in text?

We  build  an  RST  representa+on  of  the  arguments  and  observe  if  a  discourse  tree  is  capable  of  indica+ng  whether  a  paragraph  communicates  both:  

•   a  claim,  and    •   an  argumenta+on  that  backs  it  up.      We  then  explore  what  needs  to  be  added  to  a  discourse  tree  (DT)    

so  that  it  is  possible  to  judge  if  it  expresses  an  argumenta+on  paPern  or  not.  

 

24  

Page 25: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Our example

 We  present  a  controversial  ar+cle  published  in  Wall  Street  Journal  about  Theranos,  a  company  providing  healthcare  services,  and  the  company  rebuPal  (Theranos  2015).    

                   “Since  October  [2015],  the  Wall  Street  Journal  has  published  a  series  

of  anonymously  sourced  accusaGons  that  inaccurately  portray  Theranos.  Now,  in  its  latest  story  (“U.S.  Probes  Theranos  Complaints,”  Dec.  20),  the  Journal  once  again  is  relying  on  anonymous  sources,  this  Gme  reporGng  two  undisclosed  and  unconfirmed  complaints  that  allegedly  were  filed  with  the  Centers  for  Medicare  and  Medicaid  Services  (CMS)  and  U.S.  Food  and  Drug  AdministraGon  (FDA).”  

 25  

Page 26: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

“…But  Theranos  has  struggled  behind  the  scenes  to  turn  the  excitement  over  its  technology  into  reality.  At  the  end  of  2014,  the  lab  instrument  developed  as  the  linchpin  of  its  strategy  handled  just  a  small  fracGon  of  the  tests  then  sold  to  consumers,  according  to  four  former  employees.”    

A claim and its CDT

Page 27: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

“…But  Theranos  has  struggled  behind  the  scenes  to  turn  the  excitement  over  its  technology  into  reality.  At  the  end  of  2014,  the  lab  instrument  developed  as  the  linchpin  of  its  strategy  handled  just  a  small  fracGon  of  the  tests  then  sold  to  consumers,  according  to  four  former  employees.”    

A claim and its CDT

When  arbitrary  communica+ve  ac+ons  are  aPached  to  DT  as  labels  of  its  terminal  arcs,  it  becomes  clear  that  the  author  is  trying  to  bring  her  point  across  and  not  merely  sharing  a  fact.    

Page 28: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

“Theranos  remains  acGvely  engaged  with  its  regulators,  including  CMS  and  the  FDA,  and  no  one,  including  the  Wall  Street  Journal,  has  provided  Theranos  a  copy  of  the  alleged  complaints  to  those  agencies.  Because  Theranos  has  not  seen  these  alleged  complaints,  it  has  no  basis  on  which  to  evaluate  the  purported  complaints.”    

A claim and its DT

Theranos  aPempts  to  rebuke  the  claim  of  WSJ,  but  without  

communica+ve  ac+ons  it  is  unclear  from  

the  DT    

Page 29: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

“It  is  not  unusual  for  disgruntled  and  terminated  employees  in  the  heavily  regulated  health  care  industry  to  file  complaints  in  an  effort  to  retaliate  against  employers  for  terminaGon  of  employment.  Regulatory  agencies  have  a  process  for  evaluaGng  complaints,  many  of  which  are  not  substanGated.  Theranos  trusts  its  regulators  to  properly  invesGgate  any  complaints.”    

CDT  for  an  aPempt  by  Theranos  to  acquit  itself   Communica+ve  ac+ons  as  labels  for  rhetoric  

rela+ons  helps  to  iden+fy  a  text  which  contains  

a  heated  discussion    

Page 30: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

RST  +  CA  =>  iden+fica+on  of  logical  argumenta+on  

•  To  show  the  structure  of  arguments,  discourse  rela+ons  are  necessary  but  insufficient,  and  communica+ve  ac+ons,  or  speech  acts  are  necessary  but  insufficient  as  well.    

•  We  need  to  know  the  discourse  structure  of  interac+ons  between  agents,  and  what  kind  of  interac+ons  they  are.    

•  We  don’t  need  to  know  domain  of  interac+on  (here,  health),  the  subjects  of  these  interac+on  (the  company,  the  journal,  the  agencies),  what  are  the  en++es,  but  we  need  to  take  into  account  mental,  domain-­‐independent  rela+ons  between  them.    

Page 31: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

 “Theranos  remains  acGvely  engaged  with  its  regulators,  including  CMS  and  the  FDA,  and  no  one,  including  the  Wall  Street  Journal,  has  provided  Theranos  a  copy  of  the  alleged  complaints  to  those  agencies.  Because  Theranos  has  not  seen  these  alleged  complaints,  it  has  no  basis  on  which  to  evaluate  the  purported  complaints.”  

Another  CDT  for  an  aPempt  to  make  even  stronger  counter-­‐claim

Page 32: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Extrac+ng  communica+ve  ac+ons  from  text:  VerbNet  entry  

 Let  us  consider  the  verb  amuse.    There  is  a  cluster  of  similar  verbs  that  have  a  similar  structure  of  arguments  (seman+c  roles)  such  as  amaze,  anger,  arouse,  disturb,  irritate,  and  other.    The  roles  of  the  arguments  of  these  communica+ve  ac+ons  are  as  follows:  • Experiencer  (usually,  an  animate  en+ty)  • S+mulus  • Result    

Page 33: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Extrac+ng  communica+ve  ac+ons  from  text:  VerbNet  Frames  

NP  V  NP        Example:  "The  teacher  amused  the  children."        Syntax:  S+mulus  V  Experiencer        Clause:    amuse(SGmulus,  E,  EmoGon,  Experiencer):-­‐  cause(SGmulus,  E),    emoGonal_state(result(E),  EmoGon,  Experiencer).      NP  V  ADV-­‐Middle        Example:  "Small  children  amuse  quickly."        Syntax:  Experiencer  V  ADV        Clause:  amuse(Experiencer,  Prop):-­‐  

 property(Experiencer,  Prop),  adv(Prop).      NP  V  NP-­‐PRO-­‐ARB        example  "The  teacher  amused."        syntax  S+mulus  V    amuse(SGmulus,  E,  EmoGon,  Experiencer):-­‐                    cause(SGmulus,  E),                    emoGonal_state(result(E),  EmoGon,  Experiencer).  

For  this  example,  the  informa+on  for  the  class  of  verbs  amuse  is  at  hPp://verbs.colorado.edu/verb-­‐index/vn/amuse-­‐31.1.php#amuse-­‐31.1  

Page 34: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Argumenta+on  scenario  and  its  

DT  

I  explained  that  my  check  bounced  (I  wrote  it  aaer  I  made  a  deposit).  A  customer  service  representaGve  accepted  that  it  usually  takes  some  Gme  to  process  the  deposit.    I  reminded  that  I  was  unfairly  charged  an  overdraa  fee  a  month  ago  in  a  similar  situaGon.  They  denied  that  it  was  unfair  because  the  overdraa  fee  was  disclosed  in  my  account  informaGon.    I  disagreed  with  their  fee  and  wanted  this  fee  deposited  back  to  my  account.    They  explained  that  nothing  can  be  done  at  this  point  and  that  I  need  to  look  into  the  account  rules  closer.    

Page 35: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Argumenta+on  scenario  and  its  

graph  representa+on  

I  explained  that  my  check  bounced  (I  wrote  it  aaer  I  made  a  deposit).  A  customer  service  representaGve  accepted  that  it  usually  takes  some  Gme  to  process  the  deposit.    I  reminded  that  I  was  unfairly  charged  an  overdraa  fee  a  month  ago  in  a  similar  situaGon.  They  denied  that  it  was  unfair  because  the  overdraa  fee  was  disclosed  in  my  account  informaGon.    I  disagreed  with  their  fee  and  wanted  this  fee  deposited  back  to  my  account.    They  explained  that  nothing  can  be  done  at  this  point  and  that  I  need  to  look  into  the  account  rules  closer.    

Page 36: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Extrac+ng  communica+ve  ac+ons  from  text:  clusters  

Verbs  with  Predica+ve  Complements  Appoint,  characterize,  dub,  declare,  conjecture,  masquerade,  orphan,  captain,  consider,  classify.  Verbs  of  Percep+on  See,  sight,  peer.  Psych-­‐Verbs  (Verbs  of  Psychological  State)  Amuse,  admire,  marvel,  appeal.  Verbs  of  Desire  Want,  long.  Judgment  Verbs  Judgment.  Verbs  of  Assessment  Assess,  es+mate.  Verbs  of  Searching  Hunt,  search,  stalk,  invesGgate,  rummage,  ferret.  Verbs  of  Social  Interac+on  Correspond,  marry,  meet,  bacle.  Verbs  of  Communica+on  Transfer(message),  inquire,  interrogate,  tell,  manner(speaking),  talk,  chat,  say,  complain,  advise,  confess,  lecture,  overstate,  promise.  Avoid  Verbs  Avoid.  Measure  Verbs  Register,  cost,  fit,  price,  bill.  Aspectual  Verbs  Begin,  complete,  conGnue,  stop,  establish,  sustain.  

Page 37: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Sta+s+cal  vs  induc+ve  learning  To  compute  similarity  between  abstract  structures,  two  approaches  are  employed:  •  represent  these  structures  in  a  numerical  space,  and  

express  similarity  as  a  number.  This  is  a  sta+s+cal  learning  approach  

•  use  a  structural  representa+on,  without  numerical  space,  such  as  trees  and  graphs,  and  express  similarity  as  a  maximal  common  sub-­‐structure.    We  refer  to  such  opera+on  as  generalizaGon.  This  is  an  induc+ve  learning  approach.  

To  conduct  feature  engineering,  we  will  compare  both  these  approaches  for  text  classifica+on.  The  representa+on  machinery  and  learning  selngs  are  different,  but  the  classifica+on  accuracies  can  be  compared.  

Page 38: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Similarity  between  CDTs:  maximal  common  sub-­‐CDTs  with  

generalized  labels  

Page 39: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Numerical  similarity  between  CDTs:  representa+on  for  kernel  

model  

Page 40: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Mo+va+ons  Most  web  visionaries  think  that  good  quality  content  comes  from:    •  really  passionate  fans,  or    •  from  professional  journalists,  knowing  the  topic  of  their  wri+ng  well.    However,  nowadays,  the  demand  for  content  requires  people  write  large  amount  of  text,  for  a  variety  of  commercial  purposes:  •  from  search  engine  op+miza+on    •  marke+ng    •  self-­‐promo+on.    -­‐  A  wide  variety  of  business  areas  require  content  crea+on  in  some  form,  including  web  content,    -­‐  Expecta+on  of  text  quality  are  rather  low  and  deteriorate  further,  as  content  is  becoming  more  and  more  commercial  -­‐  

So  lets  have  a  machine  generate  low-­‐quality  

content  for  us  

Page 41: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Now  we  focus  on  content  genera+on  in  detail  

41  

Page 42: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Objec+ve  To  build    •  an  efficient    •  domain-­‐independent    •  crea+ve    •  interac+ve  wri+ng  assistance  tool  which  produces:    a  large  volume  of  content  where  •  quality  and    •  effec+veness    are  not  essen+al  

Page 43: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Plan  •  Goals  of  the  project  •  Technology  

–  Content  genera+on  algorithm  –  NLP:  Learning  parse  trees  –  Parse  thickets  

•  Applica+ons  –  Facebook  agent  –  Various  cases  of  content  genera+on  

•  How  to  use  it  –  How  to  integrate  into  your  applica+on  

Page 44: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

OpenNLP.Similarity.content generation entry point

String topic = “Albert Einstein”; HitBase hits =

new RelatedSentenceFinder(). generateContentAbout(topic);

String filenameDOCX = new

WordDocBuilder.buildWordDoc(hits, topic);

Page 45: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Innova+on  •  Use  of  web  mining  to  collect  pieces  of  

content  •  Machine  learning  of  parse  trees  to  filter  

irrelevant  pieces  of  content  from  the  web  •  Paragraph-­‐level  syntac+c  structures  to  form  

content  from  pieces  •  Document-­‐level  structure  of  sec+ons  is  

formed  by  web  mining  for  aPributes  of  an  en+ty  which  is  a  subject  of  an  assay  being  wriPen  

   

Page 46: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Mo+va+ons:  why  open  source  •  Today search engines are information

access gatekeepers •  Search engines want their automated

agent to analyze / index content created by humans

•  WE want to automatically build content good enough so that search engines do not recognize it as written by humans…

•  So that search engines would need humans to analyze content created by our machines

Page 47: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Got  content  from  Walmart?  •  Today,  original  content  is  created  by  human  writers  

and  therefore  costly,  slowly  produced    •  Finding  a  way  to  automate  content  crea+on  so  that  the  

result  is  sa+sfactory  for  human  content  consumers  and  perceived  as  original  by  search  engine  would  advance  user's  web  presence  and  facilitate  his  professional  reputa+on  

 Reformulate  a  text  for  the  purpose  of  

making  it  original:      •  well  known  to  search  engines    

•  produce  uninteres+ng  and  seman+cally  non-­‐cohesive  

Page 48: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Reproduce  the  human  wri+ng  ac+vity  

A  typical  crea+ve  wri+ng  ac+vity  of  an  average  author  is:  •  searching  the  web  for  a  subject  of  wri+ng  •  browsing  the  web  for  relevant  informa+on  •  finding  candidate  pieces    •  merging  them  to  form  a  text  •  Inser+ng  images/videos/sound  recordings  •  final  text  polishing    

We  intend  our  wri+ng  algorithm  to  do  these  exact  

steps…  And  possibly  leave  the  last  

step  for  an  author  

Page 49: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Technology  

Page 50: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Difficulty  =  relevance      Eric  P.:  We  will  e-­‐mail  you  a  bill  for  the  amount  of  your  money  transfer,  and  you  pay  using  your  

bank's  online  bill  pay  service.    Boris  Galitsky:  can  YOU  do  this  email?  Eric  P.:  Please  clarify  and  restate  your  ques+on.  Boris  Galitsky:  can  you  please  send  me  email  for  bill  pay  informa+on?    Eric  P.:  Once  you  have  selected  WU  Pay  as  the  "Pay  With"  op+on  and  clicked  the  SUBMIT  buPon,  

the  RECEIPT  screen  will  provide  instruc+ons  on  how  to  fund  the  Money  Transfer.      Boris  Galitsky:  ok,  I  will  try  that    Boris  Galitsky:  are  you  an  automated  agent?    Eric  P.:  A|er  choosing  WU®  Pay,  you'll  receive  a  bill  by  email.      Eric  P.:  Is  there  anything  else  I  can  help  you  with  today?    Boris  Galitsky:  can  you  please  answer  my  previous  ques+on?    Eric  P.:  What  ques+on  Boris?      Boris  Galitsky:  are  you  an  automated  agent?    Eric  P.:  No  sir      Boris  Galitsky:  It  is  my  right  as  a  customer  to  know  if  I  am  served  by  an  automated  or  human  

agent,  right?    Eric  P.:  Yes  it  is    

My  conversa+on  with  WU  agent  

Page 51: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Content  genera+on  algorithm    

Page 52: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

           

Algorithm:  select  candidate  sentences    

Content  genera+on  algorithm  1    

Page 53: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

           

Algorithm:  verify  relevance  

Content  genera+on  algorithm  2    

Page 54: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

           

Algorithm:  final  acceptance  of  pieces  of  content  

Content  genera+on  algorithm  3    

Page 55: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Content  genera+on  algorithm:  building  sec+on  structure  

We  search  for  a  essay  topic  and  obtain  main  en++es  associated  with  it,  such  as  {space,  atomic  bomb,  +me  machine}  

Page 56: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Content  genera+on  algorithm:  building  sec+on  structure  and  filling  

them  We  now  search  for  a  essay  topic  +  sec+on  topic:  

Page 57: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Building  sec+on  structure  and  filling  them  

Now  we  construct  the  sec+on  structure:  Albert  Einstein    1.  Albert  Einstein  invented  Space  

2.  Albert  Einstein  invented  Bomb  

Page 58: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Content  genera+on  flow  1  For  sentence  “Give  me  a  break,  there  is  no  reason  why  you  can't  reGre  in  ten  years  if  you  had  been  a  raGonal  investor  and  not  a  crazy  trader”  • We  form  the  query  for  search  engine  API:  +raGonal  +investor  +crazy  +trader  • From  search  results  we  remove  duplicates,  including  “DerivaGves:  ImplicaGons  for  Investors  |  The  <b>RaGonal</b>  Walk”.  • From  the  search  results  we  show  syntac+c  generaliza+on  results  for  the  seed  and  search  result:  •   Syntac+c  similarity:  np  [  [IN-­‐in  DT-­‐a  JJ-­‐*  ],    [DT-­‐a  JJ-­‐*  JJ-­‐crazy  ],    [JJ-­‐ra+onal  NN-­‐*  ],    [DT-­‐a  JJ-­‐crazy  ]],  score=  0.9  • Rejected  candidate  sentence:    RaGonal    opportuniGes  in  a    crazy    silly  world.  

Page 59: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Content  genera+on  flow  2  •  Syntac+c  generaliza+on  2:  np  [  [VBN-­‐*  DT-­‐a  JJ-­‐*  JJ-­‐ra+onal  NN-­‐investor  ],    [DT-­‐a  JJ-­‐*  JJ-­‐ra+onal  NN-­‐investor  ]]  vp  [  [DT-­‐a  ],    [VBN-­‐*  DT-­‐a  JJ-­‐*  JJ-­‐ra+onal  NN-­‐investor  ]],  score=  2.0  

•  Accepted  sentence:  I  have  licle  pretensions  about  being  a  so-­‐called  "raGonal  investor”.  

.   The  laPer  sentence  has  significantly  stronger  seman+c  commonality  with  the  seed  one,  

compared  to  the  former  one,  so  it  is  expected  to  serve  as  a  relevant  part  of  generated  

content  about  “raGonal  investor”  from  the  seed  sentence  

Page 60: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

NLP:  machine  learning  of  parse  trees    

Page 61: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Similarity  =  common  part  among  parse  trees  

Page 62: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Why  Parse  Thickets  

•  To  represent  a  linguis+c  structure  of  a  paragraph  of  text  based  on  parse  trees  for  each  sentence  of  this  paragraph.  

•  We  will  refer  to  the  sequence  of  parse  trees  extended  by  a  number  of  arcs  for  inter-­‐sentence  rela+ons  between  nodes  for  words  as  Parse  Thicket  (PT).    

•  A  PT  is  a  graph  which  includes  parse  trees  for  each  sentence,  as  well  as  addi+onal  arcs  for  inter-­‐sentence  rela+onship  between  parse  tree  nodes  for  words.  

Page 63: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

What  are  we  going  to  do  with  Parse  Thickets  

 •  Extend  the  opera+on  of  least  general  generaliza+on  (unifica+on  of  logic  formula)  towards  structural  representa+ons  of  paragraph  of  texts  

•  Define  the  opera+on  of  generalizaGon  of  text  paragraphs  to  assess  similarity  between  por+ons  of  text.    

•  Use  of  generaliza+on  for  similarity  assessment  is  inspired  by  structured  approaches  to  machine  learning  versus  unstructured,  sta+s+cal  where  similarity  is  measured  by  a  distance  in  feature  space  (Moschil  et  al)  

Page 64: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Now we compare linguistic phrase search and regular

SOLR search SOLR  is  one  of  the  most  popular  search  frameworks  used  in  industry  lucene.apache.org/solr/‎    Linguis+c  phrase  is  not  a  string  phrase  query  “…”.  There  are  noun  phrases,  verb  phrases,  preposi+onal  phrases,  and  others    SOLR  search  takes  into  account  term  frequency,  inverse  document  frequency,  and  the  number  of  words  between  keywords  in  the  document  

Page 65: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Keyword  search  

Page 66: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Phrase  (natural  language)  search  

Page 67: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Finding similarity between two paragraphs

 "Iran  refuses  to  accept  the  UN  proposal  to  end  the  dispute  over  work  on  nuclear  weapons",  "UN  nuclear  watchdog  passes  a  resolu+on  condemning  Iran  for  developing  a  second  uranium  enrichment  site  in  secret",  "A  recent  IAEA  report  presented  diagrams  that  suggested  Iran  was  secretly  working  on  nuclear  weapons",  "Iran  envoy  says  its  nuclear  development  is  for  peaceful  purpose,  and  the  material  evidence  against  it  has  been  fabricated  by  the  US"          "UN  passes  a  resolu+on  condemning  the  work  of  Iran  on  nuclear  weapons,  in  spite  of  Iran  claims  that  its  nuclear  research  is  for  peaceful  purpose",  "Envoy  of  Iran  to  IAEA  proceeds  with  the  dispute  over  its  nuclear  program  and  develops  an  enrichment  site  in  secret",  "Iran  confirms  that  the  evidence  of  its  nuclear  weapons  program  is  fabricated  by  the  US  and  proceeds  with  the  second  uranium  enrichment  site"  

Original  text  

Text,  found  on  the    web:  is  it  RELEVANT?  

Page 68: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Keywords: topic with no details

Iran,  UN,  proposal,  dispute,  nuclear,  weapons,  passes,  resolu+on,  developing,  enrichment,  site,  secret,  condemning,  second,  uranium      

Page 69: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Improvement: pair-wise generalization

[NN-­‐work  IN-­‐*  IN-­‐on  JJ-­‐nuclear  NNS-­‐weapons  ],      [DT-­‐the  NN-­‐dispute  IN-­‐over  JJ-­‐nuclear  NNS-­‐*  ],    [VBZ-­‐passes  DT-­‐a  NN-­‐resolu+on  ],      [VBG-­‐condemning  NNP-­‐iran  IN-­‐*  ],        [VBG-­‐developing  DT-­‐*  NN-­‐enrichment  NN-­‐site  IN-­‐in  NN-­‐secret  ]],      [DT-­‐*  JJ-­‐second  NN-­‐uranium  NN-­‐enrichment  NN-­‐site  ]],      [VBZ-­‐is  IN-­‐for  JJ-­‐peaceful  NN-­‐purpose  ],        [DT-­‐the  NN-­‐evidence  IN-­‐*  PRP-­‐it  ],      [VBN-­‐*  VBN-­‐fabricated  IN-­‐by  DT-­‐the  NNP-­‐us  ]    

Page 70: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Pair-wise vs. Parse thicket

 [NN-­‐Iran  VBG-­‐developing  DT-­‐*  NN-­‐enrichment  NN-­‐site  IN-­‐in  NN-­‐secret  ]  [NN-­‐generaliza+on-­‐<UN/nuclear  watchdog>  *  VB-­‐pass  NN-­‐resolu+on  VBG  condemning  NN-­‐  Iran]  [NN-­‐generaliza+on-­‐<Iran/envoy  of  Iran>  Communica@ve_ac@on    DT-­‐the  NN-­‐dispute  IN-­‐over  JJ-­‐nuclear  NNS-­‐*  [Communica@ve_ac@on  -­‐  NN-­‐work    IN-­‐of  NN-­‐Iran  IN-­‐on  JJ-­‐nuclear  NNS-­‐weapons]  [NN-­‐generaliza+on  <Iran/envoy  to  UN>    Communica+ve_ac+on    NN-­‐Iran  NN-­‐nuclear  NN-­‐*  VBZ-­‐is  IN-­‐for  JJ-­‐peaceful  NN-­‐purpose  ],        [Communica+ve_ac+on  -­‐  NN-­‐generalize  <work/develop>    IN-­‐of  NN-­‐Iran  IN-­‐on  JJ-­‐nuclear  NNS-­‐weapons]  [NN-­‐generaliza+on  <Iran/envoy  to  UN>    Communica+ve_ac+on    NN-­‐evidence  IN-­‐against  NN  Iran  NN-­‐nuclear      VBN-­‐fabricated  IN-­‐by  DT-­‐the  NNP-­‐us  ]  condemn^proceed  [enrichment  site]  <leads  to>    suggest^condemn  [  work  Iran  nuclear  weapon  ]    

Page 71: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Parse  Thicket  

•  Syntac+c  parse  trees  

•  Links  between  words  from  different  sentences  –  Anaphora  –  Rhetoric  Structure  Theory  (RST)  [Mann]    –  Speech  Act  Theory  (communica+ve  ac+ons,  CA)  [Searle]  

Page 72: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:
Page 73: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:
Page 74: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Applica+ons  

Page 75: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Case  Study:  Agent  ac+ng  on  behalf  of  a  user  pos+ng  messages  to  

maintain  friendship  

Page 76: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Domain  of  social  promo+on  

•  On   average,   people   have   200-­‐300   friends   or   contacts   on   social   network  systems  such  Facebook  and  LinkedIn.    

•  To   maintain   ac+ve   rela+onships   with   this   high   number   of   friends,   a   few  hours  per  week  is  required    

•   In  reality,  people  only  maintain  rela+onship  with  10-­‐20  most  close  friends,  family  and  colleagues,    

•  The  rest  of   friends  are  being  communicated  with  very   rarely.  These  not  so  close  friends  feel  that  the  social  network  rela+onship  has  been  abandoned  

Page 77: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

CASP  is  about  to  post  a  message  

Page 78: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Case  study  

Page 79: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

CASP  writes  an  essay  on  a  topic  

Page 80: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Content  Genera+on  part  of  CASP  •  The  content  genera+on  part  of  CASP  can  be  used  at  

www.facebook.com/RoughDra|Essay  •  Given  a  topic,  it  first  mines  the  web  to  auto  build  a  taxonomy  of  en++es  which  will  

be  used  in  the  future  comment  or  essay  •  Then  the  system  searches  the  web  for  these  en++es  to  find  pieces  of  content  to  

create  respec+ve  sec+ons  •  The  resultant  document  is  delivered  as  DOCX  email  aPachment  •  CASP  can  automa+cally  compile  texts  from  hundreds  of  sources  to  write  an  essay  

on  the  topic      •  If  a  user  wants  to  share  a  comprehensive  review,  opinion  on  something,  provide  a  

thorough  background,  then  this  interac+ve  mode  should  be  used.    

Page 81: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Evalua+on  of  CASP  We  measure  the  failure  score  of  an  individual  CASP  pos+ng  as  the  number  of  failures  in  relevance  and  appropriateness.  The  pos+ngs  started  with  [posted  by  an  agent  on  behalf  of  Boris]  

A|er  a  certain  number  of  failures,  friends    •  complain  •  unfriend  •  share  nega+ve  informa+on  about  the  loss  of  trust  with  

others    

•  and  even  encourage  other  friends  to  unfriend  an  friend  who  is  enabled  with  CASP.    

Page 82: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Trust  vs  Domain  &  Complexity  

Values  are  the  average  numbers  of  failed    CASP  pos+ngs,  for  given  domain  and  given  complexity,  when  issues  with  trust  arises  On  average,  friends  complain  a|er  6  failures  and  want  to  unfriend  CASP’s  host  a|er  10  failures  

Page 83: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Trust  vs  Relevance  

averaging  between  2  and  3  for  each  human  user  pos+ng  

•  For  less  informa+on-­‐cri+cal  domains  like  travel  and  shopping,  tolerance  to  failed  relevance  is  rela+vely  high.  

•  Conversely,  in  the  domains  taken  more  seriously,  like  job  related,  and  with  personal  flavor,  like  personal  life,  users  are  more  sensi+ve  to  CASP  failures  and  the  loss  of  trust  in  its  various  forms  occur  faster  

•  For  all  domains,  tolerance  slowly  decreases  when  the  complexity  of  pos+ng  increases.    

Page 84: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Conclusions  for  Facebook  applica+on  •  We  proposed  a  problem  domain  of  social  promo+on  and  build  a  conversa+onal  agent  CASP  to  act  in  this  domain  

•  CASP  maintains  friendship  and  professional  rela+onships  by  automa+cally  pos+ng  messages  on  behalf  of  its  human  host  

•  It  was  demonstrated  that  a  substan+al  intelligence  in  informa+on  retrieval,  reasoning,  and  natural  language-­‐based  relevance  assessment  is  required  to  retain  trust  in  CASP  agents  by  human  users  

Page 85: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Content  genera+on  agent  can  be  trusted  

•  We  evaluated  the  scenarios  of  how  trust  was  lost  in  Facebook  environment.    

•  We  confirmed  that  it  happens  rarely  enough  for  CASP  agent  to  improve  the  social  visibility  and  maintain  more  friends  for  a  human  host  than  being  without  CASP.      

Although  some  friends  lost  trust  in  CASP,  the  friendship  with  most  friends  was  retained  by  CASP    

   Overall  CASP’s  impact  on  social  ac+vity  is  posi+ve.  

Page 86: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Conclusions  

Page 87: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

Published  work    

• Galitsky,  B.,  Josep  Lluis  de  la  Rosa  &  Gábor  Dobrocsi.  Building  Integrated  Opinion  Delivery  Environment  FLAIRS-­‐24  2011.  • Galitsky,  B.  and  Josep  Lluis  de  la  Rosa.  Concept-­‐based  learning  of  human  behavior  for  customer  rela+onship  management.  Special  Issue  on  Informa+on  Engineering  Applica+ons  Based  on  Lalces.    Informa+on  Sciences.  Volume  181,  Issue  10,  15  May  2011,  pp  2016-­‐2035,  2011.  • Galitsky,  B.,  Josep-­‐Lluis  de  la  Rosa,  and  Boris  Kovalerchuk.  Assessing  plausibility  of  explana+on  and  meta-­‐explana+on  in  inter-­‐human  conflict.  Engineering  Applica+on  of  AI,  V  24  Issue  8,  pp  1472-­‐1486,    2011.  • Galitsky,  B.  Josep  Lluis  de  la  Rosa,  Gábor  Dobrocsi.  Inferring  the  seman+c  proper+es  of  sentences  by  mining  syntac+c  parse  trees.  Data  &  Knowledge  Engineering.  Available  online  28  July  2012.  • Galitsky,  B.  Machine  Learning  of  Syntac+c  Parse  Trees  for  Search  and  Classifica+on  of  Text.  Engineering  Applica+on  of  Ar+ficial  Intelligence,  2012    • Galitsky,  B.  Exhaus+ve  simula+on  of  consecu+ve  mental  states  of  human  agents.  Knowledge-­‐Based  Systems,  2013  

Page 88: ADeepTextAnalysisSystemBasedon OpenNLP · Outline 1. Introduc+on,to,OpenNLP.Similarity, component 2. Whatis,aDeep,TextAnalysis,and,its, Applicaons, 3. Func+onality,of,OpenNLP.Similarity:

JGraphT  

This project among open source projects

OpenNLP  

StanfordNLP    

Graph-­‐learning  Parse  

Thickets  Linguists  can  now    enjoy  graph-­‐  learning  of  linguis+c  structures,  and  graph  community  –  new  rich  source  of  graphs