29
Sketchengine TOOL FOR TEXTBASED TERMINOLOGY INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB

TOOL FOR TEXT-‐BASED TERMINOLOGY

  • Upload
    vocong

  • View
    223

  • Download
    2

Embed Size (px)

Citation preview

Page 1: TOOL FOR TEXT-‐BASED TERMINOLOGY

SketchengineTOOL  FOR  TEXT-­‐BASED  TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 2: TOOL FOR TEXT-‐BASED TERMINOLOGY

Using  texts  for  term  miningv Building  a  corpus.vChoosing   textsvConverting   into  common   format  (txt)vAnnotationv Croatian:  http://nlp.ffzg.hr/api-­‐for-­‐our-­‐language-­‐technologies/

vAlignmentv CAT  tools  (SDL,  memoQ)  or  LF  ALigner,  https://sourceforge.net/p/aligner/wiki/Home/

v Searching  the  corpus.v Concordance  tools:  AntConc (free),  Wordsmith   (€),  ParaConc (free)vWeb-­‐based  corpus  workbench:  Sketchengine,  http://www.sketchengine.co.uk

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 3: TOOL FOR TEXT-‐BASED TERMINOLOGY

What  is  the  Sketchengine?v Very  powerful  corpus  workbench:  https://www.sketchengine.co.uk/

v Provides  access  to  multiple  pre-­‐compiled  corpora  (British  National  Corpus,  hrWaC,  DGT  corpora  and  many  more)

v NOT  free,  but  not  expensiveJ (5,99  € per  month)

v Allows  the  creation  of  ad  hoc  corpora  from  web  texts

v Supports  TMX  import  (for  bilingual  texts!)

v Provides  ways  to  extract  terminology  semi-­‐automatically

v Online  tutorials:  https://www.sketchengine.co.uk/sketch-­‐engine-­‐video-­‐tutorials/

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 4: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 5: TOOL FOR TEXT-‐BASED TERMINOLOGY

Simple  concordances

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 6: TOOL FOR TEXT-‐BASED TERMINOLOGY

Other  query  types

v simple:  searches  for  word  and  its  inflected  forms

v lemma:  searches  for  all  words  with  this  lemma

v phrase:  for  searching  multiple  words

v word:  to  search  for  a  specific  wordform

v character:  to  search  for  a  string  of  characters

v CQL:  corpus  query  language

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 7: TOOL FOR TEXT-‐BASED TERMINOLOGY

WordSketches

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 8: TOOL FOR TEXT-‐BASED TERMINOLOGY

Thesaurus  – similar  words

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 9: TOOL FOR TEXT-‐BASED TERMINOLOGY

Keyword  extraction

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 10: TOOL FOR TEXT-‐BASED TERMINOLOGY

Term  queries  in  the  DGT  parallel  corpusv simple  queries:  ribolov,  brancin,  grdobina

v lemma  queries:  ribolov -­‐>  ribolova,  ribolovu,  ribolov

v parallel  query:  

v querying  using  CQL  syntax:  

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 11: TOOL FOR TEXT-‐BASED TERMINOLOGY

Basic  CQLv Typical  format:  [attribute="value"],  e.g.  [lemma=“riba”]

v Specifying  word  class  or  case:  [tag=“N.*”]  (any  noun),  [tag=“A.*”]  (any  adjective)

v Regular  expressions:  v .  (dot)  matches  any  single  characterv *  (asterisk)  matches  0-­‐100  repetitionsv +  (plus)  matches  1-­‐100  repetitionsv {n,k}  specifies  exact  range  of  repetitions,   from  n  to  k

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 12: TOOL FOR TEXT-‐BASED TERMINOLOGY

[lemma=“rad”]

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 13: TOOL FOR TEXT-‐BASED TERMINOLOGY

[tag=“A.*”][lemma=“riba”]

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 14: TOOL FOR TEXT-‐BASED TERMINOLOGY

"ulov.*"  []{0,3}   [tag="N.*"]

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 15: TOOL FOR TEXT-‐BASED TERMINOLOGY

Challengesv Search  for  verbs  occurring  before  the  word  “ugovor”  with  up  to  2  words  in  between.

v Search  for  words  ending  with  “anje”.

v Search  for  defining  contexts  containing  a  noun  in  the  nominative  case  followed  by  “je”  followed  by  an  adjective  and  noun  in  the  nominative  case.

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 16: TOOL FOR TEXT-‐BASED TERMINOLOGY

Looking  for  definitionsv Exploit  typical  definition  patterns:  v[X]  is  a  [Y]v [X]  is  defined  as  [Y]v [X]  is  a  kind  of  [Y]v …

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 17: TOOL FOR TEXT-‐BASED TERMINOLOGY

WebBootCatv Tool  to  create  text  collections  from  web  pages

v User  provides  keywords  &  optionally  selects  sites  to  crawl

v When  the  corpus  is  compiled  it  can  be  used  for  queries  or  download.

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 18: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 19: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 20: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 21: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 22: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 23: TOOL FOR TEXT-‐BASED TERMINOLOGY

TMX  Uploadv Allows  you  to  create  corpora  from  your  translation  memories

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 24: TOOL FOR TEXT-‐BASED TERMINOLOGY

Terminology  extractionv Works  for  languages  with  a  predefined  “term  grammar”

v Manage  corpus  -­‐>  Keywords  and  terms

v Terms  can  be  exported  into  TBX  or  CSV

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 25: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 26: TOOL FOR TEXT-‐BASED TERMINOLOGY

Exercisev Use  the  corpus-­‐derived  information  on  the  following  slides  to  create  a  term  entry  for  “bluetongue”.

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 27: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 28: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Page 29: TOOL FOR TEXT-‐BASED TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB