Upload
brian-gray
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Next Generation Information Extraction (NGIE) in Multilingual Environment
(collaborative project with TCS)
Pushpak BhattacharyyaCSE Dept.IIT Bombay19 May, 2014
NGIE Project Overview
The goal of the project is to develop Next Generation Information Extraction Technology
The IE environment will be multi lingual
Involves Machine Translation and Cross Lingual Search
The IE focus is on relation extraction, named entity extraction, multi word extraction, semantic role labeling, corpus management
Relation and name extraction are to be jointly done since they are synergistic. (CEO_of is a relation between Person and Organization)
The fruits of this research is to be carried to TCS IE environment called INX
High quality publications in IE, in all the above tasks and combinations thereof
Principal Investigators and other members1. Mr. Girish Palshikar,Principal Scientist,Systems Research Lab,Tata Consultancy Services Limited
2. Dr. Pushpak Bhattacharyya,Professor, Department of Computer Science & engineering,IIT Bombay
3. Other members- Rohit Bangera, Sachin Pawar, Rudra Murthy, Girish Ponkia, Ravi Soni, Manish Shrivastava, Diptesh Kanojia, Gajanan Rane
NGIE Project accomplishments (1/3)
1. Relation Extraction: A relation extraction system has been built which can extract entities from natural language sentence and identify relationships among them. Following papers have been published:
Sachin Pawar, Pushpak Bhattacharyya and Girish Keshav Palshikar, Semi-supervised Relation Extraction using EM Algorithm, International Conference on NLP (ICON 2013), Noida, India, 18-20 December, 2013
Sachin Pawar, Pushpak Bhattacharyya and Girish Keshav Palshikar, Improving Relation Extraction Using A Joint Model of Entities and Relations , under revision.
Relation Extraction: Joint Model of Entities and Relations
E1, E2 : Types of the first and second entity mentions
R : Type of the Relation between two mentions F : Feature Vector capturing characteristics of the
entity mentions and how they occur in the sentence Can be used in-
Semi-supervised mode : F, E1,E2 known, R unknown, EM algorithm is used for learning the model parameters.
Supervised mode : F, E1, E2 and R are known while learning
Relation Extraction: Example Input sentence : Patricia Newell, an organizer for
Nader at the University of Florida in Gainesville, said that Nader had won far fewer votes in Florida than his supporters had expected.
Entity Mentions Extracted : PERSON - Patricia Newell, organizer, Nader, Nader, his, supporters
ORGANIZATION - University of Florida GPE (Geo-Political Entity) – Gainesville, Florida
Relations Extracted :
Relation
Entity Mention 1 Entity Mention 2
PER-SOC
organizer Nader
GPE-ORG
University of Florida Gainesville
PER-SOC
his supporters
NGIE Project accomplishments (2/3)
2. Multiword Extraction: Identifying and Extracting multi words using deep learning (multilayered neural networks)
Paper submitted to COLING 2014 (Ireland):
Rahul Sharnagat, Rudra Murthy V, Dhirendra Singh, Pushpak Bhattacharyya, Identification of Multiword Named Entities using Co-occurrence Statistics and Distributed Word Representation.
MWE situations
• Machine Translationo यू�क्रे� न की� से�न न� क्रे�मि�यू की� से��वर्ती� इलाकी� �� अपन
डे�रा� डे�ल दि�या� है�।
• Natural Language Generationo Good said or Well said ?
o Baby chaning room (what is changed?)
Challenges of MWE
• ಈ ಕೆಲಸವು ಕಬ್ಬಿ�ಣದ ಕಡಲೆಯೇ ಸರಿo Transliteration: I kelasavu kabbiNada kaDaleyE
sario Gloss: this job iron nut correcto Translation: This job is a hard nut to cracko Google: This work is strong meat
• ಯಾ�ರ ಹತ್ತಿ�ರವೂ ಕೆ� ಚಾಚಬೇ�ಡo Transliteration: yAra hattiravU kai chAchabEDao Gloss: which near hand no extendo Translation: Do not ask help from anybodyo Google: Whose ever hand cacabeda
MWE Extraction Taxonomy
Rule Based Empirical
MWE Extraction Techniques
Statistical Measures Based
Similarity based
Thesaurus based
Distributional Word Representation
MultiWord Extraction Process Artificial Neural Networks(ANN)
successfully applied to various Natural Language Processing tasks
ANNs able to capture the semantics of the word
Use ANNs to extract MWE from the text: Deep Learning
Multi Lingual POS Projection: HMM Results with Hindi as Helper
• HMM trained on Hindi • Tested on Hindi words aligned with
source Language words
Hindi as HelperMarathi 55.18Bengali 41.11Gujarati 42.23Punjabi 45.54
Summary Project goal: Advanced IE in Multilingual
setting Involves Machine Translation and Search
too Sophisticated machine learning
techniques like Markov Logic Network, Deep Learning etc. to be used for NLP
The incumbent will get into depths of ML and NLP with active support for existing project work
Expectation: day to day project work, attending research evaluation meetings around the country, publish, create downloadable resources and tools
Thank you
Lab URL: http://www.cfilt.iitb.ac.inMy URL: http://www.cse.iitb.ac.in/~pb