21
Accessing an Information System by Chatting Bayan Abu Shawar and Eric Atwell [email protected], [email protected] School of Computing University of Leeds

S11P2

Embed Size (px)

Citation preview

Page 1: S11P2

Accessing an Information System by Chatting

Bayan Abu Shawar and Eric Atwell

[email protected], [email protected]

School of Computing

University of Leeds

Page 2: S11P2

Presentation Outline

Introduction.

Chatbot and corpus definitions.

ALICE chatbot system.

What has been done so far.

System architecture of the Qur’an chatbot.

Results and Evaluation.

Page 3: S11P2

IntroductionMethods of Accessing an information system:Information Retrieval (IR): which retrieve a relevant

subset of documents from a large set.

Information Extraction (IE): which is the process of extracting specific pieces of data from documents to fill a list of slots in predefined templates.

We presented another way to access an information system using a chatbot tool.

Page 4: S11P2

Definitions

A Chatbot is a computer program, which is designed to simulate human conversation.

The user chats with the bot using textual or spoken natural language.

The chatbot must have access to knowledge (e.g., set of input/output rules), to accept input and match it against the rules to generate replies in the conversation.

We developed a machine learning approach to automatically generate chatting rules from machine readable text (corpora) and convert it to the ALICE chatbot format.

Page 5: S11P2

ALICE System

 ALICE: the Artificial Linguistic Internet Computer Entity;

a software robot that you can chat with using natural language.

ALICE is composed of two parts:

1. Chatbot Engine

2. The language model

ALICE language model is stored in AIML files.

AIML: The Artificial Intelligence Mark up Language.

Page 6: S11P2

The AIML Format

< aiml version=”1.0” >

< topic name=” the topic” >

<category>

<pattern>PATTERN</pattern>

<template>Template</template>

</category>

..

</topic>

</aiml>

Page 7: S11P2

Implementing a Java ProgramThe primary goal of chatbots is to mimic real human conversations. We developed a Java program to read from ‘real’ human dialogues and generate conversational rules for the ALICE chatbot.

The program reads a dialogue corpus

Converts the dialogue transcript to AIML format.

The output AIML is used to retrain ALICE.

Page 8: S11P2

The Aim of the Automatic ProcessSaving time and effort in encoding the knowledge

manually.

Generating different versions of the chatbots that are not restricted to specific language and/or domain.

Creating versions that simulates ‘real’ human conversation.

Machine Learning ApproachUsing most significant word approach: based on the fact

that usually people respond according to the most significant word.

A frequency list has been obtained form each corpora then used to generate the least frequent word.

Page 9: S11P2

The Dialogue Corpora Used so Far

Minnesota: French dialogue corpus.

Spoken Afrikaans: Afrikaans dialogue corpus.

British National Corpus (BNC): Spoken transcripts.

Page 10: S11P2

The Holy book of Islam (Qur’an)

The Qur’an is written in the classical Arabic form.

Qur’an consists of 114 soora (chapters), which are grouped into 30 parts.

Each soora consists of sequential verses (sections).

Page 11: S11P2

The Original English Text Format of Qur’anSample:

THE DAYBREAK, DAWN, CHAPTER NO. 113

With the Name of Allah, the Merciful Benefactor, The Merciful Redeemer

113.1 Say: I seek refuge with the Lord of the Dawn

113.2 From the mischief of created things;

113.3 From the mischief of Darkness as it overspreads;

113.4 From the mischief of those who practise secret arts;

113.5 And from the mischief of the envious one as he practises envy.

Page 12: S11P2

Using the Qur’an as a Trainable Corpus

We selected the Qur’an to illustrate:

1. Whether or not we could access an information source via chatting?

2. How to convert a written text to the AIML format?

3. How to adapt ALICE to learn from a text which is not a dialogue transcripts?

4. How to adapt ALICE interpreter to recognise the Arabic characters?

Page 13: S11P2

The Qur’an chatbot

In this chatbot we used the parallel corpora English/Arabic text.

Input: a statement, question or a verse in English.

Output: Verse(s) extracted from Qur’an in both English and Arabic

Problems raised:

1. How to divide a non conversational text into utterance-like chunks?

2. How to enable ALICE interpreter to recognise Arabic characters?

Page 14: S11P2

System Architecture

Dialogue

corpora

Reading files

Filtration

Text pre-processing Phase

Reiteration Preparation

Restructuring Phase

Investigation Patterns Equalities

Investigation Templates Equalities

AIML FILE

String Vector

Page 15: S11P2

Results

Two versions of ALICE have been generated:

Qur’an0-13: http://www.pandorabots.com/pandora/talk?botid=832e8a088e348e83 [link]

Qur’an 14-114:

http://www.pandorabots.com/pandora/talk?botid=b60ec43cbe3481fe [link]

Page 16: S11P2

Sample of Chatting Dialogue Human: Is it right that our protector is Allah? Qur'an14-114: {Muhammad (Muhammad)} 11 That is because Allah is the Protector of those who believe, but those who reject Allah have no protector.

) محمد ) سورة لهم مولى ال الكافرين وأن آمنوا الذين مولى الله بأن 11ذلك{The Morning Star (At-Tariq)} 4 There is no soul but has a protector over it.

) الطارق ) سورة حافظ عليها ا لم نفس كل 4إن  

Human: Allah, the eternal, Absolute

Qur'an14-114: {Sincerity (Al-Iklas)} 3 He begetteth not, nor is He begotten;يولد ولم يلد اإلخالص )لم 3(سورة

Human: He begetteth not, nor is He begotten Qur'an14-114: {Sincerity (Al-Iklas)} 4 And there is none like unto Him.

أحد كفوا له يكن اإلخالص )ولم 4(سورة

Page 17: S11P2

Evaluation

Problems of evaluation:

1. It is not easy to evaluate this kind of general information access.

2. The source is not in terms of specific questions.

3. Can not count numbers of hits in order to compute precision and recall scores.

Page 18: S11P2

Evaluation (cont.)

Evaluate the System in Terms of:

1. Achieving the proposed objectives.

2. Finding possible usefulness for the system.

3. User satisfaction.

Page 19: S11P2

Evaluation (Cont.)1. We achieved our goals which are focusing in using a

text which is not conversational in its nature and using the Arabic language.

2. The feedback from users were as follows:

Some users found the tool unsatisfactory since it does not provide answers to the questions.

Others found it interesting to:

a. Know more about Qur’an.

b. Find out from which soora a certain verse came from.

Page 20: S11P2

Conclusions1. We presented a novel way of accessing information

from an online source by having an informal chat.

2. The system may use as a search tool for verses that hold same words but have different connotations.

3. It may be good to know the soora name of a certain verse.

4. Students could use it as a new method to recite the Qur’an.

Page 21: S11P2

Thank YOU

?