How far down the digital road will EL assessment go? · statement –DeepMind) Demis Hassabis - CEO Entering a process (e.g. playing a game) through a Zlearning algorithm that changes

Trinity College LondonEnglish qualifications for real-world communication

TECHNOLOGY FOR TEACHERS IN

ASSESSMENT – THE IMMEDIATE FUTURE

1&2 November, 2018

Alex ThorpLead Academic - Europe

How far down the digital road will EL assessment go?

Overview

1. Back to the start – Introduction

2. Introducing AI – history and definitions

3. AI and language – NLP

4. Chatbots

5. AI and Language assessment

(Speaking focus)

6. Case study – Communicative competence

7. Summary

8. Test evaluation – the 3 c’s

9. Future considerations

Introduction

Introduction – true or false?

Current AI still hugely limited, processing equivalent to a 2 year old

AI and, more particularly NLP, can now offer a fully automated 4-skill assessment solution

AI dates back as far as the 1950s

The human brain provided the model for modern machine learning

That which humans find easy, computers find difficult – and vice versa

Elon Musk labelled AI ‘a fundamental risk to the existence of civilization’

Machine scoring is more reliable than human scoring

I’ve utilized AI this morning!

Spot the odd one out?

Name of tool Developer Language

learning/testingWrite & Improve English Language iTutoring LearningWrite & Improve +Class View English Language iTutoring Learning

Write & Improve +Test Zone English Language iTutoring TestingRead & Improve (coming soon) English Language iTutoring Learning

Duolingo Duolingo Learning / testinge-Rater ETS TestingWriting Mentor ETS LearningLanguage Muse Activity Palette ETS Learning / testing

AuraLang AuraLang LearningBetterAccentTutor Better Accent LearningTriplePlayPlus Syracuse Language Systems Learning

Test of English Language Learning Pearson Testing

Intelligent Essay Assessor Pearson TestingIntelliMetric Vantage Learning TestingMyAccess! Vantage Learning LearningProject Essay Grade MI Learning / testing

Summary table of identified commercially-available language learning and language testing tools. Gillings et al. 2018

Computers as ‘tutors or tools’?

Introducing AI

Coding is the application of linguistic resource through a range of cognitive processes to generate meaning – often described as competences

Back to basics - Communication cycle


Back to basics - Communication cycle

Back to the beginning

29’086 measures barley 37 months. Kushim

A clay tablet with an administrative text from the city of Uruk, c.3400–3000 BC. Probably our first ever recorded code. If Kushim was indeed a person, he may be the first individual in history whose name is known to us!

Y N Harari 2015

Let’s go back

Partial scripts

Numerical partial script became the language of advancement

As societies developed external codes required to cope with sociological demands to support larger collectives

Full scripts

Unto the era of computers….

Can computers think like humans?

H Simon and A Newell – Pittsburgh 1955. A thinking machine?

Can computers think like humans?

Alan Turing 19481st chess programme

How to overcome Combinational Explosion?

How to give intelligence to make good decisions?

Turing developed rules to guide.

The birth of Classical AI

A problem defined, a set of programmed rules applied (Heuristics)

Could plan complex operations in highly controlled environments

Could deliver maximum efficiency and economy

But classical AI couldn’t engage with it’s environment

Our world is a little more… chaotic

Enter Machine learning

System’s ability to learn for themselves from raw data (training datasets)

System’s learn from first principles – from structure in data, and seeks potential solutions to problems

• Image recognition• Voice recognition• Optical character recognition• Advanced customisation• Intelligent data analysis• Sensory data analysis

-Model (predicts) based on Parameters -Input to inform (training data) -Learner (adjusts parameters through differences in prediction and actual)

1960’s – Bayesian methods

introduced for probabilistic

inference

1980’s – back propagation

1990’s: Shift from knowledge to a

data driven approach – analysis of large amount of

data

>1990s: Support Vector machines

and Recurrent Neural Networks

2010>: ANN and Deep learning

Enter Machine learning

Machine learning: Algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions.

The algorithm needs to be told how to make an accurate prediction

The Moravec Paradox

The things that our brains find difficult to cope with, that require a lot of conscious mental effort, like chess, were simple for AI.

The things that our brains find easy to cope with, that require a little conscious mental effort, like making sense of what we see and hear, or movement, were very difficult for AI

“We are prodigious Olympians in perceptual and motor areas… abstract thought though is a new trick.. We’ve not yet mastered it”

(Moravec 1988)

How does ML work? Enter Artificial Neural Networks

You recognised a dog instantaneously, by the firing of choral assemblies of neural networks

Neural Networks consist of the following components•An input layer, x•An arbitrary amount of hidden layers•An output layer, ŷ•A set of weights and biases between each layer, W and b•A choice of activation function for each hidden layer, σ.

Enter Artificial Neural Networks

Artificial Neural Networks

Is there a full stop?

Is there a

capital?

Is it at start of para?

Is there a

subject?

Is there an

object?

Is there a

noun?

SVOCA?

OC?

VA?

Sentence

sample

Sentence sample

Sentence

Non sentence

Training data

Training data – each time we tell it what it’s looking at, it tweaks the connections to better recognise what it’s looking for.

AI is now booming• Optimise harvesting• Interpret medical images• Grading students• Id financial opportunities• Driverless cars

AI ANN : taught then develops

10’s of 1000’s of simulations every second and chooses to do the best one

Enter Deep Learning

Solve intelligence. Use it to make the world a better place. (Mission statement – DeepMind)

Demis Hassabis - CEO

Entering a process (e.g. playing a game) through a ‘learning algorithm’ that changes millions of connections in a neural network to reinforce or stop an action to improve the desired outcome (not task-based algorithm)

Uses Representation Learning –automatically discovers characteristics needed for feature detection or classification of raw data, that is then used to perform a task

Deep learning: ML requires input – DL can learn by itself through learning algorhithm.E.g. Automatic light – ML accepts only ‘dark’, DL would learn ‘I can’t see’

Could a DL neural network system go beyond human understanding?

AlphaGo played a completely unpredictable move – can come up with a new idea beyond the remit of human thought….

Let’s ‘Go’

In DL systems, the algorithm learns how to make accurate predictions through its own data processing (ML needs to be told).

AI Limitations

Can find patterns in, and learn from, data, but no real understanding of what those patterns actually mean, there is no meaningful conceptual thinking.

• Patterns in complex data • Convert data into meaningful concepts

• Process ‘predictable’ (images / outcomes)

• ‘Understand’ content or images – easily tricked

With no real conceptual understanding of patterns – hardest challenge of all is ability that relies on exactly this - language

Prof Al Khalili

• Data engagement beyond human capacity

• Operate autonomously – based on training datasets

AI and language

Recognise these?

Chatbot

NLP

NLU

ASR

NLGAI

SDS

DMS


Communication cycle

AI in language - NLP

Automated Speech Recognition (ASR) Speech generation

Text recognition Text generation (NLG)

[Response driven]

When was Elvis born?

AI in language – Speech Recognition

Limited until advent of AI and Machine Learning techniques

Collect waveforms (phonetic input)

Fast Fourier Transform = spectogram

Identifies resonances of production

Labels ‘Formants’ recognising phonemes,

words and phrases

Converts to text –‘best fit hypothesis’

ASR Challenges – Who ate all the cake?

I think David ate all the delicious chocolate cake.

Tonic / Keywords / Onset – Volume / Pitch / Length / Pausing

Remarkable number of variables - immense amount of comparative data to be processed to arrive at correct hypothesis as to meaning beyond denotation.

Yet any communication act is a combination of oral production and non-verbal cues, paralinguistics and contextual parameters.

AI in language – Speech Recognition

Formants – limited with 44 phonemes and syntactic training If only it were that easy:-)

Requires a ‘Language model’

Automatic Speech Recognition (ASR)

Speech signal (audio)

Decoding

Orthographic representation

Language models

Acoustic model

Lexical data

Training data

Learns with more training data

INPUT

OUTPUT

Text recognition

Fails if can’t parse sentences – higher risk when rules based

Can process sentence meaning (denotative)

Tag sentence structure (syntactic)

Parse tree - tag words with likely part of speech

Phrase structure rules (e.g. parts of speech)

AI in language - NLP

Automated Speech Recognition (ASR) Speech generation


[Response driven]


Natural Language Generation (text)

Fails if can’t access relevant semantic meaning – higher risk when rules based

Produces sentence ‘parsed text’ related to meaning (denotative)

Knowledge Graph generated (Google 70b+ facts end 2016)

Exploits web of semantic information (entities linked through meaningful relationships)

Codifying of language applied

Speech synthesis

• Speech recognition in reverse• Text broken into phonetic

elements• Speech sound generated• Rules of phonemic

representation manipulable• ML can extrapolate models

from input (training) data

Putting the pieces together

So a computer can…

• Convert our speech to text• Establish meaning• Generate a text response• Convert this test to speech

But can it have a meaningful conversation?

Spoken Dialogue Systems (SDS)

SDS use both speech and NLP technologies to enable extended human-machine conversation.

Det

erm

ine

app

rop

riat

e sy

stem

re

spo

nse

Commercially driven to achieve success in constrained conversation to achieve a specific scenario’s goal (Litman et al, 2016). Limited application in assessing interactive language.

• DMS uses ASR and NLU, in conjunction with an internal representation of ‘system state’

SDS – Dialogue Management system

• Limited number of ‘states’ – interaction at any point represented by one ‘state’

• Each utterance moves the interaction from one ‘state’ to another

• Applicable to mapped dialogues (scripted)

System ask ‘Do you live in a town or

the countryside?’

System Ask: Which town do you live

in?

System ask: How far is the nearest

town?

System say: (not_understood)

SDS – Dialogue Management system

NLU – live - Town

NLU – live -Countryside

NLU – live - ?

‘Finite state machine’ – predictable path of interaction – not spontaneous

Summary - AI in language - NLP

Automated Speech Recognition (ASR)

Speech generation


[Response driven]


Chatbots

Can AI simulate human interaction?

Chatbots – several programmes simultaneously analysing output, these generate wide range of hypothetical responses and choose that which is most likely to prolong dialogic exchange:

• Person bot – personality with character and baseline facts• Rapport bot – find out about you and interests• Wikibot – seek facts based on conversation content• A ranking function – choosing the best response

Heriot-Watt University Alana the bot

Prof Oliver Lemon

Can AI simulate human interaction?

In communication there is a lot more going on than just words. Whilst AI can recognise complex patterns it cannot understand concepts.

AI still very limited in terms of:• Pragmatics • Socio-linguistic competence• Strategic competence• Co-constructed dialogue / authentic exchange

Let’s have a chat to a bot

https://www.masswerk.at/eliza/

Eliza the psychotherapist- Heuristic engine

Mitsuku. ML engine60b+ messages processed

https://www.pandorabots.com/mitsuku/

Chatbot - task

What went right? Why do you think it worked?What did not work so well?What do you think was the cause of the communication break-down?

In pairs, sharing a device, have a chat with Mitsuku (or alternative conversational chatbot)

1: Try a simple interaction2: Try a more demanding dialogue

AI and language assessment

(speaking)

AI and language assessment (Productive skills)

Machine scores automatically generated

Utilise set criteria and dependent variables (e.g. repeat accuracy, length of production, fluency, vocabulary, grammar and pronunciation)

Compared to reference scores (manually set)

ASR - automated and human correlation

• Correlations improve with longer utterances (Bernstein, 2012; Neumeyer et al, 2000)

• Repeat accuracy = high correlation (0.92) (Graham et al, 2008)

• Repeat accuracy used as predictor of oral proficiency

• Further high correlation studies as predictors (Cook et al, 2011; De Wet et al, 2009)

• Predictive measures for fluency stronger for read speech rather than spontaneous (Cucchiarini et al, 2010)

• Correlations higher for rate of speech and accuracy compared to ‘goodness of pronunciation’ (Müller et al, 2009)

AI scores – case studies

Pearson PTE ETS – TOEFL iBT

System Versant Speechrater

Task example Read aloudRepeat sentenceShort answer

Opinion on familiar topicSpeak based on reading (total 6 tasks)

Scoring includes Pronunciation FluencyVocabularySentence mastery

PronunciationFluencyGrammatical facilityTopical coherenceIdea progression(Multiple regression scoring)

Correlation 0.84 - 0.92 0.73

Construct Psycholinguistic (Van Moere, 2012)

Direct and immediate interaction (Butler et al, 2000)

Predictive Ability to use core language in real time / use lexis to build phrases and clauses and

articulate

Contextualised and limited restriction – account for content, coherence and interactive (but task

monologic)

Adapted from Litman et al, 2018

On

ly p

ract

ice

test

s su

bje

ct t

o

furt

her

res

earc

h

Automatic Speech Recognition (ASR) in assessment

• Repeat accuracy• Length of production • Fluency (rate of speech)• Vocabulary – complexity and

accuracy• Grammar – complexity and

accuracy • Pronunciation (compared to

reference acoustic model)

• Test task limited: e.g. elicited imitation, reading aloud or short free responses

• Limited opportunity for spontaneous or dialogic speech

• Copes with transactional rather than interactional dialogue

Dialogue Management system – Finite state

At utterance level ‘States’ created (for

example)

Syntactic analysis

• Grammar errors• (NLU:Grammar = No)

Semantic analysis

• Meaning for expected answer

• Detail or gist

Pragmatics • Politeness• Contextual

coherence

Acoustic input

• Prosodic features

• Fluency

Alternative ‘State tracking’ –DMS gives

probability of path

Less ASR errors and can resolve

ambiguities further in dialogue

Potential application of holistic scales including CEFR. (Shashidhar et al, 2015)

Spoken Dialogue Systems (SDS) in assessment (Finite state)

• User tolerance of recognition errors

• Pedagogical value of misrecognised utterances

• Narrow domain scenario-guided conversation

• Useful for constrained and transactional dialogues

• Applicable where semi-scripted dialogue used

• Conversations simple and constrained• Based on L1 competent model• Test-takers have limited speaking skills,

SDSs – designed to process speech from proficient users

• Most conversational responses are not right or wrong (as required from tutorial dialogue system technology)

• SDS needs to be easily configurable by language experts

• Limited training data (despite machine learning)

The assessor will need to consider the extent to which their construct can accommodate [SDS’s] deviation from authentic dialogue (Litman et al, 2018)

Spoken Dialogue Systems (SDS)

Opportunities for spontaneous yet non-conversational speech, within constrained domain

State tracking SDSs – overcome ASR difficulties during dialogue

Applying AI – case study

Bachman & Palmer (2010): communicative competence model

Communicative Competence

Linguistic competence

Socio-linguistic competence

Discoursecompetence

Strategic competence

Case study - Communicative competence

Conversational features in co-constructed dialogue

Communicative competence - features

• Higher level contextual user ability – often related to concept• Semantic and topical relationship – tied to utterance history• Appropriate conversational functions (e.g. ending dialogue)• Linguistic devices (referring expressions, prosody etc.)• Turn taking conventions (linguistic signalling etc.)• Conversation coordination (confirming understanding, recovering etc.)• .• .

Communicative competence

1: In groups of 3 or 4 you will be allocated one of the four competences.

- Identify elements of one competence (e.g. Socio-cultural = register)- What parameters would need to be measured in spoken performance to

assess these elements?- Discuss if you think the AI systems covered today could be applied to assess

each element / the overall competence?

2: Cross-group into groups of 4, with one person covering each competence.

- Share your ideas around the application of AI to the communicative competences

Summary

Development of AI over

time

Speaking constructsassessed

AI and automated language assessment

20182010

Lin

guis

tic

com

pet

ence

? ? ?

Fully

au

tom

ated

la

ngu

age

asse

ssm

en

t

Initial criticism as only narrow constructs could be assessed

Construct or assessment engine?

Choose construct to assess, audit available technologies, and compensate for short-fallings with human intervention

or

Select available technologies, and align to construct they can cover, decide if compensation necessary

To what extent can contained measures be used as indicators or predictors of overarching language proficiency?

Construct – considerations

There is a rethinking of what speaking constructs could be….

• Expand theoretical definition of interactional competence • Encompass co-constructive and dynamic dialogue• Engage personal cognitive and contextual factors • Incorporate digital literacies, human – machine interaction• Consider narrower / partial constructs as sufficient predictors of proficiency• Scope for plurilingual and translanguaging competencies • Inclusion of transferable skills and mediation

There is a long road ahead…

Role of individual agency – impact on identity in test taking experience

‘I deserve to engage with a human’

AI in language assessment - identity

To conclude - some predictions

• Increasing number of collaborations between exam developers and high-tech IT companies

• Increasing use of blended modes of assessment delivery – digital / human

• Inclusion of digital literacies written into assessed constructs (coping with latency, paucity of NVs or paralinguistics, digital interface engagement, mediating NLP shortcomings etc.)

• Blended modes may include tasks of recorded human interaction that machines score – but not actual interaction with the machine (to overcome restrictions with SDSs etc.)

• Commercial opportunities for establishing L2 spoken corpora at differentiated CEFR levels for training datasets

• Development of AI formative assessment engine integrated into course delivery

To conclude - some predictions

And in the long term….• Localisation to class-level through local-populated datasets driving

adaptive assessment on an ongoing and formative basis – mediated through individual devices…

Trinity College LondonEnglish qualifications for real-world communication

Alex ThorpLead Academic, Language (Europe)

[email protected]

Documents

How far down the digital road will EL assessment go? · statement –DeepMind) Demis Hassabis - CEO Entering a process (e.g. playing a game) through a Zlearning algorithm that changes