62
Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Embed Size (px)

DESCRIPTION

Two categories of SPM (That we’ll discuss today) Approaches Coming from Association Rule Mining MOTIF Extraction

Citation preview

Page 1: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Special Topics in Educational Data Mining

HUDK5199Spring, 2013

April 24, 2012

Page 2: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Today’s Class

• Sequential Pattern Mining

Page 3: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Two categories of SPM(That we’ll discuss today)

• Approaches Coming from Association Rule Mining

• MOTIF Extraction

Page 4: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Association Rule Mining

• Try to automatically find if-then rules within the data set

Page 5: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Sequential Pattern Mining

• Try to automatically find temporal patterns within the data set

Page 6: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

ARM Example

• If person X buys diapers,• Person X buys beer

• Purchases occur at the same time

Page 7: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

SPM Example

• If person X buys novel Foundation now,• Person X buys novel Second Foundation in a

later transaction

• Conclusion: recommend Second Foundation to people who have previously purchased Foundation

Page 8: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

SPM Example

• Many customers rent Star Wars, then the Empire Strikes Back, then Return of the Jedi

• Doesn’t matter if they rent other stuff in-between

Page 9: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

SPM Example

• Many customers buy flowers, and then buy diapers AND diaper cream several months later

Page 10: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

SPM Example

• Many learners become confused, then game the system, then become frustrated, then complete gaming the system, then become re-engaged

Page 11: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Different Constraints than ARM

• If-then elements do not need to occur in the same data point

• Instead– If-then elements should have same user (or other

organizing variable)– If elements can be within a certain time window

of each other– Then element time should be within a certain

window after if times

Page 12: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Sequential Pattern Mining

• Find all subsequences in data with high support

• Support calculated as number of sequences that contain subsequence, divided by total number of sequences

Page 13: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Sequential Pattern Mining

• What are some subsequences with high support? (What is their support?)

• Chuck: a, abc, ac, de, cef • Darlene: af, ab, acd, dabc, ef• Egoberto: aef, ab, aceh, d, ae• Francine: a, bc, acf, d, abeg

Page 14: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Questions? Comments?

Page 15: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Algorithms for SPM

Page 16: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

GSP (Generalized Sequential Pattern)

• Classic Algorithm• (Srikant & Agrawal, 1996)

Page 17: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Data pre-processing

• Data transformed from individual actions to sequences by user

• E.g.• Bob: {GAMING and BORED, OFF-TASK and

BORED, ON-TASK and BORED, GAMING and BORED, GAMING and FRUSTRATED, ON-TASK and BORED}

Page 18: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Data pre-processing

• In some cases, time also included

• E.g.• Bob: {GAMING and BORED 5:05:20, OFF-TASK

and BORED 5:05:40, ON-TASK and BORED 5:06:00, GAMING and BORED 5:06:20, GAMING and FRUSTRATED 5:06:40, ON-TASK and BORED 5:07:00}

Page 19: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Algorithm• Take the whole set of sequences of length 1

– May include “ANDed” combinations at same time• Find which sequences of length 1 have support over pre-

chosen threshold• Compose potential sequences out of pairs of sequences of

length 1 with acceptable support• Find which sequences of length 2 have support over pre-

chosen threshold• Compose potential sequences out of triplets of sequences of

length 1 and 2 with acceptable support• Continue until no new sequences found

Page 20: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Let’s execute GPS algorithm

• With min support = 50%

Page 21: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Let’s execute GPS algorithm

• With min support = 50%

• Chuck: a, abc, ac, de, cef • Darlene: af, ab, acd, dabc, ef• Egoberto: aef, ab, aceh, d, ae• Francine: a, bc, acf, d, abeg

Page 22: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Other algorithms

• Free-Span• Prefix-Span

• Select sub-sets of data to search within

• Faster, but same basic idea as in GPS

Page 23: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Uses in educational domains

Page 24: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Perera et al. (2009)

• What were the three ways that Perera et al. (2009) used sequential pattern mining?

• What did they learn, and how did they use the information?

Page 25: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Perera et al. (2009)

1. Overall uses of collaborative tools by groups2. Sequences of collaborative tool use by different group

members3. Sequences of access of specific resources by different

group members

• In all cases, they found common patterns and then looked at how support differed for successful and unsuccessful groups

Page 26: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Perera et al. (2009):Important Findings

1. Overall uses of collaborative tools by groups– Successful groups used ticketing system more

than the wiki; weaker groups used wiki more– Patterns were particularly strong for group

leaders

Page 27: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Perera et al. (2009):Important Findings

2. Sequences of collaborative tool use by different group members– Successful groups characterized by leader

opening ticket and other student working on ticket

– Successful groups characterized by students other than leader opening ticket, and other students working on ticket

Page 28: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Perera et al. (2009):Important Findings

3. Sequences of access of specific resources by different group members– The best groups had interactions around the

same resource by multiple students – The poor groups did no work on tickets before

closing them

Page 29: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Questions? Comments?

Page 30: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

MOTIF Extraction

Page 31: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Motif

• Short, recurring pattern in a sequence of categories occurring over time

Page 32: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Motif in Music

• Short, recurring pattern of notes in a musical composition

Page 33: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Motif in Music

• What’s the motif?

• http://www.youtube.com/watch?v=rRgXUFnfKIY

• How many times does the motif occur?

Page 34: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Motif in Music

• What’s the motif?

• http://www.youtube.com/watch?v=rRgXUFnfKIY

• How many times does the motif occur?– Depends on how you define it, right?– And that’s part of the challenge…

Page 35: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Motif in Language

• Short, recurring pattern of characters in a sequence of characters occurring over time

Page 36: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Motif in Genetics

• Short, recurring pattern of genes in a sequence of genes occurring over time

• Typically written as letters

Page 37: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Goal of Motif Extraction

• Discern a common pattern of characters in a large corpus of characters

• The characters may vary slightly from case to case

Page 38: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Can you find the motif?

Page 39: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Can you find the motif?

UBSWWDFKLWPRHUC

JBDPXBDVEJVMBKK

VBDWNLROFVUBFFWOWIFTIENDOXJXIOBAUAAOOXZAABZSBT

MUAWSNTVZXSFHMILFQRKUTFRIENDOV

LOMTPOQHJVYYMFJ

LWGJMVPKYOZNMSA

RUPMFOHPVSPPVPTBAZXVFTPQFQJVBM

HLPMOKUOXGRIENDO

INUSUNSGDAAICAV

XRZZWCDXOVZZJKQ

VOVCROMCJTOLXYU

HUVRYFREENDOBBGC

AQJBVXJCAJLEMAU

ONJORIFCGAUGIRN

PJGCHBDQIWJJTMQ

IQYQHKKBNBVDFPV

JJLHWPZAYZIGGEH

IGJZRMAAWJBESSS

JXZFRIEMDOVZRBJY

IRPWYIRJISLFVFF

Page 40: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

How would you describe the motif?

UBSWWDFKLWPRHUC

JBDPXBDVEJVMBKK

VBDWNLROFVUBFFWOWIFTIENDOXJXIOBAUAAOOXZAABZSBT

MUAWSNTVZXSFHMILFQRKUTFRIENDOV

LOMTPOQHJVYYMFJ

LWGJMVPKYOZNMSA

RUPMFOHPVSPPVPTBAZXVFTPQFQJVBM

HLPMOKUOXGRIENDO

INUSUNSGDAAICAV

XRZZWCDXOVZZJKQ

VOVCROMCJTOLXYU

HUVRYFREENDOBBGC

AQJBVXJCAJLEMAU

ONJORIFCGAUGIRN

PJGCHBDQIWJJTMQ

IQYQHKKBNBVDFPV

JJLHWPZAYZIGGEH

IGJZRMAAWJBESSS

JXZFRIEMDOVZRBJY

IRPWYIRJISLFVFF

Page 41: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Finding motifs

• Several algorithms

Page 42: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Finding motifs

• Variant on PROJECTION algorithm (Tompa & Buhler, 2001) used in (Shanabrook et al., 2010)– Only example of motif extraction in educational

data mining so far

Page 43: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Big idea

• For each character string C that could be a motif example (e.g. all character strings of desired length)– Create a set of projections, random variations of C

that vary in one or more ways

Page 44: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Big idea

• For each pair of strings C1 and C2, see how many overlaps there are between their projection matrices

• Take the pair with the most matches and combine into a motif– Creating multi-example motif if 3+ get added together

• Repeat until goal number of motifs is found, or until new motif is below criterion goodness

Page 45: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Motif in Education

• Short, recurring pattern of behaviors in a sequence of behaviors occurring over time

• Written as letters in Shanabrook et al. (2010)

Page 46: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Detail for education• How do you segment student behavior?

• Could use student’s interaction on an entire problem, and compute letters across whole problem– Might make more sense in tutors with shorter problems

• Could use student’s interaction on an entire problem, and define letters differently for context within whole problem– Approach used by Shanabrook et al. (2010)

• Could use “sliding window” of N actions

Page 47: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Behaviors in Shanabrook et al.• “hints (a, b, c) – Hints is a measure of the number of hints viewed for

this problem. Although each problem has a maximum number of hints, the hint count does not have an upper bound because students can repeat hints and the count will increase at each repeated view. The three categories for hints are: (a) no hints, meaning that thestudent did not use the hint facility for that problem, (b) meaning the student used the hint facility, but was not given the solution, and (c) last hint solved, meaning that the student was given the solution to the problem by the last hint. As described above, this metric combines two values logged by the tutor: the count of hints seen, and an indicator that the final hint giving the answer was seen. The data could have been simply binned low, medium, high hints; however, this would have missed the significance of zero hints and using hints to reveal the problem solution.”

Page 48: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Behaviors in Shanabrook et al.• “secFirst (d, e, f) – The seconds to first attempt is an important

measure as it is during this time that the student is reading the problem and formulating their response. In previous research [6], five seconds was determined to be a threshold for this metric representing gaming: students who make a first attempt in less than five seconds are considered not working on-task. We divide secFirst into three bins: (d) less than 5 sec, (e) 5 to 30 sec, (f) greater than 30 sec. (d) represents students who are gaming the system, (e) represents a moderate time to the first attempt, (f) represents a long time to the first attempt. The cut at 30 seconds was chosen because it equalizes the distribution of bins (e and f), representing a division between a moderate and a long time to the first attempt.”

Page 49: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Behaviors in Shanabrook et al.• “secOther (g, h, i, j, k) – This variable represents actions related to answering the

problem after the first attempt was made. While the first attempt includes the problem reading and solution time, subsequent solution attempts could be much quicker and the student could still be making good effort. secOther is categorized in five bins: (g) skip, (h) solved on first, (i) 0 to 1.2 sec, (j) 1.2 to 2.9 sec, (k) greater than 2.9 sec. First, there are two categorical bins, skip and solve on first attempt. These are each determined from an indicator in the log data for that problem. Skipping a problem implies only that students never clicked on a correct answer; they could have worked on the problem and then given up, or immediately skipped to the next problem with only a quick look. Solved on first attempt indicates correctly solving the problem. If neither of the first two bins are indicated in the logs, then the secOther metric measures the mean time for all attempts after the first. The divisions of 1.2 sec and 2.9 sec for the latter three bins were obtained using the mean and one standard deviation above the mean for all tutor usage; (i) less than 1.2 seconds would indicate guessing, (j) would indicate normal attempts, and (k) would indicate a long time between attempts.”

Page 50: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Behaviors in Shanabrook et al.

• “numIncorrect – (o, p, q) - Each problem has four or five possible answer choices, that we divide into three groups: (o) zero incorrect attempts, indicates either solved on first attempt, skipped problem, or last hint solves problem (defined by the other metrics); (p) indicates choosing the correct answer in the second or third attempt, and (q) obtaining the answer by default in a four answer problem or possibly guessing when there is five answer problem.”

Page 51: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

What other constructscould be used?

• What other kinds of constructs could be used for the atoms of motif analyses in educational analyses?– At this grain-size (e.g. specific actions)

Page 52: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

What other constructscould be used?

• What other kinds of constructs could be used for the atoms of motif analyses in educational analyses?– At other grain-sizes?

Page 53: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Common Motifs• {adgo, adip, adiq}

• {aeho, afho}

• {ceho}

• {adgo, aeho}

• {aeiq aeho aeho aekp aeho aeiq aeho aeip aeho aeip}

Page 54: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Interpretations (Shanabrook et al., 2010)

• {adgo, adip, adiq} – gaming the system

• {aeho, afho} – “This student is using the tutor appropriately, but not being challenged.”

• {ceho} – problem is too difficult

• {adgo, aeho} – student is skipping problems

• {aeiq aeho aeho aekp aeho aeiq aeho aeip aeho aeip} – working on-task

Page 55: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Do you agree with interpretations?• {adgo, adip, adiq} – gaming the system

• {aeho, afho} – “This student is using the tutor appropriately, but not being challenged.”

• {ceho} – problem is too difficult

• {adgo, aeho} – student is skipping problems

• {aeiq aeho aeho aekp aeho aeiq aeho aeip aeho aeip} – working on-task

Page 56: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

How can researchers form good interpretations?

Page 57: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Questions? Comments?

Page 58: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

What else?

• What else could sequential pattern mining and motif extraction be used for in education?– Beyond Perera et al. and Shanabrook et al.

Page 59: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Asgn. 8

• Solutions

Page 60: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Asgn. 9

• Questions?• Comments?

Page 61: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

Next Class

• Monday, April 29

• Learnograms

• Guest lecturer: Arnon Hershkovitz– World-famous inventor of learnograms

• Readings: None

• Assignments Due: None

Page 62: Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 24, 2012

The End