24
Matching Lecture 11

Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Matching

Lecture 11

Page 2: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Topics

• ID parade Frames

• Matching Examples

• Fuzzy Matching

• Metric Spaces

• Scales of measurement

Page 3: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

ID Parade Frames

• Classifying volunteers as clean• Matching suspect to volunteers• Reservation of parade facility, officers,

volunteers• Managing long-running process from decision to

hold parade to payment of volunteers• Accounting – payment to volunteers and billing

of police authorites• Historical record and analysis

Page 4: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Merging multiple frames

• Each frame produces its own model of the actors.

• E.g. Models of volunteer – For matching with suspect– For classification – For payment– For reservation

• For database, problem is called ‘view integration’

Page 5: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Miscellaneous Matching applications

• Many systems have a matching task at their core:– Shazam – sound sample matching– De-duping mailing lists– CD DB - CD recognition– COTS selection– IS development selection

– fingerprint matching– patient/donor matching for transplant surgery– blood typing and matching– patients to clinical trials– interns to placements in hospitals– DNA samples– search request to locate relevant documents– incoming news items to information subscribers– number plate recognition in London’s Congestion Congestion Charging System – speech and writing recognition– patterns to material to minimise wastage

Page 6: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Shazam - 2580

• Shazam is a mobile phone application• It can recognise 1.7 million tracks from a 30 sec sample

– new tracks added at 5,000 a week• The track details are texted back within about 30secs• It costs 50p + 9p call charge (surcharge only if

successful)• Your personal page shows the tracks you have tagged

• www.shazam.com

Page 7: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

De-duping

C WallaceWest England UniversityColdharbour LaneFrenchayBristolBS16 1QY

Ms C WallaceUniv. of the West of EnglandFrenchay CampusColdharbour LaneBristolBS16 1QY

One person or two?

A catalogue from O’Reilly

Mailing lists are reported with 25 – 40% duplicates.

Page 8: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

CD DB• Database of 2.5 million CD’s, track details and

supporting matter run by gracenote (www.gracenote.com)

• Used by media players to obtain track info• Player sends signature of CD [sequence of track

lengths in 1/4sec] to match against the database (via HTTP)

• Application searches DB for best match and returns track info to media player.

• Matching algorithm described in US Patent 6,061,680

Page 9: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Commercial Of the Shelf Software (COTS)

• Software exists for most business needs:– payroll– order processing– general ledger– human resources– e-commerce– e.g. SAP, SAGE ..

• but analysts need to match business needs to COTS capability, and customise generic software for local business rules.

Page 10: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Chatbots

• Chatbots like ALICE simulate a human response to typed input

• Most are for fun or annoyance

• Increasingly being used for customer service, helpdesks, marketing

• Based on matching patterns in text

• The patterns are in an XML application called AIML

Page 11: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Police ID parade

• Currently:– Suspect matched to Volunteers visually by

officer

• Information System– Suspect and Volunteers modelled in database– System provides list of matching volunteers

Page 12: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Matching in general

• Matching task typically involve:– two sets of individuals : e.g.

• the suspect / sampled track / DNA sample - The Requirement• the volunteers / 1.7 million stored tracks / DNA on file – The Resource

– ‘adequate’ representations of both – a ‘fitness’ function which calculates how well matched a Resource

is to the Requirement– a process to achieve the matching goal

• Matching processes:– Single or Batch?

• Single: One Req to many Resources• Batch: Many Reqs to many Resources (e.g. cutting)

– Automatic, Interactive, Assistive• Automatic: Matching fully automated• Interactive: User makes final selection, adjusts weights• Assistive : Computer produces analyses which aid human selection

Page 13: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Single Allocation• Allocation to a single Requirement:

– ‘long list’ the Resources - eliminate the obviously unsuitable

– compute fitness between Requirement and each remaining Resource

– rank the Resources in fitness order for a ‘short list’– ? user selection from short list on basis of additional

information unknown to system

• Interactive– User adjusts:

• description of Requirement (e.g the search term in Google)• fitness function (e.g. the weights in the ID parade)

– and retries

Page 14: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Simple Matching

• Resource and Requirement are of the same kind

• Fitness = least distance between objects

• String Matching– Levenshtein distance– Soundex and Metaphone

• Age difference

Page 15: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

String Matching• How close are two strings – words, DNA sequences? • Levenshtein distance

– is the number of single character edits required to change one to the other using the operations of:

• inserting a letter• deleting a letter• replacing a letter

• E.g.– Distance(receipt,tecept) = 2– Distance(receipt,reciept) = 2

• Need a theory of why the strings are different– Better theory for typing would be to count transposition as 1 edit instead

of 2– Better theory for texting would be to count a replace by a letter on the

same key less than a letter on a different key.– mutations in DNA matching

Page 16: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Soundex and Metaphone

• Surnames in English have multiple spellings for similar sounds – Wallace and Wallis, Smith and Smythe– Errors caused by similar phonetics having different spelling– Useful where sound-text transliteration occurs in data capture

• e.g. Smith and Smythe• Soundex (Odell and Russell 1922) reduces every word to

a letter and 3 digits – S530 for both• Metaphone (Philips 1990) smarter about English

phonetics – SM0 for both• Double Metaphone – improved and two codes – one

english, one ‘foreign’• Comparison of algorithms

Page 17: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Matching is subjective• How close are two ages?

• Is the answer different for the identity parade and a dating agency?

0.0

ageSuspect Volunteer

distance

Ideal PersonDate

Page 18: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Multi-valued Matching

• How to combine multiple values to create a single distance?

• Age and Height are different to Build, Eye-colour, Gender and Ethnic origin.

• Distance in 2-D space:

dx

dySqrt(dx^2 + dy^2)

x

y

Page 19: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Metric space

• Formally, a metric space M is a set of points with an associated distance function (also called a metric) d : M × M -> R (where R is the set of real numbers).

• For all x, y, z in M, this function is required to satisfy the following conditions:– d(x, y) ≥ 0 – d(x, x) = 0 – if   d(x, y) = 0   then   x = y     (identity of

indiscernibles) – d(x, y) = d(y, x)     (symmetry) – d(x, z) ≤ d(x, y) + d(y, z)     (triangle inequality).

Page 20: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Multi-attribute matching

• Extract shows a simple Excel spreadsheet containing a suspect age, weight and gender, and the same attributes for 10 volunteers

• Representation– Age is measured in years – Height in cm– Gender is M or F

• Fitness function– Calculate difference between suspect and volunteer attributes– Normalise differences to 0…1– Multiple by weights to express importance of each attribute– Sum of squared differences as Fitness function– Best fit volunteer has minimum value for Fitness

Page 21: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Scales of Measurement• Nominal – names or categories

– E.g. Eye-colour, Ethnic origin, Telephone number, ISBN– Valid operations: =, not =

• Partly Ordered Scales e.g. grandparent, parent, uncle, child, cousin– Pairs are ordered but no overall ordering

• Ordinal – ranks– E.g. 1,2,3 in Derby, 1st ,2.1, 2.2, 3rd class, slight, medium heavy build– Valid operations: <, = , >– Invalid operations : + , - ( gap between 1 and 2, is not the same as between 2 and 3)– Non-parametric statistics may apply

• Interval - arbitrary zero value– E.g. Temperature in degrees F, date in Julian Calendar– Valid Op : - (minus) – Invalid: + , * (but differences are Ratio)

• Ratio – E.g. Length, age – Valid Ops: + , * , /, standard statistical operations

• Multi-dimensional scales (index numbers)– E.g. Miles/gallon, IQ – Compound of several scales of measurement

Page 22: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Suspect/Volunteer attributes

• Nominal – names or codes

• Ordinal – ranks

• Interval - no zero value

• Ratio

Page 23: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Transforming and Scaling

• To combine different attributes, we need to transform Nominal, Ordinal and Interval values to Ratio scales

• This cannot be done objectively, so judgement involved

• Scaling and weights need to be adjustable to fine-tune matching

• => Learning Frame (later)

Page 24: Matching Lecture 11. Topics ID parade Frames Matching Examples Fuzzy Matching Metric Spaces Scales of measurement

Sensitivity Analysis

• Arbitrary weights can be adjusted to see what effect their variation has on the final selection

• ? How much would each weight have to change before the first choice is demoted?

• Excel analysis