1
Using Fuzzy String Matching To Automate Importing Data To SQL Server Michael Justice – Master of Science in Data Science University of Minnesota, Twin Cities Background Gilliland, Michael Levenshtein Distance, in Three Flavors (http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm) Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010) Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. The NumPy Array: A Structure for Efficient Numerical Computation, Computing in Science & Engineering, 13, 22-30 (2011) Shipman, John W. Tkinter 8.5 reference: aGUI for Python (http://infohost.nmt.edu/tcc/help/pubs/tkinter/tkinter.pdf) Roskam, Albert-Jan. SavReaderWriter 3.4.2 (https://pypi.python.org/pypi/savReaderWriter/3.4.2) Challenges Data Goals Survey Data At Walden University Existing Database on Microsoft SQL Server (Student Records, Demographic Information, etc.) Two Major Annual Surveys: Student Satisfaction Alumni Satisfaction Survey Data Stored Individually (SPSS .SAV files). High Dimensional Sparse 1. Expand existing database by adding survey data. 2. Automate the process of adding survey data to SQL Server from an SPSS .SAV file. A. Adding new questions B. Adding new labels C. Match incoming questions with existing questions. 3. Export data from SQL Server to Tableau or .SAV file. Python package SavReaderWriter truncates variable labels at 255 characters. R function read.spss takes up too much memory for large surveys. A survey can have duplicate questions. A university student can have multiple ID numbers. Variable names change from year to year. The text of some survey questions change slightly from year to year, but need to be compared overtime. The response value labels change from year to year. Variables need to be selected in an efficient way. The IT department withheld administrative privileges to my PC. ER Diagram of Survey Tables Exporting Data: Building An SPSS .SAV File Key Strategies: Regular Expression & User Intervention Key Strategies: Fuzzy String Matching & User Intervention References In order for variables with different labels to be considered the same, user intervention is required: New question text Best matching questions already in database Score from fuzzy string matching algorithm – uses Levenshtein Distance The user has the choice to: A) Select the best match from the database OR B) Use their new question text to create a new question in the SurveyQuestion table. Labels are matched using the same approach, but the algorithm struggled with short labels making user intervention even more important: Notice that if the score was the only thing used, Very likely would have been incorrectly matched with Very Unlikely. *Python package FuzzyWuzzy’s fuzz.ratio function was used to generate scores. **Python package Tkinter was used to generate graphical prompts to the user. Convert data from long to wide format for data file Build label dictionaries for variable labels and value labels Resolve conflicts of changing labels for overtime data sets: Insignificant Change: Significant Change: This is a result of the following changes happening to the survey from one year to the next for a question regarding employment status: 2015 2016 1 – Full-time 1 – Full-time 2 – Part-time 2 – Part-time 3 – Self-employed 3 – Self-employed 4 – Retired 4 – Pursuing continuing education 5 – Not currently employed 5 – Not currently employed and not seeking employment 6 6 – Seeking employment but not currently employed 7 7 – Retired Regular expression selecting variables with a name starting with Q or q followed immediately by a digit: Prompt to user to select any of the remaining variables: *Student table contains many more variables (not shown) with other information about the student.

Using Fuzzy String Matching To Automate Importing Data To ...€¦ · data to SQL Server from an SPSS .SAV file. A. Adding new questions B.Adding new labels C.Match incoming questions

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Fuzzy String Matching To Automate Importing Data To ...€¦ · data to SQL Server from an SPSS .SAV file. A. Adding new questions B.Adding new labels C.Match incoming questions

Using Fuzzy String Matching To Automate Importing Data To SQL Server Michael Justice – Master of Science in Data Science

University of Minnesota, Twin Cities

Background

 Gilliland, Michael Levenshtein Distance, in Three Flavors (http://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm)

Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010)

Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. The NumPy Array: A Structure for Efficient Numerical Computation, Computing in Science & Engineering, 13, 22-30 (2011)

Shipman, John W. Tkinter 8.5 reference: aGUI for Python (http://infohost.nmt.edu/tcc/help/pubs/tkinter/tkinter.pdf)

Roskam, Albert-Jan. SavReaderWriter 3.4.2 (https://pypi.python.org/pypi/savReaderWriter/3.4.2)

Challenges DataGoals

• Survey Data At Walden University

• Existing Database on Microsoft SQL Server (Student Records, Demographic Information, etc.)

• Two Major Annual Surveys:• Student Satisfaction• Alumni Satisfaction

• Survey Data Stored Individually (SPSS .SAV files).

• High Dimensional• Sparse

1. Expand existing database by adding survey data.

2. Automate the process of adding survey data to SQL Server from an SPSS .SAV file.

A. Adding new questionsB. Adding new labelsC. Match incoming questions with existing

questions.

3. Export data from SQL Server to Tableau or .SAV file.

• Python package SavReaderWriter truncates variable labels at 255 characters.

• R function read.spss takes up too much memory for large surveys.• A survey can have duplicate questions.• A university student can have multiple ID numbers.• Variable names change from year to year.• The text of some survey questions change slightly from year to

year, but need to be compared overtime. • The response value labels change from year to year.• Variables need to be selected in an efficient way.• The IT department withheld administrative privileges to my PC.

ER Diagram of Survey Tables

Exporting Data: Building An SPSS .SAV FileKey Strategies: Regular Expression &

User InterventionKey Strategies: Fuzzy String Matching & User Intervention

References

In order for variables with different labels to be considered the same, user intervention is required:

New question text

Best matching questionsalready in database

Score from fuzzy string matching algorithm – uses Levenshtein Distance

The user has the choice to: A) Select the best match from the database OR

B) Use their new question text to create a new question in the SurveyQuestion table.

Labels are matched using the same approach, but the algorithm struggled with short labels making user intervention even more important:

Notice that if the score was the only thing used, Very likely would have been incorrectly matched with

Very Unlikely.

*Python package FuzzyWuzzy’s fuzz.ratio function was used to generate scores. **Python package Tkinter was used to generate graphical prompts to the user.

• Convert data from long to wide format for data file• Build label dictionaries for variable labels and value labels• Resolve conflicts of changing labels for overtime data sets:

• Insignificant Change:

• Significant Change:

This is a result of the following changes happening to the survey from one year to the next for a question regarding employment status:

2015 20161 – Full-time 1 – Full-time2 – Part-time 2 – Part-time3 – Self-employed 3 – Self-employed4 – Retired 4 – Pursuing continuing education5 – Not currently employed 5 – Not currently employed and not seeking employment6 6 – Seeking employment but not currently employed7 7 – Retired

Regular expression selecting variables with a name starting with Q or q followed immediately by a digit:

Prompt to user to select any ofthe remaining variables:

*Student table contains many more variables (not shown) with other information about the student.