102
An Introduction to Text Analytics in IBM SPSS Modeler Skylar Ritchie Shawn Bergman 1

Text Analytics Presentation

Embed Size (px)

Citation preview

An Introduction to Text Analytics

An Introduction to Text Analytics in IBM SPSS ModelerSkylar RitchieShawn Bergman1

1

ObjectivesTo give a broad overview of text analyticsDefining key termsDescribing important steps in the processTo provide a step-by-step tutorial for how to use IBM SPSS Modeler to...Read in source textExtract concepts, sentiment, and text link patterns from recordsCategorize recordsVisualize the results2

Having read the 225-page Users Guide cover to cover and watched countless videos on Modeler, I can personally attest that the two most difficult aspects of learning the software areDistinguishing between terms that look similar, but signify very different ideasComing up with an organizational framework for understanding the many things you can do in Modeler

The first half of the presentation is dedicated to the first difficulty, and the second half of the presentation, to the second

The overriding goal of this presentation is for you to feel as though you can explore the software for yourselves

In putting it together, I tried to focus only on the essentials, and even though I only scratched the surface of what the software can do, we will have to hustle to make it through everything

However, we will post this presentation with all of its examples and videos on the Office of Research Consultation website so that you can use it as a resource and refer back to it when you need it

In the interest of time, I am going to cover the first half of the presentation relatively quickly, but if I am moving too quickly, please do not hesitate to ask questions and slow me downjust understand that we may not get to everything and that you may have to watch some of the videos at the end for yourself2

Overview of Text AnalyticsObjective #13

3

Text Analytics

The process of deriving high quality information from text --Marisa Peacock, Social Media Strategist

A technology and process both, a mechanism for knowledge discovery applied to documents, a means of finding value in text. Solutionsanalyze linguistic structure...discern entities...as well as relationships, concepts, and even sentiments. They...automate classification...of source documents. They exploit visualization for exploratory analysis. --Seth Grimes, Analytics Strategy ConsultantExtraction: to discern entities, relationships, concepts, and sentimentsCategorization: to automate classificationVisualization

4

Lets start with a definition of text analytics

One thing both of these definitions have in common is that they both describe text analytics as a process

Furthermore, both definitions describe the outcome of this process in similar terms: the outcome is high quality information, knowledge, and value

The second definition, however, is somewhat more descriptive than the first, since it enumerates the principal steps in this process

Those steps are toDiscern entities, relationships, concepts, and the relationship between themsomething IBM calls extractionAutomate classificationsomething IBM calls categorizationVisualize the results

In my presentation today, I will first describe these steps in greater detail and then show you how to perform them for yourself4

What does text analytics look like? 5Handout provided

So what does this process look like?

On the macro-level, the process involves four primary steps: Reading in source textExtracting linguistic entities, relationships, and sentimentCategorizing recordsVisualizing the results

On more of a micro-level, the primary steps of extracting and categorizing can be broken down further:Extraction involves passing the source text through a variety of dictionaries (to be described in greater detail) in order to identifyConceptsTypesText Link Analysis patternsCategorization involves taking these extraction results and applying a number of grouping techniques in order to create categories and descriptors that classify records

These diagrams depict text analytics as a linear process; however, as the Users Guide repeatedly emphasizes, text analytics is an iterative process, so a more accurate depiction might include a feedback loop5

6

6

Key Terms

Source text file

FieldDocument/record

7

Lets take a look at the first step in the text analytics process: sourcing

Source text can take the form of either a computer file (such as an Excel file) or a Web feed (such as an RSS feed with various web links)

Since the focus of todays presentation is to demonstrate how to perform extraction, categorization, and visualization, I will use an Excel file as source text

Using a Web feed as a source is a little less straightforward, but if you are interested in that as well, I can make that the topic of a future presentation

Within an Excel file, you have worksheets, whose columns are known as fields and whose cells are referred to as either documents or records, two terms that IBM uses interchangeably

For the sake of simplicity, I will refer to them in the future as records 7

8

8

Key TermsTypes: higher-level concepts

Concepts: lead terms under which similar terms are grouped together

Terms: single words (uni-terms) or word phrases (multi-terms) that are interesting or relevant9Handout provided

Lets turn now to the second main step in text analytics: extraction

Here for the first time we encounter a number of terms that look similar, but signify very different ideas

In fact, these ideas are arranged hierarchically wither terms at the bottom and types at the highest level of abstraction

Terms and concepts are always written as lowercase words or word phrases, and types are always enclosed in brackets

The general types that come with the Core Librarymore on that laterinclude , , , and

But types in other more specific libraries can themselves be more specific:Types in the Opinions Library include , , , and among othersTypes in the Employee Satisfaction Library include , , , and among others9

Substitution Dictionary: Terms ConceptsAn editable collection of synonymous terms grouped under a target term, or concept

Target TermSynonymsuniversityuniversity, college, school, academy, institute, polytechnic, alma mater, graduate school studentstudent, scholar, undergraduate, graduate, grad student, postdoctoral fellow, freshman, sophomore, junior, seniorprofessorprofessor, prof, tenured faculty member, dean, assistant professor, associate professor, lecturer, academic

10

As I mentioned earlier, there are several linguistic dictionaries that are instrumental in the extraction process

The first of these is known as a substitution dictionary, and it is responsible for grouping terms under what are called target terms or concepts

The computer scans all of the records, and whenever it finds synonymous terms, it essentially rewrites them as the target term

It is important to note that this dictionaryand all the othersare editableSo if, for example, you want to distinguish between universities and institutes, you can separate the two terms in your substitution dictionaryAnd if, on the other hand, you want to use two terms synonymously, you can combine them in this dictionary10

Type Dictionary: Concepts TypesAn editable collection of concepts grouped under a label known as the type nameConceptType5 stara lot betterbeyond my expectationsabhorbizarrecant standall about the samebeen with it for too little timecant think of any

11

The second linguistic dictionary is known as a type dictionary, and as its name implies, it is responsible for grouping concepts under their respective types

Here the computer assigns a higher-level descriptive label to the concepts themselves, and although it is generally pretty good at assigning types when given some kind of context, if it is not given context, it will often assign the type

11

Exclude DictionaryAn editable collection of terms and types that will be removed from the final extraction resultsExclude Listany kind of problemcant say enoughcant waiti was out ofif it aint broke, dont fix itprefer not toto work withwent down to

12

The third and final linguistic dictionary is known as the exclude dictionary, and as its name suggests, whatever it contains is excluded from the final extraction

As you peruse this dictionary, you might find a term or phrase that you do want to extract, and by deselecting it in this dictionary, you can ensure that it shows up in the extraction results

There is also a way to assign unwanted terms and phrases to the exclude dictionary12

Text Link Analysis (TLA)A pattern-matching technology that is used to extract relationships found betweenEither conceptsOr types

13Handout provided

Text Link Analysis (or TLA) is where text analytics really demonstrates its value

TLA patterns are the fourth and final kind of extraction results

Whereas the other extraction results (terms, concepts, and types) represent a single linguistic unit, TLA patterns represent the relationships between these units and can express the meaning of an entire sentence with a subject, verb, and predicate

As the examples at right indicatePatterns can contain 2 or more concepts or typesOrder is important (indicated by the + operator), but sentiments always come last13

14

14

Key TermsCategorization: the process of assigning records to a category when the text within them matches a descriptor

Category: higher-level ideas that capture the central message of the text

Descriptor: concepts, types, patterns, and category rules that have been used to define a category15

Finally lets turn to the third main step in text analytics: categorization

Whereas extraction involves bundling the terms, concepts, and types within records, categorization bundles the records themselves on the basis of what they contain

Descriptors determine whether or not a record is assigned to a given category, and descriptors can take the form of either concepts, types, TLA patterns, or category rules15

Category RulesStatements that classify records into a category based on a logical expression using extracted concepts, types, and patterns as well as Boolean operatorsOperatorMeaningExample+And(order important) + university + excellent&And(order not important) & excellent & university|Or | student | university!()Not!()!(student)

Matching SentenceThis is a 5 star university

16Handout provided

Since we have already covered concepts, types, and TLA patterns, lets move on and cover category rules

In one way, category rules are like TLA patterns: they often join concepts or categories to describe a record and determine whether or not it belongs in a category

In another way, however, category rules are unlike TLA patternsIn the first place, they can use operators such as the ampersand or the vertical bar, in which case order is not important(excellent & university) would capture the exact same records as (university & student)In the second place, category rules can indicate the absence of something, whereas TLA patterns only focus on the presence of things!(student) would capture all of the records that do not contain student, and this might be a considerable numberUsually, you would want to use the not operator in conjunction with another operator such as student & !(professor)16

Wildcard OperatorThe Boolean operator * that acts as a variable and stands in for a missing word or word fragmentUsageExampleMatching PhrasesSpace after wordgraduate *graduate schoolgraduate studentSpace before word* graduateuniversity graduateNo space after wordgraduate*graduatesgraduatedNo space before word*graduateundergraduate

17

The fifth and final Boolean operator is known as the wildcard, and you can think of it as a variable that represents a missingPrefixSuffixOr word that precedes or comes after a given word

If there is a space either before or after the wildcard, the wildcard represents a missing word

If, on the other hand, there is no space, then the wildcard only represents a part of a word

Wildcards can be useful for generalizing category descriptors, but in some instances, they can overgeneralize

For example, graduated can be either an adjective or a verb, and if it is an adjective, it can refer to an alumnus or to a cylinder, and depending on the context, you may want to capture one concept but not the other with your descriptor17

Grouping TechniquesThe mechanisms underlying the categorization process18Handout provided

Having covered category rules, the fourth kind of category descriptor, lets turn to the grouping techniques that generate both the categories and their descriptors

There are four of these: concept inclusion, concept root derivation, semantic networks, and co-occurrence18

Concept InclusionWhat?Grouping based on subsets and supersets

How?Breaking concepts into componentsDe-inflecting components

When?Text that is somewhat technical

19

Concept inclusion is a grouping technique that involves breaking concepts into their component sets, de-inflecting these components, and then identifying areas of overlap

For example, lets say you had the multi-term concepts graduate faculty, faculty committees, and tenured faculty members

These concepts would first be broken down into their component sets and then these sets would be de-inflected (e.g., converting nouns from plural to singular)

In the process at right, I have illustrated the de-inflection process by underlining the parts of the word that are removed in a subsequent step

In these component sets, the order of the words is not important; the only thing that is important for the concept inclusion technique is whether or not these component sets have areas of overlap

Concept inclusion is a technique that is relatively robust and works well on text that contains technical jargon

19

Concept Root DerivationWhat?Grouping based on morphological relationships

How?Breaking concepts into componentsDe-inflecting componentsRemoving suffixes to find root

When?Any text, but few categories

20

Concept root derivation employs a very similar process, but goes one step furtherstripping words down to their morphological or structural roots so that areas of overlap can be identified

As you can see at right, psychology, psychological, and psychologist all have the same rootpsycholog-and the concepts can be grouped into categories on the basis of this similarity20

Semantic NetworkWhat?Grouping based on semantic relationships

How?Synonyms: are relationshipHyponyms: is a relationship

When?Text that is not highly technical

21

Unlike concept root derivation, which categorizes concepts on the basis of morphological relationships, the semantic network technique looks for and categorizes concepts on the basis of semantic relationships, relationships having to do with word meanings

These semantic relationships generally take the form of either synonyms or hyponyms, where the former denotes an are relationship, and the latter, an is a relationshipProfessors and teachers, for example, might be considered synonyms, since they both are educatorsPsychology and social science, on the other hand, are hyponyms, since psychology is a social science

21

Co-occurrenceExampleConceptsStudents flock to ASUstudents = WASU = XASU focuses on sustainabilityASU = Xsustainability = YSustainability is the way of the futuresustainability = Yway of the future = Z

22

The fourth and final grouping technique is that of co-occurrence

Cxy represents the number of records in which two concepts co-occur; Cx, the number in which the first concept occurs; Cy, the number in which the second occurs

Generally, concepts must co-occur two or more times in order for them to be categorized together; however, this setting can be adjusted either higher or lower If your setting is high, you will generate fewer categories, but these categories will contain concepts that are more similar to each other If your setting is low, you will generate more categories, but they will be more heterogeneous

Co-occurrence is a relatively straightforward technique, but if you are interested in how it computes a similarity coefficient for two concepts, several sample calculations are illustrated at right22

Extraction v. CategorizationExtractionCategorizationEndsTo discover what records containTo classify records based on what they containMeansSubstitution dictionaryType dictionaryExclude dictionaryConcept root derivationConcept inclusionSemantic networkCo-occurrenceOutputConceptsTypesTLA patternsCategoriesDescriptorsConceptsTypesTLA patternsCategory rules

23

To sum up what we have said so far, extraction differs from categorization both in terms of its purpose or end and in terms of its means to that end

The purpose of extraction is to discover what records contain, whereas the purpose of categorization is to classify records on the basis of what they contain

The means used are also differentExtraction takes place by comparing records against a number of dictionariesCategorization, on the other hand, involves applying a variety of algorithms to the extraction results to create categories

In this way, concepts, types, and TLA patterns are both output and input: output for the extraction process and input for the categorization process

They are what gets pulled out of records and what the software then turns around and uses to classify those records23

Modeler TutorialObjective #224

Now that we have parsed out what the terminology means, lets take a look at the software itself and see how to perform the various tasks associated with sourcing, extracting, categorizing, and visualizing

As I mentioned earlier, one difficulty in learning Modeler is distinguishing between terms that look similar; however, a second difficulty concerns organizing the many different tasks you can perform in Modeler

To surmount this second difficulty, I have provided a number of charts so that you can keep track of what we have done and what we are doing

If you have the data set, you may find it helpful to follow along on your computer

24

25

25

Creating a New StreamOpen IBM SPSS Modeler 17.1Select Click OkTo create another stream, click

26

A stream is just your workspace, and it lays out in a visual fashionWhat data you are usingWhat processes you are running it through26

27

27

28

The data set that you gave us to analyze is a focus group conversation about the strategic direction of the College of Business

Because you are probably less interested in moderator comments than you are in those of participants, you may want to filter out the moderators remarks in Excel before you start the analysis process

28

Sourcing an Excel File

Click the tab Double click the node or click and drag it into the stream Double click the node within the stream or right click and click EditClick on the tabSelect the Select theA Select Click Ok

29

29

30Handout provided

Less information in substitution, type, and exclude dictionariesNo categories

More information in substitution, type, and exclude dictionariesNo categories

More information in substitution, type, and exclude dictionariesPre-built categories

Templates initiate the extraction phrase and pull out concepts and types

There are many different kinds of templates, some of which contain more in their substitution, type, and exclude dictionaries than others

There are also what are called text analysis packages (or TAPs) that comeNot only with a wealth of information in their dictionariesBut also with a number of pre-built categories that you may be interested in when you are conducting your analysis

For example, there is a TAP for employee satisfaction surveys, and the categories that it comes with include positive and negative sentiment towardCoworkersManagersCommunicationJob securityBenefitsEtc.

If you are not interested in all of the pre-built categories, you can delete or modify them to suit your preferences30

31

31

Starting an Interactive Workbench Session with the Basic Resources Template

Click the tab Double click the node or click and drag it into the streamDouble click the node within the stream or right click and click EditClick on the tabSelect the Click on the tabSelect Click

32

32

Interactive Workbench Categories & Concepts View

Categories PaneExtraction Results PaneData Pane33

33

Interactive Workbench Resource Editor View

Type DictionarySubstitution DictionaryExclude Dictionary34

34

35

35

Starting an Interactive Workbench Session with the Opinions Template

Double click the node within the stream or right click and click EditClick on the tabClickSelect Click OkClick

36

36

Interactive Workbench Categories & Concepts ViewConcept View37

37

Interactive Workbench Categories & Concepts ViewType View38

38

Interactive Workbench Resource Editor ViewType DictionarySubstitution DictionaryExclude Dictionary39

39

40

40

Starting an Interactive Workbench Session with the Opinions Text Analysis Package

Double click the node within the stream or right click and click EditClick on the tabSelect ClickSelect

Click Click

41

41

Interactive Workbench Categories & Concepts ViewCategories PaneExtraction Results PaneData Pane42

42

Interactive Workbench Resource Editor ViewType DictionarySubstitution DictionaryExclude Dictionary43

43

Templates v. Text Analysis PackagesLibrariesPre-Built CategoriesBasic Resources TemplateLocalCoreVariationsNonlinguistic EntitiesNoOpinions TemplateLocal CoreVariationsNonlinguistic EntitiesOpinionsBudgetSlang EmoticonNoOpinions Text Analysis PackageLocal CoreVariationsNonlinguistic EntitiesOpinionsBudgetSlang EmoticonYes

44Handout provided

44

45Handout provided

45

46

46

Interactive Workbench Categories & Concepts View

47

47

Editing the Substitution Dictionary

Right click on the conceptSelect Add to SynonymClick NewCreate the target term to which you want to assign the synonymClick OkClick

48

48

Interactive WorkbenchCategories & Concepts ViewResource Editor View

49

49

50

50

Interactive Workbench Categories & Concepts View

51

51

Editing the Type Dictionary

Right click on the conceptSelect Add to TypeClick MoreSelect the type to which you want to assign the conceptClick Ok

Click Ok againClick

52

52

Interactive WorkbenchCategories & Concepts ViewResource Editor View

53

53

54

54

Interactive Workbench Categories & Concepts View

55

55

Editing the Exclude Dictionary

Right click on the conceptClick Exclude from ExtractionClick

56

56

Interactive WorkbenchCategories & Concepts ViewResource Editor View

57

57

58

58

Extracting TLA Patterns

In the Text Link Analysis View, clickSelect a type pattern to see the concept patterns that correspond to itClick to see the concepts and type webs corresponding to these patterns

59

59

Interactive Workbench Text Link Analysis View

60

60

61

61

Automatically Building Categories

In the Categories & Concepts View, click Click Edit:SelectClick ClickSelectSelectSelectSelectSelectClick OkClick

62

62

Interactive Workbench Categories & Concepts View

CategorySubcategoryDescriptorVisualization Pane: Category Bar63

63

Interactive Workbench Categories & Concepts ViewCategory Web64

64

Interactive Workbench Categories & Concepts ViewCategory Web Table65

65

66

66

Interactive Workbench Categories & Concepts View

67

67

Manually Categorizing Concepts

Select the concept you want to categorizeClick Select the category to which you want to assign the concept: Click Ok

68

68

Interactive Workbench Categories & Concepts View

69

69

70

70

Interactive Workbench Categories & Concepts View

71

71

Manually Categorizing Types

Select the type you want to categorizeClick Select the category to which you want to assign the concept or create a new category: Click Ok

72

72

Interactive Workbench Categories & Concepts View

73

73

74

74

Interactive Workbench Text Link Analysis ViewType Patterns

Concept Patterns

75

75

Manually Categorizing TLA Patterns

Select the TLA pattern you want to categorizeClick Select the category to which you want to assign the concept or create a new category: Click Ok

76

76

Interactive Workbench Categories & Concepts View

77

77

78

78

Manually Creating Category Rules

Right click on the category for which you want to create a ruleClick Create Category RuleCreate your rule byDragging concepts or types into the Rule EditorCombining them with Boolean operatorsClick to see how many records matchClick

79

79

Interactive Workbench Categories & Concepts View

80

80

81Handout provided

Now that we have explored the extraction and categorization results with the Opinions Template, lets move to the Opinions Text Analysis Package

As youll remember from the first part of the tutorial, the difference between a template and a text analysis package is that the former does not come with pre-built categories, whereas the latter does81

82

Because the focus group conversation is not in the proper format with a question as the field header and each record as one persons response to that question, we will switch to a slightly different data set that is in the proper format so that we can demonstrate the remaining capabilities

This data set is a questionnaire about a companys safety program, and the field that we will be looking at has to do with what employees want the company to stop doing with regard to safety

Because this is an employee opinion questionnaire, we can use the employee opinion text analysis package82

83

83

Interactive Workbench Categories & Concepts View

84

84

Manually Adjusting Categories

Right click on the category or categories that you want to adjustSelect either Move to Category or Merge Categories or Edit > Delete

85

85

Interactive Workbench Categories & Concepts View

86

86

87

87

Interactive Workbench Categories & Concepts View

88

88

Generating Model

Once you are satisfied with the categories you have created, clickDrag the newly created modeling node into your stream Right click on your source node Click ConnectClick on your modeling node to connect the two nodes

89

89

Stream90

90

91

91

Converting Model Categories to Fields

Right click on your modeling nodeClick EditClick on the tabSelectChange theClick Ok

92

92

93

93

Deriving a Total Negativity Score

Click on the tabDouble click the node or click and drag it into the streamDouble click the node within the stream or right click and click EditGive a descriptive name to yourClick to create a formulaIn Expression Builder, click on a category that you want to be in your formulaClick to add itClick on an operator such as Add another categoryWhen you are finished, click OkRepeat the process to create additional formulas

94

94

95

95

Deriving an Overall Sentiment ScoreClick on the tabDouble click the node or click and drag it into the streamDouble click the node within the stream or right click and click EditGive a descriptive name to yourSelectDefine field settings:

Click Ok

96

96

Stream

97

97

98

98

Visualizing Model ResultsClick on the tabDouble click the node or click and drag it into the streamDouble click the node within the stream or right click and click EditClick on the tabSelectSelect overlay:

SelectClick

99

99

100

100

SummaryTo give a broad overview of text analyticsDefining key termsDescribing important steps in the processTo provide a step-by-step tutorial for how to use IBM SPSS Modeler to...Read in source textExtract concepts, sentiment, and text link patterns from recordsCategorize recordsVisualize the results101

101

Additional ResourcesUsers Guide: http://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/17.0/en/ModelerTextAnalytics.pdf Introduction to SPSS Text Analytics Webinar: https://www.youtube.com/watch?v=tK-o4MnRScQ&list=WL&index=2 102