Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SAGE Innovations in Research Methods Series Editor: Kosuke Imai, Princeton University
Background and series rationale:
SAGE has a long, and unparalleled history of publishing innovative books on research methods to support
students and faculty in the social and behavioral sciences. Starting in the mid-1970s, the Quantitative
Applications in the Social Sciences (QASS) series, or “little green books”, offered short handy guides to key and
emerging methods: today there are 175 volumes available in print and online. Landmark publications such as
Denzin & Lincoln’s SAGE Handbook of Qualitative Research helped define the field and mainstream qualitative
methods within the academy. However the landscape in which social science research is now conducted has
changed dramatically in recent years. Students and researchers now have access to massive amounts of new
data, as well as new tools and techniques for accessing and making sense of these data. As SAGE celebrates its
first 50 years, it is a fitting moment to launch a series of books that both capture and anticipate the changing
research environment.
SAGE Innovations in Research Methods will be a series of fairly short guides (up to around 200 pages) on
cutting-edge methods, aiming to make these methods accessible to a broader audience of researchers and
graduate students. Innovation will be considered in terms of new ways of doing data collection and analysis,
technological innovations, and new forms of representation of research. The aim is for books in the series to
be used as supplementary texts on graduate-level courses, and be purchased by individual researchers and
graduate students who need insight into the topics covered. Authors, who are experts in their fields, will be
recruited from around the world.
For information on current books in development, or to submit a proposal for the Series, please contact:
Kosuke Imai Helen Salmon
Series Editor Senior Acquisitions Editor, SAGE
Hybrid Text Analysis: Humans and Machines Together by Nicholas B. Adams
The Prospectus
I. Market & Course Background A. Primary Course and Market
Hybrid Text Analysis will be a capstone book for the growing number of computational text analysis courses aimed at social science graduate students and faculty. Right now, most of these courses emphasize the specific utility of separate text analysis tools like cosine similarity, TFIDF, clustering, topic modeling, grammarparsing, and supervised machine learning using keyword dictionaries. (Sometimes they also include supervised machine learning from handlabeled text.) Too often, though, students of such courses walk away with the impression that each tool is to be used once per research project. This notion is reinforced by exemplar publications that demonstrate one tool at a time, mostly because the methods are so novel. Hybrid Text Analysis will show that these now proven tools are even more powerful when used in combination with one another, the more so when combined with human text labeling procedures. With a playful approach to these methods, and technical yet accessible demonstrations of how they can be combined, Hybrid Text Analysis will seed the creativity of the new pioneers of textbased social science research, launching a ‘big text’ revolution in which researchers ask and answer, heretofore unanswerable, questions about complex social phenomena.
B. Secondary Markets and Courses Hybrid Text Analysis will also be written with the curious computer novice in mind. Many full professors, encountering text analysis only through their younger, computersavvy graduate students, need a handy, intuitive introduction to the approaches. Their students need them to have that introduction as well. With accessible prose, copious examples, and justenough mathematical reasoning (unburdened by formal mathematical notation), Hybrid Text Analysis will leave untrained readers with a solid, intuitive understanding of text analysis techniques, a trust that they are reasonable, an interest in their further development, and (perhaps) a curiosity to learn more through existing and forthcoming textbooks.
The book is likely to be very popular in industry, as well. As new consulting firms pop up, almost daily, to serve the text analysis needs of corporate, government, and nonprofit clients, Hybrid Text Analysis will serve as a highlevel howto guide for constructing textual data processing and analysis pipelines that deploy the full repertoire of automated, humanannotation, crowdsourcing, and machine learning approaches to textual data. In addition to serving institutional clients, many “artificially intelligent” apps serving human needs will be launched by software engineers ingesting the lessons of this book.
II. The Book
A. Topic, Approach, and Style of Presentation This book takes as its starting point the observation that over the last decade, a set of text analysis tools have gradually accumulated in the computer science and computational linguistics corners of university campuses. Social scientists are beginning to understand the utility of these tools, using them cautiously here or there, and oneatatime, to speed up some of their data collection or analysis procedures. But no one yet has emphasized or demonstrated that these tools (/toys!) are much more productive (and fun!) to use in combination. This book encourages researchers to joyously embrace text analysis techniques, learn about their strengths and limitations, and – uniquely – to use them promiscuously in various fruitful combinations. Demonstrating new text data pipelines that leverage the author’s new crowd content analysis technology, and innovative applications of automated text analysis algorithms, Hybrid Text Analysis calls on researchers to start tackling large textual datasets now. While effective tools replacing human judgment do not exist for every text analysis problem, the author’s technology for aggregating human judgment at scale, and new insights about the effective combination of human and machine efforts, now make it possible for researchers to effectively and efficiently analyze vast quantities of textual data with a richness equal to indepth qualitative data. With such data, careful, comparative/statistical studies of complex symbolic interactionist phenomena are now feasible, ushering in what will be a golden age of sociology. After an introductory chapter, the book will offer an intuitive introduction to text analysis, divided into two chapters: “Reading Like a Human,” and “Reading Like a Machine.” These chapters are not intended to reproduce the detail of available textbooks explaining the math behind popular text analysis techniques (listed in I.A., above), but to ensure a common understanding of their basic function, and to set up the next chapter’s discussion of the relative strengths and weaknesses of humans and machines for asking/answering particular types of research questions. That chapter will evolve from an emphasis on comparison to synthesis, developing the argument that the techniques can and ought to be combined to (a) make text processing more efficient and (b) enable the
asking of yet more, and more interesting, questions. A few examples demonstrating the rewards of combining two methods together will set up the next chapters, which demonstrate larger text analysis pipelines combining 7 and 10 of these techniques, respectively, to convert thousands of newspaper articles into multilevel database entries describing actions, interactive sequences, the events in which they are embedded, and the event sequences and larger social contexts in which those are embedded. Substantively groundbreaking results emerging from these hybrid text analysis pipelines will be highlighted. The book’s next chapter offers readers counsel in the development and management of research teams performing largescale hybrid text analysis projects, and discusses the sort of capacity expanding democratization of social science work that will emerge with this new approach. Finally, the book concludes with a statement on the promise of hybrid text analysis approaches, particularly for the systematic analysis of the sort of multilevel timeseries data that would allow humans to discover and understand patterns in their complex symbolic interactions. A few directions for further study are proposed in a final call for researchers to go forth, play, and (re)discover what humans are all about.
B. Features and Benefits Hybrid Text Analysis begins where current and forthcoming text analysis textbooks and courses typically end. If, by analogy, they teach the skills of proper sentence construction, it teaches and encourages the higherorder skills of constructing paragraphs, and arguments. The book itself is not designed as classic textbook with discussion questions, exercises, and activities (though see D. Ancillaries below for a description of online tutorials and code cookbooks). Neither does it aim to be an encyclopedia of all the possible combinations of text analysis techniques researchers might deploy. Instead, it aims to demonstrate several such combinations, in a sense, “giving permission” to researchers to take risks and pursue their own creative solutions to their research problems. With diagrams of hybrid text analysis pipelines, screenshots of the textual inputs/outputs of various text processing algorithms, and plotted research findings showing the rise and fall of textbased social phenomena over time, readers will see, at a conceptual level, how text is transformed from raw news accounts to a better understanding of our world. To be read as a companion to other more detailed text analysis textbooks, or as a highlevel, intuitive invitation to those methods and their underappreciated promise, Hybrid Text Analysis will feature a ‘glossary’ describing a range of text analysis methods and the questions they answer, to serve as reminder or reference to these two types of readers.
C. Length ~170 pages
III. Competition The author is well connected to the social science text analysis community. After spending many hours interacting with the leading scholars in this area – among them Paul DiMaggio, Pablo Barbera, Joshua Tucker, students of Justin Grimmer, scholars convened by the Computational Text Analysis Working Group the author founded at Berkeley (including Michael I. Jordan (coinventor or topic modeling), David Bamman, Jake Ryland Williams, and the students of Dan Klein working on Berkeley NLP) – the author is aware of no other scholar working on a book such as Hybrid Text Analysis. Moreover, those who have been briefed on this book proposal appreciate its value and are anticipating its release.
Table of Contents
I. Introduction Few know it, but we are entering the greatest age of sociology. With advancing text analysis technology and seas of digital textual data, we can finally accomplish intricate, multilevel, timeseries, analysis adequate to our most complex theories of the social world. The impacts of this big text revolution are difficult to overstate… ... This book extends an invitation to a range of readers. Maybe you are qualitative researcher familiar with interviewing and content analysis methods. Perhaps your curiosity is piqued by all the buzz around text analysis, but you have been waiting for an accessible introductory text surveying the available tools and their application to a range of research questions. Or maybe… ... Over the last few decades, in the computer science and computational linguistics corners of campuses around the world, scientists have been creating and refining tools that allow computers to read, often in exotic ways, the words we humans write in letters, news articles, legal tomes, and much more. These computational text analysis tools have been proven, onebyone, in a smattering of articles across the social sciences, to be rather useful for asking limited questions about social data appearing in the form of text. But social science has yet to fully appreciate the power of these tools. This book invites readers into the text analysis sandbox. First, it orients readers to the tools of computational text analysis, and their strengths
and weaknesses, especially compared to humans’ reading abilities. Then, it moves into, and celebrates, the realization that human and machine approaches to text can be combined in many fruitful ways. Through many examples and three chapters demonstrating novel text analysis pipelines, readers will learn to ... ...
II. Reading Like a Human
A. What we all do – Psycholinguistics, Metaphors We Live By B. The epistemologies of literary criticism, historical, and sociological
reading C. Reading for Symbolic Interaction
1. Events/situations 2. Broader context
Allusions, irony, sarcasm, idioms 3. Dogwhistles, Codeswitching, Conflicted Intimacy (foreshadows caution about
machine reading) D. The challenges of human reading at scale
1. Intersubjective agreement 2. Time
III. Reading Like a Machine
A. Bagging Words 1. Why we bag words 2. The documentterm matrix (Screenshot, example) 3. The documentphrases matrix (Screenshot, example) 4. Difference of proportions (Screenshot, example) 5. TFIDF (Screenshot, example) 6. Clustering (Screenshot, example) 7. Topic Modeling (Screenshot, example) 8. Structural Topic Modeling (Screenshot, example)
B. Grammar Parsing 1. The power of Grammar
Syntax constrains semantics (Screenshot, example) 2. How human taught computers to understand grammar
3. Coreference resolvers (Screenshot, example) C. More Humantrained Algorithms
1. Dictionary classification LIWC (Screenshot, example) 2. Namedentity recognizers (Screenshot, example) 3. WordNet (Screenshot, example) 4. VerbNet (Screenshot, example) 5. FrameNet (Screenshot, example)
D. Strengths and Weaknesses 1. So fast 2. So replicable 3. Valid? 4. But, a canny reader will notice that the strengths and weaknesses of each
approach vary. So we cannot judge all Machine reading methods in a few paragraphs… and more importantly, it seems plausible that the failures of each method might be redeemed by some other method. The rest of this book explores that intuition.
IV. Hybrid Text Analysis
A. So many new toys in the sandbox! B. Humans still needed
1. Review and expand on the strengths and weaknesses of humans 2. AND, New tools for effectively enlisting and aggregating human judgments 3. Brief Introduction to Crowd Content Analysis
C. Example: humans and machines together 1. Grammarparsing, then bagging approaches 2. Human chunking, then topic modeling 3. Grammarparsing, then human sequencing
D. A time for creative and playful scholars to flourish
V. Computational Grounded Theory (a chapter by Laura Nelson) If it suits SAGE, I’d like to invite Laura Nelson to write a chapter on her Computational Grounded Theory Framework. This research design framework will feel exciting and familiar to qualitative researchers, and it scales to very big text data. Her example, showing that First and Second wave feminism are more accurately characterized as ‘Chicago’ and ‘New York’ feminism (since both existed throughout the 20th century) is also likely to support a fundamental, implicit purpose of this book: to encourage critical
researchers to embrace text analysis tools so they can tell the more complete and complex stories about our social world that need telling.
A. Qualitative Research aided by Computers B. The Computational Grounded Theory Framework
1. Pattern detection 2. Pattern Refinement 3. Pattern confirmation
C. A Geographic Corrective to the Wave Theory of Feminism 1. Pattern detection
structural topic modeling diff of prop
2. Pattern Refinement guided qualitative deep reading
3. Pattern confirmation Wordnet specificity scores Named entity recognition
4. Findings: New York City comprised a decentralized feminism focused on the individual in both the first and second wave; Chicago comprised a centralized feminism focused on meeting immediate needs in both waves. In sum, geographical differences that persisted over time better explains the content of feminism than the wave framework.
VI. The New Event Analysis This chapter revitalizes and extend Charles Tilly’s theory of contentious performances by presenting a new and fruitful method for discovering, measuring, and modeling them. The article begins with a restatement of Tilly’s theory that performances are flexible, semiscripted sets of action cohering within and across protest events (i.e., contentious gatherings), then describes how Tilly attempted to measure performances. Next, it describes a Hybrid Text Analysis method for discovering performances that merges handcoding and automated text analysis techniques. The merits of the approach are demonstrated, showing that protester and police performance can be measured far faster than with traditional content analysis, and with far more detail than is permitted by previous attempts to model Events using automated methods alone.
A. Tilly’s Performances B. Performance Modeling: The New Event Analysis
1. HandCoding Event Text Units (Screenshot)
2. AutoCoding Actions and Actors 3. AutoCoding Interactions
C. Demonstration D. Findings E. Broader Application for Symbolic Interactionist Research
VII. A Crowd Content Analysis Assembly Line This chapter describes a Crowd Content Analysis Assembly Line and enabling software allowing researchers to complete largescale content analysis projects with the help of citizen scientists and/or crowd workers. The key innovation enabling this new text analysis workflow is the TUA, the Text Unit of/for Analysis. TUAs are the subset of words in a document that refer to a single case of a single unit of analysis under study. Once a research team has identified the TUAs in a corpus, detailed handcoding work can be performed by internet users who experience the work as a series of reading comprehension tasks requiring them to highlight the text they use to justify their answers. By breaking down content analysis into these cognitively simpler tasks that scale to a larger workforce, the approach can reduce the duration of largescale content analysis projects by a factor of six. The article explains how databases produced by this text analysis approach are created faster, and are richer, more reliable, larger, and more transparent than those generated using previous methods and technology. The integration of the approach with automated approaches is also discussed.
A. Text Units of Analysis 1. What they are 2. How they enable Content Analysis at Scale 3. Identifying Them
B. Crowd Coding Text Units 1. Reading Comprehension Homework Finally Pays Off 2. The Text Thresher Interface
C. Validating Crowd Work 1. Gold Standards 2. Voting 3. Data Stewardship
D. Extensions 1. Integrating NLP hints 2. Active Machine Learning
VIII. Serious Fun: Managing Your Own Hybrid Text Analysis Project
A. Who can do this? 1. Qualitative researchers 2. Managers 3. Quantitative/computational researchers 4. You
B. Building teams, tools, and pipelines simultaneously – the social and organizational work/play of Hybrid text analysis
C. What a Hybrid Text Analysis Team Looks Like 1. Roles 2. Size
D. Supporting Technologies 1. Email, Slack, Asana, GitHub, GoogleDocs
2. Let the Spreadsheet do the work 3. Pizza
E. Expanding Social Science Capacity, Democratizing Scientific Work
IX. More Human Machines
A. Human Teaching, Machine Learning 1. Intuitive introduction to Machine Learning
i. Linear Regression ii. Support Vector Machines iii. Decision Trees iv. Knearest Neighbors
2. Measuring Machine’s performances 3. Tradeoffs between interpretability and performance
B. More Human Machines, More Human Humans 1. Artificial Intelligences 2. Improved Social Intelligence
X. Conclusion
A. Review B. The Promise
1. Big, complex text data for a big, complex world 2. Encouraging directions for study
Biographical & Other Information
I. Biographical Information The author has successfully written in long form, including two ~80 page research reports for a policy audience, and a >400 page dissertation (217 pages of body text). Writing a book of this length will not be particularly challenging for him.
NICHOLAS BRIGHAM ADAMS
Berkeley Institute for Data Science University of California, Berkeley
190 Doe Library Berkeley, CA 94720
CURRENT APPOINTMENTS PostDoctoral Data Science Fellow, Sociologist —Berkeley Institute for Data Science EDUCATION Ph.D. University of California, Berkeley, 2015 Dissertation Committee: Kim Voss, Trond Petersen, Chris Ansell Specializing in fields of political sociology and public behavior, social movements, social and political psychology, and classical and contemporary sociological theory M.A. University of California, Berkeley, 2009 Sociology B.A. Washington University in St. Louis, 2002 Philosophy, Political Science Magna cum Laude, Phi Beta Kappa
HONORS, AWARDS and GRANTS Digital Humanities @ Berkeley Curriculum Development and Teaching Grant for Text Analysis, 2016 ($60,000) Alfred P. Sloan Foundation Grant ($9000) Hypothes.is “Open Annotation Fund” Grant ($9000)
Digital Humanities Fellow Award, 2015 ($3000)
Mentorship Award, Sociology Department, 2014
National Science Foundation Doctoral Dissertation Improvement Grant, 20132014 ($12,000)
Lotus Foundation and Open Society Institute “Science of Security” Project Grant, 20102012
($400,000)
Outstanding Graduate Student Instructor, 2009
Robert Wood Johnson Mentor Research Grant, 2007 ($2,000)
Graduate Diversity Fellowship, UCB, 20052006 ($4,000)
Leo Lowenthal Fellowship, University of California, Berkeley, 20052006 ($8,000)
Phi Beta Kappa, 2002
Magna cum Laude honors in Philosophy, 2002
Schwarzchild Prize (biennial award honoring top Philosophy student at Washington University),
2002
National Merit Finalist, 1998
PUBLICATIONS Nick Adams. “Scaling Up Content Analysis: CrowdCoding Text Units.” (under second review w/ Sociological Methodology) Nick Adams, Ted Nordhaus, and Michael Shellenberger. January 2012. “Planes, Trains, and Car Bombs: The Method Behind the Madness of Terrorism.” The Breakthrough Institute. Oakland, CA.
Nick Adams, Ted Nordhaus, and Michael Shellenberger. 2011. “Who Killed the War on Terror?” The Atlantic. August 29. Washington, D.C. http://www.theatlantic.com/national/archive/2011/08/whokilledthewaronterror/244273/?single_page=true Nick Adams, Ted Nordhaus, and Michael Shellenberger. 2011. “Congress Needs to Evaluate Counterterrorism Techniques.” ROLL CALL. Congressional Quarterly. June 28, 2011. Washington, D.C. http://www.rollcall.com/issues/56_146/interrogation_tactics_need_evaluated2068201.html Nick Adams, Ted Nordhaus, and Michael Shellenberger. 2010. “Counterterrorism Since 9/11: Evaluating the Efficacy of Controversial Tactics.” The Breakthrough Institute. Oakland, CA. Robb Willer and Nick Adams. 2008. “The Threat of Terrorism and Support for the 2008 Presidential Candidates: Results of a National Field Experiment.” Current Research in Social Psychology. 14(1):122. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.168.6157&rep=rep1&type=pdf WORKS IN PROGRESS Nick Adams. “Discovering Performances, and Dances, Too: New Method and Theory for Protest Event Analysis.” Nick Adams. “An Occupation Campaign Life Course?: Sequential Activity Across 182 U.S. Occupy Campaigns.” Nick Adams. “Strategic Control Performances: American Police Departments’ Responses to the Occupy Campaigns of 2011.” Nick Adams. “Path Dependence Through the Public: State Health Policies’ Effects on Public Opinion of National Health Policy.” CONFERENCE PAPERS Nick Adams. “Scaling Up Content Analysis: CrowdCoding Text Units.” Human Centered Data Science Workshop @ CSCW. 2016. Nick Adams. “Scaling Up Content Analysis: CrowdCoding Text Units.” International Conference on Computational Social Science. 2016.