Upload
pim-huijnen
View
54
Download
0
Embed Size (px)
Citation preview
Conceptual text mining
Pim Huijnen
Utrecht University & University of Sheffield Digital Humanities Workshop
May 12, 2016
1 How to define a concept? Efficiency ≠ “efficiency”
Eugenetica ≠ “eugenetica” + “eugenetiek” + “eugeniek" + "rassenleer"
2 How to study its changing uses, contexts, and meaning over time?
5
How to know what to look at?
1 How to define a concept? Efficiency ≠ “efficiency”
Eugenetica ≠ “eugenetica” + “eugenetiek” + “eugeniek" + "rassenleer"
2 How to study its changing uses, contexts, and meaning over time?
6
How to know what to look at?
1) Eugenics
7
* topic modeling newspaper articles containing "eugenics" * using meaningful words to look for eugenics without
“eugenics” * in the given example: querying Texcavator with
‘regulation AND health AND race’ (575 results)
Texcavator
8
plotting the results on a time scale (relative to total number of articles per year)
extracting distinctive words from query results per year (tf-idf)
2) Scientific management
12
* using close reading to find all significant Dutch equivalents for “scientific management"
* extract results, divide them per year and upload them to Voyant Tools
* study changing vocabulary in the subset over time
Scientific management query
13
”wetenschappelijke bedrijfsleiding” (233)”wetenschappelijke bedrijfsorganisatie” (216)”wetenschappelijke bedrijfsvoering” (32)”scientific management” (28)
’taylorstelsel OR taylor-stelsel’ (330)’taylorsysteem OR taylor-systeem’ (369)’taylorisme’ (42)
Combined in a single query results in 1175 hits
The third way: distributional semantics
17
* Our implementation combines a) creating dictionaries and b) tracing meaning over time in a single workflow
* by finding ‘most similar words’ (i.e. words with equal vector values / words with similar meaning in sentences)
* Use cluster of most similar words from ten-year time period to find most similar words in next (and partly overlapping) time frame
* Trace word use of concepts over time without being dependant on single terms or predefined dictionaries