18
Conceptual text mining Pim Huijnen Utrecht University & University of Sheffield Digital Humanities Workshop May 12, 2016

Conceptual Text Mining

Embed Size (px)

Citation preview

Conceptual text mining

Pim Huijnen

Utrecht University & University of Sheffield Digital Humanities Workshop

May 12, 2016

What to do with 11 million newspaper pages?

a) distant reading

In: Het Centrum, 10 October 1919, p. 4.

b) finding the needle in the hay stack

In: Het Volk: Dagblad voor de arbeiderspartij, 29 January 1921, p. 3.

1 How to define a concept? Efficiency ≠ “efficiency”

Eugenetica ≠ “eugenetica” + “eugenetiek” + “eugeniek" + "rassenleer"

2 How to study its changing uses, contexts, and meaning over time?

5

How to know what to look at?

1 How to define a concept? Efficiency ≠ “efficiency”

Eugenetica ≠ “eugenetica” + “eugenetiek” + “eugeniek" + "rassenleer"

2 How to study its changing uses, contexts, and meaning over time?

6

How to know what to look at?

1) Eugenics

7

* topic modeling newspaper articles containing "eugenics" * using meaningful words to look for eugenics without

“eugenics” * in the given example: querying Texcavator with

‘regulation AND health AND race’ (575 results)

Texcavator

8

plotting the results on a time scale (relative to total number of articles per year)

extracting distinctive words from query results per year (tf-idf)

Texcavator

9

Texcavator

10

Texcavator

11

2) Scientific management

12

* using close reading to find all significant Dutch equivalents for “scientific management"

* extract results, divide them per year and upload them to Voyant Tools

* study changing vocabulary in the subset over time

Scientific management query

13

”wetenschappelijke bedrijfsleiding” (233)”wetenschappelijke bedrijfsorganisatie” (216)”wetenschappelijke bedrijfsvoering” (32)”scientific management” (28)

’taylorstelsel OR taylor-stelsel’ (330)’taylorsysteem OR taylor-systeem’ (369)’taylorisme’ (42)

Combined in a single query results in 1175 hits

The third way: distributional semantics

17

* Our implementation combines a) creating dictionaries and b) tracing meaning over time in a single workflow

* by finding ‘most similar words’ (i.e. words with equal vector values / words with similar meaning in sentences)

* Use cluster of most similar words from ten-year time period to find most similar words in next (and partly overlapping) time frame

* Trace word use of concepts over time without being dependant on single terms or predefined dictionaries

Shico

18