17
Forensic Linguistics with Apache Spark Kostas Perifanos @k_perifanos

Forensic linguistics with Apache Spark

Embed Size (px)

Citation preview

Page 1: Forensic linguistics with Apache Spark

Forensic Linguistics with Apache Spark

Kostas Perifanos @k_perifanos

Page 2: Forensic linguistics with Apache Spark

Idiolect, sociolect, intertextuality

What?

- Idiolect: individual’s distinctive and unique use of language

- Sociolect : variety of language associated with a social group (socioeconomic, ethnic, age)

- Intertextuality: the shaping of a text’s meaning by another text

Page 3: Forensic linguistics with Apache Spark

Forensic Linguistics

"Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics.” [Wikipedia]

- Authorship Attribution

- Authorship Identification

- Gender/Age classification etc

Page 4: Forensic linguistics with Apache Spark

Dataset

- 8m tweets between 18/06/2015 - 06/08/2015

- 92m words (white space tokenized)

- 190K users

- Key events during this period

- Referendum Announcement

- Capital Controls

- Referendum voting

Page 5: Forensic linguistics with Apache Spark

Toolset

- Apache Spark 1.6.1

- RDD

- DataFrames / Spark SQL

- Word2vec, KMeans

- Apache Zeppelin

- Gephi

Page 6: Forensic linguistics with Apache Spark

Basic Data Exploration - Counting

Check for trends:

- Lowercase vs Uppercase ratios

- Relative frequencies of important (propaganda) words

- Average text length (per day)

- Average word length (per day)

Page 7: Forensic linguistics with Apache Spark

Counting - lowercase / uppercase ratio

Page 8: Forensic linguistics with Apache Spark

Counting - Propaganda

Page 9: Forensic linguistics with Apache Spark

- Build a word2vec model, treat @mentions as vocabulary words

- Find top-N “synonyms” using seed accounts, keep all starting with “@”

- @handle1: @handle2, @handle3, ...

- @handle32: @handle5, @handle3, ...

- Visualize the graph

Similarities & user interactions

Page 10: Forensic linguistics with Apache Spark

Similarities & interactions graph [Gephi]

Page 11: Forensic linguistics with Apache Spark

Similarities & interactions graph [Gephi]

Gephi : Modularity analysis, 9 communities detected

Communities:

- “Yes”, black

- “No”, magenta

- media, red

- celebrities, dark green

- “Romantic twitter”, orange

- ....

Page 12: Forensic linguistics with Apache Spark

- Choose top N most frequent words [1]

- Build frequency vectors for all users

- Compare user signatures [eg Cosine Similarity]

- Identified double-account user among 180K candidates (so much for anonymity)

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694980/

2. Idiolect : Style signatures

Page 13: Forensic linguistics with Apache Spark

2. Idiolect : Style signatures

Page 14: Forensic linguistics with Apache Spark

- Apply clustering on signature vectors

- KMeans on signatures

- KMeans on word2vec vectors:

- Transform words to vectors, sum and average

- Also works very well for metaphor detection

Sociolect: Clustering

Page 15: Forensic linguistics with Apache Spark

- User generates texts by sampling a number of topics

- “Similar” users will tend to have similar topic distributions

- Given a subset of similar users, identify the most influential, eg the user who enforces writing style. [But that’s another presentation :)]

Challenges

Noise

“Random events”

Opinion shifting: People change their opinions and their writing styles accordingly. Social media tends to amplify this behaviour [one more presentation :) ]

Intertextuality: LDA + signatures

Page 16: Forensic linguistics with Apache Spark

- User - Topic Classification

- Gender classification

- Age

- Personality, stress, anxiety etc

- Try Deep Learning approaches

Next steps

Page 17: Forensic linguistics with Apache Spark

Thank you!

Questions?

@k_perifanos - http://github.com/kperi