24

Building Corpora from Social Media

Embed Size (px)

DESCRIPTION

Given at the Stuts conference in Stuttgart. All code available on the github - all data currently unavailable.

Citation preview

Page 1: Building Corpora from Social Media
Page 2: Building Corpora from Social Media

Introduction

What are ‘Low resource’ languages?

Half of the world’s 7,000 languages have been predicted to go extinct within this century (Krauss 1992).

There is corpora for statistically none of them available.

Page 3: Building Corpora from Social Media

• Only around thirty languages currently enjoy full technological resources

• Only a 100 or so have basic resources such as dictionaries, spellcheckers, or parsers (Scannell 2007; Krauwer 2003).

Introduction

Page 4: Building Corpora from Social Media

Introduction

Why make corpora?• Linguistic data can be analysed by linguists

interested in theoretical questions• Utilised by data scientists and computational

linguists to provide better tools and applications

• Archived for posterity.

Page 5: Building Corpora from Social Media

Outline

• The Tʉlʉʉsɨke Kɨlaangi Facebook Group• Previous work (in brief)• Legality of using Facebook • Corpus creation process• An XML Schema for data archival

Page 6: Building Corpora from Social Media

Tʉlʉʉsɨke Kɨlaangi

Rangi:– Bantu language– 350,000 speakers– Spoken mainly in Tanzania– A few linguists working on it – mainly Oliver

Stegen (Edinburgh, SIL)

Page 7: Building Corpora from Social Media
Page 8: Building Corpora from Social Media

Tʉlʉʉsɨke Kɨlaangi

Facebook Group:– Founded by Oliver Stegen– 339 Members– Since February 11, 2011– Created for corpora generation.– For talking in Rangi – but there is often English

and Swahili code switching.

Page 9: Building Corpora from Social Media

Previous Work

• Twitter corpora: Large datasets, lots of opinion mining.– Examples: US elections, Arab Spring

• Án Crúbadán by Kevin Scannell

Page 10: Building Corpora from Social Media

Previous Work

Page 11: Building Corpora from Social Media

Previous Work

• Work on Facebook corpora:– – – Ok, there is some work, but it is very sparse. (If

you know of any, let me know.)

Page 12: Building Corpora from Social Media

Legal Issues

• Disclaimer: This is not sound legal advice, and I am not opening a lawyer-client relationship with you by telling you any of this. This is merely what I think I’ve figured out by staring at the literature and Facebook for a very, very long time.

Page 13: Building Corpora from Social Media

Legal Issues

• Facebook’s Statement of Rights and Responsibilities, section 3.2 states: – ”You will not collect users’ content or information,

or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission.”

• Automated Data Collection Terms:– All automated processes on the site are forbidden,

unless there is express written consent.

Page 14: Building Corpora from Social Media

Legal Issues

• “You agree that any violation of these terms may result in your immediate ban from all Facebook websites, products and services. You acknowledge and agree that a breach or threatened breach of these terms would cause irreparable injury…” – Facebook

Page 15: Building Corpora from Social Media

Legal Issues

• Work around:– Use only ‘public’ information– EU Directive 96/9/EC– ‘Fair Use’– Implied licenses– Not using a crawler or scraper.

Page 16: Building Corpora from Social Media

Privacy

• Facebook wants written consent from each user.

• Standard procedure in language documentation.

• Required by most universities (and often journals.)

Page 17: Building Corpora from Social Media

Privacy

• Unnecessary here: – All data is in the public domain.– The data will not be shared or monetized– All names and personal data are anonymised– The data is being used purely for research.– The group I’m looking at was set up for this

purpose, and there has been personal communication confirming this by Stegen.

Page 18: Building Corpora from Social Media

The Tool

• Load page into a browser normally– the source code has already been collected into the

system, and automation is not necessary for retrieving more URLs.

• Manually click on “Display more posts...” and “View all comments” – An Ajax query is sent to the database, and the posts

are loaded in the browser. • Copy and save the HTML source code. • Clean and sort with Python (Beautiful Soup).

Page 19: Building Corpora from Social Media

XML Storage

• The data is massive.• From February 11, 2011 to February 17, 2011

is almost 300k lines of HTML. • Mining this is not trivial.

Page 20: Building Corpora from Social Media

XML Storage

• XML = extensible markup language• Not reliant on any single, particular program.• Widely used for data storage already. • XML works by conforming to a schema.• Easily converted into RDF and other useful

storage formats. • Easy to understand for both humans and

machines. • Can also be stored independently of the data.

Page 21: Building Corpora from Social Media
Page 22: Building Corpora from Social Media

Results

• The largest corpus currently available for Rangi:– Án Crúbadán crawler: this corpus is 108

documents large, and is comprised of 17,908 words and 123,354 characters.

• This Facebook corpus:– 990 threads, 64,891 words and 571,182

characters.

Page 23: Building Corpora from Social Media

Future Work

• Eventually, I hope to make this corpus public.

• Multilingual identification.

Page 24: Building Corpora from Social Media

THANKS

Questions?

https://github.com/RichardLitt/lrl