Digitizing the Canadian Parliamentary Debates
Kaspar Beelen, Christopher Cochrane, Graeme Hirst,Nona Naderi, Ludovic Rheault,Tanya Whyte
University of TorontoDepartment of Computer ScienceDepartment of Political Science
1
Dilipad Project BackgroundDigging into Linked Parliamentary Data (Dilipad)
- Tri-National Project
2
University of Toronto University of Amsterdam Institute of Historical Research
Dilipad Project Background Digging into Linked Parliamentary Data (Dilipad)
- Funded by the Digging into Data Challenge
3
Dilipad Project Objectives - Data Creation: Collect, Digitize and Enrich Parliamentary Proceedings in a uniform format
4
Dilipad Project Objectives - Data Creation: Collect, Digitize and Enrich Parliamentary Proceedings in a uniform format
- The Netherlands: Tweede Kamer and Senaat (1815-present)
5
Dilipad Project Objectives - Data Creation: Collect, Digitize and Enrich Parliamentary Proceedings in a uniform format
- The Netherlands: Tweede Kamer and Senaat (1815-present)
- United Kingdom: House of Commons (1803-present)
6
Dilipad Project Objectives - Data Creation: Collect, Digitize and Enrich Parliamentary Proceedings in a uniform format
- The Netherlands: Tweede Kamer and Senaat (1815-present)
- United Kingdom: House of Commons (1803-present)- Canada: House of Commons (1900-present)
7
Dilipad Project Objectives - Outreach: Lipad.ca
- Indexed and searchable version of the corpus (see the following presentation by Tanya Whyte)
8
Digitization:Overview
9
Digitization:Source Material
10
11
Enrichment:WhySemantic Annotation?
Enrichment:WhySemantic Annotation?
12
Enrichment:WhySemantic Annotation?
13
Enrichment:WhySemantic Annotation?
14
Enrichment:WhySemantic Annotation?
15
Conservative
Liberal
Corpus Structure
16
Corpus Structure
Proceedings
17
Dilipad Scheme
18
Proceedings
Dilipad Scheme
19
Proceedings
20
Proceedings
Dilipad Scheme
21
Proceedings
Dilipad Scheme
22
Proceedings
Dilipad Scheme
23
Proceedings
Dilipad Scheme
Members
24
Dilipad Scheme
Members
25
Dilipad Scheme
Members
26
Dilipad Scheme
Parties
27
Dilipad Scheme
Project Workflow
28
Project Workflow
29
1. OCR Conversion
Project Workflow
30
2.Structuring Text
Project Workflow
31
3. Linking Data
Step 1:OCR
Conversion
32
From PDF to Plain Text
DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House Resumed [...]Mr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]
33
Step 2:Structuring
Text
34
Identifying Patterns
DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]
35
DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]
Identifying Patterns
36
Matching Patterns with Regular Expressions E.g. Wildcards, Canad* = Canada, Canadian,
Identifying Patterns
37
(\n[A-Z]{3,}\n)
DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]
DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]
Identifying Patterns
38(\nMr\.\s[A-Za-z]+\s(.+?):)
Hon. Jean J. Charest (Minister of State (Youth) and Minister of State (Fitness and Amateur Sport and Deputy Leader of the government in the House of Commons)):
Mr. Rompkey:
Issues: Variation
^((?:Sir|M\.|Mr\.|Mr\,|Hon\.|The\sHon\.|Right\sHon\.|The\sRight\sHon\.|Miss|Mrs\.|Ms\.)\s(?:[A-Zdv][\-\w\.\']{1,25}\s{0,1}){1,4}\s{0,1}(?:\(.+?\)){0,1}\s{0,1}(?:(?:moved:)|(?:moved)|:|;))
Etcetera...
Issues: Variation
Issues: Changes over time
1888
1955
Issues: Changes over time
BAILLANiTYNEBALLAINTYNEBALLANT1NEBALLAiNTYNEiBALiLANTYNE
Issues: OCR Errors
Issues: OCR Errors
BAILLANiTYNEBALLAINTYNEBALLANT1NEBALLAiNTYNEiBALiLANTYNE
BALLANTYNE=
DEFENCE EXPENDITUREAPPOINTMENT OF SPECIAL COMMITTEEThe House ResumedMr. G. C. Nowlan (Annapolis-Kings):Mr. Speaker, I intend to intervene but briefly in his debate. [...]Mr. Jean Francois Pouliot (Temiscouata):Mr. Speaker, I do not intend to speak today only as a member of parliament or as a member of the Liberal party. [...]
Identifying Patterns
45
for line in document:if preceding line == topic title:
if next line == speech:code line as procedural text
Annotating the Proceedings
46
47
Annotating the Proceedings
48
Annotating the Proceedings
Step 3:Linking
Data
49
Disambiguating Entities
Mr. Jean Francois Pouliot (Temiscouata)
50
Disambiguating Entities
51
Title First Name
Last Name
51
Constituency
Mr. Jean Francois Pouliot (Temiscouata)
Disambiguating Entities
525252
=Mr. Jean Francois Pouliot (Temiscouata)
53535353
=
Adding Information
Mr. Jean Francois Pouliot (Temiscouata)
54545454
=
Adding Information
Mr. Jean Francois Pouliot (Temiscouata)
Project Summary
55555555
Project Output: - Structured and enriched parliamentary corpus. Includes
all House of Commons debates from 1900 to present. - Linked to ParlInfo and other knowledge sources such as
Wikipedia.- Easy-to-use and flexible search engine (Lipad.ca).
Dilipad TeamCanada
- Team: Kaspar Beelen, Chris Cochrane, Graeme Hirst, Nona Naderi, Ludovic Rheault, Tanya Whyte- Interns: Tim Alberdingk-Thijm, Mike Kimmins, Roman Polyanovsky
Netherlands
- Team: Jaap Kamps, Maarten Marx- Other Contributors: Hosein Azarbonyad, Mostafa Denghani, Alex Olieman- Interns: Kees Halvemaan, Sander Lijbrink
United Kingdom
- Team: Jonathan Blaney, Luke Blaxill, Richard Gartner, Paul Seaward, Martin Steer, Jane Winters
56
Funding Agencies- Social Sciences and Humanities Research Council (CAN)- National Sciences and Engineering Research Council (CAN)- Canada Foundation for Innovation (CAN)- Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NED)- Arts and Humanities Research Council (UK)- Economic and Social Research Council (UK)- National Endowment for the Humanities (USA)- National Science Foundation (USA)- Institute of Museum and Library Services (USA)- Joint Information Systems Committee (UK)
57
Questions?
58