Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Introduction and Overview
Outline
● Goals● Existing Systems● The Original Data● Our Work So Far● Future Work, including what I we will do this
Summer
Introduction and Overview
Goals● Open Source Library of Legal NLP Materials
– Legal Search, Annotated Corpora, Automated Translation, etc.
– Court Decisions, Legislation, Legal Forms, Codes and Regulations, etc.
● Short Range Goals:– Analysis of Legal Decisions from the Supreme Court
– Expand to Other Appellate Decisions
● Collaboration:– Periodic contact with Court Listener – hope to do distribute/share with them
● https://www.courtlistener.com/● We get our data from them
– Possible future work with Cornell Legal Information Institute● https://www.law.cornell.edu/lii/about/who_we_are● They work more with legislation. For example, we may want to link citations in legal decisions to their text● They translate some legal documents (not case decisions) to Spanish
Introduction and Overview
Goals for Today
● Project So Far and Future Directions● Some Projects that have Already Started● Intended Future Projects● Method for Identifying Future Projects● The beginnings of 1 or more annotation tasks
Introduction and Overview
Existing Systems● Proprietary Systems: Lexis/Nexis, Westlaw, Bloomberg Law, Casemaker
– Legal Search
– Links between citations and legal documents● Including citation graph
– Sorting by: parties of case, judges, etc.
– Information about Lexis/Nexis● https://www.youtube.com/watch?v=7jPBrndIMMs ● http://www.lexisnexis.com/documents/pdf/20100702053851_large.pdf● https://www.lexisnexis.com/documents/pdf/20150701111022_large.pdf
● Court Listener currently has limited versions of these capabilities, but is freely available– We are adapting Court Listener's data
Introduction and Overview
The Data
● U.S. Court Decisions– Supreme Court (Scotus)
– Appeals Courts (District Courts, Veterans Appeals)
● Source of Data = Court Listener – Free Law Project (a non-profit organization)
● Michael Lissner and Brian Carver
● Court Listener provides: – Harvesting court cases to provide users with free access
– legal search engine
– Limited html links between court cases
– Downloadable json/html markup of text
Introduction and Overview
Our Modified Version of Data● https://nlp.cs.nyu.edu/meyers/web_of_law.html● Plain Text Version of the Data
– xml offset based on original json markup
– corrections to make cohesive: fixing paragraph splits, separating footnotes, etc.
– correct (most) encodeing errors
● Simple version of citation graph (usable without looking at the cases)● Simple Manual rule output
– citations to cases
– people names, organizations, legal_roles, professions, dates, etc.– Equivalence and Role relations
● based on apposition and substring
● Supreme Court Only (64K)● After further work, will expand to other appellate court decisions
Introduction and Overview
Manual Rule-based Entities & Relations● Regular expressions and local patterns only● Recognizes Citations to Legal Decisions
– Standard Citation: ● 410 U.S. 113 (1973)
– X vs. Y: ● ROE ET AL. v. WADE, DISTRICT ATTORNEY OF DALLAS COUNTY
– Other:● "In re Vince"
● Recognizes Other Entities:– LEGAL_ROLE: appellants
– PROFESSION: DISTRICT ATTORNEY OF DALLAS COUNTY
– NAME: Robert C. Flowers
– Others: ORGANIZATION, DOCKET, DATE
● Recognizes relations based on apposition and substring– Equivalence, party_of case, at_date, etc.
Introduction and Overview
Weiyi Lu's System
● Weiyi has modified ICE– http://nlp.cs.nyu.edu/ice/
● ICE is designed to produce an IE system (relations and entities) by bootstrapping from a small number of annotated examples.
● Weiyi's system identifies Named Entities and Legal Roles and some relations
● It can import entities (citations) from the manual rule system
Introduction and Overview
Future Manual Rules and Annotation
● Other Entities– citations to legislation are currently not covered by our system
● Rules similar to those for citations could be used● We have a copy of the “Blue Book” and copies exist online● Examples: constitution, statutes, amendments, sections, regulations, …
– Quotations – wrote a simple quote recognizer yesterday, which could probably be updated● This could be extended to cover “that clauses”● Quotes that are part of the same “that clause” could be grouped together, etc.
● Other Relations– Extended Coreference
● All members of party in case and their representatives form a unit● Citations, authors of decisions, courts the authors represent, etc. form a unit.
– Attribution● Quotes are attributed to sources (citations, people, organizations)
Introduction and Overview
Terminology Extraction
● We have adapted Termolator to run on Law Cases– http://nlp.cs.nyu.edu/termolator/
– The .terms files are offset annotation of individual (candidate) terms – these are potential arguments for IE relations
● We have run using one file (Roe v Wade) as a foreground● We are adapting to efficiently run on all 64K Supreme
Court cases, using each case as foreground and the whole set as background– The hope is that the top N=50 or N=100 can be used as topic
words for each case.
Introduction and Overview
Top Terms from Roe V Wade● medical-legal history● medical abortion practices● common-law prosecutions● roman catholic dogma● good-faith belief● definitional deficiencies● canon-law treatment● historical statutory development● common-law scholar● uniform abortion act● clinical judgment● anti-abortion statutes● anti-abortion mood● anti-abortion law● emotional self● final article● trend will● birth control law unconstitutional● good-faith opinion● jewish law● abortion controversy● canon-law crime
Introduction and Overview
Machine Translation to Spanish● John Ortega is supervising this● Esteban Galvis as begun working on this● Goal:
– Translate Court Decisions automatically to Spanish● Obstacle:
– No parallel text in the court decision domain● Proposed Solution: Domain Adaptation
– Train MT system on related domains ● Europarl contains Spanish/English text for EU parliament proceedings● Some Spanish/English documents at Cornell (legislation)● Some US government forms exist in Spanish and English
– Use previously described recognition of citations, names, terminology, etc. to identify “special phrases”● These phrases may be left untranslated● These phrases may be translated as units (e.g., a word inside a special phrase will not form a constituent with a word outside a special phrase)● Some special phrases will be specifically found in existing bilingual dictionaries.● These factors will constrain and improve the resutling MT
– Legal Dictionary from U of Connecticut: http://jud.ct.gov/external/news/jobs/interpreter/glossary_of_Legal_Terminology_English-to-Spanish.pdf
● Possible Extension: Implement Spanish Version of Termolator (favor term/term alignments) ● Dictionary of in vocabulary terms● Rules for inline terms● Adjustment of some encoding filters
Introduction and Overview
Big Citation Graph
Baker v. Carr
Board of Regents v. Roth
Association of Data Processing Service Organizations, Inc. v. Camp
Ashwander v. TVA
Aptheker v. Secretary of State
Abrams v. Foshee
Abele v. Markle
Boyd v. United States
Buck v. Bell
Byrne v. Karalexis
Byrn v. New York City Health & Hospitals Corp.
Bolling v. Sharpe
Boyle v. LandryBabbitz v. McCann
Cantwell v. Connecticut
Carrington v. Rash
Carroll v. President and Commissioners of Princess Anne
Carter v. Jury Comm'n
Cheaney v. State, Ind.
Commonwealth v. Bangs
Commonwealth v. Parker
Corkey v. Edwards
Court. Yick Wo v. Hopkins
Crossen v. Attorney General
Crossen v. BreckenridgeDoe v. Bolton
Doe v. Rampton
Doe v. Scott
Roe v Wade
Introduction and Overview
Full Citation Graph is Very Large
● Previous slide includes < ½ of cases cited in Roe v Wade (w/simulated interconnections)
● Non-case citations omitted– Legislative documents and Regulations
– Briefs
– Other materials
● Other Possible Graphs– Rulings and Citings by specific judges
– Rulings and Citings by particular courts
– Rulings and Citings sorted by particular issues
Introduction and Overview
Doc-Internal Coreference for Roe v Wade
Roe et al v Wade
Party1: Roe et al
Party2: Henry Wade
Dockets: 70-18
Years: 1973
Reporter: U.S.
Volume: 410
Page: 113
...
ROE ET AL
appellants
Jane Roe
Roe
Roe
RoeJane Roeappellant...
Wade
appellee
appellee
Henry Wade
Bolton
Bolton
Bolton...
Party1: John and Mary Doe
Party2: Bolton
Dockets: 70-18
Years: 1973
Reporter: F. Supp
Volume: 319
Page: 1048
...
Doe v. Bolton
Party2
John and Mary Doe
Doe
Doe
Mrs. Doe
The Does
Party1
Mentions
Doe v. Bolton
410 U.S. 179
410 U.S. 179
319 F.Supp. 1048 (N.D.Ga.1970)
Party1
Party2
Mentions
410 U.S. 113 (1973)
93 S.Ct. 705
35 L.Ed.2d 147
Jane ROE, et al., Appellants,v.Henry WADE
Cite
Cite
...
402 U. S. 941 (1971)
314 F. Supp. 1217, 1225 (ND Tex. 1970)
Lochner v. New York
Introduction and Overview
In-Document Coreference and Citations● Citations are mentions of docs & have context
– Version information: ● Stages of cases (original case, 1st appeal, nth appeal, supreme court)● Revisions of Laws
– Sentiment like properties – Candidates for Annotation + ML Research● “Reason” for citation – Looking for a simple detectable codification of what a reason for a citation is
– A topic, a set of topic words or terms, support for a particular party in the case, etc.
● Straight-forward common noun coreference– Metonymy via legal roles, professions and (maybe family relations)
● appellant, defendant, appellee, …● assistant district attorney, officer, justice, …● wife, husband, father, …
– Uses of nouns from small classes of words● this section, the statute, ...
● Parts of documents– sections of laws
– opinions: majority/binding, concurring, dissenting● overruling, confirming & negating other opinions
● Simplification: who is a “member” or “representative” of a particular party in a case– These could be treated as one entity (for purposes of attribution)
● A citation for a case, the court ruling on the case and the author of the binding opinion– These could be treated as a single entity (for purposes of attribution)
Introduction and Overview
Clustering of Documents
● Clustering of documents according to which other documents they cite– Anna Fenske – initial implementation, has not evaluated it
yet
● Clustering of documents based on the words/terms they contain– Carly Abraham – planned
● Comparison of these Clusters● Using co-training to combine methods
Introduction and Overview
How to Get Further on IE
● Design IE tasks through manual annotation with some pre-processing (preliminary systems)
● Pre-processing systems– Python Code (to be released)
● Identifies entities and simple relations as above● Identifies quoted phrases
– Weiyi's JET modification● Identifies entities and relations● Uses different methods – possibility of merging output
● I will demo a preliminary attribution task now
Introduction and Overview
Information Extraction Research
● Rounds of annotation, specification correction, pre-processing, until task(s) are well-defined. – Good Opportunity because we have several students
– Results in a design of an annotation task
● Machine Learning System– Using training data from annotation
● Active Learning System– More efficient use of training data
● More Manual Rules– Short Term Project: Modify existing code to recognize legislation citation
Introduction and Overview
Tasks● Information Extraction
– Annotation, Preprocessing, Etc.
– Short Term: manual rules for legislation
– Short Term: code for converting automatic annotation to Mae format
– Extracting Entities needed by MT– After consistent annotation, various automatic methods
● Including unsupervised/semi-supervised techniques based on IR and deep learning
● Terminology Extraction– Work on Spanish Version of Termolator
● Machine Translation– Various (including interaction with above)
● Document Clustering– Progress further on citation based clustering: evaluation, other methods
– Terminology/Word based clustering● Use words● Use N-grams● Use terms from .terms files● Use top terms as per Termoaltor
● Other
Introduction and Overview
Document Distribution● Initial output at Web of Law Website● Termolator (from Github)● More Code (in next 1 or 2 days):
– Code for material on Web of Law Website
– Initial Mae dtd for quotation task
– Initial specs
● Mae Program: https://github.com/amber-stubbs/mae-annotation● Various Dictionaries● Does everyone have a CS linux account?● Does everyone have “proteus” access?
Introduction and Overview
Some Standards
● Offset Annotation: – Output of systems should “point” to start and end
character offsets in the original text
● Encoding– The text is utf8 encoding including special characters
– The annotation should escape out problematic sgml/html characters using &xxx; style characters.
– I will distribute some code that does this in both directions.