Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Database Systems Research GroupHeidelberg University
April 22, 2020
Software PracticalsSummer Semester 2020
Slides Online
The slides are available on our webpagehttps://dbs.ifi.uni-heidelberg.de/teaching/current/
Organization
Outline ● Overview of topics (today)
○ send application for a topic until Monday, April 27, 1pm○ assignment of topics by April 29
● First milestone (mid/end May)○ prototype/part of software○ summary of research (literature and related systems/tools)○ further milestones in agreement with supervisor
● End of practical (mid/end July)○ code in local Gitlab○ report / documentation as local Wiki document ○ presentation / demo of practical and software (10-12 minutes)
Organizational issues● Application
○ by email directly to supervisor○ brief list of relevant courses / prior knowledge / “Anwendungsgebiet”○ schedule and milestones for the practical○ group work is not possible○ application is binding (don’t apply if you don’t want to do the practical)
● Deadlines○ presentation: planned for last week in July 2020 ○ Report & Gitlab upload: end of August 2020○ no extension possible○ not finished = failed (grade 5,0)
Assessment● Credit points (Leistungspunkte)
○ Beginners Practical (IAP, 2+4 ECTS) [Bachelor students]■ workload: 180 h (~1 ½ days/week)
○ Advanced Practical (IFP, 8 ECTS ECTS)■ workload: 240 h (~2 days/week)
● Grading based on○ code (readability, structure, functionality)○ documentation (README, comments)○ commitment and self-reliance○ cool ideas!!
● IMPORTANT○ talk to / communicate with your advisor
Supervisors
● Michael Gertz (MG)
● Satya Almasian (SA)
● Dennis Aumiller (DA)
● Philip Hausner (PH)
Project Topics
Overview of Topics
1. Implement Citation Extraction in spaCy, BP/AP, (Aumiller)
2. Outline Generation for Wikipedia Articles, AP, (Aumiller/Almasian)
3. Analysis of RNV Delays, BP/AP, (Aumiller/Hausner)
4. Time-dependent analysis of COVID-19 case development, BP/AP, (Hausner)
5. Time-dependent Political Twitter Analysis, AP, (Hausner)
6. Annotating Numerical Relations in News Articles , AP, (Almasian)
7. Numerical Word Co-occurrence Networks (extension), BP/AP, (Almasian)
8. YouTube Video Comment Extractor and Exploration, AP, (Gertz)
9. Extraktion und Management von Bundestagsdokumenten, BP/AP, (Gertz)
BP/AP: Implement Citation Extraction in spaCy (DA)
Given: 1. Rule-based extraction algorithm by Openlegaldata.io2. Dataset of ~1,000 manually annotated referencesTasks: • Transfer functionality to spaCy’s rule-based entity extractor• Publish package that makes this easily usable in spaCy
Subtasks:• Create detailed flow-chart of existing RegEx coverage
Languages / Tools:• Python; spaCy; RegEx
AP: Outline Generation for Wikipedia Articles (DA/SA)
Given: 1. Cleaned dataset of articles from Wikipedia2. Paper by Zhang et al. [1]Tasks: • Implement efficient data loader• Try to reproduce training results from the paper• Implement alternative scoring (RAND score, etc.)
Subtasks:• Learn details about implementation and investigate improvements• Investigate evaluation metrics
Languages / Tools:• Python; PyTorch; Neural Networks (!!)
Given: 1. Start.Info API (RNV API) [1]2. Previous outside project: RNV Monitor [2]Tasks: • Crawl all data (not just delays)• Broader analysis of delays (daytime, line, etc.)• Create time dependent geographical heat map
BP/AP: Analysis of RNV Delays (DA/PH)
Subtasks:• Compare results to RNV Monitor dump• Create suitable database scheme
Languages / Tools:• Python; REST API; SQL
Given: 1. Public data set for Germany [1]2. Reference work from RKI [2]Tasks: • Crawl data set• Identify locations with high increase of case numbers• Create time dependent geographical heat map
BP/AP: Time-dependent Analysis of COVID-19 (PH)
Subtasks:• Create suitable database scheme• Structure in time-dependent fashion
Languages / Tools:• Python; Javascript (vis.js); REST API; SQL
Given: 1. Twitter dataTasks: • Structure information around creation dates of Twitter posts• Identify important topics for certain dates• Take into account all terms or only hashtags
BP/AP: Time-dependent Political Twitter Analysis (PH)
Subtasks:• Investigate different weighting schemes
Languages / Tools:• Python; SQL
AP: Annotating Numerical Relations in News Articles (SA)
Given: 1. Corpus of economical news articles 2. Tasks: • Extract high confidence relations that contain numerical information
from news articles• Apply Named Entity Disambiguation to the entities and numbers • Saving the annotated dataset in Mongodb
Subtasks:• Getting familiar with OpenIE for information extraction• Using AIDA for Named Entity Disambiguation • Detecting quantities with Illinois Quantifier
Languages / Tools:• Python, MongoDB, Brief knowledge of JAVA is also recommended
BP/AP: Numerical Word Co-occurrence Networks (SA)
Given: 1. English Wikipedia corpusTasks: • Improve and existing pipeline of word co-occurrence graph from the
sentences containing numerical information • Enhance the NER (using Metamap from UMLS)• Enhancing the numerical extractor (using Illinois Quantifier)
Subtasks:• Explore the distribution of the numerical values with respect to the
surrounding word to extract valid rangesLanguages / Tools:• Python; SciKit-Learn, Brief knowledge of JAVA is also recommended
AP: YouTube Comment Extractor/Exploration (MG)
Given: 1. Existing pipeline to extract comments from YouTube2. Comprehensive documentation of the dataTasks:• Implement Web-based dashboard to view comment statistics• Provide Web-based search interface on comments
Subtasks:• Port pipeline to Elasticsearch• Decide which features to realize in dashboard• Develop search methods for comments
Languages / Tools:• Python; Elasticsearch
AP: Bundestagsdokumente (MG)
Gegeben: 1. Drucksachen und Plenarprotokolle [DIPBT] Tasks:• (Adaptiver) Crawler für Drucksachen • Speicherung der Dokumente in Solr (strukturiert)• Faceted Search auf Dokumente über Web-Frontend
Subtasks:• Datenmodell für Dokumente• Modell für Faceted Search
Languages / Tools:• Python; Solr
Slides Online
The slides are available on our webpagehttps://dbs.ifi.uni-heidelberg.de/teaching/current/