Research Internships Advanced Research and Modeling Research Group

Embed Size (px)

Citation preview

  • Slide 1

Research Internships Advanced Research and Modeling Research Group Slide 2 ADREM What? Research group that deals with computational aspects of data databases data mining Information retrieval Slide 3 ADREM Who? DB/DM/IR Floris Geerts Bart Goethals Martin Theobald Bioinf Kris Laukens Tim Van den Bulcke + Phd students and postdoctoral researchers http://adrem.ua.ac.be/adrem Slide 4 Internships What? 2 research internships (15 credits each) Msc thesis (30 credits). Goal: internships are an initiation to research and is in collaboration with researchers in ADReM 15 credits is a lot = internship is time consuming! 1 credit = 15 hour work Balance your course load and internship well. Internships are not necessarily related to your Msc thesis (but it can) In a Msc thesis your ability to independently do research plays an important role. Slide 5 Internships Who? Everyone who follows the research option in the database Msc program Slide 6 Research In an internship you need to: 1.Understand a specific problem 2.Implement an (existing) method for solving the problem 3.Test and evaluate 4.Write a report (Msc thesis: you have to solve the problem as well by designing new methods) Slide 7 Internships in a company It is allowed to do a internship in a company but you have to ask permission Also, you have to find the company yourself and convince us that there is research involved You cant receive any money from the company during your internship Slide 8 Databases, data mining, information retrieval These are not separate research domains The topics for internships that each of us will present next are usually on the intersection of these areas. Lets see some example topics. Slide 9 Bart Goethals Slide 10 Recommender Systems Implement state of the art recommenders Pattern mining for better recommendations Interactive Recommendation Explaining recommendations Test recommenders for real data Slide 11 Visual Instant Interactive Pattern Mining Study Visualizations enabling Interactive Pattern Mining Implement and Experiment with novel instant mining methods Slide 12 Pattern based Clustering Implement and evaluate different techniques for clustering based pattern mining, and pattern based clustering Slide 13 Data Mining for Cleaning Study and experiment with data mining methods for data cleaning. Slide 14 Martin Theobald Slide 15 Information Extraction (I): Wikipedia Infoboxes Slide 16 bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) Information Extraction (I): Infoboxes YAGO/DBpedia et al. >120 M facts for YAGO2 (mostly from Wikipedia infoboxes) Slide 17 Information Extraction (II): Wikipedia Categories Slide 18 ? Slide 19 http://www.mpi-inf.mpg.de/yago-naga/ RDF Knowledge Bases Entity Max_Planck Apr 23, 1858 Person City Country subclass Location subclass instanceOf subclass bornOn Max Planck means subclass Oct 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means Max Karl Ernst Ludwig Planck Physicist instanceOf subclass Biologist subclass Germany Politician Angela Merkel Schleswig- Holstein State Angela Dorothea Merkel Oct 23, 1944 diedOn Organization subclass Max_Planck Society instanceOf means instanceOf subclass means Angela Merkel means citizenOf instanceOf locatedIn subclass accuracy 95% 3 Mio. entities, 120 Mio. facts 100 relations, 200k classes Slide 20 Linked Open Data As of Sept. 2011: > 200 sources > 30 billion RDF triples > 400 million links http://linkeddata.org/ Slide 21 Currently (Sept. 2011) > 5 Mio owl:sameAs links between DBpedia/YAGO/Freebase As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase Slide 22 IBM Watson: Deep Question Answering 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain This town is known as "Sin City" & its downtown is "Glitter Gulch" William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel As of 2010, this is the only former Yugoslav republic in the EU YAGO knowledge back-ends question classification & decomposition www.ibm.com/innovation/us/watson/index.htm D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010. Slide 23 A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Jeopardy! Slide 24 Structured Knowledge Queries A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Select Distinct ?c Where { ?c type City. ?c locatedIn USA. ?a1 type Airport. ?a2 type Airport. ?a1 locatedIn ?c. ?a2 locatedIn ?c. ?a1 namedAfter ?p. ?p type WarHero. ?a2 namedAfter ?b. ?b type BattleField. } Use manually created templates for mapping sentence patterns to structured queries. Works for factoid and list questions. Slide 25 Mining Rules from RDF Knowledge Bases A-priori-style pre-filtering of low-support join patterns Dynamic programming ILP algorithm Learning with constants and type constraints Ground truth for bornIn (partially known) Facts produced by the rule (only partially true) Closed World Assumption: strongly penalizes the rule Specificity: avoid producing overly general rules Use a combination of statistical measures Confidence instead of Accuracy: do not penalize the rule for unseen entities Our solution: Overly general Refine by types Ground truth for livesIn (only partially known) Knowledge base for livesIn (known positive examples) Facts produced by the rule (only partially correct) Goal: Inductively learn (soft) rules: livesIn(x,y) :- bornIn(x,y) G KB R Slide 26 Rule-based Reasoning (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y) marriedTo(x,z) livesIn(z,y) livesIn(x,y) hasChild(x,z) livesIn(z,y) People are not born in different places/on different dates bornIn(x,y) bornIn(x,z) y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t 1 ) marriedTo(x,z,t 2 ) yz disjoint(t 1,t 2 ) [0.8] [0.5] Slide 27 Probabilistic RDF Database \/ /\ graduatedFrom (Surajit, Princeton) [0.7] graduatedFrom (Surajit, Princeton) [0.7] hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) CD AB A (B (C D)) A (B (C D)) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Rules hasAdvisor(x,y) worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y) graduatedFrom(x,z) y=z Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] 1-(1-0.72)x(1-0.6) =0.888 0.8x0.9 =0.72 0.7x(1-0.888)=0.078(1-0.7)x0.888=0.266 Slide 28 Temporal Knowledge Slide 29 0.08 0.12 0.16 0.4 0.6 03 0507 playsFor(Beckham, Real, T 1 ) Base Facts Derived Facts 0.2 0.1 0.4 05000207 playsFor(Ronaldo, Real, T 2 ) 04 0304 07 05 playsFor(Beckham, Real, T 1 ) playsFor(Ronaldo, Real, T 2 ) overlaps(T 1,T 2 ) t 3 teamMates(Beckham, Ronaldo, t 3 ) State Relation teamMates(Beckham, Ronaldo, T 3 ) Probabilistic-Temporal Consistency Reasoning Slide 30 Topics for Internships & Master Theses Research Internships Preparation & Integration of Linked Data Sources for Scientific Experiments (SQL/Java/Python) Mining Association Rules from Linked Data (Java/C++) Visualization Frontend for Linked Data (ActionScript & Adobe Flash) Master Theses Implementation of a distributed rule-based query engine for RDF data (C++ & Message Passing Interface) Implementation of a distributed factor graph model for correlated RDF facts (C++ & Message Passing Interface) Faceted Search and Interactive Browsing for Linked Data Slide 31 Floris Geerts Slide 32 Find top-3 flights from Edi to NYC with at most one stop Items: flights Selection criteria: relational queries Utility function: in terms of price and duration (for ranking) RDBMS-based recommendation systems 32 Books, music, news, Web sites, research papers,.. top-k items NY EDI items Top-k item selection Utility function Selection criteria Slide 33 valid query relaxation Query relaxation 33 Q(f#, name,type,ticket, time) = DT, AT, AD, x To ( flight ( f#, EDI, x To, DT, 5/19/2012, AT, AD, Pr ) POI ( name, x To, type, ticket, time) x To = NYC ) Q 1 (f#, name, type, ticket, time) = DT, AT, AD, u To, w Edi, w NYC,w DD ( flight ( f#, w Edi, x To, DT,w DD, AT,A D, Pr ) x To = w NYC POI( name, u To, type, ticket, time) w DD =5/19/2012 dist(w NYC,NYC)15 dist(w Edi,EDI) 15 x To =u To ) E = { EDI,NYC,4/1/2012 }, X = { x To } There is no direct flight from EDI to NYC Relaxation: cities within 15 miles of EDI or NYC are acceptable Query for 5-day holiday dist(w DD,5/10/2012 ) 3 Further relaxation: departure dates within 3 days of 5/19/2012 are acceptable Slide 34 Top-k query answering algorithm on top of RDBMS Query relaxation approaches and query completion Topics Slide 35 Data quality Detecting and correcting inconsistencies Finding duplicates Finding most up-to-date information Slide 36 Semantic errors Yahoo! Finance Nasdaq Days Range: 93.80-95.71 52wk Range: 25.38-95.71 52 Wk: 25.38-93.72 Days Range: 93.80-95.71 Slide 37 Instance ambiguity Slide 38 Out-of-Date Data 4:05 pm 3:57 pm Slide 39 Unit errors 76,821,000 76.82B Slide 40 Fast inconsistency detection Duplication elimination algorithms Automated repairing algorithms Mining of data quality rules Topics