23
CS698F M. Atre Introduction Course Structure Academic Integrity Relational Data Query Optimization Challenges Graph Data Challenges - Graphs Solutions Course Flow Advanced Data Management Medha Atre Office: KD-219 [email protected] July 28, 2016

Advanced Data Management - cs.ox.ac.uk · Lab University Ghent Open Data Ecuador Geo Ecuador Serendipity UTPL LOD GovAgriBus Denmark DBpedia live URI ... RKB Explorer cordis Govtrack

Embed Size (px)

Citation preview

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Advanced Data Management

Medha Atre

Office: [email protected]

July 28, 2016

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Goals

Learn tools and techniques of the “big data” management.

Big data encompasses relational data, graph data, textdata, and any other data (media etc).

“Graph shaped” data has become predominant as itcaptures the ideology of internet and social networks – anetwork of information, persons, objects, and things ingeneral.

E.g., DBPedia1 is a “graphical” representation of theWikipedia network. UniProt2 is a graphical representationproteins and genes network, MusicBrainz3 is a graphicalrepresentation of music database.

1http://wiki.dbpedia.org/

2http://www.uniprot.org/

3https://musicbrainz.org/

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Goals

Is this graph data of any significance?

DBPedia (Wikipedia network in RDF format, pivotal in thesuccess of IBM Watson supercomputer in the Jeopardychallenge, along with the Yago RDF network)4

Adaption of RDFa by Google, and RDF by Facebook.

Understand the challenges of management of this datafrom the “scalability” point of view – how would youmanage such a huge data? Would you use

one commodity computer, ora cluster of commodity computers, ora supercomputer?

4http://en.wikipedia.org/wiki/Watson (computer)#Data

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

A glimpse of the Linked Data network

Linked Datasets as of August 2014

Uniprot

AlexandriaDigital Library

Gazetteer

lobidOrganizations

chem2bio2rdf

MultimediaLab University

Ghent

Open DataEcuador

GeoEcuador

Serendipity

UTPLLOD

GovAgriBusDenmark

DBpedialive

URIBurner

Identifiers

EionetRDF

lobidResources

WiktionaryDBpedia

Viaf

Umthes

RKBExplorer

Courseware

Opencyc

Olia

Gem.Thesaurus

AudiovisueleArchieven

DiseasomeFU-Berlin

Eurovocin

SKOS

DNBGND

Cornetto

Bio2RDFPubmed

Bio2RDFNDC

Bio2RDFMesh

IDS

OntosNewsPortal

AEMET

ineverycrea

LinkedUser

Feedback

MuseosEspaniaGNOSS

Europeana

NomenclatorAsturias

Red UnoInternacional

GNOSS

GeoWordnet

Bio2RDFHGNC

CticPublic

Dataset

Bio2RDFHomologene

Bio2RDFAffymetrix

MuninnWorld War I

CKAN

GovernmentWeb Integration

forLinkedData

Universidadde CuencaLinkeddata

Freebase

Linklion

Ariadne

OrganicEdunet

GeneExpressionAtlas RDF

ChemblRDF

BiosamplesRDF

IdentifiersOrg

BiomodelsRDF

ReactomeRDF

Disgenet

SemanticQuran

IATI asLinked Data

DutchShips and

Sailors

Verrijktkoninkrijk

IServe

Arago-dbpedia

LinkedTCGA

ABS270a.info

RDFLicense

EnvironmentalApplications

ReferenceThesaurus

Thist

JudaicaLink

BPR

OCD

ShoahVictimsNames

Reload

Data forTourists in

Castilla y Leon

2001SpanishCensusto RDF

RKBExplorer

Webscience

RKBExplorerEprintsHarvest

NVS

EU AgenciesBodies

EPO

LinkedNUTS

RKBExplorer

Epsrc

OpenMobile

Network

RKBExplorerLisbon

RKBExplorer

Italy

CE4R

EnvironmentAgency

Bathing WaterQuality

RKBExplorerKaunas

OpenData

Thesaurus

RKBExplorerWordnet

RKBExplorer

ECS

AustrianSki

Racers

Social-semweb

Thesaurus

DataOpenAc Uk

RKBExplorer

IEEE

RKBExplorer

LAAS

RKBExplorer

Wiki

RKBExplorer

JISC

RKBExplorerEprints

RKBExplorer

Pisa

RKBExplorer

Darmstadt

RKBExplorerunlocode

RKBExplorer

Newcastle

RKBExplorer

OS

RKBExplorer

Curriculum

RKBExplorer

Resex

RKBExplorer

Roma

RKBExplorerEurecom

RKBExplorer

IBM

RKBExplorer

NSF

RKBExplorer

kisti

RKBExplorer

DBLP

RKBExplorer

ACM

RKBExplorerCiteseer

RKBExplorer

Southampton

RKBExplorerDeepblue

RKBExplorerDeploy

RKBExplorer

Risks

RKBExplorer

ERA

RKBExplorer

OAI

RKBExplorer

FT

RKBExplorer

Ulm

RKBExplorer

Irit

RKBExplorerRAE2001

RKBExplorer

Dotac

RKBExplorerBudapest

SwedishOpen Cultural

Heritage

Radatana

CourtsThesaurus

GermanLabor LawThesaurus

GovUKTransport

Data

GovUKEducation

Data

EnaktingMortality

EnaktingEnergy

EnaktingCrime

EnaktingPopulation

EnaktingCO2Emission

EnaktingNHS

RKBExplorer

Crime

RKBExplorercordis

Govtrack

GeologicalSurvey of

AustriaThesaurus

GeoLinkedData

GesisThesoz

Bio2RDFPharmgkb

Bio2RDFSabiorkBio2RDF

Ncbigene

Bio2RDFIrefindex

Bio2RDFIproclass

Bio2RDFGOA

Bio2RDFDrugbank

Bio2RDFCTD

Bio2RDFBiomodels

Bio2RDFDBSNP

Bio2RDFClinicaltrials

Bio2RDFLSR

Bio2RDFOrphanet

Bio2RDFWormbase

BIS270a.info

DM2E

DBpediaPT

DBpediaES

DBpediaCS

DBnary

AlpinoRDF

YAGO

PdevLemon

Lemonuby

Isocat

Ietflang

Core

KUPKB

GettyAAT

SemanticWeb

Journal

OpenlinkSWDataspaces

MyOpenlinkDataspaces

Jugem

Typepad

AspireHarperAdams

NBNResolving

Worldcat

Bio2RDF

Bio2RDFECO

Taxon-conceptAssets

Indymedia

GovUKSocietal

WellbeingDeprivation imd

EmploymentRank La 2010

GNULicenses

GreekWordnet

DBpedia

CIPFA

Yso.fiAllars

Glottolog

StatusNetBonifaz

StatusNetshnoulle

Revyu

StatusNetKathryl

ChargingStations

AspireUCL

Tekord

Didactalia

ArtenueVosmedios

GNOSS

LinkedCrunchbase

ESDStandards

VIVOUniversityof Florida

Bio2RDFSGD

Resources

ProductOntology

DatosBne.es

StatusNetMrblog

Bio2RDFDataset

EUNIS

GovUKHousingMarket

LCSH

GovUKTransparencyImpact ind.Households

In temp.Accom.

UniprotKB

StatusNetTimttmy

SemanticWeb

Grundlagen

GovUKInput ind.

Local AuthorityFunding FromGovernment

Grant

StatusNetFcestrada

JITA

StatusNetSomsants

StatusNetIlikefreedom

DrugbankFU-Berlin

Semanlink

StatusNetDtdns

StatusNetStatus.net

DCSSheffield

AtheliaRFID

StatusNetTekk

ListaEncabezaMientosMateria

StatusNetFragdev

Morelab

DBTuneJohn PeelSessions

RDFizelast.fm

OpenData

Euskadi

GovUKTransparency

Input ind.Local auth.Funding f.

Gvmnt. Grant

MSC

Lexinfo

StatusNetEquestriarp

Asn.us

GovUKSocietal

WellbeingDeprivation ImdHealth Rank la

2010

StatusNetMacno

OceandrillingBorehole

AspireQmul

GovUKImpact

IndicatorsPlanning

ApplicationsGranted

Loius

Datahub.io

StatusNetMaymay

Prospectsand

TrendsGNOSS

GovUKTransparency

Impact IndicatorsEnergy Efficiency

new Builds

DBpediaEU

Bio2RDFTaxon

StatusNetTschlotfeldt

JamendoDBTune

AspireNTU

GovUKSocietal

WellbeingDeprivation Imd

Health Score2010

LoticoGNOSS

UniprotMetadata

LinkedEurostat

AspireSussex

Lexvo

LinkedGeoData

StatusNetSpip

SORS

GovUKHomeless-

nessAccept. per

1000

TWCIEEEvis

AspireBrunel

PlanetDataProject

Wiki

StatusNetFreelish

Statisticsdata.gov.uk

StatusNetMulestable

Enipedia

UKLegislation

API

LinkedMDB

StatusNetQth

SiderFU-Berlin

DBpediaDE

GovUKHouseholds

Social lettingsGeneral Needs

Lettings PrpNumber

Bedrooms

AgrovocSkos

MyExperiment

ProyectoApadrina

GovUKImd CrimeRank 2010

SISVU

GovUKSocietal

WellbeingDeprivation ImdHousing Rank la

2010

StatusNetUni

Siegen

OpendataScotland Simd

EducationRank

StatusNetKaimi

GovUKHouseholds

Accommodatedper 1000

StatusNetPlanetlibre

DBpediaEL

SztakiLOD

DBpediaLite

DrugInteractionKnowledge

BaseStatusNet

Qdnx

AmsterdamMuseum

AS EDN LOD

RDFOhloh

DBTuneartistslast.fm

AspireUclan

HellenicFire Brigade

Bibsonomy

NottinghamTrent

ResourceLists

OpendataScotland SimdIncome Rank

RandomnessGuide

London

OpendataScotland

Simd HealthRank

SouthamptonECS Eprints

FRB270a.info

StatusNetSebseb01

StatusNetBka

ESDToolkit

HellenicPolice

StatusNetCed117

OpenEnergy

Info Wiki

StatusNetLydiastench

OpenDataRISP

Taxon-concept

Occurences

Bio2RDFSGD

UIS270a.info

NYTimesLinked Open

Data

AspireKeele

GovUKHouseholdsProjectionsPopulation

W3C

OpendataScotland

Simd HousingRank

ZDB

StatusNet1w6

StatusNetAlexandre

Franke

DeweyDecimal

Classification

StatusNetStatus

StatusNetdoomicile

CurrencyDesignators

StatusNetHiico

LinkedEdgar

GovUKHouseholds

2008

DOI

StatusNetPandaid

BrazilianPoliticians

NHSJargon

Theses.fr

LinkedLifeData

Semantic WebDogFood

UMBEL

OpenlyLocal

StatusNetSsweeny

LinkedFood

InteractiveMaps

GNOSS

OECD270a.info

Sudoc.fr

GreenCompetitive-

nessGNOSS

StatusNetIntegralblue

WOLD

LinkedStockIndex

Apache

KDATA

LinkedOpenPiracy

GovUKSocietal

WellbeingDeprv. ImdEmpl. Rank

La 2010

BBCMusic

StatusNetQuitter

StatusNetScoffoni

OpenElection

DataProject

Referencedata.gov.uk

StatusNetJonkman

ProjectGutenbergFU-BerlinDBTropes

StatusNetSpraci

Libris

ECB270a.info

StatusNetThelovebug

Icane

GreekAdministrative

Geography

Bio2RDFOMIM

StatusNetOrangeseeds

NationalDiet Library

WEB NDLAuthorities

UniprotTaxonomy

DBpediaNL

L3SDBLP

FAOGeopolitical

Ontology

GovUKImpact

IndicatorsHousing Starts

DeutscheBiographie

StatusNetldnfai

StatusNetKeuser

StatusNetRusswurm

GovUK SocietalWellbeing

Deprivation ImdCrime Rank 2010

GovUKImd Income

Rank La2010

StatusNetDatenfahrt

StatusNetImirhil

Southamptonac.uk

LOD2Project

Wiki

DBpediaKO

DailymedFU-Berlin

WALS

DBpediaIT

StatusNetRecit

Livejournal

StatusNetExdc

Elviajero

Aves3D

OpenCalais

ZaragozaTurruta

AspireManchester

Wordnet(VU)

GovUKTransparency

Impact IndicatorsNeighbourhood

Plans

StatusNetDavid

Haberthuer

B3Kat

PubBielefeld

Prefix.cc

NALT

Vulnera-pedia

GovUKImpact

IndicatorsAffordable

Housing Starts

GovUKWellbeing lsoa

HappyYesterday

Mean

FlickrWrappr

Yso.fiYSA

OpenLibrary

AspirePlymouth

StatusNetJohndrink

Water

StatusNetGomertronic

Tags2conDelicious

StatusNettl1n

StatusNetProgval

Testee

WorldFactbookFU-Berlin

DBpediaJA

StatusNetCooleysekula

ProductDB

IMF270a.info

StatusNetPostblue

StatusNetSkilledtests

NextwebGNOSS

EurostatFU-Berlin

GovUKHouseholds

Social LettingsGeneral Needs

Lettings PrpHousehold

Composition

StatusNetFcac

DWSGroup

OpendataScotland

GraphSimd Rank

DNB

CleanEnergyData

Reegle

OpendataScotland SimdEmployment

Rank

ChroniclingAmerica

GovUKSocietal

WellbeingDeprivation

Imd Rank 2010

StatusNetBelfalas

AspireMMU

StatusNetLegadolibre

BlukBNB

StatusNetLebsanft

GADMGeovocab

GovUKImd Score

2010

SemanticXBRL

UKPostcodes

GeoNames

EEARodAspire

Roehampton

BFS270a.info

CameraDeputatiLinkedData

Bio2RDFGeneID

GovUKTransparency

Impact IndicatorsPlanning

ApplicationsGranted

StatusNetSweetie

Belle

O'Reilly

GNI

CityLichfield

GovUKImd

Rank 2010

BibleOntology

Idref.fr

StatusNetAtari

Frosch

Dev8d

NobelPrizes

StatusNetSoucy

ArchiveshubLinkedData

LinkedRailway

DataProject

FAO270a.info

GovUKWellbeing

WorthwhileMean

Bibbase

Semantic-web.org

BritishMuseum

Collection

GovUKDev LocalAuthorityServices

CodeHaus

Lingvoj

OrdnanceSurveyLinkedData

Wordpress

EurostatRDF

StatusNetKenzoid

GEMET

GovUKSocietal

WellbeingDeprv. imdScore '10

MisMuseosGNOSS

GovUKHouseholdsProjections

totalHouseolds

StatusNet20100

EEA

CiardRing

OpendataScotland Graph

EducationPupils by

School andDatazone

VIVOIndiana

University

Pokepedia

Transparency270a.info

StatusNetGlou

GovUKHomelessness

HouseholdsAccommodated

TemporaryHousing Types

STWThesaurus

forEconomics

DebianPackageTrackingSystem

DBTuneMagnatune

NUTSGeo-vocab

GovUKSocietal

WellbeingDeprivation ImdIncome Rank La

2010

BBCWildlifeFinder

StatusNetMystatus

MiguiadEviajesGNOSS

AcornSat

DataBnf.fr

GovUKimd env.

rank 2010

StatusNetOpensimchat

OpenFoodFacts

GovUKSocietal

WellbeingDeprivation Imd

Education Rank La2010

LODACBDLS

FOAF-Profiles

StatusNetSamnoble

GovUKTransparency

Impact IndicatorsAffordable

Housing Starts

StatusNetCoreyavisEnel

Shops

DBpediaFR

StatusNetRainbowdash

StatusNetMamalibre

PrincetonLibrary

Findingaids

WWWFoundation

Bio2RDFOMIM

Resources

OpendataScotland Simd

GeographicAccess Rank

Gutenberg

StatusNetOtbm

ODCLSOA

StatusNetOurcoffs

Colinda

WebNmasunoTraveler

StatusNetHackerposse

LOV

GarnicaPlywood

GovUKwellb. happy

yesterdaystd. dev.

StatusNetLudost

BBCProgram-

mes

GovUKSocietal

WellbeingDeprivation Imd

EnvironmentRank 2010

Bio2RDFTaxonomy

Worldbank270a.info

OSM

DBTuneMusic-brainz

LinkedMarkMail

StatusNetDeuxpi

GovUKTransparency

ImpactIndicators

Housing Starts

BizkaiSense

GovUKimpact

indicators energyefficiency new

builds

StatusNetMorphtown

GovUKTransparency

Input indicatorsLocal authorities

Working w. tr.Families

ISO 639Oasis

AspirePortsmouth

ZaragozaDatos

AbiertosOpendataScotland

SimdCrime Rank

Berlios

StatusNetpiana

GovUKNet Add.Dwellings

Bootsnall

StatusNetchromic

Geospecies

linkedct

Wordnet(W3C)

StatusNetthornton2

StatusNetmkuttner

StatusNetlinuxwrangling

EurostatLinkedData

GovUKsocietal

wellbeingdeprv. imd

rank '07

GovUKsocietal

wellbeingdeprv. imdrank la '10

LinkedOpen Data

ofEcology

StatusNetchickenkiller

StatusNetgegeweb

DeustoTech

StatusNetschiessle

GovUKtransparency

impactindicatorstr. families

Taxonconcept

GovUKservice

expenditure

GovUKsocietal

wellbeingdeprivation imd

employmentscore 2010

Linking Open Data cloud diagram 2014, by M. Schmachtenberg, C. Bizer, A. Jentzsch, R. Cyganiak http://lod-cloud.net/

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Course Structure

50% – Course project, supposed to be innovative, chosenfrom the cutting-edge research topic. A list of researchpapers of interesting topics will be referred to by theinstructor. Students may choose an outside topic of theirinterest as long as it is relevant to the broader theme ofthe course. Breakup of the grades:

5% – A written proposal with a literature review.30% – Implementation and execution on the Linuxenvironment.15% – Demonstrable evaluation results.

All the course projects need to be well defined and needinstructor’s approval by the middle of the semester.

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Course Structure

Course projects can be done in the groups of 1, 2, or 3students together.

Groups of size more than 3 need special approval. Such agroup needs to demonstrate a significant component of theproject by the middle of the semester to get the approval.

Recommended size of a group is 2 or 3.

15% – Bi-weekly (once in 2 weeks) assignments.

15% – Mid-semester State-of-the-art paper presentation.

20% – Final detailed course project report (expected to bewritten as professionally as possible like a research paper).

Office hours: Mon-Thur 12-1pm (else by prior appointment).Course webpage: www.cse.iitk.ac.in/users/atrem/courses/cs698f2016fall/

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Academic Integrity

We value academic integrity and honesty above good grades.

Every student is required to go through the CSE departmentacademic integrity (anti-cheating) policy given onhttp://www.cse.iitk.ac.in/pages/AntiCheatingPolicy.html

After the add/drop deadline, each registered student will berequired to sign an acknowledgement of having read this policy.

There will be 0 tolerance shown towards plagiarism, copying5,and any other dishonest means.

5Using open source codes or components of them with properacknowledgement is allowed.

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Relational Data

Figure taken from https://docs.oracle.com/.

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

SQL Queries

Find description and brand of all the products in all the storesin the New York city.

SELECT PRODUCT.Description, PRODUCT.BrandWHERE STORE.City=“New York” ANDSTORE.Store key=SALES FACT.Store key ANDSALES FACT.Product key=PRODUCT.Product key

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Query Planning and Optimization

Cost based estimation for selecting the best query plan.

Costs are in terms of number of rows (disk blocks) youhave to access to perform a join operation, whether thetable has any indexes on it etc.

In a nested loop join, suppose STORE has S pages and tStuples per page, and SALES FACT has F pages and tF tuplesper page, then with STORE as the outer relation andSALES FACT as the inner relation, the cost of the joinbetween the two will be: S + tS ∗ S ∗ F .

In a block nested loop join with B buffer pages, the cost of thisjoin will be: S + F ∗ d S

B−2e where 2 buffer pages are reservedfor the inner relation SALES FACT.

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Query Optimization – well known techniques

Create indexes on frequently joined columns.

Hash-joins.

Pipelining of data through the operators to avoid creationof temporary intermediate relation – popular method inthe contemporary database systems.

Creation of join-indices for frequently used joins.

Pre-join pruning of data based on the schema of the dataor other techniques such as semi-joins.

Column-wise storage instead of traditional row-wisestorage.

Data compression – column compression, compressedbit-vectors, delta encoding techniques.

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Challenges

When the difference between the size of the disk residentdata and available main memory becomes larger.

When the data cannot be stored on the single disk on asingle computer.

When cost-estimation methods of query planning arerendered ineffective due to heavy skew in the data –common in the “graph shaped data”.

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Challenges

E.g., A graph with 1 billion edges (109) cannot be fullyloaded in memory even on a machine with 32GB mainmemory. Its disk resident size in the uncompressed form isabout 400GB, e.g., UniProt network.

Cost of a PC with 8GB RAM, 1TB disk Rs. 30,000.

Cost of a PC with 32GB RAM 1TB disk is Rs. 1,75,280.

If your data processing requires cumulative X amount ofmemory and Y amount of disk space:

(cost(PC (X ,Y ))× 1)� (cost(PC (X

n,Y

n))× n)

Where n is the number of commodity compute nodes. That iswhat Google is doing through their server farms!

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Challenges

When the data gets distributed on a farm of servers,instead of a single server, the cost computation changes.

Now network communication cost has to be included whileplanning a query.

Let us go back to our example of the join between STORE andSALES FACT tables. Assuming none of the join data isavailable locally, and STORAGE table has to be communicatedacross the entire cluster of 5 compute nodes, an approximatecost computation of the join will be:

5 ∗ (S

5+ 4 ∗ 2 ∗ Comm(

S

5) + tS ∗ S ∗

F

5)

We can do better than this... and in this course we will seehow!

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Graph data of TV sitcoms

:Jerry

:Larry :Julia

:Veep:CurbYourEnthu :Seinfeld :NewAdvOldChristine

:LosAngeles :D.C. :Jersey:NewYorkCity

:hasFriend :hasFriend

:actedIn

:actedIn:actedIn

:actedIn

:location :location :location :location

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Graph data and queries

Data

:Jerry :hasFriend :Larry

:Jerry :hasFriend :Julia

:Larry :actedIn :CurbYourEnthu

:Julia :actedIn :Seinfeld

:Julia :actedIn :Veep

:Julia :actedIn :CurbYourEnthu

:Julia :actedIn :NewAdvOldChristine

:Seinfeld :location :NewYorkCity

:Veep :location :D.C.

:CurbYourEnthu :location :LosAngeles

:NewAdvOldChristine :location :Jersey

Graphical Representation

:Jerry

:Larry :Julia

:Veep:CurbYourEnthu :Seinfeld :NewAdvOldChristine

:LosAngeles :D.C. :Jersey:NewYorkCity

:hasFriend :hasFriend

:actedIn

:actedIn:actedIn

:actedIn

:location :location :location :location

SPARQLSELECT ?friend ?sitcom WHERE {

:Jerry :hasFriend ?friend .

?friend :actedIn ?sitcom .

?sitcom :location :NewYorkCity .

}

Eqv. SQL query

SELECT t1.o, t2.o from rdf as t1, rdf as

t2, rdf as t3 WHERE t1.s=“:Jerry” and

t1.p=“:hasFriend” and t2.p=“:actedIn”

and t3.p=“:location” and

t3.o=“:NewYorkCity” and t1.o=t2.s and

t2.o=t3.s

:Jerry

?sitcom

:NewYorkCity

:hasFriend

:actedIn

:location

?friend

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Query Evaluation Results

Data

:Jerry :hasFriend :Larry

:Jerry :hasFriend :Julia

:Larry :actedIn :CurbYourEnthu

:Julia :actedIn :Seinfeld

:Julia :actedIn :Veep

:Julia :actedIn :CurbYourEnthu

:Julia :actedIn :NewAdvOldChristine

:Seinfeld :location :NewYorkCity

:Veep :location :D.C.

:CurbYourEnthu :location :LosAngeles

:NewAdvOldChristine :location :Jersey

:Jerry

:Larry :Julia

:Veep:CurbYourEnthu :Seinfeld :NewAdvOldChristine

:LosAngeles :D.C. :Jersey:NewYorkCity

:hasFriend :hasFriend

:actedIn

:actedIn:actedIn

:actedIn

:location :location :location :location

Query

SELECT ?friend ?sitcom WHERE {:Jerry :hasFriend ?friend .

?friend :actedIn ?sitcom .

?sitcom :location :NewYorkCity .

}

:Jerry

?sitcom

:NewYorkCity

:hasFriend

:actedIn

:location

?friend

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Graphs – Types of problems

Pattern matching, a.k.a., joins.

Optional pattern matching, where part of the pattern ismatch is optional, a.k.a., left-outer-joins.

Path pattern queries, hard problems –

Given a regular expression, find all pairs of nodes that haveat least one path matching that expression – recursive joinqueries.Given a regular expression, and a pair of nodes, find allpaths that satisfy that expression.

Reachability queries – static and dynamic graphs.

Keyword search over graphs.

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Graphs – Challenges

Lacks schema, hence standard techniques of normalizationas in relational DBs not applicable.

Has a heavy skew in the data – a few nodes with very highdegree, and a large number of nodes with very less degree.

Lots of strongly connected components and hence cycles.

Distribution strategies need to take care of graphstructure, to avoid large cuts – methods like METIS6 helpin efficient graph partitioning.

6http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Solution space – general purpose

Apache Hadoop – Map Reduce framework.

Apache SPARK – in-memory Hadoop like framework.

H20 – Big data analysis.

Apache Mahout – for substrate independent big scalemachine learning tasks, typically run on Hadoop, SPARK,H20, Flink.

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Solution space – relational data

Single compute node:

Row-wise stores – MySQL, PostgreSQL, Many othercommercial stores like IBM DB2.Column-wise stores – MonetDB, Virtuoso.

Cluster computing:

Cloudera, Cassandra, BigTable, HBase, HyperTable(inspired by BigTable).

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Solution space – graph data

Single compute node:

BitMat7 [Atre2008, Atre2010, Atre2015, Atre2016] – forpattern matching, a.k.a. joins.RDF-3X [Neumann2010]Neo4j, HypergraphDB.

Cluster computing:

Pregel by Google (Apache Giraph as opensource version)Trinity by Microsoft, GraphX library (in Apache SPARK)H-RDF-3X [Huang2011], SHARD [Rohloff2011], TriAD[Gurajada2014].

7http://bitmatrdf.sourceforge.net

CS698F

M. Atre

Introduction

CourseStructure

AcademicIntegrity

RelationalData

QueryOptimization

Challenges

Graph Data

Challenges -Graphs

Solutions

Course Flow

Flow of the course

Go over graph data processing systems on a single nodecompute environment, to understand the problems andsolutions in detail.

Through that understand the scalability problems.

Gradually move towards distributed systems, andunderstand the solutions for scalability.

Through the above exercises, identify problems in thespace of both – scalability of the existing solutions andproblems that lack good solutions even on the single nodecompute environment.

Introduction to open and/or advanced problems overgraph and relational data processing for further research.