Mobile Computing, Internet of Things, and Big Data for Urban Informatics

Mobile Computing, Internet of Things, and Big Data for Urban

InformaticsAnirban Mondal (Shiv Nadar University, India)Praveen Rao (University of Missouri-Kansas, USA)Sanjay Kumar Madria (Missouri University of Science & Technology, Rolla USA)

Outline of the Talk

1

• Urban Informatics: Challenges & Opportunities for Mobile Computing• Applications of mobile data management & IoT in urban informatics• Mobile crowdsourcing/crowdsensing-based data management issues including incentive

mechanisms• mobile and IoT data integration

2• Big Data Analytics for Mobile Computing & IoT• Scalable data analytics frameworks for batch processing• Scalable data analytics frameworks for real-time processing

3• Indexing & Query Processing over Mobile Big Data for Urban Informatics

Part 1 - Urban Informatics: Challenges & Opportunities for Mobile Data Management

Interesting facts about urbanization(Source: NYU CUSP)

~80% of the U.S. population and ~50% of the world’s population live in urban areas

Growth rate: Over 1 million people per week

By 2050, 64% of people in the developing countries and 85% of people in the developed world will live in urban areas

http://data-informed.com/urban-informatics-putting-big-data-to-work-in-our-cities/

Ever-increasing demands on urban infrastructure

What is Urban Informatics?

Urban informatics uses data to better understand how cities work• This understanding can remedy a wide

range of issues affecting the everyday lives of citizens and the long-term health and efficiency of cities from morning commutes to emergency preparedness to air quality.

• Source: http://cusp.nyu.edu/urban-informatics/

Urban informatics is the study, design, and practice of urban experiences across different urban contexts• that are created by new opportunities

of real-time, ubiquitous technology and the augmentation that mediates the physical and digital layers of people networks and urban infrastructures.

• Source: Foth, Choi & Satchell 2011 http://www.urbaninformatics.net/

Urban computing is an interdisciplinary field which pertains to the study and application of computing technology in urban areas• This involves the application of wireless

networks, sensors, computational power, and data to improve the quality of densely populated areas:

• Source: Wikipedia https://en.wikipedia.org/wiki/Urban_computing

Key themes in Urban Informatics (Source: McKinsey)

Using existing city data to improve efficiency

Building new data for better operational and planning decisions

Increasing public engagement to problem solving (mobile crowdsourcing focus)

http://mckinseyonsociety.com/emerging-trends-in-urban-informatics/

Waste Management

Transportation

Disaster Management

Healthcare

Smart CitiesApplication Domain

Waste ManagementTwo key themes• How to manage waste collection, distribution & routing?• Reducing the amount of waste created in the first place

Traditional approach• Collectors/trucks physically go to dumpsters to check the trash levels at fixed times

Problems with the traditional approach• Potential trash bin overflows• If trash bins are not yet full, the process becomes inefficient waste of time & fuel

How often to pick up the trash? • Use sensors/IoT in conjunction with a dashboard• Can significantly reduce labor costs of janitorial services

Source: http://www.link-labs.com/smart-waste-management/

http://www.link-labs.com/smart-waste-management/

Waste Management by Enevo

Enevo is an innovative waste management company

• A proprietary dumpster sensor and software system that is placed on the lids of garbage receptacles

The system communicates regarding the trash level of a given garbage container

Also performs predictive analytics about when a dumpster will be full• Facilitates route planning in advance



Waste management by SmartBin

Source: https://www.smartbin.com/

IoT-based Smart monitoring using Ultrasonic Level Sensor (UBi)• most widely deployed fill-level sensor

SmartBin Live is the IoT game changer for collectors and distributors• helps monitor containers• plan for route optimization

Reducing waste creation

Millions of pounds of garbage generated annually

• Municipalities pay a lot of money for waste removal, landfills etc.

Waste creation relates also to purchasing habits and lack of ability to predict usage• In the U.S., consumers waste ~133 billion pounds of food annually• Grocery stores/supermarkets can prevent significant amounts of waste by asset

tracking & management, JIT inventory management etc. (perishables)

Smart refrigerators can alert consumers about when food is getting spoiled• Relates to purchasing habits of consumers



Smart Parking: S Oil’s HERE

Clever use of balloons to reduce search time for open parking spotsDemonstrates that you don’t always need hi-tech to solve real

problemsSource: http://freakonomics.com/2013/08/16/how-to-save-time-hunting-for-a-parking-spot-south-korea-edition/

Android App for weather, travel & news

StreetJournal.org: Crowdsourcing platform for effectively solving city problems

Mobile users report problems with streets e.g., potholes,

street lights etc.

Ushahidi Platform for Beijing Transport Planning

Mapunity: technology forsocial problem-solving

Mobile Applications for Smart Cities

Crowdsourcing-based traffic & navigation app

IBM Smart Cities

Wearable Technology Market Forecast• Wearable electronics market

worldwide: $20 billion in 2015 to almost $70 billion in 2025

• Dominant sector for wearable: Healthcare with medical, fitness and wellness management

• Promising new developments:

Health and Well-being in Elderly• Keeping the elderly population healthy, safe and at freedom of

their own home for few more years is a win for everyone

Win-Win-Win!

Ever-increasing popularity and proliferation of mobile technology

P2P

Users can collect a wide variety of data using mobile devices

Mobile-P2P Computing a.k.a. Mobile Crowdsourcing

05/01/2023 17

Real-world Crowd-driven Mobile Data Collection

Mobile-P2P ApplicationsWhich shops here

sell running shoes? Prices?

Discounts?

Italian restaurants nearby here with lunch specials? How crowded?

Ambience?

Find an available nearby parking

slot

Where can I find the nearest car-repairing station

with a short queue?

Mobile group buying: Want

to buy discounted

Levis jeans?

Anyone wants to share a cab

ride?

Source: www.istanbul-airport.info

Mobile-P2P Cab Sharing Application

Shuo Ma, Yu Zheng, Ouri Wolfson. Real-Time City-Scale Taxi Ridesharing. IEEE Trans. Knowl. Data Eng. 27(7): 1782-1795 (2015)

Specialized contextual knowledge/information in Mobile-P2P environments• Lots of context embedded in humans and/or devices• Context is spatio-temporal + many other dynamic attributes that

represent specialized semantic knowledge

The user essentially has a better CONTEXT & human judgment with which to answer queries• due to being located at that space at that time + his own

knowledge base

It goes beyond merely using a mobile device for getting a query

answered

Observations for Mobile-P2P

Research Challenges in Data Management

1• Sharing information & free-riding

2• Heterogeneous, multi-modal sensor

(including human sensors) data integration

3• Scalability & handling Big Data

Incentive schemes for Mobile Environments (MANETs to Mobile-

P2P)

Nodes get nuglets to forward packets• Nuglets is a virtual currency

Packet Purse Model• Packet sender loads nuglets into the packet• Relay nodes take nuglets from packet when forwarding• Packet is dropped when it has no more nuglets

Packet Trade Model• Destination pays the cost of packet forwarding• Senders pay nothing: Spamming by senders possible• Each relay node earns some nuglets

Assumes a tamper-resistant hardware module at each node• To maintain the integrity of the nuglet counter at each node

Stimulating nodes to forward messages in a MANET

L. Buttyan and J. Hubaux. Nuglets: a virtual currency to stimulate cooperation in self-organized mobile ad hoc networks. Technical Report DSC/2001/001, Swiss Federal Institute of Technology,Lausanne, 2001

L. Buttyan and J.P. Hubaux. Stimulating cooperation in self-organizing mobile ad hoc networks. ACM/Kluwer Mobile Networks and Applications, 8(5), 2003.

Sprite is a Cheat-Proof, Credit-based System for Mobile Ad Hoc Networks• Node pays others for forwarding its messages • Does not require tamper-proof hardware

Each selfish node tries to maximize its welfare

• Welfare = benefit – cost

When a node receives a message, the node keeps a receipt of the message• Later, when the node has a fast connection, it reports to a Credit Clearance Service (CCS) • The CCS determines the charge and credit to each node based on reported receipts

Sprite: Forwarding Incentives in a MANET

S. Zhong, J. Chen, and Y.R. Yang. Sprite: A simple, cheat-proof, credit-based system for mobile ad-hoc networks. Proc. IEEE INFOCOM, 2003

Assumes that the users are selfish, but rational• “Pay for service” model of cooperation

Each router constitutes a smart market• where an auction process runs continuously to determine• who should obtain how much of bandwidth & at what price

Bidders are traffic flows passing that router• iPass assumes secure payment/accounting

iPass uses the generalized Vickrey auction with reserve pricing• Truthful bidding of utility

iPass: Cooperative packet forwarding in a MANET via incentive-based auctions

K. Chen and K. Nahrstedt. iPass: an incentive compatible auction scheme to enable packet forwarding service in MANET. Proc. ICDCS, 2004

A math framework for user cooperation• Defines strategies for optimal user behavior• Trade-off: node lifetime vs network throughput

Generous TIT-FOR-TAT (GTFT) algorithm • used by the nodes to decide whether to accept or reject a relay request

GTFT algorithm ((Non-Cooperative Game Theory) • Based on the prisoner’s dilemma idea• Each player mimics the action of the other player in the previous game• Occasionally, each player helps out even if the other player had not helped in previous game• Nodes maintain record of past experiences

Does not address• Malicious nodes• Algorithms for collecting system-wide info

Cooperation among energy-constrained nodes in wireless ad hoc networks

V. Srinivasan, P. Nuggehalli, C.F. Chiasserini, and R. R. Rao. Cooperation in wireless ad hoc networks. Proc. INFOCOM, 2003

Motivates users to move to “central locations”

Rewards nodes according to their sending ability

Each user has an initial credit balance

Incentives for acting as transit nodes in MANETs

J. Crowcroft, R. Gibbens, F. Kelly, and S. Ostring. Modelling incentives for collaboration in mobile ad hoc networks. Proc. WiOpt, 2003

Remarks on Incentive schemes for MANETs

1• Focus: message forwarding

2• Data hosting not addressed

Every data item has a price• Virtual currency model• Revenue of an MP is how much currency it has• Discourages free-riding

Fairness in replica allocation by considering the origin of queries • How many different mobile users downloaded the data item?

Hybrid superpeer architecture for facilitating replication

A. Mondal, S.K. Madria and M. Kitsuregawa “E-ARL: An economic incentive scheme for adaptive revenue-load-based dynamic replication of data in Mobile-P2P networks.” Journal on Distributed and Parallel Databases 28(1) – DAPD 2010

A. Mondal, S.K. Madria and M. Kitsuregawa “EcoRep: An economic model for efficient dynamic replication in mobile-P2P networks.” Proc. COMAD 2006

The E-ARL Incentive System

Price of data item d depends on• access frequency• number of MHs served by d (fairness issue)• number of existing replicas of d• (replica) consistency of d • average response time for queries on d

Computation of data item price

Key idea: Assign higher-priced data items to MHs with either low revenue or low load• spectrum of algorithms with different weights for revenue and load• uses a parameter to adjust relative weightage of load and revenue

Facilitates both revenue-balance and load-balance

• Revenue-balance avoids starvation of MPs and encourages MP participation in the network

• Load-balance reduces query response times

Replica allocation in E-ARL

Brokers facilitate data collection & improve data availability through value-added routing service• Pro-active search by maintaining index of data items and

their replicas• Faster query response through directed routing rather just

forwarding• Load-sharing [Reference: Mondal_SIGMOD 2000]

Different brokerages for various levels of brokers to provide better services

N. Padhariya, A. Mondal, S.K. Madria and M. Kitsuregawa “Economic incentive-based brokerage schemes for improving data availability in mobile-P2P networks.” Journal of Computer Communications 36(8): 861-874, 2013

The E-Broker Incentive System

E-Broker incentive schemes

IR

• Individual ranking strategy

• Drawback of collaborative score assignments to few brokers

NGS

• Neighbor-based Gossiping

• Limited to ONE-hop• Exchange brokers’

score with others

K-NGS

• NGS extended to K-hops

• Wide coverage• Highly accurate score

assignments as compared to NGS/IR

EIB+

Score assignment strategies in EIB+

Designed for answering top-k queries in Mobile-P2P networks• Uses economic incentive schemes to facilitate effective top-k query processing

Services • contributing to top-k query results• brokerage• message relay

Every service has a price• Service-requestor pays the price of the service to the service-provider

Revenue of a Mobile Peer (MP) is how much currency it has• MP earns currency by providing services• MP spends currency for obtaining services

Nilesh Padhariya, Anirban Mondal and Sanjay Kumar Madria. “Top-k Query Processing in Mobile-P2P Networks using Economic Incentive Schemes.” Peer-to-Peer Networking and Applications Journal, PPNA, 2015:8(5):1-21

N. Padhariya, A. Mondal, V. Goyal, R. Shankar and S.K. Madria “EcoTop: An Economic Model for Dynamic Processing of Top-k Queries in Mobile-P2P Networks.” DASFAA 2011

The E-Top Incentive System

Architecture of E-Top

Incentive schemes

• reward rankers for send relevant top-k results, penalizes otherwise

• reduce communication traffic

ETK• Equal payoff distribution• Weighted Re-ranking

ETK+• Weighted payoff distribution• Payoff-based re-ranking

E-Top: Economic Incentive Schemes

Dissemination of reports in mobile-P2P

Each disseminated report represents information about a spatial-temporal resource

Provides incentives for resource dissemination

Discusses about pricing resource information

Benefit of report dissemination: time-saving

O. Wolfson, B. Xu, and A.P. Sistla. An economic model for resource exchange in mobile Peer-to-Peer networks. Proc. SSDBM, 2004B. Xu, O. Wolfson, and N. Rishe. Benefit and pricing of spatio-temporal information in Mobile Peer-to-Peer networks. Proc. HICSS-39, 2006

Advertising spatial resources

Today’s specialsPeking duck: 120 yuan

Hot pot: 200 yuan

Sounds good!

Let me first get a parking slot!

Incentives for spatio-temporal resource dissemination

Price of a resource depends on

• Time length since the creation of the resource• Distance of the resource from the consumer

Economic incentive models

• Consumer-paid resources• Producer-paid resources

Consumer-paid resources

• Example: parking slot advertisements• Depends on relevance and time-saved• Consumer cannot sell reports • The price of a report is a function of relevance • A broker pays a percentage of the price of the report.

• is paid the same percentage when selling the report to another broker • is paid full price when selling report to a consumer

Producer-paid resources

• Example: a gas station• Producer pays an “advertisement” fee by attaching some coins in announcement• Each mobile node that transmits the resource withdraws a “commission” from the coins • How many coins the producer should put in • Tradeoff between advertising cost and effect

P2P report exchange using Information Guided Search

Quick overview of the approach

• Discover peers within range • Specify interests about reports• Exchange reports with encountered peers

A consumer starts by moving around the area where a resource of interest could possibly be located

• either an available resource is encountered OR• some resource-report is received

The search continues until

When resource-report is received, consumer moves along shortest path to resource

City-scale Taxi Ride-sharing

Shuo Ma, Yu Zheng, Ouri Wolfson. Real-Time City-Scale Taxi Ridesharing. IEEE Trans. Knowl. Data Eng. 27(7): 1782-1795 (2015)

Taxi-sharing reduces energy consumption (go green) and helps user commute

• Constraints: time, capacity, and monetary

A mobile-cloud architecture based taxi-sharing system that accepts users’ real-time requests sent via smartphone App and schedules taxis accordingly

• passengers will not pay than “no ridesharing” and get compensated if their travel time is increased because of ridesharing; • taxi drivers will make money for detour distance due to ridesharing.

Monetary constraints incentivize both passengers and taxi drivers

• A scheduling process is used to choose a taxi that satisfies the request with minimum increase in travel distance

The Cloud finds candidate taxis using an algorithm supported by a spatio-temporal index

43

Free-riding

Mobile resource constraints

Data availability

Research Challenges in Mobile P2P

Incentives for enticing free-riders to provide data

Traditional environments (e.g., clusters) generally assume cooperative behavior by all nodes

• Free-riders provide no data

Rampant Free-riding in P2P environments

• Transmitting messages tax limited energy of mobile peers• Bandwidth is limited and there are data transmission costs

Mobile resource constraints further exacerbate the free-riding problem

Free-riding

Mobile resource constraints include energy and bandwidth

• Such constraints impact data availability

Most mobile technologies physically support broadcast data management mechanisms

• Push or pull paradigm for obtaining real-time information

Only the most relevant information can be sent to minimize energy consumption

• Sending/receiving messages consumes the limited energy of mobile devices

Mobile resource constraints

Low data availability

Replication

Free-riding

Incentives

Mobile-P2P incentive model

Challenges in mobile crowdsourcing

47

Mobile Crowdsourcing can be a great way to collect large-scale data and

facilitate various applications, but ..

Mobile sensor-aware crowdsourcing

Quality control based on rich signals

Anyone can be a requester as well as a worker

Personalised request

Explicit user input as well as implicit sensor input

A proposed architecture

• Improving task performance and efficiency• Enabling new crowdsourcing process• Enabling new types of application

Opportunities

• Improving personalisation of request allocation and response aggregation

• Crowdsourcing spontaneous feedback• New quality control mechanism• Information about situational impacts on cognitive

performance

Improving task performance and efficiency

• Crowdsourcing for the masses• Weakening the strong correlation between labour

and human resources• Participatory sensing

New crowdsourcing processes

• Crowdsourcing as an anonymised customer information system

• Enabling technology for smart cities• Mobile crowdsourcing for energy management and

smart buildings• Mobile crowdsourcing as a chance for ubiquitous

computing research

New applications

• High performance data processing and analysis mechanism

• Limitations of mobile devices• Data security and privacy issues

Challenges

Part 2 - Big Data Analytics for Mobile Computing & IoT

Part II: Agenda

NoSQL systems, New SQL systems

Systems for batch-oriented processing/interactive analytics

Stream processing systems

• Systems for processing massive datasets

Some Applications

Urban Transportation1,3

Open Data2 Citizen Services

1 Ilarri, Sergio, Ouri Wolfson, and Thierry Delot. "Collaborative sensing for urban transportation." IEEE Data Engineering Bulletin 37 (2014): 3-14.2 Catlett, Charles E., Tanu Malik, Brett Goldstein, Jonathan Giuffrida, Yetong Shao, Alessandro Panella, Derek Eder, Eric van Zanten, Robert Mitchum, Severin Thaler and Ian T. Foster. “Plenario: An Open Data Discovery and Exploration Platform for Urban Science.” IEEE Data Engineering Bulletin 37 (2014): 27-42.3Juliana Freire, Cláudio T. Silva, Huy T. Vo, Harish Doraiswamy, Nivan Ferreira, Jorge Poco. “Riding from Urban Data to Insight Using New York City Taxis.” IEEE Data Engineering Bulletin 37 (2014): 43-55.

SFPark

Source: http://sfpark.org/how-it-works/applications/

Waze

Source: https://www.waze.com

Plenar.io

Source: http://plenar.io

NYC 311

Source: http://www1.nyc.gov/311/

Smart City Services

Source: https://smartcity.thermi.gov.gr/parking

Placemeter

Source: http://www.placemeter.com

TaxiVis

Nivan Ferreira, Jorge Poco, Huy T. Vo, Juliana Freire, and Cláudio T. Silva. “Visual Exploration of Big Spatio-Temporal Urban Data: A Study of New York City Taxi Trips,” IEEE Transactions on Visualization and Computer Graphics, v. 19 (12), p. 2149-2158, 2013.

Source: http://vgc.poly.edu/projects/taxivis/

Smart City Initiative in Kansas CityPhoto Credit: Kansas City Area Development Councilhttp://kcmo.gov/smartcity/

Smart city vision1

o $15 million public-private partnership

o Smart streetlights and sensors2

o Free public Wi-Fio Interactive digital

kiosks for citizens

1 “Beyond Traffic: The Vision for the Kansas City Smart City Challenge,” http://www.eenews.net/assets/2016/03/31/document_pm_06.pdf

2 http://www.sensity.com

NoSQL Systems • Simple read/write

operations • On small

number of related items/records

• Basically available, soft state, and eventually consistent (BASE)1

• Typically do not support ACID transactions to achieve higher performance & scalability1,2

• CAP theorem3; weak consistency model

• Relax ACID semantics

• Flexible schema; schema-less

• Horizontal scaling• Shared-nothing

architecture

1 Rick Cattell. “Scalable SQL and NoSQL datastores.” ACM SIGMOD Record 40, 2 (June 2011).2 Michael Stonebraker and Rick Cattell. “10 Rules for Scalable Performance in ’Simple Operation’ Datastores.” Commun. ACM 54, 6 (June 2011), 72-80.3 Eric Brewer. “Towards robust distributed systems.” In Proc. of the 19th Annual ACM Symposium on Principles of Distributed Computing, July 2000, Portland.

Broad Classification

Key-value stores

Document stores

Graph databases

Wide-column/

extensible record stores

Examples: Dynamo1, Voldemort2, Riak3, Redis4, MemcacheDB5

Data model: collection of objects, each with a key and a data object

Operations: insert, lookup, delete

k1

k2

k3

k3

ValueKey

2 http://www.project-voldemort.com/voldemort/3 http://basho.com/products/riak-kv/4 http://redis.io5 http://memcachedb.org

1 S. Sivasubramanian. “Amazon DynamoDB: A seamlessly scalable non-relational database service.” In Proceedings of the 2012 ACM SIGMOD Conference, pages 729–730, 2012.


Examples: SimpleDB1, MongoDB2, CouchDB3, Terrastore4

Data model: objects with variable number of attributes (e.g., JSON), nesting of objects

Schema-less, secondary indexes

Can query collection of objects via multiple attribute constraints

Key-value stores

Document stores

Graph databases

Wide-column/


{ "createdAt": ”Jan 1, 2017 ...", "id": 1234567890…, "text": “ ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": ”en", …}



Collection

1 https://aws.amazon.com/simpledb/2 https://www.mongodb.com3 http://couchdb.apache.org4 https://code.google.com/archive/p/terrastore/


Examples: Bigtable1, Cassandra2,HBase3, HyperTable4

Data model: variable-width record sets, can add new attributes

Rows are partitioned horizontally across nodes

Column groups are partitioned vertically across nodes

Key-value stores

Document stores

Graph databases

Wide-column/


Key Column group 1

Column group 2

C1 C2 C3 C4

3 https://hbase.apache.org

2 A. Lakshman and P. Malik. “Cassandra: A Structured Storage System on a P2P network.” In Proc. of the 2008 ACM SIGMOD Conference, Vancouver, Canada, 2008.

1 Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. "Bigtable: A distributed storage system for structured data." In ACM Transactions on Computer Systems (TOCS) 26, no. 2 (2008): 4.

4 http://www.hypertable.com


Examples: Neo4j1, AllegroGraph2, Virtuoso3, InfiniteGraph4

Data model: property graphs, RDF graphs, labeled directed multigraphs

Query languages: SPARQL, Cypher, and others

Key-value stores

Document stores

Graph databases

Wide-column/


a

bc

e

f

d

22

41

3

2

1

3

1 https://neo4j.com2 http://franz.com/agraph/allegrograph3 http://virtuoso.openlinksw.com4 http://www.objectivity.com/products/infinitegraph

More Recently…

Multi-model

Graph

Key-value

Document

Wide-column

Examples: ArangoDB1, MarkLogic2, OrientDB3, FoundationDB4

Data model: support more than one type of data model

Support for ACID properties in transactions5,6

6 http://www.methodsandtools.com/archive/acidnosqldatabase.php

5 Eric Brewer, "CAP Twelve Years Later: How the 'Rules' Have Changed," Computer, pp. 23-29, February 2012

1 https://www.arangodb.com2 http://www.marklogic.com3 http://orientdb.com4 https://foundationdb.com

New SQL Examples: MySQL

Cluster3, VoltDB4, Clustrix5, NimbusDB6

SQL support, ACID transactions, high performance & scalability

Shared-nothing, avoid multi-node/shard operations (queries, updates), leverage main-memory databases, automatic recovery & availability

New SQL1,2 (Scalable

SQL)

New OLTP workloads

(write-focused, simple

operations)

Web-based applications, multi-player

games, social networking sites Need for higher

OLTP throughputNeed for real-time analytics

2 Michael Stonebraker. “New SQL: An Alternative to NoSQL and Old SQL for New OLTP Apps.” http://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext

1 Michael Stonebraker and Rick Cattell. “10 Rules for Scalable Performance in ’Simple Operation’ Datastores.” Communications of ACM 54, 6 (June 2011), 72-80.

3 https://www.mysql.com/products/cluster 4 https://voltdb.com 5 http://www.clustrix.com 6 http://www.nuodb.com

Batch-Oriented Data Processing

• Processing large datasets on large clusters– Google’s MapReduce1 (MR)– Apache Hadoop2,3

• Hadoop Distributed File System (HDFS)• Hadoop MapReduce• Hadoop YARN (Yet Another Resource Negotiator)

– Cluster resource management

1 Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” In Proc. of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, 2004.

3 Tom White. “Hadoop: The Definitive Guide.” O’Reilly Media, Inc., 1st edition, 2009.

2 http://hadoop.apache.org

MapReduce

data

data

data

data

Map (M)

Map (M)

Map (M)

Map (M)

Reduce (R)

Reduce (R)

Reduce (R)

Intermediatedata

Output

Big Data Ecosystem

Distributed storage (e.g., HDFS)

Cluster management (e.g., YARN)

Hive (SQL-like) Spark SQL

Distributed processing engines(e.g., Tez, MR, Spark)

Others

See also: http://hortonworks.com/apache/, http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final,http://www.slideshare.net/hortonworks/apache-tez-accelerating-hadoop-query-processing

Tez and Spark process a job as a complex directed acyclic graph (DAG) of tasks

M M M

R R

M M M

HDFS

Tez

Apache Hive

• Enables efficient query processing; indexes – Queries are compiled into map & reduce jobs

• Supports schema changes– Modifies metadata

year mon

Partitions(on year)

/…/foo/year=2016/1…/…/foo/year=2016/4/…/foo/year=2015/1…/…/foo/year=2015/4

HDFS directories/files

Tablefoo

4 buckets(hash on mon)

Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. “Hive - A Warehousing Solution over a Map-Reduce Framework.” VLDB Endowment, 2(2):1626–1629, Aug. 2009.

Example

id lang time_zone id text follower_count retweet_count

hive> create table A (id BIGINT, lang STRING, time_zone STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> create table B (id BIGINT, text STRING, follower_count INT, retweet_count INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> load data local inpath '/home/biadmin/Desktop/table1.csv' overwrite into table A;

hive> load data local inpath '/home/biadmin/Desktop/table3.csv' overwrite into table B;

hive> select A.id, A.lang, B.text, B.follower_count FROM A JOIN B ON (A.id = B.id);

Apache Spark• Resilient Distributed Dataset (RDD)

– Distributed collection (immutable), in-memory processing, fault-tolerance

– Transformations: map, reduceByKey, join, and others

See also: http://spark.apache.org/docs/latest/cluster-overview.html

Cache

Task Task

Executor

Wor

ker n

ode Cache

Task Task

Executor

Wor

ker n

ode Cache

Task Task

Executor

Wor

ker n

ode

Cluster manager

Driver program main()

1 2 3

RDD

Spark SQL

• Complex/interactive analytics1

SELECT count(user.followersCount) as C, avg(user.followersCount) as A, lang from tableOfTweets where lang != "en" group by lang

Spark SQL

JSON documents

{ "createdAt": "May 8, 2016 8:53:39 PM", "id": 729489478797008900, "text": "No existe el olvido mi amor, no existe ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": "es", …}

{ "createdAt": "May 8, 2016 8:53:39 PM", "id": 729489478797008900, "text": "No existe el olvido mi amor, no existe ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": "es", …}


Load

Query

Output

1 Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. “Spark SQL: Relational Data Processing in Spark.” In Proc. of the 2015 SIGMOD, 1383-1394, 2015.

+---+------------------+----+|C |A |lang|+---+------------------+----+|1 |1910.0 |sl ||38 |1700.2894736842106|fr ||3 |156.0 |sv ||4 |1451.25 |zh ||124|496.46774193548384|th ||81 |1790.3456790123457|tl ||29 |5760.689655172414 |tr ||1 |0.0 |ne ||7 |49530.57142857143 |nl |....

Stream Processing Systems

• Apache Storm1/Twitter Heron2

2 Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. “Twitter Heron: Stream Processing at Scale.” In Proc. of the 2015 ACM SIGMOD Conference, 239-250, 2015.

1 Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. “Storm@Twitter.” In Proc. of the 2014 SIGMOD Conference, 147-156, 2016.

Spout

Bolt

Bolt

Topology (DAG)

Tuple-processing semantics: at least-once, at most-once

Each blot/spout executes in parallel as tasks on cluster nodes

E.g., count the frequency of hashtags in a stream of tweets

Bolt

Stream Processing Systems

• Apache Flink1 Support for batch-oriented workloads

Exactly-once semantics Streams are partitioned,

each operator executes in parallel on a cluster

Window operators E.g., compute the

frequency of hastags every 60 seconds (or over the last 1000 tweets)

Source Sink

Streaming dataflow (DAG)

Transformations

Stream

1 A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinlander, M. J. Sax, S. Schelter, M. Hoger, K. Tzoumas, and D. Warneke. “The Stratosphere Platform for Big Data Analytics.” The VLDB Journal, 23(6):939–964, Dec. 2014.

Graph Analytics & Machine Learning

• Apache Spark– GraphX1

– MLlib2

• Apache Flink– Gelly3

– FlinkML4

1 J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. “GraphX: Graph Processing in a Distributed Dataflow Framework.” In the 2014 OSDI Conference, pages 599–613, 2014.2 Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: machine learning in apache spark. Journal of Machine Learning Research 17, 1 (January 2016), 1235-1241.

4 https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/index.html

3 https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html

Part III: Agenda

• Examples of a Big Data Indexing and Querying System for processing massive datasets

Traffic Monitoring

Battlefield

Behavior Tracking

Healthcare

Forest Fire Detection

Agriculture

Examples of Big Data

Sample Queries over Big Data

How many people moving between regions (xL,yL) and (xH, yH)

Range Query (xL,yL)(xH,yH)

What is the current Water quality at x,y

location?Point Query (x,y)

Which other carsare closest to x,y location?

KNN Query (x,y)

Possible Query Types

86

• Preference queries– Subscription-based aggregation technique– Skyline monitoring over frequent update streams– Top-k query processing on multi-dimensional data with keywords

• Result diversification– Diversified set monitoring over distributed data streams– Diversified top-k query processing for mobile sensor networks

• Probabilistic Query Processing– Nearest neighbor queries on distributed uncertain data

Big Data Management Architecture

• Rich Query Functionality: the ability to efficiently process point, range and k-nearest neighbor queries

• Distributed Index and Query Processing: designing an efficient key structure and a system which can efficiently distribute and retrieve the keys across the network

• Load Balancing: the capability of automatic load balancing for skewed data

• Consistency Management: the management of replica consistency to provide fault tolerance with respect to node failures

• Elastic Scalability: the capability to extend the system in the presence of dynamic workload

M-Grid: Objectives

• Among all the space filling curves– Hilbert Curve achieves the best clustering property1

– After mapping, points which are closer in n-d hypercube are closer in 1-d space also– Clustering of data points allow efficient query processing– Our key design uses Hilbert Curve of appropriate order to give linear ordering to data points in the multidimensional space

1. B. Moon, H. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properties of the Hilbert space-Filling curve. Knowledge and Data Engineering, 13(1):124{141, January 2001.

0101 0110 1001 1010

0100 0111 1000 1011

0011 0010 1101 1100

0000 0001 1110 1111

Second Order Hilbert Curve

M-Grid Indexing Proposal

• Data Model and Point Query:– HBase is designed to provide fast key access – Our data model stores the coordinate values in the following way

– Multiple users can be stored within the same key– Given a Point (x,y), find its Hilbert-Value– value = Table.get(Hilbert-Value(x,y))– Can be extended for higher dimensions also

Key Family:Qualifier Value

Hilbert-Value<x,y..> data:userID x.y

Our Proposal

• How to avoid master-server bottleneck?• Arrange the nodes in P-Grid overlay structure

– Builds a virtual trie structure for efficient mapping of keys to nodes– Preserves key order relationship– Completely decentralized– Search can be initiated on any node– Robust to failure and support index replication

Our Proposal

• M-Grid System Architecture

M-Grid: Our Indexing Proposal

• Point query <q>– The peer which is responsible for data item is identified– Query is routed to that peer and processed locally– Result is sent back to initiator

Our Proposal

• Range query <(Lx,Ly)(Hx,Hy)>– Processes range query concurrently – Only those data items will be returned which intersects with the query– Search latency depends upon the skewness of data items

Our Proposal

• KNN Query <q,k>– Starting from the query point, enlarge search region until k objects

are found – Comprises of series of range searches, first construct a range search R

centered at query point q=key and with radius r = ϕ = Dk/k– Dk is the estimated distance between the query object and its k’th

nearest neighbor; Dk can be estimated by using the following equation where N is the estimated objects in whole space and d is the dimensionality of M-Grid

*An efficient cost model for optimization of nearest neighbor search in low and medium dimensional spaces by Y. Tao et. al.

*

Our Proposal

•

0101 0110 1001 1010

0100 0111 1000 1011

0011 0010 1101 1100

0000 0001 1110 1111H-Order Space

Our Proposal

Categories CG-Index RT-CAN EMINC MD-HBase Proposed Solution

Multidimensional queries

No Yes Yes Yes Yes

Master server bottleneck

No No Yes Yes No

Scalability linear linear linear low High

KNN queries No Yes No Yes Yes

Independent approach

No No No Yes Yes

Current Solutions: Comparison

Categories CG-Index RT-CAN EMINC MD-HBase MGridInsert

throughput256 nodes - 30 k/sec

Not Present Not present 16 nodes, 220k/sec

16 Nodes, 860k/sec

Set Up / Point Query

(Traffic Dataset)

265 nodes,10m

records– .033 ms

128 nodes, 500k

records per node – not

given

1000 nodes, 10m

records -40-50 ms

4 nodes,400m

records -not present

4 nodes, 400m

records – less than 5

ms

Range Query s = .1% – .06 ms

S = .1 % – .14 ms

S = .001% -50+ ms

S = 10% - 200000 ms

S = 10% -50000 ms

KNN Query Not present

K=16 – 23k/sec

Not present K=1000 – 4 sec

k = 1000 –2 sec

Current Solutions vs M-Grid: Comparison

Key Research Challenges

Data Collection• Incentive schemes for

mobile crowdsourcing• Ensuring long-term

user engagement• Data Reliability• Data Privacy• Cost-efficiencies

Big Data management, processing & analytics on the Cloud• Data filtering, validation &

integration • Indexing from the perspective of

diverse stakeholders with varying needs

• Spatial data management & indexing to facilitate location-dependent services

• Big Data replication and partitioning approaches on the Cloud

• Privacy & security of the Big Data• Uncertainty

Data Reasoning & Semantics• Ontology generation &

knowledge representation• Context-awareness,

reasoning & interpretation of the data for drawing valuable inferences

• Approaches for improving query expressiveness with semantic considerations

• Activity Recognition• Group Activity

Recognition• Integration of domain-

related constraints

References• http://data-informed.com/urban-informatics-putting-big-data-to-work-in-our-cities/• http://cusp.nyu.edu/urban-informatics/• Foth, Choi & Satchell 2011 http://www.urbaninformatics.net/• https://en.wikipedia.org/wiki/Urban_computing• http://mckinseyonsociety.com/emerging-trends-in-urban-informatics/• http://www.link-labs.com/smart-waste-management/• https://www.smartbin.com/• http://freakonomics.com/2013/08/16/how-to-save-time-hunting-for-a-parking-spot-south-korea-edition/• http://tv.seas.harvard.edu/research.php.• http://www.microsoft.com/presspass/presskits/zune/default.mspx.• E. Adar and B. A. Huberman. Free riding on Gnutella. Proc. First Monday, 5(10), 2000• M. Fischmann and O. Gunther. Free riders: Fact or fiction? Sep 2003• L. Ramaswamy and L. Liu. Free riding: A new challenge to P2P file sharing systems. Proc. HICSS, 2003• B. Yang and H. Garcia-Molina. Designing a super-peer network. Proc. ICDE, 2003• S. Kamvar, M. Schlosser, and H. Garcia-Molina. Incentives for combatting free-riding on P2P networks. Proc. Euro-Par, 2003.• N. Liebau, V. Darlagiannis, O. Heckmann, and R. Steinmetz. Asymmetric incentives in peer-to-peer systems. Proc. AMCIS, 2005• L. Buttyan and J. Hubaux. Nuglets: a virtual currency to stimulate cooperation in self-organized mobile ad hoc networks. Technical Report

DSC/2001/001, Swiss Federal Institute of Technology,Lausanne, 2001• L. Buttyan and J.P. Hubaux. Stimulating cooperation in self-organizing mobile ad hoc networks. ACM/Kluwer Mobile Networks and Applications,

8(5), 2003.• S. Zhong, J. Chen, and Y.R. Yang. Sprite: A simple, cheat-proof, credit-based system for mobile ad-hoc networks. Proc. IEEE INFOCOM, 2003• K. Chen and K. Nahrstedt. iPass: an incentive compatible auction scheme to enable packet forwarding service in MANET. Proc. ICDCS, 2004• V. Srinivasan, P. Nuggehalli, C.F. Chiasserini, and R. R. Rao. Cooperation in wireless ad hoc networks. Proc. INFOCOM, 2003• J. Crowcroft, R. Gibbens, F. Kelly, and S. Ostring. Modelling incentives for collaboration in mobile ad hoc networks. Proc. WiOpt, 2003

http://www.urbaninformatics.net/

https://en.wikipedia.org/wiki/Urban_computing

https://www.smartbin.com/

http://freakonomics.com/2013/08/16/how-to-save-time-hunting-for-a-parking-spot-south-korea-edition/

http://freakonomics.com/2013/08/16/how-to-save-time-hunting-for-a-parking-spot-south-korea-edition/

http://tv.seas.harvard.edu/research.php

http://www.microsoft.com/presspass/presskits/zune/default.mspx

References• R. Chakravorty, S. Agarwal, S. Banerjee, and I. Pratt. MoB: a mobile bazaar for wide-area wireless services. Proc. MobiCom, 2005.• E. Pitoura and B. Bhargava. Maintaining consistency of data in mobile distributed environments. Proc. ICDCS, 1995• Yuan Xue, Baochun Li, and Klara Nahrstedt. Channel-relay price pair: Towards arbitrating incentives in wireless ad hoc networks. Journal of

Wireless Communications and Mobile Computing, Special Issue on Ad Hoc Networks, Wiley InterScience, 2005• Yuan Xue, Baochun Li, and Klara Nahrstedt. Optimal resource allocation in wireless ad hoc networks: A price-based approach. IEEE Transactions

on Mobile Computing, 2005• T. Hara and S.K. Madria. Consistency management among replicas in peer-to-peer mobile ad hoc networks. Proc. IEEE SRDS, 2005.• T. Hara and S.K. Madria. Data replication for improving data accessibility in ad hoc networks. IEEE Transactions on Mobile Computing, 2006.• A. Mondal, S.K. Madria and M. Kitsuregawa “E-ARL: An economic incentive scheme for adaptive revenue-load-based dynamic replication of

data in Mobile-P2P networks.” Journal on Distributed and Parallel Databases 28(1) – DAPD 2010• A. Mondal, S.K. Madria and M. Kitsuregawa “EcoRep: An economic model for efficient dynamic replication in mobile-P2P networks.” Proc.

COMAD 2006• N. Padhariya, A. Mondal, S.K. Madria and M. Kitsuregawa “Economic incentive-based brokerage schemes for improving data availability in

mobile-P2P networks.” Journal of Computer Communications 36(8): 861-874, 2013 • Nilesh Padhariya, Anirban Mondal and Sanjay Kumar Madria. “Top-k Query Processing in Mobile-P2P Networks using Economic Incentive

Schemes.” Peer-to-Peer Networking and Applications Journal, PPNA, 2015:8(5):1-21• N. Padhariya, A. Mondal, V. Goyal, R. Shankar and S.K. Madria “EcoTop: An Economic Model for Dynamic Processing of Top-k Queries in Mobile-

P2P Networks.” DASFAA 2011• O. Wolfson, B. Xu, and A.P. Sistla. An economic model for resource exchange in mobile Peer-to-Peer networks. Proc. SSDBM, 2004• B. Xu, O. Wolfson, and N. Rishe. Benefit and pricing of spatio-temporal information in Mobile Peer-to-Peer networks. Proc. HICSS-39, 2006• Shuo Ma, Yu Zheng, Ouri Wolfson. Real-Time City-Scale Taxi Ridesharing. IEEE Trans. Knowl. Data Eng. 27(7): 1782-1795 (2015)• A. Lakshman and P. Malik. “Cassandra: A Structured Storage System on a P2P network.” In Proc. of the 2008 ACM SIGMOD Conference,

Vancouver, Canada, 2008.

References• N.Garg.“HBase Essentials.” Packt Publishing, 2014.• S. Sivasubramanian. “Amazon DynamoDB: A seamlessly scalable non-relational database service.” In Proceedings of the 2012

ACM SIGMOD Conference, pages 729–730, 2012.• T. White. “Hadoop: The Definitive Guide.” O’Reilly Media, Inc., 1st edition, 2009.• Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. “Hive - A Warehousing Solution over

a Map-Reduce Framework.” VLDB Endowment, 2(2):1626–1629, Aug. 2009. • M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, and I. Stoica. “Spark: Cluster Computing with Working Sets.” In Proceedings

of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10– 10, 2010.• G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. “Pregel: A System for Large-Scale Graph

Processing.” In Proc. of the 2010 ACM SIGMOD Conference, pages 135–146, 2010.• J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. “GraphX: Graph Processing in a Distributed Dataflow

Framework.” In the 2014 OSDI Conference, pages 599–613, 2014.• A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S.

Mittal, and D. Ryaboy. “Storm@Twitter.” In Proceedings of the 2014 ACM SIGMOD Conference, pages 147–156, 2014.• A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M.

Peters, A. Rheinlander, M. J. Sax, S. Schelter, M. Hoger, K. Tzoumas, and D. Warneke. “The Stratosphere Platform for Big Data Analytics.” The VLDB Journal, 23(6):939–964, Dec. 2014.

• S. Owen, R. Anil, T. Dunning, and E. Friedman. “Mahout in Action.” Manning Publications Co., 2011. • X. Meng, J.K. Bradley, B. Yavuz, E.R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J.

Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. “MLlib: Machine Learning in Apache Spark.” CoRR, abs/1505.06807, 2015. • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. “Distributed GraphLab: A Framework for Machine

Learning in the Cloud.” In Proc. of PVLDB Conference, pages 716–727, 2012. • A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. “SystemML:

Declarative Machine Learning on MapReduce.” In Proc. of ICDE 2011, pages 231–242, 2011.

Questions?

• Contact information– Anirban Mondal: [email protected]– Praveen Rao: [email protected]– Sanjay Kumar Madria: [email protected]