Upload
praveen-rao
View
262
Download
0
Embed Size (px)
Citation preview
Mobile Computing, Internet of Things, and Big Data for Urban
InformaticsAnirban Mondal (Shiv Nadar University, India)Praveen Rao (University of Missouri-Kansas, USA)Sanjay Kumar Madria (Missouri University of Science & Technology, Rolla USA)
Outline of the Talk
1
• Urban Informatics: Challenges & Opportunities for Mobile Computing• Applications of mobile data management & IoT in urban informatics• Mobile crowdsourcing/crowdsensing-based data management issues including incentive
mechanisms• mobile and IoT data integration
2• Big Data Analytics for Mobile Computing & IoT• Scalable data analytics frameworks for batch processing• Scalable data analytics frameworks for real-time processing
3• Indexing & Query Processing over Mobile Big Data for Urban Informatics
Part 1 - Urban Informatics: Challenges & Opportunities for Mobile Data Management
Interesting facts about urbanization(Source: NYU CUSP)
~80% of the U.S. population and ~50% of the world’s population live in urban areas
Growth rate: Over 1 million people per week
By 2050, 64% of people in the developing countries and 85% of people in the developed world will live in urban areas
http://data-informed.com/urban-informatics-putting-big-data-to-work-in-our-cities/
Ever-increasing demands on urban infrastructure
What is Urban Informatics?
Urban informatics uses data to better understand how cities work• This understanding can remedy a wide
range of issues affecting the everyday lives of citizens and the long-term health and efficiency of cities from morning commutes to emergency preparedness to air quality.
• Source: http://cusp.nyu.edu/urban-informatics/
Urban informatics is the study, design, and practice of urban experiences across different urban contexts• that are created by new opportunities
of real-time, ubiquitous technology and the augmentation that mediates the physical and digital layers of people networks and urban infrastructures.
• Source: Foth, Choi & Satchell 2011 http://www.urbaninformatics.net/
Urban computing is an interdisciplinary field which pertains to the study and application of computing technology in urban areas• This involves the application of wireless
networks, sensors, computational power, and data to improve the quality of densely populated areas:
• Source: Wikipedia https://en.wikipedia.org/wiki/Urban_computing
Key themes in Urban Informatics (Source: McKinsey)
Using existing city data to improve efficiency
Building new data for better operational and planning decisions
Increasing public engagement to problem solving (mobile crowdsourcing focus)
http://mckinseyonsociety.com/emerging-trends-in-urban-informatics/
Waste Management
Transportation
Disaster Management
Healthcare
Smart CitiesApplication Domain
Waste ManagementTwo key themes• How to manage waste collection, distribution & routing?• Reducing the amount of waste created in the first place
Traditional approach• Collectors/trucks physically go to dumpsters to check the trash levels at fixed times
Problems with the traditional approach• Potential trash bin overflows• If trash bins are not yet full, the process becomes inefficient waste of time & fuel
How often to pick up the trash? • Use sensors/IoT in conjunction with a dashboard• Can significantly reduce labor costs of janitorial services
Source: http://www.link-labs.com/smart-waste-management/
Waste Management by Enevo
Enevo is an innovative waste management company
• A proprietary dumpster sensor and software system that is placed on the lids of garbage receptacles
The system communicates regarding the trash level of a given garbage container
Also performs predictive analytics about when a dumpster will be full• Facilitates route planning in advance
Source: http://www.link-labs.com/smart-waste-management/
Waste management by SmartBin
Source: https://www.smartbin.com/
IoT-based Smart monitoring using Ultrasonic Level Sensor (UBi)• most widely deployed fill-level sensor
SmartBin Live is the IoT game changer for collectors and distributors• helps monitor containers• plan for route optimization
Reducing waste creation
Millions of pounds of garbage generated annually
• Municipalities pay a lot of money for waste removal, landfills etc.
Waste creation relates also to purchasing habits and lack of ability to predict usage• In the U.S., consumers waste ~133 billion pounds of food annually• Grocery stores/supermarkets can prevent significant amounts of waste by asset
tracking & management, JIT inventory management etc. (perishables)
Smart refrigerators can alert consumers about when food is getting spoiled• Relates to purchasing habits of consumers
Source: http://www.link-labs.com/smart-waste-management/
Smart Parking: S Oil’s HERE
Clever use of balloons to reduce search time for open parking spotsDemonstrates that you don’t always need hi-tech to solve real
problemsSource: http://freakonomics.com/2013/08/16/how-to-save-time-hunting-for-a-parking-spot-south-korea-edition/
Android App for weather, travel & news
StreetJournal.org: Crowdsourcing platform for effectively solving city problems
Mobile users report problems with streets e.g., potholes,
street lights etc.
Ushahidi Platform for Beijing Transport Planning
Mapunity: technology forsocial problem-solving
Mobile Applications for Smart Cities
Crowdsourcing-based traffic & navigation app
IBM Smart Cities
Wearable Technology Market Forecast• Wearable electronics market
worldwide: $20 billion in 2015 to almost $70 billion in 2025
• Dominant sector for wearable: Healthcare with medical, fitness and wellness management
• Promising new developments:
Health and Well-being in Elderly• Keeping the elderly population healthy, safe and at freedom of
their own home for few more years is a win for everyone
Win-Win-Win!
Ever-increasing popularity and proliferation of mobile technology
P2P
Users can collect a wide variety of data using mobile devices
Mobile-P2P Computing a.k.a. Mobile Crowdsourcing
05/01/2023 17
Real-world Crowd-driven Mobile Data Collection
Mobile-P2P ApplicationsWhich shops here
sell running shoes? Prices?
Discounts?
Italian restaurants nearby here with lunch specials? How crowded?
Ambience?
Find an available nearby parking
slot
Where can I find the nearest car-repairing station
with a short queue?
Mobile group buying: Want
to buy discounted
Levis jeans?
Anyone wants to share a cab
ride?
Source: www.istanbul-airport.info
Mobile-P2P Cab Sharing Application
Shuo Ma, Yu Zheng, Ouri Wolfson. Real-Time City-Scale Taxi Ridesharing. IEEE Trans. Knowl. Data Eng. 27(7): 1782-1795 (2015)
Specialized contextual knowledge/information in Mobile-P2P environments• Lots of context embedded in humans and/or devices• Context is spatio-temporal + many other dynamic attributes that
represent specialized semantic knowledge
The user essentially has a better CONTEXT & human judgment with which to answer queries• due to being located at that space at that time + his own
knowledge base
It goes beyond merely using a mobile device for getting a query
answered
Observations for Mobile-P2P
Research Challenges in Data Management
1• Sharing information & free-riding
2• Heterogeneous, multi-modal sensor
(including human sensors) data integration
3• Scalability & handling Big Data
Incentive schemes for Mobile Environments (MANETs to Mobile-
P2P)
Nodes get nuglets to forward packets• Nuglets is a virtual currency
Packet Purse Model• Packet sender loads nuglets into the packet• Relay nodes take nuglets from packet when forwarding• Packet is dropped when it has no more nuglets
Packet Trade Model• Destination pays the cost of packet forwarding• Senders pay nothing: Spamming by senders possible• Each relay node earns some nuglets
Assumes a tamper-resistant hardware module at each node• To maintain the integrity of the nuglet counter at each node
Stimulating nodes to forward messages in a MANET
L. Buttyan and J. Hubaux. Nuglets: a virtual currency to stimulate cooperation in self-organized mobile ad hoc networks. Technical Report DSC/2001/001, Swiss Federal Institute of Technology,Lausanne, 2001
L. Buttyan and J.P. Hubaux. Stimulating cooperation in self-organizing mobile ad hoc networks. ACM/Kluwer Mobile Networks and Applications, 8(5), 2003.
Sprite is a Cheat-Proof, Credit-based System for Mobile Ad Hoc Networks• Node pays others for forwarding its messages • Does not require tamper-proof hardware
Each selfish node tries to maximize its welfare
• Welfare = benefit – cost
When a node receives a message, the node keeps a receipt of the message• Later, when the node has a fast connection, it reports to a Credit Clearance Service (CCS) • The CCS determines the charge and credit to each node based on reported receipts
Sprite: Forwarding Incentives in a MANET
S. Zhong, J. Chen, and Y.R. Yang. Sprite: A simple, cheat-proof, credit-based system for mobile ad-hoc networks. Proc. IEEE INFOCOM, 2003
Assumes that the users are selfish, but rational• “Pay for service” model of cooperation
Each router constitutes a smart market• where an auction process runs continuously to determine• who should obtain how much of bandwidth & at what price
Bidders are traffic flows passing that router• iPass assumes secure payment/accounting
iPass uses the generalized Vickrey auction with reserve pricing• Truthful bidding of utility
iPass: Cooperative packet forwarding in a MANET via incentive-based auctions
K. Chen and K. Nahrstedt. iPass: an incentive compatible auction scheme to enable packet forwarding service in MANET. Proc. ICDCS, 2004
A math framework for user cooperation• Defines strategies for optimal user behavior• Trade-off: node lifetime vs network throughput
Generous TIT-FOR-TAT (GTFT) algorithm • used by the nodes to decide whether to accept or reject a relay request
GTFT algorithm ((Non-Cooperative Game Theory) • Based on the prisoner’s dilemma idea• Each player mimics the action of the other player in the previous game• Occasionally, each player helps out even if the other player had not helped in previous game• Nodes maintain record of past experiences
Does not address• Malicious nodes• Algorithms for collecting system-wide info
Cooperation among energy-constrained nodes in wireless ad hoc networks
V. Srinivasan, P. Nuggehalli, C.F. Chiasserini, and R. R. Rao. Cooperation in wireless ad hoc networks. Proc. INFOCOM, 2003
Motivates users to move to “central locations”
Rewards nodes according to their sending ability
Each user has an initial credit balance
Incentives for acting as transit nodes in MANETs
J. Crowcroft, R. Gibbens, F. Kelly, and S. Ostring. Modelling incentives for collaboration in mobile ad hoc networks. Proc. WiOpt, 2003
Remarks on Incentive schemes for MANETs
1• Focus: message forwarding
2• Data hosting not addressed
Every data item has a price• Virtual currency model• Revenue of an MP is how much currency it has• Discourages free-riding
Fairness in replica allocation by considering the origin of queries • How many different mobile users downloaded the data item?
Hybrid superpeer architecture for facilitating replication
A. Mondal, S.K. Madria and M. Kitsuregawa “E-ARL: An economic incentive scheme for adaptive revenue-load-based dynamic replication of data in Mobile-P2P networks.” Journal on Distributed and Parallel Databases 28(1) – DAPD 2010
A. Mondal, S.K. Madria and M. Kitsuregawa “EcoRep: An economic model for efficient dynamic replication in mobile-P2P networks.” Proc. COMAD 2006
The E-ARL Incentive System
Price of data item d depends on• access frequency• number of MHs served by d (fairness issue)• number of existing replicas of d• (replica) consistency of d • average response time for queries on d
Computation of data item price
Key idea: Assign higher-priced data items to MHs with either low revenue or low load• spectrum of algorithms with different weights for revenue and load• uses a parameter to adjust relative weightage of load and revenue
Facilitates both revenue-balance and load-balance
• Revenue-balance avoids starvation of MPs and encourages MP participation in the network
• Load-balance reduces query response times
Replica allocation in E-ARL
Brokers facilitate data collection & improve data availability through value-added routing service• Pro-active search by maintaining index of data items and
their replicas• Faster query response through directed routing rather just
forwarding• Load-sharing [Reference: Mondal_SIGMOD 2000]
Different brokerages for various levels of brokers to provide better services
N. Padhariya, A. Mondal, S.K. Madria and M. Kitsuregawa “Economic incentive-based brokerage schemes for improving data availability in mobile-P2P networks.” Journal of Computer Communications 36(8): 861-874, 2013
The E-Broker Incentive System
E-Broker incentive schemes
IR
• Individual ranking strategy
• Drawback of collaborative score assignments to few brokers
NGS
• Neighbor-based Gossiping
• Limited to ONE-hop• Exchange brokers’
score with others
K-NGS
• NGS extended to K-hops
• Wide coverage• Highly accurate score
assignments as compared to NGS/IR
EIB+
Score assignment strategies in EIB+
Designed for answering top-k queries in Mobile-P2P networks• Uses economic incentive schemes to facilitate effective top-k query processing
Services • contributing to top-k query results• brokerage• message relay
Every service has a price• Service-requestor pays the price of the service to the service-provider
Revenue of a Mobile Peer (MP) is how much currency it has• MP earns currency by providing services• MP spends currency for obtaining services
Nilesh Padhariya, Anirban Mondal and Sanjay Kumar Madria. “Top-k Query Processing in Mobile-P2P Networks using Economic Incentive Schemes.” Peer-to-Peer Networking and Applications Journal, PPNA, 2015:8(5):1-21
N. Padhariya, A. Mondal, V. Goyal, R. Shankar and S.K. Madria “EcoTop: An Economic Model for Dynamic Processing of Top-k Queries in Mobile-P2P Networks.” DASFAA 2011
The E-Top Incentive System
Architecture of E-Top
Incentive schemes
• reward rankers for send relevant top-k results, penalizes otherwise
• reduce communication traffic
ETK• Equal payoff distribution• Weighted Re-ranking
ETK+• Weighted payoff distribution• Payoff-based re-ranking
E-Top: Economic Incentive Schemes
Dissemination of reports in mobile-P2P
Each disseminated report represents information about a spatial-temporal resource
Provides incentives for resource dissemination
Discusses about pricing resource information
Benefit of report dissemination: time-saving
O. Wolfson, B. Xu, and A.P. Sistla. An economic model for resource exchange in mobile Peer-to-Peer networks. Proc. SSDBM, 2004B. Xu, O. Wolfson, and N. Rishe. Benefit and pricing of spatio-temporal information in Mobile Peer-to-Peer networks. Proc. HICSS-39, 2006
Advertising spatial resources
Today’s specialsPeking duck: 120 yuan
Hot pot: 200 yuan
Sounds good!
Let me first get a parking slot!
Incentives for spatio-temporal resource dissemination
Price of a resource depends on
• Time length since the creation of the resource• Distance of the resource from the consumer
Economic incentive models
• Consumer-paid resources• Producer-paid resources
Consumer-paid resources
• Example: parking slot advertisements• Depends on relevance and time-saved• Consumer cannot sell reports • The price of a report is a function of relevance • A broker pays a percentage of the price of the report.
• is paid the same percentage when selling the report to another broker • is paid full price when selling report to a consumer
Producer-paid resources
• Example: a gas station• Producer pays an “advertisement” fee by attaching some coins in announcement• Each mobile node that transmits the resource withdraws a “commission” from the coins • How many coins the producer should put in • Tradeoff between advertising cost and effect
P2P report exchange using Information Guided Search
Quick overview of the approach
• Discover peers within range • Specify interests about reports• Exchange reports with encountered peers
A consumer starts by moving around the area where a resource of interest could possibly be located
• either an available resource is encountered OR• some resource-report is received
The search continues until
When resource-report is received, consumer moves along shortest path to resource
City-scale Taxi Ride-sharing
Shuo Ma, Yu Zheng, Ouri Wolfson. Real-Time City-Scale Taxi Ridesharing. IEEE Trans. Knowl. Data Eng. 27(7): 1782-1795 (2015)
Taxi-sharing reduces energy consumption (go green) and helps user commute
• Constraints: time, capacity, and monetary
A mobile-cloud architecture based taxi-sharing system that accepts users’ real-time requests sent via smartphone App and schedules taxis accordingly
• passengers will not pay than “no ridesharing” and get compensated if their travel time is increased because of ridesharing; • taxi drivers will make money for detour distance due to ridesharing.
Monetary constraints incentivize both passengers and taxi drivers
• A scheduling process is used to choose a taxi that satisfies the request with minimum increase in travel distance
The Cloud finds candidate taxis using an algorithm supported by a spatio-temporal index
43
Free-riding
Mobile resource constraints
Data availability
Research Challenges in Mobile P2P
Incentives for enticing free-riders to provide data
Traditional environments (e.g., clusters) generally assume cooperative behavior by all nodes
• Free-riders provide no data
Rampant Free-riding in P2P environments
• Transmitting messages tax limited energy of mobile peers• Bandwidth is limited and there are data transmission costs
Mobile resource constraints further exacerbate the free-riding problem
Free-riding
Mobile resource constraints include energy and bandwidth
• Such constraints impact data availability
Most mobile technologies physically support broadcast data management mechanisms
• Push or pull paradigm for obtaining real-time information
Only the most relevant information can be sent to minimize energy consumption
• Sending/receiving messages consumes the limited energy of mobile devices
Mobile resource constraints
Low data availability
Replication
Free-riding
Incentives
Mobile-P2P incentive model
Challenges in mobile crowdsourcing
47
Mobile Crowdsourcing can be a great way to collect large-scale data and
facilitate various applications, but ..
Mobile sensor-aware crowdsourcing
Quality control based on rich signals
Anyone can be a requester as well as a worker
Personalised request
Explicit user input as well as implicit sensor input
A proposed architecture
• Improving task performance and efficiency• Enabling new crowdsourcing process• Enabling new types of application
Opportunities
• Improving personalisation of request allocation and response aggregation
• Crowdsourcing spontaneous feedback• New quality control mechanism• Information about situational impacts on cognitive
performance
Improving task performance and efficiency
• Crowdsourcing for the masses• Weakening the strong correlation between labour
and human resources• Participatory sensing
New crowdsourcing processes
• Crowdsourcing as an anonymised customer information system
• Enabling technology for smart cities• Mobile crowdsourcing for energy management and
smart buildings• Mobile crowdsourcing as a chance for ubiquitous
computing research
New applications
• High performance data processing and analysis mechanism
• Limitations of mobile devices• Data security and privacy issues
Challenges
Part 2 - Big Data Analytics for Mobile Computing & IoT
Part II: Agenda
NoSQL systems, New SQL systems
Systems for batch-oriented processing/interactive analytics
Stream processing systems
• Systems for processing massive datasets
Some Applications
Urban Transportation1,3
Open Data2 Citizen Services
1 Ilarri, Sergio, Ouri Wolfson, and Thierry Delot. "Collaborative sensing for urban transportation." IEEE Data Engineering Bulletin 37 (2014): 3-14.2 Catlett, Charles E., Tanu Malik, Brett Goldstein, Jonathan Giuffrida, Yetong Shao, Alessandro Panella, Derek Eder, Eric van Zanten, Robert Mitchum, Severin Thaler and Ian T. Foster. “Plenario: An Open Data Discovery and Exploration Platform for Urban Science.” IEEE Data Engineering Bulletin 37 (2014): 27-42.3Juliana Freire, Cláudio T. Silva, Huy T. Vo, Harish Doraiswamy, Nivan Ferreira, Jorge Poco. “Riding from Urban Data to Insight Using New York City Taxis.” IEEE Data Engineering Bulletin 37 (2014): 43-55.
SFPark
Source: http://sfpark.org/how-it-works/applications/
Waze
Source: https://www.waze.com
Plenar.io
Source: http://plenar.io
NYC 311
Source: http://www1.nyc.gov/311/
Smart City Services
Source: https://smartcity.thermi.gov.gr/parking
Placemeter
Source: http://www.placemeter.com
TaxiVis
Nivan Ferreira, Jorge Poco, Huy T. Vo, Juliana Freire, and Cláudio T. Silva. “Visual Exploration of Big Spatio-Temporal Urban Data: A Study of New York City Taxi Trips,” IEEE Transactions on Visualization and Computer Graphics, v. 19 (12), p. 2149-2158, 2013.
Source: http://vgc.poly.edu/projects/taxivis/
Smart City Initiative in Kansas CityPhoto Credit: Kansas City Area Development Councilhttp://kcmo.gov/smartcity/
Smart city vision1
o $15 million public-private partnership
o Smart streetlights and sensors2
o Free public Wi-Fio Interactive digital
kiosks for citizens
1 “Beyond Traffic: The Vision for the Kansas City Smart City Challenge,” http://www.eenews.net/assets/2016/03/31/document_pm_06.pdf
2 http://www.sensity.com
NoSQL Systems • Simple read/write
operations • On small
number of related items/records
• Basically available, soft state, and eventually consistent (BASE)1
• Typically do not support ACID transactions to achieve higher performance & scalability1,2
• CAP theorem3; weak consistency model
• Relax ACID semantics
• Flexible schema; schema-less
• Horizontal scaling• Shared-nothing
architecture
1 Rick Cattell. “Scalable SQL and NoSQL datastores.” ACM SIGMOD Record 40, 2 (June 2011).2 Michael Stonebraker and Rick Cattell. “10 Rules for Scalable Performance in ’Simple Operation’ Datastores.” Commun. ACM 54, 6 (June 2011), 72-80.3 Eric Brewer. “Towards robust distributed systems.” In Proc. of the 19th Annual ACM Symposium on Principles of Distributed Computing, July 2000, Portland.
Broad Classification
Key-value stores
Document stores
Graph databases
Wide-column/
extensible record stores
Examples: Dynamo1, Voldemort2, Riak3, Redis4, MemcacheDB5
Data model: collection of objects, each with a key and a data object
Operations: insert, lookup, delete
k1
k2
k3
k3
ValueKey
2 http://www.project-voldemort.com/voldemort/3 http://basho.com/products/riak-kv/4 http://redis.io5 http://memcachedb.org
1 S. Sivasubramanian. “Amazon DynamoDB: A seamlessly scalable non-relational database service.” In Proceedings of the 2012 ACM SIGMOD Conference, pages 729–730, 2012.
Broad Classification
Examples: SimpleDB1, MongoDB2, CouchDB3, Terrastore4
Data model: objects with variable number of attributes (e.g., JSON), nesting of objects
Schema-less, secondary indexes
Can query collection of objects via multiple attribute constraints
Key-value stores
Document stores
Graph databases
Wide-column/
extensible record stores
{ "createdAt": ”Jan 1, 2017 ...", "id": 1234567890…, "text": “ ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": ”en", …}
{ "createdAt": ”Jan 1, 2017 ...", "id": 1234567890…, "text": “ ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": ”en", …}
{ "createdAt": ”Jan 1, 2017 ...", "id": 1234567890…, "text": “ ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": ”en", …}
Collection
1 https://aws.amazon.com/simpledb/2 https://www.mongodb.com3 http://couchdb.apache.org4 https://code.google.com/archive/p/terrastore/
Broad Classification
Examples: Bigtable1, Cassandra2,HBase3, HyperTable4
Data model: variable-width record sets, can add new attributes
Rows are partitioned horizontally across nodes
Column groups are partitioned vertically across nodes
Key-value stores
Document stores
Graph databases
Wide-column/
extensible record stores
Key Column group 1
Column group 2
C1 C2 C3 C4
3 https://hbase.apache.org
2 A. Lakshman and P. Malik. “Cassandra: A Structured Storage System on a P2P network.” In Proc. of the 2008 ACM SIGMOD Conference, Vancouver, Canada, 2008.
1 Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. "Bigtable: A distributed storage system for structured data." In ACM Transactions on Computer Systems (TOCS) 26, no. 2 (2008): 4.
4 http://www.hypertable.com
Broad Classification
Examples: Neo4j1, AllegroGraph2, Virtuoso3, InfiniteGraph4
Data model: property graphs, RDF graphs, labeled directed multigraphs
Query languages: SPARQL, Cypher, and others
Key-value stores
Document stores
Graph databases
Wide-column/
extensible record stores
a
bc
e
f
d
22
41
3
2
1
3
1 https://neo4j.com2 http://franz.com/agraph/allegrograph3 http://virtuoso.openlinksw.com4 http://www.objectivity.com/products/infinitegraph
More Recently…
Multi-model
Graph
Key-value
Document
Wide-column
Examples: ArangoDB1, MarkLogic2, OrientDB3, FoundationDB4
Data model: support more than one type of data model
Support for ACID properties in transactions5,6
6 http://www.methodsandtools.com/archive/acidnosqldatabase.php
5 Eric Brewer, "CAP Twelve Years Later: How the 'Rules' Have Changed," Computer, pp. 23-29, February 2012
1 https://www.arangodb.com2 http://www.marklogic.com3 http://orientdb.com4 https://foundationdb.com
New SQL Examples: MySQL
Cluster3, VoltDB4, Clustrix5, NimbusDB6
SQL support, ACID transactions, high performance & scalability
Shared-nothing, avoid multi-node/shard operations (queries, updates), leverage main-memory databases, automatic recovery & availability
New SQL1,2 (Scalable
SQL)
New OLTP workloads
(write-focused, simple
operations)
Web-based applications, multi-player
games, social networking sites Need for higher
OLTP throughputNeed for real-time analytics
2 Michael Stonebraker. “New SQL: An Alternative to NoSQL and Old SQL for New OLTP Apps.” http://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext
1 Michael Stonebraker and Rick Cattell. “10 Rules for Scalable Performance in ’Simple Operation’ Datastores.” Communications of ACM 54, 6 (June 2011), 72-80.
3 https://www.mysql.com/products/cluster 4 https://voltdb.com 5 http://www.clustrix.com 6 http://www.nuodb.com
Batch-Oriented Data Processing
• Processing large datasets on large clusters– Google’s MapReduce1 (MR)– Apache Hadoop2,3
• Hadoop Distributed File System (HDFS)• Hadoop MapReduce• Hadoop YARN (Yet Another Resource Negotiator)
– Cluster resource management
1 Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” In Proc. of the 6th Symposium on Operating Systems Design and Implementation, San Francisco, 2004.
3 Tom White. “Hadoop: The Definitive Guide.” O’Reilly Media, Inc., 1st edition, 2009.
2 http://hadoop.apache.org
MapReduce
data
data
data
data
Map (M)
Map (M)
Map (M)
Map (M)
Reduce (R)
Reduce (R)
Reduce (R)
Intermediatedata
Output
Big Data Ecosystem
Distributed storage (e.g., HDFS)
Cluster management (e.g., YARN)
Hive (SQL-like) Spark SQL
Distributed processing engines(e.g., Tez, MR, Spark)
Others
See also: http://hortonworks.com/apache/, http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final,http://www.slideshare.net/hortonworks/apache-tez-accelerating-hadoop-query-processing
Tez and Spark process a job as a complex directed acyclic graph (DAG) of tasks
M M M
R R
M M M
HDFS
Tez
Apache Hive
• Enables efficient query processing; indexes – Queries are compiled into map & reduce jobs
• Supports schema changes– Modifies metadata
year mon
Partitions(on year)
/…/foo/year=2016/1…/…/foo/year=2016/4/…/foo/year=2015/1…/…/foo/year=2015/4
HDFS directories/files
Tablefoo
4 buckets(hash on mon)
Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. “Hive - A Warehousing Solution over a Map-Reduce Framework.” VLDB Endowment, 2(2):1626–1629, Aug. 2009.
Example
id lang time_zone id text follower_count retweet_count
hive> create table A (id BIGINT, lang STRING, time_zone STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> create table B (id BIGINT, text STRING, follower_count INT, retweet_count INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> load data local inpath '/home/biadmin/Desktop/table1.csv' overwrite into table A;
hive> load data local inpath '/home/biadmin/Desktop/table3.csv' overwrite into table B;
hive> select A.id, A.lang, B.text, B.follower_count FROM A JOIN B ON (A.id = B.id);
Apache Spark• Resilient Distributed Dataset (RDD)
– Distributed collection (immutable), in-memory processing, fault-tolerance
– Transformations: map, reduceByKey, join, and others
See also: http://spark.apache.org/docs/latest/cluster-overview.html
Cache
Task Task
Executor
Wor
ker n
ode Cache
Task Task
Executor
Wor
ker n
ode Cache
Task Task
Executor
Wor
ker n
ode
Cluster manager
Driver program main()
1 2 3
RDD
Spark SQL
• Complex/interactive analytics1
SELECT count(user.followersCount) as C, avg(user.followersCount) as A, lang from tableOfTweets where lang != "en" group by lang
Spark SQL
JSON documents
{ "createdAt": "May 8, 2016 8:53:39 PM", "id": 729489478797008900, "text": "No existe el olvido mi amor, no existe ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": "es", …}
{ "createdAt": "May 8, 2016 8:53:39 PM", "id": 729489478797008900, "text": "No existe el olvido mi amor, no existe ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": "es", …}
{ "createdAt": ”Jan 1, 2017 ...", "id": 1234567890…, "text": “ ...", "isTruncated": false, "inReplyToStatusId": -1, "inReplyToUserId": -1, "isFavorited": false, "isRetweeted": false, "favoriteCount": 0, "retweetCount": 0, "isPossiblySensitive": false, "lang": ”en", …}
Load
Query
Output
1 Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. “Spark SQL: Relational Data Processing in Spark.” In Proc. of the 2015 SIGMOD, 1383-1394, 2015.
+---+------------------+----+|C |A |lang|+---+------------------+----+|1 |1910.0 |sl ||38 |1700.2894736842106|fr ||3 |156.0 |sv ||4 |1451.25 |zh ||124|496.46774193548384|th ||81 |1790.3456790123457|tl ||29 |5760.689655172414 |tr ||1 |0.0 |ne ||7 |49530.57142857143 |nl |....
Stream Processing Systems
• Apache Storm1/Twitter Heron2
2 Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. “Twitter Heron: Stream Processing at Scale.” In Proc. of the 2015 ACM SIGMOD Conference, 239-250, 2015.
1 Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. “Storm@Twitter.” In Proc. of the 2014 SIGMOD Conference, 147-156, 2016.
Spout
Bolt
Bolt
Topology (DAG)
Tuple-processing semantics: at least-once, at most-once
Each blot/spout executes in parallel as tasks on cluster nodes
E.g., count the frequency of hashtags in a stream of tweets
Bolt
Stream Processing Systems
• Apache Flink1 Support for batch-oriented workloads
Exactly-once semantics Streams are partitioned,
each operator executes in parallel on a cluster
Window operators E.g., compute the
frequency of hastags every 60 seconds (or over the last 1000 tweets)
Source Sink
Streaming dataflow (DAG)
Transformations
Stream
1 A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinlander, M. J. Sax, S. Schelter, M. Hoger, K. Tzoumas, and D. Warneke. “The Stratosphere Platform for Big Data Analytics.” The VLDB Journal, 23(6):939–964, Dec. 2014.
Graph Analytics & Machine Learning
• Apache Spark– GraphX1
– MLlib2
• Apache Flink– Gelly3
– FlinkML4
1 J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. “GraphX: Graph Processing in a Distributed Dataflow Framework.” In the 2014 OSDI Conference, pages 599–613, 2014.2 Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: machine learning in apache spark. Journal of Machine Learning Research 17, 1 (January 2016), 1235-1241.
4 https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/index.html
3 https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html
Part III: Agenda
• Examples of a Big Data Indexing and Querying System for processing massive datasets
Traffic Monitoring
Battlefield
Behavior Tracking
Healthcare
Forest Fire Detection
Agriculture
Examples of Big Data
Sample Queries over Big Data
How many people moving between regions (xL,yL) and (xH, yH)
Range Query (xL,yL)(xH,yH)
What is the current Water quality at x,y
location?Point Query (x,y)
Which other carsare closest to x,y location?
KNN Query (x,y)
Possible Query Types
86
• Preference queries– Subscription-based aggregation technique– Skyline monitoring over frequent update streams– Top-k query processing on multi-dimensional data with keywords
• Result diversification– Diversified set monitoring over distributed data streams– Diversified top-k query processing for mobile sensor networks
• Probabilistic Query Processing– Nearest neighbor queries on distributed uncertain data
Big Data Management Architecture
• Rich Query Functionality: the ability to efficiently process point, range and k-nearest neighbor queries
• Distributed Index and Query Processing: designing an efficient key structure and a system which can efficiently distribute and retrieve the keys across the network
• Load Balancing: the capability of automatic load balancing for skewed data
• Consistency Management: the management of replica consistency to provide fault tolerance with respect to node failures
• Elastic Scalability: the capability to extend the system in the presence of dynamic workload
M-Grid: Objectives
• Among all the space filling curves– Hilbert Curve achieves the best clustering property1
– After mapping, points which are closer in n-d hypercube are closer in 1-d space also– Clustering of data points allow efficient query processing– Our key design uses Hilbert Curve of appropriate order to give linear ordering to data points in the multidimensional space
1. B. Moon, H. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properties of the Hilbert space-Filling curve. Knowledge and Data Engineering, 13(1):124{141, January 2001.
0101 0110 1001 1010
0100 0111 1000 1011
0011 0010 1101 1100
0000 0001 1110 1111
Second Order Hilbert Curve
M-Grid Indexing Proposal
• Data Model and Point Query:– HBase is designed to provide fast key access – Our data model stores the coordinate values in the following way
– Multiple users can be stored within the same key– Given a Point (x,y), find its Hilbert-Value– value = Table.get(Hilbert-Value(x,y))– Can be extended for higher dimensions also
Key Family:Qualifier Value
Hilbert-Value<x,y..> data:userID x.y
Our Proposal
• How to avoid master-server bottleneck?• Arrange the nodes in P-Grid overlay structure
– Builds a virtual trie structure for efficient mapping of keys to nodes– Preserves key order relationship– Completely decentralized– Search can be initiated on any node– Robust to failure and support index replication
Our Proposal
• M-Grid System Architecture
M-Grid: Our Indexing Proposal
• Point query <q>– The peer which is responsible for data item is identified– Query is routed to that peer and processed locally– Result is sent back to initiator
Our Proposal
• Range query <(Lx,Ly)(Hx,Hy)>– Processes range query concurrently – Only those data items will be returned which intersects with the query– Search latency depends upon the skewness of data items
Our Proposal
• KNN Query <q,k>– Starting from the query point, enlarge search region until k objects
are found – Comprises of series of range searches, first construct a range search R
centered at query point q=key and with radius r = ϕ = Dk/k– Dk is the estimated distance between the query object and its k’th
nearest neighbor; Dk can be estimated by using the following equation where N is the estimated objects in whole space and d is the dimensionality of M-Grid
*An efficient cost model for optimization of nearest neighbor search in low and medium dimensional spaces by Y. Tao et. al.
*
Our Proposal
•
0101 0110 1001 1010
0100 0111 1000 1011
0011 0010 1101 1100
0000 0001 1110 1111H-Order Space
Our Proposal
Categories CG-Index RT-CAN EMINC MD-HBase Proposed Solution
Multidimensional queries
No Yes Yes Yes Yes
Master server bottleneck
No No Yes Yes No
Scalability linear linear linear low High
KNN queries No Yes No Yes Yes
Independent approach
No No No Yes Yes
Current Solutions: Comparison
Categories CG-Index RT-CAN EMINC MD-HBase MGridInsert
throughput256 nodes - 30 k/sec
Not Present Not present 16 nodes, 220k/sec
16 Nodes, 860k/sec
Set Up / Point Query
(Traffic Dataset)
265 nodes,10m
records– .033 ms
128 nodes, 500k
records per node – not
given
1000 nodes, 10m
records -40-50 ms
4 nodes,400m
records -not present
4 nodes, 400m
records – less than 5
ms
Range Query s = .1% – .06 ms
S = .1 % – .14 ms
S = .001% -50+ ms
S = 10% - 200000 ms
S = 10% -50000 ms
KNN Query Not present
K=16 – 23k/sec
Not present K=1000 – 4 sec
k = 1000 –2 sec
Current Solutions vs M-Grid: Comparison
Key Research Challenges
Data Collection• Incentive schemes for
mobile crowdsourcing• Ensuring long-term
user engagement• Data Reliability• Data Privacy• Cost-efficiencies
Big Data management, processing & analytics on the Cloud• Data filtering, validation &
integration • Indexing from the perspective of
diverse stakeholders with varying needs
• Spatial data management & indexing to facilitate location-dependent services
• Big Data replication and partitioning approaches on the Cloud
• Privacy & security of the Big Data• Uncertainty
Data Reasoning & Semantics• Ontology generation &
knowledge representation• Context-awareness,
reasoning & interpretation of the data for drawing valuable inferences
• Approaches for improving query expressiveness with semantic considerations
• Activity Recognition• Group Activity
Recognition• Integration of domain-
related constraints
References• http://data-informed.com/urban-informatics-putting-big-data-to-work-in-our-cities/• http://cusp.nyu.edu/urban-informatics/• Foth, Choi & Satchell 2011 http://www.urbaninformatics.net/• https://en.wikipedia.org/wiki/Urban_computing• http://mckinseyonsociety.com/emerging-trends-in-urban-informatics/• http://www.link-labs.com/smart-waste-management/• https://www.smartbin.com/• http://freakonomics.com/2013/08/16/how-to-save-time-hunting-for-a-parking-spot-south-korea-edition/• http://tv.seas.harvard.edu/research.php.• http://www.microsoft.com/presspass/presskits/zune/default.mspx.• E. Adar and B. A. Huberman. Free riding on Gnutella. Proc. First Monday, 5(10), 2000• M. Fischmann and O. Gunther. Free riders: Fact or fiction? Sep 2003• L. Ramaswamy and L. Liu. Free riding: A new challenge to P2P file sharing systems. Proc. HICSS, 2003• B. Yang and H. Garcia-Molina. Designing a super-peer network. Proc. ICDE, 2003• S. Kamvar, M. Schlosser, and H. Garcia-Molina. Incentives for combatting free-riding on P2P networks. Proc. Euro-Par, 2003.• N. Liebau, V. Darlagiannis, O. Heckmann, and R. Steinmetz. Asymmetric incentives in peer-to-peer systems. Proc. AMCIS, 2005• L. Buttyan and J. Hubaux. Nuglets: a virtual currency to stimulate cooperation in self-organized mobile ad hoc networks. Technical Report
DSC/2001/001, Swiss Federal Institute of Technology,Lausanne, 2001• L. Buttyan and J.P. Hubaux. Stimulating cooperation in self-organizing mobile ad hoc networks. ACM/Kluwer Mobile Networks and Applications,
8(5), 2003.• S. Zhong, J. Chen, and Y.R. Yang. Sprite: A simple, cheat-proof, credit-based system for mobile ad-hoc networks. Proc. IEEE INFOCOM, 2003• K. Chen and K. Nahrstedt. iPass: an incentive compatible auction scheme to enable packet forwarding service in MANET. Proc. ICDCS, 2004• V. Srinivasan, P. Nuggehalli, C.F. Chiasserini, and R. R. Rao. Cooperation in wireless ad hoc networks. Proc. INFOCOM, 2003• J. Crowcroft, R. Gibbens, F. Kelly, and S. Ostring. Modelling incentives for collaboration in mobile ad hoc networks. Proc. WiOpt, 2003
References• R. Chakravorty, S. Agarwal, S. Banerjee, and I. Pratt. MoB: a mobile bazaar for wide-area wireless services. Proc. MobiCom, 2005.• E. Pitoura and B. Bhargava. Maintaining consistency of data in mobile distributed environments. Proc. ICDCS, 1995• Yuan Xue, Baochun Li, and Klara Nahrstedt. Channel-relay price pair: Towards arbitrating incentives in wireless ad hoc networks. Journal of
Wireless Communications and Mobile Computing, Special Issue on Ad Hoc Networks, Wiley InterScience, 2005• Yuan Xue, Baochun Li, and Klara Nahrstedt. Optimal resource allocation in wireless ad hoc networks: A price-based approach. IEEE Transactions
on Mobile Computing, 2005• T. Hara and S.K. Madria. Consistency management among replicas in peer-to-peer mobile ad hoc networks. Proc. IEEE SRDS, 2005.• T. Hara and S.K. Madria. Data replication for improving data accessibility in ad hoc networks. IEEE Transactions on Mobile Computing, 2006.• A. Mondal, S.K. Madria and M. Kitsuregawa “E-ARL: An economic incentive scheme for adaptive revenue-load-based dynamic replication of
data in Mobile-P2P networks.” Journal on Distributed and Parallel Databases 28(1) – DAPD 2010• A. Mondal, S.K. Madria and M. Kitsuregawa “EcoRep: An economic model for efficient dynamic replication in mobile-P2P networks.” Proc.
COMAD 2006• N. Padhariya, A. Mondal, S.K. Madria and M. Kitsuregawa “Economic incentive-based brokerage schemes for improving data availability in
mobile-P2P networks.” Journal of Computer Communications 36(8): 861-874, 2013 • Nilesh Padhariya, Anirban Mondal and Sanjay Kumar Madria. “Top-k Query Processing in Mobile-P2P Networks using Economic Incentive
Schemes.” Peer-to-Peer Networking and Applications Journal, PPNA, 2015:8(5):1-21• N. Padhariya, A. Mondal, V. Goyal, R. Shankar and S.K. Madria “EcoTop: An Economic Model for Dynamic Processing of Top-k Queries in Mobile-
P2P Networks.” DASFAA 2011• O. Wolfson, B. Xu, and A.P. Sistla. An economic model for resource exchange in mobile Peer-to-Peer networks. Proc. SSDBM, 2004• B. Xu, O. Wolfson, and N. Rishe. Benefit and pricing of spatio-temporal information in Mobile Peer-to-Peer networks. Proc. HICSS-39, 2006• Shuo Ma, Yu Zheng, Ouri Wolfson. Real-Time City-Scale Taxi Ridesharing. IEEE Trans. Knowl. Data Eng. 27(7): 1782-1795 (2015)• A. Lakshman and P. Malik. “Cassandra: A Structured Storage System on a P2P network.” In Proc. of the 2008 ACM SIGMOD Conference,
Vancouver, Canada, 2008.
References• N.Garg.“HBase Essentials.” Packt Publishing, 2014.• S. Sivasubramanian. “Amazon DynamoDB: A seamlessly scalable non-relational database service.” In Proceedings of the 2012
ACM SIGMOD Conference, pages 729–730, 2012.• T. White. “Hadoop: The Definitive Guide.” O’Reilly Media, Inc., 1st edition, 2009.• Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. “Hive - A Warehousing Solution over
a Map-Reduce Framework.” VLDB Endowment, 2(2):1626–1629, Aug. 2009. • M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, and I. Stoica. “Spark: Cluster Computing with Working Sets.” In Proceedings
of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10– 10, 2010.• G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. “Pregel: A System for Large-Scale Graph
Processing.” In Proc. of the 2010 ACM SIGMOD Conference, pages 135–146, 2010.• J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. “GraphX: Graph Processing in a Distributed Dataflow
Framework.” In the 2014 OSDI Conference, pages 599–613, 2014.• A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S.
Mittal, and D. Ryaboy. “Storm@Twitter.” In Proceedings of the 2014 ACM SIGMOD Conference, pages 147–156, 2014.• A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M.
Peters, A. Rheinlander, M. J. Sax, S. Schelter, M. Hoger, K. Tzoumas, and D. Warneke. “The Stratosphere Platform for Big Data Analytics.” The VLDB Journal, 23(6):939–964, Dec. 2014.
• S. Owen, R. Anil, T. Dunning, and E. Friedman. “Mahout in Action.” Manning Publications Co., 2011. • X. Meng, J.K. Bradley, B. Yavuz, E.R. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J.
Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. “MLlib: Machine Learning in Apache Spark.” CoRR, abs/1505.06807, 2015. • Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. “Distributed GraphLab: A Framework for Machine
Learning in the Cloud.” In Proc. of PVLDB Conference, pages 716–727, 2012. • A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. “SystemML:
Declarative Machine Learning on MapReduce.” In Proc. of ICDE 2011, pages 231–242, 2011.
Questions?
• Contact information– Anirban Mondal: [email protected]– Praveen Rao: [email protected]– Sanjay Kumar Madria: [email protected]