Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
MOTIVATION
Uncertainty is inevitable in any attempt to predict earthquakes as theultimate cause (mantle convection) is inherently non-deterministic - insitu measurements, however, may reduce this uncertainty
The availability of actionable observations is time critical to effective and efficient communications of advisories and warnings
The establishment of a cause (earthquake) - effect (tsunami) relationship remains outstanding, and is complicated by multiple factors (e.g., tectonic setting)
Far-field estimates of tsunami propagation (pre-computed) and coastalinundation (computed in real time), however, have proven to be extremely accurate - results successfully combine data from a distributed array of deep-ocean tsunami detection buoys with a forecasting model
THE 6Vs OF SCIENTIFIC VS. SOCIAL NETWORKING DATA
CONCLUSIONS
Credible tweets could be transformative - Big Data source that cancomplement traditional sources (e.g., scientific instruments)
Working with 6V Twitter data can be challenging, though it also presents interesting opportunities
Curation of training data is extremely important, but also extremely time consuming (as this is a manual process)
Current research emphasizes Deep Learning, BUT RDF/OWL semantics will need to play a role ultimately
Approach can be generalized for application to natural andanthropogenic disasters of all kinds
ACCOUNTING FOR OIL SPILLS AND OTHER DISASTERS …
Energy exploration via reflection seismology provides the fundamental source of data that is subsequently processed and interpreted for the identification of potential petroleum reservoirs
Reservoir simulation is used to engineer the extraction of petroleumreserves from reservoirs
Drilling is used to ‘truth’ the results provided by interpretations andsimulations prior to production extraction
SOPs ensure extraction of oil from a production reservoir is routinely monitored and reported upon - e.g., to quantify rig safety and output (barrels/day)
From exploration to extraction, this is a data-rich workflow
Additional data sources become relevant when disasters occur (e.g., oil spills) - from re-purposed scientific instruments (e.g., weather satellites)to social media (e.g., Twitter, Instagram, Snapchat, ...)
Data-rich workflows can generate problems in Big Data Analytics
Deep Learning Pipeline
THE OPPORTUNITY FOR SEMANTICS
A feature vector is a feature vector - it is devoid of semantics
Ignores inherent, overall credibility of a Tweet - e.g., as quantified by TweetCred Twitter metadata (handles, hashtags and URLs) contributes equally to Twitter data (unstructured text that comprises the body of a Tweet) in constructing feature vectors - i.e., the semantic value of Twitter metadata is also ignored by Deep Learning
The W3C’s Resource Description Framework (RDF) facilitates therepresentation of metadata and thus exposes semantics
The W3C’s Web Ontology Language (OWL) accounts for domain specifics - disambiguates use of overloaded terms (e.g., “earthquake”) in different contexts (e.g., geophysics vs. movies vs. …)
Deep Learning in combination with RDF/OWL semantics has the potential to produce learned models with knowledge represented
MITIGATING DISASTERS WITH DEEP LEARNING FROM TWITTER?
WWW.UNIVA.COM�
Deep Learning pipeline implemented using the Machine Learning Library (MLlib) from Apache Spark scaled onto a converged Big Data/HPC cluster via Univa Universal Resource Broker for featurization, training, evaluationand operational use.
Copyright © 2017 Univa® and Grid Engine® are registered trademarks of Univa Corporation
1
3
2 6
4
5Data extracted from Twitter via a Perl script that targeted the hashtag #earthquake
Spark MLlib HashingTF establishes frequency- based usage
Spark MLlib Logistic Regression with SGD classifies spam vs. ham
Recent ‘earthquake’ data from Twitter used toevaluate model
Featurization Training Model Evaluation
Feature VectorsTraining Data
Twitter data manually curated into ‘ham’ and ‘spam’, then represented in-memory via Spark RDDs
SPAM
HAM
Model Best Model
++
+
––
–
++
+–
–
–
++
+–
–
–
++
+–
–
–
++
+
––
–
1000+ Apps Data Sources
Univa Universal Resource Broker
Univa Grid Engine Scheduler
API Command Line
Spark UIs
Data Frames ML Pipelines
MLib GraphXSparkStreaming
Spark Core
SparkSQL
Volume
Variety
Velocity
Veracity
Validity
Volatility
small'ish, finite
semi-structured, restricted
slow, sampled
low (stationary, irreplaceable)
BIG, ‘infinite’
unstructured, unrestricted - except for handles, hashtags& URLs (pages, images)
fast, streamed
high? (mobile? disposable?)
Traditional Scientific Data Twitter Data
Created at: Wed Jun 04 20:29:33 +0000 20145.0 earthquake! Thu Jun 05 02:04:27 GMT+09:00 2014 near 84km SW of Iquique, Chile http://t.co/mmFokGQWT7 #earthquake
Created at: Wed Jun 04 20:30:13 +0000 2014The #earthquake continues: Latest via @Spectator_CH /@YouGov -#Labour 36 #Tories 32%, LD 8%, #Ukip 14%. Implied Labour majority- 42 .
Created at: Wed Jun 04 20:31:35 +0000 2014#terremoto ML 2.7 CENTRAL ITALY: Magnitude ††ML 2.7 Region ††CENTRAL ITALY Date time ††2014-06-04 20:01:33.9 UTC... http://t.co/Y141Ovu6kP
Created at: Tue Jun 10 12:22:34 +0000 2014RT @TheRock: Just wrapped a massive post earthquake scene for SAN ANDREAS. To the hundreds of background actors/extras.. THANK U for all yo...
biases, noise & abnormalities
accuracy & correctness April 16, 2016
01:25 JSTMagnitude 7.1 earthquakeKyushu area, 10 km depth
01:27 JSTTsunami advisories issues• Imminent arrival• ~ 1m maximum height
01:29 JSTHigh-tide amplificationadvisory
04:43 - 04:54 JSTHigh-tide times
01:40 JSTEstimated tsunami first arrives
02:14 JSTTsunami advisories lifted
x
Japan Meteorological Agency