Upload
buinhu
View
225
Download
2
Embed Size (px)
Citation preview
Ashish Mahabal aam at astro.caltech.edu
Center for Data Driven Discovery, Caltech IAU 325: AstroInformatics, Sorrento, Italy
2016-10-23
From Sky to Earth: Data Science Methodology Transfer
JPL Data Science InitiativeNASA Advanced Information Systems Technology Program (AIST)
Western States Water Architecture Study
EarthCubeVIFI
Broad Outline
• Similarities in Big data of Astro and Earth Sciences
• The Hydrology case
• Example projects from BigSkyEarth
• EarthCube - the Earth VO
• Domain Adaptation
• Summary
2
Generic Big Data• Complex rather than just voluminous
• Real-time needs
• Complexity in terms of
• spatial distribution
• spatial and temporal resolution,
• time epochs (number of and irregularity),
• coverage (overlap)
Volume, Velocity, Volatility, Veracity, Value, …
3
0 1
24
22
20
18
16
14
12
10
−8
−6
V838 MonM85 OT
M31 RV
SCP06F6
SN2006gySN2005ap SN2008es
SN2007bi
SN2008S
NGC300OT
SN2008ha
SN2005E
SN2002bj
PTF10iuvPTF09dav
PTF11bijPTF10bhp
PTF10fqs
PTF10acbp
PTF09atuPTF09cnd
PTF09cwlPTF10cwr
Thermonuclear Supernovae
Classical Novae
Luminous Red
Novae
Core−Collapse Supernovae
Luminous Supernovae
.Ia Explosions
Ca−rich Transients
P60−M81OT−071213
P60−M82OT−081119
0 1
24
22
20
18
16
14
12
10
8
6
V838 MonM85 OT
M31 RV
SCP06F6
SN2006gySN2005ap SN2008es
SN2007bi
SN2008S
NGC300OT
SN2008ha
SN2005E
SN2002bj
PTF10iuvPTF09dav
PTF11bijPTF10bhp
PTF10fqs
PTF10acbp
PTF09atuPTF09cnd
PTF09cwlPTF10cwr
Thermonuclear Supernovae
Classical Novae
Luminous Red
Novae
Core−Collapse Supernovae
Luminous Supernovae
.Ia Explosions
Ca−rich Transients
P60−M81OT−071213
P60−M82OT−081119 M85 OT
1038
1039
1040
1041
1042
1043
1044
1045
Peak
Lum
inos
ity [e
rg s−
1 ]
−24
−22
−20
−18
−16
−14
−12
−10
−8
−6
Peak
Lum
inos
ity [M
V]
10 10 10Characteristic Timescale [day]
0
log ( [sec]) 10 10
Characteristic Timescale [day]1 2 3 4 5 6 7
A
A
A
. A BB D
BA B A
C .
C
0 1 2
24
22
20
18
16
14
12
10
8
6
V838 MonM85 OT
M31 RV
SCP06F6
SN2006gySN2005ap SN2008es
SN2007bi
SN2008S
NGC300OT
SN2008ha
SN2005E
SN2002bj
PTF10iuvPTF09dav
PTF11bijPTF10bhp
PTF10fqs
PTF10acbp
PTF09atuPTF09cnd
PTF09cwlPTF10cwr
Thermonuclear Supernovae
Classical Novae
Luminous Red
Novae
Core−Collapse Supernovae
Luminous Supernovae
.Ia Explosions
Ca−rich Transients
P60−M81OT−071213
P60−M82OT−081119
10
10
10
10
10
10
10
10
- B C BA
- BA
B A
AB
2 1 log ( )
-1 -2 -3 -4
A B A ) (
Big Data - Astronomy• Complex rather than just voluminous (catalogs, spectra, polarimetry)
• Real-time needs (e.g. transient classification)
• Understanding in terms of existing models (e.g. Tabby’s star, HB stars)
• Complexity in terms of
• spatial distribution (data archives at different locations)
• spatial and temporal resolution (HST~0”.1 -> TESS~10”)
• time epochs (number of and irregularity) (SDSS - Kepler)
• coverage (overlap) (DLS -> Gaia)
J Cooke4
Big Data - Earth Science• In-situ measurements
• Satellite-based observations
• Models (predictive, computational)
• Real-time needs (e.g. predicting flashfloods, ephemeral water flow)
• Complexity in terms of
• spatial and temporal resolution (wells, snow, moisture, underground water, overground water)
• time epochs (number of and irregularity) (snow, wells)
• coverage (overlap)
5
Big Data - Earth Science
Python, R, GrADS, IDL, Matlab, ArcGIS, HydroDesktop, and Google’s Earth Engine.
Tools
Multi-dimensional Indexing GeoMesa GeoWave
6
Ontologies
Astronomical Objects
PDSSteve HughesDan Crichton
PDS -> Earth Science (NASA)7
Multiplicity of ontologies
Meta-data (and ontologies) are good Too many, or non-confirming systems may be hurtful ONTOLOG, EarthCube, OGC, SWEET
8
Parallels in Earth Science and astronomy methodology
• Water vapour
• Precipitation
• Surface Water
• Ground Water
• Snow
• Evaporation
• Rivers/Lakes/…
The Hydrology Case
9
GRACE AQUA
Next few slides from ARSET10
11
Less than 1 and up to two measurements per day12
Data latency under 3 hours to 3 monthsTESS will have downlinks every 15 days13
Very distributed and not talking enough to each other
14
Water
JPL Data Science InitiativeNASA Advanced Information Systems Technology Program (AIST) Western States Water Architecture Study
Input&Forcing-(e.g.,-GPM)-
For-Data-Assimila<on-(e.g.,-MODSCAG)-
Standard-Reports- Ad-Hoc-Queries-and-Custom-Reports-
Snow&Water-Equivalent- Surface-Water- Ground-Water-
Single&Month-Es<mates- Short-and-Long&Term-Trends-
Research(
Applica-ons(
Decision(Support(
Data(Science(Infrastructure((Tools,(Services,(Methods(for(Massive(Data(Analysis)(
A(Scalable(Data(Processing(System(for(Hydrological(Science(
(Web&Based-Interface)-
15
Western States Water Mission (WSWM)
hydrological state estimation on water availability, at 3km2 resolution for the Western US
timely actionable information
a close collaboration of hydrological modeling and data science expertise in a mission-style project architecture
WaterTrek: an interactive, web-based interactive analytics environment
Regularization of spatial resolution Time series regularization
Integration of datasets16
WSWM
Pacific Northwest
California
Great Basin
Lower Colorado
Upper Colorado
WSWM domain: Continental US west of divide
17
WSWM
Pacific Northwest
California
Great Basin
Lower Colorado
Upper Colorado
WSWM domain: Continental US west of divide
Franklin D Roosevelt Lake
Lake Koocanusa
Shasta Lake
Lake Mead
Lake Powell
Contains 5 of the 15 largest US reservoirs
18
WSWM
Pacific Northwest
California
Great Basin
Lower Colorado
Upper Colorado
WSWM domain: Continental US west of divide
Franklin D Roosevelt Lake
Lake Koocanusa
Shasta Lake
Lake Mead
Lake Powell
Contains 5 of the 15 largest US reservoirs
Getting ready for SWOT
Actual model resolution
Largest rivers
19
WSWM
Pacific Northwest
California
Great Basin
Lower Colorado
Upper Colorado
WSWM domain: Continental US west of divide
Franklin D Roosevelt Lake
Lake Koocanusa
Shasta Lake
Lake Mead
Lake Powell
Contains 5 of the 15 largest US reservoirs
Getting ready for SWOT
Actual model resolution
Largest rivers
658,702 river reaches (1,410,328 total length) 7,532 gauges (many now inactive)
Hyper-resolution with assimilation
20
WSWM
Pacific Northwest
California
Great Basin
Lower Colorado
Upper Colorado
WSWM domain: Continental US west of divide
Franklin D Roosevelt Lake
Lake Koocanusa
Shasta Lake
Lake Mead
Lake Powell
Contains 5 of the 15 largest US reservoirs
Getting ready for SWOT
Actual model resolution
Largest rivers
658,702 river reaches (1,410,328 total length) 7,532 gauges (many now inactive)
Hyper-resolution with assimilation
Facilitates informed decisions at the local level
High-resolution modeling over large spatial domain
21
High Level Concept of Data Management and Data Analytics
22
COST’s
First Training School at Oberpfaffenhofen 2016
https://github.com/marcoq/BSE_TS2016_Oberpfaffenhofen/
23
EarthCubeNSF 2011
Cyberinfrastructure sharing
visualization analysis
Interoperability standards better integration
democratizing dataJPL, CaltechScalable Arch
Test Environment24
• BCube: Broker for Next generation Geoscience (meditating interactions)
• Integrating Long-Tail Data and Models
• Scalable Community Driven Architecture
• (SG Djorgovski, E Law, D Crichton, A Mahabal)
• ECITE (Graves, Yang, Law, Djorgovski, Mahabal)
• … (other Building Blocks)
EarthCubeFunded Projects
25
Scalable Community Driven Architecture
• Identify Stakeholders, key use cases
• Incorporate cross-agency informatics efforts to capture architectural drivers, principles, models
• Roadmap for extensible and sustainable participation coherent with cyberinfrastructure
• Design architecture, data intensive system leading to discovery in the big data era
Team: S.Caltagirone, D.Crichton, S.G.Djorgovski, T.Huang, S.Hughes, E.Law, A.Mahabal, D.Pilone, T.Pilone
26
EarthCube Integration and Test Environment (ECITE)
• Seamless federeated system of scalable, location independent resources
• Compute and storage with minimal administration
• Integration, test, and evaluation
• Share ideas, concepts, experiments
SarvabhaumMandlik
Caltech + GMU27
Domain AdaptationWith Jingling Li, Samarth Vaijanapurkar, Brian Bui, …
28
Feature Correlations
Sample from Drake et al.
29
RF, GFK, CODA, …
• Examine the baseline performance for three combinations of data using random forest blindly:
• Source to Target• Source + Target to Target • Target to Target
• Compare performance with Domain Adaptation• Misclassified objects and outliers
To be used with the various hydrology layers having irregular time series
Aspects to be explored through VIFITalukdar, Mahabal, Djorgovski, Crichton30
Earth to Sky Pokemon Go to Transient Go
Binary Transient Brokers combined with AR
CRTS +LSST; Gaia?
SUNY Oswego CS undergrads
31
Summary• Many parallels in Astro- and Earth-sciences
• In EarthScience many datasets still analyzed separately
• One big difference: intervention possible
• water distribution
• Citizen Science not explored enough
• monitoring presence of lead at different locations
• Many other use cases being explored32