Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Environmental Research with RapidMiner
About Me
Rodrigo Fuentealba Cartes !
Lead Data Scientist and Senior Software Developer at Pegasus
Mr. Fuentealba has been using and developing open source technologies since 1995. His career in data science began in 2008 when he began building models for healthcare and predictive maintenance for vessels.
Use Case
This is a project in development since 2016 as an effort to address environmental issues in the salmon farming process.
The Pegasus Group provides data science services and technology support to this project.
Background
• Chile !
• World's 2nd largest farmed salmon exporter.
• Salmon farming is the 3rd largest economic activity.
• In 2017, produced USD $ 4.5 billion in revenue.
Problem
• Sea Lice
• Deadly parasite that hosts and damages salmonids.
• Threatens the environment, the communities and the local economy, both directly and indirectly.
• USD$ 350 million are spent to address it.
Sea Lice
Challenge
• Understanding how the Caligus is spread.
• Predicting what salmon farms are in immediate danger.
• Evaluating the best antibiotic treatments against Caligus.
Warming up
How to solve these challenges?
• Apply a Hydrodynamic Model to review tide directions.
• Apply Predictive Analytics to detect farms in danger.
• Apply Machine Learning to evaluate the best treatments.
Methodology
• RMDS: Rod's Methodology for Data Science
• Understanding the Context.
• Asking the right Questions.
• Identifying the Nouns.
• Taking action with Verbs.
• Interpreting Answers.
RMDS vs CRISP-DM
• Context
• Questions
• Nouns (Data)
• Verbs (Processes)
• Answers
Infrastructure
GIS DWHCMM
Hydra 12 DBs
API
Connie
Dashboard
Applying Nouns and Verbs
GIS(Noun)
DWH(Verb)
CMM(Verb)
Hydra(Noun)
9 DBs (Noun)
API(Noun)
Connie(Noun)
Dashboards(Verb)
But there are massive amounts of it.
No Big Data
How much data do we have?
100 Gb (stable)
47 Gb(hourly)
10 Gb(yearly)
300 Gb(stable)
1 Gb(hourly)
Challenge 1:How the parasite is spread?
How the Caligus is spread?
• Hydrodynamic Model
• Streaming a 4D representation of the ocean (latitude, longitude, depth and time) in time-series format.
• Processing this representation with Navier-Stokes equations and map/reduced into Connie Matrix.
(Think of automatic BMP to SVG transformation, a few million times heavier)
4D Representation of the Ocean
A A
T T
A A
D A T A
A A
T T
A A
D A T A
A A
T T
A A
D A T A
TIME
X Dimension
Y D
imen
sion
Z Dimension
X Dimension X Dimension
0
750
1500
2250
3000
1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 Hit
Performance of Hydrodynamic Model
Connectivity Matrix
Connectivity Matrix
Challenge 2:What farms are in danger?
What farms are in danger?
• Answer: the ones in the path of Caligus!
• Mix operational databases, the GIS database and the Connie Matrix in the data warehouse.
• Perform Time-Series and k-Means on different pairs 360 times on each block.
• A manually trained Decision Tree helps categorizing the threat level between 0 and 10.
RapidMiner: Getting Operational DB's
RapidMiner: Joining Operational and GIS
RapidMiner: Joining Connie Matrix
• Same ol', same ol',
• Except that it's done with PostgreSQL and PostGIS.
• So, no pictures of this process.
RapidMiner: k-Means + Decision Tree
Reports
Results
• Find farms that might be attacked within 2 weeks.
• Trained data from 2016, tested data from 2017.
• This has been pretty consistent with data from 2018.
True Hit True Miss %
Pred. Hit 4982 1845 72.97
Pred. Miss 890 192817 99.54
Class Recall 84.84 99.05
Challenge 3:What is the best treatment?
Data Model for Production/Mortality
Challenge
• Explore operational databases for the following things:
• Maximized production and minimized mortality rate.
• Analyze diseases, caligus reports, treatments and vaccinations.
• Retrieve patterns that are applied in the best farms and apply these to the worst ones.
Notice
While the database has been entirely designed by me (the structure), the information (the data) contained on it is
proprietary and I cannot share it with you. That doesn't mean I can't obfuscate the data to show you how we performed
the analysis.
Also, it has been simplified from nearly a thousand processes to just two, as proper data extraction and
classification was quite difficult.
Preparation Process
Analytics Process
Results
Real Life Testing
• Sample: 20 farms of nearly 5800.
• The combination of treatments was designed through SVM, Neural Networks and Time-Series. (Too complex to be shown here).
• Mortality reduced in 46.1%. (73.7% in Caligus)
• USD$ 97,565 saved in treatments.
• Expected to save USD$ 24 million by 2019.
Conclusions
• #DataSci is about solving challenges with technology: we apply it in many other use cases.
• Proper data prep overcomes technical debt limits. Public organizations developments suffer a lot of this.
• Quick process model (20%) helps us fail fast and achieve results earlier.
• RapidMiner excels at both. We couldn't have done this without it.
RapidMinerData Science, Fast and Simple
Contact Information
Rodrigo Fuentealba Cartes
E-mail: [email protected] Twitter:@datasciencegemsLinkedIn:https://www.linkedin.com/in/rodrigofuentealbacartes/