32
So You Want To Be A Data Scientist? What It Means To Be A Data Scientist

So you want to be a Data Scientist?

Embed Size (px)

Citation preview

Page 1: So you want to be a Data Scientist?

So You Want To Be A Data Scientist?What It Means To Be A Data Scientist

Page 2: So you want to be a Data Scientist?

About:MeMohd Izhar Firdaus Ismail

- Current: Solution Architect @ ABYRES Enterprise Technologies Sdn Bhd

- Open Source Activist & (self-proclaimed) Hacker, Open Data Advocate, Fedora Ambassador, Data Architect, Data Engineer, Consultant, Python Programmer, Analyst, Trainer, and bunch of other hats ;-)

- Contributing to Open Source projects for over 8 years- Over 6 years building systems related to data, content, information and knowledge management

- http://linkedin.com/in/kagesenshi- [email protected] / [email protected]

Page 3: So you want to be a Data Scientist?

The People I Work For

● Open Source Technology Company– Specialize in Cloud, Big Data &

Enterprise Application Development

– Red Hat & Hortonworks Partner

● IT Consulting & Professional Services around Open Source Softwares– Design, development,

implementation and training services

– Consulting practice around leveraging Open Source technologies and implementing Big Data project

● The largest organized mafia of pure play open source geeks in Malaysia ;-)

Page 4: So you want to be a Data Scientist?

Before I Start

Some people call me a data scientist, But I don't consider myself one (yet)

(( its a personal integrity thing – Machine Learning & Stats is not (yet) my strong point ))

But I do work quite a bit with data: designing application, infrastructure, algorithms, processes and pipelines for big data

workload – from data acquisition to visualization

Page 5: So you want to be a Data Scientist?

Who is A Data Scientist?

Page 6: So you want to be a Data Scientist?

"Data scientists are involved with gathering data, massaging it into a tractable form, making it tell its

story, and presenting that story to others." - Mike Loukides, VP, O’Reilly Media.

"A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking,

statistics and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a

first-class product." - Hillary Mason, Data Scientist, Accel, Scientist

Emeritus, bitly, co-founder, HackNY.

Page 7: So you want to be a Data Scientist?
Page 8: So you want to be a Data Scientist?
Page 9: So you want to be a Data Scientist?
Page 10: So you want to be a Data Scientist?
Page 11: So you want to be a Data Scientist?

Whats With The Superhuman Requirements?

Page 12: So you want to be a Data Scientist?

Domain Knowledge & Soft Skills

● Knowledge to find what matters– Knowing the statistics does not mean knowing

what is the significance of the results to a business

– Business rules, terminologies, problem solving techniques, scientific theories & formulas

– Identifying actionable informations

● Problem solving & Hacker mindset– New & creative ways to find, acquire,

transform, manipulate, mashing, and using data

– Possibily unconventional uses of the same result

– Knowing what data needed, and houw to get them, to solve particular business problem

Page 13: So you want to be a Data Scientist?

Math & Statistics

● People use your output for decision making – wrong numbers might end up with bad decisions– Lies, damned lies, and statistics

● Machine Learning– Predict future values– Analyze patterns in structured and

unstructured data– Automated decision support

systems

Page 14: So you want to be a Data Scientist?

Programming & Database● Programming

– Calculating few thousand rows on excel might be okay, but dealing with distributed processing need some skills

● Query over distributed data – you don't want a query that stuck in a single core on a hundreds node cluster

– Simple visualizations can be done with drag-drop builders, complex visualization will require you to get yourself dirty

– Advanced decision system capabilities can only be implemented through some sort of rule programming

– Develop data pipelines both batch and stream– Develop data collection, scraping, machine learning &

artificial intelligence softwares

● Database– Ingesting data from various type of sources,

managing data format, data storage, governance

Page 15: So you want to be a Data Scientist?

Communication & Visualization

● Spreading information and discoveries– Presenting data in the form that non-

scientist can understand

– Knowing how to explain to business users as to why a result matters, how it can be used to benefit the business, organization, society

● Identifying patterns through visual analysis– Some insights might not be obvious when

presented in column and rows– Knowing how to visualize information so

to make hidden patterns more obvious

Page 16: So you want to be a Data Scientist?
Page 17: So you want to be a Data Scientist?

Data Science VS

Data Engineering

Page 18: So you want to be a Data Scientist?
Page 19: So you want to be a Data Scientist?

The Key Differences

● Data Science– Problem solving through

strategies around data– Hindsight, Insight,

Foresight– Understanding of patterns,

behaviors, etc– Automated Data Driven

Decision Making

● Data Engineering– Ingestion pipelines– Data integration– Data enrichment– Data cleansing– Data preparation– Data pipeline

Page 20: So you want to be a Data Scientist?

Hadoop?

Page 21: So you want to be a Data Scientist?

Hadoop is for Big Data

● Core of "Big Data"– Techniques, technologies &

strategies, to handle ingestion, storage, and processing of high velocity, high volume, high variety datasets

– Historical data, and not just current state

– Transaction + interaction + observation = Big Data

Page 22: So you want to be a Data Scientist?
Page 23: So you want to be a Data Scientist?

Data Science Need Big Data

"The reaction of one man could be forecast by no known mathematics; the reaction of a billion is something else again"

– Asimov

● Without rich historical data, analysis and development become more challenging– Patterns will start to show itself in rich historical data

– Models that accurate with small data, might start to fall apart when more parameters/data are introduced

● Start collecting data today!, you never know when you need it, and when you do, the historical data is there for you to mine

Page 24: So you want to be a Data Scientist?

Getting Started With Data Science

Some tips for beginners

Page 25: So you want to be a Data Scientist?

Attn.

● Courses, trainings, documents, tools, etc will definitely help you to establish your foundations and basics in data science– but, like any technical field, what important is your ability to

mash everything up and apply it to solve problems

● Anybody can learn how to draw, anybody can draw, but not anybody can be an artist.

Page 26: So you want to be a Data Scientist?

Domain & Business

● Learn more about your industry (or your target industry)● Learn what make they tick, what number that matters,

what are scientific knowledge around the domain● Businesses exist for they key purpose of making profit,

which usually translates to; increase sales & reduce cost– Find how to help your organization business by collecting

data and analyze to produce visualizations that will help in organization make more profit

Page 27: So you want to be a Data Scientist?

Math & Statistics

● Find that old textbook you had from university, and study them again ;-)

● Learn, understand and start to apply how statistics can be used for estimation, predictions.

Page 28: So you want to be a Data Scientist?

Programming & Information System

● If you haven't know programming yet, start to pick up one– I suggest Python as it has strong background in scientific computing

communities, and was designed by a mathematician – Guido Van Rossum– Though I'm a biased parseltongue :P– Books:

● Packt's Practical Data Analysis ● How to Think Like A Computer Scientist

● SQL is important– Pretty much the most mature method for declaring data queries

● Pick up Big Data technologies to help you handle massive datasets

Page 29: So you want to be a Data Scientist?
Page 30: So you want to be a Data Scientist?

One more thing

Page 31: So you want to be a Data Scientist?

http://pysiphae.rtfd.org

Page 32: So you want to be a Data Scientist?

Thanks

Contact:Izhar Firdaus (KageSenshi)

[email protected] / [email protected] +60172792765