Upload
kristin-ellis
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
EXPLORING PROCESS OF DOING DATA SCIENCE VIA AN ETHNOGRAPHIC STUDY
OF A MEDIA ADVERTISING COMPANY
J.SALTZ, I.SHAMSHURIN
Click icon to add picture
2015 IEEE INTERNATIONAL CONFERENCE ON B IG DATA 2 9 O C T O B E R , S A N TA C L A RA , C A , U S A
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 2
OUTLINE
Introduction
Related Work
Data Collection
Findings
Observed Issues
Possible Improvements
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 3
INTRODUCTION
Data science teams do not have an explicit data science team-based process methodology:
• What steps should be done first?
• How long each phase of a project should take?
• Which people with what skills should be involved in the project?
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 4
RELATED WORK
• Data is increasingly being viewed as a strategic resource for the organization [Wade, 2004].
• Bid Data can enable new and improved business models that have not been feasible in the past [Tiefenbacher, 2015]
• Lack of focus on the process teams should use to actually do a data science project [Saltz, 2015]
• Teams doing data analysis and data science work in an ad hoc fashion, using trial and error to identify the right tools [Bhardwaj, 2015]
• Data science as a step-by-step process:o Acquisition, information extraction and cleaning, data integration, modeling, analysis, interpretation and
deployment [Jagadish, 2014]o Preparation, Analysis, Reflection and Dissemination [Guo, 2013]
• Understanding of what might be an appropriate data science process methodology is to document case studies of how teams are actually doing data science, especially within a corporate context [Saltz, 2015]
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 5
BACKGROUND AND STAKEHOLDER ANALYSIS• One of the researchers was embedded within the data science team
•A global media advertising software company headquartered in New York City
• The company had a total of 100 people distributed globally
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 6
RESEARCH QUESTIONS
RQ1. What is the current methodology that they follow?
RQ2. What are some possible ways to improve the current methodology, i.e. to make the projects more efficient in time and cost?
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 7
DATA COLLECTION
• Phase I: information was collected prior to one of the researchers being embedded within the data science team
• Phase II: during a 9 week period, one of the researchers participated as part of the data science team, and in addition to collecting data and observing how the team functioned, actually helped the team with various tasks
• Phase III: interview with the VP of Data Science
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 8
DATA SCIENCE TEAM
• 2 Data Scientists, including VP of Data Science
• 3 Data Operations people
• 3 Software Developers
• 1 Data Engineer
The team was divided across multiple locations
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 9
FINDINGS: TYPES OF PROJECTS
• Routine Projectso on a regular basiso more externalo data transformation and pre-processingo performed by data groupo deadlines
• Exploratory Projectso research orientedo no standard methodology is usedo performed by VP of data science and embedded researchero duration of these projects can vary from a week to a yearo no official deadlines
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 10
FINDINGS: ROLES
• Data Scienceo Explores the data and generates insight from the data, including tasks such as data mining and data visualization. This team included the data scientist who was the embedded observer.
• Data Operationso Getting data from data providers, transformation and preparation of the data for analysis (i.e., for use by the data science team)
• Software Developmento Develop software tools to help the data science team perform data analysis
• Data Engineeringo Supports and improves the existing system and participates in some data science projects
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 11
FINDINGS: HIGH LEVEL PROCESS DESCRIPTION
I. Preparation
· Getting Data from Data Provider· Quick Data Validation· Transform Data
· Understand needs of business· Perform background research
Data
Business Context
· What to do and how to do· Modeling· Generating insights from data
II. Analysis
· Communicate resultsIII. Dissemination
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 12
PROCESS FLOW DESCRIPTION: PREPARATION
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 13
PROCESS FLOW DESCRIPTION: ANALYSIS
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 14
PROCESS FLOW DESCRIPTION: DISSEMINATION
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 15
OBSERVED ISSUES / CHALLENGES• No specific deadlines for the whole data science project or for any individual phase of the project
• Project organization and planning
• Whenever the data science team needs to have a task completed, they send a request to the developers, but the developers typically respond the following day
• Developers are involved in several projects at the same time
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 16
POSSIBLE PROCESS IMPROVEMENTS• Documenting the current process
• Better structuring developer interactions
• Imposing deadlines
• Process automation
• Better preparation
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 17
FEEDBACKS ON SUGGESTION
SuggestionMake
sense?Can
Implement?Short or Long
Term?Documenting the current process
4 4 short
Better structuring developers interactions
3 2 long
Imposing deadlines
4 3 long
Process Automation
4 3 long
Better Preparation
3 3 short
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 18
EFFECTIVE PRACTICES OBSERVED• Pre-processing
• Frequent dialog with senior management:
• Engaging Senior Management
• Using a defined SDLC with the software team
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 19
CONCLUSION
• The data science team was not thinking about the process of doing the projects
• Suggestions received positive feedbacks
• Studying additional organizations might be helpful to examine if the suggestions and feedback from this study are related to the current size, organizational structure or domain of the company, and if there are any patterns observed across the organizations doing data science projects
SCHOOL OF INFORMATION STUDIES | SYRACUSE UNIVERSITY 20
THANK YOU