View
5
Download
0
Category
Preview:
Citation preview
Version control and open-source methodologyin data scienceMislav Marohnić, software developer at GitHub
Topics
Topics
• What is version control,
Topics
• What is version control,
• How has open source influenced software,
Topics
• What is version control,
• How has open source influenced software,
• How can this be relevant to researchers in data science.
Version control
Version control
Version control
Version control
version control outside of the software world
version control outside of the software world
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
adding files
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
adding files
saving changes
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
adding files
saving changes
syncing
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
adding files
saving changes
syncing
this project elsewhere
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
adding files
saving changes
syncing
this project elsewhere
isolated changes
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
adding files
saving changes
syncing
this project elsewhere
isolated changes
combining changes
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
adding files
saving changes
syncing
this project elsewhere
isolated changes
combining changes
copying a project
Version control terminology• repository
• “checking in”
• commit
• push/pull
• remote
• branch
• merge
• fork
• pull request
project directory
adding files
saving changes
syncing
this project elsewhere
isolated changes
combining changes
copying a project
contributingchanges
What version control facilitates
What version control facilitates
• code storage & backups
What version control facilitates
• code storage & backups
• isolated environment (branches) to experiment with changes
What version control facilitates
• code storage & backups
• isolated environment (branches) to experiment with changes
• syncing work within the team
What version control facilitates
• code storage & backups
• isolated environment (branches) to experiment with changes
• syncing work within the team
• project history
What version control facilitates
• code storage & backups
• isolated environment (branches) to experiment with changes
• syncing work within the team
• project history
• tracking down software bugs
What version control facilitates
• code storage & backups
• isolated environment (branches) to experiment with changes
• syncing work within the team
• project history
• tracking down software bugs
• release management
What version control facilitates
• code storage & backups
• isolated environment (branches) to experiment with changes
• syncing work within the team
• project history
• tracking down software bugs
• release management
• continuous integration (CI)
What version control looks like
What version control looks like
What version control looks like
Open-source
Open-sourceFOSS: Anyone is freely licensed to use, copy, study, and change the
software in any way, and the source code is openly shared so that people are encouraged to voluntarily improve the design of the software.
Examples of open source
Examples of open source• Python / R
Examples of open source• Python / R
• the Web & most browsers
Examples of open source• Python / R
• the Web & most browsers
• Linux
Examples of open source• Python / R
• the Web & most browsers
• Linux
• parts of Apple's macOS
Examples of open source• Python / R
• the Web & most browsers
• Linux
• parts of Apple's macOS
• Android OS
Examples of open source• Python / R
• the Web & most browsers
• Linux
• parts of Apple's macOS
• Android OS
• Microsoft .NET
Benefits of open source
Benefits of open source
• transparency → trust
Benefits of open source
• transparency → trust
• fosters learning
Benefits of open source
• transparency → trust
• fosters learning
• fosters collaboration
Benefits of open source
• transparency → trust
• fosters learning
• fosters collaboration
• more resilient software
Benefits of open source
• transparency → trust
• fosters learning
• fosters collaboration
• more resilient software
• longer-lasting software
What GitHub provides
What GitHub provides
• web interface for git
What GitHub provides
• web interface for git
• storage & backups
What GitHub provides
• web interface for git
• storage & backups
• issue tracking
What GitHub provides
• web interface for git
• storage & backups
• issue tracking
• pull requests
What GitHub provides
• web interface for git
• storage & backups
• issue tracking
• pull requests
• code search
What GitHub provides
• web interface for git
• storage & backups
• issue tracking
• pull requests
• code search
• collaboration features
What GitHub provides
• web interface for git
• storage & backups
• issue tracking
• pull requests
• code search
• collaboration features
• project management
What GitHub provides
• web interface for git
• storage & backups
• issue tracking
• pull requests
• code search
• collaboration features
• project management
• downloadable releases
What GitHub provides
• web interface for git
• storage & backups
• issue tracking
• pull requests
• code search
• collaboration features
• project management
• downloadable releases
• web site publishing
What GitHub provides
• web interface for git
• storage & backups
• issue tracking
• pull requests
• code search
• collaboration features
• project management
• downloadable releases
• web site publishing
• API for integrations
Pull Requestsa small “unit” of collaboration
The GitHub Flow
The GitHub Flow: new branch
The GitHub Flow: changes (commits)
The GitHub Flow: create pull request
The GitHub Flow: collaboration
The GitHub Flow: peer approval
The GitHub Flow: merge
Continuous integration (CI)The killer feature of pull requests
Continuous integration (CI)The killer feature of pull requests
Version control in data science
Similarities to software development
• syncing materials & data
• writing actual code (e.g. R)
• collaboration within a team
• peer review process
• publishing
Writing formats
Writing formats
• LaTeX
Writing formats
• LaTeX
• Markdown
Writing formats
• LaTeX
• Markdown
• R Markdown
Writing formats
• LaTeX
• Markdown
• R Markdown
• Jupyter (IPython) Notebook
Writing formats
• LaTeX
• Markdown
• R Markdown
• Jupyter (IPython) Notebook
• AsciiDoc
Potential problems
Potential problems
• git can be tricky to learn for non-developers
Potential problems
• git can be tricky to learn for non-developers
• large datasets can be inconvenient to add to version control
Potential problems
• git can be tricky to learn for non-developers
• large datasets can be inconvenient to add to version control
• transition paths from other tools aren't always clear
Recommended