Transcript
Page 1: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Large-scale license transparency using open data, open standards and F/OSS

+ => 1 million SPDX

http://triplecheck.net http://searchcode.com

Page 2: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Speaker

Slide #2

Nuno Brito

Free/open source contributor since 2005 Last 12 months wrote 100k F/OSS lines of code SPDX contributor, co-founder of TripleCheck

Around the web http://nunobrito.eu

Page 3: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Transparency

Slide #3

Take some source code as example

Who developed the code?Which licenses are applicable?Was the code copied from somewhere else?

Page 4: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Size

Slide #4

A problem of scale

Open licenses? > 300 types to choose> 5 million F/OSS projects

> 100 million source code files

Page 5: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Practice

Slide #5

Applying licenses

Burden on developer (do correctly, do enough) Expressed differently (difficult to understand) Scaling obstacles (scarce automation)

Transparency?

Page 6: 2014 10-14: GitHub plus FOSS == 1 million SPDX

What do?

Slide #6

Ideally, we'd have tooling that is..

a) Reachableb) Cooperativec) Free

Choose two. (sad reality)

Page 7: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Choose three

Slide #7

Choose building blocks based on:

a) Open standardsb) Open datac) Reachable tools

Learn, write, improve.

Share.

Page 8: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Standards

Slide #8

SPDX: Open standard for software licensing

Standardizes license description Defines Id for license terms http://spdx.org

Pro: Good docs, straightforward, getting better Cons: Slow adoption, scarce tooling

Page 9: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Open data

Slide #9

GitHub: Targeting open data repositories

API suited for intensive access Social coding Largest open source code collection

Pro: Reachable, diverse Cons: Repositories processed one-by-one

Page 10: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Tooling

Slide #10

Custom-built tools for software licenses

Large-scale repository data-mining Find applicable licenses inside content Share millions of SPDX documents

Pro: Learn by doing, modularized, single language Cons: Built from scratch, needs consolidation

Page 11: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Step 1

Slide #11

Desktop tool/engine to discover licenses

SPDX format as storage medium Identify copyright and 18 license types Java, released in Feb 2014. EUPL

http://spdx.org/tools/community/triplecheck-reporter

Page 12: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Desktop

Slide #12

Page 13: 2014 10-14: GitHub plus FOSS == 1 million SPDX

File detail

Slide #13

Page 14: 2014 10-14: GitHub plus FOSS == 1 million SPDX

SPDX file

Slide #14

Page 15: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Customize

Slide #15

Page 16: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Details

Slide #16

Underneath the hood

147 file extensions, 18 license types LOC, hashes (SHA1, MD5, SHA256, SSDEEP) Command line supported (Jenkins, cron) Fast, 40k files/minute (Pentium IV)

Page 17: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Step 2

Slide #17

Discovering repositories with gitFinder

Create a list of projects online to use as components. Get basic licensing information from each project.

Write text file with each github user (~7 million) For each user, find repositories not forked (~10M) Split each repository according to language (197) For each list of language/reps, download code

Page 18: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Performance

Slide #18

~70k repositories/day

Single machine (i7, 8Gb RAM, CentOS) 9 parallel threads Resume/recover supported Released in Jun. 2014

https://github.com/triplecheck/gitfinder

Page 19: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Output

Slide #19

Page 20: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Storage?

Slide #20https://what-if.xkcd.com/29/ (CC BY-NC 2.5)

Page 21: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Storage

Slide #21

BigZip, +100 million files on a single download

Flat-file, zip compression (per entry) Fast, simple, portable. Indexed search

https://github.com/triplecheck/big

Page 22: 2014 10-14: GitHub plus FOSS == 1 million SPDX

How it looks

Slide #22

Page 23: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Step 3

Slide #23

SPDX search engine

One-click SPDX creation from open data Visualize license and copyright data Visit at http://searchcode.com/spdx

Page 24: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Example

Slide #24

Using the original URL..

https://github.com/iuly/europa_kernel/

=>

https://spdxhub.com/iuly/europa_kernel/

Page 25: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Example

Slide #25

Page 26: 2014 10-14: GitHub plus FOSS == 1 million SPDX

SPDX-1M

Slide #26

“Do It Yourself” kit. Generate 1 million SPDX

https://github.com/triplecheck/diy 1.2 million open source projects “Arduino” for s/w licenses detection

9Gb worth of SPDX? Grab:http://triplecheck.net/public/storage/spdx.big

Page 27: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Screenshots

Slide #27

Page 28: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Next step?

Slide #28

F2F – pinpointing non-original code

Decompose code into blocks Tokenize/anonymize data Find code matches across knowledge base

ETA in Dec. 2014https://github.com/triplecheck/f2f

Page 29: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Preview

Slide #29

Page 30: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Conclusion

Slide #30

What is now available for everyone

Desktop tooling / detection engine Extraction of open data in scale Search engine for SPDX

Page 31: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Questions?

Slide #31

http://spdx.orghttp://searchcode.com/spdxhttp://github.com/triplecheck

Interesting stuff? Let us know: @nn81 @boyte #linuxcon

http://xkcd.com/1118/

Page 32: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Backup slides

Slide #32

Page 33: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Engine

Slide #33

Page 34: 2014 10-14: GitHub plus FOSS == 1 million SPDX

License DB

Slide #34

Page 35: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Components

Slide #35

Page 36: 2014 10-14: GitHub plus FOSS == 1 million SPDX

Exporting

Slide #36


Recommended