Licensing talk short
Licensing is Software Too:
Achievements and Challenges
(and how this relates to code provenance)
Massimiliano Di Penta
University of Sannio, Italy
[email protected]
http://www.rcost.unisannio.it/mdipenta
Acknowledgements
Daniel M. Germán, Univ. Victoria, Canada
Julius Davies, Univ. Victoria, Canada
Giuliano Antoniol, Ecole Polyt. Montréal, Canada
Yann-Gaël Guéhéneuc, Ecole Polyt. Montréal,
Canada
Reusing Open Source Software
When developing a software system,
we try (if possible) not to reinvent the wheel
Components, libraries, source
code snippets out of there, ready to be reused
Code search engines are becoming popular
Open source code modification and
redistribution governed by
Software licenses
Copyright statements
Everything contained in a licensing
block…
What does a licensing contain?
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
License
(MPL+GPL+LGPL)
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
*
Brian Ryner
….
* decision by deleting the provisions above and replace them with the notice
Copyright
* and other provisions requiredContributor
by the GPL or the Copyright
LGPL. If you do not
delete
statement
year
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*
* ***** END LICENSE BLOCK ***** */
Restrictive vs. permissive
licenses
Restrictive (aka copyleft or reciprocal)
Changed software must be made available under
similar terms wrt. the original
Example: GPL
Permissive
Modifications/enhancements may remain
proprietary
Distribution of source code or binary permitted
– Provided copyright notice and/or liability disclaimers
– Contributor names do not imply endorsement
Examples: Berkeley Software Distribution (BSD),
Apache Software License, MIT
FOSS development teams care!
(source: Debian)
I am in the process of trying to prepare 0.8.0 for Debian
GNU/Linux I have started going over the copyright/license
headers. In src/celeste many files are missing copyright
information. Most of these are files imported with minimal
changes from Gabor API http://www.kung-foo.tv/gaborapi.php
or libsvm http://www.csie.ntu.edu.tw/\~cjlin/libsvm/.
The attached patch adds copyright and license statements
to these files.[1]
Please apply and update the headers (adding copyright
holders) if you make substantial changes.
thanks, cu andreas
[1] I have doublechecked with Gabor API's upstream author
Adriaan Tijsseling that files like ContrastFilter.cpp are
Copyright (c) Adriaan Tijsseling and licensed under
GPLv2+, although the original headers just say:
Original Author:
Yasunobu Honma
Modifications by:
Adriaan Tijsseling (AGT)
Conjectures
Since licenses determine the way software
can be composed and re-distributed
They may change/evolve as any other part of
the software
They might be subject to bugs too
– See our ICPC 2010 paper about how to identify
licensing incompatibilities
They might determine the success/failure of a
software project
Code provenance and licenses:
Licenses constrain source code migration
between projects
Code provenance might be useful to determine
the licensing of closed components
Licenses influence the software
lifetime
OpenBSD founder and project leader Theo de Raadt
removed a security software package called IP-Filter
[written by Darren Reed] after its author changed its
license.
Stephen Shankland, CNET News, 2001/05/30.
Licenses evolve as software does
Failing to account for that would cause copyright
infringements
Decisions on license changes impact as other
decisions on software evolution
Little attention so far from the scientific community
Need for methods and tools to audit licensing
and their changes
Example: Java
Until November 2006, the license of Java JDK v1.2 said:
“Except as specifically authorized in any Supplemental
License Terms, you may not make copies of Software,
other than a single copy of Software for archival
purposes”
This disallowed the inclusion of Java in Linux distributions
Java 5.0 released under the GPL v2 with the
CLASSPATH exception:
Java could be modified/updated under the GPL v2
Java programs could be released under any license as long as
they satisfy the conditions stated in the CLASSPATH exception
Changing the license of a system can promote
and ease the distribution and reuse of a
software system
Example: QT
First released under a non-open source but free
license, called the FreeQT License, and a commercial
license
QT became the basis for KDE
QT v2.0 was released under a new license, the Q Public
License
incompatible with the GPL
GNOME project started as a QT-free alternative to KDE
Harmony project started as a GPL replacement of QT
Trolltech changed the license of QT v3 to the GPL v2
The Harmony project was abandoned
Changing the license of FOSS system
towards a more permissive might cause
the abandonment of a competing system
Empirical Study
Goal: analyze licensing evolution
Purpose: investigating how
developers change licensing
statements
Context: CVS/SVN repositories of
ArgoUML, Eclipse-JDT, the FreeBSD and
the OpenBSD kernels, Mozilla, Samba
Research Questions
RQ1: To what extent are files
changing their licenses?
RQ2: How are copyright years
changed in licensing statements?
RQ3: Who are the contributors of a
software project and how do they
change?
Licensing Analysis Method –
Extracting Licensing statements
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
*
Brian Ryner
….
* decision by deleting the provisions above and replace them with the notice
* and other provisions required by the GPL or the LGPL. If you do not delete
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*
Licensing Analysis Method –
Classifying licenses
FoSSology [Gobeille, MSR 2008]: detects licenses
using the Binary Symbolic Alignment Matrix (bSAM)
Ninka [German et al., ASE 2010]: uses a patternmatching approach
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….
MPL 1.1/GPL 2.0/LGPL 2.1
Licensing Analysis Method –
Identifying changes in copyright
years
Mining references to years in licensing…
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….
Licensing Analysis Method –
Identifying contributor names
Mining emails, plus various patterns
Copyright … year name
Contributor(s) …
And mapped to committers, whenever possible
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….
RQ1: Most relevant license changes
Eclipse-JDT
Common Public License v1.0
Eclipse Public License v1.0
CHANGE
2394
Common Public License v0.5
Common Public License v1.0
UPDATE
808
Mozilla
NPL
'NPL v1.1'-style+GPL v2+LGPL v2.1
DUAL
2914
NPL
'Dual MPL GPL'-style+MPL
DUAL
1274
'Dual MPL GPL'-style+MPL
NPL
BUG
1194
Licensing updated as new licenses were
developed
Eclipse JDT: CPL 0.5CPL 1.0EPL 1.0
IBM has relinquished control of licenses to the Eclipse
Foundation
Mozilla: NPLMPL + GPL (+ LGPL)
NPL allowed to release Netscape 6 as a proprietary system
MPL only allows to re-distribute the source code under the
RQ1: Most relevant license changes
FreeBSD
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
UPDATE
491
'BSD UCRegents'-style (4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
300
OpenBSD
'BSD UCRegents'-style (4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
964
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
UPDATE
414
FreeBSD and OpenBSD are more eclectic
than other projects
Moving from BSD-4 clauses to the more
permissive BSD-3 and BSD-2
RQ1: Most relevant license changes
ArgoUML
None
'Free with copyright clause'-style +'UC Regents free with
copyright clause'-style
ADD
127
ADD
15
Samba
None
GPL v2
ArgoUML and Samba kept the same
licenses over the analyzed time span
Change is from None to a simple license
Authors realized the importance of including a
license
RQ2: How and why were
copyright years changed?
Files for which the copyright years were
updated underwent a significantly higher
number of changes than others
When developers perform substantial changes to a
file, they also update copyright years
Required by copyright regulations
Lack of updates with substantial changes would
allow an infringer to claim “innocent infringement”
Commits explicitly targeted to copyright years
“Updated copyrights”
“Updated copyrights to 2004”
RQ3: When do contributors change?
Changes where contributor
names are added are significantly
bigger than other changes
Contributors often added
when they make substantial
changes
Contributor names are important
assets in source code
Like the signature on a picture
However…
contributors can change during the time
no standard way of reporting them
no clear rule on when one should become a
contributor
Their presence can have legal implications
Licenses Influence
Code Migration
Free (software) as a bird…
As birds migrate differently
during different seasons….
Code might have a
migration preferential
direction
Given two systems
e.g. FreeBSD and Linux
We find the same code in
both systems
Three scenarios:
Migration FreeBSD Linux
Migration Linux FreeBSD
Migration third-party
FreeBSD, Linux
Sibling(s) Origin
Identify siblings between systems using clone detection
CCFinderX, with >100 tokens as threshold, plus other heuristics
Trace back into past siblings – their code fragments in the
same files
Again clone detection, the sibling fragment wrt. previous file
revisions
When they disappear, then we have their origins
Take the oldest of the two as the true origin
Sys 1 – File i
Cloned fragments
Sys 2 – File j
Migration
direction
siblings
Cloned fragments
Code Migration and Licenses
FreeBSD
Linux
BSD
GPL
BSD
MIT
BSD
None
Corporate
BSD+GPL
GPL
None
Phrase
BSD+GPL
X.Net+BSD MIT
Files
8
2
2
89
1
1
1
OpenBSD
Linux
BSD
BSD+GPL
BSD
MIT
Almost nothingBSD
Unknown
after
BSD+GPL
GPL
BSD+Phras
Phrase+GPL
e
MIT
GPL
Before
Jan 1, 2002
Linux
BSD+GPL
GPL
GPL
GPL
MIT
MIT+GPL
None
Files
1
2
1
1
1
23
FreeBSD
Files
Corporate
8
BSD
17
BSD+GPL
1
CPL+BSD+GPL
1
BSD
1
None
2
BSD
1
After Jan 1, 2002
Nothing before
Discussion
Siblings have a preferential flow
Initially from BSD(s) to Linux – frequent
Today from Linux to FreeBSD – less frequent
Thus, due to licenses but also to the system
level of development
Companies directly contribute to code in
different kernels – see Intel drivers with
dual licenses
In this case, code migrates from a third party
towards Linux and FreeBSD
Identifying licenses of jar
archives
Motivations
Very often, Java open source software
is distributed in jar archives
See http://mvnrepository.com/
Problem: the jar might not contain
licensing info
Under what conditions can we integrate
the component?
The jar might not be legally used
Even if it’s from open source code, we
might not found exactly the same jar
Search-driven approach
Extracting info from the class bytecode
Class and package names.. or a fingerprint..
We use the ASM library (http://asm.ow2.org/)
Querying Google Code Search
Using the full qualified class name
Using the package only
Query performed using the Google Code API
(http://code.google.com/apis/gdata/)
If the same class is not found, its license is
obtained by those of classes belonging to the
same package
Google Code Search Output
% of correct classifications
Found license:
Min. 29%
(commons.codec), Avg.
82%, median: 89.5%
Inferred licenses:
Min. 62% (JLayer 1.0),
Avg. 95%, median 100%
The inferring heuristic
significantly better
both in terms of
completeness and of
precision
Incorrect classifications
Most of them are between LGPL
and GPL and between BSD and
Apache.
commons-codec: mismatching
between Apache and BSD
files licensed under the Apache v 1.1
derived from the BSD
JLayer: mismatching between GPL
and LGPL
same inferred licenses in both
releases (0.4 and 1.0)
however, JLayer moved from GPL to
LGPL from release 0.4 to release 1.0
Conclusions
We proposed a code analysis method as
support for lawyers other than for software
engineers
We studied how licensing are used and
evolve
License type, copyright year, contributors
Main findings:
License influence projects outcome
License influence code migration
Moving towards more permissive licenses
Copyright years and contributor names updated
to preserve rights on new code
Licensing and code provenance
Licensing influences the direction in which
code flows from a system towards another
one
Often code flows in the direction of more
permissive licenses…
..but there are many other factors influencing how
code flows
Search-driven approaches can be adopted to
determine from what code does a closed
component come from
And thus its licensing…
Issues related to the capabilities of the code
search tools
Thank you!
References
Daniel M. Germán, Jens H. Weber-Jahnke, Massimiliano Di Penta: Lawful
Software Engineering, Proceedings of FoSER: Working Conference on the
Future of Software Engineering Research, November 2010, Santa Fe', USA,
2010, ACM
Daniel M. Germán, Massimiliano Di Penta, Julius Davies: Understanding and
Auditing the Licensing of Open Source Software Distributions. ICPC 2010: 8493
Massimiliano Di Penta, Daniel M. Germán, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: An exploratory study of the evolution of software licensing. ICSE
2010: 145-154
Massimiliano Di Penta, Daniel M. Germán, Giuliano Antoniol: Identifying
licensing of jar archives using a code-search approach. MSR 2010: 151-160
Massimiliano Di Penta, Daniel M. Germán: Who are Source Code Contributors
and How do they Change? WCRE 2009: 11-20
Daniel M. Germán, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: Code siblings: Technical and legal implications of copying code
between applications. MSR 2009: 81-90
Daniel M. Germán, Yuki Manabe, Katsuro Inoue: A sentence-matching method
for automatic license identification of source code files. ASE 2010: 437-446
Daniel M. Germán, Ahmed E. Hassan: License integration patterns: Addressing
license mismatches in component-based development. ICSE 2009: 188-198
Robert Gobeille: The FOSSology project. MSR 2008: 47-50
Achievements and Challenges
(and how this relates to code provenance)
Massimiliano Di Penta
University of Sannio, Italy
[email protected]
http://www.rcost.unisannio.it/mdipenta
Acknowledgements
Daniel M. Germán, Univ. Victoria, Canada
Julius Davies, Univ. Victoria, Canada
Giuliano Antoniol, Ecole Polyt. Montréal, Canada
Yann-Gaël Guéhéneuc, Ecole Polyt. Montréal,
Canada
Reusing Open Source Software
When developing a software system,
we try (if possible) not to reinvent the wheel
Components, libraries, source
code snippets out of there, ready to be reused
Code search engines are becoming popular
Open source code modification and
redistribution governed by
Software licenses
Copyright statements
Everything contained in a licensing
block…
What does a licensing contain?
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
License
(MPL+GPL+LGPL)
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
*
Brian Ryner
….
* decision by deleting the provisions above and replace them with the notice
Copyright
* and other provisions requiredContributor
by the GPL or the Copyright
LGPL. If you do not
delete
statement
year
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*
* ***** END LICENSE BLOCK ***** */
Restrictive vs. permissive
licenses
Restrictive (aka copyleft or reciprocal)
Changed software must be made available under
similar terms wrt. the original
Example: GPL
Permissive
Modifications/enhancements may remain
proprietary
Distribution of source code or binary permitted
– Provided copyright notice and/or liability disclaimers
– Contributor names do not imply endorsement
Examples: Berkeley Software Distribution (BSD),
Apache Software License, MIT
FOSS development teams care!
(source: Debian)
I am in the process of trying to prepare 0.8.0 for Debian
GNU/Linux I have started going over the copyright/license
headers. In src/celeste many files are missing copyright
information. Most of these are files imported with minimal
changes from Gabor API http://www.kung-foo.tv/gaborapi.php
or libsvm http://www.csie.ntu.edu.tw/\~cjlin/libsvm/.
The attached patch adds copyright and license statements
to these files.[1]
Please apply and update the headers (adding copyright
holders) if you make substantial changes.
thanks, cu andreas
[1] I have doublechecked with Gabor API's upstream author
Adriaan Tijsseling that files like ContrastFilter.cpp are
Copyright (c) Adriaan Tijsseling and licensed under
GPLv2+, although the original headers just say:
Original Author:
Yasunobu Honma
Modifications by:
Adriaan Tijsseling (AGT)
Conjectures
Since licenses determine the way software
can be composed and re-distributed
They may change/evolve as any other part of
the software
They might be subject to bugs too
– See our ICPC 2010 paper about how to identify
licensing incompatibilities
They might determine the success/failure of a
software project
Code provenance and licenses:
Licenses constrain source code migration
between projects
Code provenance might be useful to determine
the licensing of closed components
Licenses influence the software
lifetime
OpenBSD founder and project leader Theo de Raadt
removed a security software package called IP-Filter
[written by Darren Reed] after its author changed its
license.
Stephen Shankland, CNET News, 2001/05/30.
Licenses evolve as software does
Failing to account for that would cause copyright
infringements
Decisions on license changes impact as other
decisions on software evolution
Little attention so far from the scientific community
Need for methods and tools to audit licensing
and their changes
Example: Java
Until November 2006, the license of Java JDK v1.2 said:
“Except as specifically authorized in any Supplemental
License Terms, you may not make copies of Software,
other than a single copy of Software for archival
purposes”
This disallowed the inclusion of Java in Linux distributions
Java 5.0 released under the GPL v2 with the
CLASSPATH exception:
Java could be modified/updated under the GPL v2
Java programs could be released under any license as long as
they satisfy the conditions stated in the CLASSPATH exception
Changing the license of a system can promote
and ease the distribution and reuse of a
software system
Example: QT
First released under a non-open source but free
license, called the FreeQT License, and a commercial
license
QT became the basis for KDE
QT v2.0 was released under a new license, the Q Public
License
incompatible with the GPL
GNOME project started as a QT-free alternative to KDE
Harmony project started as a GPL replacement of QT
Trolltech changed the license of QT v3 to the GPL v2
The Harmony project was abandoned
Changing the license of FOSS system
towards a more permissive might cause
the abandonment of a competing system
Empirical Study
Goal: analyze licensing evolution
Purpose: investigating how
developers change licensing
statements
Context: CVS/SVN repositories of
ArgoUML, Eclipse-JDT, the FreeBSD and
the OpenBSD kernels, Mozilla, Samba
Research Questions
RQ1: To what extent are files
changing their licenses?
RQ2: How are copyright years
changed in licensing statements?
RQ3: Who are the contributors of a
software project and how do they
change?
Licensing Analysis Method –
Extracting Licensing statements
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
*
Brian Ryner
….
* decision by deleting the provisions above and replace them with the notice
* and other provisions required by the GPL or the LGPL. If you do not delete
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*
Licensing Analysis Method –
Classifying licenses
FoSSology [Gobeille, MSR 2008]: detects licenses
using the Binary Symbolic Alignment Matrix (bSAM)
Ninka [German et al., ASE 2010]: uses a patternmatching approach
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….
MPL 1.1/GPL 2.0/LGPL 2.1
Licensing Analysis Method –
Identifying changes in copyright
years
Mining references to years in licensing…
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….
Licensing Analysis Method –
Identifying contributor names
Mining emails, plus various patterns
Copyright … year name
Contributor(s) …
And mapped to committers, whenever possible
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….
RQ1: Most relevant license changes
Eclipse-JDT
Common Public License v1.0
Eclipse Public License v1.0
CHANGE
2394
Common Public License v0.5
Common Public License v1.0
UPDATE
808
Mozilla
NPL
'NPL v1.1'-style+GPL v2+LGPL v2.1
DUAL
2914
NPL
'Dual MPL GPL'-style+MPL
DUAL
1274
'Dual MPL GPL'-style+MPL
NPL
BUG
1194
Licensing updated as new licenses were
developed
Eclipse JDT: CPL 0.5CPL 1.0EPL 1.0
IBM has relinquished control of licenses to the Eclipse
Foundation
Mozilla: NPLMPL + GPL (+ LGPL)
NPL allowed to release Netscape 6 as a proprietary system
MPL only allows to re-distribute the source code under the
RQ1: Most relevant license changes
FreeBSD
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
UPDATE
491
'BSD UCRegents'-style (4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
300
OpenBSD
'BSD UCRegents'-style (4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
964
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
UPDATE
414
FreeBSD and OpenBSD are more eclectic
than other projects
Moving from BSD-4 clauses to the more
permissive BSD-3 and BSD-2
RQ1: Most relevant license changes
ArgoUML
None
'Free with copyright clause'-style +'UC Regents free with
copyright clause'-style
ADD
127
ADD
15
Samba
None
GPL v2
ArgoUML and Samba kept the same
licenses over the analyzed time span
Change is from None to a simple license
Authors realized the importance of including a
license
RQ2: How and why were
copyright years changed?
Files for which the copyright years were
updated underwent a significantly higher
number of changes than others
When developers perform substantial changes to a
file, they also update copyright years
Required by copyright regulations
Lack of updates with substantial changes would
allow an infringer to claim “innocent infringement”
Commits explicitly targeted to copyright years
“Updated copyrights”
“Updated copyrights to 2004”
RQ3: When do contributors change?
Changes where contributor
names are added are significantly
bigger than other changes
Contributors often added
when they make substantial
changes
Contributor names are important
assets in source code
Like the signature on a picture
However…
contributors can change during the time
no standard way of reporting them
no clear rule on when one should become a
contributor
Their presence can have legal implications
Licenses Influence
Code Migration
Free (software) as a bird…
As birds migrate differently
during different seasons….
Code might have a
migration preferential
direction
Given two systems
e.g. FreeBSD and Linux
We find the same code in
both systems
Three scenarios:
Migration FreeBSD Linux
Migration Linux FreeBSD
Migration third-party
FreeBSD, Linux
Sibling(s) Origin
Identify siblings between systems using clone detection
CCFinderX, with >100 tokens as threshold, plus other heuristics
Trace back into past siblings – their code fragments in the
same files
Again clone detection, the sibling fragment wrt. previous file
revisions
When they disappear, then we have their origins
Take the oldest of the two as the true origin
Sys 1 – File i
Cloned fragments
Sys 2 – File j
Migration
direction
siblings
Cloned fragments
Code Migration and Licenses
FreeBSD
Linux
BSD
GPL
BSD
MIT
BSD
None
Corporate
BSD+GPL
GPL
None
Phrase
BSD+GPL
X.Net+BSD MIT
Files
8
2
2
89
1
1
1
OpenBSD
Linux
BSD
BSD+GPL
BSD
MIT
Almost nothingBSD
Unknown
after
BSD+GPL
GPL
BSD+Phras
Phrase+GPL
e
MIT
GPL
Before
Jan 1, 2002
Linux
BSD+GPL
GPL
GPL
GPL
MIT
MIT+GPL
None
Files
1
2
1
1
1
23
FreeBSD
Files
Corporate
8
BSD
17
BSD+GPL
1
CPL+BSD+GPL
1
BSD
1
None
2
BSD
1
After Jan 1, 2002
Nothing before
Discussion
Siblings have a preferential flow
Initially from BSD(s) to Linux – frequent
Today from Linux to FreeBSD – less frequent
Thus, due to licenses but also to the system
level of development
Companies directly contribute to code in
different kernels – see Intel drivers with
dual licenses
In this case, code migrates from a third party
towards Linux and FreeBSD
Identifying licenses of jar
archives
Motivations
Very often, Java open source software
is distributed in jar archives
See http://mvnrepository.com/
Problem: the jar might not contain
licensing info
Under what conditions can we integrate
the component?
The jar might not be legally used
Even if it’s from open source code, we
might not found exactly the same jar
Search-driven approach
Extracting info from the class bytecode
Class and package names.. or a fingerprint..
We use the ASM library (http://asm.ow2.org/)
Querying Google Code Search
Using the full qualified class name
Using the package only
Query performed using the Google Code API
(http://code.google.com/apis/gdata/)
If the same class is not found, its license is
obtained by those of classes belonging to the
same package
Google Code Search Output
% of correct classifications
Found license:
Min. 29%
(commons.codec), Avg.
82%, median: 89.5%
Inferred licenses:
Min. 62% (JLayer 1.0),
Avg. 95%, median 100%
The inferring heuristic
significantly better
both in terms of
completeness and of
precision
Incorrect classifications
Most of them are between LGPL
and GPL and between BSD and
Apache.
commons-codec: mismatching
between Apache and BSD
files licensed under the Apache v 1.1
derived from the BSD
JLayer: mismatching between GPL
and LGPL
same inferred licenses in both
releases (0.4 and 1.0)
however, JLayer moved from GPL to
LGPL from release 0.4 to release 1.0
Conclusions
We proposed a code analysis method as
support for lawyers other than for software
engineers
We studied how licensing are used and
evolve
License type, copyright year, contributors
Main findings:
License influence projects outcome
License influence code migration
Moving towards more permissive licenses
Copyright years and contributor names updated
to preserve rights on new code
Licensing and code provenance
Licensing influences the direction in which
code flows from a system towards another
one
Often code flows in the direction of more
permissive licenses…
..but there are many other factors influencing how
code flows
Search-driven approaches can be adopted to
determine from what code does a closed
component come from
And thus its licensing…
Issues related to the capabilities of the code
search tools
Thank you!
References
Daniel M. Germán, Jens H. Weber-Jahnke, Massimiliano Di Penta: Lawful
Software Engineering, Proceedings of FoSER: Working Conference on the
Future of Software Engineering Research, November 2010, Santa Fe', USA,
2010, ACM
Daniel M. Germán, Massimiliano Di Penta, Julius Davies: Understanding and
Auditing the Licensing of Open Source Software Distributions. ICPC 2010: 8493
Massimiliano Di Penta, Daniel M. Germán, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: An exploratory study of the evolution of software licensing. ICSE
2010: 145-154
Massimiliano Di Penta, Daniel M. Germán, Giuliano Antoniol: Identifying
licensing of jar archives using a code-search approach. MSR 2010: 151-160
Massimiliano Di Penta, Daniel M. Germán: Who are Source Code Contributors
and How do they Change? WCRE 2009: 11-20
Daniel M. Germán, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: Code siblings: Technical and legal implications of copying code
between applications. MSR 2009: 81-90
Daniel M. Germán, Yuki Manabe, Katsuro Inoue: A sentence-matching method
for automatic license identification of source code files. ASE 2010: 437-446
Daniel M. Germán, Ahmed E. Hassan: License integration patterns: Addressing
license mismatches in component-based development. ICSE 2009: 188-198
Robert Gobeille: The FOSSology project. MSR 2008: 47-50