Licensing talk short

Licensing is Software Too:
Achievements and Challenges
(and how this relates to code provenance)

Massimiliano Di Penta
University of Sannio, Italy
[email protected]
http://www.rcost.unisannio.it/mdipenta

Acknowledgements
 Daniel M. Germán, Univ. Victoria, Canada
 Julius Davies, Univ. Victoria, Canada
 Giuliano Antoniol, Ecole Polyt. Montréal, Canada
 Yann-Gaël Guéhéneuc, Ecole Polyt. Montréal,
Canada

Reusing Open Source Software
 When developing a software system,
we try (if possible) not to reinvent the wheel
 Components, libraries, source
code snippets out of there, ready to be reused

 Code search engines are becoming popular

 Open source code modification and
redistribution governed by
 Software licenses
 Copyright statements

 Everything contained in a licensing
block…

What does a licensing contain?
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/


License
(MPL+GPL+LGPL)

….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
*

Brian Ryner

….
* decision by deleting the provisions above and replace them with the notice

Copyright
* and other provisions requiredContributor
by the GPL or the Copyright
LGPL. If you do not
delete

statement

year
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*
* ***** END LICENSE BLOCK ***** */

Restrictive vs. permissive
licenses
 Restrictive (aka copyleft or reciprocal)
 Changed software must be made available under
similar terms wrt. the original
 Example: GPL

 Permissive
 Modifications/enhancements may remain
proprietary
 Distribution of source code or binary permitted
– Provided copyright notice and/or liability disclaimers

– Contributor names do not imply endorsement

 Examples: Berkeley Software Distribution (BSD),
Apache Software License, MIT

FOSS development teams care!
(source: Debian)
I am in the process of trying to prepare 0.8.0 for Debian
GNU/Linux I have started going over the copyright/license
headers. In src/celeste many files are missing copyright
information. Most of these are files imported with minimal
changes from Gabor API http://www.kung-foo.tv/gaborapi.php
or libsvm http://www.csie.ntu.edu.tw/\~cjlin/libsvm/.
The attached patch adds copyright and license statements
to these files.[1]
Please apply and update the headers (adding copyright
holders) if you make substantial changes.
thanks, cu andreas
[1] I have doublechecked with Gabor API's upstream author
Adriaan Tijsseling that files like ContrastFilter.cpp are

Copyright (c) Adriaan Tijsseling and licensed under
GPLv2+, although the original headers just say:
Original Author:

Yasunobu Honma

Modifications by:

Adriaan Tijsseling (AGT)

Conjectures
 Since licenses determine the way software
can be composed and re-distributed
 They may change/evolve as any other part of
the software
 They might be subject to bugs too
– See our ICPC 2010 paper about how to identify
licensing incompatibilities

 They might determine the success/failure of a

software project

 Code provenance and licenses:
 Licenses constrain source code migration
between projects
 Code provenance might be useful to determine
the licensing of closed components

Licenses influence the software
lifetime

 OpenBSD founder and project leader Theo de Raadt
removed a security software package called IP-Filter
[written by Darren Reed] after its author changed its
license.
Stephen Shankland, CNET News, 2001/05/30.
 Licenses evolve as software does
 Failing to account for that would cause copyright
infringements


 Decisions on license changes impact as other
decisions on software evolution
 Little attention so far from the scientific community

Need for methods and tools to audit licensing
and their changes

Example: Java
 Until November 2006, the license of Java JDK v1.2 said:
“Except as specifically authorized in any Supplemental
License Terms, you may not make copies of Software,
other than a single copy of Software for archival
purposes”
 This disallowed the inclusion of Java in Linux distributions

 Java 5.0 released under the GPL v2 with the
CLASSPATH exception:
 Java could be modified/updated under the GPL v2
 Java programs could be released under any license as long as
they satisfy the conditions stated in the CLASSPATH exception


Changing the license of a system can promote
and ease the distribution and reuse of a
software system

Example: QT
 First released under a non-open source but free
license, called the FreeQT License, and a commercial
license
 QT became the basis for KDE
 QT v2.0 was released under a new license, the Q Public
License
 incompatible with the GPL

 GNOME project started as a QT-free alternative to KDE
 Harmony project started as a GPL replacement of QT
 Trolltech changed the license of QT v3 to the GPL v2
The Harmony project was abandoned

Changing the license of FOSS system

towards a more permissive might cause
the abandonment of a competing system

Empirical Study
 Goal: analyze licensing evolution
 Purpose: investigating how
developers change licensing
statements
 Context: CVS/SVN repositories of
 ArgoUML, Eclipse-JDT, the FreeBSD and
the OpenBSD kernels, Mozilla, Samba

Research Questions
 RQ1: To what extent are files
changing their licenses?
 RQ2: How are copyright years
changed in licensing statements?
 RQ3: Who are the contributors of a
software project and how do they
change?


Licensing Analysis Method –
Extracting Licensing statements
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
*

Brian Ryner


….
* decision by deleting the provisions above and replace them with the notice
* and other provisions required by the GPL or the LGPL. If you do not delete
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*

Licensing Analysis Method –
Classifying licenses
 FoSSology [Gobeille, MSR 2008]: detects licenses
using the Binary Symbolic Alignment Matrix (bSAM)
 Ninka [German et al., ASE 2010]: uses a patternmatching approach

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….

MPL 1.1/GPL 2.0/LGPL 2.1

Licensing Analysis Method –
Identifying changes in copyright
years

 Mining references to years in licensing…

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….

Licensing Analysis Method –
Identifying contributor names
 Mining emails, plus various patterns
 Copyright … year name
 Contributor(s) …

 And mapped to committers, whenever possible

/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
*
* The contents of this file are subject to the Mozilla Public License Version
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
* http://www.mozilla.org/MPL/
….
….
* Portions created by the Initial Developer are Copyright (C) 2002
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
* the Initial Developer. All Rights Reserved.
*
*
* Contributor(s):
* Contributor(s):
* Brian Ryner
* Brian Ryner
….
….

RQ1: Most relevant license changes
Eclipse-JDT
Common Public License v1.0

Eclipse Public License v1.0

CHANGE

2394

Common Public License v0.5

Common Public License v1.0

UPDATE

808

Mozilla
NPL

'NPL v1.1'-style+GPL v2+LGPL v2.1

DUAL

2914

NPL

'Dual MPL GPL'-style+MPL

DUAL

1274

'Dual MPL GPL'-style+MPL

NPL

BUG

1194

Licensing updated as new licenses were
developed
 Eclipse JDT: CPL 0.5CPL 1.0EPL 1.0

 IBM has relinquished control of licenses to the Eclipse
Foundation

 Mozilla: NPLMPL + GPL (+ LGPL)
 NPL allowed to release Netscape 6 as a proprietary system
 MPL only allows to re-distribute the source code under the

RQ1: Most relevant license changes
FreeBSD
BSD UCRegents (4-cl BSD)

'BSD UCRegents'-style
(4-cl BSD)

UPDATE

491

'BSD UCRegents'-style (4-cl BSD)

'INRIA-OSL'-style (3-cl BSD)

UPDATE

300

OpenBSD
'BSD UCRegents'-style (4-cl BSD)

'INRIA-OSL'-style (3-cl BSD)

UPDATE

964

BSD UCRegents (4-cl BSD)

'BSD UCRegents'-style
(4-cl BSD)

UPDATE

414

 FreeBSD and OpenBSD are more eclectic
than other projects
 Moving from BSD-4 clauses to the more
permissive BSD-3 and BSD-2

RQ1: Most relevant license changes
ArgoUML
None

'Free with copyright clause'-style +'UC Regents free with
copyright clause'-style

ADD

127

ADD

15

Samba
None

GPL v2

 ArgoUML and Samba kept the same
licenses over the analyzed time span
 Change is from None to a simple license
 Authors realized the importance of including a
license

RQ2: How and why were
copyright years changed?
 Files for which the copyright years were
updated underwent a significantly higher
number of changes than others
 When developers perform substantial changes to a
file, they also update copyright years
 Required by copyright regulations
 Lack of updates with substantial changes would
allow an infringer to claim “innocent infringement”
 Commits explicitly targeted to copyright years
 “Updated copyrights”
 “Updated copyrights to 2004”

RQ3: When do contributors change?
 Changes where contributor
names are added are significantly
bigger than other changes
Contributors often added
when they make substantial
changes
 Contributor names are important
assets in source code
 Like the signature on a picture
 However…
 contributors can change during the time
 no standard way of reporting them
 no clear rule on when one should become a
contributor

 Their presence can have legal implications

Licenses Influence
Code Migration

Free (software) as a bird…
 As birds migrate differently
during different seasons….
 Code might have a
migration preferential
direction
 Given two systems
 e.g. FreeBSD and Linux

 We find the same code in
both systems
 Three scenarios:
 Migration FreeBSD  Linux
 Migration Linux  FreeBSD
 Migration third-party 
FreeBSD, Linux

Sibling(s) Origin
 Identify siblings between systems using clone detection
 CCFinderX, with >100 tokens as threshold, plus other heuristics

 Trace back into past siblings – their code fragments in the
same files
 Again clone detection, the sibling fragment wrt. previous file
revisions

 When they disappear, then we have their origins
 Take the oldest of the two as the true origin

Sys 1 – File i
Cloned fragments

Sys 2 – File j

Migration
direction

siblings

Cloned fragments

Code Migration and Licenses
FreeBSD
Linux
BSD
GPL
BSD
MIT
BSD
None
Corporate
BSD+GPL
GPL
None
Phrase
BSD+GPL
X.Net+BSD MIT

Files
8
2
2
89
1
1
1

OpenBSD
Linux
BSD
BSD+GPL
BSD
MIT
Almost nothingBSD
Unknown
after
BSD+GPL
GPL
BSD+Phras
Phrase+GPL
e
MIT
GPL

Before
Jan 1, 2002

Linux
BSD+GPL
GPL
GPL
GPL
MIT
MIT+GPL
None

Files
1
2
1
1
1
23

FreeBSD

Files
Corporate
8
BSD
17
BSD+GPL
1
CPL+BSD+GPL
1
BSD
1
None
2
BSD
1

After Jan 1, 2002
Nothing before

Discussion
 Siblings have a preferential flow
 Initially from BSD(s) to Linux – frequent
 Today from Linux to FreeBSD – less frequent
 Thus, due to licenses but also to the system
level of development

 Companies directly contribute to code in
different kernels – see Intel drivers with
dual licenses
 In this case, code migrates from a third party
towards Linux and FreeBSD

Identifying licenses of jar
archives

Motivations
 Very often, Java open source software
is distributed in jar archives
 See http://mvnrepository.com/

 Problem: the jar might not contain
licensing info
 Under what conditions can we integrate
the component?
 The jar might not be legally used
 Even if it’s from open source code, we
might not found exactly the same jar

Search-driven approach
 Extracting info from the class bytecode
 Class and package names.. or a fingerprint..
 We use the ASM library (http://asm.ow2.org/)

 Querying Google Code Search
 Using the full qualified class name
 Using the package only
 Query performed using the Google Code API
(http://code.google.com/apis/gdata/)
 If the same class is not found, its license is
obtained by those of classes belonging to the
same package

Google Code Search Output

% of correct classifications
 Found license:
 Min. 29%
(commons.codec), Avg.
82%, median: 89.5%

 Inferred licenses:
 Min. 62% (JLayer 1.0),
Avg. 95%, median 100%

 The inferring heuristic
significantly better
both in terms of
completeness and of
precision

Incorrect classifications
 Most of them are between LGPL
and GPL and between BSD and
Apache.
 commons-codec: mismatching
between Apache and BSD
 files licensed under the Apache v 1.1
 derived from the BSD

 JLayer: mismatching between GPL
and LGPL
 same inferred licenses in both
releases (0.4 and 1.0)
 however, JLayer moved from GPL to
LGPL from release 0.4 to release 1.0

Conclusions
 We proposed a code analysis method as
support for lawyers other than for software
engineers
 We studied how licensing are used and
evolve
 License type, copyright year, contributors

 Main findings:
 License influence projects outcome
 License influence code migration
 Moving towards more permissive licenses
 Copyright years and contributor names updated
to preserve rights on new code

Licensing and code provenance
 Licensing influences the direction in which
code flows from a system towards another
one
 Often code flows in the direction of more
permissive licenses…
 ..but there are many other factors influencing how
code flows

 Search-driven approaches can be adopted to
determine from what code does a closed
component come from
 And thus its licensing…
 Issues related to the capabilities of the code
search tools

Thank you!

References
 Daniel M. Germán, Jens H. Weber-Jahnke, Massimiliano Di Penta: Lawful
Software Engineering, Proceedings of FoSER: Working Conference on the
Future of Software Engineering Research, November 2010, Santa Fe', USA,
2010, ACM
 Daniel M. Germán, Massimiliano Di Penta, Julius Davies: Understanding and
Auditing the Licensing of Open Source Software Distributions. ICPC 2010: 8493
 Massimiliano Di Penta, Daniel M. Germán, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: An exploratory study of the evolution of software licensing. ICSE
2010: 145-154
 Massimiliano Di Penta, Daniel M. Germán, Giuliano Antoniol: Identifying
licensing of jar archives using a code-search approach. MSR 2010: 151-160
 Massimiliano Di Penta, Daniel M. Germán: Who are Source Code Contributors
and How do they Change? WCRE 2009: 11-20
 Daniel M. Germán, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: Code siblings: Technical and legal implications of copying code
between applications. MSR 2009: 81-90
 Daniel M. Germán, Yuki Manabe, Katsuro Inoue: A sentence-matching method
for automatic license identification of source code files. ASE 2010: 437-446
 Daniel M. Germán, Ahmed E. Hassan: License integration patterns: Addressing
license mismatches in component-based development. ICSE 2009: 188-198
 Robert Gobeille: The FOSSology project. MSR 2008: 47-50