Current practice in quantifying online commons lacks rigorous methodology. One example of this is in Creative Commons' own data about the growth and state of Creative Commons licensing. Closed,proprietary search engine web services are used to gather approximations to counts of web pages that link to licence URLs. While this data is a useful starting point to get some feel for the current state of Creative Commons-licensed works online, the methodology makes many implicit assumptions. These include: that every licensed work links to a Creative Commons licence; that non-licensed works do not link to the licences; and that proprietary search engines are capable of providing reliable data on links to arbitrary URLs.
This methodology fails to capture some licensed works works that can have valid plain English licence statements, proper embedded RDF metadata and the appropriate Creative Commons licence mark, but simply fail to link to the licence URL. Nor does this initial methodology generalise well to other categories of documents in the broader commons, such as free software licences or unannotated public domain works, where the mechanism that creates the public rights is not a link to a URL.
Analysis of the commons as a body of reusable documents, or analysis of the success of the commons movement, requires reasonable data: data about which licences are being used, which ones are most popular in the current environment, and how different media (including image, text, sound, software and others) compare in the make-up of the commons.
This paper proposes using raw web crawler data to do analysis with a reliable methodology. Preliminary experiments and analysis are performed with the purpose of contrast with existing quantification methodologies. Methodological issues about online commons quantification are raised and discussed, including the fundamental methodological question of what constitutes a single creative work on the web:while current practice counts individual web pages (that link to licences), this metric can not easily be applied to media such as motion picture, software, sound and images. Without such a discussion, any data would have only indicative value.
The paper concludes with a discussion of the many areas of potential future work in the quantification of online commons from deep web and OAIPMH-compliant databases, to embedded RDF metadata and compressed files, to copies of the full text of licenses as part of the licensed work.
Date of this Version
Ben Bildstein, "New Methodologies for Quantifying Licence-Based Commons on the Web" (August 2008). University of New South Wales Faculty of Law Research Series 2008. Working Paper 52.