Identifying Appropriate Test Statistics Involving Population Mean
Comparing Published Scientific Journal Articles to Their Pre-print Versions
1. Comparing Published Scientific
Journal Articles
to Their Pre-print Versions
Martin Klein Peter Broadwell
@mart1nkle1n @peterbroadwell
with Sharon E. Farb and Todd Grappone
@farbthink, @liber8er
{martinklein,broadwell,farb,grappone}@library.ucla.edu
University of California Los Angeles
2. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
2
Scientific Output in Numbers
Global STM publishing market > $25 billion
• 55% of this from USA
• 28% from Europe, Middle East
• Journals core part of scholarly communication process
• English language journal revenue: ~ $10 billion
• ~ 70% of that out of libraries’ budget
• > 28k scholarly peer-reviewed journals (+3.5% p.a.)
• ~ 2.5 million articles per year (+3% p.a.)
• 21% of research papers from USA
“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
3. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
3
University of California Publication Impact
“Research Performance of the UC System,” Elsevier, March 2015
4. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
4
Open Access by Disciplines
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al. 2010
http://dx.doi.org/10.1371/journal.pone.0011273
5. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
5
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
6. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
6
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
20.4% OA rate
7. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
7
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
20.4% OA rate
2015
“Open Access and Sources of Full-Text Articles in Google Scholar in Different
Subject Fields”, Hammid et al.
(http://dx.doi.org/10.1007/s11192-015-1642-2)
8. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
8
Open Access Rate Overall
2010
“Open Access to the Scientific Journal Literature: Situation 2009”, Björk B-C et al.
(http://dx.doi.org/10.1371/journal.pone.0011273)
20.4% OA rate
2015
“Open Access and Sources of Full-Text Articles in Google Scholar in Different
Subject Fields”, Hammid et al.
(http://dx.doi.org/10.1007/s11192-015-1642-2)
61.1% OA rate
9. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
9
Pre-print v. Final Published
arXiv.org
• Average annual operating cost for 2013 - 2017:
$826,000
Final Published
• English language STM journals: $10 billion in 2013
http://arxiv.org/help/support/faq#3D
“STM Report: An Overview of Scientific and Scholarly Publishing”, Mark Ware and Michael Mabe, March 2015
10. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
10
Role of Publisher
• Entrepreneur
• Copyediting
• Tagging
• Marketer
• Distributor
• E-Host
11. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
11
Value of Publisher
“Once you’ve gone through the peer review process, if you look
at the article that is actually published in a journal, it looks
radically different [to the one submitted due to] that process of
transformation, the copy-editing, the database linking, the data
visualisation tools, making sure that the metadata for the article
is all right, so when people come to [Elsevier database]
ScienceDirect or type a search into Google, they can actually
find what they are looking for on their platforms.”
Gemma Hersh
http://www.thebookseller.com/news/elsevier-defends-its-value-after-open-access-disputes-328037
12. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
12
Working Assumptions
1. If the publishers’ argument is valid, the text of a
pre-print paper should vary significantly from its
corresponding post-print version.
1. By applying standard similarity measures, we
should be able to detect and quantify such
differences.
13. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
13
Assembling a pre-print corpus
Source: arXiv.org
• 1.1 million publication records
• Metadata (typical DC, including DOI) obtained
via OAI-PMH interface
• PDF versions of articles available via Amazon’s
S3 service (using “requester pays” option)
14. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
14
Finding a matching post-print corpus
1. Extract DOIs from arXiv metadata
• 44.5% or articles have DOI
2. CrossRef’s Metadata Search API
• Match by DOI
• Download article & metadata in XML/PDF
Results in:
• 11,017 full text articles
• Majority published by Elsevier between 2003 and
2015
15. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
15
Text Comparison Methods
1. Length ratio
2. Levenshtein ratio
3. Cosine similarity
4. Jaccard coefficient
5. Sorensen similarity
16. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
16
Comparison of Sections
“Analyzing News Events in Non-Traditional Digital Library Collections” M.Klein, P.Broadwell, 2015
http://dx.doi.org/10.1145/2756406.2756948
17. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
17
Comparison of Sections
19. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
19
Comparison of Sections
21. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
21
10.1016/j.physletb.2006.10.068
Physics Letters B
22. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
22
Comparison of Sections
24. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
24
Publication Dates
Papers
0100030005000
1−90
91−180
181−270
271−360
361−450
451−540
541−630
631−720
>720
Pre−print first
Final published first
Number of days
25. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
25
Assembling a pre-print corpus
Source: arXiv.org
• 1.1 million publication records
• metadata (typical DC, including DOI) obtained
via OAI-PMH interface
• PDF versions of articles available via Amazon’s
S3 service (using “requester pays” option)
• *Latest version used if multiple available*
• 35% of all arXiv papers have > 1 version
• 58% of our matched papers have > 1 version
• Repeat experiment with *earliest version*
26. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
26
Publication Dates of Earliest Versions
Papers
Number of days
01000200030004000
1−90
91−180
181−270
271−360
361−450
451−540
541−630
631−720
>720
Pre−print first
Final published first
32. Comparing Published Scientific Journal Articles
to Their Pre-print Versions
@mart1nkle1n #jcdl2016, Newark, NJ, 06/21/2016
32
Discussion & Future Work
• Single corpus experiment
• Pre-print/final published matches based on:
• DOIs
• CrossRef API results
• UCLA serial subscriptions (majority Elsevier
publications)
• Expand to other disciplines/publishers
• Overlay with ISI Impact factor and usage statistics
• Refine extraction/comparison of authors and
references
• Operate at scale
33. Comparing Published Scientific
Journal Articles
to Their Pre-print Versions
Martin Klein Peter Broadwell
@mart1nkle1n @peterbroadwell
with Sharon E. Farb and Todd Grappone
@farbthink, @liber8er
{martinklein,broadwell,farb,grappone}@library.ucla.edu
University of California Los Angeles