[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Answering the question of what fraction of a journal’s papers were previously available as a preprint is quite difficult to do. The tricky part is matching preprints (from a number of different servers) with the published output from a journal. The easy matches are those that are directly linked together, the remainder though can be hard to identify since the manuscript may change (authors, title, abstract) between the preprint and the published version.

A strategy by Crossref called Marple, that aims match preprints to published outputs seems like the best effort so far. Their code and data up to Aug 2023 is available. Let’s use this to answer the question!

My code is below, let’s look at the results first.

The papers that have a preprint version are red, and those without are in grey. The bars are stacked in these plots and the scale is free so that the journals with different volumes of papers can be compared. The plots show only research papers. Reviews and all other outputs have been excluded as far as possible.

We can replot this to show the fraction of papers that have an associated preprint:

We can see that Elife is on a march to become 100% of papers with preprint version. This is due to a policy decision taken a few years ago.

Then there is a tranche of journals who seem to be stabilising at between 25-50% of outputs having a preprinted version. These journals include: Cell Rep, Dev Cell, Development, EMBO J, J Cell Biol, J Cell Sci, MBoC, Nat Cell Biol, Nat Commun, and Plos Biol.

Finally, journals with a very small fraction of preprinted papers include Cells, FASEB J, Front Cell Dev Biol, JBC.

My focus here was on journals in the cell and developmental biology area. I suspect that the differences in rates between journals reflects the content they carry. Cell and developmental biology, like genetics and biophysics, have an established pattern of preprinting. A journal like JCB, carrying 100% cell biology papers tops out at 50% in 2022. Whereas EMBO J, which has a lower fraction of cell biology papers plateaus at ~30%. However, the discipline doesn’t really explain why Cells and Front Cell Dev Biol have such low preprint rates. I know that there are geographical differences in preprinting and so differences in the regional base of authors at a journal may impact their preprint rate overall. There are likely other contributing factors.

Caveats and things to note:

  • the data only goes up to Aug 2023, so the final bar is unreliable.
  • the assignment is not perfect – there will be some papers here that have a preprint version but are not linked up and some erroneous linkages. I had a sense check of the data for one journal and could see a couple of duplicates in the Crossref data out of ~600 for that journal. So the error rate seems very low.
  • the PubMed data is good but again, it is hard to exclude some outputs that are not research papers if they are not tagged appropriately.

The code

devtools::install_github("ropensci/rentrez")
library(rentrez)
library(XML)

# pre-existing script that parses PubMed XML files
source("Script/pubmedXML.R")

# Fetch papers ----
# search term below exceed 9999 results, so need to use history
srchTrm <- paste('("j cell sci"[ta] OR',
                 '"mol biol cell"[ta] OR',
                 '"j cell biol"[ta] OR',
                 '"nat cell biol"[ta] OR',
                 '"embo j"[ta] OR',
                 '"biochem j"[ta] OR',
                 '"dev cell"[ta] OR',
                 '"faseb j"[ta] OR',
                 '"j biol chem"[ta] OR',
                 '"cells"[ta] OR',
                 '"front cell dev biol"[ta] OR',
                 '"nature communications"[ta] OR',
                 '"cell reports"[ta]) AND',
                 '"development"[ta]) AND',
                 '"elife"[ta]) AND',
                 '"plos biol"[ta]) AND',
                 '(2016 : 2023[pdat]) AND',
                 '(journal article[pt] NOT review[pt])')

# so we will use this
journalSrchTrms <- c('"j cell sci"[ta]','"mol biol cell"[ta]','"j cell biol"[ta]','"nat cell biol"[ta]','"embo j"[ta]',
                     '"biochem j"[ta]','"dev cell"[ta]','"faseb j"[ta]','"j biol chem"[ta]','"cells"[ta]',
                     '"front cell dev biol"[ta]','"nature communications"[ta]','"cell reports"[ta]',
                     '"development"[ta]','"elife"[ta]','"plos biol"[ta]')


# loop through journals and loop through the years
# 2016:2023
pprs <- data.frame()

for (i in 2016:2023) {
  for(j in journalSrchTrms) {
    srchTrm <- paste(j, ' AND ', i, '[pdat]', sep = "")
    pp <- entrez_search(db = "pubmed",
                        term = srchTrm, use_history = TRUE)
    if(pp$count == 0) {
      next
    }
    pp_rec <- entrez_fetch(db = "pubmed", web_history = pp$web_history, rettype = "xml", parsed = TRUE)
    xml_name <- paste("Data/all_", i,"_",extract_jname(j), ".xml", sep = "")
    saveXML(pp_rec, file = xml_name)
    tempdf <- extract_xml_brief(xml_name)
    if(!is.null(tempdf)) {
      pprs <- rbind(pprs, tempdf)
    }
  }
}

Now let’s load in the Crossref data and match it up

library(dplyr)
library(ggplot2)

df_all <- read.csv("Data/crossref-preprint-article-relationships-Aug-2023.csv")

# remove duplicates from pubmed data
pprs <- pprs[!duplicated(pprs$pmid), ]

# remove unwanted publication types by using a vector of strings
unwanted <- c("Review", "Comment", "Retracted Publication", "Retraction of Publication", "Editorial", "Autobiography", "Biography", "Historical", "Published Erratum", "Expression of Concern", "Editorial")
# subset pprs to remove unwanted publication types using grepl
pure <- pprs[!grepl(paste(unwanted, collapse = "|"), pprs$ptype), ]
# ensure that ptype contains "Journal Article"
pure <- pure[grepl("Journal Article", pure$ptype), ]
# remove papers with "NA NA" as the sole author
pure <- pure[!grepl("NA NA", pure$authors), ]

# add factor column to pure that indicates if a row in pprs has a doi that is also found in article_doi
pure$in_crossref <- ifelse(tolower(pure$doi) %in% tolower(df_all$article_doi), "yes", "no")

# find the number of rows in pprs that have a doi that is also found in pure
nrow(pure[pure$in_crossref == "yes",])

# summarize by year the number of papers in pure and how many are in the yes and no category of in_crossref
summary_df <- pure %>%
  # convert from chr to numeric
  mutate(year = as.numeric(year)) %>%
  group_by(year, journal, in_crossref) %>%
  summarise(n = n())

# make a plot to show stacked bars of yes and no for each year
ggplot(summary_df, aes(x = year, y = n, fill = in_crossref)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  scale_fill_manual(values = c("yes" = "#ae363b", "no" = "#d3d3d3")) +
  lims(x = c(2015.5, 2023.5)) +
  labs(x = "Year", y = "Papers") +
  facet_wrap(~journal, scales = "free_y") +
  theme(legend.position = "none")
ggsave("Output/Plots/preprints_all.png", width = 2400, height = 1800, dpi = 300, units = "px", bg = "white")

# now do plot where the bars stack to 100%
ggplot(summary_df, aes(x = year, y = n, fill = in_crossref)) +
  geom_bar(stat = "identity", position = "fill") +
  theme_minimal() +
  scale_fill_manual(values = c("yes" = "#ae363b", "no" = "#d3d3d3")) +
  lims(x = c(2015.5, 2023.5)) +
  labs(x = "Year", y = "Proportion of papers") +
  facet_wrap(~journal) +
  theme(legend.position = "none")
ggsave("Output/Plots/preprints_scaled.png", width = 2400, height = 1800, dpi = 300, units = "px", bg = "white")

Edit: minor update to first plot and code.

The post title comes from “Pre Self” by Godflesh from the “Post Self” album.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Pre Self: what fraction of a journal’s papers are preprinted?

Understanding the Growth and Prevalence of Preprinting in Academic Publishing

Preprinting – the sharing of academic papers before peer review – is steadily becoming common practice in many fields. Yet, measuring how many of a journal’s papers are available as preprints can be tricky, due to discrepancies in details like author names, titles, and abstracts that may occur during the transition from preprint to published version. Nonetheless, this task is essential as it allows us to observe the state and trend of preprinting practices, which will significantly influence the future of academia and publishing.

Findings and Analysis

The code discussed in the article attempts to correlate self-archived preprint papers with their corresponding published outputs, using strategies such as Crossref’s Marple. Through analysis, we observe that some journals, like Elife, are almost entirely composed of papers with preprint versions due to policy shifts favoring preprinting within the past few years. A second group of publications, including widely recognized journals such as Cell Reports, Development, and Nat Commun, have around 25-50% of their content originating from preprints. However, several journals, like Cells and FASEB J, show very low preprint rates.

The rate differences between journals could be influenced by the specificity of their subjects. Fields like cell and developmental biology – which are quite established in their preprinting practices – tend to feature higher rates of preprint originality. For instance, the Journal of Cell Biology’s (JCB) preprint rates reached 50% in 2022, while EMBO Journal – a journal with lesser focus on cell biology – only reaches around 30%. Geographic differences among the author base, alongside other undefined factors, could also affect preprint rates.

Long-term Implications and Future Developments

The rise of preprinting… practices presents several potential implications for the academic and publishing communities. If current trends persist or accelerate, we could see a more open and transparent academic landscape where the sharing of pioneering research does not have to wait for the lengthy publishing process. However, it also raises issues, such as the credibility of non-peer-reviewed papers and potential ‘scooping’ of research ideas.

At this rate, the publishing world may need to revise its policies and practices to accommodate and properly manage these changes. Incorporating more robust measures for preprint and published article matching could help improve data analytics and reporting in academic publishing. Furthermore, efforts to standardize preprinting practices may help alleviate some concerns or issues born out of its rapid adoption.

Actionable Advice

Scholars and researchers are advised to stay updated on preprinting practices in their respective fields. Preprinting can provide more immediate visibility to your research, but careful consideration should be given to the potential drawbacks. Further, journals and publishers should reassess their approaches to preprints, taking steps to more accurately account for the shift to this new publishing model.

Lastly, developers, data analysts, and librarians could cross-reference this code with their data to extract meaningful insights about preprint practices in their respective fields or institutions. This data will help keep these stakeholders informed and facilitate more strategic decision-making processes in line with the changing nature of academic publishing.

Read the original article