“Introducing biorecap: Summarizing bioRxiv preprints with a local LLM

[This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

This is re-posted from my newsletter, where I’ll be posting from now on:

https://blog.stephenturner.us/p/biorecap-r-package-for-summarizing-biorxiv-preprints-local-llm

—

TL;DR

I wrote an R package that summarizes recent bioRxiv preprints using a locally running LLM via Ollama+ollamar, and produces a summary HTML report from a parameterized RMarkdown template. The package is on GitHub and you can install it with devtools: https://github.com/stephenturner/biorecap.
I published a paper about the package on arXiv: Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. arXiv, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.
I wrote both the package and the paper with assistance from LLMs: GitHub copilot for documentation, llama3.1:70b for tests, llama3.1:405b via HuggingFace assistants for drafting, GPT-4o for editing.

The biorecap package

I recently started to explore prompting a local LLM (e.g. Llama3.1) from R via Ollama. Last week I wrote about how to do this, with a few contrived examples: (1) trying to figure out what’s interesting about a set of genes, and (2) summarizing a set of preprints retrieved from bioRxiv’s RSS feed. The gene set analysis really was contrived — as I mentioned in the footnote to that post, you’d never actually want to do a gene set analysis this way when there are plenty of well-understood first-principles approaches to gene set analysis.

The second example wasn’t so contrived. I subscribe to bioRxiv’s RSS feeds, along with many other journals and blogs in genetics, bioinformatics, data science, synthetic biology, and others. The fusillade of new preprints and peer-reviewed papers relevant to my work is relentless. Late last year bioRxiv started adding AI summaries to newly published preprints, but this required multiple clicks to get out of my RSS feed onto the paper’s landing page, another to click into the AI summary. I wanted a way to give me a quick TL;DR on all the recent papers published in particular subject areas (e.g., bioinformatics, genomics, or synthetic biology).

Shortly after putting together that one-off demo, on a sultry Sunday afternoon in Virginia too hot to do anything outside, I took the code from that post, generalized it a bit, and created the biorecap package: https://github.com/stephenturner/biorecap.

You can install it with devtools/remotes:

remotes::install_github("stephenturner/biorecap",
                        dependencies=TRUE)

Create a report with 2-sentence summaries of recent papers published in the bioinformatics, genomics, and synthetic biology sections, using llama3.1:8b running locally on your laptop:

my_subjects <- c("bioinformatics", "genomics", "synthetic_biology")

biorecap_report(output_dir=".",
                subject=my_subjects,
                model="llama3.1")

This report will look different each time you run it, depending on what’s been published that day in the sections you’re interested in.

Figure 1 from Turner 2024 arXiv: Example biorecap report for bioinformatics, genomics, and synthetic biology from August 6, 2024.

The package documentation provides further instructions on usage.

I mentioned I used LLMs to help write this package. I started out using Positron for package development, but quickly fell back to RStudio because I wanted GitHub copilot integration. Among other things, Copilot is great for quickly writing Roxygen documentation for relatively simple functions. I also used llama3.1:70b running locally via Open WebUI to help me write some of the unit tests using testthat. Starting from the code I worked out in the previous post, it took about 2 hours to get the package working, and another hour or so to write documentation and tests and set up GitHub actions for pkgdown deployment and R CMD checks on PRs to main.

The biorecap paper

I published a short paper about the package on arXiv: Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. arXiv, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.

biorecap-preprint

266KB ∙ PDF file

Download

The biorecap preprint published on arXiv

I used two LLMs to assist me in writing the package, which itself uses an LLM to summarize research published on bioRxiv. It only makes sense to close the circle and use an LLM to help me write a preprint to publish on arXiv1 describing the software.

I’ll write a post soon about how to set up a llama3.1:405b-powered chatbot on HuggingFace connected to a GitHub repository to be able to ask questions about the codebase in that repo. It’s free and takes about 60 seconds or less. I started out doing this, asking for help crafting a narrative outline and introduction section for a paper based on the code in the repo. I ended up scrapping this and writing most of the first draft text myself, then using GPT-4o to help with editing, tone, and length reduction. It’s hard to exactly put my finger on it, but what I ended up with still had that feeling of “sounds like it was written by ChatGPT.” I did some of the final editing on my own, and used Mike Mahoney’s arXiv template Quarto plugin to write and typeset the final document.

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: biorecap: an R package for summarizing bioRxiv preprints with a local LLM

Long-Term Implications & Future Developments of BioRecap R Package

Based on the original article’s insights, the bioRecap R package created by Stephen Turner has the potential to significantly impact how researchers keep track of new scientific research in bioinformatics, genomics, and synthetic biology. This solution is a particularly useful tool for summarizing bioRxiv preprints using a locally run Large-Language Model (LLM) and then creating a report using parameters defined by the user.

Adoption and Expansion of the bioRecap R Package

An important implication of the bioRecap R package is the potential for widespread adoption. By providing a tool capable of summarizing research papers, the package reduces the often overwhelming task of sifting through a multitude of preprints across various fields of interest. Such an approach allows researchers to get a quick summary of research in their specified areas of interest, providing a way to filter relevant information rather than having to browse through numerous RSS feed updates.

Furthermore, while the current configuration focuses on bioinformatics, genomics, and synthetic biology, it’s not hard to see future usage expanding to include other fields of study. By simply tweaking the parameters defined by the user, this R package could be used across a wide range of disciplines.

Collaboration with AI Technologies

Turner’s successful use of AI and LLM technology to create the package and documentation shines a spotlight on the expansive potential of utilizing AI in the field of programming languages. The combination of RStudio with GitHub copilot demonstrates how quickly and efficiently code documentation can be written for relatively simple functions. Similarly, using llama3.1:70b for writing unit tests shows a potential trend in automating some parts of code testing, which could save developers significant amounts of time in the long run.

Actionable Advice

Adopt this method: Researchers inundated with constant updates from numerous preprints in their field are advised to consider using this R package to customize their summary reports. This method significantly reduces research time on irrelevant or duplicative content.
Incorporate AI tools in Development: Developers should consider using AI technology such as GitHub copilot or LLMs like llama3.1:70b in their development processes. Swift documentation and automated testing not only save time but also contribute to maintaining high-quality code.
Foster Expansion: Developers from a variety of disciplines should consider adopting or modifying this package to suit their specific field of research. The ability to quickly and efficiently summarize research papers is universally applicable and valuable across disciplines.

Possibility of AI-augmented academic writing

The use of LLM to help in writing the paper describing the R package is a promising indication of the potential of AI in both academic and practical aspects of programming. With several AI models available for language tasks, there is the potential for AI to become deeply involved in the creation of academic papers, providing assistance in forming narrative outlines, writing drafts, or helping with the final editing.

Conclusion

The ramifications of the bioRecap R package extend beyond its immediate utility in summarizing bioRxiv preprints. The way it was created, using a collaboration of AI tools and Large Language Models, signifies a future where AI and human developers can co-create, elevating the efficiency and quality of the end product. It also opens up possibilities for AI’s involvement in academic paper crafting, which could significantly impact academic writing and peer-review systems.

Read the original article