Measuring research software impact is hard. Software often gets mentioned without formal citation, or used programmatically without any mention at all. These challenges have been explored by colleagues (Istrate et al., 2022; Afiaz et al., 2024), work funded by CZI and ITCR respectively. The former extracted software mentions from millions of papers; the latter surveyed tool developers on how they measure impact. Inspired by these efforts, we wanted to explore more deeply for one software tool how it is used based on published literature. In this experiment we look at cBioPortal, the tool our team works on—classifying which cancer types, analysis methods, and data sources researchers work with.
As a user-facing website for accessing cancer genomics data, cBioPortal has been cited many times, providing a unique opportunity to analyze usage patterns. This goes beyond conventional tracking via Google Analytics or Heap, and aims to paint a clearer picture of how researchers actually use cBioPortal.
The idea was simple: take all papers that cite one of the cBioPortal publications, download the PDF for each open access paper, extract the text, and have an LLM classify each paper by cancer type, usage patterns, and data sources. I vibe coded an automated pipeline using Claude Code which leverages AWS Bedrock for classification.
As usual, it initially seemed to work with relatively little effort—but of course parsing PDFs turned out to be much harder than I anticipated. In retrospect, I should have maybe just used the structured HTML from PMC.
The analysis examined 13,890 papers citing cBioPortal from 2012-2026. Top analysis types were gene expression (10,549 papers), mutation analysis (7,282), and survival analysis (7,050). TP53 was the most frequently queried gene (1,062 papers). TCGA dominated as the data source (8,993 papers). USA and China led geographically.
For the full report with the latest data and visualizations, see the usage report.
This experiment made me appreciate how complicated it is to analyze software usage from publications. Many articles detail what data they used rather than how they used cBioPortal. In its current form, the approach isn’t as useful for identifying UI usage patterns—though detecting cBioPortal visualizations in paper figures (e.g. OncoPrints, MutationMapper lollipop plots) could help. That said, I think we need more meta analyses like this to understand how research software is actually used. Citations alone don’t tell the full story, and there’s a lot of room for the field to develop better approaches.
I haven’t verified many of the findings yet, but it’s promising that some check out with orthogonal lines of evidence. Google Analytics also shows USA and China as top users. A survey from a few years ago indicated strong interest in gene expression analysis, and our second most popular YouTube video is about gene expression.
At the end of the day, just talking to your users is probably still the best way to understand how they use your tool. But this is a nice complementary approach for getting a broader picture.
Work in progress—stay tuned!