CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts, Images and Patients

arXiv:2405.19538v1 Announce Type: cross Abstract: Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. The emergence of vision language models has sparked an increase in demands for sharing reports linked to CheXpert images, along with a growing interest among AI fairness researchers in obtaining demographic data. To address this, CheXpert Plus serves as a new collection of radiology data sources, made publicly available to enhance the scaling, performance, robustness, and fairness of models for all subsequent machine learning tasks in the field of radiology. CheXpert Plus is the largest text dataset publicly released in radiology, with a total of 36 million text tokens, including 13 million impression tokens. To the best of our knowledge, it represents the largest text de-identification effort in radiology, with almost 1 million PHI spans anonymized. It is only the second time that a large-scale English paired dataset has been released in radiology, thereby enabling, for the first time, cross-institution training at scale. All reports are paired with high-quality images in DICOM format, along with numerous image and patient metadata covering various clinical and socio-economic groups, as well as many pathology labels and RadGraph annotations. We hope this dataset will boost research for AI models that can further assist radiologists and help improve medical care. Data is available at the following URL: https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1 Models are available at the following URL: https://github.com/Stanford-AIMI/chexpert-plus
The article “CheXpert Plus: A New Collection of Radiology Data Sources for Enhanced AI Models” introduces CheXpert Plus, a new dataset that aims to improve the performance, scalability, robustness, and fairness of machine learning models in the field of radiology. Since the release of the original CheXpert paper, CheXpert has become widely used and cited in clinical AI datasets. However, with the emergence of vision language models, there is now a demand for sharing reports linked to CheXpert images, as well as a growing interest in obtaining demographic data among AI fairness researchers.

CheXpert Plus addresses these needs by providing a large collection of radiology data sources, including 36 million text tokens, making it the largest publicly released text dataset in radiology. It also represents a significant de-identification effort, with almost 1 million PHI spans anonymized. This dataset enables cross-institution training at scale, which is a first in the field of radiology.

CheXpert Plus includes high-quality images in DICOM format, paired with reports that contain various clinical and socio-economic metadata, pathology labels, and RadGraph annotations. The goal of this dataset is to support research for AI models that can assist radiologists and improve medical care. The data and models are publicly available, providing researchers with valuable resources for their work.

Overall, CheXpert Plus is a comprehensive and significant contribution to the field of radiology, offering a rich dataset and models that can advance the development of AI models in healthcare.

Introducing CheXpert Plus: Enhancing Radiology AI with Text Data

Since the release of the original CheXpert paper five years ago, CheXpert has become one of the most widely used and cited clinical AI datasets. However, with the emergence of vision language models, there has been a growing demand for sharing reports linked to CheXpert images and an increasing interest among AI fairness researchers in obtaining demographic data. This has led to the creation of CheXpert Plus, a new collection of radiology data sources aimed at enhancing the scaling, performance, robustness, and fairness of models in the field of radiology.

CheXpert Plus is a groundbreaking dataset that offers a wealth of text data, making it the largest text dataset publicly released in radiology to date. With a total of 36 million text tokens, including 13 million impression tokens, it provides a comprehensive resource for training and testing AI models in the field. What sets CheXpert Plus apart is its focus on de-identification and privacy. It represents one of the most significant efforts in radiology to anonymize sensitive patient health information (PHI) spans, with nearly 1 million PHI spans anonymized. This commitment to privacy ensures that researchers and practitioners can work with the data while protecting patient confidentiality.

Additionally, CheXpert Plus offers pairing of all reports with high-quality images in DICOM format. This combination of text data and image metadata creates a rich dataset that can be used in a wide range of studies and applications. Furthermore, CheXpert Plus includes various image and patient metadata covering different clinical and socio-economic groups, as well as pathology labels and RadGraph annotations. This diversity allows researchers to investigate the impact of demographic factors on AI model performance and fairness, a topic of significant interest in the field of AI ethics and fairness.

One notable aspect of CheXpert Plus is its contribution to cross-institution training at scale. With its large-scale English paired dataset, researchers can now utilize data from different healthcare institutions, enabling more robust and generalizable AI models. This is only the second time such a dataset has been released in radiology, marking a significant step forward in the field.

The availability of CheXpert Plus holds great promise for advancing AI models that assist radiologists and improve medical care. By combining text data, image metadata, and demographic information, researchers can develop models that consider a broader range of factors, leading to more accurate diagnoses and personalized patient care. The dataset is publicly available for access, allowing researchers to explore its potential and drive innovation in radiology AI.

Data Access

The CheXpert Plus dataset can be accessed at the following URL: https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1

Model Repository

The models built using CheXpert Plus data are available at the following URL: https://github.com/Stanford-AIMI/chexpert-plus

Embracing the potential of CheXpert Plus, researchers and practitioners can push the boundaries of radiology AI, driving innovation, and ultimately improving patient outcomes. This dataset provides a foundation for training robust and fair AI models that can augment radiologist expertise and enhance medical care for all individuals.

The release of CheXpert Plus is a significant development in the field of clinical AI and radiology. CheXpert has already established itself as a widely used and cited clinical AI dataset, and the introduction of CheXpert Plus further expands its capabilities and potential applications.

One of the key motivations behind the creation of CheXpert Plus is the emergence of vision language models and the increasing demand for sharing reports linked to CheXpert images. This highlights the importance of integrating textual information with visual data in order to enhance the performance and robustness of AI models in radiology. By providing a large text dataset with 36 million text tokens, including 13 million impression tokens, CheXpert Plus enables researchers to explore and develop models that can effectively process and interpret radiology reports.

Another crucial aspect addressed by CheXpert Plus is the growing interest among AI fairness researchers in obtaining demographic data. By including patient metadata covering various clinical and socio-economic groups, CheXpert Plus promotes fairness in AI models by enabling researchers to analyze and mitigate biases that may arise from demographic factors. This emphasis on fairness is a significant step towards ensuring that AI technologies in radiology are equitable and provide accurate and reliable results for all patient populations.

Moreover, the de-identification effort in CheXpert Plus is noteworthy. With almost 1 million PHI (Protected Health Information) spans anonymized, it represents a significant achievement in preserving patient privacy while making the dataset publicly available. This commitment to privacy protection is essential in maintaining ethical standards and complying with data privacy regulations.

The availability of high-quality images in DICOM format, along with pathology labels and RadGraph annotations, further enhances the utility of CheXpert Plus. This comprehensive dataset allows researchers to develop AI models that can assist radiologists in accurately diagnosing and interpreting medical images. The inclusion of cross-institution training at scale is particularly significant, as it enables the development of models that can generalize well across different healthcare settings, potentially leading to improved medical care and outcomes for patients.

In conclusion, CheXpert Plus is a valuable resource that has the potential to significantly advance research and development in the field of radiology AI. Its large text dataset, paired with high-quality images and patient metadata, offers new opportunities for developing robust, fair, and accurate AI models. By providing this dataset and supporting models, the creators of CheXpert Plus are contributing to the continuous improvement of medical care and the collaboration between AI technology and radiologists. Researchers and practitioners in the field should take advantage of this resource to further advance the field and improve patient outcomes.
Read the original article

CheXpert Plus: Hundreds of Thousands of Aligned Radiology Texts, Images and Patients

Introducing CheXpert Plus: Enhancing Radiology AI with Text Data

Data Access

Model Repository

Submit a Comment Cancel reply

Recent Posts

Recent Comments