arXiv:2602.00053v1 Announce Type: new
Abstract: Efficient and scalable deployment of machine learning (ML) models is a prerequisite for modern production environments, particularly within regulated domains such as healthcare and pharmaceuticals. In these settings, systems must balance competing requirements, including minimizing inference latency for real-time clinical decision support, maximizing throughput for batch processing of medical records, and ensuring strict adherence to data privacy standards such as HIPAA. This paper presents a rigorous benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server. Leveraging a reference architecture for healthcare AI, we deployed a DistilBERT sentiment analysis model on Kubernetes to measure median (p50) and tail (p95) latency, as well as throughput, under controlled experimental conditions. Our results indicate a distinct trade-off. While FastAPI provides lower overhead for single-request workloads with a p50 latency of 22 ms, Triton achieves superior scalability through dynamic batching, delivering a throughput of 780 requests per second on a single NVIDIA T4 GPU, nearly double that of the baseline. Furthermore, we evaluate a hybrid architectural approach that utilizes FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.

Expert Commentary: Efficient and Scalable Deployment of Machine Learning Models in Healthcare

As machine learning technologies continue to revolutionize healthcare and pharmaceutical industries, the efficient and scalable deployment of ML models becomes increasingly crucial. In regulated domains like healthcare, where data privacy standards such as HIPAA must be strictly adhered to, finding the right deployment paradigm is paramount.

This paper presents a benchmarking analysis comparing two popular deployment paradigms: a lightweight Python-based REST service using FastAPI and NVIDIA Triton Inference Server. The study leverages a reference architecture for healthcare AI and evaluates the deployment of a DistilBERT sentiment analysis model on Kubernetes under controlled experimental conditions.

The results of the analysis highlight a trade-off between the two deployment paradigms. While FastAPI provides lower overhead for single-request workloads with a median latency of 22 ms, Triton offers superior scalability through dynamic batching, achieving a throughput of 780 requests per second on a single NVIDIA T4 GPU.

One interesting aspect of the study is the evaluation of a hybrid architectural approach that combines FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This hybrid model showcases the potential for combining the strengths of different deployment paradigms to create a secure, high-availability solution for enterprise clinical AI.

Overall, this study underscores the multi-disciplinary nature of deploying ML models in healthcare, requiring expertise in machine learning, software development, infrastructure management, and data privacy regulations. By offering a blueprint for secure and efficient deployments, this research contributes valuable insights for organizations looking to implement ML models in regulated environments.

Read the original article