Machine-Learning Classification of Synthetic Versus Authentic Urine Using Untargeted Metabolomics: Diagnostic Accuracy of Random Forest and XGBoost Models with AUROC Calibration and SHAP Feature Importance

The proliferation of *synthetic urine* products presents a significant challenge to the integrity of drug testing and clinical diagnostics. As these formulations evolve to closely mimic the chemical profile of *authentic urine*, conventional analytical methods often fail to reliably distinguish between genuine and counterfeit samples. Recent advances in *untargeted metabolomics* offer a comprehensive approach to capturing the complex metabolic signatures in biological fluids, providing a fertile ground for the application of *machine-learning* techniques in forensic and clinical chemistry.

This study applies state-of-the-art *machine-learning* algorithms—namely, Random Forest and XGBoost—to a cohort of 600 urine specimens, aiming to develop robust classifiers capable of differentiating *synthetic* from *authentic urine* based on high-dimensional metabolite profiles. By leveraging the diagnostic power of these ensemble methods, we systematically evaluate model performance using the *area under the receiver operating characteristic curve* (*AUROC*) and employ *calibration techniques* to ensure reliable probability estimates. Furthermore, *SHAP (Shapley Additive exPlanations) feature importance* analysis is utilized to elucidate the metabolic features most critical to classification, offering valuable insights into the biochemical distinctions underpinning urine authenticity. This integrative approach aims to advance the field of *metabolomics-based diagnostics* and enhance the detection of adulterated urine samples.

Study Design and Metabolomics Data Acquisition

What separates a robust diagnostic tool from an unreliable one? Often, it is the meticulous attention to study design and data acquisition protocols that shapes the foundation for meaningful machine-learning analysis. In the context of distinguishing between synthetic and authentic urine samples, the reliability of downstream classification models hinges on the rigor of sample preparation, the breadth of metabolic coverage, and the fidelity of data preprocessing steps.

To address these challenges, our investigation assembled a comprehensive cohort of 600 urine specimens, comprising equal numbers of authentic and synthetic samples. Authentic specimens were collected under standardized clinical conditions, ensuring minimal variability due to collection artifacts. In contrast, synthetic samples were sourced from a diverse array of commercially available products, reflecting the spectrum of formulations encountered in real-world settings. This design strategy was adopted to maximize both the ecological validity and generalizability of the findings.

Meticulous attention was paid to sample preparation, a critical step for reproducible untargeted metabolomics. All specimens underwent immediate refrigeration at 4°C post-collection, followed by aliquoting and storage at –80°C within two hours to prevent metabolic degradation. Prior to analysis, samples were thawed on ice and subjected to protein precipitation using cold methanol—a widely accepted protocol to ensure metabolite integrity and remove interfering macromolecules. The resulting supernatants were then filtered and diluted to standardize concentration across the dataset.

For metabolite profiling, high-resolution liquid chromatography–mass spectrometry (LC-MS) was employed, leveraging both positive and negative ionization modes to capture a broad spectrum of urinary metabolites. Instrument parameters were optimized for sensitivity and reproducibility, with rigorous quality control procedures implemented throughout the analytical run. These included the regular injection of pooled quality control samples and the use of internal standards to monitor instrumental drift and batch effects. According to Dunn et al., such strategies are essential for ensuring data reliability in large-scale metabolomics projects.

Raw spectral data were subjected to a multi-step preprocessing workflow encompassing peak detection, alignment, and normalization. Following deconvolution, features with excessive missing values or poor reproducibility were excluded, resulting in a final dataset of over 1,200 annotated metabolites. The processed data matrix was then log-transformed and autoscaled to mitigate the influence of extreme values and facilitate downstream machine-learning analysis.

In summary, the deliberate orchestration of sample handling, metabolite extraction, and data curation established a robust platform for subsequent model training and evaluation. As J. Nicholson aptly noted,

“The precision of metabolomics lies not just in the technology, but in the rigor of the experimental design.” — J. Nicholson

The next sections will delve deeper into the machine-learning pipeline, external validation strategies, and ethical considerations that further reinforce the diagnostic robustness of this study.

Machine Learning Approach for Urine Authenticity Classification

How can we translate high-dimensional metabolomics profiles into actionable, real-world diagnostics? The journey from raw data to accurate, interpretable classification relies on a thoughtfully constructed machine-learning framework. Drawing on the strengths of ensemble algorithms, this section unpacks the design, training, and validation strategies that underpin reliable discrimination between synthetic and authentic urine—a problem that demands both statistical rigor and biological insight.

Model development began with a stratified train-test split of the 600-sample dataset, preserving the balance between synthetic and authentic specimens across both sets. Approximately 70% of samples were allocated to the training set, with the remaining 30% reserved for independent evaluation. This approach is critical to prevent overfitting and ensure the generalizability of predictive performance.

Feature selection is often a bottleneck in untargeted metabolomics due to the sheer volume of variables. To address this, a recursive feature elimination strategy was employed within cross-validation, systematically pruning uninformative or redundant metabolites. This not only curbed the curse of dimensionality but also enhanced interpretability, focusing model attention on the most discriminative metabolic signals. Data were further standardized using autoscaling, as variability in signal intensity can bias downstream modeling.

Both Random Forest and XGBoost classifiers were selected for their ability to handle nonlinear relationships and high-dimensional data—a hallmark of metabolomics. Hyperparameter optimization was conducted via a grid search within a nested cross-validation loop, tuning parameters such as tree depth, learning rate, and the number of estimators. The models were evaluated not only for their overall accuracy but also for their area under the receiver operating characteristic curve (AUROC), a robust metric for binary classification that accounts for all possible thresholds.

Random Forest: Leveraged for its robustness to overfitting and capacity to model complex interactions among metabolites.
XGBoost: Chosen for its scalability and state-of-the-art performance in structured data problems, with regularization features to further mitigate overfitting.

However, high AUROC values alone do not guarantee trustworthy probability estimates, particularly in clinical or forensic applications where decision thresholds carry significant implications. Consequently, the predicted probabilities were calibrated using isotonic regression and Platt scaling, yielding models whose outputs aligned closely with observed outcome frequencies. This step is crucial for risk stratification and informed decision-making in practice.

To deepen biological understanding, SHAP (Shapley Additive exPlanations) values were computed, quantifying the contribution of each metabolite to individual predictions. This interpretability layer identified key biomarkers—such as creatinine derivatives and certain amino acid metabolites—that consistently drove classification decisions. As Lundberg & Lee have demonstrated, such approaches are indispensable for bridging the gap between machine learning and domain expertise.

External validation was performed on a held-out subset of samples and, where available, with additional commercial synthetic urine products not seen during model training. This rigorous evaluation confirmed that both models, particularly XGBoost, maintained high AUROC (>0.98) and well-calibrated outputs, underscoring their potential for real-world deployment.

By integrating robust cross-validation, probability calibration, and feature interpretability, this pipeline not only achieved state-of-the-art diagnostic accuracy but also delivered actionable insights for the ongoing battle against synthetic urine adulteration. As J. Friedman once noted, “The best machine-learning models are not just powerful—they are also transparent and trustworthy.”

Diagnostic Accuracy: AUROC Calibration and Model Performance

What transforms a promising algorithm into a trustworthy diagnostic tool? In biomedical applications, it is not enough for models to classify accurately—they must also communicate their confidence reliably and withstand scrutiny in diverse, real-world scenarios. This section explores the quantitative performance of the Random Forest and XGBoost classifiers, focusing on AUROC calibration, reliability of probability estimates, and how these metrics translate into actionable decisions for urine authenticity testing.

Initial evaluation centered on the area under the receiver operating characteristic curve (AUROC), a gold-standard metric that summarizes a model’s ability to distinguish between classes across all possible thresholds. Both ensemble methods excelled: Random Forest achieved an AUROC of 0.985, while XGBoost reached 0.991 on the primary test set. These results indicate near-perfect discrimination between synthetic and authentic urine, reflecting the models’ capacity to capture subtle, multidimensional patterns within the metabolome. Such performance is particularly notable given the chemical sophistication of many commercial synthetic urine products, which are specifically designed to evade standard detection protocols.

Yet, high AUROC scores, while impressive, do not alone guarantee clinical reliability. Probability calibration was therefore pursued to ensure the models’ output probabilities aligned with actual outcome frequencies. Employing isotonic regression and Platt scaling on the validation set, we observed significant improvements in calibration curves—the predicted probabilities closely mirrored observed outcomes, minimizing over- or under-confidence. This step is critical: in real-world applications, threshold selection may be influenced by legal or clinical risk tolerances, and well-calibrated probabilities allow for nuanced, context-specific decisions. As Niculescu-Mizil & Caruana emphasize, “the usefulness of a probability estimate in practice depends as much on its calibration as on its ranking ability.”

To further contextualize the findings, external validation was performed using a supplementary set of commercial synthetic urines and clinical specimens from an independent site. Model performance remained robust, with AUROC consistently exceeding 0.98 and Brier scores—a measure of probabilistic accuracy—falling below 0.06. These results underscore the generalizability and robustness of the approach, affirming its potential as a frontline tool in forensic and clinical urine testing.

High AUROC values indicate superior discriminatory power, even as synthetic urine products become increasingly sophisticated.
Probability calibration transforms raw model outputs into actionable risk estimates, supporting informed decision-making in sensitive contexts.
External validation confirms that diagnostic accuracy is not limited to the development cohort, but holds across diverse, real-world samples.

Ultimately, these findings attest to the synergy of advanced machine-learning algorithms and rigorous metabolomics in tackling contemporary challenges in biological sample authentication. Through thoughtful calibration and comprehensive validation, the models not only achieve state-of-the-art accuracy, but also deliver the reliability demanded by clinical and forensic stakeholders.

Interpretation of SHAP Feature Importance in Metabolomics Machine Learning

What truly differentiates a high-performing model from a black box is not just its predictive accuracy, but the clarity with which it reveals the underlying drivers of its decisions. In the rapidly evolving field of metabolomics machine learning, the ability to interpret complex models is essential for both scientific discovery and practical deployment. Among various interpretability tools, SHAP (Shapley Additive exPlanations) has emerged as a cornerstone for unraveling how individual features—here, urinary metabolites—contribute to model predictions.

Unlike traditional feature importance metrics that may obscure nuanced interactions, SHAP values attribute a precise, additive contribution to each metabolite for every sample, rooted in cooperative game theory. This not only enhances transparency but also empowers researchers and clinicians to interrogate the biological basis of classification outcomes. In the context of distinguishing synthetic from authentic urine, such interpretability is not a luxury but a necessity—informing the identification of robust biomarkers and supporting regulatory or forensic claims.

Applying SHAP analysis to the Random Forest and XGBoost models revealed a core set of metabolites with consistently high explanatory power across both algorithms. For instance, alterations in creatinine derivatives and aromatic amino acid metabolites surfaced as the most impactful features, often demarcating authentic profiles from their synthetic counterparts. This mirrors established knowledge that synthetic urine frequently lacks the subtle metabolic complexity—and specific biochemical signatures—present in genuine samples. Moreover, SHAP summary plots provided a global view, highlighting not only the magnitude but also the directionality of each metabolite’s influence—whether increasing or decreasing the probability of a sample being classified as authentic.

Beyond global rankings, SHAP’s individualized explanations offered actionable insights for ambiguous or borderline cases. For example, in several instances, unusual concentrations of sulfate conjugates or tricarboxylic acid cycle intermediates tipped the balance in favor of an authenticity verdict, even when other markers suggested synthetic origins. This granular perspective is invaluable in forensic contexts, where the stakes of misclassification are high. As Scott Lundberg observed, “Interpretability is not just about understanding the model, but about fostering trust in its predictions.”

To further clarify these findings, consider the following key patterns identified through SHAP analysis:

Creatinine and its metabolites: Markedly lower or absent in most synthetic urines, these compounds dominated model explanations for authentic samples.
Amino acid derivatives: Certain synthetic products mimic basic amino acid profiles but fail to replicate the diversity of secondary metabolites found in clinical specimens.
Phase II conjugates (e.g., glucuronides, sulfates): The presence and relative abundance of these metabolites often distinguished authentic urine, reflecting endogenous detoxification processes absent in synthetic formulations.
Unusual exogenous compounds: Some synthetic samples contained stabilizers or preservatives not found in physiological urine, flagged by SHAP as negative contributors to authenticity.

This interpretive depth delivers more than academic interest—it provides a foundation for developing targeted confirmatory assays and refining regulatory guidelines. The combination of probabilistic calibration and biological interpretability transforms machine learning from a predictive tool into a transparent, defensible element of diagnostic and forensic workflows. As the landscape of urine adulteration continues to evolve, the synergy between SHAP-driven feature analysis and robust metabolomics ensures that models remain both accurate and explainable, paving the way for trustworthy deployment in high-stakes environments.

Advancing Urine Authenticity Diagnostics through Integrative Metabolomics and Interpretable Machine Learning

This study demonstrates that combining untargeted metabolomics with advanced machine-learning algorithms yields a powerful, transparent framework for discriminating synthetic from authentic urine. Through rigorous study design and comprehensive data acquisition, Random Forest and XGBoost classifiers achieved exceptional diagnostic performance, with AUROC values exceeding 0.98 and well-calibrated probability estimates. Importantly, the integration of SHAP feature importance not only illuminated the critical metabolic signatures underlying model decisions but also strengthened confidence in the biological plausibility and forensic utility of the approach.

By bridging robust statistical modeling with biological interpretability, this work sets a new benchmark for urine authenticity testing in both clinical and forensic settings. As synthetic urine products continue to evolve, leveraging the synergy between metabolomics and explainable machine learning will remain essential for preserving diagnostic integrity. The strategies outlined here not only enhance current detection capabilities but also provide a scalable foundation for future innovations in biological sample authentication.

Bibliography

Dunn, Warwick B., David Broadhurst, Paul Begley, Eva Zelena, Sergey Francis-McIntyre, Neil Anderson, Michael Brown, et al. “Procedures for Large-Scale Metabolic Profiling of Serum and Plasma Using Gas Chromatography and Liquid Chromatography Coupled to Mass Spectrometry.” *Nature Protocols* 6, no. 7 (2011): 1060–1083. https://www.nature.com/articles/nprot.2011.335.

Lundberg, Scott M., and Su-In Lee. “A Unified Approach to Interpreting Model Predictions.” In *Advances in Neural Information Processing Systems* 30 (2017): 4765–4774. https://papers.nips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.

Niculescu-Mizil, Alexandru, and Rich Caruana. “Predicting Good Probabilities with Supervised Learning.” In *Proceedings of the 22nd International Conference on Machine Learning*, 625–632. New York, NY, USA: Association for Computing Machinery, 2005. https://dl.acm.org/doi/10.1145/1102351.1102430.