MULTI-SOURCE TRAINING ON CROSS-DATASET IN DERMOSCOPIC SKIN CANCER CLASSIFICATION: A 5-FOLD CROSS-VALIDATED STUDY ON HAM10000 AND ISIC 2019 WITH A SOURCE-BALANCED SAMPLER

Muhammad Haroon Ur Rashid; Muhammad Subhan; Dr. Shahid Khan Yusufzai

Authors

Muhammad Haroon Ur Rashid
Muhammad Subhan
Dr. Shahid Khan Yusufzai

Keywords:

Dermoscopy, skin cancer, deep learning, ConvNeXt, HAM10000, ISIC 2019, cross-dataset generalisation, cross-validation, bootstrap confidence intervals.

Abstract

Deep learning models for dermoscopic skin cancer classification routinely report accuracies above 90 % on the HAM10000 benchmark, yet their behaviour under realistic cross-domain conditions is rarely measured. This study reports two findings. First, a dual-backbone ConvNeXt–EfficientNet model (66.1 M parameters) trained only on HAM10000 attains a macro-F1 of 0.6905 on the HAM10000 test split but collapses to 0.4301 on the unseen ISIC 2019 archive — a generalisation gap of 26.0 percentage points. Second, a single ConvNeXt-Tiny backbone (27.8 M parameters) trained jointly on HAM10000 and ISIC 2019 with a source-balanced weighted sampler, evaluated under 5-fold lesion-grouped cross-validation with 4-view test-time augmentation and 2 000-resample bootstrap confidence intervals, achieves a pooled macro-F1 of 0.7401 [95 % CI 0.7252, 0.7541] on HAM-test and 0.5976 [95 % CI 0.5817, 0.6128] on ISIC-test. The cross-dataset gap is reduced from 0.260 to 0.142, a 45.3 % reduction, while the backbone shrinks by 58 %. Every per-class F1 improves on both datasets — most dramatically for dermatofibroma (df) on ISIC, which rises from 0.12 to 0.40, and vascular lesions (vasc), which rise from 0.35 to 0.54. The work also surfaces a measurement problem in the recent literature: of ten 2025–2026 studies surveyed, only two report macro-F1 on the 7-class HAM10000 task — and only one of those [8] uses the standard supervised protocol; only one of the ten evaluates true cross-dataset performance. Compared against the single directly comparable cross-dataset benchmark in the recent literature [8], our model achieves a pooled cross-dataset top-1 accuracy of 69.24 % on the ISIC 2019 test set versus their 56.0 % — a 13.2-point improvement — with approximately 2.7× fewer trainable parameters and under a stricter 5-fold cross-validated protocol with bootstrap confidence intervals