Section Abstract Introduction Methods Results Discussion Conflict of Interest Acknowledgment Funding Sources References
Clinical Research
Accuracy of machine learning models using ultrasound images in prostate cancer diagnosis: a systematic review
pISSN: 0853-1773 • eISSN: 2252-8083
https://doi.org/10.13181/mji.oa.236765 Med J Indones. 2023;32:112–21
Received: January 31, 2023
Accepted: September 11, 2023
Authors' affiliation:
1Department of Urology, Faculty of Medicine, Universitas Indonesia, Cipto Mangunkusumo Hospital, Jakarta, Indonesia,
2Urology Medical Staff Group, Universitas Indonesia, Universitas Indonesia Hospital, Depok, Indonesia
Corresponding author:
Agus Rizal Ardy Hariandy Hamid
Department of Urology, Faculty of Medicine, Universitas Indonesia, Cipto Mangunkusumo Hospital,
Jalan Salemba Raya No. 6, Central Jakarta 10430, DKI Jakarta, Indonesia
Telp/Fax: +62-21-3912477
E-mail: rizalhamid.urology@gmail.com
Background
In prostate cancer (PCa) diagnosis, many developed machine learning (ML) models using ultrasound images show good accuracy. This study aimed to analyze the accuracy of neural network ML models in PCa diagnosis using ultrasound images.
Methods
The protocol was registered with PROSPERO registration number CRD42021277309. Three reviewers independently conducted a literature search in 5 online databases (PubMed, EBSCO, Proquest, ScienceDirect, and Scopus). We included all cohort, case-control, and cross-sectional studies in English, that used neural networks ML models for PCa diagnosis in humans. Conference/review articles and studies with combination examination with magnetic resonance imaging or had no diagnostic parameters were excluded.
Results
Of 391 titles and abstracts screened, 9 articles relevant to the study were included. Risk of bias analysis was conducted using the QUADAS-2 tool. Of the 9 articles, 5 used artificial neural networks, 1 used deep learning, 1 used recurrent neural networks, and 2 used convolutional neural networks. The included articles showed a varied area under the curve (AUC) of 0.76–0.98. Factors affecting the accuracy of artificial intelligence (AI) were the AI model, mode and type of transrectal sonography, Gleason grading, and prostate-specific antigen level.
Conclusions
The accuracy of neural network ML models in PCa diagnosis using ultrasound images was relatively high, with an AUC value above 0.7. Thus, this modality is promising for PCa diagnosis that can provide instant information for further workup and help doctors decide whether to perform a prostate biopsy.
Keywords
artificial intelligence, machine learning, neural network model, prostate cancer, ultrasonography
Prostate cancer (PCa) is the third most common cancer globally and the second most common in men.1 It significantly affects male health, and early detection facilitates curative treatment and reduces disease morbidity and mortality.2,3
Ultrasonography has a potential for PCa imaging because it is cost-effective, practical, and widely available.4 However, standard transrectal ultrasound (TRUS) alone is not reliable due to its low sensitivity and specificity in detecting PCa.5 The current gold standard for PCa detection is a prostate biopsy performed under TRUS guidance.2,3,6,7 While ultrasonography is widely available, TRUS can be less comfortable for patients than the transabdominal approach. The best instruments currently available yield inaccurate results. More accurate diagnostic instruments are required to effectively detect disorders. Technological advancements, such as artificial intelligence (AI), may help overcome these challenges.8,9
AI is a revolutionary technology in the healthcare field that is gaining interest. Neural networks, such as artificial neural networks (ANNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs), are machine learning (ML) models that mimic human biological neurons. For PCa, AI has been shown to aid in standardized pathological grading to guide cancer stratification and treatment. Nitta et al10 and Djavan et al11 applied ML models to predict PCa based on prostate-specific antigen (PSA) concentrations. ML tended to be superior to conventional methods, with a region-wise area under the receiver operating characteristic curve (ROC-AUC) value ranging from 0.63 to 0.91.
The accuracy of ML based on data from ultrasonography as the primary modality has been debated. Thus, this review aimed to analyze the accuracy of neural networks trained on ultrasound images for PCa diagnosis.
METHODS
Protocol registration
The protocol for this systematic review was registered with PROSPERO registration number CRD42021277309.
Search strategy
Three reviewers (RCS, CA, and FH) independently conducted a literature search of five online databases on January 13, 2023. The databases were PubMed, EBSCO, ProQuest, ScienceDirect, and Scopus. The following keywords with various combinations were used: “Prostate Cancer,” “Machine Learning OR Neural Network,” “Diagnosis,” and “Ultrasonography” (Figure 1). The reference lists of the articles retrieved from the literature search were also reviewed to identify other relevant studies.
Study selection and data extraction
All articles that used ultrasound images to demonstrate the application of ML to the diagnosis of PCa were included. The literature search was limited to publications in English without regard to the publication date. A study was considered significant if it met the inclusion criteria, including using human participants, neural networks, ML models, and prostate biopsy as the criterion for diagnosis. Cohort, case-control, and cross-sectional studies were included. Conference or review articles and studies that involved a combined examination with magnetic resonance imaging (MRI) or had no diagnostic parameters were excluded. Three reviewers (RCS, CA, and FH) individually reviewed the titles and abstracts of the selected studies. Disagreements were resolved through discussions with senior reviewers until a consensus was reached. All authors agreed with the final list of papers selected for extraction. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram was used to assist in selecting the articles.
The data extracted from the included articles were tabulated to summarize the outcomes. The data collection points included the number of samples and participants, ultrasound modes, ML methods, system specifications, software tools, programming languages, ML input data, ML outcomes, and diagnostic performance. The primary outcome was the accuracy of neural network ML models for PCa diagnosis. Additionally, the neural network models were compared with other ML models; we compared their available diagnostic performance data, including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and ROC-AUC. The receiver operating characteristic is a graph showing the performance of a classification model at all classification thresholds to determine its accuracy. The area under the curve (AUC) is the probability that a classifier ranks a randomly selected positive example more highly than a randomly selected negative example. Based on the test, an AUC of 0.5 indicates the inability to distinguish between patients with and without disease or condition, 0.7−0.8 is acceptable, 0.8−0.9 is considered excellent, and >0.9 is outstanding.
Risk of bias assessment
The methodological quality of the research was independently evaluated by three reviewers (RCS, CA, and FH) using the QUADAS-2 tool in the Review Manager software version 5.4 (Cochrane, United Kingdom) for Mac. The reviewers were not blinded to the identities of the authors of the articles, journals, and publishers. Based on the questions in the QUADAS-2 tool, the risks of bias were categorized as high, unclear, and low.
RESULTS
Of the 391 retrieved articles, only 9 met the inclusion criteria (Figure 1). The quality assessment of the included articles is shown in Table 1 using the QUADAS-2 tool. Several articles included in the analysis had an unclear or high risk of bias. Unclear risk of bias was common for the index test parameters due to the unclear threshold of the index test. Meanwhile, a high risk of bias was also common because the interpretation was limited to standard results in several articles.12–14
Figure 1. PRISMA flow diagram for the current study (a total of 391 articles obtained). MRI=magnetic resonance imaging; PRISMA=Preferred Reporting Items for Systematic Reviews and Meta- Analyses
Table 1. Risk of bias assessment using the QUADAS-2 tool
The characteristics of each study are presented in Table 2.12–20 Five studies used an ANN, one used deep learning (DL), one used an RNN, and two used a CNN. Nine of the included studies had a cross-sectional design. All studies examined adult males with an unknown age range owing to unclear data. The sample sizes ranged from 48 to 1,151 patients; however, the studies by Ronco and Fernandez12 and Akatsuka et al13 only provided the number of cases. Five studies used TRUS data only for the input parameters, whereas the others used a combination of input data from clinical findings. All studies showed various accuracy analysis parameters, including AUC, PPV, NPV, sensitivity, and specificity (Table 2). However, Loch et al14 only used percentages. The performance results are presented in Table 2. Due to the varied parameters, a quantitative analysis could not be performed. Most of the studies used the AUC as an accuracy parameter. The AUC values of all the studies were greater than 0.7, ranging from 0.75 to 0.98.
Table 2. Characteristics and performance result of included studies
DISCUSSION
Based on the included studies, the overall accuracy of ML showed promising results. The AUC values of nine studies were greater than 0.7, ranging from 0.75 to 0.98. Wildeboer et al18 assessed a potential DL model based on TRUS B-mode US, shear-wave elastography (SWE), and dynamic contrast-enhanced ultrasound (DCE-US). The multiparametric classifier showed an AUC of 0.90 compared with 0.75 for the best-performing individual parameters for PCa and Gleason scores >3+4 significant PCa. This study revealed that combinations of the available modes were favored over a single mode. Lee et al15 evaluated the accuracies of multiple logistic regression, ANN, and support vector machine (SVM) models in predicting the prostate biopsy outcomes of 684 patients (214 were confirmed to have PCa). The models were developed using the following input data: age, digital rectal examination (DRE) findings, PSA parameters, and TRUS findings. This study showed that image-based clinical decision support systems (ANN and SVM) were more accurate than multiple logistic regression models. They evaluated the diagnostic performance of the ANN model with and without TRUS data. The ANN model used the primary input data of age, PSA levels, and DRE findings. However, with additional TRUS data, the ANN model showed better accuracy and a higher AUC value than without TRUS data. Azizi et al17 proposed the temporal modeling of temporal enhanced ultrasound (TeUS) using an RNN to improve cancer detection accuracy. The TeUS data were acquired from 157 patients during fusion prostate biopsy. The model achieved an AUC value of 0.96. Hassan et al19 demonstrated a higher accuracy (0.99) with a CNN (VGG-16) than with other algorithms (Gradient Boosting, SVM, and Random Forest). Akatsuka et al13 reported an AUC of 0.835 for CNN combined with an SVM built on clinical data and TRUS images. This was higher than the AUC for the SVM based on only clinical data. A recent study by Lorusso et al20 demonstrated increasing sensitivity and NPV of the ANN method using TRUS images for higher grades of PCa.
Several factors influence the accuracies of models, including the AI model, TRUS modes, amount of input data, Gleason grading, and PSA concentrations. Based on the analysis of each AI model (Table 4), two included studies highlighted the superior diagnostic performance of the neural network model to those of other models.13,20 ANN and CNN outperformed the other neural network models in terms of diagnostic performance.14,15,19 TRUS modes are substantially related to the accuracy, with DCE-US/SWE/TeUS improving the visualization and distinction of prostate tissues over the B-mode. The amount of input data is also important for reliable predictions by ANN models. More complicated data will result in a more accurate diagnosis.21,22 According to Lee et al,16 Wildeboer et al,18 and Akatsuka et al,13 adding more complicated data increases the AUC, corresponding to better accuracy. Wildeboer et al18 discovered a significant association between Gleason scores of >3+4 and accuracy of DL, but not in Gleason scores of 3+3 or 3+4. This could be due to a bias in patient selection; tumors with scores of 3+3 were disproportionately large for the doctors and were excluded from the study. According to Lee et al,16 the AUC of ANN models was consistently higher for PSA concentrations greater than 10 ng/ml. This could be related to the serum PSA concentrations, corresponding to cancer extent and histological grade.23 As a result, TRUS alone is insufficient for detecting PCa. However, TRUS data and its combinations with other pertinent input data can be used for ML. Despite its benefits, neural networks utilizing ultrasonic images have drawbacks that can be improved, such as the need for a large dataset for training.24 Furthermore, the quality of scans, sample collection procedures, and human interpretation errors differ with datasets, making it impossible to create a gold standard.24,25
Reading ultrasound images requires several years of experience and training. ML has been introduced to medical imaging to address these constraints, speed up ultrasound picture analysis, and generate objective disease classification.21 ML applications have advanced rapidly, thus reducing the time required to interpret a large amount of data and draw conclusions.26 ML is an AI subfield in which computer algorithms learn connections between data instances for predictions.22 As previously noted, ultrasound images are analyzed using various techniques such as classification, regression, registration, and segmentation. However, neural network techniques have been found to outperform other classifiers.23 Neural networks function similarly to the human brain and can solve the limitations of regular ML. They can combine additional variables and produce outcomes for more complex scenarios.23 A neural network can create input data from many variables to classify patients with PCa.
As shown in Table 3, the algorithms used to build ML have several advantages and disadvantages. Regardless of their differences, CNNs and ANNs are important in the ML field.26,27 ANNs comprise multiple layers of interconnected artificial neurons activated by activation functions. Like traditional machine algorithms, the neural network learns specific values during training.28 Other prominent ML models, such as SVM, work by adding a higher dimension to the input to differentiate the classes.29 To assess whether the data meet the criteria, the decision tree (DT) employs several decision logics that act similarly to flowcharts. When numerous DTs are joined, a Random Forest method is used to reduce the overfitting tendency of the DT.30
Table 3. Comparison of advantages and disadvantages of several ML models
The ML field is advancing rapidly, with corresponding hardware and software advancements. DL has advanced significantly in recent years, owing to data overflow and support from graphic processing unit hardware acceleration. Various DL libraries, including PyTorch, Keras, TensorFlow, Theano, and Caffe, are currently available. Neural network fusion was recently developed to increase accuracy.31 The utilization of ML with TRUS data could have a potential role as a diagnostic modality, especially when MRI is unavailable. Based on current guidelines, T2-weighted imaging remains the most useful method for local MRI.32 However, a meta-analysis by de Rooij et al33 showed that MRI had high specificity but poor sensitivity for local PCa staging. Its sensitivities and specificities for extracapsular extension, seminal vesicle invasion, and overall stage T3 detection were 0.57 (95% confidence interval [CI] = 0.49–0.64) and 0.91 (95% CI = 0.88–0.93), 0.58 (95% CI = 0.47–0.68) and 0.96 (95% CI = 0.95–0.97), and 0.61 (95% CI = 0.54–0.67) and 0.88 (95% CI = 0.85–0.91), respectively. Our findings showed that ML based on TRUS and other relevant data can improve diagnostic performance. Thus, it will become more affordable and easier to diagnose PCa without MRI. Furthermore, ML based on TRUS data can be implemented in combination with MRI for prostate biopsy and intraoperative mapping before robotic surgery. This will allow the surgeon to visualize suspected lesions on the instrument display during the procedure.
To date, no study has analyzed the cost-effectiveness of ML for PCa diagnosis. For severe cases of PCa, AI is used to reduce the processing time and facilitate early detection, resulting in a superior prognosis. Additionally, reducing the quantity of human labor enables the service to be provided at a reduced price compared with multiparametric MRI.34 A systematic review by Khanna et al35 reported that AI models demonstrated significant cost savings for medical diagnosis and treatment, and this is applicable to PCa diagnosis.
The present study had some limitations. The major limitations were the low to moderate quality of the included studies and the small sample of articles. The literature search was restricted to studies written in English, and some articles in other languages might have been missed. None of the studies used the same output parameters to generate a quantitative analysis. Additionally, most studies did not blind the diagnosis when testing the ML models, which might have resulted in bias. The approximate AUC and sensitivity values of the ML models in this study were not high and might have led to missed PCa cases among the patients. Further advancements in ML will continue to improve diagnostic accuracy.
In conclusion, the accuracy of the neural network models for PCa diagnosis using ultrasound images was relatively high, with AUCs greater than 0.7. Neural network models are promising for PCa diagnosis and can provide instant information for further workup with relatively high accuracy. Image-based ML models can help doctors decide on proceeding with or deferring a prostate biopsy. Further development of AI will be beneficial for diagnosis, treatment evaluation, and predicting patient prognosis. Future studies should investigate and compare the diagnostic performance of neural networks based on ultrasound images and MRI for PCa.
A preprint of this manuscript has previously been published (https://www.medrxiv.org/content/10.1101/2022.02.03.22270377v1).
Conflict of Interest
Agus Rizal Ardy Hariandy Hamid is the editor-in-chief of this journal but was not involved in the review or decision making process of the article.
Acknowledgment
Technical assistance and critical advice are provided by the staff of the Department of Urology, Cipto Mangunkusumo Hospital.
Funding Sources
None.
REFERENCES
Copyright @ 2023 Authors. This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original author and source are properly cited. For commercial use of this work, please see our terms at http://mji.ui.ac.id/journal/index.php/mji/copyright.
mji.ui.ac.id