AI tasks¶
The ACOUSLIC-AI challenge is designed to evaluate and benchmark AI models that automate the measurement of fetal abdominal circumference using blind-sweep ultrasound data. Its ultimate aim is to broaden the accessibility of prenatal care in areas with limited resources.
This challenge involves analyzing a series of 2D ultrasound frames extracted from blind-sweep sequences acquired by novice operators. Participants are tasked with identifying the most suitable frame for measuring the fetal abdominal circumference. Along with selecting this optimal frame, participants must also provide the binary segmentation mask of the abdomen on the ultrasound image corresponding to the selected frame. This segmentation mask should be fit for an ellipse fitting tool to measure its circumference during evaluation.
In particular, the algorithm should provide the two following outputs:
- Fetal abdomen segmentation mask: A 2D numpy array of type np.uint8, matching the dimensions of the input images (744x562 pixels) with a pixel spacing of 0.28 mm. An ellipse fitting tool will be used to fit an ellipse to the segmentation mask, so participants should use proper post-processing to ensure their mask is suitable for this (e.g. remove disconnected components that are not part of the intended segmentation). The circumference of this ellipse will then be calculated and compared to a reference measurement.
- Fetal frame number: An integer representing the frame number where the segmentation was identified, or -1 if no relevant frame was identified. Please note that indexing in the evaluation software begins at 0, meaning valid frame numbers range from [0, 840), with -1 indicating no relevant frame found.
Evaluation¶
Performance metrics
The ACOUSLIC-AI challenge employs a comprehensive suite of metrics to evaluate the performance of the participating algorithms:
*Soft Dice Similarity Coefficient (DSC_soft): *This metric quantifies the spatial overlap accuracy of the algorithm's segmentation against the ground truth mask. A higher DSC indicates a closer match to the ground truth and thus a better segmentation performance.
In this implementation, the DSC is computed on a binary format where the ground truth mask (if available) corresponds to the fetal abdomen in the specified frame of the fetal abdomen stack. The ground truth and predicted masks are first converted to binary form (1 for the fetal abdomen, 0 for the background) before calculation.
If no annotation is found in the same frame as the predicted mask, the DSC is computed for the nearest annotated frame within the same sweep and a maximum distance (max_frame_tolerance) of 15 frames. The DSC is then adjusted by a coefficient based on the distance between the current frame and the nearest annotated frame:
This coefficient scales the DSC down based on the distance between the
predicted and nearest annotated frames, acknowledging that predictions
made further from the annotated frames might be less accurate.
*Weighted Frame Selection Score (WFSS): *WFSS evaluates the algorithm's frame selection accuracy, assigning higher scores to accurately identified and chosen clinically relevant frames. A score of 1 denotes correct identification of optimal planes, 0.6 for suboptimal plane selection when an optimal is available, and 0 for the selection of irrelevant frames when optimal/suboptimal ones are present.
Hausdorff Distance (HD): This metric measures the maximum distance between the algorithm's predicted boundary and the actual ground truth boundary, providing a sense of the largest potential error in the segmentation boundary prediction. The HD is calculated between the predicted and ground truth masks in their binary forms, considering only the pixels within the field of view of the ultrasound beam. If the ground truth annotation is not available in the same frame as the predicted mask, the nearest annotated frame within the same sweep and a maximum distance (max_frame_tolerance) of 15 frames is used.
Special Cases:
-
If the ground truth mask is not available in the same frame as the predicted mask, the nearest annotated frame is used. The HD is then scaled by a coefficient:
This scaling accounts for the increased potential error due to the distance between the current and nearest annotated frames.
-
If no nearest annotated frame is available, or if the predicted mask is empty, the HD is set to the maximum possible value, defined as the maximum sweep width (744) scaled by the frame tolerance:
Normalized Absolute Error (NAE): the normalized absolute error for abdominal circumference measurements provides a scale-independent measure of the precision in abdominal circumference estimation. It's calculated by taking the absolute difference between the ground truth and the predicted circumference, normalized by the maximum of either value to account for the scale:
Where:
- is the Normalized Absolute Error for Abdominal Circumference.
- is the ground truth Abdominal Circumference measurement — if present — in the sweep corresponding to the algorithm's selected frame.
- is the algorithm's predicted Abdominal Circumference measurement.
- is a small constant to prevent division by zero, set to 1e-6.
A lower NAE indicates a higher accuracy in predicting the AC measurements from the segmented masks, which is crucial for clinical applicability.
Note: The predicted abdominal circumference used to compute this metric is measured using the fit_ellipses function in the ellipse fitting tool provided in this repository. Ellipses extending beyond the field of view of the ultrasound beam are extrapolated using the contour points contained within the FOV.
The combined use of these metrics allows for a balanced evaluation of the algorithms, not only in terms of their segmentation accuracy but also their practical utility in a clinical setting. The evaluation software for this challenge is publicly available on** ACOUSLIC-AI-evaluation-method** GitHub repository.
Ranking method¶
The performance rank for all submitted algorithms is determined based on the following composite score:
The weight assignment prioritizes the accuracy of fetal abdominal circumference measurements () as the most critical factor, underscoring the importance of precise clinical measurement. Following this, equal importance is assigned to the clinical relevance of the frame selection (WFSS), which ensures the selection of the most appropriate planes for assessment, and the accuracy of the delineated fetal abdomen masks (). These metrics mirror the steps that an expert would take to provide an abdominal circumference measurement for a specific case. While the geometric accuracy of boundary delineation (HD) is not included in the ranking, it is provided as an additional metric for comparing algorithm performance.
The results of the ACOUSLIC-AI challenge will be published in a journal article. Teams whose algorithms rank in the top three on the Final Leaderboard (Final Test Phase) will be invited to co-author the paper, with a limit of three members per team.