The team at Cedars-Sinai’s Smidt Heart Institute has become among the most prolific health system-based cardiovascular AI developers, and they added to their AI resume after unveiling the massive new EchoCLIP foundation model, which combines echo images and reporting text to perform a wide range of interpretations — without specific training.
- Foundation models are a type of generative AI, using a vast amount of unlabeled data to perform a wider range of clinical tasks than we’ve seen with most current healthcare AI tools.
To develop this beast of an echo AI model, the team assembled a dataset of 1,032,975 cardiac ultrasound videos and corresponding text-based expert interpretations.
Even without task-specific training or fine-tuning, EchoCLIP performed well across a wide range of measurement and detection tasks when tested against external data, including:
- Accessing cardiac function by predicting LVEF (7.1% mean absolute error)
- Estimating pulmonary artery pressure (10.8 mm Hg mean absolute error)
- Identifying intracardiac devices, like mitral valve repair, TAVR, and pacemaker/defibrillator leads (AUCs = 0.97, 0.92, 0.84)
- Detecting changes from a healthy cardiac chamber size, like severe dilation of the right ventricle, right atrium, left ventricle, and left atrium (AUCs = 0.92, 0.97, 0.92, 0.91)
- Assessing tamponade and severe left ventricular hypertrophy (AUCs = 0.96 & 0.82)
Perhaps more impressively, the team’s related EchoCLIP-R system was able to accurately identify specific patients using only their exams and retrieve past exams (AUC = 0.86), while highlighting clinically important changes that occurred between their echos.
Altogether, these results suggest that with a large enough dataset of echo images and expert text interpretations we can train foundational models that can support an extremely wide range of echo assessment tasks.
The Takeaway
The last few years have brought an impressive flow of echo AI models, and Cedars-Sinai’s new EchoCLIP model certainly could prove to be among the most significant given its size, breadth of capabilities, and its role as the first of potentially many advanced echo image+text foundation models.