MLCommons AI platform : launches a new platform to benchmark medical models

MLCommons AI platform : launches a new platform to benchmark medical models

The pandemic has served as a spur for the enthusiastic adoption of AI by the healthcare sector. 80% of healthcare firms now have an AI strategy in place, and another 15% are preparing to do so, according to a 2020 survey by Optum.

Vendors, including well-known companies like Google, are stepping up their efforts to accommodate the rising demand. Google recently introduced Med-PaLM 2, an AI model created exclusively to analyze medical texts and offer medical insights and solutions. Additionally, businesses like Hippocratic and OpenEvidence are creating models to provide practical guidance to physicians who are out in the field.

However, as more medical models hit the market, it gets harder to tell which ones actually live up to their claims. Medical models are frequently trained utilizing data from small and specialized clinical environments, such as Eastern seaboard hospitals. Due to the scarcity of data, prejudices against specific patient populations, notably minorities, may arise, having negative real-world repercussions.

A new testing platform called MedPerf has been created by MLCommons, an engineering cooperative focused on AI industry metrics, in order to produce a trustworthy and reliable way for benchmarking and assessing medical models. According to MLCommons, MedPerf can assess AI models using a variety of actual medical data while placing a high priority on patient privacy.

The objective of utilizing benchmarking as a method to improve medical AI was stated by Alex Karargyris, co-chair of the MLCommons Medical Working Group. Karargyris claims that impartial, scientific testing of models on sizable, varied datasets can increase efficacy, lessen bias, foster public confidence, and assist regulatory compliance.

The result of a two-year partnership between more than 20 businesses and 20 academic institutions under the direction of the Medical Working Group is MedPerf. Major businesses like Google, Amazon, IBM, and Intel, as well as prestigious institutions like Brigham and Women’s Hospital, Stanford University, and MIT, are notable members of the Medical Working Group.

MLCommons’ general-purpose AI benchmarking suites, such as MLPerf, are not intended for the operators and customers of medical models; rather, healthcare organizations are the target audience for MedPerf. The MedPerf platform gives hospitals and clinics the capacity to instantly evaluate AI models. The platform uses a method known as “federated evaluation” to remotely install models and assess their effectiveness locally.

Both private models and models that can be accessed through APIs, such as those provided by Epic and Microsoft’s Azure OpenAI Services, are supported by MedPerf, which is interoperable with common machine learning libraries. Given the variety of models used by healthcare institutions, this wide range of compatibility guarantees flexibility.

The Federated Tumor Segmentation (FeTS) Challenge was held earlier this year on MedPerf, the testing platform created by MLCommons. The NIH-funded challenge aims to examine several models for evaluating post-operative care for the aggressive brain tumor glioblastoma. Using both on-premises and cloud-based software, MedPerf facilitated the testing of 41 alternative models at 32 healthcare facilities across six continents.

When used in locations with patient demographics that were different from those on which they were trained, all the models showed decreased performance, according to MLCommons. This made it clear that the models themselves contained biases.

The Dana-Farber Cancer Institute’s director of AI operations and co-chair of the MLCommons Medical Working Group, Renato Umeton, expressed enthusiasm at the outcomes of MedPerf’s medical AI pilot trials. Umeton emphasized that the models adhered to pre-established data standards and were executed on hospital systems without transferring any data. These findings support the idea that benchmarks attained through federated evaluation constitute a positive step towards AI-enabled medicine that is more inclusive.

MLCommons views MedPerf as a key step in advancing medical AI using open, impartial, and scientific methods, despite the fact that it is now primarily focused on evaluating radiological scan-analyzing models. The partnership promotes AI researchers to make use of the platform to test their own models in hospitals. Additionally, to increase the validity of MedPerf’s testing, MLCommons encourages data owners to register patient data.

However, it’s important to consider if, even if MedPerf performs as promised (which is not a given), the platform actually deals with the challenging problems in AI for healthcare.

The considerable discrepancy between marketing claims for AI and the significant work necessary to verify the technology’s correct operation is highlighted by a new analysis from Duke University academics. There are many hurdles involved in incorporating AI into the daily activities of healthcare workers and the complex technical and service delivery systems.

This issue is not brand-new. In 2020, Google released a clear whitepaper outlining the reasons why its AI solution for detecting diabetic retinopathy failed to perform as expected in real-world testing. The challenges faced included issues with hospital equipment deployment, internet connectivity quality, and even patient reactions to AI-assisted evaluation in addition to those with the models themselves.

Unsurprisingly, healthcare professionals—rather than organizations—have conflicting opinions about the use of AI in the industry. According to a Yahoo Finance poll, only 26% of medical professionals and 55% of other professionals believe that the technology is trustworthy.

This is not meant to downplay the importance of medical model bias, which does present issues and repercussions. It has been discovered that systems like Epic’s for detecting sepsis routinely overlook cases of the illness and frequently generate false alerts. It is also true that corporations without the size of businesses like Google or Microsoft have found it difficult to get a variety of current medical data for model testing.

However, when it comes to people’s health, it wouldn’t be a good idea to rely too heavily on a platform like MedPerf. Benchmarks merely offer a fragmentary viewpoint. Vendor, customer, and research audits must be continual and rigorous to ensure the secure implementation of medical models. One word can be used to describe the absence of such testing: irresponsible.

Leave a Comment