A New Benchmark for the Risks of AI [Video]

“Overall, it’s good to see scientific rigor in the AI evaluation processes,” says Rumman Chowdhury, CEO of Humane Intelligence, a nonprofit that specializes in testing or red-teaming AI models for misbehaviors. “We need best practices and inclusive methods of measurement to determine whether AI models are performing the way we expect them to.”

MLCommons says the new benchmark is meant to be similar to automotive safety ratings, with model makers pushing their products to score well and the standard then improving over time.

The benchmark is not designed to measure the potential for AI models to become deceptive or difficult to control, an issue that garnered attention after ChatGPT blew up in late 2022. Governments worldwide launched efforts to study this issue and AI companies have teams dedicated to researching and probing models for problematic behaviors.

Mattson says MLCommon’s approach is meant to be complementary but also more expansive. “Safety institutes are trying to do evaluations, but they’re not necessarily able to …

Watch/Read More