With Evals, OpenAI hopes to crowdsource AI model testing

Alongside GPT-4, OpenAI has open-sourced a software framework to evaluate the performance of its AI models. Called Evals, OpenAI says that the tooling is designed to allow anyone to report shortcomings in its models to help guide further improvements.

It’s a sort of crowdsourcing approach to model testing, OpenAI says.

“We use Evals to guide development of our models (both identifying shortcomings and preventing regressions), and our users can apply it for tracking performance across model versions and evolving product integrations,” OpenAI wrote in a blog post announcing the release. “We are hoping Evals becomes a vehicle to share and crowdsource benchmarks, representing a maximally wide set of failure modes and difficult tasks.”

OpenAI created Evals to develop and run benchmarks for evaluating models like GPT-4 while inspecting their performance. With Evals, developers can use data sets to generate prompts, measure the quality of completions provided by an OpenAI model and compare performance across different data sets and models.

Evals, which is compatible with several popular AI benchmarks, also supports writing new classes to implement custom evaluation logic. As an example to follow, OpenAI created a logic puzzles evaluation that contains ten prompts where GPT-4 fails.

It’s all unpaid work. But to incentivize Evals usage, OpenAI plans to grant GPT-4 access to those who contribute “high-quality” benchmarks.

“We believe that Evals will be an integral part of the process for using and building on top of our models, and we welcome direct contributions, questions, and feedback,” the company wrote.

With Evals, OpenAI — which recently said that it would stop using customer data to train its models by default — is following in the footsteps of others who’ve turned to crowdsourcing to robustify AI models.

In 2017, the Computational Linguistics and Information Processing Laboratory at the University of Maryland launched a platform dubbed Break It, Build It, which let researchers submit models to users tasked with coming up with examples to defeat them. And Meta maintains a platform called Dynabench that has users “”ool” models designed to analyze sentiment, answer questions, detect hate speech, and more.

With Evals, OpenAI hopes to crowdsource AI model testing by Kyle Wiggers originally published on TechCrunch

Read Entire Article

Tags: Techcrunch Technology

With Evals, OpenAI hopes to crowdsource AI model testing

Silicon Valley Bank’s new CEO urges customers to bring deposits back

Duolingo launches new subscription tier with access to AI tutor powered by GPT-4

Microsoft’s new Bing was using GPT-4 all along

Leave a Reply Cancel reply

Highlights

Finn Wolfhard Once Got Mistaken For Timothee Chalamet in New York City

Jingle Smells

Lyle Menendez, Wife Rebecca’s Relationship Timeline Before Possible Release

‘Where are they?’: The flood-hit Spanish towns desperate for leadership

Who could make up Badenoch’s new ‘fightback’ team?

Analyst Reveals Why Bitcoin Price Crashed From $73,000 To $69,000

Trending

Liverpool 2-1 Brighton: Salah completes comeback win as Reds go top

Bournemouth 2-1 Manchester City: Leaders stunned as 32-match unbeaten run ends on south coast

23 Times Stars Have Reworn Looks on the Red Carpet and Beyond

Finn Wolfhard Once Got Mistaken For Timothee Chalamet in New York City

Jingle Smells

Recent News

Categories