New AI Coding Challenge Releases Disappointing Results

A fresh AI coding competition has announced its initial victor — and established a higher standard for artificial intelligence-driven software developers.

On Wednesday at 5pm PST, the nonprofit Laude Institute revealed the inaugural recipient of the K Prize, an AI coding competition with multiple stages initiated by Andy Konwinski, co-founder of Databricks and Perplexity. The winner was Eduardo Rocha de Andrade, a Brazilian prompt engineer, who will be awarded $50,000 for his achievement. However, what was more unexpected than his victory was his final score: he secured the win by correctly answering only 7.5% of the test questions.

“We’re happy we created a benchmark that’s truly challenging,” Konwinski stated. “Benchmarks need to be difficult if they’re going to have significance.” Konwinski has committed $1 million to the first open-source model that achieves a score above 90% on the test.

Comparable to the widely recognized SWE-Bench system, the K Prize evaluates models on GitHub issues that have been marked as problematic, serving as a measure of how effectively models can handle practical programming challenges. However, unlike SWE-Bench, which relies on a static set of problems that models can prepare for, the K Prize is structured as a “contamination-free” variant of SWE-Bench, employing a time-based submission process to prevent any training specific to the benchmark. In the first round, models were required to be submitted by March 12th. The K Prize organizers subsequently created the test using only GitHub issues that were flagged after that date.

The 7.5% top score is significantly different from SWE-Bench, which currently has a 75% top score on its more straightforward ‘Verified’ test and 34% on its more difficult ‘Full’ test. Konwinski is still uncertain whether the difference is caused by contamination in SWE-Bench or simply the difficulty of gathering new issues from GitHub, but he believes the K Prize initiative will soon provide an answer.

As we accumulate more iterations of this, we’ll gain a clearer understanding,” he said to TechCrunch, “since we anticipate that individuals will adjust to the challenges of competing on a regular basis every few months.

It could appear unusual to face challenges in this area, considering the numerous AI coding tools already accessible to the public – however, as evaluation standards become increasingly simple, many critics view initiatives such as the K Prize as an essential move forward in addressing the issue.AI’s growing evaluation problem.

“I’m very optimistic about creating new tests for established benchmarks,” says Princeton researcher Sayash Kapoor, who proposed a comparable concept.in a recent paperWithout these experiments, we cannot truly determine whether the problem is due to contamination, or simply because of targeting the SWE-Bench leaderboard with a human involved.

For Konwinski, this is more than just an improved benchmark; it’s an open call to the rest of the industry. “When you hear the excitement, it seems like we should be witnessing AI doctors, AI lawyers, and AI software engineers, but that’s not the case,” he states. “If we can’t achieve more than 10% on a contamination-free SWE-Bench, that’s the reality check for me.”

Read Also