OpenAI today released a preview of its next-generation large language models, which the company says perform better than its previous models but come with a few caveats.

In its announcement for the new model, o1-preview, OpenAI touted its performance on a variety of tasks designed for humans. The model scored in the 89th percentile in programming competitions held by Codeforces and answered 83 percent of questions on a qualifying test for the International Mathematics Olympiad, compared to GPT-4o’s 14 percent correct.

Sam Altman, OpenAI’s CEO, said the o1-preview and o1-mini models were the “beginning of a new paradigm: AI that can do general-purpose complex reasoning.” But he added that “o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.”

When asked a question, the new models use chain-of-thought techniques that mimic how humans think and how many generative AI users have learned to use the technology—by continuously prompting and correcting the model with new directions until it achieves the desired answer. But in o1 models, versions of those processes happen behind the scenes without additional prompting. “It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working,” the company said.

While these techniques improve the models’ performances on various benchmarks, OpenAI found that in a small subset of cases, they also result in o1 models intentionally deceiving users. In a test of 100,000 ChatGPT conversations powered by o1-preview, the company found that about 800 answers the model supplied were incorrect. And for roughly a third of those incorrect responses, the model’s chain of thought showed that it knew the answer was incorrect but provided it anyway.

“Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead,” the company wrote in its model system card.

Overall, the new models performed better than GPT-4o, OpenAI’s previous state-of-the-art model, on various company safety benchmarks measuring how easily the models can be jailbroken, how often they provide incorrect responses, and how often they display bias regarding age, gender, and race. However, the company found that o1-preview was significantly more likely than GPT-4o to provide an answer when it was asked an ambiguous question where the model should have responded that it didn’t know the answer.

OpenAI did not release much information about the data used to train its new models, saying only that they were trained on a combination of publicly available data and proprietary data obtained through partnerships.

Read More

President

View all posts