Newscard - o1 'Strawberry'

September 13, 2024

At a conference last week, Anthropic co-founder Jack Clark said that LLMs are approaching a technical depth that most non-experts cannot evaluate. As if to immediately prove him correct, his competitor OpenAI released a model specialized for complex math and programming problems, prompting a dozen people to ask me for review requests.

https://openai.com/index/learning-to-reason-with-llms/

Newscard

GPT o1 is the long-awaited strawberry
OpenAI once again takes a substantial research lead
o1 makes clear improvements in math, coding, and knowledge questions

OpenAI describes it as a form of both Chain of Thought and Reinforcement Learning: ““Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process.”
It likely blurs the line between Chain of Thought, an approach to instructing language models to form step by step processes, and tree search, a different kind of algorithm best known for its use in complex games such as Chess or Go, and more recently for Math Olympiads.

As a gold medallist in the International Olympiad in Informatics and a national competitor in the US and Canadian Math Olympiads, many people are asking me to explain what impact this new release will have. I asked o1 some choice questions in my areas of expertise, namely competitive math, competitive programming, and research combinatorics.

My takeaways from those domains:

o1 is an exemplary lookup system. It accurately cites obscure math theorems, describes them correctly, and identifies their use cases. It likely uses some form of retrieval augmented generation.
It struggles at more ad hoc programming and math questions, even ones which would be much easier for a human to do.
- For someone who has experience in math olympiads, it’s easy to identify problems that are “stereotypical” math olympiad problems, that use typical results and techniques. These “stereotypical” problems can be easy or difficult, but often depend on a fixed “vocabulary” of results. For o1, a stereotypical problem that would be very difficult for a human is still easier than a problem that would be easier for an inexperienced human.
- Without similar experience, it’s pretty difficult for a smart generalist to tell what kinds of problems would be well-represented or poorly-represented in training data
This is epitomized by the combinatorics research examples, in which o1 correctly cites and applies very obscure results.

Appendix (The Brian Chau Scorecard)

mantel’s theorem ✅

notes: cites mantel’s theorem directly

strong perfect graph theorem ✅(?)

also cites SPGT, gives an incomplete proof

labelled functional graphs where node 1 is reachable (n^{n-1}): ❌

did not fix after 3 corrections, gave an imo incomplete proof after being prompted with the formula for spanning trees

Off-stereotype but difficult Canadian OI problem https://dmoj.ca/problem/cco13p6 ❌

(compile error and also obviously wrong)

Another off-type OI problem: http://ceoi.inf.elte.hu/probarch/14/question.pdf ❌

complete garbage solution

https://oj.uz/problem/view/POI11_imp: ❌

passes one batch(!), but is both wrong and inefficient

Goodman’s theorem: ✅

when prompted, just cites Goodman’s theorem without proving it. Gives the normal proof of Goodman’s theorem when prompted.

https://ocw.mit.edu/courses/18-225-graph-theory-and-additive-combinatorics-fall-2023/resources/mit18_225_f23_psets_pdf/ B8 (straightforward application of turan’s theorem): ✅

Hamiltonian path in erdos renyi graph: ✅ (kinda)

Cites the correct threshold with no proof initially. When asked to prove the threshold correctly cites another result by Posa. When asked to prove the posa result, gets there eventually (though its a bit of a loose proof)

eigenvalues of bipartite graphs: ✅

gives the standard proof.

https://dmoj.ca/problem/valentines18s3

medium-difficulty dynamic programming problem set by me: ❌ (doesnt get any cases correct, even the samples)

https://ocw.mit.edu/courses/18-225-graph-theory-and-additive-combinatorics-fall-2023/resources/mit18_225_f23_psets_pdf/ D16 (Alon–Boppana theorem) ✅

I picked this problem thinking it seemed not like a direct application of a result, but turns out it is also an existing theorem, the Alon–Boppana theorem.

HMMT 2018 combinatorics p7 ✅

HMMT 2019 combinatorics p8 ❌

HMMT 2019 combinatorics p9 ❌

HMMT 2019 team 8 (which is a copy of the earlier CEOI programming question) ❌ (gives the correct answer for this yes-or-no question, but a horrendously wrong proof)

https://chatgpt.com/c/66e345c3-ed3c-800e-b62c-a20e922aeaf1

https://chatgpt.com/c/66e346c7-9f9c-800e-a602-ebbac1dc89df

https://chatgpt.com/c/66e34a1b-aff8-800e-a335-5dabc448621c

https://chatgpt.com/c/66e34ba7-bc60-800e-a5fa-231af9a7a4e6

https://chatgpt.com/c/66e34cb5-4f80-800e-89d3-bbd4db4c1dce

https://chatgpt.com/c/66e34d61-74e4-800e-b0aa-c31b6d2a693c

https://chatgpt.com/c/66e34ecf-14bc-800e-ae68-216ea4a23b26

https://chatgpt.com/c/66e34fd0-71c4-800e-8951-58334e970cb8

https://chatgpt.com/c/66e350db-3d80-800e-afd6-7a67e71b9960

https://chatgpt.com/c/66e353fb-8cf4-800e-9b56-6f3faf562085

https://chatgpt.com/c/66e35596-4b20-800e-8a36-30846b2a1837

https://chatgpt.com/c/66e35816-8c98-800e-9e59-82a3d0c43dcc