How to Use OpenAI o3 API for Mathematical Conjecture Testing in 2025

How-To Guide·Jul 4, 2026·18 min read

Prerequisites and Environment Setup

Before you write a single line of code, get your environment locked down. o3 is not a drop-in replacement for GPT-4o — it has different API access requirements, a new parameter (reasoning_effort), and dramatically higher token consumption. Skipping this setup step will cost you time and money.

Required Python packages and versions

[x] Python >= 3.10 (walrus operator and match syntax used in parsing helpers)
[x] openai >= 1.30.0 — this is the first SDK version that exposes the reasoning_effort parameter in chat.completions.create
[x] tiktoken >= 0.7.0 — for pre-flight token estimation before expensive o3 calls
[x] numpy >= 1.26.0 — array manipulation for combinatorial structures
[x] sympy >= 1.12 — symbolic math verification (non-negotiable; see Step 3)

pip install "openai>=1.30.0" "tiktoken>=0.7.0" "numpy>=1.26.0" "sympy>=1.12"

Pin these in requirements.txt exactly as shown — o3 API behavior changed between minor SDK versions and unpinned installs will bite you in CI.

OpenAI API key configuration and rate limits

o3 requires Tier 4 or higher API access. You can check your tier at platform.openai.com/account/limits. Tier 3 accounts will receive a model_not_found or rate-limit error when calling o3. At Tier 4, you get 10,000 RPD and 2M tokens per minute for o3.

export OPENAI_API_KEY="sk-...your-key-here..."

For long-running conjecture-testing sessions, set OPENAI_MAX_RETRIES=5 and use the SDK's built-in exponential backoff.

Understanding o3's extended thinking and reasoning tokens

o3 introduces reasoning tokens — internal chain-of-thought tokens that the model consumes before producing visible output. These are billed at the same rate as output tokens but are not returned in the response content. The reasoning_effort parameter (low, medium, high) controls how many reasoning tokens the model is allowed to spend. For conjecture testing, always use high — the difference in mathematical output quality between medium and high is substantial.

Note: Reasoning tokens do not appear in response.choices[0].message.content. They are reflected in response.usage.completion_tokens_details.reasoning_tokens when streaming is disabled.

Estimated time: 15 minutes to complete full setup and verify API access.

Step 1: Frame Your Mathematical Problem for an LLM Reasoner

The quality of o3's mathematical output is almost entirely determined by how precisely you state the problem. Vague prompts produce plausible-sounding but wrong conjectures. This step is where mathematical rigor meets prompt engineering.

Translating conjectures into verifiable formal statements

Every conjecture you submit should have three parts: a precise definition of the mathematical objects involved, a formal statement of the property claimed, and an explicit description of what a counterexample would look like. If you can't state what a counterexample looks like, neither can o3.

Structuring prompts for discrete geometry and combinatorics tasks

Few-shot examples of known results dramatically improve o3's output quality. Include one theorem that the model should treat as ground truth and one failed conjecture that it should use as a template for the search strategy.

Setting the reasoning_effort parameter for deep search

Use reasoning_effort='high' for any non-trivial conjecture. Reserve medium for rapid hypothesis generation when you're exploring the problem space, then switch to high for serious verification attempts.

import openai
import os

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def build_conjecture_prompt(conjecture_name: str, formal_statement: str, domain: str) -> list[dict]:
    """
    Build a structured message list for o3 conjecture testing.
    Includes few-shot examples of known theorems to anchor reasoning.
    """
    system_prompt = f"""You are a mathematical research assistant specializing in {domain}.

Your task is to search for a COUNTEREXAMPLE to the following conjecture.
Do not attempt to prove the conjecture true. Focus entirely on finding a
specific concrete counterexample or arguing rigorously why one cannot exist.

Known true theorems to use as background knowledge:
1. Ramsey's Theorem: For any positive integers r, s, there exists a minimum
   integer R(r,s) such that any graph on R(r,s) vertices contains a clique
   of size r or an independent set of size s.
   Example: R(3,3) = 6. Witness: K_5 has no monochromatic triangle in
   any 2-coloring, but K_6 always does.

2. Cap-set bound (Croot-Lev-Pach / Ellenberg-Gijswijt, 2016):
   The maximum size of a subset of (Z/3Z)^n with no 3-term arithmetic
   progression is O(2.756^n). This was a conjecture until 2016.

Target conjecture to DISPROVE:
Name: {conjecture_name}
Formal statement: {formal_statement}

Output format:
- State the candidate counterexample as a precise mathematical object
  (e.g., an explicit set, graph, or sequence)
- Verify step by step that it violates the conjecture property
- If no counterexample found, state clearly: CONJECTURE APPEARS TRUE for
  the search space examined, with reasons"""

    user_prompt = f"""Search for a counterexample to: {formal_statement}

Provide the counterexample in a structured block starting with:
COUNTEREXAMPLE: <your mathematical object here>
VERIFICATION: <step-by-step check>"""

    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]

# Example usage: testing a graph-coloring style conjecture
messages = build_conjecture_prompt(
    conjecture_name="Weak Hadwiger Conjecture Variant",
    formal_statement=(
        "Every graph G with chromatic number chi(G) >= k "
        "contains K_floor(k/2) as a minor. "
        "Specifically, find a graph with chi(G)=6 that has no K_3 minor."
    ),
    domain="graph theory and combinatorics"
)

Note: Always include LaTeX-style notation alongside natural language. o3 was trained on mathematical literature and responds significantly better when both forms are present.

Step 2: Call the o3 API with Extended Reasoning Enabled

The API call itself has several non-obvious configuration choices that determine whether you get useful mathematical output or an expensive timeout. Get these parameters right the first time.

Using the responses endpoint vs. chat completions for long reasoning chains

As of mid-2025, chat.completions.create with model='o3' is the stable path for most developers. The newer Responses API endpoint supports o3 but has stricter streaming semantics. Stick with chat completions until the Responses API stabilizes.

Configuring max_completion_tokens and reasoning token budgets

Set max_completion_tokens to at least 16,000 for non-trivial problems. o3 on reasoning_effort='high' can consume 10,000–30,000 reasoning tokens internally before producing output. Setting max_completion_tokens too low causes the model to truncate mid-proof.

Streaming reasoning traces for large problems

For long reasoning chains, stream the response to avoid timeout errors and to monitor progress. The delta.content field contains the visible output; reasoning tokens are not streamed but are summarized in the final usage chunk.

import openai
import os
from typing import Generator

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def call_o3_streamed(
    messages: list[dict],
    reasoning_effort: str = "high",
    max_completion_tokens: int = 16000
) -> tuple[str, dict]:
    """
    Call o3 with streaming enabled.
    Returns (full_response_text, usage_dict).
    """
    full_response = []
    usage_data = {}

    stream = client.chat.completions.create(
        model="o3",
        messages=messages,
        reasoning_effort=reasoning_effort,
        max_completion_tokens=max_completion_tokens,
        stream=True,
        stream_options={"include_usage": True}  # Required to get usage in stream
    )

    for chunk in stream:
        # Extract visible content from delta
        if chunk.choices and chunk.choices[0].delta.content:
            token_text = chunk.choices[0].delta.content
            full_response.append(token_text)
            print(token_text, end="", flush=True)  # Live progress display

        # The final chunk contains usage statistics
        if chunk.usage is not None:
            usage_data = {
                "prompt_tokens": chunk.usage.prompt_tokens,
                "completion_tokens": chunk.usage.completion_tokens,
                "total_tokens": chunk.usage.total_tokens,
                # Reasoning tokens are nested in completion_tokens_details
                "reasoning_tokens": getattr(
                    getattr(chunk.usage, "completion_tokens_details", None),
                    "reasoning_tokens", 0
                )
            }

    return "".join(full_response), usage_data

# Wire it up
if __name__ == "__main__":
    from step1 import build_conjecture_prompt  # from previous step
    messages = build_conjecture_prompt(
        conjecture_name="Test Conjecture",
        formal_statement="Every planar graph with n >= 4 vertices has a vertex of degree <= 4",
        domain="graph theory"
    )
    response_text, usage = call_o3_streamed(messages)
    print(f"\n\nReasoning tokens used: {usage.get('reasoning_tokens', 'N/A')}")
    print(f"Total tokens: {usage.get('total_tokens', 'N/A')}")

Note: stream_options={"include_usage": True} is required — without it, the final usage chunk is omitted from the stream and you cannot track reasoning token spend.

Step 3: Parse and Validate the Model's Counterexample Output

Never trust o3's self-verification. The model can confidently describe a counterexample that doesn't actually violate the conjecture. Independent symbolic verification is not optional — it's the entire point of the workflow.

Extracting structured mathematical objects from free-text responses

Structured output markers (the COUNTEREXAMPLE: and VERIFICATION: blocks from our prompt template) make parsing tractable. Use regex to extract the candidate, then pass it to your verifier.

Using SymPy to programmatically verify counterexamples

SymPy handles symbolic polynomial arithmetic, number theory, and combinatorial checks. For graph problems, use NetworkX alongside SymPy.

Logging and storing candidate proofs for human review

Log every candidate regardless of pass/fail. A counterexample that fails one check might partially inform the next search direction.

import re
import sympy as sp
from sympy import Matrix, GF
import json
from datetime import datetime

def extract_counterexample(response_text: str) -> str | None:
    """Pull the COUNTEREXAMPLE block from o3's response."""
    pattern = r"COUNTEREXAMPLE:\s*(.+?)(?=VERIFICATION:|$)"
    match = re.search(pattern, response_text, re.DOTALL | re.IGNORECASE)
    return match.group(1).strip() if match else None

def verify_no_three_term_ap(candidate_set_str: str) -> dict:
    """
    Verifier for the cap-set problem in (Z/3Z)^n:
    Check that a given set of vectors in (Z/3Z)^n contains no 3-term
    arithmetic progression (i.e., no x, y, z with x + z = 2y mod 3).
    
    Expects candidate_set_str like: "[[0,1,2],[1,0,1],[2,2,0]]"
    Returns dict with 'valid' bool and 'witness' if counterexample found.
    """
    try:
        # Parse the set — in production, use ast.literal_eval for safety
        import ast
        point_set = ast.literal_eval(candidate_set_str)
        
        if not isinstance(point_set, list) or len(point_set) < 3:
            return {"valid": False, "error": "Need at least 3 points", "witness": None}
        
        n = len(point_set[0])  # dimension
        
        # Check all triples for 3-term APs in (Z/3Z)^n
        from itertools import combinations
        for i, x in enumerate(point_set):
            for j, y in enumerate(point_set):
                for k, z in enumerate(point_set):
                    if i == j or j == k or i == k:
                        continue
                    # Check x + z ≡ 2y (mod 3) coordinate-wise
                    is_ap = all(
                        (x[d] + z[d]) % 3 == (2 * y[d]) % 3
                        for d in range(n)
                    )
                    if is_ap:
                        return {
                            "valid": False,  # Model's claim is FALSE — set has a 3AP
                            "error": f"Found 3-term AP: {x} + {z} = 2*{y} (mod 3)",
                            "witness": {"x": x, "y": y, "z": z}
                        }
        
        return {"valid": True, "error": None, "witness": None, "set_size": len(point_set)}
    
    except Exception as e:
        return {"valid": False, "error": f"Parse error: {str(e)}", "witness": None}

def log_candidate(response_text: str, verification_result: dict, conjecture_name: str):
    """Append candidate and result to a JSONL log file."""
    entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "conjecture": conjecture_name,
        "raw_response_snippet": response_text[:500],
        "verification": verification_result
    }
    with open("conjecture_candidates.jsonl", "a") as f:
        f.write(json.dumps(entry) + "\n")

Note: ast.literal_eval is safer than eval for parsing model output. For production use, add schema validation before passing model output to any Python interpreter.

Step 4: Build an Iterative Conjecture-Testing Loop

A single o3 call rarely finds a counterexample on the first attempt. The real power comes from building a feedback loop where verification failures are injected back as context, steering the model toward valid candidates.

Designing a feedback loop that sends verification results back to o3

Append the verifier's error message to the conversation as a user turn. o3 treats this as a correction signal and adjusts its search strategy. This is materially different from simply retrying the same prompt.

Handling cases where the model produces invalid or unverifiable output

Plan for three failure modes: unparseable output, a parsed but mathematically invalid candidate, and a candidate that is valid but doesn't address the conjecture. Each needs a different feedback message.

Cost management: estimating token spend per reasoning iteration

At reasoning_effort='high', budget approximately $15–40 per non-trivial conjecture-testing session (10 iterations × ~25K tokens at o3 pricing). Track cumulative spend from the usage object.

import openai
import os
from typing import Callable

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# o3 pricing as of mid-2025 (USD per 1M tokens) — verify at platform.openai.com/pricing
O3_INPUT_PRICE_PER_1M = 10.00
O3_OUTPUT_PRICE_PER_1M = 40.00

def iterative_conjecture_search(
    initial_messages: list[dict],
    verifier_fn: Callable[[str], dict],
    extract_fn: Callable[[str], str | None],
    conjecture_name: str,
    max_iterations: int = 5,
    reasoning_effort: str = "high"
) -> dict:
    """
    Iterative o3 conjecture-testing loop with verifier feedback.
    Returns the first valid counterexample found, or None.
    """
    messages = list(initial_messages)  # Don't mutate the original
    cumulative_cost_usd = 0.0
    cumulative_tokens = {"prompt": 0, "completion": 0, "reasoning": 0}

    for iteration in range(1, max_iterations + 1):
        print(f"\n{'='*60}")
        print(f"Iteration {iteration}/{max_iterations} — calling o3...")

        # Call o3 (non-streaming for cleaner usage tracking in loop)
        response = client.chat.completions.create(
            model="o3",
            messages=messages,
            reasoning_effort=reasoning_effort,
            max_completion_tokens=12000
        )

        response_text = response.choices[0].message.content
        usage = response.usage

        # Track token spend
        cumulative_tokens["prompt"] += usage.prompt_tokens
        cumulative_tokens["completion"] += usage.completion_tokens
        reasoning_tokens = getattr(
            getattr(usage, "completion_tokens_details", None), "reasoning_tokens", 0
        )
        cumulative_tokens["reasoning"] += reasoning_tokens

        # Estimate cost
        iteration_cost = (
            usage.prompt_tokens / 1_000_000 * O3_INPUT_PRICE_PER_1M +
            usage.completion_tokens / 1_000_000 * O3_OUTPUT_PRICE_PER_1M
        )
        cumulative_cost_usd += iteration_cost
        print(f"Iteration cost: ${iteration_cost:.4f} | Cumulative: ${cumulative_cost_usd:.4f}")
        print(f"Reasoning tokens this iteration: {reasoning_tokens}")

        # Append model's response to conversation
        messages.append({"role": "assistant", "content": response_text})

        # Extract candidate counterexample
        candidate = extract_fn(response_text)
        if candidate is None:
            feedback = (
                "Your response did not include a COUNTEREXAMPLE: block. "
                "Please provide a concrete mathematical object in the format specified."
            )
            messages.append({"role": "user", "content": feedback})
            print("WARNING: No parseable counterexample found. Retrying with format feedback.")
            continue

        # Run independent verifier
        result = verifier_fn(candidate)
        print(f"Verifier result: {result}")

        if result["valid"]:
            print(f"SUCCESS: Valid counterexample found on iteration {iteration}!")
            return {
                "found": True,
                "counterexample": candidate,
                "iteration": iteration,
                "cumulative_cost_usd": cumulative_cost_usd,
                "cumulative_tokens": cumulative_tokens
            }
        else:
            # Inject verifier error as feedback
            feedback = (
                f"Your proposed counterexample was checked by an independent symbolic verifier "
                f"and FAILED for the following reason: {result['error']}. "
                f"Witness data: {result.get('witness')}. "
                f"Please revise your approach and propose a different candidate."
            )
            messages.append({"role": "user", "content": feedback})

    print(f"\nSearch exhausted after {max_iterations} iterations.")
    print(f"Total cost: ${cumulative_cost_usd:.4f}")
    print(f"Total tokens: {cumulative_tokens}")
    return {"found": False, "cumulative_cost_usd": cumulative_cost_usd}

Step 5: Reproduce the Discrete Geometry Disproof Workflow

In early 2025, OpenAI publicly reported that o3 identified a counterexample to a conjecture in combinatorics/discrete geometry — specifically related to the structure of cap-sets and related extremal problems. Understanding what actually happened prevents you from over-claiming what this workflow can do.

What OpenAI's o3 actually did: the cap-set adjacent conjecture context

The reported result was counterexample discovery, not theorem proving. o3 found a specific mathematical object that violated a conjectured bound. A human mathematician then verified the object independently. The machine did not produce a formal proof in any proof-assistant language like Lean or Coq — it produced a witness. This distinction matters enormously for how you use and cite the results.

Replicating a minimal counterexample search in your own script

The following script demonstrates the full pipeline on a tractable Ramsey-type problem: searching for a 2-coloring of K_5 edges that avoids monochromatic triangles (a known result: R(3,3)=6 means K_5 has such a coloring, so this tests whether o3 finds it correctly).

Differences between a machine-found counterexample and a formal proof

A counterexample is a single witness. A formal proof generalizes. If o3 finds a graph on 17 vertices with no monochromatic K_4 in any 2-coloring, that's evidence against R(4,4) > 17 — but it doesn't prove R(4,4) = 18. Always frame machine-found results as witnesses requiring further mathematical analysis.

"""
Self-contained Ramsey-type counterexample search.
Task: Find a 2-coloring of K_5 edges with no monochromatic triangle.
Known answer: Yes, such colorings exist (Petersen graph coloring).
This script uses o3 as hypothesis generator + brute-force verifier as oracle.
"""
import openai
import os
import ast
from itertools import combinations

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def brute_force_check_no_monochromatic_triangle(coloring: dict, n: int = 5) -> dict:
    """
    Check if a 2-coloring of K_n edges (0=red, 1=blue) has no monochromatic triangle.
    coloring: dict mapping (i,j) -> 0 or 1 for i < j
    """
    vertices = list(range(n))
    for tri in combinations(vertices, 3):
        u, v, w = tri
        edges = [
            coloring.get((min(u,v), max(u,v)), -1),
            coloring.get((min(v,w), max(v,w)), -1),
            coloring.get((min(u,w), max(u,w)), -1)
        ]
        if -1 in edges:
            return {"valid": False, "error": f"Missing edge in triangle {tri}"}
        if edges[0] == edges[1] == edges[2]:  # Monochromatic!
            return {
                "valid": False,
                "error": f"Monochromatic triangle at vertices {tri} with color {edges[0]}",
                "witness": {"triangle": tri, "color": edges[0]}
            }
    return {"valid": True, "error": None}

def parse_coloring_from_response(response_text: str) -> dict | None:
    """
    Expect model to output coloring as a Python dict literal:
    COUNTEREXAMPLE: {(0,1): 0, (0,2): 1, (0,3): 0, (0,4): 1, ...}
    """
    pattern = r"COUNTEREXAMPLE:\s*(\{.+?\})"
    import re
    match = re.search(pattern, response_text, re.DOTALL)
    if not match:
        return None
    try:
        raw = match.group(1).replace("\n", " ")
        parsed = ast.literal_eval(raw)
        # Normalize keys to tuples with i < j
        return {(min(k[0],k[1]), max(k[0],k[1])): v for k, v in parsed.items()}
    except Exception:
        return None

messages = [
    {
        "role": "system",
        "content": (
            "You are a combinatorics expert. Find a 2-coloring (colors: 0=red, 1=blue) "
            "of all edges of K_5 (complete graph on vertices 0,1,2,3,4) such that "
            "NO triangle is monochromatic. K_5 has 10 edges: all pairs (i,j) with i<j. "
            "Output ONLY in this format:\n"
            "COUNTEREXAMPLE: {(0,1): 0, (0,2): 1, ...} (all 10 edges listed)"
        )
    },
    {
        "role": "user",
        "content": "Provide a valid 2-coloring of K_5 with no monochromatic triangle."
    }
]

response = client.chat.completions.create(
    model="o3",
    messages=messages,
    reasoning_effort="medium",  # This is simple enough for medium
    max_completion_tokens=4000
)

response_text = response.choices[0].message.content
print("Model response:")
print(response_text)

coloring = parse_coloring_from_response(response_text)
if coloring and len(coloring) == 10:
    result = brute_force_check_no_monochromatic_triangle(coloring, n=5)
    print(f"\nVerifier result: {result}")
    if result["valid"]:
        print("CONFIRMED: Valid triangle-free 2-coloring of K_5 found.")
    else:
        print(f"INVALID: {result['error']}")
else:
    print(f"Parse failed or incomplete coloring. Got: {coloring}")

Common Issues and Fixes

Error: `TypeError: create() got an unexpected keyword argument 'reasoning_effort'`

Cause: You're running openai SDK < 1.30.0, which does not support the reasoning_effort parameter.

Fix: Upgrade and pin the SDK version:

pip install "openai>=1.30.0"
pip freeze | grep openai  # Verify: should show openai==1.30.x or higher

In requirements.txt: openai>=1.30.0,<2.0.0

Error: Model returns plausible but mathematically wrong counterexamples

Cause: o3 can hallucinate mathematical verification steps. The model's internal reasoning may contain subtle arithmetic errors that produce a confident but incorrect conclusion.

Fix: Independent symbolic verification (Step 3) is mandatory. Never accept o3's self-reported verification. If the verifier consistently rejects candidates, tighten your prompt to require the model to output raw mathematical objects (e.g., explicit edge lists or integer sequences) rather than prose descriptions — prose is harder to parse and verify.

Error: `context_length_exceeded` during long reasoning chains

Cause: Multi-iteration feedback loops accumulate conversation history. After 5+ iterations, the combined prompt + reasoning tokens can exceed o3's 128K context window.

Fix: Trim old failed candidates from the message history. Keep only the system prompt, the last 2 failed attempts, and the latest user feedback:

def trim_messages(messages: list[dict], keep_last_n_pairs: int = 2) -> list[dict]:
    """Keep system message + last N assistant/user exchange pairs."""
    system = [m for m in messages if m["role"] == "system"]
    exchanges = [m for m in messages if m["role"] != "system"]
    # Keep last N*2 messages (N user + N assistant turns)
    trimmed_exchanges = exchanges[-(keep_last_n_pairs * 2):]
    return system + trimmed_exchanges

Error: Rate limit errors on Tier 3 accounts when calling o3

Cause: o3 is restricted to Tier 4+ accounts. Tier 3 will hit either model_not_found or immediate rate limit errors.

Fix: Check your tier at platform.openai.com/account/limits. To upgrade: add a payment method and ensure $500+ in API spend history (thresholds change — check the limits page). As a short-term workaround, o3-mini is available at lower tiers with reduced reasoning depth.

| Symptom | Root Cause | Resolution | |---|---|---| | unexpected keyword argument 'reasoning_effort' | SDK < 1.30.0 | pip install "openai>=1.30.0" | | Confident but wrong counterexamples | LLM self-verification hallucination | Mandatory SymPy/brute-force verifier | | context_length_exceeded | Accumulated conversation history | Trim to last 2 exchange pairs | | Rate limit on o3 | Tier 3 account | Upgrade to Tier 4 or use o3-mini | | Truncated reasoning mid-proof | max_completion_tokens too low | Set >= 16000 for hard problems |

Frequently Asked Questions

Q: Is o3's output a formal mathematical proof or just a counterexample?

o3's output is not a formal proof in any verifiable sense. It produces natural language and mathematical notation that describes a candidate counterexample or argument. Even when the reasoning appears rigorous, it has not been checked by a proof assistant like Lean 4 or Coq. What o3 can reliably do is discover a specific mathematical object (a graph, a set, a sequence) that you then verify independently using symbolic tools or brute-force enumeration. Treat o3 as a very sophisticated hypothesis generator, not an automated theorem prover. The mathematical community's current standard is that any machine-generated result requires independent human and/or computer-verified confirmation before it can be considered established.

Q: Can I use o3-mini instead of o3 to reduce costs for conjecture testing?

o3-mini is available at lower API tiers and costs roughly 4–8x less per token than o3. For simple conjecture testing — small Ramsey checks, low-dimensional combinatorial searches, or generating candidate objects in well-structured domains — o3-mini at reasoning_effort='high' performs surprisingly well. However, for problems requiring deep multi-step mathematical reasoning across multiple domains (e.g., connecting algebraic structure to geometric properties), o3's additional reasoning capacity produces materially better candidates. A practical strategy: use o3-mini for rapid hypothesis generation (iterations 1–3) and switch to full o3 only when o3-mini's candidates consistently fail verification.

Q: How do I cite or attribute a result found with AI assistance in a research context?

Emerging norms in the mathematics community are still forming, but the current consensus (reflected in journals like Annals of Mathematics and preprint guidance from arXiv) is: disclose AI assistance in the methods section, specify the model and version (e.g., "OpenAI o3, accessed May 2025"), and make clear that the AI generated a candidate which was subsequently verified by [human/computer algebra system]. Do not list the AI as a co-author — authorship requires accountability. The Fields Medal–winning Ellenberg-Gijswijt cap-set proof was a fully human result; AI-assisted discoveries are expected to clearly document the division of labor between machine generation and human/formal verification.