LLM SQL Generation Benchmark Results

We assessed the ability of popular LLMs to generate accurate and efficient SQL from natural language prompts. Using a 200 million record dataset from the GH Archive uploaded to Tinybird, we asked the LLMs to generate SQL based on 50 prompts. The results are shown below and can be compared to a human baseline.

Model Results for "top 10 Repositories with the most steady star growth rate over time"

human
Success
Yes
--
260 ms
0 s
1
7,319,235
446
0
48.82 MB
claude-3.5-sonnet
Success
Yes
0.00
174 ms
3.444 s
1
7,319,235
368
5,410
48.82 MB
claude-3.7-sonnet
Failed
No
0.00
21 ms
4.407 s
3
0
692
6,115
0.00 MB
deepseek-chat-v3-0324
Success
Yes
0.00
68 ms
1.703 s
1
7,319,235
146
4,176
20.90 MB
deepseek-chat-v3-0324:free
Success
Yes
0.00
99 ms
6.679 s
1
7,319,235
244
4,578
48.82 MB
gemini-2.0-flash-001
Success
Yes
0.00
94 ms
0.929 s
1
7,319,235
153
4,702
20.90 MB
gemini-2.5-flash-preview
Success
Yes
0.00
87 ms
1.717 s
1
7,319,235
205
4,716
48.82 MB
gemini-2.5-pro-preview-05-06
Failed
No
0.00
13 ms
142.744 s
3
0
491
14,050
0.00 MB
llama-4-maverick
Success
Yes
0.00
69 ms
2.212 s
1
7,319,235
193
4,217
27.88 MB
llama-4-scout
Success
Yes
0.00
1,625 ms
1.614 s
1
56,050,135
419
4,280
373.56 MB
llama-3.3-70b-instruct
Failed
No
0.00
19 ms
2.118 s
3
0
161
4,427
0.00 MB
ministral-8b
Failed
No
0.00
22 ms
0.938 s
3
0
197
4,841
0.00 MB
mistral-small-3.1-24b-instruct
Failed
No
0.00
17 ms
3.227 s
3
0
408
5,004
0.00 MB
mistral-nemo
Success
Yes
0.00
80 ms
3.66 s
1
7,319,235
202
4,613
20.90 MB
gpt-4.1
Success
Yes
0.00
74 ms
1.676 s
1
7,319,235
296
4,241
48.82 MB
gpt-4.1-nano
Success
Yes
0.00
75 ms
1.347 s
1
7,319,235
327
4,247
48.82 MB
gpt-4o-mini
Success
No
0.00
99 ms
2.187 s
2
7,319,235
351
4,391
48.82 MB
o3-mini
Success
No
0.00
259 ms
19.099 s
3
7,319,235
335
6,600
48.82 MB
o4-mini
Success
Yes
0.00
150 ms
23.248 s
1
7,319,235
445
5,975
48.82 MB
o4-mini-high
Success
Yes
0.00
292 ms
47.537 s
1
7,319,235
369
8,960
48.82 MB