LLM SQL Generation Benchmark Results

We assessed the ability of popular LLMs to generate accurate and efficient SQL from natural language prompts. Using a 200 million record dataset from the GH Archive uploaded to Tinybird, we asked the LLMs to generate SQL based on 50 prompts. The results are shown below and can be compared to a human baseline.

--
human
human
--
--
--
--
--
332.6 ms
31,006,852
759.83 MB
#1
anthropic
76.82
97.55
56.08
3.149
1.10
374.224 ms
40,099,998
824.57 MB
#2
openai
76.08
97.33
54.83
9.886
1.14
448.84 ms
49,432,133
844.29 MB
#3
anthropic
75.55
98.68
52.41
3.234
1.02
388.96 ms
37,145,042
684.44 MB
#4
openai
74.92
99.77
50.08
2.074
1.00
421.6 ms
52,027,773
246.69 MB
#5
deepseek
73.86
98.62
49.10
5.366
1.24
362.62 ms
39,914,537
612.03 MB
#6
openai
73.86
96.03
51.69
10.228
1.08
613.66 ms
52,581,751
940.75 MB
#7
meta-llama
73.31
98.32
48.30
3.095
1.04
410.78 ms
40,161,866
793.26 MB
#8
openai
72.59
94.92
50.26
21.133
1.04
702.64 ms
68,364,075
1,005.01 MB
#9
meta-llama
72.40
99.85
44.96
2.048
1.04
289.875 ms
39,101,618
134.66 MB
#10
openai
72.28
99.73
44.83
2.145
1.04
690.28 ms
54,131,214
193.58 MB
#11
google
70.90
99.76
42.04
1.426
1.02
350.146 ms
44,547,543
181.54 MB
#12
google
69.80
91.76
47.83
39.798
1.10
686.857 ms
53,855,819
893.51 MB
#13
google
69.16
98.42
39.90
1.622
1.00
384.551 ms
42,309,547
735.32 MB
#14
deepseek
68.17
83.13
53.21
5.875
1.11
383.682 ms
38,010,973
813.72 MB
#15
mistralai
66.84
98.82
34.86
0.925
1.00
385.911 ms
40,043,041
257.63 MB
#16
mistralai
63.87
97.73
30.00
3.307
1.09
680.644 ms
48,641,279
222.69 MB
#17
openai
63.81
99.68
27.93
1.538
1.06
445.694 ms
52,428,071
239.26 MB
#18
mistralai
45.67
47.31
44.02
1.809
1.00
376.5 ms
37,893,118
912.60 MB
#19
meta-llama
17.78
0.00
35.56
3.501
1.21
445.242 ms
38,658,489
992.39 MB