Skip to content

Conversation

kgilpin
Copy link
Contributor

@kgilpin kgilpin commented Sep 19, 2024

This branch is used for official LLM comparison data.

No changes should be applied to this branch, other than fixes to get LLMs working.

Workflow updates and improvements should not be applied to this branch, because that will invalidate the comparisons.

Results spreadsheet: https://docs.google.com/spreadsheets/d/1GjOKDzVyrFN6rh_xIP96JaaJxXDOU2ok6a6Y1I3osbI/edit?gid=1776346381#gid=1776346381

Results

  • Limits: 3 test files, 3 test status retries, 3 code files, 3 code status retries
  • Context tokens: 16,000
  • Instances: 167
  • Characters per token: 4.2
Metric sonnet-20240620 gpt-4o-2024-08-06
Date 2024-09-19 2024-09-19
Resolved % 26.3% 32.3%
Code file match % 53% 61%
Test file match % 23% 23%
Average cost $1.35 $0.94
Avg elapsed time (min) 8.7 5.5
Resolved (=2) 42% 38%
Resolved (=3) 77% 67%
Input cost per 1MM $3 $2.50
Output cost per 1MM $15 $10.00
Sent chars 254,092,577.00 234,585,594.00
Received chars 12,274,574.00 7,067,731.00
Total cost $225.33 $156.46
Stddev elapsed time 5.35 4.20
Lint repair average 4.08 2.41
Test gen average 5.06 5.11
Test gen success average 3.92 3.36
Code gen average 4.85 4.61
Edit test file % 54% 58%
Test patch gen % 54% 58%
Inverted patch gen % 46% 52%
Pass to Pass % 81% 92%
Pass to Fail % 28% 32%
Fail to Pass % 15% 20%
Average score 1.27 1.39
Resolved count 44 54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant