nomerge: LLM compare #52

kgilpin · 2024-09-19T13:39:10Z

This branch is used for official LLM comparison data.

No changes should be applied to this branch, other than fixes to get LLMs working.

Workflow updates and improvements should not be applied to this branch, because that will invalidate the comparisons.

Results spreadsheet: https://docs.google.com/spreadsheets/d/1GjOKDzVyrFN6rh_xIP96JaaJxXDOU2ok6a6Y1I3osbI/edit?gid=1776346381#gid=1776346381

Results

Limits: 3 test files, 3 test status retries, 3 code files, 3 code status retries
Context tokens: 16,000
Instances: 167
Characters per token: 4.2

Metric	sonnet-20240620	gpt-4o-2024-08-06
Date	2024-09-19	2024-09-19
Resolved %	26.3%	32.3%
Code file match %	53%	61%
Test file match %	23%	23%
Average cost	$1.35	$0.94
Avg elapsed time (min)	8.7	5.5
Resolved (=2)	42%	38%
Resolved (=3)	77%	67%
Input cost per 1MM	$3	$2.50
Output cost per 1MM	$15	$10.00
Sent chars	254,092,577.00	234,585,594.00
Received chars	12,274,574.00	7,067,731.00
Total cost	$225.33	$156.46
Stddev elapsed time	5.35	4.20
Lint repair average	4.08	2.41
Test gen average	5.06	5.11
Test gen success average	3.92	3.36
Code gen average	4.85	4.61
Edit test file %	54%	58%
Test patch gen %	54%	58%
Inverted patch gen %	46%	52%
Pass to Pass %	81%	92%
Pass to Fail %	28%	32%
Fail to Pass %	15%	20%
Average score	1.27	1.39
Resolved count	44	54

Use standard configuration for LLM compare.

ci: Update solve.yml

compare: gpt-4o

10950797630 re-runs examples that errored out in the first run, including "Content policy" errors. Only one more resolved issue was obtained.

fix: appmap-js fix/retry-claude-overload

kgilpin added 12 commits September 19, 2024 08:37

ci: Update solve.yml

5ca09aa

Use standard configuration for LLM compare.

Merge pull request #51 from getappmap/ci/run-defaults

3bfeb93

ci: Update solve.yml

feat: Add --no_link option to solver/import_solve_code_run.py

ac9e3c3

data: gpt-4o llm-compare 10941127186

1dfb7df

Merge pull request #54 from getappmap/compare/gpt-4o

1c787a7

compare: gpt-4o

fix: appmap-js fix/retry-claude-overload

3ff4d6b

data: Add sonnet_retry_error_2024-09-19

6e45755

feat: Add lower limits option

f0bf6ec

data: Sonnet runs 10949246453 and 10950797630

382d4e8

10950797630 re-runs examples that errored out in the first run, including "Content policy" errors. Only one more resolved issue was obtained.

Merge pull request #55 from getappmap/compare/sonnet

f1e3352

fix: appmap-js fix/retry-claude-overload

fix: Fix access to None content

a4d4105

feat: More limits options

9459806

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nomerge: LLM compare #52

nomerge: LLM compare #52

Uh oh!

kgilpin commented Sep 19, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nomerge: LLM compare #52

Are you sure you want to change the base?

nomerge: LLM compare #52

Uh oh!

Conversation

kgilpin commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kgilpin commented Sep 19, 2024 •

edited

Loading