Repository evaluations - ox/Ox-Character

Evaluations/LLM As A Judge - Claude 3.7 Sonnet - PixArt-Sigma 20 Epochs

main

results-PixArt-Sigma-XL-2-1024-MS-20-epochs-2025-05-29_2222.parquet

Type: image → text

Model:

Anthropic AI/Claude 3.7 Sonnet

Provider:

Anthropic

Target field: prediction

Prompt

Judge the following image on a few different criteria. Be very critical. We are aiming for perfection.

Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

Return the judgements in xml format. The xml should contain with the following field names and should be graded on the following criteria:

<character></character> Is the character a 3D Pixar-style white furry ox?
<task></task> Is the character performing the task described?
<objects></objects> Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?
<expression></expression> Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?
<texture></texture> The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.
<coloring></coloring> The fur must not contain any yellow or sepia tones (e.g., check that RGB values of white areas are near (255, 255, 255)).
<background></background> The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ox, gears, and workbench.

An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Prompt:
{prompt}

Image:
{image}

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

Queued: May 30, 2025, 1:12 AM UTC

Completed: May 30, 2025, 1:13 AM UTC

5 row sample

6393 tokens$ 0.0462

5 rows processed, 6393 tokens used ($0.0462)

Estimated cost for all 50 rows: $0.4617

Sample Results completed

3 columns, 1-5 of 50 rows