Evaluations/LLM As A Judge - Claude 3.7 Sonnet - GPT Image Gen
main
results-openai.parquet
imagetext
Anthropic AIAnthropic AI/Claude 3.7 Sonnet
Anthropic Anthropic
prediction
# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}
May 30, 2025, 5:52 AM UTC
May 30, 2025, 6:04 AM UTC
48 rows
125334 tokens$ 0.6360
48 rows processed, 125334 tokens used ($0.6360)
completed
3 columns, 48 rows