Repository evaluations - ox/Oxen-DiT-Training

Evaluations

Run models against your data

Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.

Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.

Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.

LLM As A Judge - Claude 3.7 Sonnet - 50 Epochs PixArt-XL-2

e8c91504-0c56-4147-9386-51565dc31fbd

Anthropic AI/Claude 3.7 Sonnetimage → text

3 weeks ago

Prompt

# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}

main

results-models-PixArt-XL-2-512x512-50-epochs-2025-05-30_0709.parquet

main

results/PixArt-XL-2-512x512-50-epochs-2025-05-30_0709.parquet

completed 50 rows19205 tokens$ 0.1338 1 iteration

LLM As A Judge - Claude 3.7 Sonnet - 1 Epochs PixArt-XL-2

37098f8b-da45-489e-9245-578610c2f2d1

Anthropic AI/Claude 3.7 Sonnetimage → text

3 weeks ago

Prompt

# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}

main

results-models-PixArt-XL-2-512x512-1-epochs-2025-05-30_0404.parquet

main

judgements/PixArt-XL-2-512x512-1-epochs-2025-05-30_0404.parquet

completed 50 rows64274 tokens$ 0.4503 1 iteration

LLM As A Judge - Claude 3.7 Sonnet - 20 Epochs PixArt-XL-2

637ab794-04d0-4a83-91a8-e9a14e2eadb6

Anthropic AI/Claude 3.7 Sonnetimage → text

3 weeks ago

Prompt

# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}

main

results-models-PixArt-XL-2-512x512-20-epochs-2025-05-30_0512.parquet

main

judgements/PixArt-XL-2-512x512-20-epochs-2025-05-30_0512.parquet

completed 50 rows38625 tokens$ 0.2709 1 iteration

LLM As A Judge - Claude 3.7 Sonnet - GPT Image Gen

01ed796a-9f65-4a64-b9ad-c78150f733c4

Anthropic AI/Claude 3.7 Sonnetimage → text

3 weeks ago

Prompt

# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}

main

results-openai.parquet

main

judgements/openai.parquet

completed 48 rows125334 tokens$ 0.6360 1 iteration