Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
3 weeks ago
# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}
completed 50 rows19205 tokens$ 0.1338 1 iteration
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
3 weeks ago
# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}
completed 50 rows64274 tokens$ 0.4503 1 iteration
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
3 weeks ago
# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}
completed 50 rows38625 tokens$ 0.2709 1 iteration
01ed796a-9f65-4a64-b9ad-c78150f733c4
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
3 weeks ago
# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}
completed 48 rows125334 tokens$ 0.6360 1 iteration