Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
2 months ago
# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}
error no case clause matching: {:error, {%{type: "bad_request", title: "Bad Request", detail: "new_path already exists"}, 400}, 0, 0} 49 rows173266 tokens$ 0.8797 1 iteration
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
2 months ago
# Image Judging Rubric
You are an animator looking at an artists work. Judge the following image on a few different criteria. Be very critical. We are aiming for a movie quality character.

## Valid Values
Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

## Criteria Descriptions
The criteria in which the image should be graded on are as follows:

character:
Is the character a 3D Pixar-style white furry ox?

task:
Is the character performing the task described?

objects:
Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?

expression:
Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?

texture:
The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.

coloring:
The fur must NOT contain any yellow or sepia tones. It should be a shade of white with a tone as if it lives in the arctic or himalayas.

background:
The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ones specified in the prompt.

## Return Format
Return the judgements in xml format. The xml should contain the criteria name in the tag. An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

## Inputs
Prompt:
{prompt}

Image:
{image}
completed 5 row sample12751 tokens$ 0.0656 1 iteration
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
2 months ago
Judge the following image on a few different criteria. Be very critical. We are aiming for perfection.

Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

Return the judgements in xml format. The xml should contain with the following field names and should be graded on the following criteria:

<character></character> Is the character a 3D Pixar-style white furry ox?
<task></task> Is the character performing the task described?
<objects></objects> Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?
<expression></expression> Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?
<texture></texture> The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.
<coloring></coloring> The fur must not contain any yellow or sepia tones (e.g., check that RGB values of white areas are near (255, 255, 255)).
<background></background> The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ox, gears, and workbench.

An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Prompt:
{prompt}

Image:
{image}

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.
completed 50 rows61799 tokens$ 0.4394 2 iterations
3688acba-de8d-4344-85cc-7db6411fd2a4
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
2 months ago
Judge the following image on a few different criteria. Be very critical. We are aiming for perfection.

Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

Return the judgements in xml format. The xml should contain with the following field names and should be graded on the following criteria:

<character></character> Is the character a 3D Pixar-style white furry ox?
<task></task> Is the character performing the task described?
<objects></objects> Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?
<expression></expression> Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?
<texture></texture> The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.
<coloring></coloring> The fur must not contain any yellow or sepia tones (e.g., check that RGB values of white areas are near (255, 255, 255)).
<background></background> The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ox, gears, and workbench.

An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Prompt:
{prompt}

Image:
{image}

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.
completed 49 rows125152 tokens$ 0.6350 2 iterations
a02538e1-fe15-4a10-97d1-f25b44013de6
OpenAIOpenAI/DALL-E 3text → image
Bessie
ox
2 months ago
A cute while handsome, 3D Pixar-style white ox character {prompt}. The ox should be with soft, fluffy fur and a slightly rounded, friendly body. The ox has large, curved beige horns, a big pink nose, expressive brown eyes, and a gentle, content smile. The ox should have wide open happy eyes, with a little spark like he is content with the activity he is doing. The character should have an endearing pose with a bit of gravitas. The fur is detailed with soft, realistic texturing and light shading, giving a plush, huggable appearance. The lighting is soft and even, highlighting the texture of the fur. The background is pure white, creating a clean studio look that emphasizes the character. The overall tone is whimsical, heartwarming, yet regal, suitable for an animated feature. The image should NOT contain any yellowish or vintage tone. The ox should be the same texture, lighting and shading as the reference picture.
completed 50 rows0 tokens$ 2.00 2 iterations
b03ff09e-491c-48ee-adca-fc26b538a6ca
Anthropic AIAnthropic AI/Claude 3.7 Sonnetimage → text
Bessie
ox
2 months ago
Judge the following image on a few different criteria. Be very critical. We are aiming for perfection.

Each of judgements should be one of three values:

* "bad" if the image does not match the criteria
* "okay" if the image has elements of the criteria, but is not good yes
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

Return the judgements in xml format. The xml should contain with the following field names and should be graded on the following criteria:

<character></character> Is the character a 3D Pixar-style white furry ox?
<task></task> Is the character performing the task described?
<objects></objects> Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?
<expression></expression> Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?
<texture></texture> The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.
<coloring></coloring> The fur must not contain any yellow or sepia tones (e.g., check that RGB values of white areas are near (255, 255, 255)).
<background></background> The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ox, gears, and workbench.

An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Prompt:
{prompt}

Image:
{image}

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.
completed 50 rows63003 tokens$ 0.4475 11 iterations