Repository evaluations - ox/Ox-Character

Evaluations/LLM As A Judge - Gemini 2.0 Flash - Model Epoch 50

model-epoch-50

results.parquet

Type: image → text

Model:

OpenAI/GPT 4o

Provider: OpenAI

Target field: prediction

Prompt

Judge the following image on a few different criteria. Be very critical. We are aiming for perfection.

Each of judgements should be one of three values:

* "bad" if the image does not match the description
* "good" if the image matches the criteria, but could be better
* "perfect" if there is nothing that could be improved about the image

Return the judgements in xml format. The xml should contain with the following field names and should be graded on the following:

<character></character> Is the character a 3D Pixar-style white furry ox?
<task></task> Is the character performing the task described?
<objects></objects> Are all the necessary objects in the scene? Is there anything wrong with them? Do they look realistic?
<expression></expression> Is the character's expression wide open and happy, with a visible spark of joy or engagement, conveying satisfaction in the activity?
<texture></texture> The fur must show clear texture and depth, with soft lighting that avoids harsh shadows or bright highlights.
<coloring></coloring> The fur must not contain any yellow or sepia tones (e.g., check that RGB values of white areas are near (255, 255, 255)).
<background></background> The entire background must be pure white (#FFFFFF) with no visible gradient, vignette, or objects other than the ox, gears, and workbench.

An example response looks like this:

<reasoning>Your reasoning</reasoning>
<character>good</character>
<task>bad</task>
<objects>bad</objects>
<expression>perfect</expression>
<texture>perfect</texture>
<coloring>good</coloring>
<background>perfect</background>

Prompt:
{prompt}

Image:
{image}

Reason through your thoughts step by step before responding. Put your thoughts in the <reasoning></reasoning> tags.

Queued: May 29, 2025, 6:20 PM UTC

Completed: May 29, 2025, 6:20 PM UTC

5 row sample

4340 tokens$ 0.0186

5 rows processed, 4340 tokens used ($0.0186)

Estimated cost for all 50 rows: $0.1864

Sample Results completed

3 columns, 1-5 of 50 rows