Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
OpenAIOpenAI/GPT 4o minitext → text
Bessie
ox
2 months ago
Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}
completed 1000 rows798970 tokens$ 0.1852 3 iterations
OpenAIOpenAI/GPT 4o minitext → text
Bessie
ox
2 months ago
Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}
completed 1000 rows860866 tokens$ 0.1850 2 iterations
fab2b90a-a789-4e7e-8c4c-0a6981af5dfb
MetaMeta/Llama 3.1 8B Instruct Turbotext → text
Bessie
ox
2 months ago
pretend you are a medical professional answer the following question

{question}
validation
completed 5 row sample3105 tokens$ 0.0006 1 iteration
62018aa7-a5bc-4b88-bea2-b68c2f630856
OpenAIOpenAI/GPT 4o minitext → text
Bessie
ox
2 months ago
Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}
completed 1000 rows885038 tokens$ 0.2045 2 iterations
55a801bc-e95c-4c85-afee-ca6e901c6b44
OpenAIOpenAI/GPT 4o minitext → text
Bessie
ox
2 months ago
Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}
completed 1000 rows1024253 tokens$ 0.2276 4 iterations
31c3f6ea-9c02-424f-b04d-7b499d8f2021
MetaMeta/Llama 3.2 3B Instruct Turbotext → text
Bessie
ox
2 months ago
You are a medical professional who is helping a patient. Patients will ask you questions and you will answer them in plain English so that anyone can understand.

{question}
completed 1000 rows484155 tokens$ 0.0290 2 iterations