Repository evaluations - ox/MedQuAD

Evaluations

Label datasets and evaluate model performance

Oxen.ai allows you to run models row by row over your datasets. This allows you to label data, or evaluate how well a model is performing. Once the model has run over your dataset, you can save the output to a new file or branch, comparing it to the original dataset.

LLM As A Judge - PEFT Qwen 2.5 1.5B Instruct MedQuaD - GPT 4o mini

e1ec9397-ba31-4949-9b24-81855b405830

OpenAI/GPT 4o minitext → text

1 year ago

Prompt

Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}

validation

results/PEFT_0_2025-05-03_08-40-22_Qwen2.5-1.5B-Instruct.parquet

validation

results/PEFT_0_2025-05-03_08-40-22_Qwen2.5-1.5B-Instruct.parquet

completed 1000 rows798970 tokens$ 0.1852 3 iterations

LLM As A Judge - Qwen 2.5 1.5B Instruct MedQuaD - GPT 4o mini

911fcd33-23fa-4e71-ad45-f73bcf008556

OpenAI/GPT 4o minitext → text

1 year ago

Prompt

Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}

validation

results/SFT_20_2025-05-02_12-36-12_Qwen2.5-1.5B-Instruct.parquet

validation

judgements/SFT_20_2025-05-02_12-36-12_Qwen2.5-1.5B-Instruct.parquet

completed 1000 rows860866 tokens$ 0.1850 2 iterations

fab2b90a-a789-4e7e-8c4c-0a6981af5dfb

Unknown/meta-llama-meta-llama-3-1-8b-instruct-turbotext → text

1 year ago

Prompt

pretend you are a medical professional answer the following question

{question}

validation

valid.parquet

completed 5 row sample3105 tokens$ 0.0006 1 iteration

LLM As A Judge - Qwen 2.5 1.5B Instruct - GPT 4o mini

62018aa7-a5bc-4b88-bea2-b68c2f630856

OpenAI/GPT 4o minitext → text

1 year ago

Prompt

Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}

validation

results/Qwen2.5-1.5B-Instruct.parquet

validation

judgements/Qwen2.5-1.5B-Instruct-gpt-4o-mini.parquet

completed 1000 rows885038 tokens$ 0.2045 2 iterations

LLM As A Judge - Llama 3.2 3B Instruct - GPT 4o mini

55a801bc-e95c-4c85-afee-ca6e901c6b44

OpenAI/GPT 4o minitext → text

1 year ago

Prompt

Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}

validation

results/Llama-3.2-3B-Instruct-Turbo.parquet

validation

judgements/Llama-3.2-3B-Instruct-Turbo-gpt-4o-mini.parquet

completed 1000 rows1024253 tokens$ 0.2276 4 iterations

Eval Llama 3.2 3B Instruct Turbo

31c3f6ea-9c02-424f-b04d-7b499d8f2021

Unknown/meta-llama-llama-3-2-3b-instruct-turbotext → text

1 year ago

Prompt

You are a medical professional who is helping a patient. Patients will ask you questions and you will answer them in plain English so that anyone can understand.

{question}

validation

valid.parquet

validation

results/Llama-3.2-3B-Instruct-Turbo.parquet

completed 1000 rows484155 tokens$ 0.0290 2 iterations

Loading evaluations...

LLM As A Judge - PEFT Qwen 2.5 1.5B Instruct MedQuaD - GPT 4o mini

e1ec9397-ba31-4949-9b24-81855b405830

OpenAI/GPT 4o minitext → text

1 year ago

Prompt

Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}

validation

results/PEFT_0_2025-05-03_08-40-22_Qwen2.5-1.5B-Instruct.parquet

validation

results/PEFT_0_2025-05-03_08-40-22_Qwen2.5-1.5B-Instruct.parquet

completed 1000 rows798970 tokens$ 0.1852 3 iterations

LLM As A Judge - Qwen 2.5 1.5B Instruct MedQuaD - GPT 4o mini

911fcd33-23fa-4e71-ad45-f73bcf008556

OpenAI/GPT 4o minitext → text

1 year ago

Prompt

Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}

validation

results/SFT_20_2025-05-02_12-36-12_Qwen2.5-1.5B-Instruct.parquet

validation

judgements/SFT_20_2025-05-02_12-36-12_Qwen2.5-1.5B-Instruct.parquet

completed 1000 rows860866 tokens$ 0.1850 2 iterations

fab2b90a-a789-4e7e-8c4c-0a6981af5dfb

Unknown/meta-llama-meta-llama-3-1-8b-instruct-turbotext → text

1 year ago

Prompt

pretend you are a medical professional answer the following question

{question}

validation

valid.parquet

completed 5 row sample3105 tokens$ 0.0006 1 iteration

LLM As A Judge - Qwen 2.5 1.5B Instruct - GPT 4o mini

62018aa7-a5bc-4b88-bea2-b68c2f630856

OpenAI/GPT 4o minitext → text

1 year ago

Prompt

Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}

validation

results/Qwen2.5-1.5B-Instruct.parquet

validation

judgements/Qwen2.5-1.5B-Instruct-gpt-4o-mini.parquet

completed 1000 rows885038 tokens$ 0.2045 2 iterations

LLM As A Judge - Llama 3.2 3B Instruct - GPT 4o mini

55a801bc-e95c-4c85-afee-ca6e901c6b44

OpenAI/GPT 4o minitext → text

1 year ago

Prompt

Consider the given question and two answers. The first answer is the gold standard, correct answer. The second answer may or may not be correct. Compare the text in the two answers and determine whether the second answer is correct. Provide a brief explanation for why the answer is correct or not before arriving at the final verdict (Yes/No). Provide a final verdict for whether the second answer is correct the end in the given format:

Is Correct:
Yes

or

Is Correct:
No

Do not deviate from the specified format for the final verdict.

Question:
{question}

First Answer:
{answer}

Second Answer:
{prediction}

validation

results/Llama-3.2-3B-Instruct-Turbo.parquet

validation

judgements/Llama-3.2-3B-Instruct-Turbo-gpt-4o-mini.parquet

completed 1000 rows1024253 tokens$ 0.2276 4 iterations

Eval Llama 3.2 3B Instruct Turbo

31c3f6ea-9c02-424f-b04d-7b499d8f2021

Unknown/meta-llama-llama-3-2-3b-instruct-turbotext → text

1 year ago

Prompt

You are a medical professional who is helping a patient. Patients will ask you questions and you will answer them in plain English so that anyone can understand.

{question}

validation

valid.parquet

validation

results/Llama-3.2-3B-Instruct-Turbo.parquet

completed 1000 rows484155 tokens$ 0.0290 2 iterations