Featured Posts, Tutorials, and Case studies
Data Version Control 101 with Oxen
Data Version Control 101 with Oxen

This intro tutorial from Oxen.ai shows how Oxen can make versioning your data as easy as versioning your code. Oxen is built to track and store changes for everything from a single CSV to data repositories with millions of unstructured images, videos, audio or text files. The tutorial will go through what data version control is, why it is important, and how Oxen helps data scientists and engineers gain visibility and confidence when sharing data with the rest of their team. Here's a video ve...

Greg Schoeninger
Greg Schoeninger
Nov 9, 2023
Arxiv Dive Manifesto
Arxiv Dive Manifesto

Every Friday the team at Oxen.ai gets together and goes over research papers, blog posts, or books that help us stay up to date with the latest in Machine Learning and AI. We call it Arxiv Dives because https://arxiv.org/ is a great resource for the latest research in the field. In September of 2023, we decided to make it public so that anyone can join. We’ve had amazing minds from hundreds of companies like Amazon, DoorDash, Meta, Google, and Tesla join the conversation, but I thought it would...

Greg Schoeninger
Greg Schoeninger
Nov 5, 2023
How to run Llama-2 on CPU after fine-tuning with LoRA
How to run Llama-2 on CPU after fine-tuning with LoRA

Running Large Language Models (LLMs) on the edge is a fascinating area of research, and opens up many use cases that require data privacy or lower cost profiles. With libraries like ggml coming on to the scene, it is now possible to get models anywhere from 1 billion to 13 billion parameters to run locally on a laptop with relatively low latency. In this tutorial, we are going to walk step by step how to fine tune Llama-2 with LoRA, export it to ggml, and run it on the edge on a CPU. We assume...

Greg Schoeninger
Greg Schoeninger
Oct 23, 2023
Featured Datasets

A dataset from the Allen Institute of AI consisting of genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset the Challenging Set of questions.

859 kB
1
Updated: 7 months ago

Wikipedia dataset containing cleaned articles. There are 6.4 million articles that can be streamed via apache arrow files.

20.4 gb
651
Updated: 7 months ago
Public
3

A benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. … The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

1.1 gb
938.1K
Updated: 10 months ago
Public
26

An image classification dataset containing 3670 images of flowers across 5 classes: daisy, dandelion, roses, sunflowers, tulips. The images are of nonstandard sizes and aspect ratios, ranging from 500 x 442 px to 143 x 240 px.

233.7 mb
143.7K
Updated: 8 months ago

Subset of speech commands to test audio recognition systems on.

252.5 mb
38K1
Updated: 1 year ago

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations.

1.5 gb
25203K
Updated: 2 months ago
View all featured repositories
Featured Collections

Some of the Oxen team's favorite collections.

LLM-SFT

Interesting datasets to supervise fine-tune (SFT) language models with.

a collection by ox

LLM-Feedback

Datasets with human or AI feedback. Useful for training reward models or applying techniques like DPO.

a collection by ox

LLM-Eval

A list of standard benchmarks for LLM evaluation

a collection by ox

Multimodal

List of datasets that cross modalities, combinations of text, image, audio, video etc.

a collection by ox

Browse all collections