Model Workbench - OxenHub

Image-to-image editing with prompt-guided transformations and multi-reference composition.

Image

$0.22/image

Anthropic's most capable model, with a step-change jump in agentic coding over Opus 4.6 and a native 1M-token context window.

Text

$5.00/1M tokens

Reference-guided video from prompt plus optional images, videos, and audio references.

Video

$0.27/sec

Generate videos from images with audio using xAI's Grok Imagine 1.5 Video model.

Video

$0.07/sec

HDR video conversion from Topaz with PQ/HLG transfer functions and ProRes output support.

Video

Image-to-video generation up to 1080P from a single reference image.

Video

$0.14/sec

Reference-to-video generation up to 1080P with up to 9 reference images.

Video

$0.14/sec

Text-to-video generation up to 1080P with configurable aspect ratio and duration.

Video

$0.14/sec

Animates images into video up to 15s at 1080P with first/last-frame guidance, video continuation, and optional driving audio.

Video

$0.10/sec

Reference-guided video generation with character consistency, multi-character support, optional reference voices, and up to 1080P output.

Video

$0.10/sec

Text-to-video with multi-shot generation, up to 1080P, 2-15s duration, and optional driving audio.

Video

$0.10/sec

Cheap, fast DeepSeek V4 — 13B active params over a 1M context, well-suited to high-volume traffic and as a default daily-driver.

Text

$0.18/1M tokens

Flagship DeepSeek V4 with a 1M-token context, 49B active params, and native tool use — strong reasoning and code at a fraction of frontier-tier pricing.

Text

$0.57/1M tokens

OpenAI's newest frontier model — improved reasoning over 5.4 at the same 1.05M context, configurable thinking budget, and full tool-use support.

Text

$6.50/1M tokens

OpenAI's premium 5.5 variant — top-of-line reasoning at a higher price than 5.5 base, for the hardest agentic and research workloads.

Text

$39.00/1M tokens

Text-to-image generation with photorealistic output, accurate text rendering, and strong prompt adherence.

Image

$0.22/image

Edit videos via text instructions or reference images with style transfer, up to 1080p, and flexible audio handling.

Video

$0.10/sec

Faster, lower-cost image-to-video with optional end-frame and resolution up to 720p.

Video

$0.17/sec

Faster, lower-cost reference-guided video from prompt plus optional images, videos, and audio references.

Video

$0.20/sec

Faster, lower-cost text-to-video with resolution up to 720p.

Video

$0.17/sec

Image-to-video with optional start-to-end frame transitions, resolution up to 1080p, and optional synchronized audio.

Video

$0.22/sec

Text-to-video with flexible duration and aspect ratio, resolution up to 1080p, and optional synchronized audio.

Video

$0.22/sec

Flagship 31B dense multimodal model supporting text, image, and video input with 256K context window. Achieves competitive performance with much larger models.

Text

$0.14/1M tokens

Lightweight 2.3B multimodal model supporting text, image, video, and audio input with 128K context window and 140+ language support.

Text

$0.00/1M tokensFine-tunable

Efficient 4.5B multimodal model supporting text, image, video, and audio input with 128K context window and 140+ language support.

Text

$0.00/1M tokensFine-tunable

Alibaba's latest flagship closed model for advanced reasoning, coding, and complex text generation.

Text

$0.50/1M tokens

Video restoration and upscaling model from Topaz for detail-preserving 1080p/4k outputs.

Video

Strongest OpenAI mini model for coding and agentic workloads, with 400K context, 128K max output, multimodal input, and broad tool support.

Text

$0.75/1M tokens

Video-to-depth estimation with temporal consistency, selectable model size, colormaps, and optional raw depth export.

Video

$0.05/sec

Baseten-configured LTX 2.3 Pro 22B model with IC/Union-Control support for text-to-video and image-conditioned video generation.

Video

$0.0028/sec

Frontier model for complex professional work with 1.05M context, configurable reasoning, and extensive tool support including computer use and MCP.

Text

$2.50/1M tokens

Generates high-res 4K@25FPS videos from image+text, camera control, and synced audio.

Video

$0.12/secFine-tunable

Audio-to-video generation from image + audio input, 1080p output with synchronized visuals.

Video

$0.12/sec

Targeted video segment editing: replace video, audio, or both via prompts.

Video

$0.12/sec

Text-to-video generation up to 4K@50FPS with optional audio and camera motion.

Video

$0.12/sec

Fast, low-cost Gemini 3.1 model for high-throughput multimodal workloads, with configurable reasoning and a 1M-token context window.

Text

$0.25/1M tokens

GPT-5.3 Instant model for ChatGPT with 128K context, text and image inputs, and optimized conversational performance.

Text

$1.75/1M tokens

Hybrid Mamba-Transformer MoE with 1M context, optimized for agentic reasoning; 120B total, 12B active parameters.

Text

$0.30/1M tokens

Nano Banana 2 is a text-to-image model that generates images from text descriptions.

Image

$0.08/image

Nano Banana 2 Edit is an image editing model that enables blending multiple images, maintaining character consistency, targeted transformations using natural language, and leveraging world knowledge for precise edits.

Image

$0.08/image

Compact multimodal model with dual reasoning modes, native vision capabilities, support for over 200 languages, and long-context processing up to 262,144 tokens.

Text

$0.0014/secFine-tunable

Multimodal LLM with native vision, image and video understanding, tool calling, optional thinking mode, support for 201 languages, and long-context processing up to 262,144 tokens.

Text

$0.0014/secFine-tunable

Multimodal LLM with thinking mode by default, native vision, image and video understanding, tool calling, support for 201 languages, and long-context up to 262K tokens (extensible to 1M with YaRN).

Text

$0.0014/secFine-tunable

Multimodal LLM with thinking mode by default, native vision, image and video understanding, tool calling, support for 201 languages, and long-context up to 262K tokens (extensible to 1M with YaRN).

Text

$0.0014/secFine-tunable

Image generation with multi-reference control (up to 14 images), batch generation, and 3K resolution support.

Image

$0.04/image

Flagship Gemini 3 reasoning model for complex multimodal and agentic workflows with a 1M-token context window.

Text

$2.00/1M tokens

Anthropic's latest Sonnet model with strong coding and agent performance, fast latency, and improved long-context reasoning.

Text

$3.00/1M tokens

Pro version of Qwen Image 2 with enhanced text rendering, realism, and semantic adherence for high-quality image generation and editing.

Image

$0.08/image

Anthropic's most advanced model, excelling in coding, agentic workflows, computer use, reasoning, math, and domain expertise in finance, law, STEM.

Text

$5.00/1M tokens

Animates a static image into native 4K motion with optional start/end frame anchoring and synchronized audio.

Video

$0.55/sec

Native 4K reference-to-video generation from element and style references with optional frame anchoring and synchronized audio.

Video

$0.55/sec

Native 4K text-to-video generation with cinema-grade detail and optional synchronized audio.

Video

$0.55/sec

Kling Video O3 Pro is an advanced image-to-video generation model that animates static images into high-quality videos based on text prompts.

Video

$0.22/sec

Kling o3 Pro reference-to-video model generates videos from a reference image and text prompt describing motion and cinematic intent.

Video

$0.22/sec

Edit videos using text prompts and reference images for character consistency or object replacement.

Video

$0.34/sec

Motion transfer from reference video to character image. Cost-effective for portraits and simple animations.

Video

$0.17/sec

Grok Imagine text-to-image is a high-quality image generation model from xAI that produces cinematic, stylistically consistent images from text prompts.

Image

$0.02/image

Grok Imagine - Image Edit is a high-quality image generation model from xAI that produces cinematic, stylistically consistent images from text prompts.

Image

$0.02/image

Video editing model for prompt-driven modifications like object swapping, scene restyling, and character animation with synced native audio.

Video

$0.08/sec

Generate videos from images with audio using xAI's Grok Imagine Video model.

Video

$0.07/sec

FLUX.2 Klein 4B is a compact 4 billion parameter text-to-image diffusion model optimized for fast inference and high-quality image generation.

Image

$0.01/imageFine-tunable

FLUX.2 Klein 9B is a compact 9 billion parameter text-to-image diffusion model optimized for fast inference and high-quality image generation.

Image

$0.02/imageFine-tunable

Multimodal LLM for targeted video editing: regenerate 2-16s segments (video/audio/both) via prompts, preserving motion, lighting, and continuity.

Video

$0.10/sec

Generates high-res 4K@25FPS videos from image+text, camera control, and synced audio.

Video

$0.12/secFine-tunable

Native multimodal agentic model with vision, Agent Swarm (up to 100 sub-agents, 1,500 tool calls), coding from visual specs, and 256K context.

Text

$0.60/1M tokens

Frontier open LLM with advanced coding, agentic, and reasoning capabilities; 744B MoE with DSA for efficient 200K context.

Text

$0.95/1M tokens

Generative image model that improves on in photorealistic human portraits, finer natural scenes (landscapes, animal fur, and other natural elements), better text rendering overall.

Image

$0.20/imageFine-tunable

Delivers high-fidelity, controllable image editing with dual semantic and appearance modes, precise on-image text, multi-image composition, and robust identity preservation.

Image

$0.03/imageFine-tunable

Fast multimodal model with configurable reasoning, strong agentic workflows, long context, and tool use for interactive chat, coding, and complex tasks.

Text

$0.05/1M tokens

Transforms static images into cinematic videos with synchronized audio, dialogue, and sound effects in 1080p.

Video

$0.07/sec

Generates 1080p videos from text with native synchronized audio, including dialogue, sound effects, and lip-sync.

Video

$0.07/sec

Diffusion model for high‑fidelity image generation and editing, with strong prompt adherence, preserved composition and lighting, and adjustable quality controls.

Image

$8.00/1M tokens

Animates images into 15s, 1080p videos with preserved identity, native audio, lip-sync, and multi-shot sequences guided by reference videos.

Video

$0.10/sec

Generates videos from reference videos, maintaining character consistency, with multi-shot narratives, up to 15s duration, and native audio sync.

Video

$0.10/sec

Frontier model for professional work with configurable reasoning effort, 400K context, structured outputs, and distillation support.

Text

$1.75/1M tokens

GPT-5.2 model optimized for ChatGPT with 128K context, text and image input support, streaming, and structured outputs.

Text

$1.75/1M tokens

Transforms images (with text and up to 7 references) into cinematic video clips with stable characters, controlled motion, and consistent environments.

Video

$0.12/sec

Multimodal video model for reference-guided generation, preserving characters and styles from reference images.

Video

$0.11/sec

Text-guided video-to-video editing that preserves motion and continuity while enabling character swaps, style changes, motion transfer, and scene transformations.

Video

$0.18/sec

Image generation with single- and multi-reference editing and 2K/4K resolutions.

Image

$0.04/image

Fast photorealistic text-to-image model with accurate English and Chinese on-image text, ideal for interactive design, marketing visuals, and UI/UX workflows.

Image

$0.01/imageFine-tunable

Generates photorealistic images with precise multi-reference editing, excels at legible text and infographics, and supports rapid LoRA fine-tuning workflows.

Image

$0.01/imageFine-tunable

Delivers high-quality image generation and editing with advanced text rendering, multi-image reference for style consistency, and precise, JSON-based prompt control.

Image

$0.12/image

Delivers photorealistic, high-resolution images with advanced multi-reference editing, precise pose and color control, and reliable prompt and text adherence for professionals.

Image

$0.10/image

Excels at long-horizon reasoning, advanced coding, dynamic effort control, robust multimodal tasks, and detailed computer interface inspection for complex workflows.

Text

$5.00/1M tokens

Delivers high-fidelity images with advanced text rendering, consistent character identities, and precise prompt following for professional visual design and branding.

Image

$0.15/image

Zero-shot image segmentation with text/visual prompts; exhaustive instance detection and presence head reduce false positives.

Image

$0.01/image

Detects, segments, and tracks objects across video frames using text, exemplars, points, or masks, with memory for occlusions and real-time streaming.

Video

$0.02/image

Automatically routes prompts to fast or deep reasoning modes, with adaptive effort, enhanced tone and style controls, and improved coding and math.

Text

$1.25/1M tokens

Professional-grade image upscaling powered by AI, from Topaz Labs.

Image

$0.05/image

Professional-grade video upscaling powered by AI, from Topaz Labs.

Video

$0.04/sec

Generates high-fidelity videos with native synced audio, offering strong narrative control, scene consistency, image-to-video animation, and multi-shot support.

Video

$0.20/sec

Animates an input image into short videos with controllable motion, duration, aspect ratio, resolution, and optional audio.

Video

$0.10/sec

Animates a single image into short videos with controllable motion, duration, aspect ratio, and cost-efficient quality settings.

Video

$0.05/sec

Generates high-quality 1080p videos up to 12s with synced native audio, multi-scene reasoning, timeline prompting, and realistic physics.

Video

$0.30/sec

Optimized for rapid, high-volume multimodal tasks with a 1M-token context window, delivering strong reasoning and cost efficiency for enterprise workflows.

Text

$0.10/1M tokens

Transforms single images into smooth, cinematic videos with natural motion, realistic camera work like dolly zooms, and preserved style.

Video

$0.07/sec

Delivers high-fidelity, controllable image editing with dual semantic and appearance modes, precise on-image text, multi-image composition, and robust identity preservation.

Image

$0.03/imageFine-tunable

Generates photorealistic images with precise prompt and text rendering, mask-free editing, and layout-aware outpainting, ideal for creative and multilingual content.

Image

$0.04/image

Anthropic's most advanced AI model, excelling in coding, agent-based tasks, and computer usage. It delivers high performance in reasoning, math, and domain-specific knowledge across fields like finance, law, and STEM.

Text

$3.00/1M tokens

Image generation with single- and multi-reference editing and resolutions from 1K to 4K.

Image

$0.04/image

Enables precise bilingual text and semantic edits with strong consistency, advanced multi-image editing, and native pose/control support for creative compositions.

Image

$0.03/imageFine-tunable

Lightweight multimodal model for visual Q&A, multilingual OCR, document and UI understanding, and agentic screen interpretation in constrained environments.

Text

$0.0014/secFine-tunable

Multimodal LLM for text and images, excelling in visual QA, document/UI understanding, spatial reasoning, image captioning, and multimodal coding.

Text

$0.0014/secFine-tunable

versatile multimodal large language model capable of understanding and generating both text and images. Built on the Qwen3 architecture, it provides strong general reasoning, detailed image interpretation, and instruction-following performance in a compact 8B parameter size.

Text

$0.0014/secFine-tunable

Open-weight text-to-image model with advanced prompt adherence, anatomically accurate details, and powerful tools for inpainting, outpainting, and structural edits.

Image

$0.03/imageFine-tunable

Handles complex reasoning, code generation, and multimodal inputs with improved accuracy, long context retention, and robust multilingual and personalization features.

Text

$1.25/1M tokens

Optimized for cost and speed, handles long contexts, supports text and image input, and excels at structured outputs and tool integration for precise tasks.

Text

$0.25/1M tokens

Multimodal model optimized for ultra-fast, cost-efficient summarization and classification, supporting both text and image inputs with real-time streaming output.

Text

$0.05/1M tokens

Excels at complex coding, autonomous research, and agent workflows, with advanced reasoning and a 200,000-token context for deep analysis and synthesis.

Text

$15.00/1M tokens

Built with a Mixture-of-Experts design, delivers efficient, transparent reasoning, tool use, and agentic capabilities, even with 128K token context windows.

Text

$0.15/1M tokens

Delivers strong reasoning and chain-of-thought, agentic features, and multilingual support, optimized for local deployment and efficient use on modest hardware.

Text

$0.07/1M tokensFine-tunable

An image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and support for a wide range of artistic styles. From photorealistic scenes to impressionist paintings, from anime aesthetics to minimalist design

Image

$0.03/imageFine-tunable

Delivers high-fidelity text-to-video synthesis at 480p/720p using dual expert models for scene layout and fine motion detail, ideal for creative production.

Video

$0.0014/secFine-tunable

Unified text-to-video and image-to-video model generates high-definition 720p, 24fps video clips efficiently on consumer GPUs, with advanced compression for speed.

Video

$0.0014/secFine-tunable

Fast, cost-efficient multimodal reasoning model with million-token context for high-volume applications requiring speed and versatility.

Text

$0.30/1M tokens

Delivers precise, iterative image editing and generation with consistent character, style, and text changes—using multimodal input for seamless scene transformations.

Image

$0.03/imageFine-tunable

Excels at deep reasoning, complex coding, and autonomous agent workflows with sustained performance, extended thinking, tool use, and memory across tasks.

Text

$15.00/1M tokens

Balances intelligence with efficiency for coding, research, and automation tasks; excels in reasoning, content generation, and nuanced instruction following.

Text

$3.00/1M tokens

Generates realistic text- and image-conditioned videos with native synchronized audio, including dialogue, ambient sound, and effects.

Video

$0.20/sec

Excels at building interactive web apps, advanced code editing and agentic workflows, with native multimodality and strong video-to-code capabilities.

Text

$1.25/1M tokens

Dual reasoning modes enable rapid or step-by-step responses, with robust support for over 100 languages and long-context processing up to 262,144 tokens.

Text

$0.0014/secFine-tunable

Efficient conversational AI for resource-limited devices with multilingual support, document summarization, translation, code generation, and simple information retrieval.

Text

$0.00046/secFine-tunable

Excels at advanced reasoning, coding, math, and visual tasks with simulated reasoning, tool use, web browsing, and image understanding integration.

Text

$2.00/1M tokens

Optimized for fast, affordable reasoning with strong coding and visual skills, large 200k-token context, and efficient handling of complex tasks.

Text

$1.10/1M tokens

Excels in coding and instruction following with million-token context window, enabling superior performance on complex, multi-step tasks.

Text

$2.00/1M tokens

Powerful mid-sized model with GPT-4o-level performance at lower cost and latency, featuring a 1 million token context window for complex tasks.

Text

$0.40/1M tokens

OpenAI's fastest, cost-effective model with full 1 million token context, optimized for classification, autocompletion, and real-time AI agent tasks.

Text

$0.10/1M tokens

A lightweight, versatile 24B multimodal model handling text and images with extensive multilingual support and 128k token context window.

Text

$0.10/1M tokens

Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning.

Text

$0.20/1M tokens

Performs exhaustive, multi-step research by autonomously searching and synthesizing hundreds of sources into detailed, expert-level reports across domains.

Text

$5.00/1M tokens

Premium reasoning model for complex, multi-step analysis. Delivers detailed explanations, real-time web search, and double citations for thorough answers.

Text

$3.00/1M tokens

Efficient, multilingual instruction-tuned model designed for privacy-focused, on-device dialogue, summarization, and agentic retrieval across mobile and edge platforms.

Text

$0.00046/secFine-tunable

Generates high-fidelity, temporally consistent videos from text or images, with readable English and Chinese text, sound effects, and customizable aspect ratios.

Video

$0.0014/secFine-tunable

Optimized for search-augmented tasks, delivering fast, accurate answers with real-time web data and detailed citations. Excels in research and fact-checking.

Text

$2.00/1M tokens

Excels at complex, multi-step queries with real-time web search, detailed answers, extensive citations, and customizable information retrieval.

Text

$5.00/1M tokens

Efficient edge model with native function calling and interleaved sliding-window attention for fast, memory-efficient processing in resource-constrained environments.

Text

$0.10/1M tokens

Multimodal model handling text and images at native resolution with 128K context window, excelling in visual reasoning tasks like document analysis and image captioning.

Text

$0.15/1M tokens

Cost-efficient, fast model with 128K context window, supporting text/vision inputs and improved multilingual performance.

Text

$0.15/1M tokens

Next-generation general-purpose Topaz video upscaler with tunable detail, noise, blur, and grain controls.

Video

Generates vector representations capturing semantic meaning/context for tasks like semantic search, text classification, and clustering. Multilingual support with versatile applications.

Embeddings

$0.02/1M tokens

Multimodal LLM for real-time text, audio, and visual processing with multilingual support, emotional audio responses, and image generation.

Text

$2.50/1M tokens

Efficient Sparse MoE architecture with 39B active parameters, excels in multilingual tasks, math, coding, and handles 64K token contexts.

Text

$2.00/1M tokens

Powerful LLM with 123B parameters, excelling in multilingual tasks, coding, and reasoning, optimized for single-node inference and long-context applications.

Text

$2.00/1M tokens

Generates high-quality embeddings for complex text analysis and multilingual applications with 8,191 token context.

Embeddings

$0.13/1M tokens

Generates compact, efficient embeddings for NLP tasks with multilingual support, balancing performance and low latency.

Embeddings

$0.02/1M tokens

General-purpose Topaz video upscaling and enhancement with tunable detail, noise, blur, and grain controls.

Video

Efficient Mixture of Experts (8 experts) with 13B active parameters, optimized for multilingual tasks and cost-performance balance.

Text

$0.70/1M tokens

Topaz video upscaler focused on face restoration for medium-quality sources.

Video

Balanced performance in natural language and code tasks, efficiently handling longer sequences with innovative attention mechanisms.

Text

$0.25/1M tokens

Multimodal LLM for agentic applications, handling real-time data integration and multi-step tasks with enhanced reasoning via Thinking Mode, integrating Google tools and third-party functions.

Text

$0.10/1M tokens

Optimized for edge computing with function-calling capabilities, excelling in knowledge retrieval and commonsense reasoning with 128k token context.

Text

$0.04/1M tokens

Specializes in complex reasoning through chain-of-thought processing, excelling in STEM tasks like coding, math, and scientific analysis.

Text

$15.00/1M tokens

Optimized for STEM reasoning and problem-solving, excelling in complex tasks like advanced math and coding with improved cost efficiency.

Text

$1.10/1M tokens

Efficiently generates multilingual text and code, with dual modes for rapid chat or detailed reasoning; ideal for lightweight AI, agents, and education.

Text

$0.00046/secFine-tunable

Generates 480P videos from text prompts on consumer GPUs, with multilingual support, image-to-video, aspect ratio control, and audio integration features.

Video

$0.0014/secFine-tunable

Welcome to your Workbench

Filters

Fine-tuning

Model type

Developer

Favorites

All models

GPT Image 2 Edit

Claude Opus 4.7

Seedance 2.0 - Reference to Video

Grok Imagine Video 1.5 - Image to Video

Topaz Hyperion HDR

Happy Horse - Image to Video

Happy Horse - Reference to Video

Happy Horse - Text to Video

WAN 2.7 - Image to Video

WAN 2.7 - Reference to Video

WAN 2.7 - Text to Video

DeepSeek V4 Flash

DeepSeek V4 Pro

GPT 5.5

GPT 5.5 Pro

GPT Image 2

WAN 2.7 - Edit Video

Seedance 2.0 Fast - Image to Video

Seedance 2.0 Fast - Reference to Video

Seedance 2.0 Fast - Text to Video

Seedance 2.0 - Image to Video

Seedance 2.0 - Text to Video

Gemma 4 31B

Gemma 4 E2B

Gemma 4 E4B

Qwen3.6 Plus

Topaz Starlight Precise 2.5

GPT 5.4 Mini

Depth Anything Video

LTX-2.3 Pro 22B IC-LoRA Union Control

GPT 5.4

LTX-2.3 Pro

LTX 2.3 Pro: Audio to Video

LTX 2.3 Pro: Retake

LTX 2.3 Pro: Text to Video

Gemini 3.1 Flash-Lite

GPT 5.3 Chat

Nemotron 3 Super

Nano Banana 2

Nano Banana 2 - Image Edit

Qwen3.5 0.8B

Qwen3.5 2B

Qwen3.5 4B

Qwen3.5 9B

Seedream 5.0 Lite

Gemini 3.1 Pro Preview

Claude Sonnet 4.6

Qwen Image 2.0 Pro

Claude Opus 4.6

Kling O3 4K: Image-to-Video

Kling O3 4K - Reference to Video

Kling O3 4K: Text-to-Video

Kling O3 Pro: Image-to-Video

Kling O3 Pro - Reference to Video

Kling O3 Edit - Video to Video

Kling 3.0 Pro: Motion Control

Grok Imagine - Text to Image

Grok Imagine - Image Edit

Grok Imagine - Video Edit

Grok Imagine - Image to Video

FLUX.2 Klein 4B

FLUX.2 Klein 9B

LTX-2 Retake

LTX-2 Pro

Kimi K2.5

GLM 5

Qwen Image - 2512

Qwen Image Edit - 2511

Gemini 3 Flash

Kling 2.6 Pro - Image to Video

Kling 2.6 Pro - Text to Video

GPT Image 1.5

WAN 2.6 - Image to Video