Models

Real names. Real GPUs. No surprises.

GPUBox tells you exactly which model serves your request. No opaque endpoints, no silent swaps, no "mystery model" pricing. You name the model in your code; we serve that model.

Chat / LLM

liveApache 2.0AWQ-int4

qwen2.5-32b-instruct

Qwen2.5-32B-Instruct from Alibaba — strong general-purpose LLM at the 32B parameter class. Reliable function calling, decent reasoning, fast on consumer-grade hardware via 4-bit quantisation.

Context

8,192 tokens

Hardware

RTX 5090

Endpoint

/v1/chat/completions

Capabilities

Chat completions (OpenAI-compatible)
Streaming SSE
Tool / function calling
JSON mode (response_format)
Multilingual: English, Chinese, Spanish, French, German, etc.

Speech-to-text

liveMITfp16

whisper-large-v3-turbo

OpenAI's Whisper large-v3-turbo via faster-whisper. Real-time-factor ~0.3 on the 5090: a 60-second clip transcribes in roughly 18 seconds.

Context

30 second windows

Hardware

RTX 5090

Endpoint

/v1/audio/transcriptions

Capabilities

OpenAI-compatible /v1/audio/transcriptions
Multipart upload (file + model + optional language/prompt/temperature)
100+ languages with auto-detection
verbose_json response with segment-level timestamps and confidence
Voice-activity detection (VAD) filter

Embeddings

coming soonMITfp16

bge-m3

BAAI BGE-M3 — strong multilingual embeddings, longer context than most retrievers. Coming soon.

Context

8,192 tokens

Hardware

RTX 5090

Endpoint

/v1/embeddings

Capabilities

OpenAI-compatible /v1/embeddings
Multilingual (100+ languages)
Dense + sparse + multi-vector retrieval
1,024-dimensional output

Want a model we don't serve yet?

We add open-weight models on customer demand. Tell us what you need.

hello@gpubox.ai

Ready to use one? Read the quickstart →