gpubox.ai

Models

Real names. Real GPUs. No surprises.

GPUBox tells you exactly which model serves your request. No opaque endpoints, no silent swaps, no "mystery model" pricing. You name the model in your code; we serve that model.

Chat / LLM

liveApache 2.0AWQ-int4

qwen2.5-32b-instruct

Qwen2.5-32B-Instruct from Alibaba — strong general-purpose LLM at the 32B parameter class. Reliable function calling, decent reasoning, fast on consumer-grade hardware via 4-bit quantisation.

Context

8,192 tokens

Hardware

RTX 5090

Endpoint

/v1/chat/completions

Capabilities

  • Chat completions (OpenAI-compatible)
  • Streaming SSE
  • Tool / function calling
  • JSON mode (response_format)
  • Multilingual: English, Chinese, Spanish, French, German, etc.

Speech-to-text

liveMITfp16

whisper-large-v3-turbo

OpenAI's Whisper large-v3-turbo via faster-whisper. Real-time-factor ~0.3 on the 5090: a 60-second clip transcribes in roughly 18 seconds.

Context

30 second windows

Hardware

RTX 5090

Endpoint

/v1/audio/transcriptions

Capabilities

  • OpenAI-compatible /v1/audio/transcriptions
  • Multipart upload (file + model + optional language/prompt/temperature)
  • 100+ languages with auto-detection
  • verbose_json response with segment-level timestamps and confidence
  • Voice-activity detection (VAD) filter

Embeddings

coming soonMITfp16

bge-m3

BAAI BGE-M3 — strong multilingual embeddings, longer context than most retrievers. Coming soon.

Context

8,192 tokens

Hardware

RTX 5090

Endpoint

/v1/embeddings

Capabilities

  • OpenAI-compatible /v1/embeddings
  • Multilingual (100+ languages)
  • Dense + sparse + multi-vector retrieval
  • 1,024-dimensional output

Want a model we don't serve yet?

We add open-weight models on customer demand. Tell us what you need.

hello@gpubox.ai