STACKQUADRANT

omlx

jundot/omlx
7.8

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

Model Serving
17.2k1.5kPythonApache-2.0today

TensorRT-LLM

NVIDIA/TensorRT-LLM
7.1

TensorRT-LLM — a leading open-source project in the AI/LLM ecosystem.

Model Serving
14.0k2.5kPythonNOASSERTIONtoday

vllm-omni

vllm-project/vllm-omni
7.5

A framework for efficient model inference with omni-modality models

Model Serving
5.3k1.2kPythonApache-2.0today

Olares

beclab/Olares
7.0

Olares: An Open-Source Personal Cloud to Reclaim Your Data

Model Serving
5.0k302GoAGPL-3.01d ago

Deep-Learning-in-Production

ahkarami/Deep-Learning-in-Production
4.5

In this repository, I will share some useful notes and references about deploying deep learning-based models in production.

Model Serving
4.4k6851y ago

AI-Infra-from-Zero-to-Hero

HuaizhengZhang/AI-Infra-from-Zero-to-Hero
6.2

🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑‍💻 Video Tutorials.

Model Serving
4.1k401MIT11mo ago

LightLLM

ModelTC/LightLLM
6.5

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Model Serving
4.1k335PythonApache-2.02d ago

chitu

thu-pacman/chitu
6.8

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Model Serving
3.1k266PythonApache-2.01d ago

ramalama

containers/ramalama
7.5

RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.

Model Serving
2.9k344PythonMIT1d ago

inference

roboflow/inference
7.0

Turn any computer or edge device into a command center for your computer vision projects.

Model Serving
2.3k277PythonNOASSERTION2d ago

vllm-ascend

vllm-project/vllm-ascend
7.2

Community maintained hardware plugin for vLLM on Ascend

Model Serving
2.3k1.5kC++Apache-2.0today

envd

tensorchord/envd
6.8

🏕️ Reproducible development environment for humans and agents

Model Serving
2.2k169GoApache-2.01mo ago

sie

superlinked/sie
6.6

Superlinked Inference Engine is an Open-source inference server and production cluster for embeddings, reranking, and extraction.

Model Serving
2.1k183PythonApache-2.01d ago

aici

microsoft/aici
4.9

AICI: Prompts as (Wasm) Programs

Model Serving
2.1k84RustMIT1y ago

mlrun

mlrun/mlrun
7.2

MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.

Model Serving
1.7k308PythonApache-2.0today

kitops

kitops-ml/kitops
7.0

An open source DevOps tool from the CNCF for packaging and versioning AI/ML models, datasets, code, and configuration into an OCI Artifact.

Model Serving
1.4k176GoApache-2.02d ago

hopsworks

logicalclocks/hopsworks
5.8

Hopsworks - Data-Intensive AI platform with a Feature Store

Model Serving
1.3k158JavaAGPL-3.01y ago

rtp-llm

alibaba/rtp-llm
6.0

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Model Serving
1.2k219CudaApache-2.0today

truss

basetenlabs/truss
6.8

The simplest way to serve AI/ML models in production

Model Serving
1.2k109PythonMIT2d ago

Nanoflow

efeslab/Nanoflow
4.7

A throughput-oriented high-performance serving framework for LLMs

Model Serving
96550Jupyter Notebook3mo ago

mosec

mosecorg/mosec
6.5

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

Model Serving
90273PythonApache-2.03d ago

model_server

openvinotoolkit/model_server
6.5

A scalable inference server for models optimized with OpenVINO™

Model Serving
892260C++Apache-2.02d ago

pipeless

pipeless-ai/pipeless
4.9

An open-source computer vision framework to build and deploy apps in minutes

Model Serving
84952RustApache-2.02y ago

Yatai

bentoml/Yatai
6.1

Model Deployment at Scale on Kubernetes 🦄️

Model Serving
84476TypeScriptNOASSERTION29d ago

ServerlessLLM

ServerlessLLM/ServerlessLLM
5.8

Serverless LLM Serving for Everyone.

Model Serving
68774PythonApache-2.01mo ago

timber

kossisoroyce/timber
5.4

Ollama for classical ML models. AOT compiler that turns XGBoost, LightGBM, scikit-learn, CatBoost & ONNX models into native C99 inference code. One command to load, one command to serve. 336x faster than Python inference.

Model Serving
68523PythonNOASSERTION2mo ago

fastapi-ml-skeleton

eightBEC/fastapi-ml-skeleton
4.5

FastAPI Skeleton App to serve machine learning models production-ready.

Model Serving
60491PythonApache-2.05mo ago

pinferencia

underneathall/pinferencia
4.7

Python + Inference - Model Deployment library in Python. Simplest model inference server ever.

Model Serving
54383PythonApache-2.03y ago

ome

ome-projects/ome
6.1

Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, TensorRT-LLM, and Triton

Model Serving
47283GoApache-2.01d ago

JetStream

AI-Hypercomputer/JetStream
4.8

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

Model Serving
44766PythonApache-2.05mo ago

xFasterTransformer

intel/xFasterTransformer
4.3

xFasterTransformer — open-source AI/LLM project.

Model Serving
43675C++Apache-2.09mo ago

gpu-rest-engine

NVIDIA/gpu-rest-engine
3.7

A REST API for Caffe using Docker and Go

Model Serving
42293C++BSD-3-Clause7y ago

stable-diffusion-deploy

Lightning-Universe/stable-diffusion-deploy
4.6

Learn to serve Stable Diffusion models on cloud infrastructure at scale. This Lightning App shows load-balancing, orchestrating, pre-provisioning, dynamic batching, GPU-inference, micro-services working together via the Lightning Apps framework.

Model Serving
39139PythonApache-2.02y ago

TurboOCR

aiptimizer/TurboOCR
5.1

Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.

Model Serving
30537C++MITtoday

pmetal

Epistates/pmetal
5.0

PMetal: high-performance Apple Silicon framework for local LLM inference, LoRA/QLoRA fine-tuning, serving, quantization, and MLX/Metal acceleration.

Model Serving
30021RustNOASSERTION23d ago

podman-desktop-extension-ai-lab

containers/podman-desktop-extension-ai-lab
5.9

Work with LLMs on a local environment using containers

Model Serving
29182TypeScriptApache-2.06d ago

BMW-YOLOv4-Inference-API-GPU

BMW-InnovationLab/BMW-YOLOv4-Inference-API-GPU
4.1

This is a repository for an nocode object detection inference API using the Yolov3 and Yolov4 Darknet framework.

Model Serving
27767PythonBSD-3-Clause4y ago

llm-server

raketenkater/llm-server
4.8

Auto-tuned launcher for GGUF models on llama.cpp / ik_llama.cpp — OpenAI-compatible server with multi-GPU tensor-split, MoE expert placement, measured flag tuning (AI Tune), hardware-matched HuggingFace downloads, and crash recovery. An Ollama alternative for multi-GPU rigs.

Model Serving
23712GoMIT3d ago

ggrun

raketenkater/ggrun
4.8

Auto-tuned launcher for GGUF models on llama.cpp / ik_llama.cpp — OpenAI-compatible server with multi-GPU tensor-split, MoE expert placement, measured flag tuning (AI Tune), hardware-matched HuggingFace downloads, and crash recovery. An Ollama alternative for multi-GPU rigs.

Model Serving
23712GoMIT3d ago

BMW-YOLOv4-Inference-API-CPU

BMW-InnovationLab/BMW-YOLOv4-Inference-API-CPU
3.9

This is a repository for an nocode object detection inference API using the Yolov4 and Yolov3 Opencv.

Model Serving
21858PythonNOASSERTION4y ago