STACKQUADRANT

deepeval

confident-ai/deepeval
8.0

The LLM Evaluation Framework

Evaluation & Testing
16.5k1.6kPythonApache-2.02d ago

Ragas

explodinggradients/ragas
7.4

Ragas — a leading open-source project in the AI/LLM ecosystem.

Evaluation & Testing
14.6k1.5kPythonApache-2.04mo ago

garak

NVIDIA/garak
7.5

the LLM vulnerability scanner

Evaluation & Testing
8.2k1.1kPythonApache-2.02d ago

chinese-llm-benchmark

jeinlee1991/chinese-llm-benchmark
6.3

ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括335个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.5、文心ERNIE-X1.1、ERNIE-5.0-Thinking、qwen3-max、百川、讯飞星火、商汤senseChat等商用模型, 以及kimi-k2、ernie4.5、minimax-M2、deepseek-v3.2、qwen3-2507、llama4、智谱GLM-4.6、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。

Evaluation & Testing
6.2k2541d ago

LLM-Engineers-Handbook

PacktPublishing/LLM-Engineers-Handbook
6.6

The LLM's practical guide: From the fundamentals to deploying advanced LLM and RAG apps to AWS using LLMOps best practices

Evaluation & Testing
5.1k1.2kPythonMIT2mo ago

lmms-eval

EvolvingLMMs-Lab/lmms-eval
7.5

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

Evaluation & Testing
4.3k608PythonNOASSERTION4d ago

agenta

Agenta-AI/agenta
7.4

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Evaluation & Testing
4.2k555TypeScriptNOASSERTIONtoday

AI-Infra-Guard

Tencent/AI-Infra-Guard
7.4

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evaluation.

Evaluation & Testing
4.0k385PythonApache-2.02d ago

trulens

truera/trulens
7.3

Evaluation and Tracking for LLM Experiments and AI Agents

Evaluation & Testing
3.4k306PythonMIT6d ago

lmnr

lmnr-ai/lmnr
7.0

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

Evaluation & Testing
3.0k212TypeScriptApache-2.0today

Observal

BlazeUp-AI/Observal
6.0

Observal is an AI agent registry with first in class observabilty and eval framework

Evaluation & Testing
2.1k459PythonNOASSERTIONtoday

aisheets

huggingface/aisheets
6.1

Build, enrich, and transform datasets using AI models with no code

Evaluation & Testing
1.6k141TypeScriptApache-2.01mo ago

FuzzyAI

cyberark/FuzzyAI
5.4

A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

Evaluation & Testing
1.5k207Jupyter NotebookApache-2.04mo ago

prompty

microsoft/prompty
6.8

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

Evaluation & Testing
1.2k118TypeScriptMITtoday

uqlm

cvs-health/uqlm
6.7

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

Evaluation & Testing
1.2k127PythonApache-2.020d ago

FinSight-AI

juanjuandog/FinSight-AI
5.1

AI equity research agent with resilient workflows, Redis Lua single-flight, pgvector RAG, versioned reports, evidence tracing, and RAG evaluation.

Evaluation & Testing
1.1k59JavaMIT1mo ago

passmark

bug0inc/passmark
5.9

The open-source Playwright library for AI browser regression testing with intelligent caching, auto-healing, and multi-model verification.

Evaluation & Testing
1.1k170TypeScriptNOASSERTION12d ago

judgeval

JudgmentLabs/judgeval
6.7

The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.

Evaluation & Testing
1.0k93PythonApache-2.03d ago

WHartTest

MGdaasLab/WHartTest
6.2

WHartTest 是一款AI驱动的测试自动化平台,实现从需求到可执行测试用例的自动化生成与管理,帮助测试团队提升效率与覆盖率。 (WHartTest is an AI-driven test automation platform that automates the generation and management of executable test cases from requirements, helping testing teams improve efficiency and coverage.)

Evaluation & Testing
938131PythonMIT2d ago

scenario

langwatch/scenario
5.9

Agentic testing for agentic codebases

Evaluation & Testing
90667PythonMIT2d ago

Awesome-LLM-Eval

onejune2018/Awesome-LLM-Eval
4.7

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.

Evaluation & Testing
64776MIT7mo ago

aimock

CopilotKit/aimock
6.3

Mock everything your AI app talks to — LLM APIs, MCP, A2A, vector DBs, search. One package, one port, zero dependencies.

Evaluation & Testing
63744TypeScriptMITtoday

Awesome-LLM-in-Social-Science

ValueByte-AI/Awesome-LLM-in-Social-Science
5.1

Awesome papers involving LLMs in Social Science.

Evaluation & Testing
63349MIT20d ago

agent-skills-eval

darkrishabh/agent-skills-eval
5.3

A test runner for agentskills.io-style AI agent skills

Evaluation & Testing
60330TypeScriptMIT4d ago

iFixAi

ifixai-ai/iFixAi
6.2

The open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception, unpredictability, and opacity. Provider-agnostic. Runs against OpenAI, Anthropic, Bedrock, Azure, Gemini, and more. Letter grade in under 5 minutes, content-addressed manifest for bit-identical replay. Built by iMe.

Evaluation & Testing
570112PythonApache-2.01d ago

langtest

Pacific-AI-Corp/langtest
5.8

Deliver safe & effective language models

Evaluation & Testing
56250PythonApache-2.02mo ago

langtest

PacificAI/langtest
5.8

Deliver safe & effective language models

Evaluation & Testing
56250PythonApache-2.02mo ago

awesome-evals

benchflow-ai/awesome-evals
4.7

A curated, non-BS library of the best resources for building and evaluating AI agents — papers, blogs, talks, tools, benchmarks. Maintained by BenchFlow.

Evaluation & Testing
55240NOASSERTIONtoday

continuous-eval

relari-ai/continuous-eval
4.7

Data-Driven Evaluation for LLM-Powered Applications

Evaluation & Testing
51738PythonApache-2.01y ago

fakecloud

faiscadev/fakecloud
5.7

Free, open-source AWS emulator. LocalStack alternative: 26 services, 1,924 operations, 100% conformance. No account, no auth token, no paid tier.

Evaluation & Testing
45329RustAGPL-3.0today

rhesis

rhesis-ai/rhesis
5.5

The testing platform for AI teams. Bring engineers, PMs, and domain experts together to generate tests, simulate (adversarial) conversations, and trace every failure to its root cause.

Evaluation & Testing
37326PythonNOASSERTION1d ago

llm-leaderboard

JonathanChavezTamales/llm-leaderboard
4.7

A comprehensive set of LLM benchmark scores and provider prices. (deprecated, read more in README)

Evaluation & Testing
36140JavaScriptNOASSERTION8mo ago

palico-ai

palico-ai/palico-ai
4.5

Build, Improve Performance, and Productionize your LLM Application with an Integrated Framework

Evaluation & Testing
34328TypeScriptMIT1y ago

llms-tools

PetroIvaniuk/llms-tools
4.8

A list of LLMs Tools & Projects

Evaluation & Testing
31945Apache-2.027d ago

flutter-skill

ai-dashboad/flutter-skill
5.4

AI-powered E2E testing for 10 platforms. 253 MCP tools. Zero config. Works with Claude, Cursor, Windsurf, Copilot. Test Flutter, React Native, iOS, Android, Web, Electron, Tauri, KMP, .NET MAUI — all from natural language.

Evaluation & Testing
31544DartMIT16d ago

athina-evals

athina-ai/athina-evals
4.1

Python SDK for running evaluations on LLM generated responses

Evaluation & Testing
30122Python1y ago

testdriverai

testdriverai/testdriverai
4.6

Computer-Use SDK for E2E QA Testing

Evaluation & Testing
22233JavaScript1mo ago

qaskills

PramodDutta/qaskills
4.2

QA Skills Directory QA Skills is a curated directory of testing-specific skills for AI coding agents (Claude Code, Cursor, Copilot, etc.).

Evaluation & Testing
16016TypeScripttoday

agent-qa

vostride/agent-qa
4.8

The self-improving Agentic QA harness with Memory. Write tests in natural language.
 Catch regressions before releases ship.

Evaluation & Testing
1507TypeScriptNOASSERTION14d ago