πŸ‘½ Edition 10: Testing framework for LLM Part 1

Documenting the LLM Testing Framework till date.

Back from vacation and the writer’s block β˜€οΈ

In this edition, I have meticulously documented every testing framework for Llm that I've come across on the internet and GitHub.

Basic LLM Testing Framework:

I am organizing the frameworks in alphabetical order, without assigning any specific rank to them.

πŸ‘©β€βš–οΈ DeepEval

DeepEval provides a Pythonic way to run offline evaluations on your LLM pipelines so you can launch comfortably into production. The guiding philosophy is a "Pytest for LLM" that aims to make productionizing and evaluating LLMs as easy as ensuring all tests pass.

DeepEval – It’s a tool for easy and efficient LLM testing. Deepeval aims to make writing tests for LLM applications (such as RAG) as easy as writing Python unit tests.

πŸͺ‚ Metrics

  • AnswerRelevancy : Depends on "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"

  • BertScoreMetric: Depends on "sentence-transformers/all-mpnet-base-v2"

  • Dbias: LLMs can become highly biased after finetuning from any RLHF or optimizations. Bias, however, is a very vague term so the paper focuses on bias in the following areas.

    • Gender (e.g. "All man hours in his area of responsibility must be approved.")

    • Age (e.g. "Apply if you are a recent graduate.")

    • Racial/Ethnicity (e.g. "Police are looking for any black males who may be involved in this case.")

    • Disability (e.g. "Genuine concern for the elderly and handicapped")

    • Mental Health (e.g. "Any experience working with retarded people is required for this job.")

    • Religion

    • Education

    • Political ideology

    • This is measured according to tests with logic following this paper: 

  • BLEUMetric: Compute the BLEU score for a candidate sentence given a reference sentence. Depends on the nltk models.

  • CohereRerankerMetric

  • ConceptualSimilarityMetric: Asserting conceptual similarity.Depends on "sentence-transformers/all-mpnet-base-v2"

  • ranking_similarity: Similarity measures between two different ranked lists. Built on β€œA Similarity Measure for Indefinite Rankings”

  • NonToxicMetric: Built on detoxify 

  • FactualConsistencyMetric: Depends on "cross-encoder/nli-deberta-v3-large"

  • EntailmentScoreMetric: Depends on "cross-encoder/nli-deberta-base"

  • Custom Metrics can be added.

🎈 Details

πŸ§— Remarks

  • Clean Dashboard.

  • The model derived Metrics - and it’s good. You can adjust the model depending on the performance.

  • Helpful to measure the output quality.

  • Less Community Support.

πŸ•΅οΈ Agentops(in development)

🎈 Details

πŸ§— Remarks

  • Enlisting the product because of the exciting LLM debugging roadmap

baserun.aiπŸ’ͺπŸ’ͺπŸ’ͺ

From prompt playground to end-to-end tests, baserun helps you ship your LLM apps with confidence and speed.

Baserun is a YCombinator-backed great tool to debug the prompts on runtime.

🎈 Details

πŸ§— Remarks

  • Clean Detailed Dashboard with prompt cost(I loved that).

  • The evaluation framework is heavily inspired by the OpenAI Evals project and offers a number of built-in evaluations which we record and aggregate in the Baserun dashboard.

  • The framework simplifies the LLM Debugging workflow.

  • The hallucinations can be prevented with the tool to some extent.

  • Less Customisation Scope.

🐀 PromptTools

Welcome to prompttools created by Hegel AI! This repo offers a set of open-source, self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts. The core idea is to enable developers to evaluate using familiar interfaces like code, notebooks, and a local playground.

In just a few lines of codes, you can test your prompts and parameters across different models (whether you are using OpenAI, Anthropic, or LLaMA models). You can even evaluate the retrieval accuracy of vector databases.

🎈 Details

πŸͺ‚ Metrics

Experiments and Harnesses

Here are two main abstractions used in the prompttools library: Experiments and Harnesses. Occasionally, you may want to use a harness, because it abstracts away more details.

  • An experiment is a low-level abstraction that takes the cartesian product of possible inputs to an LLM API. For example, the OpenAIChatExperiment accepts lists of inputs for each parameter of the OpenAI Chat Completion API. Then, it constructs and asynchronously executes requests using those potential inputs. An example of using an experiment is here.

  • There are two main abstractions used in the prompttools library: Experiments and Harnesses. Occasionally, you may want to use a harness, because it abstracts away more details.A harness is built on top of an experiment and manages abstractions over inputs.

Evaluation and Validation

These built-in functions help you to evaluate the outputs of your experiments. They can also be used to be part of your CI/CD system.

  • You can also manually enter feedback to evaluate prompts, see HumanFeedback.ipynb.

  • IT uses gpt4 as a judge

Here is a list of APIs that we support with our experiments:

  • LLMs

    • OpenAI (Completion, ChatCompletion, Fine-tuned models) - Supported

    • LLaMA.Cpp (LLaMA 1, LLaMA 2) - Supported

    • HuggingFace (Hub API, Inference Endpoints) - Supported

    • Anthropic - Supported

    • Google PaLM - Supported

    • Azure OpenAI Service - Supported

    • Replicate - Supported

    • Ollama - In Progress

  • Vector Databases and Data Utility

    • Chroma - Supported

    • Weaviate - Supported

    • Qdrant - Supported

    • LanceDB - Supported

    • Milvus - Exploratory

    • Pinecone - Exploratory

    • Epsilla - In Progress

  • Frameworks

    • LangChain - Supported

    • MindsDB - Supported

    • LlamaIndex - Exploratory

  • Computer Vision

    • Stable Diffusion - Supported

    • Replicate's hosted Stable Diffusion - Supported

πŸ§— Remarks

  • I have been using it for the last 15 days. The Streamlit-based dashboard is smooth.

  • `Prompt Template Experimentation` is a nice feature of the product. But I am expecting more comparison details without latency and similarities.

  • The framework covers the LLM, VectorDb, and orchestrators.

  • Great Community Support.

  • Great tool for RLHF.

  • Can’t add a self-hosted server.

πŸ₯ Promptfoo: Test your prompts

promptfoo is a tool for testing and evaluating LLM output quality.

With promptfoo, you can:

The goal: test-driven prompt engineering, rather than trial-and-error.

🎈 Details

  • Documentation 

  • Github

  • License: MIT license

  • Here's an example of a side-by-side comparison of multiple prompts and inputs:

  • It works on the command line too.

πŸ§— Remarks

  • A detailed customizable prompt template library.

  • A great tool for prompt engineering.

  • Supports the common LLM providers.

  • You can check different scenarios:

    • You can add the prompt configurations here: Example :

      • Verify that the output doesn't contain an "AI language model".

      • Verify that the output doesn't apologize, using model-graded eval (must not contain an apology).

      • Prefer shorter outputs using a scoring function.

      • Avoiding repetition.

      • Auto-validate output with assertions.

      • Multiple variables in a single test case.​

      • Other capabilities postprocessing.

🐚 Nvidia NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems. Guardrails (or "rails" for short) are specific ways of controlling the output of a large language model, such as not talking about politics, responding in a particular way to specific user requests, following a predefined dialog path, using a particular language style, extracting structured data, and more.

This toolkit is currently in its early alpha stages, and we invite the community to contribute towards making the power of trustworthy, safe, and secure LLMs accessible to everyone. The examples provided within the documentation are for educational purposes to get started with NeMo Guardrails, and are not meant for use in production applications.

We are committed to improving the toolkit in the near term to make it easier for developers to build production-grade trustworthy, safe, and secure LLM applications.

NeMo Guardrails will help ensure smart applications powered by large language models (LLMs) are accurate, appropriate, on topic, and secure. The software includes all the code, examples, and documentation businesses need to add safety to AI apps that generate text.

It sits in the middle between the user (After Vector Embedding) and guard LLM server. It is open source so the engineer can write their own logic onto the guardrail.

NeMo Guardrails enables developers to set up three kinds of boundaries:

  • Topical guardrails prevent apps from veering off into undesired areas. For example, they keep customer service assistants from answering questions about the weather.

  • Safety guardrails ensure apps respond with accurate, appropriate information. They can filter out unwanted language and enforce that references are made only to credible sources.

  • Security guardrails restrict apps to making connections only to external third-party applications known to be safe.

🎈 Details

πŸ§— Remarks

  • Nemo-Guardrail is An easily programmable guardrail that is a must for the production-based LLM application.

  • The conversation designer can add the boundaries of the conversation in the same plain English using colang.

  • The filtering policy of the guard rail depends on the embedding space - more intelligent.

  • Supports the production batching for the orchestration.

  • The community is great.

  • The most required framework in the time.

**

I will publish the next Edition with the other five frameworks on Sunday.

This is the 10th Edition, If you have any feedback please don’t hesitate to share it with me, And if you love my work, do share it with your colleagues.

It takes time to research and document it - Please be a paid subscriber and support my work.

Cheers!!

Raahul

**

Reply

or to participate.