πŸ‘½ Edition 11:Testing Framework for LLM Part 2

Documenting the LLM Testing Framework till date.

Continuation of the last post:

Please visit the first post of the series if you missed it.

🦜 Agenta

Building production-ready LLM-powered applications is currently very difficult. It involves countless iterations of prompt engineering, parameter tuning, and architectures.

Agenta provides you with the tools to quickly do prompt engineering and πŸ§ͺ experiment, βš–οΈ evaluate, and πŸš€ deploy your LLM apps. All without imposing any restrictions on your choice of framework, library, or model.

🎈 Details

πŸ§— Remarks

  • The website and app code have excellent UX. The end-to-end user journey, from creation to testing, is beautifully designed.

  • Can be hosted OnPrem - Aws or GCP

  • They have different parts:

    • Playground: to create the prompts from lots of predefined templates like

      • sales_call_summarizer

      • baby_name_generator

      • chat_models

      • completion_models

      • compose_email

      • experimental

      • extract_data_to_json

      • job_info_extractor

      • noteGPT

      • recipes_and_ingredients

      • sales_call_summarizer,

      • sales_transcript_summarizer,

      • sentiment_analysis

    • Test Sets

    • Evaluate

    • API Endpoint

🦚 AgentBench

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors.

🎈 Details

πŸ§— Remarks

  • This paper evaluates the performance of several LLMs (LLama 2, Vicuna, GPT-X, Dolly, etc.) as intelligent agents in a long chain environment that involves databases (SQL), web booking, and product comparison on the internet. The main question to be answered is whether LLama 2 is superior to ChatGPT in comparing products on the internet. For the purpose of this study, an agent refers to an LLM that operates in this environment.

  • An "AGENT" is an LLM that operates within a simulated environment to achieve a specific goal. In this study, the term, "agent" is used to refer to such an LLM. The agent's performance is assessed based on its capability to complete assigned tasks.

  • To date, It’s one of the best approaches to evaluating a LLM model for various tasks.

Thank you for reading Musings on AI. This post is public so feel free to share it.

πŸ¦ƒ AI Hero Studio ✨ Prompt Craft ✨ (Beta Phase)

🎈 Details

πŸ§— Remarks

  • Detailed β€œopenAI” API-based prompt experimentation dashboard.

  • The tool have a Promot Auto Completion feature that will enhance the input prompt quality using the predefined prompt templates.

  • Version Wise prompt management.

πŸ¦† Arthur Bench

Today, we’re excited to introduce our newest product: Arthur Bench, the most robust way to evaluate LLMs. Bench is an open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models. This open source tool will enable businesses to evaluate how different LLMs will perform in real-world scenarios so they can make informed, data-driven decisions when integrating the latest AI technologies into their operations. 

Here are some ways in which Arthur Bench helps businesses:

🎈 Details

πŸ§— Remarks

  • This tool creates a test suite automatically using datasets.

  • Periodically validates models for resiliency to model changes outside their control.

  • The system offers deployment gates that identify anomalous inputs, potential PII leakage, toxicity, and other quality metrics. It learns from production performance to optimize thresholds for these quality gates.

  • Provides core token-level observability, performance dashboarding, inference debugging, and alerting.

  • Accelerates ability to identify and debug underperforming regions.

🐿️ Guidance

Guidance enables you to control modern language models more effectively and efficiently than traditional prompting or chaining. Guidance programs allow you to interleave generation, prompting, and logical control into a single continuous flow matching how the language model actually processes the text. Simple output structures like Chain of Thought and its many variants (e.g., ART, Auto-CoT, etc.) have been shown to improve LLM performance. The advent of more powerful LLMs like GPT-4 allows for even richer structure, and guidance makes that structure easier and cheaper.

🎈 Details

πŸ•΅οΈβ€β™€οΈ Features

πŸ”Ή Live streaming

  • Simple, intuitive syntax. Guidance feels like a templating language, and just like standard Handlebars templates, you can do variable interpolation (e.g., ) and logical control.

  • Details

πŸ”Ή Chat dialog

  • Guidance supports API-based chat models like GPT-4, as well as open chat models like Vicuna through a unified API based on role tags (e.g., ...). This allows interactive dialog development that combines rich templating and logical control with modern chat models.

  • Details 

πŸ”Ή  Guidance acceleration

  • When multiple generation or LLM-directed control flow statements are used in a single Guidance program then we can significantly improve inference performance by optimally reusing the Key/Value caches as we progress through the prompt. This means Guidance only asks the LLM to generate the green text below, not the entire program. This cuts this prompt's runtime in half vs. a standard generation approach.

  • Details.

πŸ”Ή Token healing

  • The standard greedy tokenizations used by most language models introduce a subtle and powerful bias that can have all kinds of unintended consequences for your prompts. Using a process we call "token healing" guidance automatically removes these surprising biases, freeing you to focus on designing the prompts you want without worrying about tokenization artifacts.

  • Details

πŸ”Ή  Rich output structure example

  • To demonstrate the value of output structure, we take a simple task from BigBench, where the goal is to identify whether a given sentence contains an anachronism (a statement that is impossible because of non-overlapping time periods). Below is a simple two-shot prompt for it, with a human-crafted chain-of-thought sequence.

  • Details

πŸ”Ή Guaranteeing valid syntax JSON example

  • Large language models are great at generating useful outputs, but they are not great at guaranteeing that those outputs follow a specific format. This can cause problems when we want to use the outputs of a language model as input to another system. For example, if we want to use a language model to generate a JSON object, we need to make sure that the output is valid JSON. With guidance we can both accelerate inference speed and ensure that generated JSON is always valid. Below we generate a random character profile for a game with perfect syntax every time.

  • Details

πŸ”Ή Role-based chat model example

  • Modern chat-style models like ChatGPT and Alpaca are trained with special tokens that mark out "roles" for different areas of the prompt. Guidance supports these models through role tags that automatically map to the correct tokens or API calls for the current LLM. Below we show how a role-based guidance program enables simple multi-step reasoning and planning.

  • Details

πŸ”Ή Agents

  • We can easily build agents that talk to each other or to a user, via the await command. The await command allows us to pause execution and return a partially executed guidance program. By putting await in a loop, that partially executed program can then be called again and again to form a dialog (or any other structure you design). For example, here is how we might get GPT-4 to simulate two agents talking to one another.

  • Details

πŸ§— Remarks

  • If I need to select a tool for prompt engineering, I select this one.

  • Community Support is Superb.

🌳 Galileo LLM Studio

Algorithm-powered LLMOps Platform

Find the best prompt, inspect data errors while fine-tuning, monitor LLM outputs in real-time. All in one powerful, collaborative platform.

🎈 Details

πŸ•΅οΈβ€β™€οΈ Features

πŸ”Ή Prompt Engineering

  • Promot Inspector.

  • A detailed easy Dashboard with multiple parameters and evaluation scores.

  • Hallucination Score.

πŸ”Ή LLM Fine-Tune and Debugging

  • The watcher function analyze the input data.

  • A detailed dashboard with data quality - Auto identification of the data pulling from LLM that reduces the performance.

  • Fix and track data changes over time.

πŸ”Ή Production Monitoring

  •  Real-time LLM Monitoring.

  • Risk Control with customized plugins

  • Customized alert with your Slack.

πŸ§— Remarks

  • To date, I found this one is the tool for LLMOps. The developer can push the LLM model into production with confidence using the tool.

πŸŽ„ lakera.ai

An Overview of Lakera Guard – Bringing Enterprise-Grade Security to LLMs with One Line of Code

At Lakera, we supercharge AI developers by enabling them to swiftly identify and eliminate their AI applications’ security threats so that they can focus on building the most exciting applications securely.

Businesses around the world are integrating LLMs into their applications at lightning speeds. At the same time, LLM applications bring completely new types of security risks that organizations need to address.

This is why we’re super excited to introduce Lakera Guard – a developer-first API to bring enterprise-grade security to your LLM applications. It is lightning-fast and can be integrated within minutes. We’ve designed it so that developers love working with it!

πŸ•΅οΈβ€β™€οΈ Features

πŸ”Ή Content moderation

  • These are the categories that Lakera Guard currently evaluate against for inappropriate content in the input prompt.

    • HateContent targeting race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste, including violence. Content directed at non-protected groups (e.g., chess players) is exempt.

    • SexContent meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness).

πŸ”Ή Prompt injections

  • Jailbreaks

    • LLMs can be forced into malicious behavior by jailbreak attack prompts. Lakera Guard updates to protect against these.

  • Prompt injections

    • Prompt injection attacks must be stopped at all costs. Attackers will do whatever it takes to manipulate the system's behavior or gain unauthorized access. But fear not, Lakera Guard is constantly updated to prevent prompt injections and protect your system from harm.

πŸ”Ή Sensitive information

  • PII stands for Personally Identifiable Information - data that can identify an individual. It requires strict protection due to identity theft and privacy risks. Organizations handling PII must safeguard it to prevent unauthorized access. Laws like GDPR and HIPAA ensure proper PII handling and privacy protection.

πŸ”Ή Relevant Language

  • There are many ways to challenge LLMs using language. Users may:

    • Either Use Japanese jailbreaks.

    • Employ Portuguese prompt injections

    • Intentionally include spelling errors in prompts to bypass defenses.

    • Insert extensive code or special characters into prompts.

  • They assign a score between 0 and 1 to indicate the authenticity of a prompt. A higher score suggests a genuine attempt at regular communication.

πŸ”Ή Unknown links

  • One way in which prompt injection can be dangerous is phishing.

πŸ§— Remarks

  • The Roadmap is amazing.

  • LLM security is a real topic - and they are working on it.

🐣 NightFall AI

Securing generative AI

ChatGPT and other generative AI tools are powerful ways to increase your team's output. But sensitive data such as PII, confidential information, API keys, PHI, and much more can be contained in prompts. Rather than block these tools, use Nightfall's Chrome extension or Developer Platform to:

🎈 Details

πŸ§— Remarks

  • A great tool for handling LLM security.

  • Manage all security tasks in your SIEM or Nightfall dashboard.

  • Proactively protect your company and customer data.

  • Identify and manage secrets and keys from a single dashboard.

  • Train employees on best practice security policies, Build a culture of trust and strong data security hygiene.

  • Complete visibility of your sensitive data.

Thank you for reading Musings on AI. This post is public so feel free to share it.

🦒 BenchLLM

BenchLLM is a Python-based open-source library that streamlines the testing of Large Language Models (LLMs) and AI-powered applications. It measures the accuracy of your model, agents, or chains by validating responses on any number of tests via LLMs.

🎈 Details

πŸ§— Remarks

  • A detailed customizable library to evaluate prompt performance.

  • A great tool for prompt engineering.

  • Support Vector Retrieval, Similary, Orchestrators and Function Calling.

  • Test the responses of your LLM across any number of prompts.

  • Continuous integration for chains like Langchain, agents like AutoGPT, or LLM models like Llama or GPT-4.

  • Eliminate flaky chains and create confidence in your code.

  • Spot inaccurate responses and hallucinations in your application at every version.

πŸ¦‰ Martian

Dynamically route every prompt to the best LLM. Highest performance, lowest costs, incredibly easy to use.

There are over 250,000 LLMs today. Some are good at coding. Some are good at holding conversations. Some are up to 300x cheaper than others. You could hire an ML engineering team to test every single one β€” or you can switch to the best one for each request with Martian.

Before:

After:

🎈 Details

πŸ§— Remarks

  • In the development phase, but I love the idea. It is trying to solve one of the most burning problems in the LLM ecosystem.

  • There are various models available in the market that specialize in different tasks such as coding and storytelling. The Martian SDK is designed to identify the prompt's intention and utilize various models internally to produce the output.

  • GPT 4 models is 316x Costlier than a 7 billion model - β€œDon't waste money by paying senior models to do junior work. The model router sends your tasks to the right model.”

Thank you for reading Musings on AI. This post is public so feel free to share it.

🐹 Special Mention

πŸ₯¬ Rellm

ReLLM was created to fill a need when developing a separate tool. We needed a way to provide long term memory and context to our users, but we also needed to account for permissions and who can see what data.

πŸ₯¦ LangDock

The GDPR-compliant ChatGPT for your team

πŸ₯’ TryTaylor

Taylor AI allows enterprises to train and own their own proprietary fine-tuned LLMs in minutes, not weeks.

πŸ‰ scorecard.ai

Testing for Production-ready LLMs.Ship faster with more confidence.

Integrate in minutes.

🍈 signway.io

Signway is a proxy server that addresses the problem of re-streaming API responses from backend to frontend by allowing the frontend to directly request the API using a pre-signed URL created by Signway. This URL is short-lived, and once it passes verification for authenticity and expiry, Signway will proxy the request to the API and add the necessary authentication headers.

Deploy AI SaaS to security-demanding organizations

Mithril Security helps software vendors sell SaaS to enterprises, thanks to our secure enclave deployment tooling, which provides SaaS on-prem levels of security and control for customers.

πŸ₯ kobaltlabs

LLMs, made private and secure

Unlock the power of GPT for your most sensitive data with a fast, simple security API

πŸ₯­ cadea.ai

Secure AI for Business

Deploy enterprise-level AI tools equipped with e2e data security and role based access control. Our platform helps you create, manage, and monitor chatbots that can answer questions about your internal documents.

🐢 Summary

It is hard to compare apple-to-apple. That why I have grouped the frameworks (No rank).

πŸ”Ή  Prompt Engineering (Make Prompts better)

  • Baserun

  • PromptTools

  • DeepEval

  • Promptfoo

  • Nvidia NeMo-Guardrails

  • Agenta

  • AI Hero Studio

  • Guidance

  • Galileo LLM Studio

  • BenchLLM

πŸ”Ή  Everything about LLM (Fine-tune, Debugging, Monitoring)

  • Baserun

  • Agenta

  • Nvidia NeMo-Guardrails

  • AgentBench

  • Galileo LLM Studio

  • Martian

πŸ”Ή LLM Security (Guard The LLM Fortress)

  • Nvidia NeMo-Guardrails

  • Arthur Bench

  • Galileo LLM Studio

  • lakera.ai

  • NightFall AI

**

I will publish the next Edition on Sunday.

This is the 11th Edition, If you have any feedback please don’t hesitate to share it with me, And if you love my work, do share it with your colleagues.

It takes time to research and document it - Please be a paid subscriber and support my work.

Cheers!!

Raahul

**

Reply

or to participate.