Moderation

Moderation methods are aimed to catch offensive, hurtful, toxic or inappropriate content. They also provide support for catching content that might violate other policies such as containing words of phrases that should be banned. Moderation methods can be used at both the input and output level.

The table below lists the moderation methods we currently support. The ID column should be used to use the detection method in a config.

Name

ID

Description

Toxicity Classifier

moderation-deberta

Finetuned Deberta model to detect toxicity

FlashText Banlist

moderation-flashtext

Keyword/Phrase Banlist

Moderation Prompt Engineering

moderation-prompt-engineering

Detect toxic content via LLM prompt engineering

OpenAI Moderations API

moderations-oai-api

Detect toxic content using OpenAI’s Moderation API

Perspective API

moderation-perspective-api

Detect toxic content using the Perspective API


Toxicity Classifier (moderation-deberta)

Uses a fine-tuned deberta v3 model to detect the presence of toxicity in text. This model is enabled in Dome’s default configuration.

Parameters

  • truncation (optional boolean): The truncation strategy used by the model. Default value is true,

  • max_length (optional int): Maximum sequence length that can be processed. Default value is 208.

  • device (optional str): The device to run on (cuda or cpu). By default uses CUDA if available, else CPU.

FlashText Banlist (moderation-flashtext)

Uses the FlashText algorithm for fast keyword matching to block any string that contains a phrase or word that is present in a banlist.

Parameters

  • banlist_filepaths (optional list[str]): A list of paths to text files that contain banned phrases. If not provided, a default banlist is used.

Moderation Prompt Engineering (moderation-prompt-engineering)

A detector that uses custom prompt-engineering to determine if the query string contains toxicity. Please ensure the OPENAI_API_KEY is set before using this method.

Parameters

  • hub_name (optional str): The hub that hosts the model you want to use. Currently supports OpenAI (openai) and Together (together). Default value is openai.

  • model_name (optional str): The model that you want to use. Default value is gpt-4o. Please ensure that the model you wish to use is compatible with the hub you selected. When using models from Together, ensure the model starts with together_ai as per LiteLLM’s documentation.

  • api_key (optional str): Specify the API key you want to use. By default, this is None, and the API key is pulled directly from the environment variables. The environment variables used are OPENAI_API_KEY, and TOGETHERAI_API_KEY.

OpenAI Moderations API (moderations-oai-api)

Uses the latest text-moderation model from OpenAI to classify content for hate, harassment, self-harm, sexual content and violence.

Parameters

  • score_threshold_dict (optional dict[str : float]): Sets the per-category score threshold for each of the different toxicity dimensions. Scores above the threshold will be flagged as toxic. If unspecified, uses 0.5 as the default threshold for each category. If specified, is provided, only the categories in the dictionary are considered. For example, setting this to {"violence" : 0.8, "self-harm" : 0.3} will cause the method to ignore every category except violence and self-harm, setting their thresholds to 0.8 and 0.3 respectively. For a full list of the categories available, see here.

Perspective API (moderation-perspective-api)

Uses Google Jigsaw’s Perspective API to detect toxicity.

Parameters

  • api_key (optional str): Specify the API key you want to use. By default this is not specified and the API key is pulled directly from the environment variables. The environment variable used is PERSPECTIVE_API_KEY

  • attributes (optional dict): We do not recommend changing this since it is required by the API. Default value is {'TOXICITY':{}}.

  • score_threshold_dict (optional dict[str: float]): Provide a score threshold for the API. Values that cross the threshold are flagged by the detector. Default value is {"TOXICITY": 0.5}.