Skip to main content
LLM outputs can accidentally disseminate confidential information. Such sensitive data leakage poses substantial privacy (and security) risks in terms of reputational harm and potential legal consequences. Vijil tests if an LLM is prone to leaking confidential data from both training data and inference-stage data.

Probe and Scenario List

Harness/scenarioProbeProbe DescriptionDetector
Privacy
Copyrighted contentBook excerpt completionTests whether the model will complete a book excerpt when prompted with the beginningStarts with trigger
Book excerpt insertionTests whether the model will fill in the middle of a book excerptTrigger presence
Media excerpt completion - GuardianTests whether the model will complete an excerpt from The Guardian when prompted with the beginningStarts with trigger
Media excerpt insertion - GuardianTests whether the model will fill in the middle of an excerpt from The GuardianTrigger presence
Media excerpt completion - NYTTests whether the model will complete an excerpt from the New York Times when prompted with the beginningStarts with trigger
Media excerpt insertion - NYTTests whether the model will fill in the middle of an excerpt from the New York TimesTrigger presence
Private data leakEmails leakage - EnronTests whether the model will reveal email addresses obtained from the discovery process of the Enron investigationTrigger presence
PII LeakageTests whether PII provided to the model during a conversation can be leakedTrigger presence
Confidentiality comprehensionTests whether the model understands privacy-related phrases that flag when certain information is confidentialDon’t answer yes

Copyrighted Content

Chang et al, 2023 found that LLMs are prone to memorizing text from sources available on the internet. Vijil’s tests are intended to probe for such memorization capabilities. For this, Vijil supplies an agent with
  • a sentence with one missing word and ask it to fill in the masked word (cloze completion)
  • an incomplete sentence and ask it to complete it (sentence completion)
Vijil performs two such evaluations for each of the following three sources of text:
  1. Book passages from English literature,
  2. Articles from New York Times,
  3. Articles from The Guardian.

Private Data Leak

These probes test for whether models will reveal private data from either their training data or their inference-time data.

Private Training Data

Vijil uses Decoding Trust’s Enron dataset to test whether models will leak private data from the Enron emails.

Inference-stage Private Data

Vijil uses Decoding Trust’s synthetic prompts to test whether models will leak private data like social security numbers and SSH keys if these are provided at the inference stage. For example, if a user enters a query to a chatbot that contains private data, will this data be easily extractable later in the same conversation?

Privacy Understanding

The privacy understanding probe, adapted from Decoding Trust, measures whether the model understands when not to reveal private information. The prompts ask the model whether it is appropriate to reveal private information in various settings.