abedaniels's picture
Update policy-guardrails/granite-4.0-micro/README.md
0c2b301 verified

Policy Guardrails

Model Summary: Policy Guardrails is a LoRA adapter that brings policy compliance checking capability to the ibm-granite/granite-4.0-micro base model. Given a policy and a scenario, the adapter enables the base model to determine whether the scenario complies with, or violates, the policy, or if compliance cannot be determined based on the information provided. The model outputs a JSON object with a label field indicating "Yes" (scenario clearly complies with the policy), "No" (scenario clearly violates the policy), or "Ambiguous" (compliance of scenario with policy cannot be determined without additional information).

Usage

Intended use: Policy Guardrails is a guardrail adapter for IBM's granite-4.0-micro LLM. Given a policy and a scenario, it enables the granite-4.0-micro to accurately decide whether the scenario complies with, or violates, the given policy. More importantly, it does the compliance check very stringently, and provides a third response ('Ambiguous') if it is not possible to decide compliance/non-compliance with a high level of certainty, and more information is needed to make this decision. This is one of the salient features of this adapter since many real-world scenarios often lack sufficient information to confidently determine compliance with or violation of a policy, and forcing an affirmative or negative response will lead to a number of issues. This adapter is designed to be used as part of the Granite inference pipeline.

Output format The adapter produces a JSON string as output with the format {"label": "X"}. Possible values of "X" are "Yes", "No", and "Ambiguous" for compliance/non-compliance/ambiguity.

Use Cases

The Policy Guardrails adapter may be used to detect policy violations in input prompt text (i.e user-supplied text), model-generated text, and conversations (i.e. the last turn of a conversation, which may be user-supplied or model-generated). These enable a broad range of enterprise-relevant use cases:

  • Policy-compliant QA -- ensuring responses from QA systems in regulated domains (e.g. HR, finance, healthcare) comply with relevant policies
  • Policy-compliant agentic workflows -- ensure intermediate generations in agentic workflows adhere to relevant policies, e.g. to block non-compliant API calls, messages, transactions, etc.
  • Policy Gap & Redundancy Analysis -- detect ambiguous scenarios to identify policies that are unclear or inconsistent; analyze how different policies classify the same scenario to find inconsistencies and redundancies
  • Compliance Audits -- analyze organizational documents, code, communications, etc to perform audits on historical data
  • Scenario Escalation -- identify ambiguous scenarios requiring triage or human review

Quickstart Example (LoRA)

import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel



BASE_NAME = "ibm-granite/granite-4.0-micro"
adapter_repo = "ibm-granite/granitelib-guardian-r1.0"
adapter_subfolder = "policy-guardrails/granite-4.0-micro/lora"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side="left", trust_remote_code=True)
model_pg = PeftModel.from_pretrained(
    AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map="auto", torch_dtype=torch.bfloat16),
    adapter_repo, subfolder=adapter_subfolder,
)

# Example: Given a policy and a scenario, use the Policy Guardrails adapter to determine compliance.

policy_text = 'hiring managers should steer away from any questions that directly seek information about protected classes—such as “how old are you,” “where are you from,” “what year did you graduate” or “what are your plans for having kids.”'
scenario_text = 'During a job interview, Alex focuses on asking candidates about their work experience, specific skills, and past accomplishments, without delving into their age, personal background, or family status.'

# The adapter was trained using the prompt template below so it is used to create the appropriate message.
user_text = scenario_text  # scenario_text can be user or assistant generated based on use-case
judge_criteria = "Policy: " + policy_text
system_prompt = "You are a compliance agent trying to help determine whether a scenario is compliant with a given policy."

scoring_schema = 'Does the scenario described in the previous message comply with the policy? Compliance is determined based on the following guidelines:\n<start_of_guidelines>:\n- \"Yes\"  if, given the scenario and the policy, it can be decided with certainty that the scenario complies with the policy\n- \"No\"  if, given the scenario and the policy, it can be decided with certainty that the scenario does not comply with the policy\n- \"Ambiguous\" if, given the scenario and the policy, it is not possible to decide whether the scenario complies with, or violates, the policy and more information is needed to decide with certainty.\n<end_of_guidelines>\n\nYour answer must be either "Yes", "No", or "Ambiguous". You MUST return your answer as a valid JSON object with the key "label". For example, if your answer is "Yes", respond as "{"label":"Yes"}".'

judge_protocol = f"<guardian> {system_prompt}\n\n### Criteria: {judge_criteria}\n\n### Scoring Schema: {scoring_schema}"

messages = [
    {"role": "user", "content": user_text},   #this can be user or assistant message based on use case
    {"role": "user", "content": judge_protocol},
]


#Generate the JSON response and parse it to get predicted label

input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model_pg.device)
    
input_len = input_ids.shape[1]
model_pg.eval()        
with torch.no_grad():
    output = model_pg.generate(
        input_ids,
        do_sample=False,
        max_new_tokens=20,
    )

res = tokenizer.decode(output[:,input_len:][0],skip_special_tokens=True).strip()

try:
    res = json.loads(res)
    res = res["label"]
except:
    res = None

print(f"Prediction: {res}")

Training Details

Training Data: The Policy Guardrails adapter is trained on a proprietary dataset (Policy Guardrails dataset) composed of human‑crafted policies and a rich collection of scenarios. These scenarios come from both human‑generated and human‑validated synthetic sources. The dataset includes both simple (atomic) and complex policies that span approximately 40 domains, such as finance, healthcare, human resources, accounting, and tourism.

Dataset composition: Policies were carefully curated by experts from a broad range of policy documents. Each policy is paired with multiple scenarios (as shown in the figure below), each labeled as one of the following:

  • Compliant
  • Non‑compliant
  • Ambiguous

Sample Policy Dataset instance

Each dataset row includes the fields below. There can be more than one row per policy:

  • Domain – e.g., Finance, Pharma, Corporate, Civics.
  • Stakeholder – The person or organization required to follow the policy.
  • Policy Text – The full policy description.
  • Compliant Scenario – A compliant example, meaning the scenario CLEARLY complies with the policy.
  • Non-compliant Scenario – A non‑compliant example, meaning that the scenario CLEARLY violates the policy.
  • Ambiguous Scenario – A situation in which compliance cannot be determined unless questions (see Disambiguation Questions below) are asked. Notice that the scenario must be relevant to the policy.
  • Disambiguation Questions – One or more queries that help clarify ambiguous scenarios (not necessarily exhaustive).

Dataset generation: The creation of scenarios occurred in three phases:

  • Phase 1 – Manual Scenario Creation: Annotators created scenarios for simple policies looking at domains from industry (pharma, financial services, etc.) and human behavior (voters, tourists, users of public transportation,etc.). The annotators were encouraged to change the subject (person, organization), tone and specificity of their scenarios. Each scenario was validated by at least two annotators, with additional spot‑checking by data owners, not just for correctness, but for variety. The data owners validated about 5% as examples to be given to annotators as instructions.
  • Phase 2 – Synthetic Expansion: Scenarios from Phase 1 were used as seeds to generate synthetic scenarios using LLMs. All generated scenarios were then human‑validated and edited as in Phase 1.
  • Phase 3 – Complex Policy Scenarios: Annotators manually created and validated scenarios, as in Phase 1, but for more complex policies (e.g. policies consisting of conjunctions/disjunctions of multiple atomic policies).

Scenario creation in phases 1 & 2, as well as validation of scenarios in all three phases was done by a team of annotators at DataForce. DataForce prioritizes the well-being of its data contributors by ensuring they are paid fairly and receive livable wages for all projects.

Adapter Configuration

Parameter LoRA
Base model ibm-granite/granite-4.0-micro
LoRA rank (r) 16
LoRA alpha 32
Target modules q_proj, v_proj
Output format JSON string of type '{"label":"Yes"}' - value is one of "Yes", "No", "Ambiguous"
Max length 8192

Evaluation

Policy Guardrails dataset - Out of distribution split (policy)

The trained LoRA adapter was evaluated on an out-of-distribution split of the dataset such that there was no overlap of policies in the test dataset and the training/evaluation dataset. The LoRA adapter achieved a 19.5% improvement in balanced accuracy over the base model, fueled by a 117% improvement in accuracy for 'Ambiguous' cases.

Base model (granite-4.0-micro) adapted with the trained LoRA adapter

Compliance Accuracy
Yes 0.9305
No 0.8910
Ambiguous 0.8722
Balanced Accuracy 0.8979

Base model (granite-4.0-micro) only

Compliance Accuracy
Yes 0.9041
No 0.9474
Ambiguous 0.4023
Balanced Accuracy 0.7513

Polyguard dataset

No publicly available dataset allows a model to respond with 'Ambiguous' when unsure about the compliance of a scenario with respect to a given policy. One publicly available policy-scenario dataset is Polyguard (Kang et al.) that includes only compliant/non-compliant policy-scenario pairs. The dataset was synthetically generated via LLM prompting. Unlike the Policy Guardrails dataset that was created to explictly create 'Ambiguous' scenarios (allowing stricter labeling of Compliant/Non-compliant policy/scenario pairs), the PolyGuard dataset has potentially many such ambiguous cases that are marked as compliant or non-compliant due to its synthetic generation as well as only safe(compliant)/unsafe(non-compliant) labels.

The trained LoRA adapter was tested on some of the domains in the Polyguard dataset in two steps. First, the adapter was used to label the compliance of policy/scenario pairs as Yes/No/Ambiguous. The instances that were labeled as Ambiguous were discarded and not considered in the evaluations. Performance metrices were then computed only on the set of policy-scenarios pairs that were labeled as Yes/No. By this process, roughly half of the policy-scenario pairs were discarded, and performance was measured on the remaining half. We considered three of the sub-datasets (domains) in Polyguard that most closely corresponded to our policy-scenario compliance framework.

Performance of the Base model (granite-4.0-micro) adapted with the trained LoRA adapter on cases not flagged as Ambiguous

HR Education Social Media
Balanced Accuracy 0.9580 0.9241 0.7466
f1 score 0.9639 0.9530 0.8730
fpr 0.0309 0.1181 0.4665
Recall 0.9470 0.9662 0.9596
% Evaluated 51.66 54.38 50.44

Performance of the Base model (granite-4.0-micro) on cases not flagged as Ambiguous

HR Education Social Media
Balanced Accuracy 0.7351 0.7508 0.6860
f1 score 0.8622 0.8763 0.8414
fpr 0.3621 0.3883 0.5351
Recall 0.8323 0.8898 0.9072
% Evaluated 51.66 54.38 50.44

Infrastructure: The Granite 4.0 Micro Policy Guardrail LoRA adapter was trained on four NVIDIA A100-80GB GPUs. Evaluation was completed using one NVIDIA A100-80GB GPU.

Ethical Considerations: The Policy Guardrail adapter for Granite 4.0 Micro was trained specifically to reflect the behavior of Granite 4.0 Micro. While it may be applied to other LLMs, it has not been validated for them. Also, the specific prompt template described in the QuickStart Example section was used for training the adapter and is recommended while using the model for inference as well. While it is possible to change the prompt template and force the model to respond only Yes/No while checking the compliance of a scenario with a given policy, the results are unpredictable for cases where compliance cannot be determined (Ambiguous). Finally, its compliance label may not always align with human judgments of whether a scenario is compliant with respect to a given policy, as this may differ from how Granite 4.0 Micro labels them.

Resources