Llama 3.1 NeMoGuard Content Safety

This repository demonstrates the implementation of NVIDIA's Llama 3.1 NeMoGuard 8B Content Safety model for detecting unsafe content in conversations.

Overview

The NeMoGuard Content Safety model is a fine-tuned version of Meta's Llama 3.1 8B Instruct model, specifically trained to identify and classify potentially harmful or unsafe content in user prompts and AI responses.

Features

Content Safety Assessment: Evaluates conversations for potential safety concerns
Category Classification: Identifies specific safety categories when unsafe content is detected
User & Response Safety: Separately assesses both user inputs and agent responses
Easy Integration: Simple API using Hugging Face Transformers and PEFT

Model Information

Base Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Safety Model: nvidia/llama-3.1-nemoguard-8b-content-safety
Model Type: PEFT (Parameter-Efficient Fine-Tuning) adapter
Model Page: https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety

Installation

pip install transformers peft torch

Usage

Basic Content Safety Check

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "nvidia/llama-3.1-nemoguard-8b-content-safety")

# Prepare conversation
conversation = """
<BEGIN CONVERSATION>
user: your message here
<END CONVERSATION>
"""

# Perform safety assessment
inputs = tokenizer(conversation, return_tensors="pt")
outputs = model.generate(**inputs)
safety_assessment = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(safety_assessment)

With Response Evaluation

conversation = """
<BEGIN CONVERSATION>
user: user message here
response: agent response here
<END CONVERSATION>
"""

Output Format

The model returns a JSON-formatted assessment:

{
  "User Safety": "safe|unsafe",
  "Response Safety": "safe|unsafe",
  "Safety Categories": "category1, category2, ..."
}

Safety Categories

The model can identify various safety categories (when unsafe content is detected), including but not limited to:

S9: Privacy violations
Other categories as defined by the model's training

Requirements

Python 3.8+
PyTorch
Transformers
PEFT
CUDA-capable GPU (recommended for optimal performance)

Hardware Requirements

GPU Memory: At least 16GB VRAM recommended for inference
RAM: 16GB+ system RAM
Storage: ~20GB for model weights

Examples

See the included Jupyter notebook llama-3.1-nemoguard-8b-content-safety.ipynb for detailed examples and use cases.

License

MIT License - See LICENSE file for details

Acknowledgments

NVIDIA for developing the NeMoGuard Content Safety model
Meta for the Llama 3.1 base model
Hugging Face for the Transformers and PEFT libraries

Citation

If you use this model in your research, please cite:

@misc{nvidia-nemoguard-2024,
  title={Llama 3.1 NeMoGuard 8B Content Safety},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety}
}

Issues and Contributions

If you encounter issues with the code snippets or implementation:

Open an issue on the model repository
Report problems with Hugging Face integration on huggingface.js

Disclaimer

This model is designed for content safety assessment and should be used responsibly. It is a tool to help identify potentially harmful content but should not be the sole mechanism for content moderation in production systems.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support