YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Llama 3.1 NeMoGuard Content Safety
This repository demonstrates the implementation of NVIDIA's Llama 3.1 NeMoGuard 8B Content Safety model for detecting unsafe content in conversations.
Overview
The NeMoGuard Content Safety model is a fine-tuned version of Meta's Llama 3.1 8B Instruct model, specifically trained to identify and classify potentially harmful or unsafe content in user prompts and AI responses.
Features
- Content Safety Assessment: Evaluates conversations for potential safety concerns
- Category Classification: Identifies specific safety categories when unsafe content is detected
- User & Response Safety: Separately assesses both user inputs and agent responses
- Easy Integration: Simple API using Hugging Face Transformers and PEFT
Model Information
- Base Model: meta-llama/Meta-Llama-3.1-8B-Instruct
- Safety Model: nvidia/llama-3.1-nemoguard-8b-content-safety
- Model Type: PEFT (Parameter-Efficient Fine-Tuning) adapter
- Model Page: https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety
Installation
pip install transformers peft torch
Usage
Basic Content Safety Check
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "nvidia/llama-3.1-nemoguard-8b-content-safety")
# Prepare conversation
conversation = """
<BEGIN CONVERSATION>
user: your message here
<END CONVERSATION>
"""
# Perform safety assessment
inputs = tokenizer(conversation, return_tensors="pt")
outputs = model.generate(**inputs)
safety_assessment = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(safety_assessment)
With Response Evaluation
conversation = """
<BEGIN CONVERSATION>
user: user message here
response: agent response here
<END CONVERSATION>
"""
Output Format
The model returns a JSON-formatted assessment:
{
"User Safety": "safe|unsafe",
"Response Safety": "safe|unsafe",
"Safety Categories": "category1, category2, ..."
}
Safety Categories
The model can identify various safety categories (when unsafe content is detected), including but not limited to:
- S9: Privacy violations
- Other categories as defined by the model's training
Requirements
- Python 3.8+
- PyTorch
- Transformers
- PEFT
- CUDA-capable GPU (recommended for optimal performance)
Hardware Requirements
- GPU Memory: At least 16GB VRAM recommended for inference
- RAM: 16GB+ system RAM
- Storage: ~20GB for model weights
Examples
See the included Jupyter notebook llama-3.1-nemoguard-8b-content-safety.ipynb for detailed examples and use cases.
License
MIT License - See LICENSE file for details
Acknowledgments
- NVIDIA for developing the NeMoGuard Content Safety model
- Meta for the Llama 3.1 base model
- Hugging Face for the Transformers and PEFT libraries
Citation
If you use this model in your research, please cite:
@misc{nvidia-nemoguard-2024,
title={Llama 3.1 NeMoGuard 8B Content Safety},
author={NVIDIA},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety}
}
Issues and Contributions
If you encounter issues with the code snippets or implementation:
- Open an issue on the model repository
- Report problems with Hugging Face integration on huggingface.js
Disclaimer
This model is designed for content safety assessment and should be used responsibly. It is a tool to help identify potentially harmful content but should not be the sole mechanism for content moderation in production systems.