Llama Email Fraud Detector (AWQ 4-bit)

AWQ 4-bit quantized version of cunxin/llama-email-fraud-detector. Same fine-tuned email fraud detection capability at 1/3 the size with ~1-1.5% accuracy loss.

This model is optimized for low VRAM GPUs (6-8 GB) such as RTX 3050, RTX 3060 laptop, and similar consumer GPUs.


cunxin/llama-email-fraud-detector 的 AWQ 4-bit 量化版本。同样的邮件欺诈检测能力,体积仅为原来的 1/3,精度损失约 1-1.5%。

本模型为低显存 GPU(6-8 GB)优化,如 RTX 3050、RTX 3060 笔记本版等消费级显卡。

Model Details / 模型详情

Architecture LlamaForCausalLM (Decoder-only Transformer)
Base Model meta-llama/Llama-3.2-3B-Instruct
Fine-Tuning LoRA (r=16, alpha=32) merged, then AWQ quantized
Quantization AWQ 4-bit, group_size=128, GEMM kernel, zero_point=True
Parameters 3.2B (quantized)
Precision 4-bit weights (int4)
Model Size 2.2 GB (vs 6.4 GB for bf16)
Compression Ratio ~2.9x
Accuracy Loss vs bf16 ~1-1.5%

Lineage / 模型血统

meta-llama/Llama-3.2-3B-Instruct
    │
    ├── LoRA fine-tuning (email fraud detection)
    │
    ├── Merge LoRA into base weights (6.4 GB, bf16)
    │     └─► cunxin/llama-email-fraud-detector
    │
    └── AWQ 4-bit quantization (2.2 GB)
          └─► cunxin/llama-email-fraud-detector-awq (this model)

Output Format / 输出格式

Same structured JSON output as the bf16 version:

与 bf16 版本相同的结构化 JSON 输出:

{
  "is_fraud": true,
  "risk_score": 95,
  "confidence_level": 0.97,
  "detected_threats": ["DOMAIN_MISMATCH", "CREDENTIAL_REQUEST", "URGENCY_FEAR"],
  "reason": "The sender domain 'amaz0n-verify.com' typosquats amazon.com...",
  "suggestion": "Do not click any links. Report this email as phishing."
}

11 threat types with point-based scoring: CREDENTIAL_REQUEST (35), DOMAIN_MISMATCH (30), URL_DISCREPANCY (30), TOO_GOOD_TO_BE_TRUE (30), PROMPT_INJECTION (30), URGENCY_FEAR (15), REPLY_TO_MISMATCH (15), GENERIC_SALUTATION (8), ANOMALOUS_TIMING (8), MISSING_SIGNATURE (8), GRAMMAR_ANOMALY (5).

See cunxin/llama-email-fraud-detector for full documentation of threat types, dual-model pipeline, and training details.

详细的威胁类型、双模型流水线和训练细节请参阅 cunxin/llama-email-fraud-detector

Usage / 使用方法

With vLLM (Recommended) / 使用 vLLM(推荐)

# Set in .env
MODEL_PATH=cunxin/llama-email-fraud-detector-awq
QUANTIZATION=awq

# Start service
docker compose --profile gpu up -d

With Transformers / 使用 Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "cunxin/llama-email-fraud-detector-awq"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

VRAM Requirements / 显存要求

Configuration / 配置 VRAM / 显存 Example GPU / 示例 GPU
AWQ model only ~2.5 GB Any 4GB+ GPU
AWQ + RoBERTa ~3.0 GB RTX 3050 (8GB)
AWQ + RoBERTa + KV Cache (4096 ctx) ~3.5 GB RTX 3060 Laptop (6GB)
AWQ + RoBERTa + KV Cache + TurboQuant ~2.7 GB RTX 3050 (4GB laptop)

Comparison with bf16 / 与 bf16 对比

bf16 AWQ 4-bit (this)
Model size 6.4 GB 2.2 GB
Min VRAM 12 GB 6 GB
Max concurrent requests (24GB GPU) ~21 ~50+
Accuracy loss baseline ~1-1.5%
Inference speed baseline comparable

Quantization Details / 量化细节

AWQ (Activation-Aware Weight Quantization) identifies the most important 1% of weights by observing activation magnitudes, then scales those channels up before uniform 4-bit quantization. This preserves model quality far better than naive quantization.

AWQ(激活感知权重量化)通过观察激活值大小识别最重要的 1% 权重,在统一 4-bit 量化前对这些通道进行缩放。这比朴素量化更好地保留了模型质量。

Method AWQ (Activation-Aware Weight Quantization)
Bit Width 4-bit (int4)
Group Size 128
Kernel GEMM
Zero Point Yes
Calibration Data Pile validation set (214K samples)
Quantization Time ~4 minutes on RTX 4090

Related Models / 相关模型

Model / 模型 Type / 类型 Size / 大小 Speed / 速度 Use Case / 用途
cunxin/roberta-email-fraud-detector Discriminative 475 MB <50ms Fast binary pre-screen / 快速二元预筛
cunxin/llama-email-fraud-detector Generative (bf16) 6.4 GB ~1-3s Detailed threat analysis / 详细威胁分析
cunxin/llama-email-fraud-detector-awq (this) Generative (4-bit) 2.2 GB ~1-3s Same as above, for low VRAM / 同上,低显存版

Citation / 引用

@misc{cunxin2025llama-email-fraud-awq,
  title={Llama Email Fraud Detector (AWQ 4-bit)},
  author={cunxin},
  year={2025},
  url={https://huggingface.co/cunxin/llama-email-fraud-detector-awq}
}
Downloads last month
109
Safetensors
Model size
3B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cunxin/llama-email-fraud-detector-awq

Quantized
(1)
this model

Dataset used to train cunxin/llama-email-fraud-detector-awq

Evaluation results

  • Threat Types on Enron Email + Synthetic (held-out test set)
    self-reported
    11.000