FAR Compliance LoRA – Llama-3.2-1B

LoRA-tuned Llama-3.2-1B for generating FAR-grounded compliance outline nodes from federal RFP requirements.

Introduction

This project investigates how far a small, instruction-tuned language model can go toward supporting automated compliance review for federal Requests for Proposals (RFPs), where proposal authors must show that each requirement is addressed and properly grounded in governing regulations. The model in this repository takes as input a requirement and a candidate proposal excerpt and is trained to output a compliance judgment along with a structured outline node and textual rationale that cites back to authority language.

General-purpose LLMs are not well suited to this setting because they must integrate information across long, jargon-heavy documents, reason about subtle legal obligations, and still avoid hallucinating supporting text or overconfident “compliant” labels. To address this, the project applies LoRA to meta-llama/Llama-3.2-1B on a small labeled dataset of requirement–proposal pairs and then evaluates the resulting adapter both on a compliance loss metric (comparing pre- and post-LoRA models) and on standard benchmarks such as GSM8K, MMLU, and RACE to track changes in broader reasoning skills. In our experiments, LoRA reduces training loss on the compliance dataset and yields more targeted rationales, but the post-LoRA compliance loss on the tiny 4-example evaluation set is higher than the pre-LoRA baseline, so that signal is highly noisy, while performance on the external benchmarks changes only modestly, suggesting that the adaptation nudges behavior toward the compliance task without clearly enhancing or destroying general capabilities.

Data

This project uses a small, hand-curated dataset of 20 instruction–response examples designed to simulate federal RFP compliance reasoning. Each example pairs a short natural-language instruction describing the compliance task with an RFP-style requirement text, and the target output is a structured compliance outline node that includes a heading, a brief rationale, and inline citations to one or more relevant FAR clause IDs.

The dataset is stored as a single CSV file, instruction_response.csv, where the main columns are:

  • instruction: the prompt shown to the model
  • response: the desired output

along with metadata fields such as requirement identifiers and lists of gold clause IDs used for analysis. For finetuning, 16 examples are used for training and 4 are held out as a tiny evaluation set, so the data functions more as a controlled teaching signal than as a comprehensive benchmark. All entries are synthetic but written to resemble realistic federal acquisition language and citation patterns, and no real contractor data or personally identifiable information is included.

Methodology

This project fine-tunes the meta-llama/Llama-3.2-1B causal language model using Hugging Face Transformers, PEFT’s LoRA adapters, and the Trainer API on a small instruction–response compliance dataset.

Each row of checkin3_instruction_response_augmented.csv is converted into a single string by concatenating the instruction and response with a blank line, then tokenized with the Llama-3.2-1B tokenizer using:

  • Right-side padding
  • Truncation
  • Maximum sequence length of 512 tokens

Labels are set equal to the input IDs for causal LM training.

The LoRA configuration is kept fixed across runs with:

  • r = 8
  • lora_alpha = 16
  • lora_dropout = 0.1
  • bias = "none"

so that only the adapter parameters are updated while the base model weights remain frozen.

Training uses the standard Trainer setup with:

  • per_device_train_batch_size = 1
  • per_device_eval_batch_size = 1
  • logging_steps = 1
  • weight_decay = 0.0
  • Evaluation on the 4-example eval split at the end of each run

I ran three LoRA configurations:

  • Run 1: 3 epochs, learning rate = 2e-4 → eval loss ≈ 2.03
  • Run 2: 5 epochs, learning rate = 2e-4 → eval loss ≈ 1.81
  • Run 3: 3 epochs, learning rate = 5e-5 → eval loss ≈ 4.29

Based on these experiments, Run 2 (5 epochs, lr = 2e-4) is treated as the best configuration, and the saved adapter from this run is used for all post-LoRA evaluation.

Evaluation

Evaluation focuses on both the in-domain compliance task and the model’s broader reasoning ability.

Compliance task

For the compliance task, training and evaluation loss on the 20-example dataset were tracked across three LoRA runs, and a separate comparison was made between a pre-LoRA baseline model and the chosen LoRA adapter on the same 4-example eval split. LoRA substantially reduced training loss and produced more consistently structured compliance outline nodes, but the post-LoRA eval loss on the tiny 4-example set (≈2.49) was slightly higher than the pre-LoRA baseline (≈2.13), so this difference is highly noisy and should not be treated as strong evidence of harm or benefit. Because the eval set is only 4 examples, any loss comparison should be viewed as qualitative and illustrative rather than statistically robust.

General reasoning benchmarks

To monitor general capabilities, the project also used the course evaluation harness to run:

  • GSM8K (math word problems)
  • MMLU (mixed-domain multiple choice)
  • RACE (reading comprehension)

on both the base Llama-3.2-1B and the LoRA-adapted model, with:

  • Pre-LoRA runs evaluated on up to 200 items per benchmark
  • Post-LoRA runs evaluated on 50 items per benchmark

Across these benchmarks, exact-match and accuracy scores were low overall, as expected for a 1B-parameter model with small sample sizes, and the post-LoRA scores were broadly similar to the baseline—with small ups and downs that fall within noise rather than indicating clear gains or catastrophic degradation.

Taken together, the results suggest that LoRA nudges the model toward the desired compliance format without providing strong, statistically robust evidence of improved real-world compliance performance, highlighting the need for larger, more carefully validated evaluation sets in future work.

Usage and Intended Uses

This model is intended for research and educational experiments on FAR-grounded compliance assistance for federal RFPs, where a user wants to generate structured outline nodes that restate a requirement, explain the compliance expectation, and cite back to candidate authority text.

It is designed to be used as a drafting and triage aid for humans who already understand acquisition and contracting, not as a fully automated decision-maker. In particular, it:

  • Should not be treated as legal advice
  • Should not be used to make binding award or eligibility decisions
  • Should always be paired with human review by qualified contracting or legal professionals

Because the model is trained on a very small synthetic dataset and evaluated on limited benchmarks, its outputs may be incomplete, overconfident, or subtly incorrect even when they look plausible. The recommended pattern is to supply a clear natural-language instruction along with the relevant requirement and context, then treat the model’s outline node and citations as a starting point for human refinement rather than a final answer.

Example: Loading the Adapter

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model_id = "meta-llama/Llama-3.2-1B"
adapter_id = "NMTrue/far-compliance-lora-llama-3.2-1b"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(base_model, adapter_id)

Make sure to set the device map, dtype, and generation parameters according to your hardware and use case.

Prompt Format

This model expects a natural-language instruction followed by a specific RFP requirement, separated by a blank line. The instruction should tell the model to draft a single compliance outline node that restates the requirement, explains what a compliant proposal must show, and cites relevant FAR/DFARS clauses.

Example instruction:

You are a contracts compliance assistant. Given an RFP requirement, write a single compliance outline node that (1) restates the requirement in your own words, (2) explains what a compliant proposal needs to show, and (3) cites back to relevant FAR or DFARS clauses. Keep the response concise (3–6 sentences).

Then provide the requirement:

RFP Requirement:
"YOUR REQUIREMENT TEXT HERE"

Expected Output Format

The model’s response is intended to be a short, structured compliance outline node with:

  • A heading
  • A few sentences of rationale describing what compliance looks like
  • A citations line listing one or more FAR/DFARS clause IDs

Example shape (content will vary by requirement):

Heading: SHORT HEADING SUMMARIZING THE REQUIREMENT

Rationale: A few sentences that explain what the requirement expects from the Offeror and what evidence or documentation a compliant proposal should provide. The rationale should make it easy for a reviewer to see how the proposal text will be checked against the requirement, without restating the entire RFP.

Citations: FAR XX.XXX; DFARS XXX.XXX-XXXX

Limitations

This model has significant limitations that make it suitable only for research and educational exploration, not for operational compliance decisions.

Key limitations include:

  • Tiny synthetic dataset
    The finetuning data consists of just 20 synthetic instruction–response pairs drawn from a single RFP-style scenario, with only 4 examples held out for evaluation. The model has seen very little diversity in agencies, requirement types, and writing styles.

  • Overfitting risk & narrow generalization
    The model may overfit the specific phrasing and outline style in this dataset and generalize poorly to real solicitations, especially those with different structures or agency-specific policies.

  • Small base model
    The base model is a small 1B-parameter Llama-3.2 variant, which constrains its capacity to track long, complex documents and subtle legal nuances.

  • Noisy evaluation signal
    Quantitatively, the in-domain compliance loss comparison between pre-LoRA and post-LoRA models is extremely noisy because it is based on only 4 eval examples, and the GSM8K, MMLU, and RACE benchmarks were run with small sample sizes. None of these scores should be treated as robust performance guarantees.

  • Hallucinations and misinterpretations
    Qualitatively, the model can still hallucinate citations, misinterpret requirements, or express overconfident “compliant” rationales that are incomplete or incorrect.

In practice, safer use would require:

  • Much larger and more diverse labeled datasets
  • More systematic human evaluation
  • Potentially larger or more specialized base models

This adapter should be viewed as a teaching and experimentation artifact, not a production compliance tool.

License

This project fine-tunes the meta-llama/Llama-3.2-1B base model, which is made available by Meta under the Llama 3 Community License; anyone using this repository should review and comply with that license before redistribution or deployment.

The LoRA adapter weights and synthetic dataset provided here are released for research and educational purposes only and are not derived from real contractor data or personally identifiable information. They should not be treated as legal advice or used as the sole basis for binding contracting or compliance decisions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NMTrue/far-compliance-lora-llama-3.2-1b

Finetuned
(899)
this model