{ "cells": [ { "cell_type": "raw", "id": "0", "metadata": {}, "source": [ "SPDX-License-Identifier: Apache-2.0 \n", "Copyright (c) 2023, Rahul Unnikrishnan Nair \n", "\n", "NOTICE: Original was modified to support NVIDIA GPUs" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Fine-tuning `google/gemma-2-2b-it` on Alpaca" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "### Overview\n", "\n", "This notebook is simplified to only the steps needed for LoRA fine-tuning:\n", "\n", "1. Install dependencies\n", "2. Load `google/gemma-2-2b-it`\n", "3. Load and format the Alpaca dataset (`tatsu-lab/alpaca`)\n", "4. Fine-tune with `SFTTrainer`\n", "5. Save adapter weights and run a quick inference check\n", "\n", "Unused and unrelated code has been removed." ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "#### Step 1: Install dependencies" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "%pip install -q transformers datasets trl peft accelerate bitsandbytes" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "import os\n", "import torch\n", "from datasets import load_dataset\n", "from huggingface_hub import notebook_login\n", "from peft import LoraConfig\n", "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n", "from trl import SFTConfig, SFTTrainer\n", "\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "print(f\"Using device: {device}\")" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "#### Step 2: Configure LoRA" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "lora_config = LoraConfig(\n", " r=16,\n", " lora_alpha=32,\n", " lora_dropout=0.05,\n", " bias=\"none\",\n", " task_type=\"CAUSAL_LM\",\n", " target_modules=[\n", " \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n", " \"gate_proj\", \"up_proj\", \"down_proj\",\n", " ],\n", ")\n", "\n", "lora_config" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "#### Step 3: Login and load `google/gemma-2-2b-it`" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "# You must accept Gemma terms and authenticate before loading the model.\n", "notebook_login()\n", "\n", "model_id = \"google/gemma-2-2b-it\"\n", "\n", "bnb_config = BitsAndBytesConfig(\n", " load_in_4bit=True,\n", " bnb_4bit_compute_dtype=torch.bfloat16,\n", " bnb_4bit_use_double_quant=True,\n", " bnb_4bit_quant_type=\"nf4\",\n", ")\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(model_id)\n", "tokenizer.pad_token = tokenizer.eos_token\n", "tokenizer.padding_side = \"right\"\n", "\n", "model = AutoModelForCausalLM.from_pretrained(\n", " model_id,\n", " quantization_config=bnb_config,\n", " device_map=\"auto\",\n", ")\n", "model.config.use_cache = False" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "#### Step 4: Load and format Alpaca dataset" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "We use `tatsu-lab/alpaca`, convert each row to an instruction-style prompt, and keep only a single `text` column for supervised fine-tuning." ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "dataset = load_dataset(\"tatsu-lab/alpaca\", split=\"train\")\n", "print(dataset)\n", "print(dataset[0])" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "def format_alpaca(example):\n", " if example[\"input\"].strip():\n", " text = (\n", " \"### Instruction:\\n\"\n", " f\"{example['instruction']}\\n\\n\"\n", " \"### Input:\\n\"\n", " f\"{example['input']}\\n\\n\"\n", " \"### Response:\\n\"\n", " f\"{example['output']}\"\n", " )\n", " else:\n", " text = (\n", " \"### Instruction:\\n\"\n", " f\"{example['instruction']}\\n\\n\"\n", " \"### Response:\\n\"\n", " f\"{example['output']}\"\n", " )\n", " return {\"text\": text}\n", "\n", "formatted_dataset = dataset.map(format_alpaca)\n", "formatted_dataset = formatted_dataset.remove_columns([\"instruction\", \"input\", \"output\"])\n", "split = formatted_dataset.train_test_split(test_size=0.02, seed=42)\n", "train_dataset = split[\"train\"]\n", "eval_dataset = split[\"test\"]\n", "\n", "print(train_dataset[0][\"text\"][:500])" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "training_args = SFTConfig(\n", " output_dir=\"./gemma-2-2b-it-alpaca-lora\",\n", " num_train_epochs=1,\n", " per_device_train_batch_size=1,\n", " gradient_accumulation_steps=8,\n", " learning_rate=2e-4,\n", " lr_scheduler_type=\"cosine\",\n", " warmup_ratio=0.03,\n", " logging_steps=10,\n", " save_strategy=\"epoch\",\n", " evaluation_strategy=\"no\",\n", " optim=\"paged_adamw_8bit\",\n", " bf16=torch.cuda.is_available(),\n", " gradient_checkpointing=True,\n", " max_seq_length=1024,\n", " packing=True,\n", " report_to=\"none\",\n", ")\n", "\n", "trainer = SFTTrainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=train_dataset,\n", " peft_config=lora_config,\n", ")\n", "\n", "train_result = trainer.train()\n", "train_result" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "#### Step 5: Save LoRA adapter and run a quick test" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "adapter_out = \"./gemma-2-2b-it-alpaca-lora/final_adapter\"\n", "trainer.model.save_pretrained(adapter_out)\n", "tokenizer.save_pretrained(adapter_out)\n", "print(f\"Saved LoRA adapter to: {adapter_out}\")\n", "\n", "prompt = \"### Instruction:\\nExplain photosynthesis in simple words.\\n\\n### Response:\\n\"\n", "inputs = tokenizer(prompt, return_tensors=\"pt\").to(trainer.model.device)\n", "with torch.no_grad():\n", " outputs = trainer.model.generate(\n", " **inputs,\n", " max_new_tokens=120,\n", " do_sample=True,\n", " temperature=0.7,\n", " top_p=0.9,\n", " eos_token_id=tokenizer.eos_token_id,\n", " )\n", "\n", "print(tokenizer.decode(outputs[0], skip_special_tokens=True))" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "---\n", "Notebook cleaned for Alpaca + Gemma instruction tuning. Remaining cells below are intentionally cleared." ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.11" } }, "nbformat": 4, "nbformat_minor": 5 }