LLaMA: Open and Efficient Foundation Language Models
Paper • 2302.13971 • Published • 23
This is Carpincho-30B qlora 4-bit checkpoint, an Instruction-tuned LLM based on LLama-30B. It is trained to answer in colloquial spanish Argentine language.
It was trained on 2x3090 (48G) for 120 hs using huggingface QLoRA code (4-bit quantization)
The model is provided in LoRA format.
Here is example inference code, you will need to install the following requirements:
bitsandbytes==0.39.0
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
accelerate @ git+https://github.com/huggingface/accelerate.git
einops==0.6.1
evaluate==0.4.0
scikit-learn==1.2.2
sentencepiece==0.1.99
wandb==0.15.3
import time
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
model_name = "models/huggyllama_llama-30b/"
adapters_name = 'carpincho-30b-qlora'
print(f"Starting to load the model {model_name} into memory")
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
torch_dtype=torch.bfloat16,
device_map="sequential"
)
print(f"Loading {adapters_name} into memory")
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = LlamaTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1
stop_token_ids = [0]
print(f"Successfully loaded the model {model_name} into memory")
def main(tokenizer):
prompt = '''Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
%s
### Response:
''' % "Hola, como estas?"
batch = tokenizer(prompt, return_tensors="pt")
batch = {k: v.cuda() for k, v in batch.items()}
with torch.no_grad():
generated = model.generate(inputs=batch["input_ids"],
do_sample=True, use_cache=True,
repetition_penalty=1.1,
max_new_tokens=100,
temperature=0.9,
top_p=0.95,
top_k=40,
return_dict_in_generate=True,
output_attentions=False,
output_hidden_states=False,
output_scores=False)
result_text = tokenizer.decode(generated['sequences'].cpu().tolist()[0])
print(result_text)
main(tokenizer)
This is a generic LLM chatbot that can be used to interact directly with humans.
This bot is uncensored and may provide shocking answers. Also it contains bias present in the training material.
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
Contact the creator at @ortegaalfredo on twitter/github