{ "cells": [ { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# ! pip install tensorflow" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# ! pip install torch" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# ! pip install transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pipeline:\n", "The pipeline function is the most high level api in the transformers library.\n", "The pipeline function returns an end-to-end object that performs an NLP task on one or several texts.\n", "A pipeline includes all the necessary pre-processing as the model does not expect texts but numbers, it feeds the numbers to the model and the post-processing to make the output human readable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentiment Analysis Pipeline" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[{'label': 'LABEL_0', 'score': 0.06559786200523376}]\n", "[{'label': 'LABEL_0', 'score': 0.12948733568191528}, {'label': 'LABEL_0', 'score': 0.12888683378696442}]\n" ] } ], "source": [ "from transformers import pipeline\n", "\n", "classifier = pipeline('sentiment-analysis', model='distilgpt2')\n", "\n", "# pass single text:\n", "res = classifier(\"I've been waiting for a Huggingface course\")\n", "print(res)\n", "\n", "# Pass multiple texts:\n", "res = classifier(['I love you', 'I hate you'])\n", "print(res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Zero Shot Classification Pipeline\n", "Helps to classify what the sentence or topic is about " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n", "Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.\n", "Tokenizer was not supporting padding necessary for zero-shot, attempting to use `pad_token=eos_token`\n" ] }, { "data": { "text/plain": [ "{'sequence': 'This is a course about the Transformers library',\n", " 'labels': ['education', 'ploitics', 'business'],\n", " 'scores': [0.36338528990745544, 0.3443466126918793, 0.29226812720298767]}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline \n", "\n", "classifier = pipeline('zero-shot-classification', model='distilgpt2')\n", "classifier('This is a course about the Transformers library',\n", " candidate_labels=['education', 'ploitics', 'business'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Generation pipeline:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "will auto complete a given prompt. \n", "Output is generated with a bit of randomness so it changes when you run it each time." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "data": { "text/plain": [ "[{'generated_text': 'In this course we will teach you how to play and take your skill set as a starter, how to play, and as a player. I will'},\n", " {'generated_text': 'In this course we will teach you how to convert to Java as your main operating system and write to your friends through our website at Google+.\\n\\n'}]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "\n", "generator = pipeline('text-generation', model='distilgpt2')\n", "generator('In this course we will teach you how to',\n", " max_length=30,\n", " num_return_sequences=2)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n" ] }, { "data": { "text/plain": [ "[{'generated_text': 'in this course we will teach you how to play with a real life experience. It will be a lot more about the importance of understanding and building a'},\n", " {'generated_text': 'in this course we will teach you how to achieve the objectives of the program. We will teach you how to achieve the objectives of the program. We'}]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "\n", "generator = pipeline('text-generation', model='distilgpt2')\n", "generator('in this course we will teach you how to',\n", " max_length=30,\n", " num_return_sequences=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The text-generation pipeline is used with the model distilgpt2 above" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fill Mask Pipeline\n", "This pipeline is a pertraining objective of BERT. This is guess masked words like fill in the blanks. \n", "In this case we ask the pipeline to generate the two most likely words in the mask using top_k " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias']\n", "- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "96e2afa0d0574883abf3b4e6a86ccaca", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading (…)okenizer_config.json: 0%| | 0.00/29.0 [00:002\u001b[0m \u001b[39m# ! pip install sentencepiece\u001b[39;00m\n\u001b[0;32m 3\u001b[0m \u001b[39mimport\u001b[39;00m \u001b[39msentencepiece\u001b[39;00m\n\u001b[1;32m----> 4\u001b[0m translator \u001b[39m=\u001b[39m pipeline(\u001b[39m'\u001b[39;49m\u001b[39mtranslation\u001b[39;49m\u001b[39m'\u001b[39;49m, model\u001b[39m=\u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39mHelsinki-NLP/opus-mt-fr-en\u001b[39;49m\u001b[39m'\u001b[39;49m)\n\u001b[0;32m 5\u001b[0m translator(\u001b[39m'\u001b[39m\u001b[39mCe cours est produit par Hugging Face.\u001b[39m\u001b[39m'\u001b[39m)\n", "File \u001b[1;32m~\\AppData\\Roaming\\Python\\Python311\\site-packages\\transformers\\pipelines\\__init__.py:931\u001b[0m, in \u001b[0;36mpipeline\u001b[1;34m(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)\u001b[0m\n\u001b[0;32m 928\u001b[0m tokenizer_kwargs \u001b[39m=\u001b[39m model_kwargs\u001b[39m.\u001b[39mcopy()\n\u001b[0;32m 929\u001b[0m tokenizer_kwargs\u001b[39m.\u001b[39mpop(\u001b[39m\"\u001b[39m\u001b[39mtorch_dtype\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mNone\u001b[39;00m)\n\u001b[1;32m--> 931\u001b[0m tokenizer \u001b[39m=\u001b[39m AutoTokenizer\u001b[39m.\u001b[39;49mfrom_pretrained(\n\u001b[0;32m 932\u001b[0m tokenizer_identifier, use_fast\u001b[39m=\u001b[39;49muse_fast, _from_pipeline\u001b[39m=\u001b[39;49mtask, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mhub_kwargs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mtokenizer_kwargs\n\u001b[0;32m 933\u001b[0m )\n\u001b[0;32m 935\u001b[0m \u001b[39mif\u001b[39;00m load_image_processor:\n\u001b[0;32m 936\u001b[0m \u001b[39m# Try to infer image processor from model or config name (if provided as str)\u001b[39;00m\n\u001b[0;32m 937\u001b[0m \u001b[39mif\u001b[39;00m image_processor \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n", "File \u001b[1;32m~\\AppData\\Roaming\\Python\\Python311\\site-packages\\transformers\\models\\auto\\tokenization_auto.py:774\u001b[0m, in \u001b[0;36mAutoTokenizer.from_pretrained\u001b[1;34m(cls, pretrained_model_name_or_path, *inputs, **kwargs)\u001b[0m\n\u001b[0;32m 772\u001b[0m \u001b[39mreturn\u001b[39;00m tokenizer_class_py\u001b[39m.\u001b[39mfrom_pretrained(pretrained_model_name_or_path, \u001b[39m*\u001b[39minputs, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[0;32m 773\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m--> 774\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 775\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mThis tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 776\u001b[0m \u001b[39m\"\u001b[39m\u001b[39min order to use this tokenizer.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 777\u001b[0m )\n\u001b[0;32m 779\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 780\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mUnrecognized configuration class \u001b[39m\u001b[39m{\u001b[39;00mconfig\u001b[39m.\u001b[39m\u001b[39m__class__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m to build an AutoTokenizer.\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[0;32m 781\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mModel type should be one of \u001b[39m\u001b[39m{\u001b[39;00m\u001b[39m'\u001b[39m\u001b[39m, \u001b[39m\u001b[39m'\u001b[39m\u001b[39m.\u001b[39mjoin(c\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m\u001b[39m \u001b[39m\u001b[39mfor\u001b[39;00m\u001b[39m \u001b[39mc\u001b[39m \u001b[39m\u001b[39min\u001b[39;00m\u001b[39m \u001b[39mTOKENIZER_MAPPING\u001b[39m.\u001b[39mkeys())\u001b[39m}\u001b[39;00m\u001b[39m.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 782\u001b[0m )\n", "\u001b[1;31mValueError\u001b[0m: This tokenizer cannot be instantiated. Please make sure you have `sentencepiece` installed in order to use this tokenizer." ] } ], "source": [ "from transformers import pipeline\n", "# ! pip install sentencepiece\n", "import sentencepiece\n", "translator = pipeline('translation', model='Helsinki-NLP/opus-mt-fr-en')\n", "translator('Ce cours est produit par Hugging Face.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, there are the following tasks available withing our Pipeline API:\n", "\n", "- Text-Classification(Also called sequence classification)\n", "- Zero Shot Classification\n", "- Text Generation\n", "- Text Completion(mask filling)/ Masked Language Modeling\n", "- Token Classification\n", "- Question Answering\n", "- Summarization\n", "- Translation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 2 }