{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ML Practice Series: Module 13 - Naive Bayes\n", "\n", "Welcome to Module 13! We're exploring **Naive Bayes**, a probabilistic classifier based on Bayes' Theorem with the \"naive\" assumption of independence between features.\n", "\n", "### Resources:\n", "Refer to the **[Naive Bayes Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub for the mathematical derivation of $P(A|B)$ and how it's used in spam filtering.\n", "\n", "### Objectives:\n", "1. **Bayes Theorem**: Calculating posterior probability.\n", "2. **Different Variants**: Gaussian vs Multinomial vs Bernoulli.\n", "3. **Text Classification**: Using Naive Bayes for NLP tasks.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Setup\n", "We will use a small text dataset for **Spam detection**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd \n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "\n", "# Sample Text Data\n", "data = {\n", " 'text': [\n", " 'Free money now!', \n", " 'Hi, how are you?', \n", " 'Limited offer, buy now!', \n", " 'Meeting at 5pm', \n", " 'Win a prize today!', \n", " 'Review the documents'\n", " ],\n", " 'label': [1, 0, 1, 0, 1, 0] # 1 = Spam, 0 = Ham\n", "}\n", "df = pd.DataFrame(data)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Text Preprocessing\n", "\n", "### Task 1: Vectorization\n", "Machine learning models can't read text directly. Use `CountVectorizer` to convert text into a matrix of token counts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "cv = CountVectorizer(stop_words='english')\n", "X = cv.fit_transform(df['text'])\n", "y = df['label']\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Training & Prediction\n", "\n", "### Task 2: Multinomial NB\n", "Fit a `MultinomialNB` model and predict the class for a new message: \"Win money buy now\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Click to see Solution\n", "\n", "```python\n", "nb = MultinomialNB()\n", "nb.fit(X, y)\n", "\n", "new_msg = [\"Win money buy now\"]\n", "new_vec = cv.transform(new_msg)\n", "prediction = nb.predict(new_vec)\n", "print(\"Spam\" if prediction[0] == 1 else \"Ham\")\n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "### Excellent Probabilistic Thinking! \n", "Naive Bayes is often the baseline for NLP projects because it's fast and effective.\n", "Next: **Gradient Boosting & XGBoost**." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }