{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML Practice Series: Module 13 - Naive Bayes\n",
"\n",
"Welcome to Module 13! We're exploring **Naive Bayes**, a probabilistic classifier based on Bayes' Theorem with the \"naive\" assumption of independence between features.\n",
"\n",
"### Resources:\n",
"Refer to the **[Naive Bayes Section](https://aashishgarg13.github.io/DataScience/ml_complete-all-topics/)** on your hub for the mathematical derivation of $P(A|B)$ and how it's used in spam filtering.\n",
"\n",
"### Objectives:\n",
"1. **Bayes Theorem**: Calculating posterior probability.\n",
"2. **Different Variants**: Gaussian vs Multinomial vs Bernoulli.\n",
"3. **Text Classification**: Using Naive Bayes for NLP tasks.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Setup\n",
"We will use a small text dataset for **Spam detection**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd \n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.metrics import accuracy_score, confusion_matrix\n",
"\n",
"# Sample Text Data\n",
"data = {\n",
" 'text': [\n",
" 'Free money now!', \n",
" 'Hi, how are you?', \n",
" 'Limited offer, buy now!', \n",
" 'Meeting at 5pm', \n",
" 'Win a prize today!', \n",
" 'Review the documents'\n",
" ],\n",
" 'label': [1, 0, 1, 0, 1, 0] # 1 = Spam, 0 = Ham\n",
"}\n",
"df = pd.DataFrame(data)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Text Preprocessing\n",
"\n",
"### Task 1: Vectorization\n",
"Machine learning models can't read text directly. Use `CountVectorizer` to convert text into a matrix of token counts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"cv = CountVectorizer(stop_words='english')\n",
"X = cv.fit_transform(df['text'])\n",
"y = df['label']\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Training & Prediction\n",
"\n",
"### Task 2: Multinomial NB\n",
"Fit a `MultinomialNB` model and predict the class for a new message: \"Win money buy now\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Click to see Solution
\n",
"\n",
"```python\n",
"nb = MultinomialNB()\n",
"nb.fit(X, y)\n",
"\n",
"new_msg = [\"Win money buy now\"]\n",
"new_vec = cv.transform(new_msg)\n",
"prediction = nb.predict(new_vec)\n",
"print(\"Spam\" if prediction[0] == 1 else \"Ham\")\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"--- \n",
"### Excellent Probabilistic Thinking! \n",
"Naive Bayes is often the baseline for NLP projects because it's fast and effective.\n",
"Next: **Gradient Boosting & XGBoost**."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}