Spaces:
Running
Running
shourya commited on
Commit ·
10b8b4e
1
Parent(s): 1ac4772
Add HuggingFace Enabling Sessions interactive demo
Browse files- Created Gradio app with 3 tabs (Session 1 intro, Session 2 hands-on, Resources)
- Added configuration for 5 NLP tasks (Sentiment, NER, QA, Summarization, Similarity)
- Created utility functions for model inference and tokenization
- Added comprehensive slide content (SESSION1, SESSION2)
- Added sample data and demo examples
- Created speaker notes with detailed presentation guide
- Updated README with full documentation
Features:
- Live inference playground with 5+ NLP tasks
- Interactive tokenization explorer
- Sample data for all demonstrations
- Production-ready configuration
- Attendee-friendly resources tab
- README.md +210 -1
- SPEAKER_NOTES.md +842 -0
- app.py +587 -0
- config.py +52 -0
- data/demo_samples/embeddings.txt +6 -0
- data/demo_samples/ner.txt +5 -0
- data/demo_samples/qa.txt +2 -0
- data/demo_samples/sentiment.txt +5 -0
- data/demo_samples/summarization.txt +1 -0
- data/sample_texts.csv +12 -0
- requirements.txt +9 -0
- slides/SESSION1_SLIDES.md +387 -0
- slides/SESSION2_SLIDES.md +731 -0
- test +0 -1
- utils.py +129 -0
README.md
CHANGED
|
@@ -9,4 +9,213 @@ app_file: app.py
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# 🤗 HuggingFace Enabling Sessions
|
| 13 |
+
**Interactive Demo Platform for Transformers, Hub APIs, and NLP Pipelines**
|
| 14 |
+
|
| 15 |
+
## 📋 Overview
|
| 16 |
+
|
| 17 |
+
This is an interactive Gradio application designed for the **HuggingFace Enabling Sessions** workshop. It provides hands-on demonstrations of:
|
| 18 |
+
|
| 19 |
+
- **Session 1 (45 min):** Introduction to the HuggingFace ecosystem, Transformers architecture, and best practices
|
| 20 |
+
- **Session 2 (90 min):** Hands-on developer workshop with tokenization deep dives and inference playground across 5+ NLP tasks
|
| 21 |
+
|
| 22 |
+
## 🚀 Quick Start
|
| 23 |
+
|
| 24 |
+
The app is hosted on HuggingFace Spaces and requires **no local installation**. Simply:
|
| 25 |
+
|
| 26 |
+
1. Open the Spaces URL
|
| 27 |
+
2. Explore the 3 main tabs:
|
| 28 |
+
- **Session 1: Introduction** — Embedded slides + live NLP demos
|
| 29 |
+
- **Session 2: Hands-On Developer** — Tokenizer explorer + inference playground
|
| 30 |
+
- **Resources & Next Steps** — Documentation links and learning resources
|
| 31 |
+
|
| 32 |
+
## 📚 Session Contents
|
| 33 |
+
|
| 34 |
+
### Session 1: Introduction to HuggingFace (45 minutes)
|
| 35 |
+
**Topics Covered:**
|
| 36 |
+
- HuggingFace Platform overview (Hub, Transformers, Datasets, Spaces)
|
| 37 |
+
- Core abstractions: Pipelines, Models, Tokenizers
|
| 38 |
+
- Architecture patterns: Encoders (BERT), Decoders (GPT), Encoder-Decoders (T5/BART)
|
| 39 |
+
- Enterprise NLP landscape (licensing, open-source vs. commercial)
|
| 40 |
+
|
| 41 |
+
**Live Demos:**
|
| 42 |
+
- Sentiment Analysis using DistilBERT
|
| 43 |
+
- Named Entity Recognition (NER) with BERT
|
| 44 |
+
|
| 45 |
+
**Materials:** [SESSION1_SLIDES.md](slides/SESSION1_SLIDES.md)
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
### Session 2: Hands-On Developer Workshop (90 minutes)
|
| 50 |
+
**Topics Covered:**
|
| 51 |
+
- Tokenization mechanics and strategies
|
| 52 |
+
- Inference across 5+ NLP tasks
|
| 53 |
+
- Understanding model outputs and confidence scores
|
| 54 |
+
- Production considerations and optimization
|
| 55 |
+
|
| 56 |
+
**Interactive Tasks:**
|
| 57 |
+
- 🔤 **Tokenization Explorer** — Visualize how text becomes token IDs
|
| 58 |
+
- 📊 **Sentiment Analysis** — Classify text emotions
|
| 59 |
+
- 🏷️ **Named Entity Recognition** — Extract persons, organizations, locations
|
| 60 |
+
- ❓ **Question Answering** — Answer questions from context
|
| 61 |
+
- 📝 **Text Summarization** — Generate concise summaries
|
| 62 |
+
- 🔗 **Semantic Similarity** — Compare text meaning
|
| 63 |
+
|
| 64 |
+
**Materials:** [SESSION2_SLIDES.md](slides/SESSION2_SLIDES.md)
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## 🛠️ Project Structure
|
| 69 |
+
|
| 70 |
+
```
|
| 71 |
+
infy/
|
| 72 |
+
├── app.py # Main Gradio application
|
| 73 |
+
├── config.py # Configuration (model IDs, task definitions)
|
| 74 |
+
├── utils.py # Utility functions for inference
|
| 75 |
+
├── requirements.txt # Python dependencies
|
| 76 |
+
├── README.md # This file
|
| 77 |
+
├── SPEAKER_NOTES.md # Presenter guide with timing
|
| 78 |
+
├── slides/
|
| 79 |
+
│ ├── SESSION1_SLIDES.md # Session 1 presentation content
|
| 80 |
+
│ └── SESSION2_SLIDES.md # Session 2 presentation content
|
| 81 |
+
└── data/
|
| 82 |
+
├── sample_texts.csv # Sample texts for demos
|
| 83 |
+
└── demo_samples/
|
| 84 |
+
├── sentiment.txt
|
| 85 |
+
├── ner.txt
|
| 86 |
+
├── qa.txt
|
| 87 |
+
├── summarization.txt
|
| 88 |
+
└── embeddings.txt
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## 🤖 Models Used
|
| 92 |
+
|
| 93 |
+
| Task | Model | Type | License |
|
| 94 |
+
|------|-------|------|---------|
|
| 95 |
+
| Sentiment Analysis | distilbert-base-uncased-finetuned-sst-2-english | Encoder | Apache 2.0 |
|
| 96 |
+
| Named Entity Recognition | dslim/bert-base-uncased-finetuned-ner | Encoder | Apache 2.0 |
|
| 97 |
+
| Question Answering | deepset/roberta-base-squad2 | Encoder | Apache 2.0 |
|
| 98 |
+
| Summarization | facebook/bart-large-cnn | Encoder-Decoder | MIT |
|
| 99 |
+
| Semantic Similarity | sentence-transformers/all-MiniLM-L6-v2 | Encoder | Apache 2.0 |
|
| 100 |
+
|
| 101 |
+
## 📖 How to Use
|
| 102 |
+
|
| 103 |
+
### During Sessions
|
| 104 |
+
|
| 105 |
+
1. **Access the Spaces URL** — Attendees join via shared link
|
| 106 |
+
2. **Session 1 (45 min)**
|
| 107 |
+
- Presenter screens shares and narrates through slides
|
| 108 |
+
- Live demos showcase "click-to-run" NLP tasks
|
| 109 |
+
- Q&A after each major section
|
| 110 |
+
|
| 111 |
+
3. **Session 2 (90 min)**
|
| 112 |
+
- Presenter guides attendees through tokenization and inference
|
| 113 |
+
- Attendees observe interactive widgets
|
| 114 |
+
- Exercise checkpoints for hands-on exploration
|
| 115 |
+
- Discussion on production considerations
|
| 116 |
+
|
| 117 |
+
### After Sessions
|
| 118 |
+
|
| 119 |
+
1. **Clone the repository:**
|
| 120 |
+
```bash
|
| 121 |
+
git clone https://huggingface.co/spaces/[your-username]/infy
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
2. **Install dependencies:**
|
| 125 |
+
```bash
|
| 126 |
+
pip install -r requirements.txt
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
3. **Run locally:**
|
| 130 |
+
```bash
|
| 131 |
+
python app.py
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
4. **Explore further:**
|
| 135 |
+
- Modify sample data in `data/sample_texts.csv`
|
| 136 |
+
- Add more models to `config.py`
|
| 137 |
+
- Create custom tasks in `app.py`
|
| 138 |
+
|
| 139 |
+
## 🎓 Learning Resources
|
| 140 |
+
|
| 141 |
+
### Official Documentation
|
| 142 |
+
- [Transformers Library Docs](https://huggingface.co/docs/transformers/)
|
| 143 |
+
- [Datasets Library Docs](https://huggingface.co/docs/datasets/)
|
| 144 |
+
- [HuggingFace Course (Free)](https://huggingface.co/course/)
|
| 145 |
+
- [Hub Documentation](https://huggingface.co/docs/hub/)
|
| 146 |
+
|
| 147 |
+
### Model Hub
|
| 148 |
+
- Browse 100K+ models: https://huggingface.co/models
|
| 149 |
+
- Search by task, language, or architecture
|
| 150 |
+
|
| 151 |
+
### Community
|
| 152 |
+
- [HuggingFace Forums](https://discuss.huggingface.co/)
|
| 153 |
+
- [GitHub Issues](https://github.com/huggingface/transformers/issues)
|
| 154 |
+
- Twitter: [@huggingface](https://twitter.com/huggingface)
|
| 155 |
+
|
| 156 |
+
### Next Steps
|
| 157 |
+
- **Fine-tune on your data** — Adapt pre-trained models for domain-specific tasks
|
| 158 |
+
- **Deploy to Spaces** — Create interactive demos like this
|
| 159 |
+
- **Publish to the Hub** — Share models and datasets with the community
|
| 160 |
+
- **Explore advanced techniques** — Quantization, distillation, multi-model pipelines
|
| 161 |
+
|
| 162 |
+
## 🔧 Customization
|
| 163 |
+
|
| 164 |
+
### Add a New Task
|
| 165 |
+
|
| 166 |
+
1. **Add model to `config.py`:**
|
| 167 |
+
```python
|
| 168 |
+
"new_task": {
|
| 169 |
+
"name": "Task Name",
|
| 170 |
+
"model": "model-id-from-hub",
|
| 171 |
+
"example": "example text",
|
| 172 |
+
}
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
2. **Add function to `utils.py`:**
|
| 176 |
+
```python
|
| 177 |
+
def run_new_task(text):
|
| 178 |
+
pipe = load_pipeline("new_task")
|
| 179 |
+
return pipe(text)
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
3. **Add widget to `app.py`:**
|
| 183 |
+
```python
|
| 184 |
+
with gr.Tab("New Task"):
|
| 185 |
+
input_box = gr.Textbox()
|
| 186 |
+
output_box = gr.Markdown()
|
| 187 |
+
btn.click(run_new_task, inputs=[input_box], outputs=[output_box])
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
### Modify Sample Data
|
| 191 |
+
|
| 192 |
+
Edit `data/sample_texts.csv` or add `.txt` files to `data/demo_samples/`
|
| 193 |
+
|
| 194 |
+
## 📝 Environment
|
| 195 |
+
|
| 196 |
+
- **Python:** 3.8+
|
| 197 |
+
- **Framework:** Gradio 6.9.0
|
| 198 |
+
- **ML:** Transformers, Torch
|
| 199 |
+
- **Hosting:** HuggingFace Spaces
|
| 200 |
+
|
| 201 |
+
## 📄 License
|
| 202 |
+
|
| 203 |
+
This project is open-source and available for educational and commercial use. Model licenses vary—see individual model cards for details.
|
| 204 |
+
|
| 205 |
+
## 👨🏫 Presenter Notes
|
| 206 |
+
|
| 207 |
+
See [SPEAKER_NOTES.md](SPEAKER_NOTES.md) for:
|
| 208 |
+
- Session timing breakdowns
|
| 209 |
+
- Demo sequences and talking points
|
| 210 |
+
- Troubleshooting common issues
|
| 211 |
+
- Tips for live presentations
|
| 212 |
+
|
| 213 |
+
## 📧 Questions & Feedback
|
| 214 |
+
|
| 215 |
+
- Ask during the sessions
|
| 216 |
+
- Post on HuggingFace Forums
|
| 217 |
+
- Follow up on company Slack/Teams
|
| 218 |
+
|
| 219 |
+
---
|
| 220 |
+
|
| 221 |
+
**Ready to dive into NLP? Start with Session 1: Introduction! 🚀**
|
SPEAKER_NOTES.md
ADDED
|
@@ -0,0 +1,842 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Speaker Notes & Presentation Guide
|
| 2 |
+
## HuggingFace Enabling Sessions
|
| 3 |
+
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
## Table of Contents
|
| 7 |
+
1. [Pre-Session Setup](#pre-session-setup)
|
| 8 |
+
2. [Session 1: Introduction (45 min)](#session-1-introduction-45-min)
|
| 9 |
+
3. [Session 2: Hands-On Developer (90 min)](#session-2-hands-on-developer-90-min)
|
| 10 |
+
4. [Troubleshooting Guide](#troubleshooting-guide)
|
| 11 |
+
5. [Tips for Effective Presentation](#tips-for-effective-presentation)
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Pre-Session Setup
|
| 16 |
+
|
| 17 |
+
### Before You Start
|
| 18 |
+
- ✅ **Test the Spaces app** — Load the URL and verify all widgets work
|
| 19 |
+
- ✅ **Test internet connection** — Ensure model downloads work smoothly
|
| 20 |
+
- ✅ **Download slides** — Have slides available locally (PowerPoint/PDF)
|
| 21 |
+
- ✅ **Prepare talking points** — Familiarize yourself with content
|
| 22 |
+
- ✅ **Set up screen sharing** — Test camera + screen share on your platform
|
| 23 |
+
- ✅ **Time check** — Confirm session durations with organizers
|
| 24 |
+
|
| 25 |
+
### Spaces URL
|
| 26 |
+
- Share this link with attendees: `https://huggingface.co/spaces/[username]/infy`
|
| 27 |
+
- It's the single platform for entire workshop
|
| 28 |
+
|
| 29 |
+
### Materials Distribution
|
| 30 |
+
1. **During:** Share Spaces URL in chat/email
|
| 31 |
+
2. **After:** Email attendees:
|
| 32 |
+
- Spaces link (for reference)
|
| 33 |
+
- Slide markdown files (`SESSION1_SLIDES.md`, `SESSION2_SLIDES.md`)
|
| 34 |
+
- GitHub/GitLab link (if you want them to explore code)
|
| 35 |
+
- Links to HF documentation
|
| 36 |
+
|
| 37 |
+
### Tech Setup Checklist
|
| 38 |
+
- [ ] Spaces URL loads successfully
|
| 39 |
+
- [ ] All 3 tabs visible (Session 1, Session 2, Resources)
|
| 40 |
+
- [ ] Demo widgets respond to clicks
|
| 41 |
+
- [ ] Screen resolution adequate for showing text (1080p+ recommended)
|
| 42 |
+
- [ ] Audio working (presenter test)
|
| 43 |
+
- [ ] Network stable (no VPN dropouts)
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Session 1: Introduction (45 min)
|
| 48 |
+
|
| 49 |
+
### Overview
|
| 50 |
+
**Goal:** Build foundation understanding of HuggingFace ecosystem
|
| 51 |
+
**Audience:** Mixed technical levels (devs, product, non-technical)
|
| 52 |
+
**Format:** Slides + 2 live demos
|
| 53 |
+
|
| 54 |
+
### Timing Breakdown
|
| 55 |
+
```
|
| 56 |
+
0:00-0:05 Title + Agenda (5 min)
|
| 57 |
+
0:05-0:15 What is HuggingFace (10 min)
|
| 58 |
+
0:15-0:25 Core Abstractions - Pipelines (10 min)
|
| 59 |
+
→ DEMO 1: Sentiment Analysis
|
| 60 |
+
0:25-0:35 Architecture Patterns (10 min)
|
| 61 |
+
→ Quick review of Encoders/Decoders
|
| 62 |
+
0:35-0:40 Enterprise Landscape (5 min)
|
| 63 |
+
0:40-0:45 Q&A (5 min)
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### Talking Points by Section
|
| 67 |
+
|
| 68 |
+
#### 0:00-0:05 | Title + Agenda
|
| 69 |
+
**What to Say:**
|
| 70 |
+
> "Good morning! Welcome to the HuggingFace Enabling Sessions. Today we're going to explore one of the most exciting toolkits in modern NLP. By the end of this session, you'll understand how transformers work, why they're revolutionary, and how to use them practically."
|
| 71 |
+
|
| 72 |
+
**Q&A Preparation:**
|
| 73 |
+
- Have answers for: "Do I need to know machine learning?"
|
| 74 |
+
- Response: "No! We start from basics. If you code at all, you'll follow along."
|
| 75 |
+
|
| 76 |
+
#### 0:05-0:15 | What is HuggingFace (Slide 3-4)
|
| 77 |
+
**Key Message:**
|
| 78 |
+
HuggingFace democratized NLP — took research papers and made them accessible
|
| 79 |
+
|
| 80 |
+
**Talking Points:**
|
| 81 |
+
1. **The Problem (Pre-2017):**
|
| 82 |
+
- Great models existed... scattered in papers
|
| 83 |
+
- Training from scratch was standard
|
| 84 |
+
- Weeks/months of work per project
|
| 85 |
+
|
| 86 |
+
2. **The Solution (HuggingFace 2018+):**
|
| 87 |
+
- Centralized Hub: 100K+ ready-to-use models
|
| 88 |
+
- One library: `transformers`
|
| 89 |
+
- One command: `from transformers import pipeline`
|
| 90 |
+
|
| 91 |
+
3. **Why It Matters:**
|
| 92 |
+
- NLP went from months → weeks → days
|
| 93 |
+
- No longer need GPU clusters for basic work
|
| 94 |
+
- Community benefits from shared knowledge
|
| 95 |
+
|
| 96 |
+
**Live Example (Don't run yet, just show)**
|
| 97 |
+
```python
|
| 98 |
+
from transformers import pipeline
|
| 99 |
+
classifier = pipeline("sentiment-analysis")
|
| 100 |
+
print(classifier("I love this!"))
|
| 101 |
+
# Output: [{'label': 'POSITIVE', 'score': 0.9987}]
|
| 102 |
+
# That's it! 2 lines of code.
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
**Talking Point:**
|
| 106 |
+
> "Imagine if you had to train a sentiment classifier from scratch. That would take weeks — data collection, labeling, infrastructure setup, all of it. But with HuggingFace, you get this in 2 lines."
|
| 107 |
+
|
| 108 |
+
#### 0:15-0:25 | Core Abstractions - Pipelines (Slide 5-7)
|
| 109 |
+
|
| 110 |
+
**Key Concept:**
|
| 111 |
+
Think of Pipelines like API calls — you don't need to know how the engine works, just what inputs/outputs it expects
|
| 112 |
+
|
| 113 |
+
**Slide Walkthrough:**
|
| 114 |
+
1. **Pipelines** (Slide 5)
|
| 115 |
+
- High-level interface
|
| 116 |
+
- 12+ tasks available
|
| 117 |
+
- Perfect for demos and quick POCs
|
| 118 |
+
|
| 119 |
+
2. **Models & Tokenizers** (Slide 6)
|
| 120 |
+
- Tokenizers: Text → Numbers (tokens)
|
| 121 |
+
- Models: Process tokens → Predictions
|
| 122 |
+
|
| 123 |
+
3. **Show the flow:**
|
| 124 |
+
```
|
| 125 |
+
"I love NLP"
|
| 126 |
+
↓
|
| 127 |
+
Tokenizer: ["I", "love", "NLP"] → [1045, 2572, 28600]
|
| 128 |
+
↓
|
| 129 |
+
Model processes: neural network magic ✨
|
| 130 |
+
↓
|
| 131 |
+
Output: Sentiment prediction
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
**Talking Point for Non-Technical Audience:**
|
| 135 |
+
> "Think of a tokenizer like Rosetta Stone — it translates human language into a format the model understands. Then the model, which has learned patterns from billions of text examples, makes an intelligent prediction."
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
#### **>>> DEMO 1: Sentiment Analysis <<<**
|
| 140 |
+
**Timing:** 5 minutes (embedded in 0:15-0:25 section)
|
| 141 |
+
|
| 142 |
+
**Setup:**
|
| 143 |
+
- Open Spaces URL
|
| 144 |
+
- Navigate to **Session 1: Introduction** tab
|
| 145 |
+
- Scroll to **Demo 1: Sentiment Analysis**
|
| 146 |
+
|
| 147 |
+
**Demo Sequence:**
|
| 148 |
+
1. **Show the input box** with pre-filled text: "I absolutely love this product! It's amazing and works perfectly."
|
| 149 |
+
2. **Click "Analyze Sentiment" button**
|
| 150 |
+
3. **Show result:**
|
| 151 |
+
```
|
| 152 |
+
Label: POSITIVE
|
| 153 |
+
Confidence: 0.9987
|
| 154 |
+
```
|
| 155 |
+
4. **Narrate:**
|
| 156 |
+
> "See? Instant sentiment analysis! The model is 99.87% confident this is positive. That confidence score is critical in production — if it's >90%, we can automate. If it's 60-70%, we might need human review."
|
| 157 |
+
|
| 158 |
+
5. **Try a second example** (copy-paste into input):
|
| 159 |
+
```
|
| 160 |
+
"This is the worst experience I've ever had with a company."
|
| 161 |
+
```
|
| 162 |
+
- Click button, show result: NEGATIVE (0.9995)
|
| 163 |
+
|
| 164 |
+
6. **Narrate the difference:**
|
| 165 |
+
> "Notice both examples have high confidence. But imagine if the sentiment was ambiguous — 'The product is okay.' You might get 55% confidence. That's when you flag it for human review."
|
| 166 |
+
|
| 167 |
+
**Talking Points:**
|
| 168 |
+
- This model trained on 67K movie reviews
|
| 169 |
+
- Generalizes well to product reviews too
|
| 170 |
+
- Fine-tuned, not trained from scratch
|
| 171 |
+
- Key concept: Transfer learning (pre-training → fine-tuning)
|
| 172 |
+
|
| 173 |
+
**If Demo Fails:**
|
| 174 |
+
- Say: "Unfortunately, the model server is having a moment. But I can tell you what would happen..."
|
| 175 |
+
- Show cached output screenshot
|
| 176 |
+
- Continue to next section
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
#### 0:25-0:35 | Architecture Patterns (Slide 8-11)
|
| 181 |
+
|
| 182 |
+
**Objective:**
|
| 183 |
+
Attendees should leave understanding:
|
| 184 |
+
1. **ENCODERS** (BERT family) — for understanding/classification
|
| 185 |
+
2. **DECODERS** (GPT family) — for generation
|
| 186 |
+
3. **ENCODER-DECODERS** (T5/BART) — for structured tasks
|
| 187 |
+
|
| 188 |
+
**Presentation Strategy:**
|
| 189 |
+
- Don't go too deep into math
|
| 190 |
+
- Use analogies and comparisons
|
| 191 |
+
- Show the decision tree (Slide 11)
|
| 192 |
+
|
| 193 |
+
**Slide 8: Encoders (BERT)**
|
| 194 |
+
> "Imagine reading a sentence twice — once left-to-right, once right-to-left, simultaneously. That's bidirectional processing. BERT reads the whole sentence before making decisions. This makes it great for understanding context."
|
| 195 |
+
|
| 196 |
+
**Example:**
|
| 197 |
+
```
|
| 198 |
+
"The bank approved my loan"
|
| 199 |
+
Read both directions:
|
| 200 |
+
- Left→Right: understand "bank" in financial context
|
| 201 |
+
- Right→Left: understand "approved" is positive
|
| 202 |
+
Decision: POSITIVE sentiment
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
**Slide 9: Decoders (GPT)**
|
| 206 |
+
> "GPT is like a predictive text machine. Given 'I love...', it predicts the next word. It does this repeatedly to generate entire paragraphs. It only looks backward (at previous words) — it doesn't know what comes next."
|
| 207 |
+
|
| 208 |
+
**Example:**
|
| 209 |
+
```
|
| 210 |
+
Input: "Once upon a time..."
|
| 211 |
+
GPT generates: "there was a kingdom far away"
|
| 212 |
+
(word by word, based on learned patterns)
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
**Slide 10: Encoder-Decoders (T5/BART)**
|
| 216 |
+
> "These combine the best of both worlds. The encoder reads the full input, then passes this understanding to the decoder, which generates output. Perfect for translation, summarization, Q&A."
|
| 217 |
+
|
| 218 |
+
**Example (Translation):**
|
| 219 |
+
```
|
| 220 |
+
Encoder reads: "Hello, how are you?"
|
| 221 |
+
Decoder understands intent, generates: "Bonjour, comment allez-vous?"
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
**Slide 11: Decision Tree**
|
| 225 |
+
> "Here's the rule of thumb: Need to understand/classify? Use BERT. Need to generate text? Use GPT. Need to transform input to output? Use T5."
|
| 226 |
+
|
| 227 |
+
**Talking Point:**
|
| 228 |
+
> "The beautiful thing about HuggingFace is you don't need to make these distinctions manually. You can just use the Pipeline API. But as engineers, understanding these differences helps us choose the right tool for the job and troubleshoot when things go wrong."
|
| 229 |
+
|
| 230 |
+
#### 0:35-0:40 | Enterprise Landscape (Slide 12-14)
|
| 231 |
+
|
| 232 |
+
**Goal:**
|
| 233 |
+
Attendees understand production considerations
|
| 234 |
+
|
| 235 |
+
**Key Points:**
|
| 236 |
+
|
| 237 |
+
**Slide 12: Licensing**
|
| 238 |
+
- Most HF models are open-source (MIT, Apache)
|
| 239 |
+
- Some require special agreements (OpenRAIL)
|
| 240 |
+
- Commercial models available but need API access
|
| 241 |
+
- Always check model card before shipping
|
| 242 |
+
|
| 243 |
+
**Talking Point:**
|
| 244 |
+
> "In enterprise, we can't just pick a model randomly. We need to ensure licensing is compatible. An MIT-licensed model? Green light. Some proprietary model? Need to check terms first."
|
| 245 |
+
|
| 246 |
+
**Slide 13: Trade-offs**
|
| 247 |
+
```
|
| 248 |
+
Large Models = Better, Slower
|
| 249 |
+
Small Models = Fast, Okay quality
|
| 250 |
+
|
| 251 |
+
Example:
|
| 252 |
+
- BERT-large: 95% accuracy, 500ms latency
|
| 253 |
+
- DistilBERT: 94% accuracy, 100ms latency ← Often the sweet spot
|
| 254 |
+
```
|
| 255 |
+
|
| 256 |
+
**Talking Point:**
|
| 257 |
+
> "In production, DistilBERT is often the best choice. 40% faster, only 3% less accurate. That's what we call 'Pareto optimal' — you can't get better without significant cost."
|
| 258 |
+
|
| 259 |
+
**Slide 14: Real-World Example**
|
| 260 |
+
Walk through customer support chatbot pipeline:
|
| 261 |
+
|
| 262 |
+
1. **Intent Classification** — "What does customer want?"
|
| 263 |
+
2. **Entity Extraction** — "Who, what, when?"
|
| 264 |
+
3. **Knowledge Search** — "Find relevant solutions"
|
| 265 |
+
4. **Response Generation** — "Compose personalized answer"
|
| 266 |
+
|
| 267 |
+
**Talking Point:**
|
| 268 |
+
> "This is production NLP. Not one model, but orchestrated pipeline of models. Each trained for specific task. Each could be swapped out if performance isn't good enough."
|
| 269 |
+
|
| 270 |
+
#### 0:40-0:45 | Q&A
|
| 271 |
+
|
| 272 |
+
**Questions to Anticipate:**
|
| 273 |
+
1. **"Do I need to train models myself?"**
|
| 274 |
+
- Answer: No! Start with pre-trained. Only fine-tune if needed.
|
| 275 |
+
|
| 276 |
+
2. **"How do I use this for my company data?"**
|
| 277 |
+
- Answer: Two paths:
|
| 278 |
+
- Quick: Use `transformers.pipeline()` on your data
|
| 279 |
+
- Better: Fine-tune the model on your data (Session 2 mentions this)
|
| 280 |
+
|
| 281 |
+
3. **"How do I deploy this to production?"**
|
| 282 |
+
- Answer: We'll touch on this in Session 2 and Resources tab.
|
| 283 |
+
|
| 284 |
+
4. **"Can I use these models in my mobile app?"**
|
| 285 |
+
- Answer: Yes, quantized versions fit on phones (DistilBERT ~25MB)
|
| 286 |
+
|
| 287 |
+
**Transition to Session 2:**
|
| 288 |
+
> "That's Session 1! You now understand WHAT HuggingFace is and WHY it's powerful. Session 2 is all about HOW to use it. We'll dive into tokenization, run live inference on multiple tasks, and explore production considerations. See you in Session 2! ☕ (Break time)"
|
| 289 |
+
|
| 290 |
+
---
|
| 291 |
+
|
| 292 |
+
## Session 2: Hands-On Developer (90 min)
|
| 293 |
+
|
| 294 |
+
### Overview
|
| 295 |
+
**Goal:** Hands-on exploration of 5+ NLP tasks, deep understanding of inference
|
| 296 |
+
**Audience:** Assumed basic programming knowledge (can read Python)
|
| 297 |
+
**Format:** Interactive slides + live widgets + discussion
|
| 298 |
+
|
| 299 |
+
### Timing Breakdown
|
| 300 |
+
```
|
| 301 |
+
0:00-0:05 Welcome + Recap (5 min)
|
| 302 |
+
0:05-0:20 Tokenization Deep Dive (15 min)
|
| 303 |
+
→ DEMO: Tokenization Explorer
|
| 304 |
+
0:20-1:05 Inference Playground (45 min)
|
| 305 |
+
→ 5 interactive task demos
|
| 306 |
+
(Sentiment, NER, QA, Summarization, Similarity)
|
| 307 |
+
1:05-1:25 Exercise Checkpoint (20 min)
|
| 308 |
+
(Attendees explore, ask questions)
|
| 309 |
+
1:25-1:35 Production Considerations (10 min)
|
| 310 |
+
1:35-1:50 Wrap-up + Resources (15 min)
|
| 311 |
+
1:50-2:00 Q&A + Office Hours (10 min)
|
| 312 |
+
```
|
| 313 |
+
|
| 314 |
+
### Talking Points by Section
|
| 315 |
+
|
| 316 |
+
#### 0:00-0:05 | Welcome + Recap
|
| 317 |
+
|
| 318 |
+
**Talking Point:**
|
| 319 |
+
> "Welcome back! In Session 1, we learned about the HuggingFace ecosystem and different model architectures. Today is hands-on. We'll write almost no code — instead, we'll click buttons and see models in action. By the end, you'll understand how inference actually works under the hood."
|
| 320 |
+
|
| 321 |
+
**Quick Session 1 Recap (1 min):**
|
| 322 |
+
- Pipelines = High-level interface
|
| 323 |
+
- Models = Neural networks trained on specific tasks
|
| 324 |
+
- Transformers = Attention-based architecture
|
| 325 |
+
|
| 326 |
+
#### 0:05-0:20 | Tokenization Deep Dive (Slide 4-10)
|
| 327 |
+
|
| 328 |
+
**Core Message:**
|
| 329 |
+
Tokenization bridges human language ↔ machine numbers. It's where "magic" starts.
|
| 330 |
+
|
| 331 |
+
**Slide 4: What is Tokenization?**
|
| 332 |
+
> "Why does this matter? Because models are mathematics engines. They understand vectors and matrices, not words. Tokenization converts text into vectors. Get tokenization wrong, and the model fails. Get it right, and inference works beautifully."
|
| 333 |
+
|
| 334 |
+
**Slide 5: Step-by-Step Breakdown**
|
| 335 |
+
|
| 336 |
+
Walk through example word-by-word:
|
| 337 |
+
```
|
| 338 |
+
Input: "Hello, how are you?"
|
| 339 |
+
|
| 340 |
+
Step 1: Break into tokens
|
| 341 |
+
["Hello", ",", "how", "are", "you", "?"]
|
| 342 |
+
|
| 343 |
+
Step 2: Look up IDs (each tokenizer has a vocabulary)
|
| 344 |
+
[7592, 1010, 2129, 2024, 2017, 1029]
|
| 345 |
+
|
| 346 |
+
Step 3: Add special tokens
|
| 347 |
+
[101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
|
| 348 |
+
↑ (CLS token) ↑ (SEP token)
|
| 349 |
+
|
| 350 |
+
Step 4: Create attention mask (1=real, 0=padding)
|
| 351 |
+
[1, 1, 1, 1, 1, 1, 1, 1]
|
| 352 |
+
|
| 353 |
+
Step 5: Model reads numerical vectors
|
| 354 |
+
[embeddings] → Model processes → [predictions]
|
| 355 |
+
```
|
| 356 |
+
|
| 357 |
+
**Talking Point:**
|
| 358 |
+
> "Notice the [CLS] and [SEP] tokens at the beginning and end? BERT adds those automatically. They're like punctuation for the model. This is why you don't need to manually add them — the tokenizer does it."
|
| 359 |
+
|
| 360 |
+
**Slide 6: Tokenization Strategies**
|
| 361 |
+
|
| 362 |
+
Explain three approaches with examples:
|
| 363 |
+
|
| 364 |
+
1. **Word-level:** "uncharacteristic" → ["uncharacteristic"]
|
| 365 |
+
- Problem: New words never seen before = [UNK] (unknown)
|
| 366 |
+
|
| 367 |
+
2. **Character-level:** "uncharacteristic" → ["u", "n", "c", "h", "a", "r", ...]
|
| 368 |
+
- Problem: Sequence too long, hard to learn patterns
|
| 369 |
+
|
| 370 |
+
3. **Subword (✅ Recommended):** "uncharacteristic" → ["un", "character", "istic"]
|
| 371 |
+
- Sweet spot! Balances vocabulary size vs. sequence length
|
| 372 |
+
|
| 373 |
+
**Talking Point:**
|
| 374 |
+
> "Subword tokenization is the Goldilocks solution. Not too many tokens, not too few. And it handles new words gracefully by breaking them into known pieces."
|
| 375 |
+
|
| 376 |
+
**Slide 7: Common Tokenizers**
|
| 377 |
+
|
| 378 |
+
Different models use different tokenizers:
|
| 379 |
+
- BERT → WordPiece
|
| 380 |
+
- GPT → BPE (Byte-Pair Encoding)
|
| 381 |
+
- T5 → SentencePiece
|
| 382 |
+
|
| 383 |
+
**Talking Point:**
|
| 384 |
+
> "Important: If you train a model with one tokenizer, you MUST use the same tokenizer at inference. Mixing tokenizers = garbage outputs. Always load from the same model directory."
|
| 385 |
+
|
| 386 |
+
---
|
| 387 |
+
|
| 388 |
+
#### **>>> DEMO: Tokenization Explorer <<<**
|
| 389 |
+
**Timing:** 5 minutes (within 0:05-0:20 section)
|
| 390 |
+
|
| 391 |
+
**Setup:**
|
| 392 |
+
- Open Spaces URL → **Session 2** tab
|
| 393 |
+
- Scroll to **Part 1: Tokenization Explorer**
|
| 394 |
+
- Input box pre-filled with "Hello, how are you?"
|
| 395 |
+
|
| 396 |
+
**Demo Sequence:**
|
| 397 |
+
|
| 398 |
+
1. **Narrate:**
|
| 399 |
+
> "Let's see what tokenization actually produces. I'll click the tokenize button and we'll see how the text is broken down."
|
| 400 |
+
|
| 401 |
+
2. **Click "Tokenize" button**
|
| 402 |
+
|
| 403 |
+
3. **Show output table:**
|
| 404 |
+
```
|
| 405 |
+
Token Token ID Attention Mask
|
| 406 |
+
────────────────────────────────────────
|
| 407 |
+
[CLS] 101 1
|
| 408 |
+
hello 7592 1
|
| 409 |
+
, 1010 1
|
| 410 |
+
how 2129 1
|
| 411 |
+
are 2024 1
|
| 412 |
+
you 2017 1
|
| 413 |
+
? 1029 1
|
| 414 |
+
[SEP] 102 1
|
| 415 |
+
Total: 8 tokens
|
| 416 |
+
```
|
| 417 |
+
|
| 418 |
+
4. **Narrate:**
|
| 419 |
+
> "Notice each token got a unique ID. [CLS] is 101, 'hello' is 7592. The model learned these associations during pre-training. Attention mask is all 1s because all tokens are real — no padding."
|
| 420 |
+
|
| 421 |
+
5. **Try a longer example** (paste into input):
|
| 422 |
+
```
|
| 423 |
+
"The Hugging Face transformers library revolutionized NLP by providing easy access to pre-trained models for various tasks."
|
| 424 |
+
```
|
| 425 |
+
- Click "Tokenize"
|
| 426 |
+
- Show output: ~25 tokens
|
| 427 |
+
|
| 428 |
+
6. **Narrate:**
|
| 429 |
+
> "All tokens have attention_mask=1, meaning the model will pay attention to all of them. But if this sentence were part of a batch, and other sequences were shorter, we'd pad this one and set attention_mask=0 for the padding positions. This signals to the model: 'Ignore these fake tokens.'"
|
| 430 |
+
|
| 431 |
+
**Key Insight to Drive Home:**
|
| 432 |
+
> "This is the foundation. Everything that happens next — classification, generation, Q&A — all relies on this tokenization step being correct. That's why understanding it is so important."
|
| 433 |
+
|
| 434 |
+
---
|
| 435 |
+
|
| 436 |
+
#### 0:20-1:05 | Inference Playground (Slide 11-16)
|
| 437 |
+
|
| 438 |
+
**Setup:**
|
| 439 |
+
- Switch to **Session 2: Hands-On Developer** tab
|
| 440 |
+
- Navigate to **Part 2: Inference Playground**
|
| 441 |
+
- Show the 5 tabs: Sentiment, NER, QA, Summarization, Similarity
|
| 442 |
+
|
| 443 |
+
**Preamble:**
|
| 444 |
+
> "Now we'll explore 5 different NLP tasks. Each uses a different model, but the flow is similar: Tokenize → Model → Output. Let's see how each one works."
|
| 445 |
+
|
| 446 |
+
---
|
| 447 |
+
|
| 448 |
+
#### **Task 1: Sentiment Analysis (Slide 12)**
|
| 449 |
+
|
| 450 |
+
**Talking Point:**
|
| 451 |
+
> "Sentiment analysis classifies text emotions. This model was fine-tuned on 67K movie reviews. Let's test it."
|
| 452 |
+
|
| 453 |
+
**Demo:**
|
| 454 |
+
1. Click **Sentiment Analysis** tab
|
| 455 |
+
2. Input box shows: "I absolutely love this product! It's amazing and works perfectly."
|
| 456 |
+
3. Click **Analyze** button
|
| 457 |
+
4. Result: `POSITIVE (0.9987 confidence)`
|
| 458 |
+
|
| 459 |
+
**Narrate:**
|
| 460 |
+
> "99.87% confident this is positive. That's very high — we can automate decisions at this confidence. Now let's try something ambiguous."
|
| 461 |
+
|
| 462 |
+
5. **Clear input, paste:**
|
| 463 |
+
```
|
| 464 |
+
"The product is okay but could be better."
|
| 465 |
+
```
|
| 466 |
+
Click **Analyze**
|
| 467 |
+
|
| 468 |
+
6. Narrate the result:
|
| 469 |
+
> "Maybe 65% confident? When confidence drops to this range, we should flag for human review. That's how you build production systems — not fully automated, but smart about what to automate."
|
| 470 |
+
|
| 471 |
+
**Use Case Mention:**
|
| 472 |
+
> "Use cases: Social media monitoring, customer feedback analysis, support ticket triage, review classification."
|
| 473 |
+
|
| 474 |
+
---
|
| 475 |
+
|
| 476 |
+
#### **Task 2: Named Entity Recognition (Slide 13)**
|
| 477 |
+
|
| 478 |
+
**Talking Point:**
|
| 479 |
+
> "NER identifies who, what, where, when in text. This model recognizes persons, organizations, locations, miscellaneous entities."
|
| 480 |
+
|
| 481 |
+
**Demo:**
|
| 482 |
+
1. Click **Named Entity Recognition** tab
|
| 483 |
+
2. Input: "Apple Inc. was founded by Steve Jobs in Cupertino, California."
|
| 484 |
+
3. Click **Extract Entities** button
|
| 485 |
+
4. Output table shows:
|
| 486 |
+
```
|
| 487 |
+
Entity Type Score
|
| 488 |
+
────────────────────────────────────────
|
| 489 |
+
Apple Inc. Organization 0.99
|
| 490 |
+
Steve Jobs Person 0.98
|
| 491 |
+
Cupertino Location 0.92
|
| 492 |
+
California Location 0.94
|
| 493 |
+
```
|
| 494 |
+
|
| 495 |
+
**Narrate:**
|
| 496 |
+
> "This is incredibly useful. Imagine you have thousands of documents. You can automatically extract company names, people, locations. Then populate a knowledge graph or database. No manual tagging needed."
|
| 497 |
+
|
| 498 |
+
5. **Try a second example (paste):**
|
| 499 |
+
```
|
| 500 |
+
"Barack Obama served as President of the United States from 2009 to 2017."
|
| 501 |
+
```
|
| 502 |
+
Click **Extract Entities**
|
| 503 |
+
|
| 504 |
+
6. **Show results, narrate:**
|
| 505 |
+
> "Barack Obama → Person, United States → Location. The model correctly identifies entities even when it hasn't seen this exact sentence before. That's the power of transfer learning."
|
| 506 |
+
|
| 507 |
+
**Use Case Mention:**
|
| 508 |
+
> "Resume parsing, contract analysis, information extraction, knowledge base construction."
|
| 509 |
+
|
| 510 |
+
---
|
| 511 |
+
|
| 512 |
+
#### **Task 3: Question Answering (Slide 14)**
|
| 513 |
+
|
| 514 |
+
**Talking Point:**
|
| 515 |
+
> "QA answers questions based on provided context. This model finds the span of text that best answers the question."
|
| 516 |
+
|
| 517 |
+
**Demo:**
|
| 518 |
+
1. Click **Question Answering** tab
|
| 519 |
+
2. Context: "The Hugging Face Hub is a platform for sharing machine learning models, datasets, and demos. It allows researchers and practitioners to collaborate..."
|
| 520 |
+
3. Question: "What is the Hugging Face Hub?"
|
| 521 |
+
4. Click **Get Answer** button
|
| 522 |
+
5. Output: Answer with confidence score
|
| 523 |
+
|
| 524 |
+
**Narrate:**
|
| 525 |
+
> "The model scans the context, identifies the most relevant span, and returns it as the answer. Notice it didn't generate new text — it extracted from the provided context. That's what this model does."
|
| 526 |
+
|
| 527 |
+
6. **Try a different question (paste into question box):**
|
| 528 |
+
```
|
| 529 |
+
Who uses the Hugging Face Hub?
|
| 530 |
+
```
|
| 531 |
+
Click **Get Answer**
|
| 532 |
+
|
| 533 |
+
7. **Show result, narrate:**
|
| 534 |
+
> "If the answer isn't explicitly in the context, the model struggles. That's a limitation. For better results, the context needs to contain the answer directly. This is why context construction is critical in production QA systems."
|
| 535 |
+
|
| 536 |
+
**Use Case Mention:**
|
| 537 |
+
> "FAQ automation, documentation search, support chatbots, knowledge base queries."
|
| 538 |
+
|
| 539 |
+
---
|
| 540 |
+
|
| 541 |
+
#### **Task 4: Summarization (Slide 15)**
|
| 542 |
+
|
| 543 |
+
**Talking Point:**
|
| 544 |
+
> "Summarization generates concise summaries. This uses an abstractive approach — it generates NEW sentences, not just extracting existing ones."
|
| 545 |
+
|
| 546 |
+
**Demo:**
|
| 547 |
+
1. Click **Text Summarization** tab
|
| 548 |
+
2. Input: "Artificial intelligence has become increasingly important in modern society. Machine learning enables computers to learn from data. Deep learning uses neural networks..."
|
| 549 |
+
(full text in input box)
|
| 550 |
+
3. Click **Summarize** button
|
| 551 |
+
4. Output: Concise summary (1-2 sentences)
|
| 552 |
+
|
| 553 |
+
**Narrate:**
|
| 554 |
+
> "The model identified key concepts and generated a new, shorter version. This is harder than extraction — it requires generation. Notice the summary uses different words but preserves meaning. That's abstractive summarization."
|
| 555 |
+
|
| 556 |
+
5. **Mention parameters:**
|
| 557 |
+
> "This model has parameters: max_length controls summary length, min_length ensures it's not too short. These are tunable based on your use case."
|
| 558 |
+
|
| 559 |
+
**Use Case Mention:**
|
| 560 |
+
> "Article summarization, report generation, meeting notes, email digest, news aggregation."
|
| 561 |
+
|
| 562 |
+
---
|
| 563 |
+
|
| 564 |
+
#### **Task 5: Semantic Similarity (Slide 16)**
|
| 565 |
+
|
| 566 |
+
**Talking Point:**
|
| 567 |
+
> "Semantic similarity measures how similar two texts are IN MEANING, not just character overlap."
|
| 568 |
+
|
| 569 |
+
**Demo:**
|
| 570 |
+
1. Click **Semantic Similarity** tab
|
| 571 |
+
2. Text 1: "The cat is sleeping on the mat"
|
| 572 |
+
3. Text 2: "A feline is resting on the rug"
|
| 573 |
+
4. Click **Compare** button
|
| 574 |
+
5. Output: `Similarity Score: 0.92 (Very Similar!)`
|
| 575 |
+
|
| 576 |
+
**Narrate:**
|
| 577 |
+
> "These sentences use completely different words. But semantically, they mean almost the same thing. Character-level comparison would say they're different. But this embedding-based model recognizes they're similar. That's semantic understanding."
|
| 578 |
+
|
| 579 |
+
6. **Try different pairs (paste examples):**
|
| 580 |
+
- "Machine learning is used for AI tasks" vs "Deep learning uses neural networks"
|
| 581 |
+
- "I love this product" vs "This product is terrible"
|
| 582 |
+
|
| 583 |
+
7. **Narrate the pattern:**
|
| 584 |
+
> "High scores (>0.8) mean very similar meaning. Low scores (<0.3) mean different meanings. Scores around 0.5 mean some relation. You can tune based on your application."
|
| 585 |
+
|
| 586 |
+
**Use Case Mention:**
|
| 587 |
+
> "Duplicate detection, recommendation systems, plagiarism detection, semantic search, document clustering."
|
| 588 |
+
|
| 589 |
+
---
|
| 590 |
+
|
| 591 |
+
#### 1:05-1:25 | Exercise Checkpoint (20 min)
|
| 592 |
+
|
| 593 |
+
**Talking Point:**
|
| 594 |
+
> "Now it's your turn. I'll guide you through some exercises where you explore these tasks with your own inputs."
|
| 595 |
+
|
| 596 |
+
**Exercise 1: Sentiment on Different Domains (5 min)**
|
| 597 |
+
> "Go to Sentiment Analysis tab. Try these inputs:
|
| 598 |
+
> - Product review: 'This laptop is fantastic and runs smoothly.'
|
| 599 |
+
> - Restaurant review: 'The food was cold and service was slow.'
|
| 600 |
+
> - Technical feedback: 'This API is poorly documented.'
|
| 601 |
+
> Notice how the model interprets different domains. What confidence levels do you see?"
|
| 602 |
+
|
| 603 |
+
**Exercise 2: NER on Your Data (5 min)**
|
| 604 |
+
> "Go to NER tab. Paste text about a company or person YOU know. See how well the model extracts entities. What works well? What fails?"
|
| 605 |
+
|
| 606 |
+
**Exercise 3: QA Troubleshooting (5 min)**
|
| 607 |
+
> "Go to QA tab. Try asking a question that ISN'T answered in the context. What happens? This teaches us about limitations."
|
| 608 |
+
|
| 609 |
+
**Exercise 4: Creative Experimentation (5 min)**
|
| 610 |
+
> "Mix and match: Run sentiment on a summarized document. Extract entities from QA answers. Try Semantic Similarity on weird pairs. Explore!"
|
| 611 |
+
|
| 612 |
+
**During Exercise Time:**
|
| 613 |
+
- Monitor chat/questions
|
| 614 |
+
- Help attendees with inputs
|
| 615 |
+
- Highlight interesting results
|
| 616 |
+
- Encourage experimentation
|
| 617 |
+
|
| 618 |
+
---
|
| 619 |
+
|
| 620 |
+
#### 1:25-1:35 | Production Considerations (10 min)
|
| 621 |
+
|
| 622 |
+
**Slide 20: Optimization**
|
| 623 |
+
|
| 624 |
+
**Talking Point:**
|
| 625 |
+
> "These models are great for prototypes. But in production, we need to optimize."
|
| 626 |
+
|
| 627 |
+
**Key Tradeoffs:**
|
| 628 |
+
```
|
| 629 |
+
Large Models (BERT-large):
|
| 630 |
+
✅ Highest accuracy
|
| 631 |
+
❌ Slow (500ms per inference)
|
| 632 |
+
❌ High memory (340MB)
|
| 633 |
+
|
| 634 |
+
Medium Models (BERT-base):
|
| 635 |
+
✅ Good accuracy, decent speed
|
| 636 |
+
⚠️ Reasonable resources
|
| 637 |
+
|
| 638 |
+
Small Models (DistilBERT):
|
| 639 |
+
✅ Fast (100ms per inference)
|
| 640 |
+
✅ Small (33MB), fits on phones
|
| 641 |
+
⚠️ Slightly lower accuracy (3-5% drop)
|
| 642 |
+
```
|
| 643 |
+
|
| 644 |
+
**Talking Point:**
|
| 645 |
+
> "In production, DistilBERT is usually the winner. 40% faster, only 3% less accurate. That 3% might not even matter for your use case. Always measure before deciding!"
|
| 646 |
+
|
| 647 |
+
**Optimization Techniques:**
|
| 648 |
+
1. **Quantization** — Reduce precision (float32 → int8), gain 4x speedup
|
| 649 |
+
2. **Distillation** — Train smaller model to mimic larger one
|
| 650 |
+
3. **Pruning** — Remove unimportant connections
|
| 651 |
+
4. **Caching** — Store frequent predictions
|
| 652 |
+
|
| 653 |
+
---
|
| 654 |
+
|
| 655 |
+
#### 1:35-1:50 | Wrap-up + Next Steps (Slide 23-25)
|
| 656 |
+
|
| 657 |
+
**Recap (2 min):**
|
| 658 |
+
> "Let's recap. We learned token ization, explored 5 NLP tasks, and discussed production considerations. You now understand the full inference pipeline."
|
| 659 |
+
|
| 660 |
+
**Key Takeaways (3 min):**
|
| 661 |
+
1. Tokenization is foundational (garbage in = garbage out)
|
| 662 |
+
2. Different models for different tasks (choose wisely)
|
| 663 |
+
3. Confidence scores guide automation decisions
|
| 664 |
+
4. Production requires optimization thinking
|
| 665 |
+
|
| 666 |
+
**Next Steps (5 min):**
|
| 667 |
+
|
| 668 |
+
Using slide 23-25 as reference, walk through:
|
| 669 |
+
|
| 670 |
+
1. **Experiment Post-Session**
|
| 671 |
+
- Clone the repo locally
|
| 672 |
+
- Modify `data/sample_texts.csv`
|
| 673 |
+
- Add new models to `config.py`
|
| 674 |
+
|
| 675 |
+
2. **Fine-Tune on Your Data**
|
| 676 |
+
- Take a pre-trained model
|
| 677 |
+
- Add your labeled data
|
| 678 |
+
- Fine-tune (takes hours on GPU, not weeks)
|
| 679 |
+
|
| 680 |
+
3. **Deploy to Spaces**
|
| 681 |
+
- Fork this repo
|
| 682 |
+
- Push your changes
|
| 683 |
+
- Get a shareable URL instantly
|
| 684 |
+
|
| 685 |
+
4. **Publish to the Hub**
|
| 686 |
+
- Share your fine-tuned model
|
| 687 |
+
- Document with model card
|
| 688 |
+
- Community can use it
|
| 689 |
+
|
| 690 |
+
**Resources (5 min):**
|
| 691 |
+
Point to Resources tab in Spaces app:
|
| 692 |
+
- Official documentation links
|
| 693 |
+
- Model hub for exploration
|
| 694 |
+
- Learning courses
|
| 695 |
+
- Community forums
|
| 696 |
+
|
| 697 |
+
**Transition to Q&A:**
|
| 698 |
+
> "That concludes Session 2. Now, questions? This is your time to clarify anything that was unclear."
|
| 699 |
+
|
| 700 |
+
---
|
| 701 |
+
|
| 702 |
+
#### 1:50-2:00 | Q&A + Office Hours
|
| 703 |
+
|
| 704 |
+
**Common Questions & Answers:**
|
| 705 |
+
|
| 706 |
+
| Q | A |
|
| 707 |
+
|---|---|
|
| 708 |
+
| "How do I fine-tune?" | "Upload your labeled data, use `Trainer` API, takes 1-2 hours. We have tutorials on the Hub." |
|
| 709 |
+
| "Can I use multiple models together?" | "Yes! Build pipelines. E.g., classify intent, then run task-specific model." |
|
| 710 |
+
| "What about non-English?" | "HF has multilingual models (mBERT, XLM-R). Same API, different models." |
|
| 711 |
+
| "How do I handle errors?" | "Log everything, set confidence thresholds, flag low-confidence predictions for review." |
|
| 712 |
+
| "Can I deploy this on my server?" | "Yes! Models are just PyTorch/TF files. Download, load, serve with Flask/FastAPI." |
|
| 713 |
+
| "What about privacy?" | "Models run locally (no data sent to HF). For Spaces like this, note that inputs may be logged." |
|
| 714 |
+
|
| 715 |
+
**Offer Follow-up:**
|
| 716 |
+
> "If your question needs deeper discussion, let's connect after. I'm available for office hours / one-on-ones. We can dive into your specific use case."
|
| 717 |
+
|
| 718 |
+
---
|
| 719 |
+
|
| 720 |
+
## Troubleshooting Guide
|
| 721 |
+
|
| 722 |
+
### Issue: Model Won't Load
|
| 723 |
+
**Symptom:** "Model is loading..." for >60 seconds, then timeout
|
| 724 |
+
|
| 725 |
+
**Causes:**
|
| 726 |
+
1. Network issue
|
| 727 |
+
2. Model too large for Spaces RAM
|
| 728 |
+
3. First-time model download
|
| 729 |
+
|
| 730 |
+
**Solutions:**
|
| 731 |
+
1. Refresh the page
|
| 732 |
+
2. Switch to smaller model (DistilBERT instead of BERT-large)
|
| 733 |
+
3. Pre-download models before session (in `utils.py`)
|
| 734 |
+
|
| 735 |
+
**Fallback:**
|
| 736 |
+
- Say: "Network hiccup. Let me show you what would happen..."
|
| 737 |
+
- Move to next task
|
| 738 |
+
|
| 739 |
+
---
|
| 740 |
+
|
| 741 |
+
### Issue: Inference Takes >10 Seconds
|
| 742 |
+
**Symptom:** Button clicked, waiting...
|
| 743 |
+
|
| 744 |
+
**Likely Cause:**
|
| 745 |
+
- First inference (loading models to memory)
|
| 746 |
+
|
| 747 |
+
**Solution:**
|
| 748 |
+
- Pre-warm models at app startup (in `app.py`)
|
| 749 |
+
|
| 750 |
+
---
|
| 751 |
+
|
| 752 |
+
### Issue: Widget Doesn't Respond
|
| 753 |
+
**Symptom:** Click button, nothing happens
|
| 754 |
+
|
| 755 |
+
**Solutions:**
|
| 756 |
+
1. Refresh page
|
| 757 |
+
2. Clear browser cache
|
| 758 |
+
3. Try different browser
|
| 759 |
+
4. Check internet connection
|
| 760 |
+
|
| 761 |
+
---
|
| 762 |
+
|
| 763 |
+
### Issue: Attendee Internet Drops
|
| 764 |
+
**During Session:**
|
| 765 |
+
- Locals can still see Spaces (cached)
|
| 766 |
+
- Remote may need to rejoin
|
| 767 |
+
|
| 768 |
+
**Solution:**
|
| 769 |
+
- Have recording/backup slides ready
|
| 770 |
+
- Provide links to all materials before session
|
| 771 |
+
|
| 772 |
+
---
|
| 773 |
+
|
| 774 |
+
## Tips for Effective Presentation
|
| 775 |
+
|
| 776 |
+
### Before Session
|
| 777 |
+
1. **Dry Run**
|
| 778 |
+
- Go through slides locally
|
| 779 |
+
- Test all demos in Spaces
|
| 780 |
+
- Time each section
|
| 781 |
+
- Prepare backup slides (PDF)
|
| 782 |
+
|
| 783 |
+
2. **Prepare Examples**
|
| 784 |
+
- Have 3-4 good examples per task
|
| 785 |
+
- Have 1-2 "edge case" examples
|
| 786 |
+
- Test on provided models beforehand
|
| 787 |
+
|
| 788 |
+
3. **Communication**
|
| 789 |
+
- Send Spaces URL 24 hours before
|
| 790 |
+
- Send agenda
|
| 791 |
+
- Set expectations (interactive, will show code, not advanced math)
|
| 792 |
+
|
| 793 |
+
### During Session
|
| 794 |
+
|
| 795 |
+
1. **Pacing**
|
| 796 |
+
- Aim for 80% of allocated time (leave buffer for questions)
|
| 797 |
+
- Session 1: Stay under 40 min to leave 5 min Q&A
|
| 798 |
+
- Session 2: Finish Wrap-up by 1:35 min to leave time for Q&A
|
| 799 |
+
|
| 800 |
+
2. **Engagement**
|
| 801 |
+
- Ask rhetorical questions ("Anyone know what tokenizers do?")
|
| 802 |
+
- Pause for questions every 5-10 minutes
|
| 803 |
+
- Read chat actively, respond to questions asynchronously
|
| 804 |
+
|
| 805 |
+
3. **Demo Tips**
|
| 806 |
+
- Always narrate what you're doing ("Clicking the button now...")
|
| 807 |
+
- Point to specific outputs ("Notice the confidence is 0.99...")
|
| 808 |
+
- Explain failures ("That's interesting — let me try...")
|
| 809 |
+
- Have patience with load times
|
| 810 |
+
|
| 811 |
+
4. **Audience Management**
|
| 812 |
+
- Non-technical folks may feel lost at Tokenization. Say: "This is technical. Don't worry if it's unclear. The key insight is: text becomes numbers."
|
| 813 |
+
- Advanced folks might ask hard questions later. Note them down, offer office hours.
|
| 814 |
+
|
| 815 |
+
5. **Interactive Moments**
|
| 816 |
+
- Polling: "Who's used HF before?" (chat reaction)
|
| 817 |
+
- Breakout: "Spend 5 min exploring sentiment. Report back a surprising result."
|
| 818 |
+
- Q&A: "Anyone have a use case they want to explore?"
|
| 819 |
+
|
| 820 |
+
### Content Delivery
|
| 821 |
+
- **Use analogies** — "Tokenizer is like Rosetta Stone"
|
| 822 |
+
- **Show the code, but don't deep dive** — "Here's what this looks like in Python... notice it's just 3 lines"
|
| 823 |
+
- **Celebrate small wins** — "We went from training models for months to 3 lines of code. That's progress!"
|
| 824 |
+
- **Be honest about limitations** — "QA works best when the answer is explicitly in the context"
|
| 825 |
+
|
| 826 |
+
### If Technical Issues Occur
|
| 827 |
+
- **Stay calm** — "Let me reconnect..."
|
| 828 |
+
- **Have backups** — Screenshots, recordings, cached outputs
|
| 829 |
+
- **Continue anyway** — "While that loads, let me tell you..."
|
| 830 |
+
- **Learn and iterate** — Document what failed for next time
|
| 831 |
+
|
| 832 |
+
---
|
| 833 |
+
|
| 834 |
+
## Final Reminders
|
| 835 |
+
|
| 836 |
+
- ✅ **Slides are in markdown** — Convert to PDF/PPTX for local presentation if desired
|
| 837 |
+
- ✅ **Models are PUBLIC** — Anyone from the company can access the app
|
| 838 |
+
- ✅ **Spaces auto-updates** — Push to GitHub, it auto-deploys
|
| 839 |
+
- ✅ **Attendees can clone repo** — They can run it locally post-session
|
| 840 |
+
- ✅ **Record if possible** — Share recording with no-shows
|
| 841 |
+
|
| 842 |
+
**Good luck! 🚀 You've got this!**
|
app.py
ADDED
|
@@ -0,0 +1,587 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
HuggingFace Enabling Sessions - Gradio Interactive Demo App
|
| 3 |
+
Hosted on HuggingFace Spaces
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import gradio as gr
|
| 7 |
+
import config
|
| 8 |
+
import utils
|
| 9 |
+
import pandas as pd
|
| 10 |
+
|
| 11 |
+
# ===================== UTILITIES =====================
|
| 12 |
+
|
| 13 |
+
def load_sample_texts():
|
| 14 |
+
"""Load sample texts from CSV."""
|
| 15 |
+
try:
|
| 16 |
+
df = pd.read_csv(config.SAMPLE_DATA_CSV)
|
| 17 |
+
return df
|
| 18 |
+
except:
|
| 19 |
+
return None
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def get_sentiment_examples():
|
| 23 |
+
"""Get example texts for sentiment analysis."""
|
| 24 |
+
try:
|
| 25 |
+
with open(f"{config.DEMO_SAMPLES_DIR}/sentiment.txt") as f:
|
| 26 |
+
lines = f.read().strip().split("\n")
|
| 27 |
+
return lines
|
| 28 |
+
except:
|
| 29 |
+
return config.TASKS["sentiment"]["example"].split()
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def get_ner_examples():
|
| 33 |
+
"""Get example texts for NER."""
|
| 34 |
+
try:
|
| 35 |
+
with open(f"{config.DEMO_SAMPLES_DIR}/ner.txt") as f:
|
| 36 |
+
lines = f.read().strip().split("\n")
|
| 37 |
+
return lines
|
| 38 |
+
except:
|
| 39 |
+
return [config.TASKS["ner"]["example"]]
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def get_qa_examples():
|
| 43 |
+
"""Get example context and questions for QA."""
|
| 44 |
+
try:
|
| 45 |
+
with open(f"{config.DEMO_SAMPLES_DIR}/qa.txt") as f:
|
| 46 |
+
contexts = f.read().strip().split("\n\n")
|
| 47 |
+
return contexts
|
| 48 |
+
except:
|
| 49 |
+
return [config.TASKS["qa"]["example_context"]]
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def get_summarization_examples():
|
| 53 |
+
"""Get example texts for summarization."""
|
| 54 |
+
try:
|
| 55 |
+
with open(f"{config.DEMO_SAMPLES_DIR}/summarization.txt") as f:
|
| 56 |
+
lines = f.read().strip().split("\n")
|
| 57 |
+
return lines
|
| 58 |
+
except:
|
| 59 |
+
return [config.TASKS["summarization"]["example"]]
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def get_embeddings_examples():
|
| 63 |
+
"""Get example texts for semantic similarity."""
|
| 64 |
+
try:
|
| 65 |
+
with open(f"{config.DEMO_SAMPLES_DIR}/embeddings.txt") as f:
|
| 66 |
+
lines = f.read().strip().split("\n")
|
| 67 |
+
return lines
|
| 68 |
+
except:
|
| 69 |
+
return [config.TASKS["similarity"]["example1"], config.TASKS["similarity"]["example2"]]
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
# ===================== SENTIMENT ANALYSIS =====================
|
| 73 |
+
|
| 74 |
+
def demo_sentiment(text):
|
| 75 |
+
"""Demo sentiment analysis."""
|
| 76 |
+
if not text.strip():
|
| 77 |
+
return "Please enter some text", "No input"
|
| 78 |
+
result = utils.run_sentiment_analysis(text)
|
| 79 |
+
output = f"**Label:** {result['label']}\n\n**Confidence:** {result['score']:.4f}"
|
| 80 |
+
return output, result
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
# ===================== NER =====================
|
| 84 |
+
|
| 85 |
+
def demo_ner(text):
|
| 86 |
+
"""Demo named entity recognition."""
|
| 87 |
+
if not text.strip():
|
| 88 |
+
return "Please enter some text", "No entities"
|
| 89 |
+
results = utils.run_ner(text)
|
| 90 |
+
formatted = utils.format_ner_output(results)
|
| 91 |
+
return formatted, results
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
# ===================== QUESTION ANSWERING =====================
|
| 95 |
+
|
| 96 |
+
def demo_qa(context, question):
|
| 97 |
+
"""Demo question answering."""
|
| 98 |
+
if not context.strip() or not question.strip():
|
| 99 |
+
return "Please enter both context and question", {}
|
| 100 |
+
result = utils.run_qa(context, question)
|
| 101 |
+
if "error" in result:
|
| 102 |
+
return f"Error: {result['error']}", {}
|
| 103 |
+
output = f"**Answer:** {result['answer']}\n\n**Confidence:** {result['score']:.4f}"
|
| 104 |
+
return output, result
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
# ===================== SUMMARIZATION =====================
|
| 108 |
+
|
| 109 |
+
def demo_summarization(text):
|
| 110 |
+
"""Demo text summarization."""
|
| 111 |
+
if not text.strip():
|
| 112 |
+
return "Please enter some text", ""
|
| 113 |
+
if len(text.split()) < 20:
|
| 114 |
+
return "Text too short for summarization. Please provide at least 20 words.", ""
|
| 115 |
+
summary = utils.run_summarization(text)
|
| 116 |
+
return summary, summary
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
# ===================== SEMANTIC SIMILARITY =====================
|
| 120 |
+
|
| 121 |
+
def demo_similarity(text1, text2):
|
| 122 |
+
"""Demo semantic similarity."""
|
| 123 |
+
if not text1.strip() or not text2.strip():
|
| 124 |
+
return "Please enter both texts", 0
|
| 125 |
+
similarity = utils.compute_similarity(text1, text2)
|
| 126 |
+
if isinstance(similarity, str):
|
| 127 |
+
return similarity, 0
|
| 128 |
+
output = f"**Similarity Score:** {similarity:.4f}\n\n(Score ranges from -1 to 1, where 1 means identical semantically)"
|
| 129 |
+
return output, similarity
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
# ===================== TOKENIZATION =====================
|
| 133 |
+
|
| 134 |
+
def demo_tokenization(text):
|
| 135 |
+
"""Demo tokenization."""
|
| 136 |
+
if not text.strip():
|
| 137 |
+
return "Please enter some text", ""
|
| 138 |
+
result = utils.tokenize_text(text)
|
| 139 |
+
if "error" in result:
|
| 140 |
+
return f"Error: {result['error']}", ""
|
| 141 |
+
formatted = utils.format_tokenizer_output(result)
|
| 142 |
+
return formatted, result
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
# ===================== GRADIO INTERFACE =====================
|
| 146 |
+
|
| 147 |
+
def create_interface():
|
| 148 |
+
"""Create the Gradio interface with 3 tabs."""
|
| 149 |
+
|
| 150 |
+
with gr.Blocks(
|
| 151 |
+
title="HuggingFace Enabling Sessions",
|
| 152 |
+
theme=gr.themes.Soft(),
|
| 153 |
+
) as app:
|
| 154 |
+
gr.Markdown(
|
| 155 |
+
"""
|
| 156 |
+
# 🤗 HuggingFace Enabling Sessions
|
| 157 |
+
**Interactive Demo for Transformers, Hub APIs, and Pipeline Abstractions**
|
| 158 |
+
|
| 159 |
+
**Facilitator:** Shouryaan Grish | **Duration:** Session 1: 45 min | Session 2: 90 min
|
| 160 |
+
"""
|
| 161 |
+
)
|
| 162 |
+
|
| 163 |
+
with gr.Tabs():
|
| 164 |
+
# ===================== TAB 1: SESSION 1 - INTRODUCTION =====================
|
| 165 |
+
with gr.Tab("Session 1: Introduction (45 min)", id="session1"):
|
| 166 |
+
gr.Markdown(
|
| 167 |
+
"""
|
| 168 |
+
## 🎯 Introduction to Hugging Face Ecosystem
|
| 169 |
+
|
| 170 |
+
### What We'll Cover:
|
| 171 |
+
1. **HuggingFace Platform Overview**
|
| 172 |
+
- The Hub: Central repository for models, datasets, and spaces
|
| 173 |
+
- Transformers Library: Core Python library for NLP
|
| 174 |
+
- Model Cards: Documentation and metadata for transparency
|
| 175 |
+
|
| 176 |
+
2. **Core Abstractions**
|
| 177 |
+
- **Pipelines:** High-level API for common tasks (sentiment, NER, QA, etc.)
|
| 178 |
+
- **Models & Tokenizers:** Lower-level building blocks
|
| 179 |
+
- **Datasets:** Standardized data loading and processing
|
| 180 |
+
|
| 181 |
+
3. **Architecture Patterns**
|
| 182 |
+
- **Encoders:** BERT, RoBERTa, DistilBERT → Classification, feature extraction
|
| 183 |
+
- **Decoders:** GPT-2, GPT-3 → Text generation
|
| 184 |
+
- **Encoder-Decoders:** T5, BART → Seq2seq (translation, summarization, QA)
|
| 185 |
+
|
| 186 |
+
4. **Enterprise NLP Landscape**
|
| 187 |
+
- Open-source vs. Commercial models
|
| 188 |
+
- Licensing considerations (MIT, Apache, OpenRAIL, etc.)
|
| 189 |
+
- Fine-tuning for domain-specific tasks
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
### Live Demo: Explore the Power of Pipelines
|
| 194 |
+
|
| 195 |
+
Try the demos below to see how easy it is to use pre-trained models! 👇
|
| 196 |
+
"""
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
with gr.Group():
|
| 200 |
+
gr.Markdown("### 📊 Demo 1: Sentiment Analysis")
|
| 201 |
+
demo1_input = gr.Textbox(
|
| 202 |
+
label="Enter text to analyze sentiment",
|
| 203 |
+
value="I absolutely love this product!",
|
| 204 |
+
lines=2,
|
| 205 |
+
)
|
| 206 |
+
demo1_btn = gr.Button("Analyze Sentiment", variant="primary")
|
| 207 |
+
demo1_output = gr.Markdown(label="Result")
|
| 208 |
+
demo1_json = gr.JSON(label="Raw Output", visible=False)
|
| 209 |
+
|
| 210 |
+
demo1_btn.click(
|
| 211 |
+
demo_sentiment,
|
| 212 |
+
inputs=[demo1_input],
|
| 213 |
+
outputs=[demo1_output, demo1_json],
|
| 214 |
+
)
|
| 215 |
+
|
| 216 |
+
with gr.Group():
|
| 217 |
+
gr.Markdown("### 🏷️ Demo 2: Named Entity Recognition (NER)")
|
| 218 |
+
demo2_input = gr.Textbox(
|
| 219 |
+
label="Enter text for entity recognition",
|
| 220 |
+
value="Apple Inc. was founded by Steve Jobs in Cupertino, California.",
|
| 221 |
+
lines=2,
|
| 222 |
+
)
|
| 223 |
+
demo2_btn = gr.Button("Extract Entities", variant="primary")
|
| 224 |
+
demo2_output = gr.Markdown(label="Entities Found")
|
| 225 |
+
demo2_json = gr.JSON(label="Raw Output", visible=False)
|
| 226 |
+
|
| 227 |
+
demo2_btn.click(
|
| 228 |
+
demo_ner,
|
| 229 |
+
inputs=[demo2_input],
|
| 230 |
+
outputs=[demo2_output, demo2_json],
|
| 231 |
+
)
|
| 232 |
+
|
| 233 |
+
gr.Markdown(
|
| 234 |
+
"""
|
| 235 |
+
---
|
| 236 |
+
### 💡 Key Takeaways
|
| 237 |
+
- Pre-trained models save time and resources
|
| 238 |
+
- HuggingFace Pipelines abstract away complexity
|
| 239 |
+
- Models are available for dozens of NLP tasks
|
| 240 |
+
- Easy to fine-tune for specialized use cases
|
| 241 |
+
|
| 242 |
+
**Next:** Head to Session 2 for hands-on development with Tokenizers and Advanced Inference! 🚀
|
| 243 |
+
"""
|
| 244 |
+
)
|
| 245 |
+
|
| 246 |
+
# ===================== TAB 2: SESSION 2 - HANDS-ON DEVELOPER =====================
|
| 247 |
+
with gr.Tab("Session 2: Hands-On Developer (90 min)", id="session2"):
|
| 248 |
+
gr.Markdown(
|
| 249 |
+
"""
|
| 250 |
+
## 👨💻 Building End-to-End NLP Workflows with Hugging Face
|
| 251 |
+
|
| 252 |
+
### Agenda:
|
| 253 |
+
1. **Tokenization Deep Dive** (15 min)
|
| 254 |
+
- Understanding tokenization, token IDs, and attention masks
|
| 255 |
+
- How models process text internally
|
| 256 |
+
|
| 257 |
+
2. **Inference Playground** (45 min)
|
| 258 |
+
- Interactive demos across multiple NLP tasks
|
| 259 |
+
- Learn how to use different model architectures
|
| 260 |
+
- See real outputs and understand model confidence
|
| 261 |
+
|
| 262 |
+
3. **Exercise Checkpoints** (20 min)
|
| 263 |
+
- Try your own text inputs
|
| 264 |
+
- Experiment with different examples
|
| 265 |
+
- Q&A and troubleshooting
|
| 266 |
+
|
| 267 |
+
4. **Next Steps & Resources** (10 min)
|
| 268 |
+
- Publishing models to the Hub
|
| 269 |
+
- Fine-tuning workflow overview
|
| 270 |
+
- Post-session project ideas
|
| 271 |
+
|
| 272 |
+
---
|
| 273 |
+
|
| 274 |
+
### 🔤 Part 1: Tokenization Explorer
|
| 275 |
+
"""
|
| 276 |
+
)
|
| 277 |
+
|
| 278 |
+
with gr.Group():
|
| 279 |
+
gr.Markdown(
|
| 280 |
+
"""
|
| 281 |
+
#### How Tokenization Works
|
| 282 |
+
- Text is split into tokens (words/subwords)
|
| 283 |
+
- Each token gets a unique ID
|
| 284 |
+
- Attention masks indicate which tokens are real vs. padding
|
| 285 |
+
- This is how transformers \"understand\" text!
|
| 286 |
+
"""
|
| 287 |
+
)
|
| 288 |
+
tok_input = gr.Textbox(
|
| 289 |
+
label="Enter text to tokenize",
|
| 290 |
+
value="Hello, how are you?",
|
| 291 |
+
lines=2,
|
| 292 |
+
)
|
| 293 |
+
tok_btn = gr.Button("Tokenize", variant="primary")
|
| 294 |
+
tok_output = gr.Markdown(label="Tokens")
|
| 295 |
+
tok_json = gr.JSON(label="Tokenization Details", visible=False)
|
| 296 |
+
|
| 297 |
+
tok_btn.click(
|
| 298 |
+
demo_tokenization,
|
| 299 |
+
inputs=[tok_input],
|
| 300 |
+
outputs=[tok_output, tok_json],
|
| 301 |
+
)
|
| 302 |
+
|
| 303 |
+
gr.Markdown(
|
| 304 |
+
"""
|
| 305 |
+
---
|
| 306 |
+
### 🎯 Part 2: Inference Playground (Choose a Task)
|
| 307 |
+
"""
|
| 308 |
+
)
|
| 309 |
+
|
| 310 |
+
with gr.Tabs():
|
| 311 |
+
# Task 1: Sentiment
|
| 312 |
+
with gr.Tab("Sentiment Analysis"):
|
| 313 |
+
gr.Markdown(
|
| 314 |
+
"""
|
| 315 |
+
**Classify text as positive, negative, or neutral**
|
| 316 |
+
|
| 317 |
+
Model: DistilBERT fine-tuned on SST-2 dataset
|
| 318 |
+
"""
|
| 319 |
+
)
|
| 320 |
+
sent_examples = gr.Examples(
|
| 321 |
+
examples=get_sentiment_examples(),
|
| 322 |
+
inputs=[],
|
| 323 |
+
)
|
| 324 |
+
sent_input = gr.Textbox(
|
| 325 |
+
label="Enter text",
|
| 326 |
+
value=get_sentiment_examples()[0],
|
| 327 |
+
lines=3,
|
| 328 |
+
)
|
| 329 |
+
sent_btn = gr.Button("Analyze", variant="primary")
|
| 330 |
+
sent_output = gr.Markdown(label="Result")
|
| 331 |
+
sent_json = gr.JSON(label="Details", visible=False)
|
| 332 |
+
|
| 333 |
+
sent_btn.click(
|
| 334 |
+
demo_sentiment,
|
| 335 |
+
inputs=[sent_input],
|
| 336 |
+
outputs=[sent_output, sent_json],
|
| 337 |
+
)
|
| 338 |
+
|
| 339 |
+
# Task 2: NER
|
| 340 |
+
with gr.Tab("Named Entity Recognition"):
|
| 341 |
+
gr.Markdown(
|
| 342 |
+
"""
|
| 343 |
+
**Identify people, organizations, locations, and more**
|
| 344 |
+
|
| 345 |
+
Model: BERT fine-tuned on CoNLL-2003 NER dataset
|
| 346 |
+
"""
|
| 347 |
+
)
|
| 348 |
+
ner_examples = gr.Examples(
|
| 349 |
+
examples=get_ner_examples(),
|
| 350 |
+
inputs=[],
|
| 351 |
+
)
|
| 352 |
+
ner_input = gr.Textbox(
|
| 353 |
+
label="Enter text",
|
| 354 |
+
value=get_ner_examples()[0],
|
| 355 |
+
lines=3,
|
| 356 |
+
)
|
| 357 |
+
ner_btn = gr.Button("Extract Entities", variant="primary")
|
| 358 |
+
ner_output = gr.Markdown(label="Entities")
|
| 359 |
+
ner_json = gr.JSON(label="Details", visible=False)
|
| 360 |
+
|
| 361 |
+
ner_btn.click(
|
| 362 |
+
demo_ner,
|
| 363 |
+
inputs=[ner_input],
|
| 364 |
+
outputs=[ner_output, ner_json],
|
| 365 |
+
)
|
| 366 |
+
|
| 367 |
+
# Task 3: QA
|
| 368 |
+
with gr.Tab("Question Answering"):
|
| 369 |
+
gr.Markdown(
|
| 370 |
+
"""
|
| 371 |
+
**Answer questions based on provided context**
|
| 372 |
+
|
| 373 |
+
Model: RoBERTa fine-tuned on SQuAD 2.0
|
| 374 |
+
"""
|
| 375 |
+
)
|
| 376 |
+
qa_examples = get_qa_examples()
|
| 377 |
+
qa_context = gr.Textbox(
|
| 378 |
+
label="Context/Passage",
|
| 379 |
+
value=qa_examples[0] if qa_examples else config.TASKS["qa"]["example_context"],
|
| 380 |
+
lines=4,
|
| 381 |
+
)
|
| 382 |
+
qa_question = gr.Textbox(
|
| 383 |
+
label="Question",
|
| 384 |
+
value="What is the Hugging Face Hub?",
|
| 385 |
+
lines=2,
|
| 386 |
+
)
|
| 387 |
+
qa_btn = gr.Button("Get Answer", variant="primary")
|
| 388 |
+
qa_output = gr.Markdown(label="Answer")
|
| 389 |
+
qa_json = gr.JSON(label="Details", visible=False)
|
| 390 |
+
|
| 391 |
+
qa_btn.click(
|
| 392 |
+
demo_qa,
|
| 393 |
+
inputs=[qa_context, qa_question],
|
| 394 |
+
outputs=[qa_output, qa_json],
|
| 395 |
+
)
|
| 396 |
+
|
| 397 |
+
# Task 4: Summarization
|
| 398 |
+
with gr.Tab("Text Summarization"):
|
| 399 |
+
gr.Markdown(
|
| 400 |
+
"""
|
| 401 |
+
**Generate concise summaries of longer texts**
|
| 402 |
+
|
| 403 |
+
Model: BART large fine-tuned on CNN/DailyMail
|
| 404 |
+
"""
|
| 405 |
+
)
|
| 406 |
+
sum_examples = get_summarization_examples()
|
| 407 |
+
sum_input = gr.Textbox(
|
| 408 |
+
label="Text to summarize (min 20 words)",
|
| 409 |
+
value=sum_examples[0] if sum_examples else config.TASKS["summarization"]["example"],
|
| 410 |
+
lines=5,
|
| 411 |
+
)
|
| 412 |
+
sum_btn = gr.Button("Summarize", variant="primary")
|
| 413 |
+
sum_output = gr.Markdown(label="Summary")
|
| 414 |
+
sum_json = gr.JSON(label="Details", visible=False)
|
| 415 |
+
|
| 416 |
+
sum_btn.click(
|
| 417 |
+
demo_summarization,
|
| 418 |
+
inputs=[sum_input],
|
| 419 |
+
outputs=[sum_output, sum_json],
|
| 420 |
+
)
|
| 421 |
+
|
| 422 |
+
# Task 5: Semantic Similarity
|
| 423 |
+
with gr.Tab("Semantic Similarity"):
|
| 424 |
+
gr.Markdown(
|
| 425 |
+
"""
|
| 426 |
+
**Compare semantic similarity between texts**
|
| 427 |
+
|
| 428 |
+
Model: Sentence-BERT (all-MiniLM-L6-v2)
|
| 429 |
+
"""
|
| 430 |
+
)
|
| 431 |
+
emb_examples = get_embeddings_examples()
|
| 432 |
+
emb_text1 = gr.Textbox(
|
| 433 |
+
label="First text",
|
| 434 |
+
value=emb_examples[0] if len(emb_examples) > 0 else "The cat is sleeping",
|
| 435 |
+
lines=2,
|
| 436 |
+
)
|
| 437 |
+
emb_text2 = gr.Textbox(
|
| 438 |
+
label="Second text",
|
| 439 |
+
value=emb_examples[1] if len(emb_examples) > 1 else "A feline is resting",
|
| 440 |
+
lines=2,
|
| 441 |
+
)
|
| 442 |
+
emb_btn = gr.Button("Compare", variant="primary")
|
| 443 |
+
emb_output = gr.Markdown(label="Similarity")
|
| 444 |
+
emb_json = gr.JSON(label="Details", visible=False)
|
| 445 |
+
|
| 446 |
+
emb_btn.click(
|
| 447 |
+
demo_similarity,
|
| 448 |
+
inputs=[emb_text1, emb_text2],
|
| 449 |
+
outputs=[emb_output, emb_json],
|
| 450 |
+
)
|
| 451 |
+
|
| 452 |
+
gr.Markdown(
|
| 453 |
+
"""
|
| 454 |
+
---
|
| 455 |
+
### 🚀 Part 3: Key Concepts Recap
|
| 456 |
+
|
| 457 |
+
✅ **Transformers Architecture:**
|
| 458 |
+
- Self-attention mechanisms allow models to focus on relevant parts of text
|
| 459 |
+
- Pre-training on large corpora + fine-tuning = transfer learning
|
| 460 |
+
|
| 461 |
+
✅ **Using HuggingFace:**
|
| 462 |
+
- Pipelines for quick demos
|
| 463 |
+
- Fine-tuning for custom tasks
|
| 464 |
+
- Model Hub for sharing and collaboration
|
| 465 |
+
|
| 466 |
+
✅ **Production Considerations:**
|
| 467 |
+
- Model size vs. accuracy tradeoff
|
| 468 |
+
- Quantization and distillation for faster inference
|
| 469 |
+
- Licensing and compliance for models
|
| 470 |
+
"""
|
| 471 |
+
)
|
| 472 |
+
|
| 473 |
+
# ===================== TAB 3: RESOURCES =====================
|
| 474 |
+
with gr.Tab("Resources & Next Steps", id="resources"):
|
| 475 |
+
gr.Markdown(
|
| 476 |
+
"""
|
| 477 |
+
## 📚 Learning Resources
|
| 478 |
+
|
| 479 |
+
### Official Documentation
|
| 480 |
+
- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/)
|
| 481 |
+
- [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets/)
|
| 482 |
+
- [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/)
|
| 483 |
+
|
| 484 |
+
### Tutorials & Courses
|
| 485 |
+
- [Hugging Face Course (Free)](https://huggingface.co/course/)
|
| 486 |
+
- [Transformers from Scratch](https://huggingface.co/docs/transformers/training)
|
| 487 |
+
- [Fine-tuning Guide](https://huggingface.co/docs/transformers/training)
|
| 488 |
+
|
| 489 |
+
---
|
| 490 |
+
|
| 491 |
+
## 🛠️ Popular Models to Explore
|
| 492 |
+
|
| 493 |
+
### Text Classification
|
| 494 |
+
- `distilbert-base-uncased-finetuned-sst-2-english` - Sentiment Analysis
|
| 495 |
+
- `roberta-base` - General purpose classifier
|
| 496 |
+
- `bert-base-multilingual-cased` - Multilingual support
|
| 497 |
+
|
| 498 |
+
### Named Entity Recognition
|
| 499 |
+
- `dslim/bert-base-uncased-finetuned-ner` - English NER
|
| 500 |
+
- `xlm-roberta-base` - Multilingual NER
|
| 501 |
+
|
| 502 |
+
### Question Answering
|
| 503 |
+
- `deepset/roberta-base-squad2` - SQuAD 2.0 fine-tuned
|
| 504 |
+
- `bert-large-uncased-whole-word-masking-finetuned-squad` - BERT Large
|
| 505 |
+
|
| 506 |
+
### Text Generation
|
| 507 |
+
- `gpt2` - Lightweight generation
|
| 508 |
+
- `facebook/bart-large` - Sequence-to-sequence
|
| 509 |
+
- `google/t5-base` - T5 for various tasks
|
| 510 |
+
|
| 511 |
+
### Embeddings & Similarity
|
| 512 |
+
- `sentence-transformers/all-MiniLM-L6-v2` - Fast & efficient
|
| 513 |
+
- `sentence-transformers/all-mpnet-base-v2` - High quality
|
| 514 |
+
|
| 515 |
+
---
|
| 516 |
+
|
| 517 |
+
## 💾 Popular Datasets
|
| 518 |
+
|
| 519 |
+
- `glue` - General Language Understanding Evaluation
|
| 520 |
+
- `wikitext` - Large language model benchmark
|
| 521 |
+
- `squad` - Question answering dataset
|
| 522 |
+
- `conll2003` - Named entity recognition
|
| 523 |
+
- `imdb` - Sentiment analysis
|
| 524 |
+
|
| 525 |
+
---
|
| 526 |
+
|
| 527 |
+
## 🎯 Next Steps After the Sessions
|
| 528 |
+
|
| 529 |
+
### Beginner Path
|
| 530 |
+
1. Explore models on the Hub
|
| 531 |
+
2. Try different models on your own data
|
| 532 |
+
3. Learn about fine-tuning concepts
|
| 533 |
+
|
| 534 |
+
### Intermediate Path
|
| 535 |
+
1. Fine-tune a pre-trained model on your dataset
|
| 536 |
+
2. Deploy a model to Spaces (like this demo!)
|
| 537 |
+
3. Publish your model to the Hub
|
| 538 |
+
|
| 539 |
+
### Advanced Path
|
| 540 |
+
1. Build multi-stage pipelines
|
| 541 |
+
2. Implement custom training loops
|
| 542 |
+
3. Contribute to open-source projects
|
| 543 |
+
|
| 544 |
+
---
|
| 545 |
+
|
| 546 |
+
## 🔗 Community & Support
|
| 547 |
+
|
| 548 |
+
- [Hugging Face Forums](https://discuss.huggingface.co/)
|
| 549 |
+
- [GitHub Issues](https://github.com/huggingface/transformers/issues)
|
| 550 |
+
- [Twitter/X @huggingface](https://twitter.com/huggingface)
|
| 551 |
+
- Company Slack/Teams Channels
|
| 552 |
+
|
| 553 |
+
---
|
| 554 |
+
|
| 555 |
+
## 📝 Session Information
|
| 556 |
+
|
| 557 |
+
**Session 1: Introduction to Hugging Face** (45 minutes)
|
| 558 |
+
- Overview of the ecosystem
|
| 559 |
+
- Core abstractions (Pipelines, Models, Tokenizers)
|
| 560 |
+
- Architecture patterns
|
| 561 |
+
- Enterprise considerations
|
| 562 |
+
|
| 563 |
+
**Session 2: Hands-On Developer Workshop** (90 minutes)
|
| 564 |
+
- Tokenization deep dive
|
| 565 |
+
- Interactive inference playground (5+ NLP tasks)
|
| 566 |
+
- Live coding and experimentation
|
| 567 |
+
- Best practices and next steps
|
| 568 |
+
|
| 569 |
+
---
|
| 570 |
+
|
| 571 |
+
### Questions?
|
| 572 |
+
Feel free to reach out via Slack or email during the sessions! 💬
|
| 573 |
+
"""
|
| 574 |
+
)
|
| 575 |
+
|
| 576 |
+
return app
|
| 577 |
+
|
| 578 |
+
|
| 579 |
+
# ===================== MAIN =====================
|
| 580 |
+
|
| 581 |
+
if __name__ == "__main__":
|
| 582 |
+
app = create_interface()
|
| 583 |
+
app.launch(
|
| 584 |
+
share=True,
|
| 585 |
+
server_name="0.0.0.0",
|
| 586 |
+
server_port=7860,
|
| 587 |
+
)
|
config.py
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Configuration for HuggingFace Enabling Sessions
|
| 2 |
+
|
| 3 |
+
# Models
|
| 4 |
+
SENTIMENT_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
|
| 5 |
+
NER_MODEL = "dslim/bert-base-uncased-finetuned-ner"
|
| 6 |
+
QA_MODEL = "deepset/roberta-base-squad2"
|
| 7 |
+
SUMMARIZATION_MODEL = "facebook/bart-large-cnn"
|
| 8 |
+
EMBEDDINGS_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
|
| 9 |
+
|
| 10 |
+
# Task definitions
|
| 11 |
+
TASKS = {
|
| 12 |
+
"sentiment": {
|
| 13 |
+
"name": "Sentiment Analysis",
|
| 14 |
+
"description": "Classify text sentiment (positive/negative/neutral)",
|
| 15 |
+
"model": SENTIMENT_MODEL,
|
| 16 |
+
"example": "I absolutely love this product! It's amazing and works perfectly.",
|
| 17 |
+
},
|
| 18 |
+
"ner": {
|
| 19 |
+
"name": "Named Entity Recognition",
|
| 20 |
+
"description": "Identify and classify named entities (Person, Location, Organization)",
|
| 21 |
+
"model": NER_MODEL,
|
| 22 |
+
"example": "Apple Inc. was founded by Steve Jobs in Cupertino, California.",
|
| 23 |
+
},
|
| 24 |
+
"qa": {
|
| 25 |
+
"name": "Question Answering",
|
| 26 |
+
"description": "Answer questions based on provided context",
|
| 27 |
+
"model": QA_MODEL,
|
| 28 |
+
"example_context": "The Hugging Face Hub is a platform for sharing machine learning models, datasets, and demos.",
|
| 29 |
+
"example_question": "What is the Hugging Face Hub?",
|
| 30 |
+
},
|
| 31 |
+
"summarization": {
|
| 32 |
+
"name": "Text Summarization",
|
| 33 |
+
"description": "Generate concise summaries of longer texts",
|
| 34 |
+
"model": SUMMARIZATION_MODEL,
|
| 35 |
+
"example": "The Hugging Face transformers library provides state-of-the-art pre-trained models for natural language processing tasks. It supports PyTorch and TensorFlow, making it easy to use with either framework.",
|
| 36 |
+
},
|
| 37 |
+
"similarity": {
|
| 38 |
+
"name": "Semantic Similarity",
|
| 39 |
+
"description": "Compare semantic similarity between two sentences",
|
| 40 |
+
"model": EMBEDDINGS_MODEL,
|
| 41 |
+
"example1": "The cat is sleeping on the mat",
|
| 42 |
+
"example2": "A feline is resting on the rug",
|
| 43 |
+
},
|
| 44 |
+
}
|
| 45 |
+
|
| 46 |
+
# Sample data paths
|
| 47 |
+
SAMPLE_DATA_CSV = "data/sample_texts.csv"
|
| 48 |
+
DEMO_SAMPLES_DIR = "data/demo_samples"
|
| 49 |
+
|
| 50 |
+
# Session timings
|
| 51 |
+
SESSION1_DURATION = 45 # minutes
|
| 52 |
+
SESSION2_DURATION = 90 # minutes
|
data/demo_samples/embeddings.txt
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
The cat is sleeping on the mat
|
| 2 |
+
A feline is resting on the rug
|
| 3 |
+
Machine learning is used for AI tasks
|
| 4 |
+
Deep learning uses neural networks
|
| 5 |
+
Natural language processing handles text
|
| 6 |
+
Computer vision processes images
|
data/demo_samples/ner.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Apple Inc. was founded by Steve Jobs in Cupertino, California.
|
| 2 |
+
Barack Obama was the 44th President of the United States.
|
| 3 |
+
Microsoft is headquartered in Redmond, Washington.
|
| 4 |
+
The Eiffel Tower is located in Paris, France.
|
| 5 |
+
Google was founded by Larry Page and Sergey Brin at Stanford University.
|
data/demo_samples/qa.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
The Hugging Face Hub is a platform for sharing machine learning models, datasets, and demos. It allows researchers and practitioners to collaborate and share their work with the community. The Hub is built on top of Git and provides version control for models and datasets.
|
| 2 |
+
Transformers are neural networks that use self-attention mechanisms to process sequential data. They were introduced in the "Attention is All You Need" paper by Vaswani et al. in 2017. Transformers have become the foundation for most state-of-the-art NLP models today.
|
data/demo_samples/sentiment.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
I absolutely love this product! It's amazing and works perfectly.
|
| 2 |
+
This is the worst experience I've ever had with a company.
|
| 3 |
+
The movie was decent but could have been better.
|
| 4 |
+
I'm so happy with my purchase!
|
| 5 |
+
This service is terrible and a waste of money.
|
data/demo_samples/summarization.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
Artificial intelligence has become increasingly important in modern society. Machine learning, a subset of AI, enables computers to learn from data without being explicitly programmed. Deep learning, which uses neural networks with multiple layers, has revolutionized fields like computer vision and natural language processing. The Hugging Face library has made it easier for developers to access and use pre-trained models for various NLP tasks. These tools have democratized access to powerful AI capabilities, allowing organizations of all sizes to leverage machine learning in their applications.
|
data/sample_texts.csv
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
text,task,label
|
| 2 |
+
"I absolutely love this product! It's amazing and works perfectly.",sentiment,positive
|
| 3 |
+
"This is the worst experience I've ever had with a company.",sentiment,negative
|
| 4 |
+
"The weather today is nice.",sentiment,neutral
|
| 5 |
+
"The movie was okay, nothing special.",sentiment,neutral
|
| 6 |
+
"I'm so happy with my purchase!",sentiment,positive
|
| 7 |
+
"Apple Inc. was founded by Steve Jobs in Cupertino California.",ner,entities
|
| 8 |
+
"Barack Obama was the 44th President of the United States.",ner,entities
|
| 9 |
+
"Microsoft is headquartered in Redmond Washington.",ner,entities
|
| 10 |
+
"The Hugging Face Hub is a platform for sharing machine learning models.",qa,context
|
| 11 |
+
"The Eiffel Tower is located in Paris France.",ner,entities
|
| 12 |
+
"Google was founded by Larry Page and Sergey Brin at Stanford University.",ner,entities
|
requirements.txt
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio==6.9.0
|
| 2 |
+
transformers>=4.40.0
|
| 3 |
+
torch>=2.0.0
|
| 4 |
+
datasets>=2.16.0
|
| 5 |
+
numpy>=1.24.0
|
| 6 |
+
pandas>=2.0.0
|
| 7 |
+
scikit-learn>=1.3.0
|
| 8 |
+
tqdm>=4.66.0
|
| 9 |
+
Pillow>=10.0.0
|
slides/SESSION1_SLIDES.md
ADDED
|
@@ -0,0 +1,387 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Session 1: Introduction to Hugging Face
|
| 2 |
+
## (45 minutes)
|
| 3 |
+
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
## Slide 1: Title Slide
|
| 7 |
+
**Introduction to Hugging Face**
|
| 8 |
+
### Building the Future of NLP Together
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
**Presenter:** Shouryaan Grish
|
| 12 |
+
**Duration:** 45 minutes
|
| 13 |
+
**Focus:** HuggingFace Ecosystem, Transformers, and Best Practices
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Slide 2: Agenda
|
| 18 |
+
1. **What is Hugging Face?** (5 min)
|
| 19 |
+
2. **Core Abstractions** (10 min)
|
| 20 |
+
- Pipelines
|
| 21 |
+
- Models & Tokenizers
|
| 22 |
+
- Datasets
|
| 23 |
+
3. **Architecture Patterns** (12 min)
|
| 24 |
+
- Encoders
|
| 25 |
+
- Decoders
|
| 26 |
+
- Encoder-Decoders
|
| 27 |
+
4. **Enterprise Landscape** (10 min)
|
| 28 |
+
- Licensing
|
| 29 |
+
- Open-source vs Commercial
|
| 30 |
+
5. **Q&A** (8 min)
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## Slide 3: What is Hugging Face?
|
| 35 |
+
|
| 36 |
+
### The Mission
|
| 37 |
+
> "Democratizing state-of-the-art Natural Language Processing"
|
| 38 |
+
|
| 39 |
+
### The Platform
|
| 40 |
+
- **The Hub:** Central repository for 100K+ models, datasets, and applications
|
| 41 |
+
- **Transformers Library:** Python library with pre-trained models
|
| 42 |
+
- **Datasets Library:** Standardized data loading and processing
|
| 43 |
+
- **Spaces:** Deploy interactive demos and applications
|
| 44 |
+
|
| 45 |
+
### Why It Matters
|
| 46 |
+
- ✅ Pre-trained models save time and resources
|
| 47 |
+
- ✅ Community-driven model sharing
|
| 48 |
+
- ✅ Production-ready inference
|
| 49 |
+
- ✅ Cuts NLP development from months to days
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## Slide 4: The NLP Transformation
|
| 54 |
+
|
| 55 |
+
### Pre-Hugging Face Era (2017)
|
| 56 |
+
- Models were scattered across papers and personal repos
|
| 57 |
+
- Training from scratch was standard practice
|
| 58 |
+
- Minimal standardization
|
| 59 |
+
- High barrier to entry
|
| 60 |
+
|
| 61 |
+
### Post-Hugging Face Era (2018+)
|
| 62 |
+
- Unified Hub for model sharing
|
| 63 |
+
- Pre-trained models ready to use
|
| 64 |
+
- Standardized APIs
|
| 65 |
+
- Democratized access to SOTA models
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Slide 5: Core Abstraction #1: Pipelines
|
| 70 |
+
|
| 71 |
+
### What is a Pipeline?
|
| 72 |
+
A high-level API for common NLP tasks
|
| 73 |
+
|
| 74 |
+
### Simple Code Example
|
| 75 |
+
```python
|
| 76 |
+
from transformers import pipeline
|
| 77 |
+
|
| 78 |
+
# 2 lines of code → Production-ready inference
|
| 79 |
+
classifier = pipeline("sentiment-analysis")
|
| 80 |
+
result = classifier("I love this product!")
|
| 81 |
+
# Output: [{'label': 'POSITIVE', 'score': 0.9987}]
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
### Supported Tasks
|
| 85 |
+
- sentiment-analysis
|
| 86 |
+
- text-classification
|
| 87 |
+
- named-entity-recognition
|
| 88 |
+
- question-answering
|
| 89 |
+
- summarization
|
| 90 |
+
- translation
|
| 91 |
+
- text-generation
|
| 92 |
+
- zero-shot-classification
|
| 93 |
+
|
| 94 |
+
---
|
| 95 |
+
|
| 96 |
+
## Slide 6: Core Abstraction #2: Models & Tokenizers
|
| 97 |
+
|
| 98 |
+
### Tokenizers
|
| 99 |
+
Convert raw text → Token IDs that models understand
|
| 100 |
+
|
| 101 |
+
### Models
|
| 102 |
+
Neural networks that take token IDs → Predictions
|
| 103 |
+
|
| 104 |
+
### The Flow
|
| 105 |
+
```
|
| 106 |
+
Text Input
|
| 107 |
+
↓
|
| 108 |
+
Tokenizer (breaks into subwords)
|
| 109 |
+
↓
|
| 110 |
+
Model (processes tokens)
|
| 111 |
+
↓
|
| 112 |
+
Predictions/Embeddings
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
### Example
|
| 116 |
+
```python
|
| 117 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 118 |
+
|
| 119 |
+
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
| 120 |
+
model = AutoModelForSequenceClassification.from_pretrained(...)
|
| 121 |
+
|
| 122 |
+
tokens = tokenizer("Hello world")
|
| 123 |
+
outputs = model(**tokens)
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## Slide 7: Core Abstraction #3: Datasets
|
| 129 |
+
|
| 130 |
+
### Why a Standardized Dataset API?
|
| 131 |
+
- Different format hassles (JSON, CSV, Parquet, etc.)
|
| 132 |
+
- Download and caching
|
| 133 |
+
- Consistent preprocessing
|
| 134 |
+
|
| 135 |
+
### Key Features
|
| 136 |
+
```python
|
| 137 |
+
from datasets import load_dataset
|
| 138 |
+
|
| 139 |
+
# One line to load any dataset
|
| 140 |
+
dataset = load_dataset("imdb")
|
| 141 |
+
|
| 142 |
+
# Auto-handles splits, caching, processing
|
| 143 |
+
train_set = dataset["train"]
|
| 144 |
+
val_set = dataset["validation"]
|
| 145 |
+
|
| 146 |
+
# Easy to preprocess
|
| 147 |
+
def preprocess(examples):
|
| 148 |
+
return tokenizer(examples["text"])
|
| 149 |
+
|
| 150 |
+
processed = dataset.map(preprocess, batched=True)
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
## Slide 8: Architecture Pattern #1: Encoders
|
| 156 |
+
|
| 157 |
+
### BERT (Bidirectional Encoder Representations from Transformers)
|
| 158 |
+
|
| 159 |
+
```
|
| 160 |
+
Input Text
|
| 161 |
+
↓
|
| 162 |
+
Bidirectional Processing (attends to left AND right context)
|
| 163 |
+
↓
|
| 164 |
+
Output: Contextualized Embeddings
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
### Characteristics
|
| 168 |
+
- ✅ Bidirectional context
|
| 169 |
+
- ✅ Great for understanding tasks
|
| 170 |
+
- ✅ Fast inference
|
| 171 |
+
|
| 172 |
+
### Best For
|
| 173 |
+
- Classification (sentiment, intent, toxicity)
|
| 174 |
+
- Named Entity Recognition
|
| 175 |
+
- Semantic similarity
|
| 176 |
+
- Feature extraction
|
| 177 |
+
|
| 178 |
+
### Popular Variants
|
| 179 |
+
- BERT-base: 110M parameters
|
| 180 |
+
- **DistilBERT**: 66M parameters, 40% faster, 97% performance
|
| 181 |
+
- **RoBERTa**: Better pre-training, SOTA on many benchmarks
|
| 182 |
+
- **ALBERT**: Lightweight, parameter-shared
|
| 183 |
+
|
| 184 |
+
---
|
| 185 |
+
|
| 186 |
+
## Slide 9: Architecture Pattern #2: Decoders
|
| 187 |
+
|
| 188 |
+
### GPT-2/GPT-3 (Generative Pre-trained Transformer)
|
| 189 |
+
|
| 190 |
+
```
|
| 191 |
+
Input Tokens
|
| 192 |
+
↓
|
| 193 |
+
Unidirectional Processing (attends ONLY to previous tokens)
|
| 194 |
+
↓
|
| 195 |
+
Predict Next Token
|
| 196 |
+
↓
|
| 197 |
+
Repeat
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
### Characteristics
|
| 201 |
+
- ✅ Generate fluent text
|
| 202 |
+
- ✅ Few-shot learning capability
|
| 203 |
+
- ✅ Large model = more knowledge
|
| 204 |
+
|
| 205 |
+
### Best For
|
| 206 |
+
- Text generation
|
| 207 |
+
- Conversational AI
|
| 208 |
+
- Creative writing
|
| 209 |
+
- Code generation
|
| 210 |
+
|
| 211 |
+
### Trade-offs
|
| 212 |
+
- Slower at inference (generates token-by-token)
|
| 213 |
+
- More memory required
|
| 214 |
+
- Can hallucinate or produce irrelevant content
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## Slide 10: Architecture Pattern #3: Encoder-Decoders
|
| 219 |
+
|
| 220 |
+
### T5 / BART (Seq2Seq Models)
|
| 221 |
+
|
| 222 |
+
```
|
| 223 |
+
Input Text (Encoder)
|
| 224 |
+
↓
|
| 225 |
+
Attention between Encoder & Decoder
|
| 226 |
+
↓
|
| 227 |
+
Output Text (Decoder)
|
| 228 |
+
```
|
| 229 |
+
|
| 230 |
+
### Characteristics
|
| 231 |
+
- ✅ Combines encoder AND decoder strengths
|
| 232 |
+
- ✅ Flexible input/output handling
|
| 233 |
+
- ✅ Great for structured tasks
|
| 234 |
+
|
| 235 |
+
### Best For
|
| 236 |
+
- Machine translation
|
| 237 |
+
- Summarization
|
| 238 |
+
- Question answering
|
| 239 |
+
- Paraphrase generation
|
| 240 |
+
- Style transfer
|
| 241 |
+
|
| 242 |
+
### Example Tasks
|
| 243 |
+
```
|
| 244 |
+
Input: "Summarize: The AI era has transformed industries..."
|
| 245 |
+
Output: "AI is transforming industries worldwide."
|
| 246 |
+
|
| 247 |
+
Input: "Translate to French: Hello world"
|
| 248 |
+
Output: "Bonjour le monde"
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
---
|
| 252 |
+
|
| 253 |
+
## Slide 11: Choosing the Right Architecture
|
| 254 |
+
|
| 255 |
+
| Task | Best Architecture | Example Model |
|
| 256 |
+
|------|------------------|---|
|
| 257 |
+
| Classification | Encoder | BERT, RoBERTa |
|
| 258 |
+
| Named Entities | Encoder | BERT-NER |
|
| 259 |
+
| Question Answering | Encoder-Decoder | RoBERTa-SQuAD |
|
| 260 |
+
| Text Generation | Decoder | GPT-2 |
|
| 261 |
+
| Summarization | Encoder-Decoder | BART, T5 |
|
| 262 |
+
| Semantic Similarity | Encoder | Sentence-BERT |
|
| 263 |
+
| Machine Translation | Encoder-Decoder | MarianMT |
|
| 264 |
+
|
| 265 |
+
---
|
| 266 |
+
|
| 267 |
+
## Slide 12: Enterprise Landscape
|
| 268 |
+
|
| 269 |
+
### Open-Source Models
|
| 270 |
+
- Free, transparent, modifiable
|
| 271 |
+
- No licensing restrictions (usually)
|
| 272 |
+
- Community support
|
| 273 |
+
- Examples: BERT, RoBERTa, GPT-2
|
| 274 |
+
|
| 275 |
+
### Commercial Models
|
| 276 |
+
- Licensing required (sometimes)
|
| 277 |
+
- Vendor support
|
| 278 |
+
- Proprietary optimizations
|
| 279 |
+
- Examples: GPT-4, Claude, PaLM
|
| 280 |
+
|
| 281 |
+
### Licensing Types
|
| 282 |
+
- **MIT/Apache:** Fully permissive
|
| 283 |
+
- **OpenRAIL:** Open Responsible AI License (middle ground)
|
| 284 |
+
- **Proprietary:** Restricted use (API access only)
|
| 285 |
+
|
| 286 |
+
---
|
| 287 |
+
|
| 288 |
+
## Slide 13: Considerations for Production
|
| 289 |
+
|
| 290 |
+
### Performance vs. Speed Trade-off
|
| 291 |
+
```
|
| 292 |
+
Performance (Accuracy)
|
| 293 |
+
↑
|
| 294 |
+
| Large Models
|
| 295 |
+
| (BERT-large)
|
| 296 |
+
|
|
| 297 |
+
| Medium Models
|
| 298 |
+
| (RoBERTa-base)
|
| 299 |
+
|
|
| 300 |
+
| Small Models
|
| 301 |
+
| (DistilBERT)
|
| 302 |
+
─────┼──────────────→ Speed
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
### Optimization Techniques
|
| 306 |
+
- **Quantization:** Reduce model size by 4x
|
| 307 |
+
- **Distillation:** Train smaller model to mimic larger one
|
| 308 |
+
- **Pruning:** Remove less important connections
|
| 309 |
+
- **Caching:** Store frequent predictions
|
| 310 |
+
|
| 311 |
+
---
|
| 312 |
+
|
| 313 |
+
## Slide 14: Real-World Example Flow
|
| 314 |
+
|
| 315 |
+
### Workflow: Building a Customer Support Chatbot
|
| 316 |
+
|
| 317 |
+
1. **Classify Intent** (What does the customer want?)
|
| 318 |
+
- Model: BERT for intent classification
|
| 319 |
+
- Output: bug_report, billing_question, feature_request
|
| 320 |
+
|
| 321 |
+
2. **Extract Entities** (Who, what, when?)
|
| 322 |
+
- Model: NER model
|
| 323 |
+
- Output: customer_id, product_name, date
|
| 324 |
+
|
| 325 |
+
3. **Search Knowledge Base** (Find relevant docs)
|
| 326 |
+
- Model: Semantic similarity with SBERT
|
| 327 |
+
- Output: Top matching solutions
|
| 328 |
+
|
| 329 |
+
4. **Generate Response** (Compose answer)
|
| 330 |
+
- Model: T5 or GPT-2
|
| 331 |
+
- Output: Customer-tailored response
|
| 332 |
+
|
| 333 |
+
---
|
| 334 |
+
|
| 335 |
+
## Slide 15: Key Takeaways
|
| 336 |
+
|
| 337 |
+
### ✅ Main Lessons
|
| 338 |
+
1. **HuggingFace democratizes NLP**
|
| 339 |
+
- Pre-trained models, no need to train from scratch
|
| 340 |
+
|
| 341 |
+
2. **Pipelines make NLP accessible**
|
| 342 |
+
- Complex tasks in 2-3 lines of Python
|
| 343 |
+
|
| 344 |
+
3. **Choose the right architecture**
|
| 345 |
+
- Encoders for understanding, Decoders for generation, Encoder-Decoders for structured tasks
|
| 346 |
+
|
| 347 |
+
4. **Production requires thoughtfulness**
|
| 348 |
+
- Consider licensing, performance, and trade-offs
|
| 349 |
+
|
| 350 |
+
5. **The community is your resource**
|
| 351 |
+
- 100K+ models ready to use on the Hub
|
| 352 |
+
|
| 353 |
+
### 🚀 Next Steps
|
| 354 |
+
- **Session 2:** Hands-on development and advanced techniques
|
| 355 |
+
- **Post-Session:** Experiment with your own data and models
|
| 356 |
+
|
| 357 |
+
---
|
| 358 |
+
|
| 359 |
+
## Slide 16: Q&A
|
| 360 |
+
|
| 361 |
+
**Questions?**
|
| 362 |
+
|
| 363 |
+
We'll cover more technical details in Session 2, including:
|
| 364 |
+
- Tokenization mechanics
|
| 365 |
+
- Fine-tuning your own models
|
| 366 |
+
- Deploying to production with Spaces
|
| 367 |
+
- Publishing to the Hub
|
| 368 |
+
|
| 369 |
+
---
|
| 370 |
+
|
| 371 |
+
## Slide 17: Bonus: Resource Links
|
| 372 |
+
|
| 373 |
+
### Documentation
|
| 374 |
+
- https://huggingface.co/docs/transformers/
|
| 375 |
+
- https://huggingface.co/docs/datasets/
|
| 376 |
+
- https://huggingface.co/course/
|
| 377 |
+
|
| 378 |
+
### Model Hub
|
| 379 |
+
- https://huggingface.co/models
|
| 380 |
+
|
| 381 |
+
### Try It Out
|
| 382 |
+
- This very demo: Spaces app with live inference
|
| 383 |
+
|
| 384 |
+
---
|
| 385 |
+
|
| 386 |
+
## Session 1 Complete! 🎉
|
| 387 |
+
**Next: Session 2 - Hands-On Developer Workshop (90 min)**
|
slides/SESSION2_SLIDES.md
ADDED
|
@@ -0,0 +1,731 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Session 2: Hands-On Developer Workshop
|
| 2 |
+
## Building End-to-End NLP Workflows with Hugging Face
|
| 3 |
+
### (90 minutes)
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Slide 1: Title Slide
|
| 8 |
+
**Hands-On Developer Workshop**
|
| 9 |
+
### Building End-to-End NLP Workflows with HuggingFace
|
| 10 |
+
|
| 11 |
+
**Duration:** 90 minutes
|
| 12 |
+
**Format:** Interactive coding + Demos + Exercises
|
| 13 |
+
**Difficulty:** Intermediate
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Slide 2: Session Agenda (90 min)
|
| 18 |
+
|
| 19 |
+
1. **Tokenization Deep Dive** (15 min)
|
| 20 |
+
- How tokenizers work
|
| 21 |
+
- Tokenization process breakdown
|
| 22 |
+
- Interactive explorer
|
| 23 |
+
|
| 24 |
+
2. **Inference Playground** (45 min)
|
| 25 |
+
- Live demos across 5+ NLP tasks
|
| 26 |
+
- Understanding model outputs
|
| 27 |
+
- Confidence scores and interpretability
|
| 28 |
+
|
| 29 |
+
3. **Exercise Checkpoint** (20 min)
|
| 30 |
+
- Try your own examples
|
| 31 |
+
- Experiment with different inputs
|
| 32 |
+
- Troubleshooting guide
|
| 33 |
+
|
| 34 |
+
4. **Wrap-Up & Next Steps** (10 min)
|
| 35 |
+
- Publishing to the Hub
|
| 36 |
+
- Fine-tuning overview
|
| 37 |
+
- Post-session resources
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## Slide 3: Review: Session 1 Recap
|
| 42 |
+
|
| 43 |
+
### We Learned:
|
| 44 |
+
✅ HuggingFace Platform Overview
|
| 45 |
+
✅ Pipelines - high-level API
|
| 46 |
+
✅ Encoders, Decoders, Encoder-Decoders
|
| 47 |
+
✅ Architecture selection guide
|
| 48 |
+
|
| 49 |
+
### Now We'll Learn:
|
| 50 |
+
🔬 **HOW things work under the hood**
|
| 51 |
+
👨💻 **Hands-on implementation**
|
| 52 |
+
⚙️ **Production considerations**
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## Slide 4: Part 1 - Tokenization Deep Dive
|
| 57 |
+
|
| 58 |
+
### What is Tokenization?
|
| 59 |
+
Converting raw text → Numerical representations that models understand
|
| 60 |
+
|
| 61 |
+
### Why Does This Matter?
|
| 62 |
+
- Models don't understand words, only numbers
|
| 63 |
+
- Tokenization decisions affect model performance
|
| 64 |
+
- Different tokenizers produce different results
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## Slide 5: Tokenization Step-by-Step
|
| 69 |
+
|
| 70 |
+
### Example Input
|
| 71 |
+
```
|
| 72 |
+
"Hello, how are you?"
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### Step 1: Split into tokens (subwords)
|
| 76 |
+
```
|
| 77 |
+
["Hello", ",", "how", "are", "you", "?"]
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
### Step 2: Convert to IDs
|
| 81 |
+
```
|
| 82 |
+
[7592, 1010, 2129, 2024, 2017, 1029]
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
### Step 3: Add attention masks (1 = real token, 0 = padding)
|
| 86 |
+
```
|
| 87 |
+
[1, 1, 1, 1, 1, 1]
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### Step 4: Model processes
|
| 91 |
+
```
|
| 92 |
+
Input: [7592, 1010, 2129, 2024, 2017, 1029]
|
| 93 |
+
Output: [logits, embeddings, predictions]
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## Slide 6: Tokenization Strategies
|
| 99 |
+
|
| 100 |
+
### 1. Word-level Tokenization
|
| 101 |
+
**Split on spaces and punctuation**
|
| 102 |
+
```
|
| 103 |
+
"I love NLP" → ["I", "love", "NLP"]
|
| 104 |
+
```
|
| 105 |
+
❌ Problem: Infinite vocabulary (new words = unknown tokens)
|
| 106 |
+
|
| 107 |
+
### 2. Character-level Tokenization
|
| 108 |
+
**Split character by character**
|
| 109 |
+
```
|
| 110 |
+
"NLP" → ["N", "L", "P"]
|
| 111 |
+
```
|
| 112 |
+
❌ Problem: Too many tokens, harder to learn patterns
|
| 113 |
+
|
| 114 |
+
### 3. Subword Tokenization (RECOMMENDED) ✅
|
| 115 |
+
**Smart splitting into frequently-seen sequences**
|
| 116 |
+
```
|
| 117 |
+
"uncharacteristic" → ["un", "character", "istic"]
|
| 118 |
+
new_word → ["new", "_word"] (special token for space)
|
| 119 |
+
```
|
| 120 |
+
✅ Balance between vocabulary size and sequence length
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## Slide 7: Common Tokenizers
|
| 125 |
+
|
| 126 |
+
| Tokenizer | Used By | Strategy |
|
| 127 |
+
|-----------|---------|----------|
|
| 128 |
+
| WordPiece | BERT | Greedy merge frequent pairs |
|
| 129 |
+
| BPE | GPT-2, RoBERTa | Byte-Pair Encoding |
|
| 130 |
+
| SentencePiece | T5, XLNet | Language-agnostic |
|
| 131 |
+
| Unigram | ALBERT | Probabilistic |
|
| 132 |
+
|
| 133 |
+
### Key Insight
|
| 134 |
+
Different tokenizers may produce different tokens for the same input!
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
## Slide 8: Tokenization in Practice
|
| 139 |
+
|
| 140 |
+
### Code Example
|
| 141 |
+
```python
|
| 142 |
+
from transformers import AutoTokenizer
|
| 143 |
+
|
| 144 |
+
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
| 145 |
+
|
| 146 |
+
text = "I love Hugging Face!"
|
| 147 |
+
|
| 148 |
+
# Tokenize
|
| 149 |
+
tokens = tokenizer(text)
|
| 150 |
+
# Output: {
|
| 151 |
+
# 'input_ids': [101, 1045, 2572, 23604, 2469, 999, 102],
|
| 152 |
+
# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0],
|
| 153 |
+
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1]
|
| 154 |
+
# }
|
| 155 |
+
|
| 156 |
+
# Convert back to tokens
|
| 157 |
+
token_strings = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
|
| 158 |
+
# Output: ['[CLS]', 'i', 'love', 'hugging', 'face', '!', '[SEP]']
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
### Special Tokens
|
| 162 |
+
- `[CLS]` = Classification token (beginning)
|
| 163 |
+
- `[SEP]` = Separator (between sentences)
|
| 164 |
+
- `[PAD]` = Padding (for shorter sequences)
|
| 165 |
+
- `[UNK]` = Unknown token (unseen words)
|
| 166 |
+
- `[MASK]` = Masked token (for BERT pre-training)
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## Slide 9: Attention Masks
|
| 171 |
+
|
| 172 |
+
### Why Attention Masks?
|
| 173 |
+
Models process batches of text. Different sequences have different lengths.
|
| 174 |
+
Need to mark which tokens are "real" vs "padding"
|
| 175 |
+
|
| 176 |
+
### Example
|
| 177 |
+
```
|
| 178 |
+
Sentence 1: "I love NLP" (5 tokens)
|
| 179 |
+
Sentence 2: "Hello" (2 tokens)
|
| 180 |
+
|
| 181 |
+
After padding to length 5:
|
| 182 |
+
Sequence 1: [I, love, NLP, [PAD], [PAD]]
|
| 183 |
+
Sequence 2: [Hello, [PAD], [PAD], [PAD], [PAD]]
|
| 184 |
+
|
| 185 |
+
Attention Masks:
|
| 186 |
+
Sequence 1: [1, 1, 1, 0, 0] ← Ignore these
|
| 187 |
+
Sequence 2: [1, 0, 0, 0, 0] ← Ignore these
|
| 188 |
+
|
| 189 |
+
Model learns: "When attention_mask=0, ignore this token"
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
## Slide 10: Token Limitations
|
| 195 |
+
|
| 196 |
+
### Max Sequence Length
|
| 197 |
+
Different models have different token limits!
|
| 198 |
+
|
| 199 |
+
| Model | Max Tokens | Max Characters (approx) |
|
| 200 |
+
|-------|-----------|--------------------------|
|
| 201 |
+
| BERT | 512 | 2048 |
|
| 202 |
+
| GPT-2 | 1024 | 4096 |
|
| 203 |
+
| RoBERTa | 512 | 2048 |
|
| 204 |
+
| LongFormer | 4096 | 16384 |
|
| 205 |
+
| T5 | 512 | 2048 |
|
| 206 |
+
|
| 207 |
+
### What happens when text is too long?
|
| 208 |
+
```python
|
| 209 |
+
# Option 1: Truncation
|
| 210 |
+
tokens = tokenizer(long_text, max_length=512, truncation=True)
|
| 211 |
+
|
| 212 |
+
# Option 2: Split document into chunks
|
| 213 |
+
chunks = [text[i:i+2000] for i in range(0, len(text), 2000)]
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## Slide 11: Part 2 - Inference Playground
|
| 219 |
+
|
| 220 |
+
### Now let's RUN models on real text!
|
| 221 |
+
|
| 222 |
+
### We'll Explore 5 NLP Tasks:
|
| 223 |
+
1. **Sentiment Analysis** - Classify emotions
|
| 224 |
+
2. **Named Entity Recognition** - Extract entities
|
| 225 |
+
3. **Question Answering** - Answer from context
|
| 226 |
+
4. **Summarization** - Generate summaries
|
| 227 |
+
5. **Semantic Similarity** - Compare meaning
|
| 228 |
+
|
| 229 |
+
---
|
| 230 |
+
|
| 231 |
+
## Slide 12: Task 1 - Sentiment Analysis
|
| 232 |
+
|
| 233 |
+
## How It Works
|
| 234 |
+
```
|
| 235 |
+
Input: "I love this product!"
|
| 236 |
+
Model: DistilBERT (fine-tuned on SST-2)
|
| 237 |
+
Process:
|
| 238 |
+
1. Tokenize → [101, 1045, 2572, ...]
|
| 239 |
+
2. Feed to BERT → [embeddings]
|
| 240 |
+
3. Classification head → [logits]
|
| 241 |
+
4. Softmax → [0.9987, 0.0013]
|
| 242 |
+
|
| 243 |
+
Output: POSITIVE (99.87% confidence)
|
| 244 |
+
```
|
| 245 |
+
|
| 246 |
+
### Use Cases
|
| 247 |
+
- Customer feedback analysis
|
| 248 |
+
- Social media monitoring
|
| 249 |
+
- Review classification
|
| 250 |
+
- Support ticket triage
|
| 251 |
+
|
| 252 |
+
### Demo Time! 🎬
|
| 253 |
+
Check the Inference Playground tab
|
| 254 |
+
|
| 255 |
+
---
|
| 256 |
+
|
| 257 |
+
## Slide 13: Task 2 - Named Entity Recognition
|
| 258 |
+
|
| 259 |
+
### How It Works
|
| 260 |
+
```
|
| 261 |
+
Input: "Apple Inc. was founded by Steve Jobs"
|
| 262 |
+
|
| 263 |
+
Process:
|
| 264 |
+
1. Tokenize → ["Apple", "Inc", ".", "was", ...]
|
| 265 |
+
2. Classify each token:
|
| 266 |
+
- "Apple" → Organization
|
| 267 |
+
- "Inc" → Organization
|
| 268 |
+
- "Steve" → Person
|
| 269 |
+
- "Jobs" → Person
|
| 270 |
+
|
| 271 |
+
Output:
|
| 272 |
+
Apple Inc. [ORG]
|
| 273 |
+
Steve Jobs [PER]
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
### Use Cases
|
| 277 |
+
- Resume parsing
|
| 278 |
+
- Information extraction
|
| 279 |
+
- Knowledge base construction
|
| 280 |
+
- Data mining from documents
|
| 281 |
+
|
| 282 |
+
### Aggregation Strategies
|
| 283 |
+
```
|
| 284 |
+
Token-level: ["Apple", "Inc"] separately
|
| 285 |
+
Simple: Keep adjacent same entities
|
| 286 |
+
Aggregated: Combine subwords into words
|
| 287 |
+
```
|
| 288 |
+
|
| 289 |
+
---
|
| 290 |
+
|
| 291 |
+
## Slide 14: Task 3 - Question Answering
|
| 292 |
+
|
| 293 |
+
### How It Works
|
| 294 |
+
```
|
| 295 |
+
Context: "The Eiffel Tower is in Paris, France."
|
| 296 |
+
Question: "Where is the Eiffel Tower?"
|
| 297 |
+
|
| 298 |
+
Process:
|
| 299 |
+
1. Tokenize both
|
| 300 |
+
2. Encode context + question pairs
|
| 301 |
+
3. Model predicts START and END positions
|
| 302 |
+
4. Extract text span
|
| 303 |
+
|
| 304 |
+
Output: "Paris, France"
|
| 305 |
+
```
|
| 306 |
+
|
| 307 |
+
### Model Architecture
|
| 308 |
+
Uses SQuAD fine-tuned model (Question Answering dataset)
|
| 309 |
+
|
| 310 |
+
### Use Cases
|
| 311 |
+
- FAQ systems
|
| 312 |
+
- Document search
|
| 313 |
+
- Customer support automation
|
| 314 |
+
- Knowledge base queries
|
| 315 |
+
|
| 316 |
+
### Limitations
|
| 317 |
+
- Must have answer in provided context
|
| 318 |
+
- Works best with structured context
|
| 319 |
+
- Answer must be continuous span
|
| 320 |
+
|
| 321 |
+
---
|
| 322 |
+
|
| 323 |
+
## Slide 15: Task 4 - Summarization
|
| 324 |
+
|
| 325 |
+
## How It Works
|
| 326 |
+
```
|
| 327 |
+
Input: "The Hugging Face transformers library provides..."
|
| 328 |
+
|
| 329 |
+
Process:
|
| 330 |
+
1. Encode full input
|
| 331 |
+
2. Use attention to identify key sentences
|
| 332 |
+
3. Generate summary with decoder
|
| 333 |
+
4. Stop at max_length
|
| 334 |
+
|
| 335 |
+
Output: "Hugging Face offers access to pre-trained..."
|
| 336 |
+
```
|
| 337 |
+
|
| 338 |
+
### Abstractive vs Extractive
|
| 339 |
+
```
|
| 340 |
+
Abstractive (our model uses this):
|
| 341 |
+
- Generate NEW sentences
|
| 342 |
+
- More fluent, may hallucinate
|
| 343 |
+
|
| 344 |
+
Extractive:
|
| 345 |
+
- Select existing sentences
|
| 346 |
+
- More faithful, but choppy
|
| 347 |
+
```
|
| 348 |
+
|
| 349 |
+
### Parameters
|
| 350 |
+
- `max_length`: Maximum summary length
|
| 351 |
+
- `min_length`: Minimum summary length
|
| 352 |
+
- `num_beams`: Quality vs speed trade-off
|
| 353 |
+
|
| 354 |
+
### Use Cases
|
| 355 |
+
- Article summarization
|
| 356 |
+
- Report generation
|
| 357 |
+
- Meeting notes
|
| 358 |
+
- Long document processing
|
| 359 |
+
|
| 360 |
+
---
|
| 361 |
+
|
| 362 |
+
## Slide 16: Task 5 - Semantic Similarity
|
| 363 |
+
|
| 364 |
+
## How It Works
|
| 365 |
+
```
|
| 366 |
+
Sentence 1: "The cat is sleeping on the mat"
|
| 367 |
+
Sentence 2: "A feline is resting on the rug"
|
| 368 |
+
|
| 369 |
+
Process:
|
| 370 |
+
1. Encode sentence 1 → [embedding vector: 384-dimensional]
|
| 371 |
+
2. Encode sentence 2 → [embedding vector: 384-dimensional]
|
| 372 |
+
3. Calculate cosine similarity
|
| 373 |
+
4. Score: -1 to 1 (1 = identical meaning)
|
| 374 |
+
|
| 375 |
+
Output: Similarity = 0.92 (Very Similar!)
|
| 376 |
+
```
|
| 377 |
+
|
| 378 |
+
### Model: Sentence-BERT
|
| 379 |
+
- Optimized for semantic meaning
|
| 380 |
+
- Much faster than full transformers
|
| 381 |
+
- Maintains meaning representation
|
| 382 |
+
|
| 383 |
+
### Use Cases
|
| 384 |
+
- Duplicate detection
|
| 385 |
+
- Recommendation systems
|
| 386 |
+
- Plagiarism detection
|
| 387 |
+
- Document clustering
|
| 388 |
+
- Semantic search
|
| 389 |
+
|
| 390 |
+
### Similarity Scores
|
| 391 |
+
```
|
| 392 |
+
1.0 = Identical meaning
|
| 393 |
+
0.8+ = Very similar
|
| 394 |
+
0.6+ = Similar
|
| 395 |
+
0.4+ = Somewhat related
|
| 396 |
+
0.0 = No relation
|
| 397 |
+
```
|
| 398 |
+
|
| 399 |
+
---
|
| 400 |
+
|
| 401 |
+
## Slide 17: Understanding Model Outputs
|
| 402 |
+
|
| 403 |
+
### Confidence Scores Matter!
|
| 404 |
+
```
|
| 405 |
+
Sentiment: "I like this movie"
|
| 406 |
+
Output: POSITIVE (0.78 confidence)
|
| 407 |
+
|
| 408 |
+
Should I trust this?
|
| 409 |
+
- 78% is decent, but not super high
|
| 410 |
+
- Might want manual review for this one
|
| 411 |
+
|
| 412 |
+
vs.
|
| 413 |
+
|
| 414 |
+
Output: POSITIVE (0.98 confidence)
|
| 415 |
+
- Very high confidence, safer to automate
|
| 416 |
+
```
|
| 417 |
+
|
| 418 |
+
### Decision Rules
|
| 419 |
+
```
|
| 420 |
+
High confidence (>0.90): Automate with confidence
|
| 421 |
+
Medium confidence (0.70-0.90): Automate with logging
|
| 422 |
+
Low confidence (<0.70): Flag for review
|
| 423 |
+
```
|
| 424 |
+
|
| 425 |
+
---
|
| 426 |
+
|
| 427 |
+
## Slide 18: Part 3 - Exercise Checkpoint
|
| 428 |
+
|
| 429 |
+
### Your Turn! 🎯
|
| 430 |
+
|
| 431 |
+
### Challenge 1: Sentiment Analysis
|
| 432 |
+
Try different types of text:
|
| 433 |
+
- Product reviews
|
| 434 |
+
- Social media posts
|
| 435 |
+
- Technical feedback
|
| 436 |
+
- Sarcasm ("Yeah, GREAT idea!")
|
| 437 |
+
|
| 438 |
+
### Challenge 2: NER
|
| 439 |
+
Test with:
|
| 440 |
+
- Different domains (sports, business, medicine)
|
| 441 |
+
- Multiple languages
|
| 442 |
+
- Misspellings or slang
|
| 443 |
+
|
| 444 |
+
### Challenge 3: QA
|
| 445 |
+
Build a simple FAQ system:
|
| 446 |
+
- Create a knowledge base document
|
| 447 |
+
- Ask different questions
|
| 448 |
+
- See how model performs
|
| 449 |
+
|
| 450 |
+
### Challenge 4: Creative Exploration
|
| 451 |
+
Mix and match:
|
| 452 |
+
- Run sentiment on generated summaries
|
| 453 |
+
- Extract entities from question answers
|
| 454 |
+
- Create a pipeline!
|
| 455 |
+
|
| 456 |
+
---
|
| 457 |
+
|
| 458 |
+
## Slide 19: Troubleshooting Guide
|
| 459 |
+
|
| 460 |
+
### Common Issues & Solutions
|
| 461 |
+
|
| 462 |
+
#### Issue 1: Text Too Long
|
| 463 |
+
```
|
| 464 |
+
Error: Input sequence too long
|
| 465 |
+
|
| 466 |
+
Solution:
|
| 467 |
+
option1 = tokenizer(text, max_length=512, truncation=True)
|
| 468 |
+
# Text is trimmed to first 512 tokens
|
| 469 |
+
|
| 470 |
+
option2 = [text[i:i+2000] for i in range(0, len(text), 2000)]
|
| 471 |
+
# Process in chunks, combine results
|
| 472 |
+
```
|
| 473 |
+
|
| 474 |
+
#### Issue 2: Unexpected Results
|
| 475 |
+
```
|
| 476 |
+
Input: "This is NOT good!"
|
| 477 |
+
Output: POSITIVE (but we wanted NEGATIVE)
|
| 478 |
+
|
| 479 |
+
Cause: Models may miss negation
|
| 480 |
+
|
| 481 |
+
Solution:
|
| 482 |
+
- Model is sensitive to specific words/context
|
| 483 |
+
- Try different phrasing
|
| 484 |
+
- Consider fine-tuning on domain data
|
| 485 |
+
- Use ensemble of models
|
| 486 |
+
```
|
| 487 |
+
|
| 488 |
+
#### Issue 3: Slow Inference
|
| 489 |
+
```
|
| 490 |
+
Cause: Large model + slow hardware
|
| 491 |
+
|
| 492 |
+
Solutions:
|
| 493 |
+
- Use DistilBERT instead of BERT (40% faster)
|
| 494 |
+
- Enable quantization
|
| 495 |
+
- Use GPU/TPU
|
| 496 |
+
- Batch process multiple inputs
|
| 497 |
+
```
|
| 498 |
+
|
| 499 |
+
---
|
| 500 |
+
|
| 501 |
+
## Slide 20: Production Considerations
|
| 502 |
+
|
| 503 |
+
### Performance vs Speed
|
| 504 |
+
```
|
| 505 |
+
DistilBERT: Fast, 97% of BERT performance ✅
|
| 506 |
+
BERT-base: Good balance
|
| 507 |
+
BERT-large: Slower, best quality
|
| 508 |
+
```
|
| 509 |
+
|
| 510 |
+
### Optimization Techniques
|
| 511 |
+
```
|
| 512 |
+
1. Quantization: Model size ↓ 4x, speed ↑ 2-3x
|
| 513 |
+
2. Distillation: Train smaller model, keep quality
|
| 514 |
+
3. Pruning: Remove redundant parameters
|
| 515 |
+
4. Caching: Store frequent predictions
|
| 516 |
+
5. Batching: Process multiple inputs together
|
| 517 |
+
```
|
| 518 |
+
|
| 519 |
+
### Monitoring & Logging
|
| 520 |
+
```
|
| 521 |
+
- Track model confidence
|
| 522 |
+
- Log failures/edge cases
|
| 523 |
+
- Monitor latency
|
| 524 |
+
- User feedback loop
|
| 525 |
+
```
|
| 526 |
+
|
| 527 |
+
---
|
| 528 |
+
|
| 529 |
+
## Slide 21: Publishing to the Hub
|
| 530 |
+
|
| 531 |
+
### 3-Step Process
|
| 532 |
+
|
| 533 |
+
#### Step 1: Create Model Card
|
| 534 |
+
```markdown
|
| 535 |
+
# Model Name
|
| 536 |
+
Description of what model does
|
| 537 |
+
|
| 538 |
+
## Usage
|
| 539 |
+
```python
|
| 540 |
+
from transformers import pipeline
|
| 541 |
+
classifier = pipeline("sentiment-analysis",
|
| 542 |
+
model="your-username/model-name")
|
| 543 |
+
```
|
| 544 |
+
|
| 545 |
+
## Training Data
|
| 546 |
+
Trained on [dataset]
|
| 547 |
+
|
| 548 |
+
## Performance
|
| 549 |
+
Accuracy: X%, F1: Y%
|
| 550 |
+
```
|
| 551 |
+
|
| 552 |
+
#### Step 2: Upload
|
| 553 |
+
```python
|
| 554 |
+
from transformers import AutoModelForSequenceClassification
|
| 555 |
+
from huggingface_hub import push_to_hub_or_create_repo
|
| 556 |
+
|
| 557 |
+
model = AutoModelForSequenceClassification.from_pretrained(...)
|
| 558 |
+
model.push_to_hub("your-username/model-name")
|
| 559 |
+
```
|
| 560 |
+
|
| 561 |
+
#### Step 3: Share
|
| 562 |
+
```
|
| 563 |
+
https://huggingface.co/your-username/model-name
|
| 564 |
+
```
|
| 565 |
+
|
| 566 |
+
---
|
| 567 |
+
|
| 568 |
+
## Slide 22: Fine-Tuning Overview
|
| 569 |
+
|
| 570 |
+
### When to Fine-Tune
|
| 571 |
+
```
|
| 572 |
+
New Task ← Need Fine-Tuning → Different Domain/Language
|
| 573 |
+
↓ ↓
|
| 574 |
+
Legal documents Medical texts
|
| 575 |
+
vs BERT trained on Wikipedia vs General BERT
|
| 576 |
+
```
|
| 577 |
+
|
| 578 |
+
### Fine-Tuning Process
|
| 579 |
+
```
|
| 580 |
+
1. Load pre-trained model
|
| 581 |
+
2. Add task-specific head
|
| 582 |
+
3. Train on your data (small learning rate!)
|
| 583 |
+
4. Evaluate and iterate
|
| 584 |
+
5. Deploy!
|
| 585 |
+
```
|
| 586 |
+
|
| 587 |
+
### Example: Custom Sentiment
|
| 588 |
+
```python
|
| 589 |
+
from transformers import AutoModelForSequenceClassification, Trainer
|
| 590 |
+
|
| 591 |
+
model = AutoModelForSequenceClassification.from_pretrained(
|
| 592 |
+
"bert-base-uncased",
|
| 593 |
+
num_labels=2
|
| 594 |
+
)
|
| 595 |
+
# Load YOUR labeled data
|
| 596 |
+
train_dataset = load_my_data("path/to/data.csv")
|
| 597 |
+
|
| 598 |
+
trainer = Trainer(
|
| 599 |
+
model=model,
|
| 600 |
+
train_dataset=train_dataset,
|
| 601 |
+
learning_rate=2e-5, # Small learning rate!
|
| 602 |
+
)
|
| 603 |
+
trainer.train()
|
| 604 |
+
```
|
| 605 |
+
|
| 606 |
+
---
|
| 607 |
+
|
| 608 |
+
## Slide 23: Real-World Pipeline Example
|
| 609 |
+
|
| 610 |
+
### Build a Document Classifier + Summarizer
|
| 611 |
+
|
| 612 |
+
```
|
| 613 |
+
Raw Document
|
| 614 |
+
↓
|
| 615 |
+
[1] Tokenize
|
| 616 |
+
↓
|
| 617 |
+
[2] Classify Intent
|
| 618 |
+
├─ Legal: Route to legal team
|
| 619 |
+
├─ Finance: Route to finance team
|
| 620 |
+
└─ Tech: Route to tech team
|
| 621 |
+
↓
|
| 622 |
+
[3] Extract Entities (contract dates, amounts, parties)
|
| 623 |
+
↓
|
| 624 |
+
[4] Summarize (generate executive summary)
|
| 625 |
+
↓
|
| 626 |
+
[5] Store (vector DB for search)
|
| 627 |
+
↓
|
| 628 |
+
Output: Classified, Summarized, Searchable
|
| 629 |
+
```
|
| 630 |
+
|
| 631 |
+
### Code Pattern
|
| 632 |
+
```python
|
| 633 |
+
# Each step reusable, composable
|
| 634 |
+
classifier_pipe = pipeline("zero-shot-classification")
|
| 635 |
+
ner_pipe = pipeline("ner")
|
| 636 |
+
summarizer_pipe = pipeline("summarization")
|
| 637 |
+
|
| 638 |
+
def process_document(doc):
|
| 639 |
+
intent = classifier_pipe(doc, ["legal", "finance", "tech"])
|
| 640 |
+
entities = ner_pipe(doc)
|
| 641 |
+
summary = summarizer_pipe(doc, max_length=100)
|
| 642 |
+
|
| 643 |
+
return {
|
| 644 |
+
"intent": intent,
|
| 645 |
+
"entities": entities,
|
| 646 |
+
"summary": summary
|
| 647 |
+
}
|
| 648 |
+
```
|
| 649 |
+
|
| 650 |
+
---
|
| 651 |
+
|
| 652 |
+
## Slide 24: Key Takeaways (Session 2)
|
| 653 |
+
|
| 654 |
+
### 🔬 Technical Deep Dive
|
| 655 |
+
1. **Tokenization is foundational**
|
| 656 |
+
- Text → Numerical representations
|
| 657 |
+
- Different tokenizers, different results
|
| 658 |
+
- Handles special tokens and attention masks
|
| 659 |
+
|
| 660 |
+
2. **Inference is flexible**
|
| 661 |
+
- 5 different task categories
|
| 662 |
+
- Models = transformers + task-specific heads
|
| 663 |
+
- Confidence scores guide automation decisions
|
| 664 |
+
|
| 665 |
+
3. **Production requires engineering**
|
| 666 |
+
- Monitor, log, optimize
|
| 667 |
+
- Consider performance/speed tradeoffs
|
| 668 |
+
- Fine-tune for domain-specific needs
|
| 669 |
+
|
| 670 |
+
### 🚀 Next Steps
|
| 671 |
+
1. Experiment with your own data
|
| 672 |
+
2. Publish a model to the Hub
|
| 673 |
+
3. Fine-tune for your use case
|
| 674 |
+
4. Deploy to Spaces (like this app!)
|
| 675 |
+
|
| 676 |
+
---
|
| 677 |
+
|
| 678 |
+
## Slide 25: Resources & Links
|
| 679 |
+
|
| 680 |
+
### Official Documentation
|
| 681 |
+
- https://huggingface.co/docs/transformers/
|
| 682 |
+
- https://huggingface.co/docs/datasets/
|
| 683 |
+
- https://huggingface.co/course/
|
| 684 |
+
|
| 685 |
+
### Explore Models
|
| 686 |
+
- https://huggingface.co/models
|
| 687 |
+
|
| 688 |
+
### Build & Deploy
|
| 689 |
+
- https://huggingface.co/spaces (Deploy Gradio apps)
|
| 690 |
+
- Hugging Face Hub API for uploading
|
| 691 |
+
|
| 692 |
+
### Community
|
| 693 |
+
- https://discuss.huggingface.co/
|
| 694 |
+
- GitHub Issues
|
| 695 |
+
- Community Discord
|
| 696 |
+
|
| 697 |
+
---
|
| 698 |
+
|
| 699 |
+
## Slide 26: Q&A & Open Discussion
|
| 700 |
+
|
| 701 |
+
### Questions We Can Answer
|
| 702 |
+
- How do I fine-tune for my specific domain?
|
| 703 |
+
- What model should I use for X task?
|
| 704 |
+
- How do I deploy to production?
|
| 705 |
+
- How do I handle multiple languages?
|
| 706 |
+
- What about multilingual models?
|
| 707 |
+
|
| 708 |
+
### Further Learning
|
| 709 |
+
- We have Session materials on the Hub
|
| 710 |
+
- Code examples in the Spaces app
|
| 711 |
+
- Documentation links in Resources tab
|
| 712 |
+
|
| 713 |
+
---
|
| 714 |
+
|
| 715 |
+
## Session 2 Complete! 🎉
|
| 716 |
+
|
| 717 |
+
### You Now Know:
|
| 718 |
+
✅ How tokenization works
|
| 719 |
+
✅ How to run inference on 5+ NLP tasks
|
| 720 |
+
✅ How to interpret model outputs
|
| 721 |
+
✅ How to publish and deploy models
|
| 722 |
+
✅ How to think about fine-tuning and optimization
|
| 723 |
+
|
| 724 |
+
### Keep Learning!
|
| 725 |
+
The NLP field is rapidly evolving. Stay updated with:
|
| 726 |
+
- Hugging Face blog
|
| 727 |
+
- Research papers (arXiv)
|
| 728 |
+
- Community discussions
|
| 729 |
+
- Your own experiments
|
| 730 |
+
|
| 731 |
+
**Thank you for attending! 🙏**
|
test
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
xyz
|
|
|
|
|
|
utils.py
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Utility functions for HuggingFace Enabling Sessions Spaces app
|
| 3 |
+
"""
|
| 4 |
+
import torch
|
| 5 |
+
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
|
| 6 |
+
from sentence_transformers import SentenceTransformer, util
|
| 7 |
+
import numpy as np
|
| 8 |
+
from functools import lru_cache
|
| 9 |
+
import config
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
@lru_cache(maxsize=10)
|
| 13 |
+
def load_pipeline(task_type: str):
|
| 14 |
+
"""Load and cache a pipeline for the given task."""
|
| 15 |
+
if task_type == "sentiment":
|
| 16 |
+
return pipeline("sentiment-analysis", model=config.SENTIMENT_MODEL)
|
| 17 |
+
elif task_type == "ner":
|
| 18 |
+
return pipeline("ner", model=config.NER_MODEL, aggregation_strategy="simple")
|
| 19 |
+
elif task_type == "qa":
|
| 20 |
+
return pipeline("question-answering", model=config.QA_MODEL)
|
| 21 |
+
elif task_type == "summarization":
|
| 22 |
+
return pipeline("summarization", model=config.SUMMARIZATION_MODEL, max_length=150, min_length=30)
|
| 23 |
+
else:
|
| 24 |
+
raise ValueError(f"Unknown task type: {task_type}")
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def run_sentiment_analysis(text: str):
|
| 28 |
+
"""Run sentiment analysis on text."""
|
| 29 |
+
pipe = load_pipeline("sentiment")
|
| 30 |
+
result = pipe(text[:512]) # Truncate to avoid token limit
|
| 31 |
+
return result[0] if result else {"label": "Unknown", "score": 0}
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def run_ner(text: str):
|
| 35 |
+
"""Run Named Entity Recognition on text."""
|
| 36 |
+
pipe = load_pipeline("ner")
|
| 37 |
+
result = pipe(text[:512])
|
| 38 |
+
return result
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def run_qa(context: str, question: str):
|
| 42 |
+
"""Run question answering on context."""
|
| 43 |
+
pipe = load_pipeline("qa")
|
| 44 |
+
try:
|
| 45 |
+
result = pipe(question=question, context=context[:512])
|
| 46 |
+
return result
|
| 47 |
+
except Exception as e:
|
| 48 |
+
return {"error": str(e), "answer": "Unable to answer", "score": 0}
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def run_summarization(text: str):
|
| 52 |
+
"""Generate summary of text."""
|
| 53 |
+
pipe = load_pipeline("summarization")
|
| 54 |
+
try:
|
| 55 |
+
result = pipe(text[:1024], max_length=150, min_length=30, do_sample=False)
|
| 56 |
+
return result[0]["summary_text"] if result else "Unable to summarize"
|
| 57 |
+
except Exception as e:
|
| 58 |
+
return f"Error: {str(e)}"
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def compute_similarity(text1: str, text2: str):
|
| 62 |
+
"""Compute semantic similarity between two texts."""
|
| 63 |
+
try:
|
| 64 |
+
model = SentenceTransformer(config.EMBEDDINGS_MODEL)
|
| 65 |
+
embeddings = model.encode([text1, text2], convert_to_tensor=True)
|
| 66 |
+
similarity = util.pytorch_cos_sim(embeddings[0], embeddings[1])
|
| 67 |
+
return float(similarity.item())
|
| 68 |
+
except Exception as e:
|
| 69 |
+
return f"Error: {str(e)}"
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def tokenize_text(text: str, model_name: str = config.SENTIMENT_MODEL):
|
| 73 |
+
"""Tokenize text and show tokens with IDs."""
|
| 74 |
+
try:
|
| 75 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 76 |
+
encoding = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
|
| 77 |
+
|
| 78 |
+
tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0])
|
| 79 |
+
token_ids = encoding["input_ids"][0].tolist()
|
| 80 |
+
attention_mask = encoding["attention_mask"][0].tolist()
|
| 81 |
+
|
| 82 |
+
result = {
|
| 83 |
+
"tokens": tokens,
|
| 84 |
+
"token_ids": token_ids,
|
| 85 |
+
"attention_mask": attention_mask,
|
| 86 |
+
"num_tokens": len(tokens),
|
| 87 |
+
}
|
| 88 |
+
return result
|
| 89 |
+
except Exception as e:
|
| 90 |
+
return {"error": str(e)}
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def format_tokenizer_output(tokenization_result):
|
| 94 |
+
"""Format tokenization result for display."""
|
| 95 |
+
if "error" in tokenization_result:
|
| 96 |
+
return f"Error: {tokenization_result['error']}"
|
| 97 |
+
|
| 98 |
+
tokens = tokenization_result["tokens"]
|
| 99 |
+
token_ids = tokenization_result["token_ids"]
|
| 100 |
+
|
| 101 |
+
output = f"**Total Tokens:** {tokenization_result['num_tokens']}\n\n"
|
| 102 |
+
output += "| Token | Token ID | Attention Mask |\n"
|
| 103 |
+
output += "|-------|----------|----------------|\n"
|
| 104 |
+
|
| 105 |
+
for token, tid, attn in zip(
|
| 106 |
+
tokens,
|
| 107 |
+
token_ids,
|
| 108 |
+
tokenization_result["attention_mask"]
|
| 109 |
+
):
|
| 110 |
+
output += f"| {token} | {tid} | {attn} |\n"
|
| 111 |
+
|
| 112 |
+
return output
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def format_ner_output(ner_results):
|
| 116 |
+
"""Format NER results for display."""
|
| 117 |
+
if not ner_results:
|
| 118 |
+
return "No entities found"
|
| 119 |
+
|
| 120 |
+
output = "| Entity | Type | Score |\n"
|
| 121 |
+
output += "|--------|------|-------|\n"
|
| 122 |
+
|
| 123 |
+
for result in ner_results:
|
| 124 |
+
word = result.get("word", "")
|
| 125 |
+
entity_type = result.get("entity_group", result.get("entity", ""))
|
| 126 |
+
score = result.get("score", 0)
|
| 127 |
+
output += f"| {word} | {entity_type} | {score:.4f} |\n"
|
| 128 |
+
|
| 129 |
+
return output
|