shourya commited on
Commit
10b8b4e
·
1 Parent(s): 1ac4772

Add HuggingFace Enabling Sessions interactive demo

Browse files

- Created Gradio app with 3 tabs (Session 1 intro, Session 2 hands-on, Resources)
- Added configuration for 5 NLP tasks (Sentiment, NER, QA, Summarization, Similarity)
- Created utility functions for model inference and tokenization
- Added comprehensive slide content (SESSION1, SESSION2)
- Added sample data and demo examples
- Created speaker notes with detailed presentation guide
- Updated README with full documentation

Features:
- Live inference playground with 5+ NLP tasks
- Interactive tokenization explorer
- Sample data for all demonstrations
- Production-ready configuration
- Attendee-friendly resources tab

README.md CHANGED
@@ -9,4 +9,213 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  ---
11
 
12
+ # 🤗 HuggingFace Enabling Sessions
13
+ **Interactive Demo Platform for Transformers, Hub APIs, and NLP Pipelines**
14
+
15
+ ## 📋 Overview
16
+
17
+ This is an interactive Gradio application designed for the **HuggingFace Enabling Sessions** workshop. It provides hands-on demonstrations of:
18
+
19
+ - **Session 1 (45 min):** Introduction to the HuggingFace ecosystem, Transformers architecture, and best practices
20
+ - **Session 2 (90 min):** Hands-on developer workshop with tokenization deep dives and inference playground across 5+ NLP tasks
21
+
22
+ ## 🚀 Quick Start
23
+
24
+ The app is hosted on HuggingFace Spaces and requires **no local installation**. Simply:
25
+
26
+ 1. Open the Spaces URL
27
+ 2. Explore the 3 main tabs:
28
+ - **Session 1: Introduction** — Embedded slides + live NLP demos
29
+ - **Session 2: Hands-On Developer** — Tokenizer explorer + inference playground
30
+ - **Resources & Next Steps** — Documentation links and learning resources
31
+
32
+ ## 📚 Session Contents
33
+
34
+ ### Session 1: Introduction to HuggingFace (45 minutes)
35
+ **Topics Covered:**
36
+ - HuggingFace Platform overview (Hub, Transformers, Datasets, Spaces)
37
+ - Core abstractions: Pipelines, Models, Tokenizers
38
+ - Architecture patterns: Encoders (BERT), Decoders (GPT), Encoder-Decoders (T5/BART)
39
+ - Enterprise NLP landscape (licensing, open-source vs. commercial)
40
+
41
+ **Live Demos:**
42
+ - Sentiment Analysis using DistilBERT
43
+ - Named Entity Recognition (NER) with BERT
44
+
45
+ **Materials:** [SESSION1_SLIDES.md](slides/SESSION1_SLIDES.md)
46
+
47
+ ---
48
+
49
+ ### Session 2: Hands-On Developer Workshop (90 minutes)
50
+ **Topics Covered:**
51
+ - Tokenization mechanics and strategies
52
+ - Inference across 5+ NLP tasks
53
+ - Understanding model outputs and confidence scores
54
+ - Production considerations and optimization
55
+
56
+ **Interactive Tasks:**
57
+ - 🔤 **Tokenization Explorer** — Visualize how text becomes token IDs
58
+ - 📊 **Sentiment Analysis** — Classify text emotions
59
+ - 🏷️ **Named Entity Recognition** — Extract persons, organizations, locations
60
+ - ❓ **Question Answering** — Answer questions from context
61
+ - 📝 **Text Summarization** — Generate concise summaries
62
+ - 🔗 **Semantic Similarity** — Compare text meaning
63
+
64
+ **Materials:** [SESSION2_SLIDES.md](slides/SESSION2_SLIDES.md)
65
+
66
+ ---
67
+
68
+ ## 🛠️ Project Structure
69
+
70
+ ```
71
+ infy/
72
+ ├── app.py # Main Gradio application
73
+ ├── config.py # Configuration (model IDs, task definitions)
74
+ ├── utils.py # Utility functions for inference
75
+ ├── requirements.txt # Python dependencies
76
+ ├── README.md # This file
77
+ ├── SPEAKER_NOTES.md # Presenter guide with timing
78
+ ├── slides/
79
+ │ ├── SESSION1_SLIDES.md # Session 1 presentation content
80
+ │ └── SESSION2_SLIDES.md # Session 2 presentation content
81
+ └── data/
82
+ ├── sample_texts.csv # Sample texts for demos
83
+ └── demo_samples/
84
+ ├── sentiment.txt
85
+ ├── ner.txt
86
+ ├── qa.txt
87
+ ├── summarization.txt
88
+ └── embeddings.txt
89
+ ```
90
+
91
+ ## 🤖 Models Used
92
+
93
+ | Task | Model | Type | License |
94
+ |------|-------|------|---------|
95
+ | Sentiment Analysis | distilbert-base-uncased-finetuned-sst-2-english | Encoder | Apache 2.0 |
96
+ | Named Entity Recognition | dslim/bert-base-uncased-finetuned-ner | Encoder | Apache 2.0 |
97
+ | Question Answering | deepset/roberta-base-squad2 | Encoder | Apache 2.0 |
98
+ | Summarization | facebook/bart-large-cnn | Encoder-Decoder | MIT |
99
+ | Semantic Similarity | sentence-transformers/all-MiniLM-L6-v2 | Encoder | Apache 2.0 |
100
+
101
+ ## 📖 How to Use
102
+
103
+ ### During Sessions
104
+
105
+ 1. **Access the Spaces URL** — Attendees join via shared link
106
+ 2. **Session 1 (45 min)**
107
+ - Presenter screens shares and narrates through slides
108
+ - Live demos showcase "click-to-run" NLP tasks
109
+ - Q&A after each major section
110
+
111
+ 3. **Session 2 (90 min)**
112
+ - Presenter guides attendees through tokenization and inference
113
+ - Attendees observe interactive widgets
114
+ - Exercise checkpoints for hands-on exploration
115
+ - Discussion on production considerations
116
+
117
+ ### After Sessions
118
+
119
+ 1. **Clone the repository:**
120
+ ```bash
121
+ git clone https://huggingface.co/spaces/[your-username]/infy
122
+ ```
123
+
124
+ 2. **Install dependencies:**
125
+ ```bash
126
+ pip install -r requirements.txt
127
+ ```
128
+
129
+ 3. **Run locally:**
130
+ ```bash
131
+ python app.py
132
+ ```
133
+
134
+ 4. **Explore further:**
135
+ - Modify sample data in `data/sample_texts.csv`
136
+ - Add more models to `config.py`
137
+ - Create custom tasks in `app.py`
138
+
139
+ ## 🎓 Learning Resources
140
+
141
+ ### Official Documentation
142
+ - [Transformers Library Docs](https://huggingface.co/docs/transformers/)
143
+ - [Datasets Library Docs](https://huggingface.co/docs/datasets/)
144
+ - [HuggingFace Course (Free)](https://huggingface.co/course/)
145
+ - [Hub Documentation](https://huggingface.co/docs/hub/)
146
+
147
+ ### Model Hub
148
+ - Browse 100K+ models: https://huggingface.co/models
149
+ - Search by task, language, or architecture
150
+
151
+ ### Community
152
+ - [HuggingFace Forums](https://discuss.huggingface.co/)
153
+ - [GitHub Issues](https://github.com/huggingface/transformers/issues)
154
+ - Twitter: [@huggingface](https://twitter.com/huggingface)
155
+
156
+ ### Next Steps
157
+ - **Fine-tune on your data** — Adapt pre-trained models for domain-specific tasks
158
+ - **Deploy to Spaces** — Create interactive demos like this
159
+ - **Publish to the Hub** — Share models and datasets with the community
160
+ - **Explore advanced techniques** — Quantization, distillation, multi-model pipelines
161
+
162
+ ## 🔧 Customization
163
+
164
+ ### Add a New Task
165
+
166
+ 1. **Add model to `config.py`:**
167
+ ```python
168
+ "new_task": {
169
+ "name": "Task Name",
170
+ "model": "model-id-from-hub",
171
+ "example": "example text",
172
+ }
173
+ ```
174
+
175
+ 2. **Add function to `utils.py`:**
176
+ ```python
177
+ def run_new_task(text):
178
+ pipe = load_pipeline("new_task")
179
+ return pipe(text)
180
+ ```
181
+
182
+ 3. **Add widget to `app.py`:**
183
+ ```python
184
+ with gr.Tab("New Task"):
185
+ input_box = gr.Textbox()
186
+ output_box = gr.Markdown()
187
+ btn.click(run_new_task, inputs=[input_box], outputs=[output_box])
188
+ ```
189
+
190
+ ### Modify Sample Data
191
+
192
+ Edit `data/sample_texts.csv` or add `.txt` files to `data/demo_samples/`
193
+
194
+ ## 📝 Environment
195
+
196
+ - **Python:** 3.8+
197
+ - **Framework:** Gradio 6.9.0
198
+ - **ML:** Transformers, Torch
199
+ - **Hosting:** HuggingFace Spaces
200
+
201
+ ## 📄 License
202
+
203
+ This project is open-source and available for educational and commercial use. Model licenses vary—see individual model cards for details.
204
+
205
+ ## 👨‍🏫 Presenter Notes
206
+
207
+ See [SPEAKER_NOTES.md](SPEAKER_NOTES.md) for:
208
+ - Session timing breakdowns
209
+ - Demo sequences and talking points
210
+ - Troubleshooting common issues
211
+ - Tips for live presentations
212
+
213
+ ## 📧 Questions & Feedback
214
+
215
+ - Ask during the sessions
216
+ - Post on HuggingFace Forums
217
+ - Follow up on company Slack/Teams
218
+
219
+ ---
220
+
221
+ **Ready to dive into NLP? Start with Session 1: Introduction! 🚀**
SPEAKER_NOTES.md ADDED
@@ -0,0 +1,842 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Speaker Notes & Presentation Guide
2
+ ## HuggingFace Enabling Sessions
3
+
4
+ ---
5
+
6
+ ## Table of Contents
7
+ 1. [Pre-Session Setup](#pre-session-setup)
8
+ 2. [Session 1: Introduction (45 min)](#session-1-introduction-45-min)
9
+ 3. [Session 2: Hands-On Developer (90 min)](#session-2-hands-on-developer-90-min)
10
+ 4. [Troubleshooting Guide](#troubleshooting-guide)
11
+ 5. [Tips for Effective Presentation](#tips-for-effective-presentation)
12
+
13
+ ---
14
+
15
+ ## Pre-Session Setup
16
+
17
+ ### Before You Start
18
+ - ✅ **Test the Spaces app** — Load the URL and verify all widgets work
19
+ - ✅ **Test internet connection** — Ensure model downloads work smoothly
20
+ - ✅ **Download slides** — Have slides available locally (PowerPoint/PDF)
21
+ - ✅ **Prepare talking points** — Familiarize yourself with content
22
+ - ✅ **Set up screen sharing** — Test camera + screen share on your platform
23
+ - ✅ **Time check** — Confirm session durations with organizers
24
+
25
+ ### Spaces URL
26
+ - Share this link with attendees: `https://huggingface.co/spaces/[username]/infy`
27
+ - It's the single platform for entire workshop
28
+
29
+ ### Materials Distribution
30
+ 1. **During:** Share Spaces URL in chat/email
31
+ 2. **After:** Email attendees:
32
+ - Spaces link (for reference)
33
+ - Slide markdown files (`SESSION1_SLIDES.md`, `SESSION2_SLIDES.md`)
34
+ - GitHub/GitLab link (if you want them to explore code)
35
+ - Links to HF documentation
36
+
37
+ ### Tech Setup Checklist
38
+ - [ ] Spaces URL loads successfully
39
+ - [ ] All 3 tabs visible (Session 1, Session 2, Resources)
40
+ - [ ] Demo widgets respond to clicks
41
+ - [ ] Screen resolution adequate for showing text (1080p+ recommended)
42
+ - [ ] Audio working (presenter test)
43
+ - [ ] Network stable (no VPN dropouts)
44
+
45
+ ---
46
+
47
+ ## Session 1: Introduction (45 min)
48
+
49
+ ### Overview
50
+ **Goal:** Build foundation understanding of HuggingFace ecosystem
51
+ **Audience:** Mixed technical levels (devs, product, non-technical)
52
+ **Format:** Slides + 2 live demos
53
+
54
+ ### Timing Breakdown
55
+ ```
56
+ 0:00-0:05 Title + Agenda (5 min)
57
+ 0:05-0:15 What is HuggingFace (10 min)
58
+ 0:15-0:25 Core Abstractions - Pipelines (10 min)
59
+ → DEMO 1: Sentiment Analysis
60
+ 0:25-0:35 Architecture Patterns (10 min)
61
+ → Quick review of Encoders/Decoders
62
+ 0:35-0:40 Enterprise Landscape (5 min)
63
+ 0:40-0:45 Q&A (5 min)
64
+ ```
65
+
66
+ ### Talking Points by Section
67
+
68
+ #### 0:00-0:05 | Title + Agenda
69
+ **What to Say:**
70
+ > "Good morning! Welcome to the HuggingFace Enabling Sessions. Today we're going to explore one of the most exciting toolkits in modern NLP. By the end of this session, you'll understand how transformers work, why they're revolutionary, and how to use them practically."
71
+
72
+ **Q&A Preparation:**
73
+ - Have answers for: "Do I need to know machine learning?"
74
+ - Response: "No! We start from basics. If you code at all, you'll follow along."
75
+
76
+ #### 0:05-0:15 | What is HuggingFace (Slide 3-4)
77
+ **Key Message:**
78
+ HuggingFace democratized NLP — took research papers and made them accessible
79
+
80
+ **Talking Points:**
81
+ 1. **The Problem (Pre-2017):**
82
+ - Great models existed... scattered in papers
83
+ - Training from scratch was standard
84
+ - Weeks/months of work per project
85
+
86
+ 2. **The Solution (HuggingFace 2018+):**
87
+ - Centralized Hub: 100K+ ready-to-use models
88
+ - One library: `transformers`
89
+ - One command: `from transformers import pipeline`
90
+
91
+ 3. **Why It Matters:**
92
+ - NLP went from months → weeks → days
93
+ - No longer need GPU clusters for basic work
94
+ - Community benefits from shared knowledge
95
+
96
+ **Live Example (Don't run yet, just show)**
97
+ ```python
98
+ from transformers import pipeline
99
+ classifier = pipeline("sentiment-analysis")
100
+ print(classifier("I love this!"))
101
+ # Output: [{'label': 'POSITIVE', 'score': 0.9987}]
102
+ # That's it! 2 lines of code.
103
+ ```
104
+
105
+ **Talking Point:**
106
+ > "Imagine if you had to train a sentiment classifier from scratch. That would take weeks — data collection, labeling, infrastructure setup, all of it. But with HuggingFace, you get this in 2 lines."
107
+
108
+ #### 0:15-0:25 | Core Abstractions - Pipelines (Slide 5-7)
109
+
110
+ **Key Concept:**
111
+ Think of Pipelines like API calls — you don't need to know how the engine works, just what inputs/outputs it expects
112
+
113
+ **Slide Walkthrough:**
114
+ 1. **Pipelines** (Slide 5)
115
+ - High-level interface
116
+ - 12+ tasks available
117
+ - Perfect for demos and quick POCs
118
+
119
+ 2. **Models & Tokenizers** (Slide 6)
120
+ - Tokenizers: Text → Numbers (tokens)
121
+ - Models: Process tokens → Predictions
122
+
123
+ 3. **Show the flow:**
124
+ ```
125
+ "I love NLP"
126
+
127
+ Tokenizer: ["I", "love", "NLP"] → [1045, 2572, 28600]
128
+
129
+ Model processes: neural network magic ✨
130
+
131
+ Output: Sentiment prediction
132
+ ```
133
+
134
+ **Talking Point for Non-Technical Audience:**
135
+ > "Think of a tokenizer like Rosetta Stone — it translates human language into a format the model understands. Then the model, which has learned patterns from billions of text examples, makes an intelligent prediction."
136
+
137
+ ---
138
+
139
+ #### **>>> DEMO 1: Sentiment Analysis <<<**
140
+ **Timing:** 5 minutes (embedded in 0:15-0:25 section)
141
+
142
+ **Setup:**
143
+ - Open Spaces URL
144
+ - Navigate to **Session 1: Introduction** tab
145
+ - Scroll to **Demo 1: Sentiment Analysis**
146
+
147
+ **Demo Sequence:**
148
+ 1. **Show the input box** with pre-filled text: "I absolutely love this product! It's amazing and works perfectly."
149
+ 2. **Click "Analyze Sentiment" button**
150
+ 3. **Show result:**
151
+ ```
152
+ Label: POSITIVE
153
+ Confidence: 0.9987
154
+ ```
155
+ 4. **Narrate:**
156
+ > "See? Instant sentiment analysis! The model is 99.87% confident this is positive. That confidence score is critical in production — if it's >90%, we can automate. If it's 60-70%, we might need human review."
157
+
158
+ 5. **Try a second example** (copy-paste into input):
159
+ ```
160
+ "This is the worst experience I've ever had with a company."
161
+ ```
162
+ - Click button, show result: NEGATIVE (0.9995)
163
+
164
+ 6. **Narrate the difference:**
165
+ > "Notice both examples have high confidence. But imagine if the sentiment was ambiguous — 'The product is okay.' You might get 55% confidence. That's when you flag it for human review."
166
+
167
+ **Talking Points:**
168
+ - This model trained on 67K movie reviews
169
+ - Generalizes well to product reviews too
170
+ - Fine-tuned, not trained from scratch
171
+ - Key concept: Transfer learning (pre-training → fine-tuning)
172
+
173
+ **If Demo Fails:**
174
+ - Say: "Unfortunately, the model server is having a moment. But I can tell you what would happen..."
175
+ - Show cached output screenshot
176
+ - Continue to next section
177
+
178
+ ---
179
+
180
+ #### 0:25-0:35 | Architecture Patterns (Slide 8-11)
181
+
182
+ **Objective:**
183
+ Attendees should leave understanding:
184
+ 1. **ENCODERS** (BERT family) — for understanding/classification
185
+ 2. **DECODERS** (GPT family) — for generation
186
+ 3. **ENCODER-DECODERS** (T5/BART) — for structured tasks
187
+
188
+ **Presentation Strategy:**
189
+ - Don't go too deep into math
190
+ - Use analogies and comparisons
191
+ - Show the decision tree (Slide 11)
192
+
193
+ **Slide 8: Encoders (BERT)**
194
+ > "Imagine reading a sentence twice — once left-to-right, once right-to-left, simultaneously. That's bidirectional processing. BERT reads the whole sentence before making decisions. This makes it great for understanding context."
195
+
196
+ **Example:**
197
+ ```
198
+ "The bank approved my loan"
199
+ Read both directions:
200
+ - Left→Right: understand "bank" in financial context
201
+ - Right→Left: understand "approved" is positive
202
+ Decision: POSITIVE sentiment
203
+ ```
204
+
205
+ **Slide 9: Decoders (GPT)**
206
+ > "GPT is like a predictive text machine. Given 'I love...', it predicts the next word. It does this repeatedly to generate entire paragraphs. It only looks backward (at previous words) — it doesn't know what comes next."
207
+
208
+ **Example:**
209
+ ```
210
+ Input: "Once upon a time..."
211
+ GPT generates: "there was a kingdom far away"
212
+ (word by word, based on learned patterns)
213
+ ```
214
+
215
+ **Slide 10: Encoder-Decoders (T5/BART)**
216
+ > "These combine the best of both worlds. The encoder reads the full input, then passes this understanding to the decoder, which generates output. Perfect for translation, summarization, Q&A."
217
+
218
+ **Example (Translation):**
219
+ ```
220
+ Encoder reads: "Hello, how are you?"
221
+ Decoder understands intent, generates: "Bonjour, comment allez-vous?"
222
+ ```
223
+
224
+ **Slide 11: Decision Tree**
225
+ > "Here's the rule of thumb: Need to understand/classify? Use BERT. Need to generate text? Use GPT. Need to transform input to output? Use T5."
226
+
227
+ **Talking Point:**
228
+ > "The beautiful thing about HuggingFace is you don't need to make these distinctions manually. You can just use the Pipeline API. But as engineers, understanding these differences helps us choose the right tool for the job and troubleshoot when things go wrong."
229
+
230
+ #### 0:35-0:40 | Enterprise Landscape (Slide 12-14)
231
+
232
+ **Goal:**
233
+ Attendees understand production considerations
234
+
235
+ **Key Points:**
236
+
237
+ **Slide 12: Licensing**
238
+ - Most HF models are open-source (MIT, Apache)
239
+ - Some require special agreements (OpenRAIL)
240
+ - Commercial models available but need API access
241
+ - Always check model card before shipping
242
+
243
+ **Talking Point:**
244
+ > "In enterprise, we can't just pick a model randomly. We need to ensure licensing is compatible. An MIT-licensed model? Green light. Some proprietary model? Need to check terms first."
245
+
246
+ **Slide 13: Trade-offs**
247
+ ```
248
+ Large Models = Better, Slower
249
+ Small Models = Fast, Okay quality
250
+
251
+ Example:
252
+ - BERT-large: 95% accuracy, 500ms latency
253
+ - DistilBERT: 94% accuracy, 100ms latency ← Often the sweet spot
254
+ ```
255
+
256
+ **Talking Point:**
257
+ > "In production, DistilBERT is often the best choice. 40% faster, only 3% less accurate. That's what we call 'Pareto optimal' — you can't get better without significant cost."
258
+
259
+ **Slide 14: Real-World Example**
260
+ Walk through customer support chatbot pipeline:
261
+
262
+ 1. **Intent Classification** — "What does customer want?"
263
+ 2. **Entity Extraction** — "Who, what, when?"
264
+ 3. **Knowledge Search** — "Find relevant solutions"
265
+ 4. **Response Generation** — "Compose personalized answer"
266
+
267
+ **Talking Point:**
268
+ > "This is production NLP. Not one model, but orchestrated pipeline of models. Each trained for specific task. Each could be swapped out if performance isn't good enough."
269
+
270
+ #### 0:40-0:45 | Q&A
271
+
272
+ **Questions to Anticipate:**
273
+ 1. **"Do I need to train models myself?"**
274
+ - Answer: No! Start with pre-trained. Only fine-tune if needed.
275
+
276
+ 2. **"How do I use this for my company data?"**
277
+ - Answer: Two paths:
278
+ - Quick: Use `transformers.pipeline()` on your data
279
+ - Better: Fine-tune the model on your data (Session 2 mentions this)
280
+
281
+ 3. **"How do I deploy this to production?"**
282
+ - Answer: We'll touch on this in Session 2 and Resources tab.
283
+
284
+ 4. **"Can I use these models in my mobile app?"**
285
+ - Answer: Yes, quantized versions fit on phones (DistilBERT ~25MB)
286
+
287
+ **Transition to Session 2:**
288
+ > "That's Session 1! You now understand WHAT HuggingFace is and WHY it's powerful. Session 2 is all about HOW to use it. We'll dive into tokenization, run live inference on multiple tasks, and explore production considerations. See you in Session 2! ☕ (Break time)"
289
+
290
+ ---
291
+
292
+ ## Session 2: Hands-On Developer (90 min)
293
+
294
+ ### Overview
295
+ **Goal:** Hands-on exploration of 5+ NLP tasks, deep understanding of inference
296
+ **Audience:** Assumed basic programming knowledge (can read Python)
297
+ **Format:** Interactive slides + live widgets + discussion
298
+
299
+ ### Timing Breakdown
300
+ ```
301
+ 0:00-0:05 Welcome + Recap (5 min)
302
+ 0:05-0:20 Tokenization Deep Dive (15 min)
303
+ → DEMO: Tokenization Explorer
304
+ 0:20-1:05 Inference Playground (45 min)
305
+ → 5 interactive task demos
306
+ (Sentiment, NER, QA, Summarization, Similarity)
307
+ 1:05-1:25 Exercise Checkpoint (20 min)
308
+ (Attendees explore, ask questions)
309
+ 1:25-1:35 Production Considerations (10 min)
310
+ 1:35-1:50 Wrap-up + Resources (15 min)
311
+ 1:50-2:00 Q&A + Office Hours (10 min)
312
+ ```
313
+
314
+ ### Talking Points by Section
315
+
316
+ #### 0:00-0:05 | Welcome + Recap
317
+
318
+ **Talking Point:**
319
+ > "Welcome back! In Session 1, we learned about the HuggingFace ecosystem and different model architectures. Today is hands-on. We'll write almost no code — instead, we'll click buttons and see models in action. By the end, you'll understand how inference actually works under the hood."
320
+
321
+ **Quick Session 1 Recap (1 min):**
322
+ - Pipelines = High-level interface
323
+ - Models = Neural networks trained on specific tasks
324
+ - Transformers = Attention-based architecture
325
+
326
+ #### 0:05-0:20 | Tokenization Deep Dive (Slide 4-10)
327
+
328
+ **Core Message:**
329
+ Tokenization bridges human language ↔ machine numbers. It's where "magic" starts.
330
+
331
+ **Slide 4: What is Tokenization?**
332
+ > "Why does this matter? Because models are mathematics engines. They understand vectors and matrices, not words. Tokenization converts text into vectors. Get tokenization wrong, and the model fails. Get it right, and inference works beautifully."
333
+
334
+ **Slide 5: Step-by-Step Breakdown**
335
+
336
+ Walk through example word-by-word:
337
+ ```
338
+ Input: "Hello, how are you?"
339
+
340
+ Step 1: Break into tokens
341
+ ["Hello", ",", "how", "are", "you", "?"]
342
+
343
+ Step 2: Look up IDs (each tokenizer has a vocabulary)
344
+ [7592, 1010, 2129, 2024, 2017, 1029]
345
+
346
+ Step 3: Add special tokens
347
+ [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
348
+ ↑ (CLS token) ↑ (SEP token)
349
+
350
+ Step 4: Create attention mask (1=real, 0=padding)
351
+ [1, 1, 1, 1, 1, 1, 1, 1]
352
+
353
+ Step 5: Model reads numerical vectors
354
+ [embeddings] → Model processes → [predictions]
355
+ ```
356
+
357
+ **Talking Point:**
358
+ > "Notice the [CLS] and [SEP] tokens at the beginning and end? BERT adds those automatically. They're like punctuation for the model. This is why you don't need to manually add them — the tokenizer does it."
359
+
360
+ **Slide 6: Tokenization Strategies**
361
+
362
+ Explain three approaches with examples:
363
+
364
+ 1. **Word-level:** "uncharacteristic" → ["uncharacteristic"]
365
+ - Problem: New words never seen before = [UNK] (unknown)
366
+
367
+ 2. **Character-level:** "uncharacteristic" → ["u", "n", "c", "h", "a", "r", ...]
368
+ - Problem: Sequence too long, hard to learn patterns
369
+
370
+ 3. **Subword (✅ Recommended):** "uncharacteristic" → ["un", "character", "istic"]
371
+ - Sweet spot! Balances vocabulary size vs. sequence length
372
+
373
+ **Talking Point:**
374
+ > "Subword tokenization is the Goldilocks solution. Not too many tokens, not too few. And it handles new words gracefully by breaking them into known pieces."
375
+
376
+ **Slide 7: Common Tokenizers**
377
+
378
+ Different models use different tokenizers:
379
+ - BERT → WordPiece
380
+ - GPT → BPE (Byte-Pair Encoding)
381
+ - T5 → SentencePiece
382
+
383
+ **Talking Point:**
384
+ > "Important: If you train a model with one tokenizer, you MUST use the same tokenizer at inference. Mixing tokenizers = garbage outputs. Always load from the same model directory."
385
+
386
+ ---
387
+
388
+ #### **>>> DEMO: Tokenization Explorer <<<**
389
+ **Timing:** 5 minutes (within 0:05-0:20 section)
390
+
391
+ **Setup:**
392
+ - Open Spaces URL → **Session 2** tab
393
+ - Scroll to **Part 1: Tokenization Explorer**
394
+ - Input box pre-filled with "Hello, how are you?"
395
+
396
+ **Demo Sequence:**
397
+
398
+ 1. **Narrate:**
399
+ > "Let's see what tokenization actually produces. I'll click the tokenize button and we'll see how the text is broken down."
400
+
401
+ 2. **Click "Tokenize" button**
402
+
403
+ 3. **Show output table:**
404
+ ```
405
+ Token Token ID Attention Mask
406
+ ────────────────────────────────────────
407
+ [CLS] 101 1
408
+ hello 7592 1
409
+ , 1010 1
410
+ how 2129 1
411
+ are 2024 1
412
+ you 2017 1
413
+ ? 1029 1
414
+ [SEP] 102 1
415
+ Total: 8 tokens
416
+ ```
417
+
418
+ 4. **Narrate:**
419
+ > "Notice each token got a unique ID. [CLS] is 101, 'hello' is 7592. The model learned these associations during pre-training. Attention mask is all 1s because all tokens are real — no padding."
420
+
421
+ 5. **Try a longer example** (paste into input):
422
+ ```
423
+ "The Hugging Face transformers library revolutionized NLP by providing easy access to pre-trained models for various tasks."
424
+ ```
425
+ - Click "Tokenize"
426
+ - Show output: ~25 tokens
427
+
428
+ 6. **Narrate:**
429
+ > "All tokens have attention_mask=1, meaning the model will pay attention to all of them. But if this sentence were part of a batch, and other sequences were shorter, we'd pad this one and set attention_mask=0 for the padding positions. This signals to the model: 'Ignore these fake tokens.'"
430
+
431
+ **Key Insight to Drive Home:**
432
+ > "This is the foundation. Everything that happens next — classification, generation, Q&A — all relies on this tokenization step being correct. That's why understanding it is so important."
433
+
434
+ ---
435
+
436
+ #### 0:20-1:05 | Inference Playground (Slide 11-16)
437
+
438
+ **Setup:**
439
+ - Switch to **Session 2: Hands-On Developer** tab
440
+ - Navigate to **Part 2: Inference Playground**
441
+ - Show the 5 tabs: Sentiment, NER, QA, Summarization, Similarity
442
+
443
+ **Preamble:**
444
+ > "Now we'll explore 5 different NLP tasks. Each uses a different model, but the flow is similar: Tokenize → Model → Output. Let's see how each one works."
445
+
446
+ ---
447
+
448
+ #### **Task 1: Sentiment Analysis (Slide 12)**
449
+
450
+ **Talking Point:**
451
+ > "Sentiment analysis classifies text emotions. This model was fine-tuned on 67K movie reviews. Let's test it."
452
+
453
+ **Demo:**
454
+ 1. Click **Sentiment Analysis** tab
455
+ 2. Input box shows: "I absolutely love this product! It's amazing and works perfectly."
456
+ 3. Click **Analyze** button
457
+ 4. Result: `POSITIVE (0.9987 confidence)`
458
+
459
+ **Narrate:**
460
+ > "99.87% confident this is positive. That's very high — we can automate decisions at this confidence. Now let's try something ambiguous."
461
+
462
+ 5. **Clear input, paste:**
463
+ ```
464
+ "The product is okay but could be better."
465
+ ```
466
+ Click **Analyze**
467
+
468
+ 6. Narrate the result:
469
+ > "Maybe 65% confident? When confidence drops to this range, we should flag for human review. That's how you build production systems — not fully automated, but smart about what to automate."
470
+
471
+ **Use Case Mention:**
472
+ > "Use cases: Social media monitoring, customer feedback analysis, support ticket triage, review classification."
473
+
474
+ ---
475
+
476
+ #### **Task 2: Named Entity Recognition (Slide 13)**
477
+
478
+ **Talking Point:**
479
+ > "NER identifies who, what, where, when in text. This model recognizes persons, organizations, locations, miscellaneous entities."
480
+
481
+ **Demo:**
482
+ 1. Click **Named Entity Recognition** tab
483
+ 2. Input: "Apple Inc. was founded by Steve Jobs in Cupertino, California."
484
+ 3. Click **Extract Entities** button
485
+ 4. Output table shows:
486
+ ```
487
+ Entity Type Score
488
+ ────────────────────────────────────────
489
+ Apple Inc. Organization 0.99
490
+ Steve Jobs Person 0.98
491
+ Cupertino Location 0.92
492
+ California Location 0.94
493
+ ```
494
+
495
+ **Narrate:**
496
+ > "This is incredibly useful. Imagine you have thousands of documents. You can automatically extract company names, people, locations. Then populate a knowledge graph or database. No manual tagging needed."
497
+
498
+ 5. **Try a second example (paste):**
499
+ ```
500
+ "Barack Obama served as President of the United States from 2009 to 2017."
501
+ ```
502
+ Click **Extract Entities**
503
+
504
+ 6. **Show results, narrate:**
505
+ > "Barack Obama → Person, United States → Location. The model correctly identifies entities even when it hasn't seen this exact sentence before. That's the power of transfer learning."
506
+
507
+ **Use Case Mention:**
508
+ > "Resume parsing, contract analysis, information extraction, knowledge base construction."
509
+
510
+ ---
511
+
512
+ #### **Task 3: Question Answering (Slide 14)**
513
+
514
+ **Talking Point:**
515
+ > "QA answers questions based on provided context. This model finds the span of text that best answers the question."
516
+
517
+ **Demo:**
518
+ 1. Click **Question Answering** tab
519
+ 2. Context: "The Hugging Face Hub is a platform for sharing machine learning models, datasets, and demos. It allows researchers and practitioners to collaborate..."
520
+ 3. Question: "What is the Hugging Face Hub?"
521
+ 4. Click **Get Answer** button
522
+ 5. Output: Answer with confidence score
523
+
524
+ **Narrate:**
525
+ > "The model scans the context, identifies the most relevant span, and returns it as the answer. Notice it didn't generate new text — it extracted from the provided context. That's what this model does."
526
+
527
+ 6. **Try a different question (paste into question box):**
528
+ ```
529
+ Who uses the Hugging Face Hub?
530
+ ```
531
+ Click **Get Answer**
532
+
533
+ 7. **Show result, narrate:**
534
+ > "If the answer isn't explicitly in the context, the model struggles. That's a limitation. For better results, the context needs to contain the answer directly. This is why context construction is critical in production QA systems."
535
+
536
+ **Use Case Mention:**
537
+ > "FAQ automation, documentation search, support chatbots, knowledge base queries."
538
+
539
+ ---
540
+
541
+ #### **Task 4: Summarization (Slide 15)**
542
+
543
+ **Talking Point:**
544
+ > "Summarization generates concise summaries. This uses an abstractive approach — it generates NEW sentences, not just extracting existing ones."
545
+
546
+ **Demo:**
547
+ 1. Click **Text Summarization** tab
548
+ 2. Input: "Artificial intelligence has become increasingly important in modern society. Machine learning enables computers to learn from data. Deep learning uses neural networks..."
549
+ (full text in input box)
550
+ 3. Click **Summarize** button
551
+ 4. Output: Concise summary (1-2 sentences)
552
+
553
+ **Narrate:**
554
+ > "The model identified key concepts and generated a new, shorter version. This is harder than extraction — it requires generation. Notice the summary uses different words but preserves meaning. That's abstractive summarization."
555
+
556
+ 5. **Mention parameters:**
557
+ > "This model has parameters: max_length controls summary length, min_length ensures it's not too short. These are tunable based on your use case."
558
+
559
+ **Use Case Mention:**
560
+ > "Article summarization, report generation, meeting notes, email digest, news aggregation."
561
+
562
+ ---
563
+
564
+ #### **Task 5: Semantic Similarity (Slide 16)**
565
+
566
+ **Talking Point:**
567
+ > "Semantic similarity measures how similar two texts are IN MEANING, not just character overlap."
568
+
569
+ **Demo:**
570
+ 1. Click **Semantic Similarity** tab
571
+ 2. Text 1: "The cat is sleeping on the mat"
572
+ 3. Text 2: "A feline is resting on the rug"
573
+ 4. Click **Compare** button
574
+ 5. Output: `Similarity Score: 0.92 (Very Similar!)`
575
+
576
+ **Narrate:**
577
+ > "These sentences use completely different words. But semantically, they mean almost the same thing. Character-level comparison would say they're different. But this embedding-based model recognizes they're similar. That's semantic understanding."
578
+
579
+ 6. **Try different pairs (paste examples):**
580
+ - "Machine learning is used for AI tasks" vs "Deep learning uses neural networks"
581
+ - "I love this product" vs "This product is terrible"
582
+
583
+ 7. **Narrate the pattern:**
584
+ > "High scores (>0.8) mean very similar meaning. Low scores (<0.3) mean different meanings. Scores around 0.5 mean some relation. You can tune based on your application."
585
+
586
+ **Use Case Mention:**
587
+ > "Duplicate detection, recommendation systems, plagiarism detection, semantic search, document clustering."
588
+
589
+ ---
590
+
591
+ #### 1:05-1:25 | Exercise Checkpoint (20 min)
592
+
593
+ **Talking Point:**
594
+ > "Now it's your turn. I'll guide you through some exercises where you explore these tasks with your own inputs."
595
+
596
+ **Exercise 1: Sentiment on Different Domains (5 min)**
597
+ > "Go to Sentiment Analysis tab. Try these inputs:
598
+ > - Product review: 'This laptop is fantastic and runs smoothly.'
599
+ > - Restaurant review: 'The food was cold and service was slow.'
600
+ > - Technical feedback: 'This API is poorly documented.'
601
+ > Notice how the model interprets different domains. What confidence levels do you see?"
602
+
603
+ **Exercise 2: NER on Your Data (5 min)**
604
+ > "Go to NER tab. Paste text about a company or person YOU know. See how well the model extracts entities. What works well? What fails?"
605
+
606
+ **Exercise 3: QA Troubleshooting (5 min)**
607
+ > "Go to QA tab. Try asking a question that ISN'T answered in the context. What happens? This teaches us about limitations."
608
+
609
+ **Exercise 4: Creative Experimentation (5 min)**
610
+ > "Mix and match: Run sentiment on a summarized document. Extract entities from QA answers. Try Semantic Similarity on weird pairs. Explore!"
611
+
612
+ **During Exercise Time:**
613
+ - Monitor chat/questions
614
+ - Help attendees with inputs
615
+ - Highlight interesting results
616
+ - Encourage experimentation
617
+
618
+ ---
619
+
620
+ #### 1:25-1:35 | Production Considerations (10 min)
621
+
622
+ **Slide 20: Optimization**
623
+
624
+ **Talking Point:**
625
+ > "These models are great for prototypes. But in production, we need to optimize."
626
+
627
+ **Key Tradeoffs:**
628
+ ```
629
+ Large Models (BERT-large):
630
+ ✅ Highest accuracy
631
+ ❌ Slow (500ms per inference)
632
+ ❌ High memory (340MB)
633
+
634
+ Medium Models (BERT-base):
635
+ ✅ Good accuracy, decent speed
636
+ ⚠️ Reasonable resources
637
+
638
+ Small Models (DistilBERT):
639
+ ✅ Fast (100ms per inference)
640
+ ✅ Small (33MB), fits on phones
641
+ ⚠️ Slightly lower accuracy (3-5% drop)
642
+ ```
643
+
644
+ **Talking Point:**
645
+ > "In production, DistilBERT is usually the winner. 40% faster, only 3% less accurate. That 3% might not even matter for your use case. Always measure before deciding!"
646
+
647
+ **Optimization Techniques:**
648
+ 1. **Quantization** — Reduce precision (float32 → int8), gain 4x speedup
649
+ 2. **Distillation** — Train smaller model to mimic larger one
650
+ 3. **Pruning** — Remove unimportant connections
651
+ 4. **Caching** — Store frequent predictions
652
+
653
+ ---
654
+
655
+ #### 1:35-1:50 | Wrap-up + Next Steps (Slide 23-25)
656
+
657
+ **Recap (2 min):**
658
+ > "Let's recap. We learned token ization, explored 5 NLP tasks, and discussed production considerations. You now understand the full inference pipeline."
659
+
660
+ **Key Takeaways (3 min):**
661
+ 1. Tokenization is foundational (garbage in = garbage out)
662
+ 2. Different models for different tasks (choose wisely)
663
+ 3. Confidence scores guide automation decisions
664
+ 4. Production requires optimization thinking
665
+
666
+ **Next Steps (5 min):**
667
+
668
+ Using slide 23-25 as reference, walk through:
669
+
670
+ 1. **Experiment Post-Session**
671
+ - Clone the repo locally
672
+ - Modify `data/sample_texts.csv`
673
+ - Add new models to `config.py`
674
+
675
+ 2. **Fine-Tune on Your Data**
676
+ - Take a pre-trained model
677
+ - Add your labeled data
678
+ - Fine-tune (takes hours on GPU, not weeks)
679
+
680
+ 3. **Deploy to Spaces**
681
+ - Fork this repo
682
+ - Push your changes
683
+ - Get a shareable URL instantly
684
+
685
+ 4. **Publish to the Hub**
686
+ - Share your fine-tuned model
687
+ - Document with model card
688
+ - Community can use it
689
+
690
+ **Resources (5 min):**
691
+ Point to Resources tab in Spaces app:
692
+ - Official documentation links
693
+ - Model hub for exploration
694
+ - Learning courses
695
+ - Community forums
696
+
697
+ **Transition to Q&A:**
698
+ > "That concludes Session 2. Now, questions? This is your time to clarify anything that was unclear."
699
+
700
+ ---
701
+
702
+ #### 1:50-2:00 | Q&A + Office Hours
703
+
704
+ **Common Questions & Answers:**
705
+
706
+ | Q | A |
707
+ |---|---|
708
+ | "How do I fine-tune?" | "Upload your labeled data, use `Trainer` API, takes 1-2 hours. We have tutorials on the Hub." |
709
+ | "Can I use multiple models together?" | "Yes! Build pipelines. E.g., classify intent, then run task-specific model." |
710
+ | "What about non-English?" | "HF has multilingual models (mBERT, XLM-R). Same API, different models." |
711
+ | "How do I handle errors?" | "Log everything, set confidence thresholds, flag low-confidence predictions for review." |
712
+ | "Can I deploy this on my server?" | "Yes! Models are just PyTorch/TF files. Download, load, serve with Flask/FastAPI." |
713
+ | "What about privacy?" | "Models run locally (no data sent to HF). For Spaces like this, note that inputs may be logged." |
714
+
715
+ **Offer Follow-up:**
716
+ > "If your question needs deeper discussion, let's connect after. I'm available for office hours / one-on-ones. We can dive into your specific use case."
717
+
718
+ ---
719
+
720
+ ## Troubleshooting Guide
721
+
722
+ ### Issue: Model Won't Load
723
+ **Symptom:** "Model is loading..." for >60 seconds, then timeout
724
+
725
+ **Causes:**
726
+ 1. Network issue
727
+ 2. Model too large for Spaces RAM
728
+ 3. First-time model download
729
+
730
+ **Solutions:**
731
+ 1. Refresh the page
732
+ 2. Switch to smaller model (DistilBERT instead of BERT-large)
733
+ 3. Pre-download models before session (in `utils.py`)
734
+
735
+ **Fallback:**
736
+ - Say: "Network hiccup. Let me show you what would happen..."
737
+ - Move to next task
738
+
739
+ ---
740
+
741
+ ### Issue: Inference Takes >10 Seconds
742
+ **Symptom:** Button clicked, waiting...
743
+
744
+ **Likely Cause:**
745
+ - First inference (loading models to memory)
746
+
747
+ **Solution:**
748
+ - Pre-warm models at app startup (in `app.py`)
749
+
750
+ ---
751
+
752
+ ### Issue: Widget Doesn't Respond
753
+ **Symptom:** Click button, nothing happens
754
+
755
+ **Solutions:**
756
+ 1. Refresh page
757
+ 2. Clear browser cache
758
+ 3. Try different browser
759
+ 4. Check internet connection
760
+
761
+ ---
762
+
763
+ ### Issue: Attendee Internet Drops
764
+ **During Session:**
765
+ - Locals can still see Spaces (cached)
766
+ - Remote may need to rejoin
767
+
768
+ **Solution:**
769
+ - Have recording/backup slides ready
770
+ - Provide links to all materials before session
771
+
772
+ ---
773
+
774
+ ## Tips for Effective Presentation
775
+
776
+ ### Before Session
777
+ 1. **Dry Run**
778
+ - Go through slides locally
779
+ - Test all demos in Spaces
780
+ - Time each section
781
+ - Prepare backup slides (PDF)
782
+
783
+ 2. **Prepare Examples**
784
+ - Have 3-4 good examples per task
785
+ - Have 1-2 "edge case" examples
786
+ - Test on provided models beforehand
787
+
788
+ 3. **Communication**
789
+ - Send Spaces URL 24 hours before
790
+ - Send agenda
791
+ - Set expectations (interactive, will show code, not advanced math)
792
+
793
+ ### During Session
794
+
795
+ 1. **Pacing**
796
+ - Aim for 80% of allocated time (leave buffer for questions)
797
+ - Session 1: Stay under 40 min to leave 5 min Q&A
798
+ - Session 2: Finish Wrap-up by 1:35 min to leave time for Q&A
799
+
800
+ 2. **Engagement**
801
+ - Ask rhetorical questions ("Anyone know what tokenizers do?")
802
+ - Pause for questions every 5-10 minutes
803
+ - Read chat actively, respond to questions asynchronously
804
+
805
+ 3. **Demo Tips**
806
+ - Always narrate what you're doing ("Clicking the button now...")
807
+ - Point to specific outputs ("Notice the confidence is 0.99...")
808
+ - Explain failures ("That's interesting — let me try...")
809
+ - Have patience with load times
810
+
811
+ 4. **Audience Management**
812
+ - Non-technical folks may feel lost at Tokenization. Say: "This is technical. Don't worry if it's unclear. The key insight is: text becomes numbers."
813
+ - Advanced folks might ask hard questions later. Note them down, offer office hours.
814
+
815
+ 5. **Interactive Moments**
816
+ - Polling: "Who's used HF before?" (chat reaction)
817
+ - Breakout: "Spend 5 min exploring sentiment. Report back a surprising result."
818
+ - Q&A: "Anyone have a use case they want to explore?"
819
+
820
+ ### Content Delivery
821
+ - **Use analogies** — "Tokenizer is like Rosetta Stone"
822
+ - **Show the code, but don't deep dive** — "Here's what this looks like in Python... notice it's just 3 lines"
823
+ - **Celebrate small wins** — "We went from training models for months to 3 lines of code. That's progress!"
824
+ - **Be honest about limitations** — "QA works best when the answer is explicitly in the context"
825
+
826
+ ### If Technical Issues Occur
827
+ - **Stay calm** — "Let me reconnect..."
828
+ - **Have backups** — Screenshots, recordings, cached outputs
829
+ - **Continue anyway** — "While that loads, let me tell you..."
830
+ - **Learn and iterate** — Document what failed for next time
831
+
832
+ ---
833
+
834
+ ## Final Reminders
835
+
836
+ - ✅ **Slides are in markdown** — Convert to PDF/PPTX for local presentation if desired
837
+ - ✅ **Models are PUBLIC** — Anyone from the company can access the app
838
+ - ✅ **Spaces auto-updates** — Push to GitHub, it auto-deploys
839
+ - ✅ **Attendees can clone repo** — They can run it locally post-session
840
+ - ✅ **Record if possible** — Share recording with no-shows
841
+
842
+ **Good luck! 🚀 You've got this!**
app.py ADDED
@@ -0,0 +1,587 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HuggingFace Enabling Sessions - Gradio Interactive Demo App
3
+ Hosted on HuggingFace Spaces
4
+ """
5
+
6
+ import gradio as gr
7
+ import config
8
+ import utils
9
+ import pandas as pd
10
+
11
+ # ===================== UTILITIES =====================
12
+
13
+ def load_sample_texts():
14
+ """Load sample texts from CSV."""
15
+ try:
16
+ df = pd.read_csv(config.SAMPLE_DATA_CSV)
17
+ return df
18
+ except:
19
+ return None
20
+
21
+
22
+ def get_sentiment_examples():
23
+ """Get example texts for sentiment analysis."""
24
+ try:
25
+ with open(f"{config.DEMO_SAMPLES_DIR}/sentiment.txt") as f:
26
+ lines = f.read().strip().split("\n")
27
+ return lines
28
+ except:
29
+ return config.TASKS["sentiment"]["example"].split()
30
+
31
+
32
+ def get_ner_examples():
33
+ """Get example texts for NER."""
34
+ try:
35
+ with open(f"{config.DEMO_SAMPLES_DIR}/ner.txt") as f:
36
+ lines = f.read().strip().split("\n")
37
+ return lines
38
+ except:
39
+ return [config.TASKS["ner"]["example"]]
40
+
41
+
42
+ def get_qa_examples():
43
+ """Get example context and questions for QA."""
44
+ try:
45
+ with open(f"{config.DEMO_SAMPLES_DIR}/qa.txt") as f:
46
+ contexts = f.read().strip().split("\n\n")
47
+ return contexts
48
+ except:
49
+ return [config.TASKS["qa"]["example_context"]]
50
+
51
+
52
+ def get_summarization_examples():
53
+ """Get example texts for summarization."""
54
+ try:
55
+ with open(f"{config.DEMO_SAMPLES_DIR}/summarization.txt") as f:
56
+ lines = f.read().strip().split("\n")
57
+ return lines
58
+ except:
59
+ return [config.TASKS["summarization"]["example"]]
60
+
61
+
62
+ def get_embeddings_examples():
63
+ """Get example texts for semantic similarity."""
64
+ try:
65
+ with open(f"{config.DEMO_SAMPLES_DIR}/embeddings.txt") as f:
66
+ lines = f.read().strip().split("\n")
67
+ return lines
68
+ except:
69
+ return [config.TASKS["similarity"]["example1"], config.TASKS["similarity"]["example2"]]
70
+
71
+
72
+ # ===================== SENTIMENT ANALYSIS =====================
73
+
74
+ def demo_sentiment(text):
75
+ """Demo sentiment analysis."""
76
+ if not text.strip():
77
+ return "Please enter some text", "No input"
78
+ result = utils.run_sentiment_analysis(text)
79
+ output = f"**Label:** {result['label']}\n\n**Confidence:** {result['score']:.4f}"
80
+ return output, result
81
+
82
+
83
+ # ===================== NER =====================
84
+
85
+ def demo_ner(text):
86
+ """Demo named entity recognition."""
87
+ if not text.strip():
88
+ return "Please enter some text", "No entities"
89
+ results = utils.run_ner(text)
90
+ formatted = utils.format_ner_output(results)
91
+ return formatted, results
92
+
93
+
94
+ # ===================== QUESTION ANSWERING =====================
95
+
96
+ def demo_qa(context, question):
97
+ """Demo question answering."""
98
+ if not context.strip() or not question.strip():
99
+ return "Please enter both context and question", {}
100
+ result = utils.run_qa(context, question)
101
+ if "error" in result:
102
+ return f"Error: {result['error']}", {}
103
+ output = f"**Answer:** {result['answer']}\n\n**Confidence:** {result['score']:.4f}"
104
+ return output, result
105
+
106
+
107
+ # ===================== SUMMARIZATION =====================
108
+
109
+ def demo_summarization(text):
110
+ """Demo text summarization."""
111
+ if not text.strip():
112
+ return "Please enter some text", ""
113
+ if len(text.split()) < 20:
114
+ return "Text too short for summarization. Please provide at least 20 words.", ""
115
+ summary = utils.run_summarization(text)
116
+ return summary, summary
117
+
118
+
119
+ # ===================== SEMANTIC SIMILARITY =====================
120
+
121
+ def demo_similarity(text1, text2):
122
+ """Demo semantic similarity."""
123
+ if not text1.strip() or not text2.strip():
124
+ return "Please enter both texts", 0
125
+ similarity = utils.compute_similarity(text1, text2)
126
+ if isinstance(similarity, str):
127
+ return similarity, 0
128
+ output = f"**Similarity Score:** {similarity:.4f}\n\n(Score ranges from -1 to 1, where 1 means identical semantically)"
129
+ return output, similarity
130
+
131
+
132
+ # ===================== TOKENIZATION =====================
133
+
134
+ def demo_tokenization(text):
135
+ """Demo tokenization."""
136
+ if not text.strip():
137
+ return "Please enter some text", ""
138
+ result = utils.tokenize_text(text)
139
+ if "error" in result:
140
+ return f"Error: {result['error']}", ""
141
+ formatted = utils.format_tokenizer_output(result)
142
+ return formatted, result
143
+
144
+
145
+ # ===================== GRADIO INTERFACE =====================
146
+
147
+ def create_interface():
148
+ """Create the Gradio interface with 3 tabs."""
149
+
150
+ with gr.Blocks(
151
+ title="HuggingFace Enabling Sessions",
152
+ theme=gr.themes.Soft(),
153
+ ) as app:
154
+ gr.Markdown(
155
+ """
156
+ # 🤗 HuggingFace Enabling Sessions
157
+ **Interactive Demo for Transformers, Hub APIs, and Pipeline Abstractions**
158
+
159
+ **Facilitator:** Shouryaan Grish | **Duration:** Session 1: 45 min | Session 2: 90 min
160
+ """
161
+ )
162
+
163
+ with gr.Tabs():
164
+ # ===================== TAB 1: SESSION 1 - INTRODUCTION =====================
165
+ with gr.Tab("Session 1: Introduction (45 min)", id="session1"):
166
+ gr.Markdown(
167
+ """
168
+ ## 🎯 Introduction to Hugging Face Ecosystem
169
+
170
+ ### What We'll Cover:
171
+ 1. **HuggingFace Platform Overview**
172
+ - The Hub: Central repository for models, datasets, and spaces
173
+ - Transformers Library: Core Python library for NLP
174
+ - Model Cards: Documentation and metadata for transparency
175
+
176
+ 2. **Core Abstractions**
177
+ - **Pipelines:** High-level API for common tasks (sentiment, NER, QA, etc.)
178
+ - **Models & Tokenizers:** Lower-level building blocks
179
+ - **Datasets:** Standardized data loading and processing
180
+
181
+ 3. **Architecture Patterns**
182
+ - **Encoders:** BERT, RoBERTa, DistilBERT → Classification, feature extraction
183
+ - **Decoders:** GPT-2, GPT-3 → Text generation
184
+ - **Encoder-Decoders:** T5, BART → Seq2seq (translation, summarization, QA)
185
+
186
+ 4. **Enterprise NLP Landscape**
187
+ - Open-source vs. Commercial models
188
+ - Licensing considerations (MIT, Apache, OpenRAIL, etc.)
189
+ - Fine-tuning for domain-specific tasks
190
+
191
+ ---
192
+
193
+ ### Live Demo: Explore the Power of Pipelines
194
+
195
+ Try the demos below to see how easy it is to use pre-trained models! 👇
196
+ """
197
+ )
198
+
199
+ with gr.Group():
200
+ gr.Markdown("### 📊 Demo 1: Sentiment Analysis")
201
+ demo1_input = gr.Textbox(
202
+ label="Enter text to analyze sentiment",
203
+ value="I absolutely love this product!",
204
+ lines=2,
205
+ )
206
+ demo1_btn = gr.Button("Analyze Sentiment", variant="primary")
207
+ demo1_output = gr.Markdown(label="Result")
208
+ demo1_json = gr.JSON(label="Raw Output", visible=False)
209
+
210
+ demo1_btn.click(
211
+ demo_sentiment,
212
+ inputs=[demo1_input],
213
+ outputs=[demo1_output, demo1_json],
214
+ )
215
+
216
+ with gr.Group():
217
+ gr.Markdown("### 🏷️ Demo 2: Named Entity Recognition (NER)")
218
+ demo2_input = gr.Textbox(
219
+ label="Enter text for entity recognition",
220
+ value="Apple Inc. was founded by Steve Jobs in Cupertino, California.",
221
+ lines=2,
222
+ )
223
+ demo2_btn = gr.Button("Extract Entities", variant="primary")
224
+ demo2_output = gr.Markdown(label="Entities Found")
225
+ demo2_json = gr.JSON(label="Raw Output", visible=False)
226
+
227
+ demo2_btn.click(
228
+ demo_ner,
229
+ inputs=[demo2_input],
230
+ outputs=[demo2_output, demo2_json],
231
+ )
232
+
233
+ gr.Markdown(
234
+ """
235
+ ---
236
+ ### 💡 Key Takeaways
237
+ - Pre-trained models save time and resources
238
+ - HuggingFace Pipelines abstract away complexity
239
+ - Models are available for dozens of NLP tasks
240
+ - Easy to fine-tune for specialized use cases
241
+
242
+ **Next:** Head to Session 2 for hands-on development with Tokenizers and Advanced Inference! 🚀
243
+ """
244
+ )
245
+
246
+ # ===================== TAB 2: SESSION 2 - HANDS-ON DEVELOPER =====================
247
+ with gr.Tab("Session 2: Hands-On Developer (90 min)", id="session2"):
248
+ gr.Markdown(
249
+ """
250
+ ## 👨‍💻 Building End-to-End NLP Workflows with Hugging Face
251
+
252
+ ### Agenda:
253
+ 1. **Tokenization Deep Dive** (15 min)
254
+ - Understanding tokenization, token IDs, and attention masks
255
+ - How models process text internally
256
+
257
+ 2. **Inference Playground** (45 min)
258
+ - Interactive demos across multiple NLP tasks
259
+ - Learn how to use different model architectures
260
+ - See real outputs and understand model confidence
261
+
262
+ 3. **Exercise Checkpoints** (20 min)
263
+ - Try your own text inputs
264
+ - Experiment with different examples
265
+ - Q&A and troubleshooting
266
+
267
+ 4. **Next Steps & Resources** (10 min)
268
+ - Publishing models to the Hub
269
+ - Fine-tuning workflow overview
270
+ - Post-session project ideas
271
+
272
+ ---
273
+
274
+ ### 🔤 Part 1: Tokenization Explorer
275
+ """
276
+ )
277
+
278
+ with gr.Group():
279
+ gr.Markdown(
280
+ """
281
+ #### How Tokenization Works
282
+ - Text is split into tokens (words/subwords)
283
+ - Each token gets a unique ID
284
+ - Attention masks indicate which tokens are real vs. padding
285
+ - This is how transformers \"understand\" text!
286
+ """
287
+ )
288
+ tok_input = gr.Textbox(
289
+ label="Enter text to tokenize",
290
+ value="Hello, how are you?",
291
+ lines=2,
292
+ )
293
+ tok_btn = gr.Button("Tokenize", variant="primary")
294
+ tok_output = gr.Markdown(label="Tokens")
295
+ tok_json = gr.JSON(label="Tokenization Details", visible=False)
296
+
297
+ tok_btn.click(
298
+ demo_tokenization,
299
+ inputs=[tok_input],
300
+ outputs=[tok_output, tok_json],
301
+ )
302
+
303
+ gr.Markdown(
304
+ """
305
+ ---
306
+ ### 🎯 Part 2: Inference Playground (Choose a Task)
307
+ """
308
+ )
309
+
310
+ with gr.Tabs():
311
+ # Task 1: Sentiment
312
+ with gr.Tab("Sentiment Analysis"):
313
+ gr.Markdown(
314
+ """
315
+ **Classify text as positive, negative, or neutral**
316
+
317
+ Model: DistilBERT fine-tuned on SST-2 dataset
318
+ """
319
+ )
320
+ sent_examples = gr.Examples(
321
+ examples=get_sentiment_examples(),
322
+ inputs=[],
323
+ )
324
+ sent_input = gr.Textbox(
325
+ label="Enter text",
326
+ value=get_sentiment_examples()[0],
327
+ lines=3,
328
+ )
329
+ sent_btn = gr.Button("Analyze", variant="primary")
330
+ sent_output = gr.Markdown(label="Result")
331
+ sent_json = gr.JSON(label="Details", visible=False)
332
+
333
+ sent_btn.click(
334
+ demo_sentiment,
335
+ inputs=[sent_input],
336
+ outputs=[sent_output, sent_json],
337
+ )
338
+
339
+ # Task 2: NER
340
+ with gr.Tab("Named Entity Recognition"):
341
+ gr.Markdown(
342
+ """
343
+ **Identify people, organizations, locations, and more**
344
+
345
+ Model: BERT fine-tuned on CoNLL-2003 NER dataset
346
+ """
347
+ )
348
+ ner_examples = gr.Examples(
349
+ examples=get_ner_examples(),
350
+ inputs=[],
351
+ )
352
+ ner_input = gr.Textbox(
353
+ label="Enter text",
354
+ value=get_ner_examples()[0],
355
+ lines=3,
356
+ )
357
+ ner_btn = gr.Button("Extract Entities", variant="primary")
358
+ ner_output = gr.Markdown(label="Entities")
359
+ ner_json = gr.JSON(label="Details", visible=False)
360
+
361
+ ner_btn.click(
362
+ demo_ner,
363
+ inputs=[ner_input],
364
+ outputs=[ner_output, ner_json],
365
+ )
366
+
367
+ # Task 3: QA
368
+ with gr.Tab("Question Answering"):
369
+ gr.Markdown(
370
+ """
371
+ **Answer questions based on provided context**
372
+
373
+ Model: RoBERTa fine-tuned on SQuAD 2.0
374
+ """
375
+ )
376
+ qa_examples = get_qa_examples()
377
+ qa_context = gr.Textbox(
378
+ label="Context/Passage",
379
+ value=qa_examples[0] if qa_examples else config.TASKS["qa"]["example_context"],
380
+ lines=4,
381
+ )
382
+ qa_question = gr.Textbox(
383
+ label="Question",
384
+ value="What is the Hugging Face Hub?",
385
+ lines=2,
386
+ )
387
+ qa_btn = gr.Button("Get Answer", variant="primary")
388
+ qa_output = gr.Markdown(label="Answer")
389
+ qa_json = gr.JSON(label="Details", visible=False)
390
+
391
+ qa_btn.click(
392
+ demo_qa,
393
+ inputs=[qa_context, qa_question],
394
+ outputs=[qa_output, qa_json],
395
+ )
396
+
397
+ # Task 4: Summarization
398
+ with gr.Tab("Text Summarization"):
399
+ gr.Markdown(
400
+ """
401
+ **Generate concise summaries of longer texts**
402
+
403
+ Model: BART large fine-tuned on CNN/DailyMail
404
+ """
405
+ )
406
+ sum_examples = get_summarization_examples()
407
+ sum_input = gr.Textbox(
408
+ label="Text to summarize (min 20 words)",
409
+ value=sum_examples[0] if sum_examples else config.TASKS["summarization"]["example"],
410
+ lines=5,
411
+ )
412
+ sum_btn = gr.Button("Summarize", variant="primary")
413
+ sum_output = gr.Markdown(label="Summary")
414
+ sum_json = gr.JSON(label="Details", visible=False)
415
+
416
+ sum_btn.click(
417
+ demo_summarization,
418
+ inputs=[sum_input],
419
+ outputs=[sum_output, sum_json],
420
+ )
421
+
422
+ # Task 5: Semantic Similarity
423
+ with gr.Tab("Semantic Similarity"):
424
+ gr.Markdown(
425
+ """
426
+ **Compare semantic similarity between texts**
427
+
428
+ Model: Sentence-BERT (all-MiniLM-L6-v2)
429
+ """
430
+ )
431
+ emb_examples = get_embeddings_examples()
432
+ emb_text1 = gr.Textbox(
433
+ label="First text",
434
+ value=emb_examples[0] if len(emb_examples) > 0 else "The cat is sleeping",
435
+ lines=2,
436
+ )
437
+ emb_text2 = gr.Textbox(
438
+ label="Second text",
439
+ value=emb_examples[1] if len(emb_examples) > 1 else "A feline is resting",
440
+ lines=2,
441
+ )
442
+ emb_btn = gr.Button("Compare", variant="primary")
443
+ emb_output = gr.Markdown(label="Similarity")
444
+ emb_json = gr.JSON(label="Details", visible=False)
445
+
446
+ emb_btn.click(
447
+ demo_similarity,
448
+ inputs=[emb_text1, emb_text2],
449
+ outputs=[emb_output, emb_json],
450
+ )
451
+
452
+ gr.Markdown(
453
+ """
454
+ ---
455
+ ### 🚀 Part 3: Key Concepts Recap
456
+
457
+ ✅ **Transformers Architecture:**
458
+ - Self-attention mechanisms allow models to focus on relevant parts of text
459
+ - Pre-training on large corpora + fine-tuning = transfer learning
460
+
461
+ ✅ **Using HuggingFace:**
462
+ - Pipelines for quick demos
463
+ - Fine-tuning for custom tasks
464
+ - Model Hub for sharing and collaboration
465
+
466
+ ✅ **Production Considerations:**
467
+ - Model size vs. accuracy tradeoff
468
+ - Quantization and distillation for faster inference
469
+ - Licensing and compliance for models
470
+ """
471
+ )
472
+
473
+ # ===================== TAB 3: RESOURCES =====================
474
+ with gr.Tab("Resources & Next Steps", id="resources"):
475
+ gr.Markdown(
476
+ """
477
+ ## 📚 Learning Resources
478
+
479
+ ### Official Documentation
480
+ - [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/)
481
+ - [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets/)
482
+ - [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/)
483
+
484
+ ### Tutorials & Courses
485
+ - [Hugging Face Course (Free)](https://huggingface.co/course/)
486
+ - [Transformers from Scratch](https://huggingface.co/docs/transformers/training)
487
+ - [Fine-tuning Guide](https://huggingface.co/docs/transformers/training)
488
+
489
+ ---
490
+
491
+ ## 🛠️ Popular Models to Explore
492
+
493
+ ### Text Classification
494
+ - `distilbert-base-uncased-finetuned-sst-2-english` - Sentiment Analysis
495
+ - `roberta-base` - General purpose classifier
496
+ - `bert-base-multilingual-cased` - Multilingual support
497
+
498
+ ### Named Entity Recognition
499
+ - `dslim/bert-base-uncased-finetuned-ner` - English NER
500
+ - `xlm-roberta-base` - Multilingual NER
501
+
502
+ ### Question Answering
503
+ - `deepset/roberta-base-squad2` - SQuAD 2.0 fine-tuned
504
+ - `bert-large-uncased-whole-word-masking-finetuned-squad` - BERT Large
505
+
506
+ ### Text Generation
507
+ - `gpt2` - Lightweight generation
508
+ - `facebook/bart-large` - Sequence-to-sequence
509
+ - `google/t5-base` - T5 for various tasks
510
+
511
+ ### Embeddings & Similarity
512
+ - `sentence-transformers/all-MiniLM-L6-v2` - Fast & efficient
513
+ - `sentence-transformers/all-mpnet-base-v2` - High quality
514
+
515
+ ---
516
+
517
+ ## 💾 Popular Datasets
518
+
519
+ - `glue` - General Language Understanding Evaluation
520
+ - `wikitext` - Large language model benchmark
521
+ - `squad` - Question answering dataset
522
+ - `conll2003` - Named entity recognition
523
+ - `imdb` - Sentiment analysis
524
+
525
+ ---
526
+
527
+ ## 🎯 Next Steps After the Sessions
528
+
529
+ ### Beginner Path
530
+ 1. Explore models on the Hub
531
+ 2. Try different models on your own data
532
+ 3. Learn about fine-tuning concepts
533
+
534
+ ### Intermediate Path
535
+ 1. Fine-tune a pre-trained model on your dataset
536
+ 2. Deploy a model to Spaces (like this demo!)
537
+ 3. Publish your model to the Hub
538
+
539
+ ### Advanced Path
540
+ 1. Build multi-stage pipelines
541
+ 2. Implement custom training loops
542
+ 3. Contribute to open-source projects
543
+
544
+ ---
545
+
546
+ ## 🔗 Community & Support
547
+
548
+ - [Hugging Face Forums](https://discuss.huggingface.co/)
549
+ - [GitHub Issues](https://github.com/huggingface/transformers/issues)
550
+ - [Twitter/X @huggingface](https://twitter.com/huggingface)
551
+ - Company Slack/Teams Channels
552
+
553
+ ---
554
+
555
+ ## 📝 Session Information
556
+
557
+ **Session 1: Introduction to Hugging Face** (45 minutes)
558
+ - Overview of the ecosystem
559
+ - Core abstractions (Pipelines, Models, Tokenizers)
560
+ - Architecture patterns
561
+ - Enterprise considerations
562
+
563
+ **Session 2: Hands-On Developer Workshop** (90 minutes)
564
+ - Tokenization deep dive
565
+ - Interactive inference playground (5+ NLP tasks)
566
+ - Live coding and experimentation
567
+ - Best practices and next steps
568
+
569
+ ---
570
+
571
+ ### Questions?
572
+ Feel free to reach out via Slack or email during the sessions! 💬
573
+ """
574
+ )
575
+
576
+ return app
577
+
578
+
579
+ # ===================== MAIN =====================
580
+
581
+ if __name__ == "__main__":
582
+ app = create_interface()
583
+ app.launch(
584
+ share=True,
585
+ server_name="0.0.0.0",
586
+ server_port=7860,
587
+ )
config.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Configuration for HuggingFace Enabling Sessions
2
+
3
+ # Models
4
+ SENTIMENT_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
5
+ NER_MODEL = "dslim/bert-base-uncased-finetuned-ner"
6
+ QA_MODEL = "deepset/roberta-base-squad2"
7
+ SUMMARIZATION_MODEL = "facebook/bart-large-cnn"
8
+ EMBEDDINGS_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
9
+
10
+ # Task definitions
11
+ TASKS = {
12
+ "sentiment": {
13
+ "name": "Sentiment Analysis",
14
+ "description": "Classify text sentiment (positive/negative/neutral)",
15
+ "model": SENTIMENT_MODEL,
16
+ "example": "I absolutely love this product! It's amazing and works perfectly.",
17
+ },
18
+ "ner": {
19
+ "name": "Named Entity Recognition",
20
+ "description": "Identify and classify named entities (Person, Location, Organization)",
21
+ "model": NER_MODEL,
22
+ "example": "Apple Inc. was founded by Steve Jobs in Cupertino, California.",
23
+ },
24
+ "qa": {
25
+ "name": "Question Answering",
26
+ "description": "Answer questions based on provided context",
27
+ "model": QA_MODEL,
28
+ "example_context": "The Hugging Face Hub is a platform for sharing machine learning models, datasets, and demos.",
29
+ "example_question": "What is the Hugging Face Hub?",
30
+ },
31
+ "summarization": {
32
+ "name": "Text Summarization",
33
+ "description": "Generate concise summaries of longer texts",
34
+ "model": SUMMARIZATION_MODEL,
35
+ "example": "The Hugging Face transformers library provides state-of-the-art pre-trained models for natural language processing tasks. It supports PyTorch and TensorFlow, making it easy to use with either framework.",
36
+ },
37
+ "similarity": {
38
+ "name": "Semantic Similarity",
39
+ "description": "Compare semantic similarity between two sentences",
40
+ "model": EMBEDDINGS_MODEL,
41
+ "example1": "The cat is sleeping on the mat",
42
+ "example2": "A feline is resting on the rug",
43
+ },
44
+ }
45
+
46
+ # Sample data paths
47
+ SAMPLE_DATA_CSV = "data/sample_texts.csv"
48
+ DEMO_SAMPLES_DIR = "data/demo_samples"
49
+
50
+ # Session timings
51
+ SESSION1_DURATION = 45 # minutes
52
+ SESSION2_DURATION = 90 # minutes
data/demo_samples/embeddings.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ The cat is sleeping on the mat
2
+ A feline is resting on the rug
3
+ Machine learning is used for AI tasks
4
+ Deep learning uses neural networks
5
+ Natural language processing handles text
6
+ Computer vision processes images
data/demo_samples/ner.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Apple Inc. was founded by Steve Jobs in Cupertino, California.
2
+ Barack Obama was the 44th President of the United States.
3
+ Microsoft is headquartered in Redmond, Washington.
4
+ The Eiffel Tower is located in Paris, France.
5
+ Google was founded by Larry Page and Sergey Brin at Stanford University.
data/demo_samples/qa.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ The Hugging Face Hub is a platform for sharing machine learning models, datasets, and demos. It allows researchers and practitioners to collaborate and share their work with the community. The Hub is built on top of Git and provides version control for models and datasets.
2
+ Transformers are neural networks that use self-attention mechanisms to process sequential data. They were introduced in the "Attention is All You Need" paper by Vaswani et al. in 2017. Transformers have become the foundation for most state-of-the-art NLP models today.
data/demo_samples/sentiment.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ I absolutely love this product! It's amazing and works perfectly.
2
+ This is the worst experience I've ever had with a company.
3
+ The movie was decent but could have been better.
4
+ I'm so happy with my purchase!
5
+ This service is terrible and a waste of money.
data/demo_samples/summarization.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Artificial intelligence has become increasingly important in modern society. Machine learning, a subset of AI, enables computers to learn from data without being explicitly programmed. Deep learning, which uses neural networks with multiple layers, has revolutionized fields like computer vision and natural language processing. The Hugging Face library has made it easier for developers to access and use pre-trained models for various NLP tasks. These tools have democratized access to powerful AI capabilities, allowing organizations of all sizes to leverage machine learning in their applications.
data/sample_texts.csv ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ text,task,label
2
+ "I absolutely love this product! It's amazing and works perfectly.",sentiment,positive
3
+ "This is the worst experience I've ever had with a company.",sentiment,negative
4
+ "The weather today is nice.",sentiment,neutral
5
+ "The movie was okay, nothing special.",sentiment,neutral
6
+ "I'm so happy with my purchase!",sentiment,positive
7
+ "Apple Inc. was founded by Steve Jobs in Cupertino California.",ner,entities
8
+ "Barack Obama was the 44th President of the United States.",ner,entities
9
+ "Microsoft is headquartered in Redmond Washington.",ner,entities
10
+ "The Hugging Face Hub is a platform for sharing machine learning models.",qa,context
11
+ "The Eiffel Tower is located in Paris France.",ner,entities
12
+ "Google was founded by Larry Page and Sergey Brin at Stanford University.",ner,entities
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ gradio==6.9.0
2
+ transformers>=4.40.0
3
+ torch>=2.0.0
4
+ datasets>=2.16.0
5
+ numpy>=1.24.0
6
+ pandas>=2.0.0
7
+ scikit-learn>=1.3.0
8
+ tqdm>=4.66.0
9
+ Pillow>=10.0.0
slides/SESSION1_SLIDES.md ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Session 1: Introduction to Hugging Face
2
+ ## (45 minutes)
3
+
4
+ ---
5
+
6
+ ## Slide 1: Title Slide
7
+ **Introduction to Hugging Face**
8
+ ### Building the Future of NLP Together
9
+
10
+ ---
11
+ **Presenter:** Shouryaan Grish
12
+ **Duration:** 45 minutes
13
+ **Focus:** HuggingFace Ecosystem, Transformers, and Best Practices
14
+
15
+ ---
16
+
17
+ ## Slide 2: Agenda
18
+ 1. **What is Hugging Face?** (5 min)
19
+ 2. **Core Abstractions** (10 min)
20
+ - Pipelines
21
+ - Models & Tokenizers
22
+ - Datasets
23
+ 3. **Architecture Patterns** (12 min)
24
+ - Encoders
25
+ - Decoders
26
+ - Encoder-Decoders
27
+ 4. **Enterprise Landscape** (10 min)
28
+ - Licensing
29
+ - Open-source vs Commercial
30
+ 5. **Q&A** (8 min)
31
+
32
+ ---
33
+
34
+ ## Slide 3: What is Hugging Face?
35
+
36
+ ### The Mission
37
+ > "Democratizing state-of-the-art Natural Language Processing"
38
+
39
+ ### The Platform
40
+ - **The Hub:** Central repository for 100K+ models, datasets, and applications
41
+ - **Transformers Library:** Python library with pre-trained models
42
+ - **Datasets Library:** Standardized data loading and processing
43
+ - **Spaces:** Deploy interactive demos and applications
44
+
45
+ ### Why It Matters
46
+ - ✅ Pre-trained models save time and resources
47
+ - ✅ Community-driven model sharing
48
+ - ✅ Production-ready inference
49
+ - ✅ Cuts NLP development from months to days
50
+
51
+ ---
52
+
53
+ ## Slide 4: The NLP Transformation
54
+
55
+ ### Pre-Hugging Face Era (2017)
56
+ - Models were scattered across papers and personal repos
57
+ - Training from scratch was standard practice
58
+ - Minimal standardization
59
+ - High barrier to entry
60
+
61
+ ### Post-Hugging Face Era (2018+)
62
+ - Unified Hub for model sharing
63
+ - Pre-trained models ready to use
64
+ - Standardized APIs
65
+ - Democratized access to SOTA models
66
+
67
+ ---
68
+
69
+ ## Slide 5: Core Abstraction #1: Pipelines
70
+
71
+ ### What is a Pipeline?
72
+ A high-level API for common NLP tasks
73
+
74
+ ### Simple Code Example
75
+ ```python
76
+ from transformers import pipeline
77
+
78
+ # 2 lines of code → Production-ready inference
79
+ classifier = pipeline("sentiment-analysis")
80
+ result = classifier("I love this product!")
81
+ # Output: [{'label': 'POSITIVE', 'score': 0.9987}]
82
+ ```
83
+
84
+ ### Supported Tasks
85
+ - sentiment-analysis
86
+ - text-classification
87
+ - named-entity-recognition
88
+ - question-answering
89
+ - summarization
90
+ - translation
91
+ - text-generation
92
+ - zero-shot-classification
93
+
94
+ ---
95
+
96
+ ## Slide 6: Core Abstraction #2: Models & Tokenizers
97
+
98
+ ### Tokenizers
99
+ Convert raw text → Token IDs that models understand
100
+
101
+ ### Models
102
+ Neural networks that take token IDs → Predictions
103
+
104
+ ### The Flow
105
+ ```
106
+ Text Input
107
+
108
+ Tokenizer (breaks into subwords)
109
+
110
+ Model (processes tokens)
111
+
112
+ Predictions/Embeddings
113
+ ```
114
+
115
+ ### Example
116
+ ```python
117
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
118
+
119
+ tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
120
+ model = AutoModelForSequenceClassification.from_pretrained(...)
121
+
122
+ tokens = tokenizer("Hello world")
123
+ outputs = model(**tokens)
124
+ ```
125
+
126
+ ---
127
+
128
+ ## Slide 7: Core Abstraction #3: Datasets
129
+
130
+ ### Why a Standardized Dataset API?
131
+ - Different format hassles (JSON, CSV, Parquet, etc.)
132
+ - Download and caching
133
+ - Consistent preprocessing
134
+
135
+ ### Key Features
136
+ ```python
137
+ from datasets import load_dataset
138
+
139
+ # One line to load any dataset
140
+ dataset = load_dataset("imdb")
141
+
142
+ # Auto-handles splits, caching, processing
143
+ train_set = dataset["train"]
144
+ val_set = dataset["validation"]
145
+
146
+ # Easy to preprocess
147
+ def preprocess(examples):
148
+ return tokenizer(examples["text"])
149
+
150
+ processed = dataset.map(preprocess, batched=True)
151
+ ```
152
+
153
+ ---
154
+
155
+ ## Slide 8: Architecture Pattern #1: Encoders
156
+
157
+ ### BERT (Bidirectional Encoder Representations from Transformers)
158
+
159
+ ```
160
+ Input Text
161
+
162
+ Bidirectional Processing (attends to left AND right context)
163
+
164
+ Output: Contextualized Embeddings
165
+ ```
166
+
167
+ ### Characteristics
168
+ - ✅ Bidirectional context
169
+ - ✅ Great for understanding tasks
170
+ - ✅ Fast inference
171
+
172
+ ### Best For
173
+ - Classification (sentiment, intent, toxicity)
174
+ - Named Entity Recognition
175
+ - Semantic similarity
176
+ - Feature extraction
177
+
178
+ ### Popular Variants
179
+ - BERT-base: 110M parameters
180
+ - **DistilBERT**: 66M parameters, 40% faster, 97% performance
181
+ - **RoBERTa**: Better pre-training, SOTA on many benchmarks
182
+ - **ALBERT**: Lightweight, parameter-shared
183
+
184
+ ---
185
+
186
+ ## Slide 9: Architecture Pattern #2: Decoders
187
+
188
+ ### GPT-2/GPT-3 (Generative Pre-trained Transformer)
189
+
190
+ ```
191
+ Input Tokens
192
+
193
+ Unidirectional Processing (attends ONLY to previous tokens)
194
+
195
+ Predict Next Token
196
+
197
+ Repeat
198
+ ```
199
+
200
+ ### Characteristics
201
+ - ✅ Generate fluent text
202
+ - ✅ Few-shot learning capability
203
+ - ✅ Large model = more knowledge
204
+
205
+ ### Best For
206
+ - Text generation
207
+ - Conversational AI
208
+ - Creative writing
209
+ - Code generation
210
+
211
+ ### Trade-offs
212
+ - Slower at inference (generates token-by-token)
213
+ - More memory required
214
+ - Can hallucinate or produce irrelevant content
215
+
216
+ ---
217
+
218
+ ## Slide 10: Architecture Pattern #3: Encoder-Decoders
219
+
220
+ ### T5 / BART (Seq2Seq Models)
221
+
222
+ ```
223
+ Input Text (Encoder)
224
+
225
+ Attention between Encoder & Decoder
226
+
227
+ Output Text (Decoder)
228
+ ```
229
+
230
+ ### Characteristics
231
+ - ✅ Combines encoder AND decoder strengths
232
+ - ✅ Flexible input/output handling
233
+ - ✅ Great for structured tasks
234
+
235
+ ### Best For
236
+ - Machine translation
237
+ - Summarization
238
+ - Question answering
239
+ - Paraphrase generation
240
+ - Style transfer
241
+
242
+ ### Example Tasks
243
+ ```
244
+ Input: "Summarize: The AI era has transformed industries..."
245
+ Output: "AI is transforming industries worldwide."
246
+
247
+ Input: "Translate to French: Hello world"
248
+ Output: "Bonjour le monde"
249
+ ```
250
+
251
+ ---
252
+
253
+ ## Slide 11: Choosing the Right Architecture
254
+
255
+ | Task | Best Architecture | Example Model |
256
+ |------|------------------|---|
257
+ | Classification | Encoder | BERT, RoBERTa |
258
+ | Named Entities | Encoder | BERT-NER |
259
+ | Question Answering | Encoder-Decoder | RoBERTa-SQuAD |
260
+ | Text Generation | Decoder | GPT-2 |
261
+ | Summarization | Encoder-Decoder | BART, T5 |
262
+ | Semantic Similarity | Encoder | Sentence-BERT |
263
+ | Machine Translation | Encoder-Decoder | MarianMT |
264
+
265
+ ---
266
+
267
+ ## Slide 12: Enterprise Landscape
268
+
269
+ ### Open-Source Models
270
+ - Free, transparent, modifiable
271
+ - No licensing restrictions (usually)
272
+ - Community support
273
+ - Examples: BERT, RoBERTa, GPT-2
274
+
275
+ ### Commercial Models
276
+ - Licensing required (sometimes)
277
+ - Vendor support
278
+ - Proprietary optimizations
279
+ - Examples: GPT-4, Claude, PaLM
280
+
281
+ ### Licensing Types
282
+ - **MIT/Apache:** Fully permissive
283
+ - **OpenRAIL:** Open Responsible AI License (middle ground)
284
+ - **Proprietary:** Restricted use (API access only)
285
+
286
+ ---
287
+
288
+ ## Slide 13: Considerations for Production
289
+
290
+ ### Performance vs. Speed Trade-off
291
+ ```
292
+ Performance (Accuracy)
293
+
294
+ | Large Models
295
+ | (BERT-large)
296
+ |
297
+ | Medium Models
298
+ | (RoBERTa-base)
299
+ |
300
+ | Small Models
301
+ | (DistilBERT)
302
+ ─────┼──────────────→ Speed
303
+ ```
304
+
305
+ ### Optimization Techniques
306
+ - **Quantization:** Reduce model size by 4x
307
+ - **Distillation:** Train smaller model to mimic larger one
308
+ - **Pruning:** Remove less important connections
309
+ - **Caching:** Store frequent predictions
310
+
311
+ ---
312
+
313
+ ## Slide 14: Real-World Example Flow
314
+
315
+ ### Workflow: Building a Customer Support Chatbot
316
+
317
+ 1. **Classify Intent** (What does the customer want?)
318
+ - Model: BERT for intent classification
319
+ - Output: bug_report, billing_question, feature_request
320
+
321
+ 2. **Extract Entities** (Who, what, when?)
322
+ - Model: NER model
323
+ - Output: customer_id, product_name, date
324
+
325
+ 3. **Search Knowledge Base** (Find relevant docs)
326
+ - Model: Semantic similarity with SBERT
327
+ - Output: Top matching solutions
328
+
329
+ 4. **Generate Response** (Compose answer)
330
+ - Model: T5 or GPT-2
331
+ - Output: Customer-tailored response
332
+
333
+ ---
334
+
335
+ ## Slide 15: Key Takeaways
336
+
337
+ ### ✅ Main Lessons
338
+ 1. **HuggingFace democratizes NLP**
339
+ - Pre-trained models, no need to train from scratch
340
+
341
+ 2. **Pipelines make NLP accessible**
342
+ - Complex tasks in 2-3 lines of Python
343
+
344
+ 3. **Choose the right architecture**
345
+ - Encoders for understanding, Decoders for generation, Encoder-Decoders for structured tasks
346
+
347
+ 4. **Production requires thoughtfulness**
348
+ - Consider licensing, performance, and trade-offs
349
+
350
+ 5. **The community is your resource**
351
+ - 100K+ models ready to use on the Hub
352
+
353
+ ### 🚀 Next Steps
354
+ - **Session 2:** Hands-on development and advanced techniques
355
+ - **Post-Session:** Experiment with your own data and models
356
+
357
+ ---
358
+
359
+ ## Slide 16: Q&A
360
+
361
+ **Questions?**
362
+
363
+ We'll cover more technical details in Session 2, including:
364
+ - Tokenization mechanics
365
+ - Fine-tuning your own models
366
+ - Deploying to production with Spaces
367
+ - Publishing to the Hub
368
+
369
+ ---
370
+
371
+ ## Slide 17: Bonus: Resource Links
372
+
373
+ ### Documentation
374
+ - https://huggingface.co/docs/transformers/
375
+ - https://huggingface.co/docs/datasets/
376
+ - https://huggingface.co/course/
377
+
378
+ ### Model Hub
379
+ - https://huggingface.co/models
380
+
381
+ ### Try It Out
382
+ - This very demo: Spaces app with live inference
383
+
384
+ ---
385
+
386
+ ## Session 1 Complete! 🎉
387
+ **Next: Session 2 - Hands-On Developer Workshop (90 min)**
slides/SESSION2_SLIDES.md ADDED
@@ -0,0 +1,731 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Session 2: Hands-On Developer Workshop
2
+ ## Building End-to-End NLP Workflows with Hugging Face
3
+ ### (90 minutes)
4
+
5
+ ---
6
+
7
+ ## Slide 1: Title Slide
8
+ **Hands-On Developer Workshop**
9
+ ### Building End-to-End NLP Workflows with HuggingFace
10
+
11
+ **Duration:** 90 minutes
12
+ **Format:** Interactive coding + Demos + Exercises
13
+ **Difficulty:** Intermediate
14
+
15
+ ---
16
+
17
+ ## Slide 2: Session Agenda (90 min)
18
+
19
+ 1. **Tokenization Deep Dive** (15 min)
20
+ - How tokenizers work
21
+ - Tokenization process breakdown
22
+ - Interactive explorer
23
+
24
+ 2. **Inference Playground** (45 min)
25
+ - Live demos across 5+ NLP tasks
26
+ - Understanding model outputs
27
+ - Confidence scores and interpretability
28
+
29
+ 3. **Exercise Checkpoint** (20 min)
30
+ - Try your own examples
31
+ - Experiment with different inputs
32
+ - Troubleshooting guide
33
+
34
+ 4. **Wrap-Up & Next Steps** (10 min)
35
+ - Publishing to the Hub
36
+ - Fine-tuning overview
37
+ - Post-session resources
38
+
39
+ ---
40
+
41
+ ## Slide 3: Review: Session 1 Recap
42
+
43
+ ### We Learned:
44
+ ✅ HuggingFace Platform Overview
45
+ ✅ Pipelines - high-level API
46
+ ✅ Encoders, Decoders, Encoder-Decoders
47
+ ✅ Architecture selection guide
48
+
49
+ ### Now We'll Learn:
50
+ 🔬 **HOW things work under the hood**
51
+ 👨‍💻 **Hands-on implementation**
52
+ ⚙️ **Production considerations**
53
+
54
+ ---
55
+
56
+ ## Slide 4: Part 1 - Tokenization Deep Dive
57
+
58
+ ### What is Tokenization?
59
+ Converting raw text → Numerical representations that models understand
60
+
61
+ ### Why Does This Matter?
62
+ - Models don't understand words, only numbers
63
+ - Tokenization decisions affect model performance
64
+ - Different tokenizers produce different results
65
+
66
+ ---
67
+
68
+ ## Slide 5: Tokenization Step-by-Step
69
+
70
+ ### Example Input
71
+ ```
72
+ "Hello, how are you?"
73
+ ```
74
+
75
+ ### Step 1: Split into tokens (subwords)
76
+ ```
77
+ ["Hello", ",", "how", "are", "you", "?"]
78
+ ```
79
+
80
+ ### Step 2: Convert to IDs
81
+ ```
82
+ [7592, 1010, 2129, 2024, 2017, 1029]
83
+ ```
84
+
85
+ ### Step 3: Add attention masks (1 = real token, 0 = padding)
86
+ ```
87
+ [1, 1, 1, 1, 1, 1]
88
+ ```
89
+
90
+ ### Step 4: Model processes
91
+ ```
92
+ Input: [7592, 1010, 2129, 2024, 2017, 1029]
93
+ Output: [logits, embeddings, predictions]
94
+ ```
95
+
96
+ ---
97
+
98
+ ## Slide 6: Tokenization Strategies
99
+
100
+ ### 1. Word-level Tokenization
101
+ **Split on spaces and punctuation**
102
+ ```
103
+ "I love NLP" → ["I", "love", "NLP"]
104
+ ```
105
+ ❌ Problem: Infinite vocabulary (new words = unknown tokens)
106
+
107
+ ### 2. Character-level Tokenization
108
+ **Split character by character**
109
+ ```
110
+ "NLP" → ["N", "L", "P"]
111
+ ```
112
+ ❌ Problem: Too many tokens, harder to learn patterns
113
+
114
+ ### 3. Subword Tokenization (RECOMMENDED) ✅
115
+ **Smart splitting into frequently-seen sequences**
116
+ ```
117
+ "uncharacteristic" → ["un", "character", "istic"]
118
+ new_word → ["new", "_word"] (special token for space)
119
+ ```
120
+ ✅ Balance between vocabulary size and sequence length
121
+
122
+ ---
123
+
124
+ ## Slide 7: Common Tokenizers
125
+
126
+ | Tokenizer | Used By | Strategy |
127
+ |-----------|---------|----------|
128
+ | WordPiece | BERT | Greedy merge frequent pairs |
129
+ | BPE | GPT-2, RoBERTa | Byte-Pair Encoding |
130
+ | SentencePiece | T5, XLNet | Language-agnostic |
131
+ | Unigram | ALBERT | Probabilistic |
132
+
133
+ ### Key Insight
134
+ Different tokenizers may produce different tokens for the same input!
135
+
136
+ ---
137
+
138
+ ## Slide 8: Tokenization in Practice
139
+
140
+ ### Code Example
141
+ ```python
142
+ from transformers import AutoTokenizer
143
+
144
+ tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
145
+
146
+ text = "I love Hugging Face!"
147
+
148
+ # Tokenize
149
+ tokens = tokenizer(text)
150
+ # Output: {
151
+ # 'input_ids': [101, 1045, 2572, 23604, 2469, 999, 102],
152
+ # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0],
153
+ # 'attention_mask': [1, 1, 1, 1, 1, 1, 1]
154
+ # }
155
+
156
+ # Convert back to tokens
157
+ token_strings = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
158
+ # Output: ['[CLS]', 'i', 'love', 'hugging', 'face', '!', '[SEP]']
159
+ ```
160
+
161
+ ### Special Tokens
162
+ - `[CLS]` = Classification token (beginning)
163
+ - `[SEP]` = Separator (between sentences)
164
+ - `[PAD]` = Padding (for shorter sequences)
165
+ - `[UNK]` = Unknown token (unseen words)
166
+ - `[MASK]` = Masked token (for BERT pre-training)
167
+
168
+ ---
169
+
170
+ ## Slide 9: Attention Masks
171
+
172
+ ### Why Attention Masks?
173
+ Models process batches of text. Different sequences have different lengths.
174
+ Need to mark which tokens are "real" vs "padding"
175
+
176
+ ### Example
177
+ ```
178
+ Sentence 1: "I love NLP" (5 tokens)
179
+ Sentence 2: "Hello" (2 tokens)
180
+
181
+ After padding to length 5:
182
+ Sequence 1: [I, love, NLP, [PAD], [PAD]]
183
+ Sequence 2: [Hello, [PAD], [PAD], [PAD], [PAD]]
184
+
185
+ Attention Masks:
186
+ Sequence 1: [1, 1, 1, 0, 0] ← Ignore these
187
+ Sequence 2: [1, 0, 0, 0, 0] ← Ignore these
188
+
189
+ Model learns: "When attention_mask=0, ignore this token"
190
+ ```
191
+
192
+ ---
193
+
194
+ ## Slide 10: Token Limitations
195
+
196
+ ### Max Sequence Length
197
+ Different models have different token limits!
198
+
199
+ | Model | Max Tokens | Max Characters (approx) |
200
+ |-------|-----------|--------------------------|
201
+ | BERT | 512 | 2048 |
202
+ | GPT-2 | 1024 | 4096 |
203
+ | RoBERTa | 512 | 2048 |
204
+ | LongFormer | 4096 | 16384 |
205
+ | T5 | 512 | 2048 |
206
+
207
+ ### What happens when text is too long?
208
+ ```python
209
+ # Option 1: Truncation
210
+ tokens = tokenizer(long_text, max_length=512, truncation=True)
211
+
212
+ # Option 2: Split document into chunks
213
+ chunks = [text[i:i+2000] for i in range(0, len(text), 2000)]
214
+ ```
215
+
216
+ ---
217
+
218
+ ## Slide 11: Part 2 - Inference Playground
219
+
220
+ ### Now let's RUN models on real text!
221
+
222
+ ### We'll Explore 5 NLP Tasks:
223
+ 1. **Sentiment Analysis** - Classify emotions
224
+ 2. **Named Entity Recognition** - Extract entities
225
+ 3. **Question Answering** - Answer from context
226
+ 4. **Summarization** - Generate summaries
227
+ 5. **Semantic Similarity** - Compare meaning
228
+
229
+ ---
230
+
231
+ ## Slide 12: Task 1 - Sentiment Analysis
232
+
233
+ ## How It Works
234
+ ```
235
+ Input: "I love this product!"
236
+ Model: DistilBERT (fine-tuned on SST-2)
237
+ Process:
238
+ 1. Tokenize → [101, 1045, 2572, ...]
239
+ 2. Feed to BERT → [embeddings]
240
+ 3. Classification head → [logits]
241
+ 4. Softmax → [0.9987, 0.0013]
242
+
243
+ Output: POSITIVE (99.87% confidence)
244
+ ```
245
+
246
+ ### Use Cases
247
+ - Customer feedback analysis
248
+ - Social media monitoring
249
+ - Review classification
250
+ - Support ticket triage
251
+
252
+ ### Demo Time! 🎬
253
+ Check the Inference Playground tab
254
+
255
+ ---
256
+
257
+ ## Slide 13: Task 2 - Named Entity Recognition
258
+
259
+ ### How It Works
260
+ ```
261
+ Input: "Apple Inc. was founded by Steve Jobs"
262
+
263
+ Process:
264
+ 1. Tokenize → ["Apple", "Inc", ".", "was", ...]
265
+ 2. Classify each token:
266
+ - "Apple" → Organization
267
+ - "Inc" → Organization
268
+ - "Steve" → Person
269
+ - "Jobs" → Person
270
+
271
+ Output:
272
+ Apple Inc. [ORG]
273
+ Steve Jobs [PER]
274
+ ```
275
+
276
+ ### Use Cases
277
+ - Resume parsing
278
+ - Information extraction
279
+ - Knowledge base construction
280
+ - Data mining from documents
281
+
282
+ ### Aggregation Strategies
283
+ ```
284
+ Token-level: ["Apple", "Inc"] separately
285
+ Simple: Keep adjacent same entities
286
+ Aggregated: Combine subwords into words
287
+ ```
288
+
289
+ ---
290
+
291
+ ## Slide 14: Task 3 - Question Answering
292
+
293
+ ### How It Works
294
+ ```
295
+ Context: "The Eiffel Tower is in Paris, France."
296
+ Question: "Where is the Eiffel Tower?"
297
+
298
+ Process:
299
+ 1. Tokenize both
300
+ 2. Encode context + question pairs
301
+ 3. Model predicts START and END positions
302
+ 4. Extract text span
303
+
304
+ Output: "Paris, France"
305
+ ```
306
+
307
+ ### Model Architecture
308
+ Uses SQuAD fine-tuned model (Question Answering dataset)
309
+
310
+ ### Use Cases
311
+ - FAQ systems
312
+ - Document search
313
+ - Customer support automation
314
+ - Knowledge base queries
315
+
316
+ ### Limitations
317
+ - Must have answer in provided context
318
+ - Works best with structured context
319
+ - Answer must be continuous span
320
+
321
+ ---
322
+
323
+ ## Slide 15: Task 4 - Summarization
324
+
325
+ ## How It Works
326
+ ```
327
+ Input: "The Hugging Face transformers library provides..."
328
+
329
+ Process:
330
+ 1. Encode full input
331
+ 2. Use attention to identify key sentences
332
+ 3. Generate summary with decoder
333
+ 4. Stop at max_length
334
+
335
+ Output: "Hugging Face offers access to pre-trained..."
336
+ ```
337
+
338
+ ### Abstractive vs Extractive
339
+ ```
340
+ Abstractive (our model uses this):
341
+ - Generate NEW sentences
342
+ - More fluent, may hallucinate
343
+
344
+ Extractive:
345
+ - Select existing sentences
346
+ - More faithful, but choppy
347
+ ```
348
+
349
+ ### Parameters
350
+ - `max_length`: Maximum summary length
351
+ - `min_length`: Minimum summary length
352
+ - `num_beams`: Quality vs speed trade-off
353
+
354
+ ### Use Cases
355
+ - Article summarization
356
+ - Report generation
357
+ - Meeting notes
358
+ - Long document processing
359
+
360
+ ---
361
+
362
+ ## Slide 16: Task 5 - Semantic Similarity
363
+
364
+ ## How It Works
365
+ ```
366
+ Sentence 1: "The cat is sleeping on the mat"
367
+ Sentence 2: "A feline is resting on the rug"
368
+
369
+ Process:
370
+ 1. Encode sentence 1 → [embedding vector: 384-dimensional]
371
+ 2. Encode sentence 2 → [embedding vector: 384-dimensional]
372
+ 3. Calculate cosine similarity
373
+ 4. Score: -1 to 1 (1 = identical meaning)
374
+
375
+ Output: Similarity = 0.92 (Very Similar!)
376
+ ```
377
+
378
+ ### Model: Sentence-BERT
379
+ - Optimized for semantic meaning
380
+ - Much faster than full transformers
381
+ - Maintains meaning representation
382
+
383
+ ### Use Cases
384
+ - Duplicate detection
385
+ - Recommendation systems
386
+ - Plagiarism detection
387
+ - Document clustering
388
+ - Semantic search
389
+
390
+ ### Similarity Scores
391
+ ```
392
+ 1.0 = Identical meaning
393
+ 0.8+ = Very similar
394
+ 0.6+ = Similar
395
+ 0.4+ = Somewhat related
396
+ 0.0 = No relation
397
+ ```
398
+
399
+ ---
400
+
401
+ ## Slide 17: Understanding Model Outputs
402
+
403
+ ### Confidence Scores Matter!
404
+ ```
405
+ Sentiment: "I like this movie"
406
+ Output: POSITIVE (0.78 confidence)
407
+
408
+ Should I trust this?
409
+ - 78% is decent, but not super high
410
+ - Might want manual review for this one
411
+
412
+ vs.
413
+
414
+ Output: POSITIVE (0.98 confidence)
415
+ - Very high confidence, safer to automate
416
+ ```
417
+
418
+ ### Decision Rules
419
+ ```
420
+ High confidence (>0.90): Automate with confidence
421
+ Medium confidence (0.70-0.90): Automate with logging
422
+ Low confidence (<0.70): Flag for review
423
+ ```
424
+
425
+ ---
426
+
427
+ ## Slide 18: Part 3 - Exercise Checkpoint
428
+
429
+ ### Your Turn! 🎯
430
+
431
+ ### Challenge 1: Sentiment Analysis
432
+ Try different types of text:
433
+ - Product reviews
434
+ - Social media posts
435
+ - Technical feedback
436
+ - Sarcasm ("Yeah, GREAT idea!")
437
+
438
+ ### Challenge 2: NER
439
+ Test with:
440
+ - Different domains (sports, business, medicine)
441
+ - Multiple languages
442
+ - Misspellings or slang
443
+
444
+ ### Challenge 3: QA
445
+ Build a simple FAQ system:
446
+ - Create a knowledge base document
447
+ - Ask different questions
448
+ - See how model performs
449
+
450
+ ### Challenge 4: Creative Exploration
451
+ Mix and match:
452
+ - Run sentiment on generated summaries
453
+ - Extract entities from question answers
454
+ - Create a pipeline!
455
+
456
+ ---
457
+
458
+ ## Slide 19: Troubleshooting Guide
459
+
460
+ ### Common Issues & Solutions
461
+
462
+ #### Issue 1: Text Too Long
463
+ ```
464
+ Error: Input sequence too long
465
+
466
+ Solution:
467
+ option1 = tokenizer(text, max_length=512, truncation=True)
468
+ # Text is trimmed to first 512 tokens
469
+
470
+ option2 = [text[i:i+2000] for i in range(0, len(text), 2000)]
471
+ # Process in chunks, combine results
472
+ ```
473
+
474
+ #### Issue 2: Unexpected Results
475
+ ```
476
+ Input: "This is NOT good!"
477
+ Output: POSITIVE (but we wanted NEGATIVE)
478
+
479
+ Cause: Models may miss negation
480
+
481
+ Solution:
482
+ - Model is sensitive to specific words/context
483
+ - Try different phrasing
484
+ - Consider fine-tuning on domain data
485
+ - Use ensemble of models
486
+ ```
487
+
488
+ #### Issue 3: Slow Inference
489
+ ```
490
+ Cause: Large model + slow hardware
491
+
492
+ Solutions:
493
+ - Use DistilBERT instead of BERT (40% faster)
494
+ - Enable quantization
495
+ - Use GPU/TPU
496
+ - Batch process multiple inputs
497
+ ```
498
+
499
+ ---
500
+
501
+ ## Slide 20: Production Considerations
502
+
503
+ ### Performance vs Speed
504
+ ```
505
+ DistilBERT: Fast, 97% of BERT performance ✅
506
+ BERT-base: Good balance
507
+ BERT-large: Slower, best quality
508
+ ```
509
+
510
+ ### Optimization Techniques
511
+ ```
512
+ 1. Quantization: Model size ↓ 4x, speed ↑ 2-3x
513
+ 2. Distillation: Train smaller model, keep quality
514
+ 3. Pruning: Remove redundant parameters
515
+ 4. Caching: Store frequent predictions
516
+ 5. Batching: Process multiple inputs together
517
+ ```
518
+
519
+ ### Monitoring & Logging
520
+ ```
521
+ - Track model confidence
522
+ - Log failures/edge cases
523
+ - Monitor latency
524
+ - User feedback loop
525
+ ```
526
+
527
+ ---
528
+
529
+ ## Slide 21: Publishing to the Hub
530
+
531
+ ### 3-Step Process
532
+
533
+ #### Step 1: Create Model Card
534
+ ```markdown
535
+ # Model Name
536
+ Description of what model does
537
+
538
+ ## Usage
539
+ ```python
540
+ from transformers import pipeline
541
+ classifier = pipeline("sentiment-analysis",
542
+ model="your-username/model-name")
543
+ ```
544
+
545
+ ## Training Data
546
+ Trained on [dataset]
547
+
548
+ ## Performance
549
+ Accuracy: X%, F1: Y%
550
+ ```
551
+
552
+ #### Step 2: Upload
553
+ ```python
554
+ from transformers import AutoModelForSequenceClassification
555
+ from huggingface_hub import push_to_hub_or_create_repo
556
+
557
+ model = AutoModelForSequenceClassification.from_pretrained(...)
558
+ model.push_to_hub("your-username/model-name")
559
+ ```
560
+
561
+ #### Step 3: Share
562
+ ```
563
+ https://huggingface.co/your-username/model-name
564
+ ```
565
+
566
+ ---
567
+
568
+ ## Slide 22: Fine-Tuning Overview
569
+
570
+ ### When to Fine-Tune
571
+ ```
572
+ New Task ← Need Fine-Tuning → Different Domain/Language
573
+ ↓ ↓
574
+ Legal documents Medical texts
575
+ vs BERT trained on Wikipedia vs General BERT
576
+ ```
577
+
578
+ ### Fine-Tuning Process
579
+ ```
580
+ 1. Load pre-trained model
581
+ 2. Add task-specific head
582
+ 3. Train on your data (small learning rate!)
583
+ 4. Evaluate and iterate
584
+ 5. Deploy!
585
+ ```
586
+
587
+ ### Example: Custom Sentiment
588
+ ```python
589
+ from transformers import AutoModelForSequenceClassification, Trainer
590
+
591
+ model = AutoModelForSequenceClassification.from_pretrained(
592
+ "bert-base-uncased",
593
+ num_labels=2
594
+ )
595
+ # Load YOUR labeled data
596
+ train_dataset = load_my_data("path/to/data.csv")
597
+
598
+ trainer = Trainer(
599
+ model=model,
600
+ train_dataset=train_dataset,
601
+ learning_rate=2e-5, # Small learning rate!
602
+ )
603
+ trainer.train()
604
+ ```
605
+
606
+ ---
607
+
608
+ ## Slide 23: Real-World Pipeline Example
609
+
610
+ ### Build a Document Classifier + Summarizer
611
+
612
+ ```
613
+ Raw Document
614
+
615
+ [1] Tokenize
616
+
617
+ [2] Classify Intent
618
+ ├─ Legal: Route to legal team
619
+ ├─ Finance: Route to finance team
620
+ └─ Tech: Route to tech team
621
+
622
+ [3] Extract Entities (contract dates, amounts, parties)
623
+
624
+ [4] Summarize (generate executive summary)
625
+
626
+ [5] Store (vector DB for search)
627
+
628
+ Output: Classified, Summarized, Searchable
629
+ ```
630
+
631
+ ### Code Pattern
632
+ ```python
633
+ # Each step reusable, composable
634
+ classifier_pipe = pipeline("zero-shot-classification")
635
+ ner_pipe = pipeline("ner")
636
+ summarizer_pipe = pipeline("summarization")
637
+
638
+ def process_document(doc):
639
+ intent = classifier_pipe(doc, ["legal", "finance", "tech"])
640
+ entities = ner_pipe(doc)
641
+ summary = summarizer_pipe(doc, max_length=100)
642
+
643
+ return {
644
+ "intent": intent,
645
+ "entities": entities,
646
+ "summary": summary
647
+ }
648
+ ```
649
+
650
+ ---
651
+
652
+ ## Slide 24: Key Takeaways (Session 2)
653
+
654
+ ### 🔬 Technical Deep Dive
655
+ 1. **Tokenization is foundational**
656
+ - Text → Numerical representations
657
+ - Different tokenizers, different results
658
+ - Handles special tokens and attention masks
659
+
660
+ 2. **Inference is flexible**
661
+ - 5 different task categories
662
+ - Models = transformers + task-specific heads
663
+ - Confidence scores guide automation decisions
664
+
665
+ 3. **Production requires engineering**
666
+ - Monitor, log, optimize
667
+ - Consider performance/speed tradeoffs
668
+ - Fine-tune for domain-specific needs
669
+
670
+ ### 🚀 Next Steps
671
+ 1. Experiment with your own data
672
+ 2. Publish a model to the Hub
673
+ 3. Fine-tune for your use case
674
+ 4. Deploy to Spaces (like this app!)
675
+
676
+ ---
677
+
678
+ ## Slide 25: Resources & Links
679
+
680
+ ### Official Documentation
681
+ - https://huggingface.co/docs/transformers/
682
+ - https://huggingface.co/docs/datasets/
683
+ - https://huggingface.co/course/
684
+
685
+ ### Explore Models
686
+ - https://huggingface.co/models
687
+
688
+ ### Build & Deploy
689
+ - https://huggingface.co/spaces (Deploy Gradio apps)
690
+ - Hugging Face Hub API for uploading
691
+
692
+ ### Community
693
+ - https://discuss.huggingface.co/
694
+ - GitHub Issues
695
+ - Community Discord
696
+
697
+ ---
698
+
699
+ ## Slide 26: Q&A & Open Discussion
700
+
701
+ ### Questions We Can Answer
702
+ - How do I fine-tune for my specific domain?
703
+ - What model should I use for X task?
704
+ - How do I deploy to production?
705
+ - How do I handle multiple languages?
706
+ - What about multilingual models?
707
+
708
+ ### Further Learning
709
+ - We have Session materials on the Hub
710
+ - Code examples in the Spaces app
711
+ - Documentation links in Resources tab
712
+
713
+ ---
714
+
715
+ ## Session 2 Complete! 🎉
716
+
717
+ ### You Now Know:
718
+ ✅ How tokenization works
719
+ ✅ How to run inference on 5+ NLP tasks
720
+ ✅ How to interpret model outputs
721
+ ✅ How to publish and deploy models
722
+ ✅ How to think about fine-tuning and optimization
723
+
724
+ ### Keep Learning!
725
+ The NLP field is rapidly evolving. Stay updated with:
726
+ - Hugging Face blog
727
+ - Research papers (arXiv)
728
+ - Community discussions
729
+ - Your own experiments
730
+
731
+ **Thank you for attending! 🙏**
test DELETED
@@ -1 +0,0 @@
1
- xyz
 
 
utils.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility functions for HuggingFace Enabling Sessions Spaces app
3
+ """
4
+ import torch
5
+ from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
6
+ from sentence_transformers import SentenceTransformer, util
7
+ import numpy as np
8
+ from functools import lru_cache
9
+ import config
10
+
11
+
12
+ @lru_cache(maxsize=10)
13
+ def load_pipeline(task_type: str):
14
+ """Load and cache a pipeline for the given task."""
15
+ if task_type == "sentiment":
16
+ return pipeline("sentiment-analysis", model=config.SENTIMENT_MODEL)
17
+ elif task_type == "ner":
18
+ return pipeline("ner", model=config.NER_MODEL, aggregation_strategy="simple")
19
+ elif task_type == "qa":
20
+ return pipeline("question-answering", model=config.QA_MODEL)
21
+ elif task_type == "summarization":
22
+ return pipeline("summarization", model=config.SUMMARIZATION_MODEL, max_length=150, min_length=30)
23
+ else:
24
+ raise ValueError(f"Unknown task type: {task_type}")
25
+
26
+
27
+ def run_sentiment_analysis(text: str):
28
+ """Run sentiment analysis on text."""
29
+ pipe = load_pipeline("sentiment")
30
+ result = pipe(text[:512]) # Truncate to avoid token limit
31
+ return result[0] if result else {"label": "Unknown", "score": 0}
32
+
33
+
34
+ def run_ner(text: str):
35
+ """Run Named Entity Recognition on text."""
36
+ pipe = load_pipeline("ner")
37
+ result = pipe(text[:512])
38
+ return result
39
+
40
+
41
+ def run_qa(context: str, question: str):
42
+ """Run question answering on context."""
43
+ pipe = load_pipeline("qa")
44
+ try:
45
+ result = pipe(question=question, context=context[:512])
46
+ return result
47
+ except Exception as e:
48
+ return {"error": str(e), "answer": "Unable to answer", "score": 0}
49
+
50
+
51
+ def run_summarization(text: str):
52
+ """Generate summary of text."""
53
+ pipe = load_pipeline("summarization")
54
+ try:
55
+ result = pipe(text[:1024], max_length=150, min_length=30, do_sample=False)
56
+ return result[0]["summary_text"] if result else "Unable to summarize"
57
+ except Exception as e:
58
+ return f"Error: {str(e)}"
59
+
60
+
61
+ def compute_similarity(text1: str, text2: str):
62
+ """Compute semantic similarity between two texts."""
63
+ try:
64
+ model = SentenceTransformer(config.EMBEDDINGS_MODEL)
65
+ embeddings = model.encode([text1, text2], convert_to_tensor=True)
66
+ similarity = util.pytorch_cos_sim(embeddings[0], embeddings[1])
67
+ return float(similarity.item())
68
+ except Exception as e:
69
+ return f"Error: {str(e)}"
70
+
71
+
72
+ def tokenize_text(text: str, model_name: str = config.SENTIMENT_MODEL):
73
+ """Tokenize text and show tokens with IDs."""
74
+ try:
75
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
76
+ encoding = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
77
+
78
+ tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0])
79
+ token_ids = encoding["input_ids"][0].tolist()
80
+ attention_mask = encoding["attention_mask"][0].tolist()
81
+
82
+ result = {
83
+ "tokens": tokens,
84
+ "token_ids": token_ids,
85
+ "attention_mask": attention_mask,
86
+ "num_tokens": len(tokens),
87
+ }
88
+ return result
89
+ except Exception as e:
90
+ return {"error": str(e)}
91
+
92
+
93
+ def format_tokenizer_output(tokenization_result):
94
+ """Format tokenization result for display."""
95
+ if "error" in tokenization_result:
96
+ return f"Error: {tokenization_result['error']}"
97
+
98
+ tokens = tokenization_result["tokens"]
99
+ token_ids = tokenization_result["token_ids"]
100
+
101
+ output = f"**Total Tokens:** {tokenization_result['num_tokens']}\n\n"
102
+ output += "| Token | Token ID | Attention Mask |\n"
103
+ output += "|-------|----------|----------------|\n"
104
+
105
+ for token, tid, attn in zip(
106
+ tokens,
107
+ token_ids,
108
+ tokenization_result["attention_mask"]
109
+ ):
110
+ output += f"| {token} | {tid} | {attn} |\n"
111
+
112
+ return output
113
+
114
+
115
+ def format_ner_output(ner_results):
116
+ """Format NER results for display."""
117
+ if not ner_results:
118
+ return "No entities found"
119
+
120
+ output = "| Entity | Type | Score |\n"
121
+ output += "|--------|------|-------|\n"
122
+
123
+ for result in ner_results:
124
+ word = result.get("word", "")
125
+ entity_type = result.get("entity_group", result.get("entity", ""))
126
+ score = result.get("score", 0)
127
+ output += f"| {word} | {entity_type} | {score:.4f} |\n"
128
+
129
+ return output