mcq_generator / README.md
tanmmayyy's picture
fix huggingface yaml
3eee1f2

A newer version of the Streamlit SDK is available: 1.57.0

Upgrade
metadata
title: MCQ Generator
emoji: πŸ“
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.33.0
app_file: app/main.py
pinned: false

πŸ“ MCQ Generator β€” Automatic Multiple Choice Question Generator

An end-to-end NLP pipeline that reads any text passage and automatically generates a complete multiple-choice quiz with scoring and explanations.

Built as a course project for an NLP curriculum covering Modules I–IV: tokenization, word embeddings, transformers, and natural language generation.


πŸ“Œ Table of Contents

  1. What This Project Does
  2. Live Demo
  3. How It Works β€” The Full Pipeline
  4. NLP Techniques Used
  5. Project Structure
  6. Each File Explained
  7. Tech Stack
  8. Setup & Installation
  9. Running the App
  10. Testing Each Module
  11. Sample Output
  12. What Makes a Good Passage
  13. Known Limitations
  14. Future Work
  15. Related Research
  16. Course Outcomes Covered

What This Project Does

Given any factual text passage, this system:

  1. Extracts the most important sentences using TF-IDF ranking
  2. Identifies answer candidates using Named Entity Recognition (NER)
  3. Generates natural language questions using a T5 transformer model
  4. Creates plausible wrong options (distractors) using WordNet and NER
  5. Presents an interactive quiz with scoring and per-question explanations

Example:

Input passage:

Albert Einstein was born on March 14, 1879, in Ulm, Germany.
He was awarded the Nobel Prize in Physics in 1921 for his
discovery of the photoelectric effect.

Generated MCQ:

Q: Where was Albert Einstein born?

A. France
B. Germany  βœ“
C. United States
D. Switzerland

Live Demo

streamlit run app/main.py

Opens at http://localhost:8501 in your browser.


How It Works β€” The Full Pipeline

Raw Text Passage
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 1: PREPROCESSING  (preprocessor.py)   β”‚
β”‚                                             β”‚
β”‚  β€’ Split into sentences (spaCy)             β”‚
β”‚  β€’ Rank by TF-IDF score (scikit-learn)      β”‚
β”‚  β€’ Extract Named Entities (spaCy NER)       β”‚
β”‚  β€’ Filter answer candidates (blacklist)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚  top sentences + answer candidates
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 2: QUESTION GENERATION                β”‚
β”‚          (question_generator.py)            β”‚
β”‚                                             β”‚
β”‚  β€’ Highlight answer in sentence with <hl>   β”‚
β”‚  β€’ Feed to T5 transformer model             β”‚
β”‚  β€’ Generate 3 candidate questions           β”‚
β”‚  β€’ Validate: reject circular/vague Qs       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚  (question, answer) pairs
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 3: DISTRACTOR GENERATION              β”‚
β”‚          (distractor_generator.py)          β”‚
β”‚                                             β”‚
β”‚  Strategy 1: Same-type NER entities         β”‚
β”‚              from the passage               β”‚
β”‚  Strategy 2: WordNet hyponym siblings       β”‚
β”‚  Strategy 3: Cross-label fallback           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚  3 wrong options per question
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 4: MCQ ASSEMBLY + VALIDATION          β”‚
β”‚          (mcq_builder.py)                   β”‚
β”‚                                             β”‚
β”‚  β€’ Combine answer + distractors             β”‚
β”‚  β€’ Shuffle options randomly                 β”‚
β”‚  β€’ Quality gate: dedup, similarity check    β”‚
β”‚  β€’ Return list of MCQ objects               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚  validated MCQ list
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 5: QUIZ UI + SCORING                  β”‚
β”‚          (app/main.py + evaluator.py)       β”‚
β”‚                                             β”‚
β”‚  β€’ Streamlit 3-screen app                   β”‚
β”‚  β€’ Input β†’ Quiz β†’ Results                   β”‚
β”‚  β€’ Score, feedback, explanations            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

NLP Techniques Used

Module I β€” Foundational NLP

Technique Where Used Purpose
Tokenization preprocessor.py Split text into sentences and tokens using spaCy
Lemmatization preprocessor.py Normalize word forms for TF-IDF
Stop word removal preprocessor.py Filter noise before TF-IDF scoring
Named Entity Recognition (NER) preprocessor.py Find PERSON, ORG, DATE, GPE as answer candidates
POS Tagging preprocessor.py Identify nouns and proper nouns
WordNet distractor_generator.py Find semantically related words as distractors
Synsets / Hyponyms distractor_generator.py Navigate WordNet hierarchy for same-category words

Module II β€” Word Representation

Technique Where Used Purpose
TF-IDF preprocessor.py Rank sentences by information density
Word Embeddings (GloVe) distractor_generator.py Optional cosine-similarity based distractor finding

TF-IDF explained:

  • TF (Term Frequency) = how often a word appears in this sentence
  • IDF (Inverse Document Frequency) = how rare the word is across all sentences
  • High TF-IDF score = sentence contains rare, informative words β†’ good question source

Module III β€” Deep Learning for NLP

Technique Where Used Purpose
Transformers question_generator.py T5 model for question generation
Transfer Learning question_generator.py Using pre-trained T5 fine-tuned on SQuAD
Seq2Seq question_generator.py Encoder-decoder architecture of T5
Beam Search question_generator.py Generate multiple question candidates, pick best

Module IV β€” Advanced NLP

Technique Where Used Purpose
T5 (Text-to-Text Transfer Transformer) question_generator.py State-of-the-art QG model
Natural Language Generation (NLG) question_generator.py Generating grammatical questions
Subword Tokenization (SentencePiece) question_generator.py T5's tokenizer handles rare/unknown words
Pre-trained Models question_generator.py valhalla/t5-small-qg-hl from HuggingFace

Project Structure

mcq_generator/
β”‚
β”œβ”€β”€ src/                          # Core NLP pipeline modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ preprocessor.py           # Text cleaning, TF-IDF, NER, answer extraction
β”‚   β”œβ”€β”€ question_generator.py     # T5-based question generation
β”‚   β”œβ”€β”€ distractor_generator.py   # WordNet + NER distractor generation
β”‚   β”œβ”€β”€ mcq_builder.py            # Pipeline orchestrator + MCQ dataclass
β”‚   └── evaluator.py              # Answer checking and scoring
β”‚
β”œβ”€β”€ app/                          # Streamlit web application
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py                   # 3-screen app: input β†’ quiz β†’ results
β”‚   └── components.py             # Reusable UI components
β”‚
β”œβ”€β”€ data/
β”‚   └── sample_passages.json      # 5 test passages (ISRO, Gandhi, AI, etc.)
β”‚
β”œβ”€β”€ models/                       # (gitignored) Downloaded model files
β”‚   └── README.md
β”‚
β”œβ”€β”€ notebooks/                    # Jupyter notebooks for exploration
β”‚
β”œβ”€β”€ config.py                     # All settings in one place
β”œβ”€β”€ requirements.txt              # Python dependencies
└── README.md                     # This file

Each File Explained

config.py

Central settings file. Every other module imports from here.

  • Model name, number of questions, sentence count, file paths
  • Change values here to tune the entire system without touching logic files

src/preprocessor.py

The NLP foundation of the project.

Key functions:

  • extract_sentences(text) β€” spaCy sentence boundary detection
  • rank_sentences(sentences) β€” TF-IDF scoring, returns top N most informative sentences
  • extract_answer_candidates(sentence) β€” NER-based extraction with strict quality filters
  • preprocess(text) β€” full pipeline, returns structured dict

Design decisions:

  • Only PERSON, ORG, GPE, DATE, EVENT, WORK_OF_ART NER labels are accepted as answers
  • A BLACKLIST of 30+ generic words ("annual", "various", "Moon") prevents trivial answers
  • Answers are sorted by priority: PERSON > ORG/GPE > DATE > others

src/question_generator.py

Uses the valhalla/t5-small-qg-hl model β€” a T5-small fine-tuned on SQuAD for question generation.

How T5 QG works:

Input:  "generate question: ISRO was founded in <hl> 1969 <hl> by Vikram Sarabhai."
Output: "In what year was ISRO founded?"

Key functions:

  • highlight_answer(sentence, answer) β€” wraps answer in <hl> tags
  • generate_question(sentence, answer) β€” beam search with 5 beams, 3 candidates
  • answer_is_addressable(question, answer) β€” rejects circular, vague, or short questions

Quality filters applied:

  • Must start with a question word (what/who/when/where/which/how)
  • Answer must NOT appear in the question
  • Abbreviation trap detection (e.g. rejects Q: "What does ISRO stand for?" when A is the full name)
  • Minimum 5 words

src/distractor_generator.py

Generates 3 plausible wrong answer options. Uses a priority-based strategy chain.

Strategy 1 β€” Same-label NER (best): Finds other entities of the same NER type from the passage.

Answer: "1969" (DATE) β†’ Distractors: ["1975", "2008", "2023"]  (other DATEs in passage)
Answer: "Vikram Sarabhai" (PERSON) β†’ Distractors: ["Kalam", "Dhawan", "Nehru"]

Strategy 2 β€” WordNet hyponyms: Navigates the WordNet hierarchy to find sibling words in the same semantic category.

Answer: "India" β†’ hypernym: "country" β†’ hyponyms: ["China", "Brazil", "Pakistan"]

Strategy 3 β€” Cross-label fallback: Uses any other named entity from the passage if strategies 1 and 2 fail.

src/mcq_builder.py

The single entry point that the UI calls. Orchestrates the entire pipeline.

MCQ dataclass:

@dataclass
class MCQ:
    question       : str
    options        : list      # 4 shuffled options
    correct_index  : int       # index of correct answer (0-3)
    correct_answer : str
    explanation    : str       # original sentence

Quality gate is_valid_mcq():

  • No two options can be too similar (catches "WWE" vs "World Wrestling Entertainment")
  • Answer must appear exactly once in options
  • Maximum 1 generic placeholder option allowed
  • Answer must not appear in question text

src/evaluator.py

Checks answers and computes scores.

Returns:

{
  "score"     : 7,
  "total"     : 10,
  "percentage": 70.0,
  "feedback"  : "Good effort! Review the explanations...",
  "results"   : [ {per-question breakdown} ]
}

app/main.py

Streamlit app with 3 screens managed via st.session_state:

  • Screen 1 (input): Text area + question count slider + Generate button
  • Screen 2 (quiz): One question at a time, radio buttons, Previous/Next/Submit
  • Screen 3 (results): Score banner + per-question feedback with explanations

app/components.py

Reusable display functions:

  • render_question_card() β€” A/B/C/D labelled radio buttons
  • render_result_card() β€” green (correct) / red (wrong) with explanation
  • render_score_summary() β€” score banner + metric cards

Tech Stack

Library Version Purpose
spaCy 3.7.4 Tokenization, NER, POS tagging, sentence splitting
transformers 4.38.2 T5 model for question generation
torch 2.2.1 PyTorch backend for transformers
nltk 3.8.1 WordNet access for distractor generation
scikit-learn 1.4.1.post1 TF-IDF vectorizer
sentencepiece latest T5's subword tokenizer
streamlit 1.33.0 Web UI framework
gensim 4.3.2 Word2Vec / GloVe loading (optional)
numpy 1.26.4 TF-IDF matrix operations

Pre-trained model used:

  • valhalla/t5-small-qg-hl β€” T5-small fine-tuned on SQuAD 1.0 for answer-aware question generation using highlight format. Hosted on HuggingFace Hub, downloaded automatically on first run (~240MB).

Setup & Installation

Prerequisites

  • Python 3.11+
  • pip
  • Internet connection (first run downloads the T5 model)

Step 1 β€” Clone the repository

git clone https://github.com/tanmmayyy/mcq-generator.git
cd mcq-generator

Step 2 β€” Create a virtual environment

python -m venv myenv

# Windows
myenv\Scripts\activate

# Mac/Linux
source myenv/bin/activate

Step 3 β€” Install dependencies

pip install -r requirements.txt
pip install sentencepiece   # required for T5 tokenizer

Step 4 β€” Download spaCy language model

# If the default command fails:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl

Step 5 β€” Verify installation

python -c "import spacy; nlp = spacy.load('en_core_web_sm'); print('spaCy OK')"
python -c "from transformers import pipeline; print('Transformers OK')"

Running the App

streamlit run app/main.py

The app opens at http://localhost:8501. On first launch, the T5 model downloads (~240MB) and loads into memory β€” this takes 1–2 minutes. Subsequent launches are fast.


Testing Each Module

Run these in order to verify each step of the pipeline works independently:

# Step 1 β€” Test preprocessing (NER, TF-IDF, sentence ranking)
python src/preprocessor.py

# Step 2 β€” Test question generation (T5 model)
python src/question_generator.py

# Step 3 β€” Test distractor generation (WordNet + NER)
python src/distractor_generator.py

# Step 4 β€” Test full pipeline end-to-end
python src/mcq_builder.py

# Step 5 β€” Test scoring
python src/evaluator.py

Sample Output

Input passage (ISRO):

The Indian Space Research Organisation (ISRO) was founded in 1969 by Vikram Sarabhai.
ISRO developed India's first satellite, Aryabhata, which was launched in 1975.
The Chandrayaan-1 mission in 2008 discovered water molecules on the Moon.
In 2023, Chandrayaan-3 successfully landed near the lunar south pole.
The Mars Orbiter Mission, also called Mangalyaan, was launched in 2013.

Generated questions:

Q1: Who founded ISRO?
    A. Jawaharlal Nehru
    B. APJ Abdul Kalam
    C. Vikram Sarabhai  βœ“
    D. Homi Bhabha

Q2: What was India's first satellite called?
    A. Chandrayaan
    B. Mangalyaan
    C. Rohini
    D. Aryabhata  βœ“

Q3: When did the Chandrayaan-1 mission take place?
    A. 1975
    B. 2013
    C. 2023
    D. 2008  βœ“

Q4: What mission made India the first Asian country to reach Mars orbit?
    A. Chandrayaan-3
    B. Aryabhata
    C. Mangalyaan  βœ“
    D. Chandrayaan-1

What Makes a Good Passage

The system performs best on factual passages that contain:

Works well Works poorly
People names (PERSON entities) Opinion / descriptive text
Specific dates (DATE entities) Passages with repeated entities
Organisation names (ORG entities) Very short passages (< 5 sentences)
Place names (GPE entities) Abstract/philosophical text
One clear fact per sentence Sentences with multiple facts

Best passage types: History, science, geography, biographies, Wikipedia-style articles

Avoid: Opinion pieces, marketing content, descriptive narratives without specific facts


Known Limitations

  1. Passage type dependency β€” Works best on factual text. Descriptive or opinion text produces poor questions because there are no named entities to use as answers.

  2. T5-small quality ceiling β€” The model used (t5-small) has 60M parameters. Larger models like t5-base or t5-large would produce better questions but require more memory and time.

  3. Distractor diversity β€” When a passage has few named entities, distractors may fall back to generic options. Fine-tuning a separate T5 model on the RACE dataset for distractor generation would fix this.

  4. English only β€” The current pipeline only supports English text. Extending to Hindi or other Indic languages would require multilingual spaCy models and a multilingual QG model.

  5. No semantic deduplication β€” Two questions from the same passage can sometimes be semantically similar even if worded differently.


Future Work

  • Fine-tune a T5 distractor generation model on the RACE dataset (100k exam questions)
  • Add support for Hindi using IndicNLP + multilingual BERT
  • Add PDF upload support so users can quiz themselves on any document
  • BLEU/METEOR/ROUGE automated evaluation of generated questions
  • Difficulty scoring per question based on distractor plausibility
  • Export quiz as PDF for offline use

Related Research

Papers that use similar approaches β€” cited for comparison:

  1. Automatic Generation of Multiple-Choice Questions (2023) Zhang et al. β€” T5 with pre/postprocessing pipelines for MCQ generation https://arxiv.org/abs/2303.14576

  2. Deep Learning and Linguistic Feature Based Automatic MCQ Generation (Springer, ICDCIT 2022) Agarwal et al. β€” DL + linguistic features for MCQ generation (same 3-step pipeline) https://link.springer.com/chapter/10.1007/978-3-030-94876-4_18

  3. End-to-End MCQ Generation Using T5 (ScienceDirect 2022) Rodriguez-Torrealba et al. β€” Full T5-based pipeline with Wikipedia passages https://www.sciencedirect.com/science/article/pii/S0957417422014014

  4. Leaf β€” MCQ Generation System (ECIR 2022) Vachev et al. β€” Two fine-tuned T5 models: one for QG, one for DG on RACE https://github.com/KristiyanVachev/Leaf-Question-Generation

  5. Automatic Distractor Generation β€” Systematic Review (PMC 2024) Comprehensive review of distractor generation methods including WordNet and T5 https://pmc.ncbi.nlm.nih.gov/articles/PMC11623049/

  6. Automatic Question Generation: A Review (Springer/PMC 2023) Mulla & Gharpure β€” Survey of methodologies, datasets, and evaluation metrics https://pmc.ncbi.nlm.nih.gov/articles/PMC9886210/

What differentiates this project from the above:

  • End-to-end pipeline with interactive quiz UI (most papers only generate questions)
  • NER-type-matching distractor strategy (distractors always same entity type as answer)
  • Multi-layer quality filtering at both question and MCQ level
  • Answer circularity detection (rejects questions where answer appears in the question)

Course Outcomes Covered

CO Description How this project covers it
CO1 Articulate NLP and word representation TF-IDF, NER, WordNet, word embeddings all implemented and explained
CO2 Build deep learning models for NLP problems T5 transformer for QG (seq2seq), beam search decoding, transfer learning
CO3 Implement ML/DL solutions in real context End-to-end deployable system with Streamlit UI and interactive demo

Author

[Tanmay Jain] [ Bennett University]


Built with spaCy, HuggingFace Transformers, NLTK, scikit-learn, and Streamlit.