semiconductor / README.md
Scribbler310
Add Hugging Face metadata
a0d083a
metadata
title: Semiconductor Defect Detection
emoji: 🔬
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false

Semiconductor Wafer Defect Detection: End-to-End AI Pipeline

Project Overview

This project is a complete, end-to-end Applied AI pipeline designed for the semiconductor manufacturing industry. It takes raw mathematical array data representing defective semiconductor wafers, engineers them into an AI-ready computer vision dataset, trains a custom YOLOv8 object detection model, and feeds the results into a predictive material waste model and real-time dashboard.

Final YOLOv8 Model Performance: 0.962 mAP@50 (96.2% overall accuracy on unseen validation data). Predictive Waste Model Performance: R² = 0.9637 (Highly accurate material waste prediction).

Business Value

In semiconductor fabrication, identifying microscopic defects early in the manufacturing process saves millions in scrapped materials. This project automates quality control by transitioning from manual coordinate analysis to real-time, AI-driven visual defect detection, while simultaneously forecasting future material waste to optimize supply chain planning.

The Technical Pipeline

Phase 1: Data Engineering (src/data_prep.py)

  • The Challenge: The original dataset consisted of raw .txt files containing numeric 2D arrays (0=background, 1=good chip, 2=defect). YOLOv8 cannot read text arrays; it requires physical images and normalized bounding box coordinates.
  • The Solution: Built a custom Python pipeline using NumPy and OpenCV to parse over 25,000 text files.
  • The Math: Programmatically identified the spatial extremes (xmin, ymin, xmax, ymax) of the 2 values, normalized them to YOLO's strict 0.0 - 1.0 format, and dynamically rendered high-contrast .jpg images alongside corresponding .txt label files.

Phase 2: Dataset Architecture (src/split_data.py)

  • Used scikit-learn to execute a mathematically rigorous 80/20 train/validation split.
  • Programmatically generated the strict directory architecture required by YOLO, migrating over 50,000 individual files into structured train and val directories.

Phase 3: Model Training (src/model_train.py)

  • Initialized a pre-trained YOLOv8 Nano (yolov8n.pt) model for lightweight, high-speed inference.
  • Trained on 20,415 wafer images for 10 epochs.
  • Mapped 8 specific manufacturing defect classes (Center, Donut, Edge-Loc, Edge-Ring, Loc, Random, Scratch, Near-full).

Phase 4: Batch Inference & Evaluation (src/batch_inference.py & src/model_eval.py)

  • Deployed the custom-trained best.pt weights to run batch inference on unseen validation images.
  • Model successfully drew accurate bounding boxes and assigned confidence scores entirely autonomously.

Phase 5: Production Middleware, Predictive Modeling & Dashboard

  • Robotic Scanner Simulation (middleware/robot_controller.py): Operates on a massive hybrid dataset of 823,953 wafers (Mixed-type + WM-811K datasets) with a realistic 95.5% pass rate. It automatically routes passed wafers and runs YOLOv8 inference on defective ones, logging everything into a centralized SQLite database (wafer_control.db).
  • Material Waste Predictor (middleware/material_predictor.py): A Random Forest Regressor trained on the historical scan database. It accurately predicts the average percentage of material wasted within defective wafers, allowing fabs to estimate future material needs.
  • Real-time Dashboard (middleware/dashboard.py): A Plotly Dash web application that visualizes historical defect rates, defect distributions, routing actions, and integrates interactive material forecasting inputs.

Upcoming Feature: LLM Troubleshooting Assistant (Planned)

Goal: Integrate an intelligent Large Language Model (LLM) bot to assist fab engineers directly on the factory floor.

  • Functionality: When the dashboard flags a sudden spike in a specific defect type (e.g., "Edge-Ring" defects), the engineer can consult the LLM bot.
  • Use Case: The bot will analyze the defect trends, cross-reference historical manufacturing guidelines, and suggest potential root causes (such as misaligned etching tools or incorrect gas pressure), drastically reducing troubleshooting and downtime. (Note: This feature is currently in the design phase and not yet implemented).

Performance Metrics

The YOLOv8 model achieved phenomenal results on the blind validation set:

Metric Score Note
mAP50 (All Classes) 96.2% Overall model accuracy at a 50% confidence threshold.
Recall 93.1% The model successfully located 93.1% of all physical defects.
Edge-Ring (mAP50) 99.4% Near-flawless detection of Edge-Ring anomalies.

The Random Forest Material Waste Predictor achieved:

Metric Score Note
R² Score 0.9637 Excellent correlation on predictive targets.
MAE 0.09% Average prediction error is less than one-tenth of a percent.

Tech Stack

  • Languages: Python
  • Computer Vision: Ultralytics (YOLOv8), OpenCV (cv2)
  • Machine Learning & Data: Pandas, NumPy, Scikit-learn, SQLite
  • Web UI & Visualization: Plotly, Dash

Deployment (Docker)

This application is fully containerized for easy deployment.

  1. Clone the repository:
    git clone https://github.com/Udayan2001/Semiconductor_defect_detection.git
    cd Semiconductor_defect_detection
    
  2. Add API Key: Create a .env file in the backend/ directory and add your Google Gemini API key:
    GEMINI_API_KEY=your_api_key_here
    
  3. Start the Application: Run the following command from the root directory to build and start both the backend and frontend servers:
    docker compose up --build
    
  4. Access the Dashboard: Open your browser and navigate to http://localhost:5173.

Designed and engineered by Udayan Shashank Shukla.