--- title: Classroom AI Assistant emoji: πŸŽ“ colorFrom: indigo colorTo: red sdk: docker sdk_version: "1.0" app_file: Dockerfile app_port: 7860 pinned: false --- # AI Teaching Assistant System This project integrates emotion detection, voice-to-text, AI processing, and text-to-voice capabilities into a web-based teaching assistant system. Powered by FastAPI and LLaMA 3.1 3B, it bridges the gap between human emotions and AI responses, creating a real-time emotion-driven interaction experience. ## Problem Statement: Modern classrooms lack real-time, interactive tools to address diverse student needs and keep them engaged. The objective is to create a multimodal AI assistant that: - Accepts and processes text and voice queries from students in real-time. - Provides contextual responses, including textual explanations, charts, and visual aids. - Detects disengagement or confusion using facial expression analysis. ## Features - **Emotion Detection**: Detects user's facial emotions in real-time using DeepFace and OpenCV - **Voice-to-Text**: Converts user's speech to text for natural language input - **AI Processing**: Processes user queries with emotion-aware AI responses - **Image Search**: Finds relevant images based on contextual prompts - **Text-to-Voice**: Converts AI responses to speech with emotion-appropriate voice synthesis - **Web Interface**: Modern UI built with HTML, TailwindCSS, and JavaScript ## System Architecture ### Backend - FastAPI server with asynchronous WebSocket implementation - Connection management system handling multiple concurrent users - Integration of multiple AI components - Efficient state tracking of user emotions, interactions, and responses - Asynchronous processing for better performance ### Frontend - Responsive design with TailwindCSS - Real-time WebSocket communication for seamless interactions - Camera integration for emotion detection - Dynamic UI updates based on emotion detection - Speech recognition using Web Speech API ### System Flow Diagram ```mermaid graph TD User[User] -->|Speaks & Shows Emotion| UI[Frontend UI] subgraph "Frontend" UI -->|Captures Video| EmotionCapture[Emotion Capture] UI -->|Records Audio| SpeechCapture[Speech Capture] EmotionCapture -->|Base64 Image| WebSocket SpeechCapture -->|Text| WebSocket[WebSocket Connection] WebSocket -->|Responses| UIUpdate[UI Updates] UIUpdate -->|Display| UI end WebSocket <-->|Bidirectional Communication| Server[FastAPI Server] subgraph "Backend" Server -->|Manages| ConnectionMgr[Connection Manager] Server -->|Processes Image| EmotionDetection[Emotion Detection] Server -->|Processes Text| AIProcessing[LLaMA 3.1 Processing] EmotionDetection -->|Emotion State| AIProcessing AIProcessing -->|Response Text| TextToSpeech[Text-to-Speech] AIProcessing -->|Image Prompt| ImageSearch[Image Search] TextToSpeech -->|Audio File| Server ImageSearch -->|Image URL| Server end Server -->|Audio & Images & Text| WebSocket ``` ## Demo ### Screenshots ![AI Teaching Assistant Interface](asset/screenshot.jpeg) *Screenshot: Main interface showing the emotion-aware Classroom AI assistant with real-time camera feed and chat interface* ### Screen Recording See the AI Classroom Assistant in action: [Watch Demo](asset/screenrecord.mp4) *Video: Complete demonstration of the AI Teaching Assistant system showing emotion detection, voice interaction, and AI responses* ## Usage 1. Click the "Start Assistant" button to begin 2. Allow camera and microphone permissions when prompted 3. Speak clearly to interact with the assistant 4. The system will: - Detect your emotion - Convert your speech to text - Process your query with AI - Display relevant images - Speak the AI response ## Components - **main.py**: FastAPI backend server - **emotion_processor.py**: Handles facial emotion detection - **voice_processor.py**: Manages speech-to-text conversion - **img_and_ai.py**: Handles image search and AI processing - **TextToVoice.py**: Manages text-to-speech conversion - **index.html**: Main frontend interface - **styles.css**: Custom styling - **app.js**: Frontend JavaScript logic ## Project Structure ``` version1/ β”œβ”€β”€ README.md # Project documentation β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ server.py # Main server entry point β”œβ”€β”€ .env # Environment Variables β”œβ”€β”€ asset/ β”‚ β”œβ”€β”€ screenrecord.mp4 # Demo video showing system functionality β”‚ └── screenshot.jpeg # Interface screenshot β”œβ”€β”€ backend/ β”‚ β”œβ”€β”€ emotion_processor.py # Emotion detection processing β”‚ β”œβ”€β”€ haarcascade_frontalface_default.xml # Face detection model β”‚ β”œβ”€β”€ img_and_ai.py # Image processing utilities β”‚ β”œβ”€β”€ main.py # FastAPI application logic β”‚ β”œβ”€β”€ TextToVoice.py # Text-to-speech functionality β”‚ └── voice_processor.py # Speech recognition functionality └── frontend/ β”œβ”€β”€ static/ β”‚ β”œβ”€β”€ app.js # Frontend JavaScript β”‚ β”œβ”€β”€ styles.css # CSS styling β”‚ └── final_audio_*.mp3 # Generated audio responses └── templates/ └── index.html # Main web interface ``` ## AI Core & Technical Details ### AI Model - LLaMA 3.1 3B (8-bit) fine-tuned on 9,000+ emotion-labeled Q&A pairs - Custom emotion-aware prompt engineering with context preservation - Dedicated processing pipeline for each detected emotional state - Local model deployment with optimized inference for responsive interactions ### Emotion Detection - DeepFace & OpenCV with real-time webcam processing - Base64 image encoding for efficient WebSocket transmission - Continuous emotional state tracking with state management ### Voice Interaction - Speech-to-Text for natural language input - Edge Text-to-Speech with emotion-appropriate voice synthesis - Unique audio file generation with UUID-based identification ### Image Search - Contextual image sourcing driven by prompt - Dynamic content generation that adapts to both query and detected emotion - Integrated image processing with AI-generated responses ## Implementation Details This isn't just an API wrapperβ€”it's a complete system with: - Custom WebSocket architecture handling real-time bidirectional communication - End-to-end emotion processing pipeline from detection to response generation - Local model deployment with optimized inference for responsive interactions - Comprehensive error handling and logging system - No external LLM APIs were used due to project restrictionsβ€”everything runs locally ### Emotional Intelligence Architecture - Real-time emotion detection feeding continuously into the AI decision matrix - Calibrated to different emotional states - Stateful conversation tracking that maintains emotional context - Adaptive voice characteristics matching the detected emotional state - Connection Manager tracking user state across multiple sessions ## Resources ### Dataset and Model - Dataset: [Ques-Ans-with-Emotion](https://huggingface.co/datasets/0xarchit/Ques-Ans-with-Emotion) - 9,000+ emotion-labeled Q&A pairs - Model: [AI Teaching Assistant](https://huggingface.co/0xarchit/ai_teaching_assistant) - Fine-tuned LLaMA 3.1 3B ## Setup Instructions ### Prerequisites - Python 3.8 or higher - Webcam for emotion detection - Microphone for voice input - Internet connection for image search ### Installation 1. Clone the repository or navigate to the project directory 2. Install dependencies: ``` pip install -r requirements.txt ``` 3. Set up environment variables: - Rename `.env.example` to `.env` (or create a new `.env` file) - Add your RapidAPI key for image search - Configure any other environment variables as needed Example `.env` file: ``` RAPIDAPI_KEY="your_rapidapi_key_here" RAPIDAPI_HOST="real-time-image-search.p.rapidapi.com" AI_MODEL_URL="http://localhost:12345/v1/chat/completions" ``` 4. Run the server: ``` python server.py ``` 5. Open your browser and navigate to: ``` http://localhost:8000 ``` ## Troubleshooting - If the camera doesn't work, check browser permissions - If speech recognition fails, ensure your microphone is working - If the server fails to start, check if port 8000 is available ## Future Goals As we continue to develop this AI Classroom Teaching Assistant System, we plan to implement several enhancements to make the experience even more immersive and effective: ### Near-Term Enhancements - **Emotion-Adaptive Voice Generation**: Integration of a specialized ML model for human-like voice generation that dynamically adapts tone, pitch, and speaking style based on detected student emotions - **AI Image Generation**: Implementation of diffusion models to create custom educational illustrations and diagrams in real-time based on the educational content being discussed - **Multi-Student Emotion Tracking**: Ability to simultaneously track and respond to multiple students emotional states in classroom settings - **Personalized Learning Paths**: Development of student profiles that track engagement patterns and learning preferences to customize future interactions - **Extended Language Support**: Integration of multilingual capabilities for global classroom deployment ### Long-Term Vision - **AR/VR Integration**: Creation of immersive educational experiences with 3D visualizations of complex concepts - **Collaborative Learning Features**: Facilitation of group activities and peer-to-peer learning with AI moderation - **Advanced Analytics Dashboard**: Comprehensive insights for educators about student engagement, emotional patterns, and learning progress ## License This project is part of the Intel Unnati program, completed by team Bitbots. 🌟 Give This Repo A Star If You Like 🌟