Spaces:
Runtime error
Runtime error
Unified Data Source Management
🎯 The Challenge You Described
You have multiple data sources that may or may not have data:
- ✅ Website scraped data (may be empty)
- ✅ FAQ data (may be empty)
- ✅ Unanswered questions history (may be empty)
- ✅ Industry knowledge base (always available)
Users can ask anything in 3 languages (English, Urdu, Roman Urdu)
Question: How to intelligently determine which data source to use and handle context?
✅ The Solution: Unified Data Source Manager
I've created unified_data_manager.py - a smart system that:
- ✅ Checks all data sources automatically
- ✅ Prioritizes based on query intent
- ✅ Handles empty/missing data gracefully
- ✅ Works across all 3 languages
- ✅ Provides intelligent fallbacks
📊 How It Works (Step-by-Step)
Example Query: "mujhe bukhar hai" (I have fever)
┌─────────────────────────────────────────────────────────────┐
│ Step 1: Language Detection │
└─────────────────────────────────────────────────────────────┘
Input: "mujhe bukhar hai"
→ Detected: Roman Urdu (confidence: 0.95)
┌─────────────────────────────────────────────────────────────┐
│ Step 2: Translation to English (for processing) │
└─────────────────────────────────────────────────────────────┘
Roman Urdu: "mujhe bukhar hai"
→ English: "I have fever"
┌─────────────────────────────────────────────────────────────┐
│ Step 3: Intent Classification │
└─────────────────────────────────────────────────────────────┘
Query: "I have fever"
Industry: Healthcare
→ Intent: INDUSTRY_KNOWLEDGE (confidence: 0.85)
┌─────────────────────────────────────────────────────────────┐
│ Step 4: Check Data Source Availability │
└─────────────────────────────────────────────────────────────┘
Checking website_id: 123
✅ FAQ: Has 15 FAQs
❌ Scraped Content: Empty (no data)
✅ Industry KB: Healthcare module available
❌ Unanswered History: No history yet
✅ LLM Fallback: Always available
Available: [FAQ, Industry KB, LLM]
┌─────────────────────────────────────────────────────────────┐
│ Step 5: Query All Available Sources │
└─────────────────────────────────────────────────────────────┘
Source 1: FAQ Database
Query: "I have fever"
Result: No exact match
Confidence: 0.0
❌ Not found
Source 2: Industry Knowledge Base (Healthcare)
Query: "I have fever"
Result: Found symptom checker
Confidence: 0.90
✅ Found! (Fever → Flu, COVID-19, Infection)
Source 3: LLM Fallback
Query: "I have fever"
Result: Generic response
Confidence: 0.60
✅ Available
┌─────────────────────────────────────────────────────────────┐
│ Step 6: Select Best Result │
└─────────────────────────────────────────────────────────────┘
Results ranked by: Confidence × Priority Weight
For INDUSTRY_KNOWLEDGE intent:
- Industry KB: 0.90 × 1.0 = 0.90 ✅ WINNER
- LLM Fallback: 0.60 × 0.5 = 0.30
Selected: Industry Knowledge Base
┌─────────────────────────────────────────────────────────────┐
│ Step 7: Translate Answer Back to User's Language │
└─────────────────────────────────────────────────────────────┘
Answer (English): "You have fever. Possible conditions: Flu, COVID-19..."
Target Language: Roman Urdu
→ Translated: "Aapko bukhar hai. Mumkina bimariyan: Flu, COVID-19..."
┌─────────────────────────────────────────────────────────────┐
│ Final Response │
└─────────────────────────────────────────────────────────────┘
{
"answer": "Aapko bukhar hai. Mumkina bimariyan: Flu, COVID-19...",
"source": "industry_knowledge",
"confidence": 0.90,
"language_detected": "ur-roman",
"intent": "INDUSTRY_KNOWLEDGE"
}
🎯 Priority System by Intent
1. FAQ Intent
Priority Order:
1. FAQ Database (100%) ← Try first
2. Website Scraped (80%)
3. Industry KB (60%)
4. LLM Fallback (50%)
Example: "What are your hours?"
- Checks FAQ first (highest priority)
- If not in FAQ, check website content
- Fallback to LLM
2. Industry Knowledge Intent
Priority Order:
1. Industry KB (100%) ← Try first
2. FAQ Database (70%)
3. Website Scraped (50%)
4. LLM Fallback (50%)
Example: "What is diabetes?"
- Healthcare module first
- Falls back to FAQ if medical answer not found
- LLM as last resort
3. Business-Specific Intent
Priority Order:
1. Website Scraped (100%) ← Try first
2. FAQ Database (80%)
3. Industry KB (40%)
4. LLM Fallback (50%)
Example: "Who is Dr. Khan?"
- Checks scraped website content first
- FAQ second
- LLM if not found
🌐 Multilingual Support
Language Flow
User Query (Any Language)
↓
Detect Language (en/ur/ur-roman)
↓
Translate to English (if needed)
↓
Process with English (all data sources)
↓
Translate Answer Back (to user's language)
↓
Return in Original Language
Examples
Scenario 1: Urdu Query
Input: "داخلے کی شرائط کیا ہیں؟"
Detect: Urdu
Translate: "What are admission requirements?"
Process: Query FAQ + Education KB
Answer: "Minimum GPA 3.0..."
Translate: "کم از کم GPA 3.0..."
Output: "کم از کم GPA 3.0..."
Scenario 2: Roman Urdu Query
Input: "fee kitni hai?"
Detect: Roman Urdu
Translate: "How much is the fee?"
Process: Query FAQ first
Answer: "Annual fee is $5000"
Translate: "Saalana fee $5000 hai"
Output: "Saalana fee $5000 hai"
🔄 Handling Empty Data Sources
Case 1: No FAQ Data
# System automatically detects
available_sources[FAQ] = False # No FAQs in database
# Skip FAQ, move to next priority
→ Checks: Website Scraped → Industry KB → LLM
Case 2: No Scraped Content
# Website has no scraped content
available_sources[WEBSITE_SCRAPED] = False
# Skip scraped content
→ Checks: FAQ → Industry KB → LLM
Case 3: All Business Data Empty (FAQ + Scraped both empty)
# Fallback chain:
1. FAQ → Empty ❌
2. Scraped → Empty ❌
3. Industry KB → Available ✅ (if industry query)
4. LLM → Always available ✅
Result: User still gets an answer!
📝 Smart Features
1. Automatic Availability Detection
# Before querying, system checks:
- Does website have scraped content? (query DB)
- Are there active FAQs? (count)
- Is there unanswered history? (check logs)
# Only queries available sources
# Skips empty ones automatically
2. Confidence-Based Selection
# Even if FAQ found, may use Industry KB if higher confidence
FAQ: confidence = 0.65
Industry KB: confidence = 0.90
→ Selects Industry KB (higher confidence)
3. Unanswered Question Logging
# If confidence < 0.5 or not found:
log_unanswered_question(query, website_id)
# Later, admin can:
- Review unanswered questions
- Add to FAQ manually
- Improve knowledge base
4. Learning from History
# Future feature:
# If same question asked multiple times
# → Auto-suggest adding to FAQ
# → Learn patterns
💻 Usage Example
from app.services.unified_data_manager import get_unified_manager
# Initialize with database session
manager = get_unified_manager(db_session)
# Query (any language)
result = await manager.query(
user_query="mujhe bukhar hai", # Roman Urdu
website_id=123,
industry="healthcare",
session_id="session_456"
)
# Result:
{
"answer": "Aapko bukhar hai. Mumkina bimariyan: Flu...",
"source": "industry_knowledge", # Which source was used
"confidence": 0.90,
"language_detected": "ur-roman",
"intent": "INDUSTRY_KNOWLEDGE",
"data_sources_checked": ["faq", "industry_knowledge", "llm_fallback"]
}
🎯 Data Source Priority Matrix
| Intent | 1st Priority | 2nd Priority | 3rd Priority | Fallback |
|---|---|---|---|---|
| FAQ | FAQ (1.0) | Scraped (0.8) | Industry (0.6) | LLM (0.5) |
| Industry Knowledge | Industry (1.0) | FAQ (0.7) | Scraped (0.5) | LLM (0.5) |
| Business Specific | Scraped (1.0) | FAQ (0.8) | Industry (0.4) | LLM (0.5) |
| Creative | LLM (1.0) | - | - | - |
✅ Benefits
1. Graceful Degradation
- If primary source empty → automatic fallback
- User always gets an answer
- No errors shown to user
2. Multilingual Support
- Same code handles all 3 languages
- Automatic translation in/out
- Language-aware responses
3. Context-Aware Routing
- Intent determines priority
- Industry influences search
- Confidence-based selection
4. Learn and Improve
- Logs unanswered questions
- Track what users ask
- Identify knowledge gaps
5. Flexible Architecture
- Easy to add new data sources
- Configurable priorities
- Modular components
🚀 Future Enhancements
- Vector Search: Use embeddings for better matching
- Hybrid Retrieval: Combine keyword + semantic search
- Answer Fusion: Merge answers from multiple sources
- Learning Loop: Auto-improve from unanswered questions
- Caching: Cache frequently asked questions
- Analytics: Track which sources perform best
📊 Summary
Your Problem: Multiple data sources, may be empty, 3 languages, need smart routing
Solution: Unified Data Source Manager
- ✅ Automatically checks all sources
- ✅ Prioritizes by intent
- ✅ Handles empty data gracefully
- ✅ Works in any language
- ✅ Always provides answer (LLM fallback)
- ✅ Logs unanswered for learning
Result: One unified interface that intelligently handles everything!