customeragent-api / server /UNIFIED_DATA_SOURCE_GUIDE.md
anasraza526's picture
Clean deploy to Hugging Face
ac90985

Unified Data Source Management

🎯 The Challenge You Described

You have multiple data sources that may or may not have data:

  1. ✅ Website scraped data (may be empty)
  2. ✅ FAQ data (may be empty)
  3. ✅ Unanswered questions history (may be empty)
  4. ✅ Industry knowledge base (always available)

Users can ask anything in 3 languages (English, Urdu, Roman Urdu)

Question: How to intelligently determine which data source to use and handle context?


✅ The Solution: Unified Data Source Manager

I've created unified_data_manager.py - a smart system that:

  1. ✅ Checks all data sources automatically
  2. ✅ Prioritizes based on query intent
  3. ✅ Handles empty/missing data gracefully
  4. ✅ Works across all 3 languages
  5. ✅ Provides intelligent fallbacks

📊 How It Works (Step-by-Step)

Example Query: "mujhe bukhar hai" (I have fever)

┌─────────────────────────────────────────────────────────────┐
│ Step 1: Language Detection                                  │
└─────────────────────────────────────────────────────────────┘
Input: "mujhe bukhar hai"
→ Detected: Roman Urdu (confidence: 0.95)

┌─────────────────────────────────────────────────────────────┐
│ Step 2: Translation to English (for processing)             │
└─────────────────────────────────────────────────────────────┘
Roman Urdu: "mujhe bukhar hai"
→ English: "I have fever"

┌─────────────────────────────────────────────────────────────┐
│ Step 3: Intent Classification                               │
└─────────────────────────────────────────────────────────────┘
Query: "I have fever"
Industry: Healthcare
→ Intent: INDUSTRY_KNOWLEDGE (confidence: 0.85)

┌─────────────────────────────────────────────────────────────┐
│ Step 4: Check Data Source Availability                      │
└─────────────────────────────────────────────────────────────┘
Checking website_id: 123

✅ FAQ: Has 15 FAQs
❌ Scraped Content: Empty (no data)
✅ Industry KB: Healthcare module available
❌ Unanswered History: No history yet
✅ LLM Fallback: Always available

Available: [FAQ, Industry KB, LLM]

┌─────────────────────────────────────────────────────────────┐
│ Step 5: Query All Available Sources                         │
└─────────────────────────────────────────────────────────────┘

Source 1: FAQ Database
  Query: "I have fever"
  Result: No exact match
  Confidence: 0.0
  ❌ Not found

Source 2: Industry Knowledge Base (Healthcare)
  Query: "I have fever"
  Result: Found symptom checker
  Confidence: 0.90
  ✅ Found! (Fever → Flu, COVID-19, Infection)

Source 3: LLM Fallback
  Query: "I have fever"
  Result: Generic response
  Confidence: 0.60
  ✅ Available

┌─────────────────────────────────────────────────────────────┐
│ Step 6: Select Best Result                                  │
└─────────────────────────────────────────────────────────────┘

Results ranked by: Confidence × Priority Weight

For INDUSTRY_KNOWLEDGE intent:
- Industry KB: 0.90 × 1.0 = 0.90 ✅ WINNER
- LLM Fallback: 0.60 × 0.5 = 0.30

Selected: Industry Knowledge Base

┌─────────────────────────────────────────────────────────────┐
│ Step 7: Translate Answer Back to User's Language            │
└─────────────────────────────────────────────────────────────┘

Answer (English): "You have fever. Possible conditions: Flu, COVID-19..."
Target Language: Roman Urdu
→ Translated: "Aapko bukhar hai. Mumkina bimariyan: Flu, COVID-19..."

┌─────────────────────────────────────────────────────────────┐
│ Final Response                                               │
└─────────────────────────────────────────────────────────────┘
{
  "answer": "Aapko bukhar hai. Mumkina bimariyan: Flu, COVID-19...",
  "source": "industry_knowledge",
  "confidence": 0.90,
  "language_detected": "ur-roman",
  "intent": "INDUSTRY_KNOWLEDGE"
}

🎯 Priority System by Intent

1. FAQ Intent

Priority Order:
1. FAQ Database (100%) ← Try first
2. Website Scraped (80%)
3. Industry KB (60%)
4. LLM Fallback (50%)

Example: "What are your hours?"

  • Checks FAQ first (highest priority)
  • If not in FAQ, check website content
  • Fallback to LLM

2. Industry Knowledge Intent

Priority Order:
1. Industry KB (100%) ← Try first
2. FAQ Database (70%)
3. Website Scraped (50%)
4. LLM Fallback (50%)

Example: "What is diabetes?"

  • Healthcare module first
  • Falls back to FAQ if medical answer not found
  • LLM as last resort

3. Business-Specific Intent

Priority Order:
1. Website Scraped (100%) ← Try first
2. FAQ Database (80%)
3. Industry KB (40%)
4. LLM Fallback (50%)

Example: "Who is Dr. Khan?"

  • Checks scraped website content first
  • FAQ second
  • LLM if not found

🌐 Multilingual Support

Language Flow

User Query (Any Language)
    ↓
Detect Language (en/ur/ur-roman)
    ↓
Translate to English (if needed)
    ↓
Process with English (all data sources)
    ↓
Translate Answer Back (to user's language)
    ↓
Return in Original Language

Examples

Scenario 1: Urdu Query

Input:  "داخلے کی شرائط کیا ہیں؟"
Detect: Urdu
Translate: "What are admission requirements?"
Process: Query FAQ + Education KB
Answer: "Minimum GPA 3.0..."
Translate: "کم از کم GPA 3.0..."
Output: "کم از کم GPA 3.0..."

Scenario 2: Roman Urdu Query

Input:  "fee kitni hai?"
Detect: Roman Urdu
Translate: "How much is the fee?"
Process: Query FAQ first
Answer: "Annual fee is $5000"
Translate: "Saalana fee $5000 hai"
Output: "Saalana fee $5000 hai"

🔄 Handling Empty Data Sources

Case 1: No FAQ Data

# System automatically detects
available_sources[FAQ] = False  # No FAQs in database

# Skip FAQ, move to next priority
→ Checks: Website Scraped → Industry KB → LLM

Case 2: No Scraped Content

# Website has no scraped content
available_sources[WEBSITE_SCRAPED] = False

# Skip scraped content
→ Checks: FAQ → Industry KB → LLM

Case 3: All Business Data Empty (FAQ + Scraped both empty)

# Fallback chain:
1. FAQ → Empty ❌
2. Scraped → Empty ❌
3. Industry KB → Available ✅ (if industry query)
4. LLM → Always available ✅

Result: User still gets an answer!


📝 Smart Features

1. Automatic Availability Detection

# Before querying, system checks:
- Does website have scraped content? (query DB)
- Are there active FAQs? (count)
- Is there unanswered history? (check logs)

# Only queries available sources
# Skips empty ones automatically

2. Confidence-Based Selection

# Even if FAQ found, may use Industry KB if higher confidence
FAQ: confidence = 0.65
Industry KB: confidence = 0.90

→ Selects Industry KB (higher confidence)

3. Unanswered Question Logging

# If confidence < 0.5 or not found:
log_unanswered_question(query, website_id)

# Later, admin can:
- Review unanswered questions
- Add to FAQ manually
- Improve knowledge base

4. Learning from History

# Future feature:
# If same question asked multiple times
# → Auto-suggest adding to FAQ
# → Learn patterns

💻 Usage Example

from app.services.unified_data_manager import get_unified_manager

# Initialize with database session
manager = get_unified_manager(db_session)

# Query (any language)
result = await manager.query(
    user_query="mujhe bukhar hai",  # Roman Urdu
    website_id=123,
    industry="healthcare",
    session_id="session_456"
)

# Result:
{
    "answer": "Aapko bukhar hai. Mumkina bimariyan: Flu...",
    "source": "industry_knowledge",  # Which source was used
    "confidence": 0.90,
    "language_detected": "ur-roman",
    "intent": "INDUSTRY_KNOWLEDGE",
    "data_sources_checked": ["faq", "industry_knowledge", "llm_fallback"]
}

🎯 Data Source Priority Matrix

Intent 1st Priority 2nd Priority 3rd Priority Fallback
FAQ FAQ (1.0) Scraped (0.8) Industry (0.6) LLM (0.5)
Industry Knowledge Industry (1.0) FAQ (0.7) Scraped (0.5) LLM (0.5)
Business Specific Scraped (1.0) FAQ (0.8) Industry (0.4) LLM (0.5)
Creative LLM (1.0) - - -

✅ Benefits

1. Graceful Degradation

  • If primary source empty → automatic fallback
  • User always gets an answer
  • No errors shown to user

2. Multilingual Support

  • Same code handles all 3 languages
  • Automatic translation in/out
  • Language-aware responses

3. Context-Aware Routing

  • Intent determines priority
  • Industry influences search
  • Confidence-based selection

4. Learn and Improve

  • Logs unanswered questions
  • Track what users ask
  • Identify knowledge gaps

5. Flexible Architecture

  • Easy to add new data sources
  • Configurable priorities
  • Modular components

🚀 Future Enhancements

  1. Vector Search: Use embeddings for better matching
  2. Hybrid Retrieval: Combine keyword + semantic search
  3. Answer Fusion: Merge answers from multiple sources
  4. Learning Loop: Auto-improve from unanswered questions
  5. Caching: Cache frequently asked questions
  6. Analytics: Track which sources perform best

📊 Summary

Your Problem: Multiple data sources, may be empty, 3 languages, need smart routing

Solution: Unified Data Source Manager

  • ✅ Automatically checks all sources
  • ✅ Prioritizes by intent
  • ✅ Handles empty data gracefully
  • ✅ Works in any language
  • ✅ Always provides answer (LLM fallback)
  • ✅ Logs unanswered for learning

Result: One unified interface that intelligently handles everything!