Spaces:

anasraza526
/

customeragent-api

Runtime error

Dataset	Status	Entries	Production Ready?
MedQuAD	Sample	5 Q&A	❌ Need real data
SymCAT	Sample	8 symptoms	❌ Need real data
CourseQ	Sample	5 Q&A	✅ Custom created
STACKED	Sample	3 Q&A	❌ Need real data
CLINC150	Sample	5 examples	❌ Need real data

Why Sample Data Works:

Demonstrates all features
75%+ accuracy achieved
Perfect for development/testing
No external dependencies

🔧 Setting Up Real Datasets (Production)

Option 1: Quick Setup (CLINC150 Only)

cd server
chmod +x setup_datasets.sh
./setup_datasets.sh

This downloads CLINC150 automatically. Others require manual setup.

Option 2: Manual Setup

1. MedQuAD (Medical Q&A)

cd server/datasets
mkdir -p medquad
cd medquad

# Download from GitHub
git clone https://github.com/abachaa/MedQuAD.git

# The system will automatically load XML files from subdirectories

What You Get:

47,000+ medical Q&A pairs
From trusted sources (NIH, CDC, Mayo Clinic)
Covers diseases, treatments, symptoms

2. SymCAT (Symptom-Disease Mapping)

cd server/datasets
mkdir -p symcat

# Download symptoms.json
curl -o symcat/symptoms.json \
  https://raw.githubusercontent.com/symcat/symcat-corpus/master/symptoms.json

What You Get:

Comprehensive symptom-to-disease mappings
Clinical-grade data
Improved symptom checker accuracy

3. Roman Urdu Corpus

cd server/datasets
mkdir -p roman_urdu_corpus

# Download Roman Urdu dataset (various sources available)
# Example: https://github.com/harisbinzia/roman-urdu-dataset
# Place CSV file in: roman_urdu_corpus/data.csv

What You Get:

Roman Urdu ↔ Urdu parallel corpus
Improved translation accuracy
Better normalization

🔴 Setting Up Redis (Translation Cache)

Why Redis?

Persistent cache (survives server restarts)
Faster translations (cache hits)
Reduced API costs (fewer Google Translate calls)

Installation

macOS:

# Install Redis
brew install redis

# Start Redis service
brew services start redis

# Verify it's running
redis-cli ping  # Should return "PONG"

Linux:

sudo apt-get install redis-server
sudo systemctl start redis
sudo systemctl enable redis

Docker:

docker run -d -p 6379:6379 redis:alpine

Configuration

Add to your .env file:

# Redis Configuration
REDIS_URL=redis://localhost:6379
REDIS_PASSWORD=  # Leave empty for local development

# Optional: Configure Redis cache TTL
REDIS_CACHE_TTL=86400  # 24 hours

The system will automatically detect Redis and use it if available.

🎯 Industry Selection Explained

How It Works

Per-Website Industry Setting:

# When creating a website
{
  "domain": "healthclinic.com",
  "industry": "healthcare",  # ← Industry set here
  "supported_languages": ["en", "ur"]
}

When User Chats:

# Query router receives:
query = "I have fever"
website_id = 123  # healthclinic.com
industry = website.industry  # "healthcare" from DB

# System uses healthcare-specific features:
- Healthcare symptom checker
- Medical terminology knowledge
- HIPAA compliance
- Emergency detection

Example Flow

Website 1: Health Clinic

Industry: Healthcare
User asks: "mujhe bukhar hai" (I have fever)
→ Detects symptoms
→ Uses healthcare module
→ Returns: Urgency, medical advice, red flags

Website 2: University

Industry: Education  
User asks: "admission ke liye kya chahiye"
→ Detects academic query
→ Uses education module
→ Returns: Admission requirements, process

Key Points

Each website has its own industry (stored in database)
Industry is selected during website creation (by website owner)
All users on that website get industry-specific responses
Different websites can have different industries

Example:

hospital-a.com → Healthcare responses
university-b.edu → Education responses
store-c.com → E-commerce responses

🚀 Recommended Setup by Environment

Development (Current)

✅ Sample datasets (what you have now)
✅ In-memory cache (what you have now)
✅ No Redis needed
✅ No dataset downloads needed

Perfect for:

Feature development
Testing
Demos

Staging

⚠️ Download CLINC150, CourseQ
⚠️ Install Redis locally
✅ Keep using sample MedQuAD/SymCAT

Good for:

User acceptance testing
Performance testing
Integration testing

Production

🔴 Download ALL real datasets
🔴 Setup Redis cluster
🔴 Configure environment variables
🔴 Enable monitoring

Required for:

Real users
Scale
Accuracy
Reliability

📊 Impact of Using Real Datasets

Current (Sample Data)

Knowledge Base: ~25 Q&A pairs
Symptom Mapping: 8 symptoms
Intent Training: 5 examples
Accuracy: 75% ✅

With Real Datasets

Knowledge Base: 47,000+ Q&A pairs
Symptom Mapping: Comprehensive
Intent Training: 22,000+ examples
Expected Accuracy: 85-90% 🚀

🎯 Quick Decision Guide

Should I download real datasets?

Scenario	Recommendation
Just testing features	❌ No - use sample data
Development/debugging	❌ No - use sample data
Showing to stakeholders	🟡 Maybe - CLINC150 only
Deploying to staging	✅ Yes - partial (CLINC150, Redis)
Production deployment	✅✅ Yes - ALL datasets + Redis

💡 Summary

Current Warnings:

✅ Safe and expected for development
✅ System works perfectly with fallbacks
✅ No action needed right now

Industry Selection:

✅ Per-website setting (not global)
✅ All users on website get that industry's responses
✅ Different websites can have different industries

Next Steps:

✅ Keep developing with sample data
⏭️ Download real datasets when ready for staging
⏭️ Setup Redis when deploying to production