customeragent-api / server /SETUP_GUIDE.md
anasraza526's picture
Clean deploy to Hugging Face
ac90985

Dataset and Redis Setup Guide

🎯 Quick Answer

These warnings are SAFE and EXPECTED for development.

Your system is working perfectly with:

  • βœ… Sample medical Q&A (5 entries)
  • βœ… Sample symptom mappings (8 symptoms)
  • βœ… Sample education Q&A (8 entries)
  • βœ… In-memory translation cache

No action required unless deploying to production.


πŸ“¦ Dataset Status

Currently Using Sample Data

Dataset Status Entries Production Ready?
MedQuAD Sample 5 Q&A ❌ Need real data
SymCAT Sample 8 symptoms ❌ Need real data
CourseQ Sample 5 Q&A βœ… Custom created
STACKED Sample 3 Q&A ❌ Need real data
CLINC150 Sample 5 examples ❌ Need real data

Why Sample Data Works:

  • Demonstrates all features
  • 75%+ accuracy achieved
  • Perfect for development/testing
  • No external dependencies

πŸ”§ Setting Up Real Datasets (Production)

Option 1: Quick Setup (CLINC150 Only)

cd server
chmod +x setup_datasets.sh
./setup_datasets.sh

This downloads CLINC150 automatically. Others require manual setup.


Option 2: Manual Setup

1. MedQuAD (Medical Q&A)

cd server/datasets
mkdir -p medquad
cd medquad

# Download from GitHub
git clone https://github.com/abachaa/MedQuAD.git

# The system will automatically load XML files from subdirectories

What You Get:

  • 47,000+ medical Q&A pairs
  • From trusted sources (NIH, CDC, Mayo Clinic)
  • Covers diseases, treatments, symptoms

2. SymCAT (Symptom-Disease Mapping)

cd server/datasets
mkdir -p symcat

# Download symptoms.json
curl -o symcat/symptoms.json \
  https://raw.githubusercontent.com/symcat/symcat-corpus/master/symptoms.json

What You Get:

  • Comprehensive symptom-to-disease mappings
  • Clinical-grade data
  • Improved symptom checker accuracy

3. Roman Urdu Corpus

cd server/datasets
mkdir -p roman_urdu_corpus

# Download Roman Urdu dataset (various sources available)
# Example: https://github.com/harisbinzia/roman-urdu-dataset
# Place CSV file in: roman_urdu_corpus/data.csv

What You Get:

  • Roman Urdu ↔ Urdu parallel corpus
  • Improved translation accuracy
  • Better normalization

πŸ”΄ Setting Up Redis (Translation Cache)

Why Redis?

  • Persistent cache (survives server restarts)
  • Faster translations (cache hits)
  • Reduced API costs (fewer Google Translate calls)

Installation

macOS:

# Install Redis
brew install redis

# Start Redis service
brew services start redis

# Verify it's running
redis-cli ping  # Should return "PONG"

Linux:

sudo apt-get install redis-server
sudo systemctl start redis
sudo systemctl enable redis

Docker:

docker run -d -p 6379:6379 redis:alpine

Configuration

Add to your .env file:

# Redis Configuration
REDIS_URL=redis://localhost:6379
REDIS_PASSWORD=  # Leave empty for local development

# Optional: Configure Redis cache TTL
REDIS_CACHE_TTL=86400  # 24 hours

The system will automatically detect Redis and use it if available.


🎯 Industry Selection Explained

How It Works

Per-Website Industry Setting:

# When creating a website
{
  "domain": "healthclinic.com",
  "industry": "healthcare",  # ← Industry set here
  "supported_languages": ["en", "ur"]
}

When User Chats:

# Query router receives:
query = "I have fever"
website_id = 123  # healthclinic.com
industry = website.industry  # "healthcare" from DB

# System uses healthcare-specific features:
- Healthcare symptom checker
- Medical terminology knowledge
- HIPAA compliance
- Emergency detection

Example Flow

Website 1: Health Clinic

Industry: Healthcare
User asks: "mujhe bukhar hai" (I have fever)
β†’ Detects symptoms
β†’ Uses healthcare module
β†’ Returns: Urgency, medical advice, red flags

Website 2: University

Industry: Education  
User asks: "admission ke liye kya chahiye"
β†’ Detects academic query
β†’ Uses education module
β†’ Returns: Admission requirements, process

Key Points

  1. Each website has its own industry (stored in database)
  2. Industry is selected during website creation (by website owner)
  3. All users on that website get industry-specific responses
  4. Different websites can have different industries

Example:

  • hospital-a.com β†’ Healthcare responses
  • university-b.edu β†’ Education responses
  • store-c.com β†’ E-commerce responses

πŸš€ Recommended Setup by Environment

Development (Current)

βœ… Sample datasets (what you have now)
βœ… In-memory cache (what you have now)
βœ… No Redis needed
βœ… No dataset downloads needed

Perfect for:

  • Feature development
  • Testing
  • Demos

Staging

⚠️ Download CLINC150, CourseQ
⚠️ Install Redis locally
βœ… Keep using sample MedQuAD/SymCAT

Good for:

  • User acceptance testing
  • Performance testing
  • Integration testing

Production

πŸ”΄ Download ALL real datasets
πŸ”΄ Setup Redis cluster
πŸ”΄ Configure environment variables
πŸ”΄ Enable monitoring

Required for:

  • Real users
  • Scale
  • Accuracy
  • Reliability

πŸ“Š Impact of Using Real Datasets

Current (Sample Data)

  • Knowledge Base: ~25 Q&A pairs
  • Symptom Mapping: 8 symptoms
  • Intent Training: 5 examples
  • Accuracy: 75% βœ…

With Real Datasets

  • Knowledge Base: 47,000+ Q&A pairs
  • Symptom Mapping: Comprehensive
  • Intent Training: 22,000+ examples
  • Expected Accuracy: 85-90% πŸš€

🎯 Quick Decision Guide

Should I download real datasets?

Scenario Recommendation
Just testing features ❌ No - use sample data
Development/debugging ❌ No - use sample data
Showing to stakeholders 🟑 Maybe - CLINC150 only
Deploying to staging βœ… Yes - partial (CLINC150, Redis)
Production deployment βœ…βœ… Yes - ALL datasets + Redis

πŸ’‘ Summary

Current Warnings:

  • βœ… Safe and expected for development
  • βœ… System works perfectly with fallbacks
  • βœ… No action needed right now

Industry Selection:

  • βœ… Per-website setting (not global)
  • βœ… All users on website get that industry's responses
  • βœ… Different websites can have different industries

Next Steps:

  1. βœ… Keep developing with sample data
  2. ⏭️ Download real datasets when ready for staging
  3. ⏭️ Setup Redis when deploying to production