Spaces:
Runtime error
Runtime error
Dataset and Redis Setup Guide
π― Quick Answer
These warnings are SAFE and EXPECTED for development.
Your system is working perfectly with:
- β Sample medical Q&A (5 entries)
- β Sample symptom mappings (8 symptoms)
- β Sample education Q&A (8 entries)
- β In-memory translation cache
No action required unless deploying to production.
π¦ Dataset Status
Currently Using Sample Data
| Dataset | Status | Entries | Production Ready? |
|---|---|---|---|
| MedQuAD | Sample | 5 Q&A | β Need real data |
| SymCAT | Sample | 8 symptoms | β Need real data |
| CourseQ | Sample | 5 Q&A | β Custom created |
| STACKED | Sample | 3 Q&A | β Need real data |
| CLINC150 | Sample | 5 examples | β Need real data |
Why Sample Data Works:
- Demonstrates all features
- 75%+ accuracy achieved
- Perfect for development/testing
- No external dependencies
π§ Setting Up Real Datasets (Production)
Option 1: Quick Setup (CLINC150 Only)
cd server
chmod +x setup_datasets.sh
./setup_datasets.sh
This downloads CLINC150 automatically. Others require manual setup.
Option 2: Manual Setup
1. MedQuAD (Medical Q&A)
cd server/datasets
mkdir -p medquad
cd medquad
# Download from GitHub
git clone https://github.com/abachaa/MedQuAD.git
# The system will automatically load XML files from subdirectories
What You Get:
- 47,000+ medical Q&A pairs
- From trusted sources (NIH, CDC, Mayo Clinic)
- Covers diseases, treatments, symptoms
2. SymCAT (Symptom-Disease Mapping)
cd server/datasets
mkdir -p symcat
# Download symptoms.json
curl -o symcat/symptoms.json \
https://raw.githubusercontent.com/symcat/symcat-corpus/master/symptoms.json
What You Get:
- Comprehensive symptom-to-disease mappings
- Clinical-grade data
- Improved symptom checker accuracy
3. Roman Urdu Corpus
cd server/datasets
mkdir -p roman_urdu_corpus
# Download Roman Urdu dataset (various sources available)
# Example: https://github.com/harisbinzia/roman-urdu-dataset
# Place CSV file in: roman_urdu_corpus/data.csv
What You Get:
- Roman Urdu β Urdu parallel corpus
- Improved translation accuracy
- Better normalization
π΄ Setting Up Redis (Translation Cache)
Why Redis?
- Persistent cache (survives server restarts)
- Faster translations (cache hits)
- Reduced API costs (fewer Google Translate calls)
Installation
macOS:
# Install Redis
brew install redis
# Start Redis service
brew services start redis
# Verify it's running
redis-cli ping # Should return "PONG"
Linux:
sudo apt-get install redis-server
sudo systemctl start redis
sudo systemctl enable redis
Docker:
docker run -d -p 6379:6379 redis:alpine
Configuration
Add to your .env file:
# Redis Configuration
REDIS_URL=redis://localhost:6379
REDIS_PASSWORD= # Leave empty for local development
# Optional: Configure Redis cache TTL
REDIS_CACHE_TTL=86400 # 24 hours
The system will automatically detect Redis and use it if available.
π― Industry Selection Explained
How It Works
Per-Website Industry Setting:
# When creating a website
{
"domain": "healthclinic.com",
"industry": "healthcare", # β Industry set here
"supported_languages": ["en", "ur"]
}
When User Chats:
# Query router receives:
query = "I have fever"
website_id = 123 # healthclinic.com
industry = website.industry # "healthcare" from DB
# System uses healthcare-specific features:
- Healthcare symptom checker
- Medical terminology knowledge
- HIPAA compliance
- Emergency detection
Example Flow
Website 1: Health Clinic
Industry: Healthcare
User asks: "mujhe bukhar hai" (I have fever)
β Detects symptoms
β Uses healthcare module
β Returns: Urgency, medical advice, red flags
Website 2: University
Industry: Education
User asks: "admission ke liye kya chahiye"
β Detects academic query
β Uses education module
β Returns: Admission requirements, process
Key Points
- Each website has its own industry (stored in database)
- Industry is selected during website creation (by website owner)
- All users on that website get industry-specific responses
- Different websites can have different industries
Example:
hospital-a.comβ Healthcare responsesuniversity-b.eduβ Education responsesstore-c.comβ E-commerce responses
π Recommended Setup by Environment
Development (Current)
β
Sample datasets (what you have now)
β
In-memory cache (what you have now)
β
No Redis needed
β
No dataset downloads needed
Perfect for:
- Feature development
- Testing
- Demos
Staging
β οΈ Download CLINC150, CourseQ
β οΈ Install Redis locally
β
Keep using sample MedQuAD/SymCAT
Good for:
- User acceptance testing
- Performance testing
- Integration testing
Production
π΄ Download ALL real datasets
π΄ Setup Redis cluster
π΄ Configure environment variables
π΄ Enable monitoring
Required for:
- Real users
- Scale
- Accuracy
- Reliability
π Impact of Using Real Datasets
Current (Sample Data)
- Knowledge Base: ~25 Q&A pairs
- Symptom Mapping: 8 symptoms
- Intent Training: 5 examples
- Accuracy: 75% β
With Real Datasets
- Knowledge Base: 47,000+ Q&A pairs
- Symptom Mapping: Comprehensive
- Intent Training: 22,000+ examples
- Expected Accuracy: 85-90% π
π― Quick Decision Guide
Should I download real datasets?
| Scenario | Recommendation |
|---|---|
| Just testing features | β No - use sample data |
| Development/debugging | β No - use sample data |
| Showing to stakeholders | π‘ Maybe - CLINC150 only |
| Deploying to staging | β Yes - partial (CLINC150, Redis) |
| Production deployment | β β Yes - ALL datasets + Redis |
π‘ Summary
Current Warnings:
- β Safe and expected for development
- β System works perfectly with fallbacks
- β No action needed right now
Industry Selection:
- β Per-website setting (not global)
- β All users on website get that industry's responses
- β Different websites can have different industries
Next Steps:
- β Keep developing with sample data
- βοΈ Download real datasets when ready for staging
- βοΈ Setup Redis when deploying to production