Files
comp/README_PARALLEL.md
2025-08-18 23:16:46 +05:30

13 KiB

ZaubaCorp Parallel Scraper

A high-performance, asynchronous web scraper designed to extract company data from ZaubaCorp.com at scale. This scraper can handle all 90,769+ pages efficiently using parallel processing and intelligent rate limiting.

🚀 Features

  • Massive Scale: Scrape all 90,769+ pages of company data
  • High Performance: Parallel processing with configurable worker threads
  • Intelligent Rate Limiting: Adaptive delays to respect server limits
  • Robust Error Handling: Retry logic, timeout handling, and graceful failures
  • Multiple Output Formats: JSON and CSV with batch and consolidated outputs
  • Resumable Operations: Continue from where you left off if interrupted
  • Real-time Statistics: Monitor progress and performance metrics
  • Configurable Strategies: Multiple scraping profiles for different use cases
  • User Agent Rotation: Avoid detection with rotating headers

📊 Performance Metrics

Based on testing, the scraper can achieve:

  • Conservative: 5-10 pages/second, ~25-50 hours for full scrape
  • Balanced: 15-25 pages/second, ~10-16 hours for full scrape
  • Aggressive: 30-50 pages/second, ~5-8 hours for full scrape
  • Maximum: 50+ pages/second, ~3-5 hours for full scrape

🛠 Installation

Prerequisites

  • Python 3.8+
  • 8GB+ RAM recommended for large-scale scraping
  • Stable internet connection (10Mbps+ recommended)

Install Dependencies

pip install -r requirements_parallel.txt

Dependencies Include:

  • aiohttp - Async HTTP client
  • aiofiles - Async file operations
  • beautifulsoup4 - HTML parsing
  • pandas - Data manipulation
  • asyncio - Async programming

🎯 Quick Start

1. Basic Usage

# Quick test (100 pages)
python run_parallel_scraper.py quick --pages 100

# Full scrape (all pages)
python run_parallel_scraper.py full

# Detailed scrape with company pages
python run_parallel_scraper.py detailed --pages 1000

2. Interactive Mode

python run_parallel_scraper.py

3. Programmatic Usage

import asyncio
from zaubacorp_parallel_scraper import ZaubaCorpParallelScraper

# Create scraper
scraper = ZaubaCorpParallelScraper(
    max_workers=20,
    output_dir="my_output"
)

# Run scraping
asyncio.run(scraper.scrape_all_companies(
    start_page=1,
    end_page=1000,
    batch_size=100,
    scrape_details=False
))

⚙️ Configuration

Performance Profiles

The scraper includes 4 performance profiles:

Conservative (Safe for servers)

{
    'max_workers': 5,
    'batch_size': 50,
    'request_delay': (0.5, 1.0),
    'connection_limit': 10
}
{
    'max_workers': 15,
    'batch_size': 100,
    'request_delay': (0.2, 0.5),
    'connection_limit': 30
}

Aggressive (High speed)

{
    'max_workers': 25,
    'batch_size': 200,
    'request_delay': (0.1, 0.3),
    'connection_limit': 50
}

Maximum (Use with caution)

{
    'max_workers': 40,
    'batch_size': 300,
    'request_delay': (0.05, 0.2),
    'connection_limit': 80
}

Custom Configuration

from parallel_config import ParallelConfig

# Get optimized config for your system
config = ParallelConfig.get_optimized_config()

# Or create custom config
custom_config = ParallelConfig.get_config(
    'balanced',
    max_workers=20,
    batch_size=150,
    output_dir='custom_output'
)

📁 Data Structure

Company List Data

Each company record contains:

{
    "cin": "U32107KA2000PTC026370",
    "company_name": "ESPY SOLUTIONS PRIVATE LIMITED",
    "status": "Strike Off",
    "paid_up_capital": "0",
    "address": "NO.32/A, 11TH 'A' CROSS, 6THMAIN, 3RD PHASE J P NAGAR BANGALORE -78",
    "company_url": "https://www.zaubacorp.com/ESPY-SOLUTIONS-PRIVATE-LIMITED-U32107KA2000PTC026370",
    "page_number": 90769,
    "scraped_at": "2024-01-15T10:30:00"
}

Enhanced Data (with details scraping)

When scrape_details=True, additional fields are extracted:

{
    "registration_number": "026370",
    "authorized_capital": "100000",
    "company_category": "Private Limited Company",
    "class_of_company": "Private",
    "roc": "Bangalore",
    "registration_date": "2000-03-15",
    "email": "contact@company.com",
    "phone": "+91-80-12345678"
}

🔄 Scraping Strategies

1. Quick Sample

Test the scraper with a small number of pages:

python run_parallel_scraper.py quick --pages 100

2. Full Basic Scrape

Scrape all companies list pages (basic info only):

python run_parallel_scraper.py full

3. Detailed Scrape

Include company detail pages (much slower):

python run_parallel_scraper.py detailed --pages 1000

4. Resume Failed Pages

Continue from failed pages:

python run_parallel_scraper.py resume --failed-file failed_pages.json

5. Segmented Scraping

Divide work into segments:

python run_parallel_scraper.py segmented --segments 10

6. Adaptive Scraping

Smart scraping that adjusts based on success rate:

python run_parallel_scraper.py adaptive

📤 Output Files

The scraper generates several output files:

Batch Files

  • companies_batch_1.json - Companies from batch 1
  • companies_batch_1.csv - Same data in CSV format
  • companies_batch_2.json - Companies from batch 2
  • etc.

Consolidated Files

  • all_companies.json - All companies in JSON format
  • all_companies.csv - All companies in CSV format

Metadata Files

  • scraping_statistics.json - Performance statistics
  • failed_pages.json - List of pages that failed to scrape
  • parallel_scraper.log - Detailed log file

Statistics Example

{
    "total_pages": 90769,
    "pages_processed": 89432,
    "companies_found": 2847392,
    "companies_detailed": 0,
    "failed_pages": 1337,
    "start_time": "2024-01-15T08:00:00",
    "end_time": "2024-01-15T14:30:00"
}

🚨 Rate Limiting & Best Practices

Built-in Protections

  • Random delays between requests (0.1-2.0 seconds)
  • User agent rotation (8 different browsers)
  • Connection pooling and limits
  • Automatic retry with exponential backoff
  • Graceful handling of HTTP 429 (rate limited)

Recommendations

  1. Start Conservative: Begin with 'conservative' profile
  2. Monitor Performance: Watch success rates and adjust accordingly
  3. Respect the Server: Don't overwhelm zaubacorp.com
  4. Use Appropriate Delays: Longer delays for detail page scraping
  5. Monitor Logs: Check logs for rate limiting warnings

Error Handling

The scraper handles various error conditions:

  • Network timeouts
  • HTTP errors (404, 500, etc.)
  • Rate limiting (429)
  • Connection refused
  • Invalid HTML/parsing errors

📊 Monitoring Progress

Real-time Statistics

The scraper provides real-time updates:

Processing page 1000/90769
Found 25 companies on this page
Batch 10: Found 2,500 companies
Success rate: 98.5%
Speed: 23.4 pages/second

Log File Analysis

Check the log file for detailed information:

tail -f zaubacorp_parallel_data/parallel_scraper.log

Statistics Dashboard

View final statistics:

ZAUBACORP PARALLEL SCRAPING COMPLETED
================================================================================
Total pages processed: 89,432
Total companies found: 2,847,392
Companies with details: 0
Failed pages: 1,337
Success rate: 98.5%
Duration: 6:30:00
Average speed: 22.8 pages/second
Companies per minute: 7,301
Output directory: zaubacorp_parallel_data
================================================================================

🔧 Troubleshooting

Common Issues

1. High Failure Rate

Symptoms: Many failed pages, low success rate Solutions:

  • Reduce max_workers and batch_size
  • Increase request_delay
  • Use 'conservative' profile
  • Check internet connection

2. Memory Issues

Symptoms: Out of memory errors, slow performance Solutions:

  • Reduce batch_size
  • Enable periodic saving more frequently
  • Close other applications
  • Use 64-bit Python

3. Rate Limiting

Symptoms: HTTP 429 errors, temporary blocks Solutions:

  • Increase delays between requests
  • Reduce number of workers
  • Use different IP address/proxy
  • Wait before retrying

4. Slow Performance

Symptoms: Very low pages/second rate Solutions:

  • Increase max_workers (if success rate is high)
  • Reduce request_delay
  • Check network speed
  • Use SSD storage for output

Debug Mode

Enable verbose logging for debugging:

import logging
logging.basicConfig(level=logging.DEBUG)

🔄 Resuming Interrupted Scrapes

If scraping is interrupted, you can resume in several ways:

1. Resume from Failed Pages

python run_parallel_scraper.py resume --failed-file zaubacorp_parallel_data/failed_pages.json

2. Continue from Last Page

Check the last successfully processed page in logs and restart:

# If last successful page was 5000
python zaubacorp_parallel_scraper.py  # Modify start_page in script

3. Merge Results

Combine multiple scraping sessions:

import pandas as pd
import glob

# Read all CSV files
csv_files = glob.glob("*/all_companies.csv")
combined_df = pd.concat([pd.read_csv(f) for f in csv_files])

# Remove duplicates
combined_df = combined_df.drop_duplicates(subset=['cin'])

# Save combined results
combined_df.to_csv('final_combined_companies.csv', index=False)

🎛 Advanced Usage

Custom Scraper Class

from zaubacorp_parallel_scraper import ZaubaCorpParallelScraper

class CustomScraper(ZaubaCorpParallelScraper):
    def parse_companies_list_page(self, html, page_num):
        # Custom parsing logic
        companies = super().parse_companies_list_page(html, page_num)
        
        # Add custom filtering
        filtered_companies = []
        for company in companies:
            if company.get('status') == 'Active':
                filtered_companies.append(company)
        
        return filtered_companies

# Use custom scraper
scraper = CustomScraper(max_workers=10)

Distributed Scraping

Run multiple instances on different machines:

Machine 1: Pages 1-30,000

python run_parallel_scraper.py segmented --segments 3

Machine 2: Pages 30,001-60,000

# Modify start_page and end_page in script
await scraper.scrape_all_companies(start_page=30001, end_page=60000)

Machine 3: Pages 60,001-90,769

await scraper.scrape_all_companies(start_page=60001, end_page=90769)

📈 Performance Optimization

System Optimization

  1. CPU: More cores = more workers
  2. RAM: 16GB+ recommended for large batches
  3. Storage: SSD for faster I/O
  4. Network: Stable high-speed connection

Python Optimization

# Use PyPy for better performance
pypy3 -m pip install -r requirements_parallel.txt
pypy3 run_parallel_scraper.py full

# Or use performance Python
python -O run_parallel_scraper.py full

Configuration Tuning

# For high-end systems
config = {
    'max_workers': 50,
    'batch_size': 500,
    'connection_limit': 100,
    'request_delay': (0.01, 0.1)
}

# For low-end systems
config = {
    'max_workers': 3,
    'batch_size': 25,
    'connection_limit': 5,
    'request_delay': (1.0, 2.0)
}

Important Considerations

  1. Respect robots.txt: Check ZaubaCorp's robots.txt file
  2. Rate Limiting: Built-in delays respect server capacity
  3. Terms of Service: Ensure compliance with ZaubaCorp's ToS
  4. Data Usage: Use scraped data responsibly
  5. Attribution: Consider providing attribution when using data

Best Practices

  • Start with small tests before full scraping
  • Use conservative settings initially
  • Monitor server response and adjust accordingly
  • Don't run multiple instances simultaneously
  • Respect any temporary blocks or rate limits

🆘 Support & Contributing

Getting Help

  1. Check the troubleshooting section
  2. Review log files for error details
  3. Test with conservative settings first
  4. Ensure all dependencies are installed

Contributing

Contributions are welcome! Areas for improvement:

  • Better error handling
  • More efficient parsing
  • Additional output formats
  • Performance optimizations
  • Better documentation

Feature Requests

  • Database integration
  • GUI interface
  • Cloud deployment scripts
  • Real-time monitoring dashboard
  • Integration with data analysis tools

📊 Expected Results

Full Scrape Results

A complete scrape of ZaubaCorp should yield:

  • ~90,769 pages processed
  • ~2.8-3.2 million companies found
  • File sizes: 500MB-1GB (CSV), 800MB-1.5GB (JSON)
  • Duration: 3-24 hours (depending on configuration)

Data Quality

  • Completeness: 95-99% of available data
  • Accuracy: High (direct from source)
  • Freshness: As current as ZaubaCorp's database
  • Duplicates: Minimal (handled by CIN uniqueness)

🎉 Conclusion

This parallel scraper provides a robust, scalable solution for extracting company data from ZaubaCorp. With proper configuration and responsible usage, it can efficiently process the entire database while respecting server limits and providing high-quality data output.

Remember to always scrape responsibly and in accordance with applicable laws and terms of service!