Files
comp/README_PARALLEL.md
2025-08-18 23:16:46 +05:30

519 lines
13 KiB
Markdown

# ZaubaCorp Parallel Scraper
A high-performance, asynchronous web scraper designed to extract company data from ZaubaCorp.com at scale. This scraper can handle all 90,769+ pages efficiently using parallel processing and intelligent rate limiting.
## 🚀 Features
- **Massive Scale**: Scrape all 90,769+ pages of company data
- **High Performance**: Parallel processing with configurable worker threads
- **Intelligent Rate Limiting**: Adaptive delays to respect server limits
- **Robust Error Handling**: Retry logic, timeout handling, and graceful failures
- **Multiple Output Formats**: JSON and CSV with batch and consolidated outputs
- **Resumable Operations**: Continue from where you left off if interrupted
- **Real-time Statistics**: Monitor progress and performance metrics
- **Configurable Strategies**: Multiple scraping profiles for different use cases
- **User Agent Rotation**: Avoid detection with rotating headers
## 📊 Performance Metrics
Based on testing, the scraper can achieve:
- **Conservative**: 5-10 pages/second, ~25-50 hours for full scrape
- **Balanced**: 15-25 pages/second, ~10-16 hours for full scrape
- **Aggressive**: 30-50 pages/second, ~5-8 hours for full scrape
- **Maximum**: 50+ pages/second, ~3-5 hours for full scrape
## 🛠 Installation
### Prerequisites
- Python 3.8+
- 8GB+ RAM recommended for large-scale scraping
- Stable internet connection (10Mbps+ recommended)
### Install Dependencies
```bash
pip install -r requirements_parallel.txt
```
### Dependencies Include:
- `aiohttp` - Async HTTP client
- `aiofiles` - Async file operations
- `beautifulsoup4` - HTML parsing
- `pandas` - Data manipulation
- `asyncio` - Async programming
## 🎯 Quick Start
### 1. Basic Usage
```bash
# Quick test (100 pages)
python run_parallel_scraper.py quick --pages 100
# Full scrape (all pages)
python run_parallel_scraper.py full
# Detailed scrape with company pages
python run_parallel_scraper.py detailed --pages 1000
```
### 2. Interactive Mode
```bash
python run_parallel_scraper.py
```
### 3. Programmatic Usage
```python
import asyncio
from zaubacorp_parallel_scraper import ZaubaCorpParallelScraper
# Create scraper
scraper = ZaubaCorpParallelScraper(
max_workers=20,
output_dir="my_output"
)
# Run scraping
asyncio.run(scraper.scrape_all_companies(
start_page=1,
end_page=1000,
batch_size=100,
scrape_details=False
))
```
## ⚙️ Configuration
### Performance Profiles
The scraper includes 4 performance profiles:
#### Conservative (Safe for servers)
```python
{
'max_workers': 5,
'batch_size': 50,
'request_delay': (0.5, 1.0),
'connection_limit': 10
}
```
#### Balanced (Recommended)
```python
{
'max_workers': 15,
'batch_size': 100,
'request_delay': (0.2, 0.5),
'connection_limit': 30
}
```
#### Aggressive (High speed)
```python
{
'max_workers': 25,
'batch_size': 200,
'request_delay': (0.1, 0.3),
'connection_limit': 50
}
```
#### Maximum (Use with caution)
```python
{
'max_workers': 40,
'batch_size': 300,
'request_delay': (0.05, 0.2),
'connection_limit': 80
}
```
### Custom Configuration
```python
from parallel_config import ParallelConfig
# Get optimized config for your system
config = ParallelConfig.get_optimized_config()
# Or create custom config
custom_config = ParallelConfig.get_config(
'balanced',
max_workers=20,
batch_size=150,
output_dir='custom_output'
)
```
## 📁 Data Structure
### Company List Data
Each company record contains:
```json
{
"cin": "U32107KA2000PTC026370",
"company_name": "ESPY SOLUTIONS PRIVATE LIMITED",
"status": "Strike Off",
"paid_up_capital": "0",
"address": "NO.32/A, 11TH 'A' CROSS, 6THMAIN, 3RD PHASE J P NAGAR BANGALORE -78",
"company_url": "https://www.zaubacorp.com/ESPY-SOLUTIONS-PRIVATE-LIMITED-U32107KA2000PTC026370",
"page_number": 90769,
"scraped_at": "2024-01-15T10:30:00"
}
```
### Enhanced Data (with details scraping)
When `scrape_details=True`, additional fields are extracted:
```json
{
"registration_number": "026370",
"authorized_capital": "100000",
"company_category": "Private Limited Company",
"class_of_company": "Private",
"roc": "Bangalore",
"registration_date": "2000-03-15",
"email": "contact@company.com",
"phone": "+91-80-12345678"
}
```
## 🔄 Scraping Strategies
### 1. Quick Sample
Test the scraper with a small number of pages:
```bash
python run_parallel_scraper.py quick --pages 100
```
### 2. Full Basic Scrape
Scrape all companies list pages (basic info only):
```bash
python run_parallel_scraper.py full
```
### 3. Detailed Scrape
Include company detail pages (much slower):
```bash
python run_parallel_scraper.py detailed --pages 1000
```
### 4. Resume Failed Pages
Continue from failed pages:
```bash
python run_parallel_scraper.py resume --failed-file failed_pages.json
```
### 5. Segmented Scraping
Divide work into segments:
```bash
python run_parallel_scraper.py segmented --segments 10
```
### 6. Adaptive Scraping
Smart scraping that adjusts based on success rate:
```bash
python run_parallel_scraper.py adaptive
```
## 📤 Output Files
The scraper generates several output files:
### Batch Files
- `companies_batch_1.json` - Companies from batch 1
- `companies_batch_1.csv` - Same data in CSV format
- `companies_batch_2.json` - Companies from batch 2
- etc.
### Consolidated Files
- `all_companies.json` - All companies in JSON format
- `all_companies.csv` - All companies in CSV format
### Metadata Files
- `scraping_statistics.json` - Performance statistics
- `failed_pages.json` - List of pages that failed to scrape
- `parallel_scraper.log` - Detailed log file
### Statistics Example
```json
{
"total_pages": 90769,
"pages_processed": 89432,
"companies_found": 2847392,
"companies_detailed": 0,
"failed_pages": 1337,
"start_time": "2024-01-15T08:00:00",
"end_time": "2024-01-15T14:30:00"
}
```
## 🚨 Rate Limiting & Best Practices
### Built-in Protections
- Random delays between requests (0.1-2.0 seconds)
- User agent rotation (8 different browsers)
- Connection pooling and limits
- Automatic retry with exponential backoff
- Graceful handling of HTTP 429 (rate limited)
### Recommendations
1. **Start Conservative**: Begin with 'conservative' profile
2. **Monitor Performance**: Watch success rates and adjust accordingly
3. **Respect the Server**: Don't overwhelm zaubacorp.com
4. **Use Appropriate Delays**: Longer delays for detail page scraping
5. **Monitor Logs**: Check logs for rate limiting warnings
### Error Handling
The scraper handles various error conditions:
- Network timeouts
- HTTP errors (404, 500, etc.)
- Rate limiting (429)
- Connection refused
- Invalid HTML/parsing errors
## 📊 Monitoring Progress
### Real-time Statistics
The scraper provides real-time updates:
```
Processing page 1000/90769
Found 25 companies on this page
Batch 10: Found 2,500 companies
Success rate: 98.5%
Speed: 23.4 pages/second
```
### Log File Analysis
Check the log file for detailed information:
```bash
tail -f zaubacorp_parallel_data/parallel_scraper.log
```
### Statistics Dashboard
View final statistics:
```
ZAUBACORP PARALLEL SCRAPING COMPLETED
================================================================================
Total pages processed: 89,432
Total companies found: 2,847,392
Companies with details: 0
Failed pages: 1,337
Success rate: 98.5%
Duration: 6:30:00
Average speed: 22.8 pages/second
Companies per minute: 7,301
Output directory: zaubacorp_parallel_data
================================================================================
```
## 🔧 Troubleshooting
### Common Issues
#### 1. High Failure Rate
**Symptoms**: Many failed pages, low success rate
**Solutions**:
- Reduce `max_workers` and `batch_size`
- Increase `request_delay`
- Use 'conservative' profile
- Check internet connection
#### 2. Memory Issues
**Symptoms**: Out of memory errors, slow performance
**Solutions**:
- Reduce `batch_size`
- Enable periodic saving more frequently
- Close other applications
- Use 64-bit Python
#### 3. Rate Limiting
**Symptoms**: HTTP 429 errors, temporary blocks
**Solutions**:
- Increase delays between requests
- Reduce number of workers
- Use different IP address/proxy
- Wait before retrying
#### 4. Slow Performance
**Symptoms**: Very low pages/second rate
**Solutions**:
- Increase `max_workers` (if success rate is high)
- Reduce `request_delay`
- Check network speed
- Use SSD storage for output
### Debug Mode
Enable verbose logging for debugging:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
```
## 🔄 Resuming Interrupted Scrapes
If scraping is interrupted, you can resume in several ways:
### 1. Resume from Failed Pages
```bash
python run_parallel_scraper.py resume --failed-file zaubacorp_parallel_data/failed_pages.json
```
### 2. Continue from Last Page
Check the last successfully processed page in logs and restart:
```bash
# If last successful page was 5000
python zaubacorp_parallel_scraper.py # Modify start_page in script
```
### 3. Merge Results
Combine multiple scraping sessions:
```python
import pandas as pd
import glob
# Read all CSV files
csv_files = glob.glob("*/all_companies.csv")
combined_df = pd.concat([pd.read_csv(f) for f in csv_files])
# Remove duplicates
combined_df = combined_df.drop_duplicates(subset=['cin'])
# Save combined results
combined_df.to_csv('final_combined_companies.csv', index=False)
```
## 🎛 Advanced Usage
### Custom Scraper Class
```python
from zaubacorp_parallel_scraper import ZaubaCorpParallelScraper
class CustomScraper(ZaubaCorpParallelScraper):
def parse_companies_list_page(self, html, page_num):
# Custom parsing logic
companies = super().parse_companies_list_page(html, page_num)
# Add custom filtering
filtered_companies = []
for company in companies:
if company.get('status') == 'Active':
filtered_companies.append(company)
return filtered_companies
# Use custom scraper
scraper = CustomScraper(max_workers=10)
```
### Distributed Scraping
Run multiple instances on different machines:
**Machine 1**: Pages 1-30,000
```bash
python run_parallel_scraper.py segmented --segments 3
```
**Machine 2**: Pages 30,001-60,000
```python
# Modify start_page and end_page in script
await scraper.scrape_all_companies(start_page=30001, end_page=60000)
```
**Machine 3**: Pages 60,001-90,769
```python
await scraper.scrape_all_companies(start_page=60001, end_page=90769)
```
## 📈 Performance Optimization
### System Optimization
1. **CPU**: More cores = more workers
2. **RAM**: 16GB+ recommended for large batches
3. **Storage**: SSD for faster I/O
4. **Network**: Stable high-speed connection
### Python Optimization
```bash
# Use PyPy for better performance
pypy3 -m pip install -r requirements_parallel.txt
pypy3 run_parallel_scraper.py full
# Or use performance Python
python -O run_parallel_scraper.py full
```
### Configuration Tuning
```python
# For high-end systems
config = {
'max_workers': 50,
'batch_size': 500,
'connection_limit': 100,
'request_delay': (0.01, 0.1)
}
# For low-end systems
config = {
'max_workers': 3,
'batch_size': 25,
'connection_limit': 5,
'request_delay': (1.0, 2.0)
}
```
## 🔒 Legal & Ethical Guidelines
### Important Considerations
1. **Respect robots.txt**: Check ZaubaCorp's robots.txt file
2. **Rate Limiting**: Built-in delays respect server capacity
3. **Terms of Service**: Ensure compliance with ZaubaCorp's ToS
4. **Data Usage**: Use scraped data responsibly
5. **Attribution**: Consider providing attribution when using data
### Best Practices
- Start with small tests before full scraping
- Use conservative settings initially
- Monitor server response and adjust accordingly
- Don't run multiple instances simultaneously
- Respect any temporary blocks or rate limits
## 🆘 Support & Contributing
### Getting Help
1. Check the troubleshooting section
2. Review log files for error details
3. Test with conservative settings first
4. Ensure all dependencies are installed
### Contributing
Contributions are welcome! Areas for improvement:
- Better error handling
- More efficient parsing
- Additional output formats
- Performance optimizations
- Better documentation
### Feature Requests
- Database integration
- GUI interface
- Cloud deployment scripts
- Real-time monitoring dashboard
- Integration with data analysis tools
## 📊 Expected Results
### Full Scrape Results
A complete scrape of ZaubaCorp should yield:
- **~90,769 pages** processed
- **~2.8-3.2 million companies** found
- **File sizes**: 500MB-1GB (CSV), 800MB-1.5GB (JSON)
- **Duration**: 3-24 hours (depending on configuration)
### Data Quality
- **Completeness**: 95-99% of available data
- **Accuracy**: High (direct from source)
- **Freshness**: As current as ZaubaCorp's database
- **Duplicates**: Minimal (handled by CIN uniqueness)
## 🎉 Conclusion
This parallel scraper provides a robust, scalable solution for extracting company data from ZaubaCorp. With proper configuration and responsible usage, it can efficiently process the entire database while respecting server limits and providing high-quality data output.
Remember to always scrape responsibly and in accordance with applicable laws and terms of service!