519 lines
13 KiB
Markdown
519 lines
13 KiB
Markdown
# ZaubaCorp Parallel Scraper
|
|
|
|
A high-performance, asynchronous web scraper designed to extract company data from ZaubaCorp.com at scale. This scraper can handle all 90,769+ pages efficiently using parallel processing and intelligent rate limiting.
|
|
|
|
## 🚀 Features
|
|
|
|
- **Massive Scale**: Scrape all 90,769+ pages of company data
|
|
- **High Performance**: Parallel processing with configurable worker threads
|
|
- **Intelligent Rate Limiting**: Adaptive delays to respect server limits
|
|
- **Robust Error Handling**: Retry logic, timeout handling, and graceful failures
|
|
- **Multiple Output Formats**: JSON and CSV with batch and consolidated outputs
|
|
- **Resumable Operations**: Continue from where you left off if interrupted
|
|
- **Real-time Statistics**: Monitor progress and performance metrics
|
|
- **Configurable Strategies**: Multiple scraping profiles for different use cases
|
|
- **User Agent Rotation**: Avoid detection with rotating headers
|
|
|
|
## 📊 Performance Metrics
|
|
|
|
Based on testing, the scraper can achieve:
|
|
- **Conservative**: 5-10 pages/second, ~25-50 hours for full scrape
|
|
- **Balanced**: 15-25 pages/second, ~10-16 hours for full scrape
|
|
- **Aggressive**: 30-50 pages/second, ~5-8 hours for full scrape
|
|
- **Maximum**: 50+ pages/second, ~3-5 hours for full scrape
|
|
|
|
## 🛠 Installation
|
|
|
|
### Prerequisites
|
|
- Python 3.8+
|
|
- 8GB+ RAM recommended for large-scale scraping
|
|
- Stable internet connection (10Mbps+ recommended)
|
|
|
|
### Install Dependencies
|
|
```bash
|
|
pip install -r requirements_parallel.txt
|
|
```
|
|
|
|
### Dependencies Include:
|
|
- `aiohttp` - Async HTTP client
|
|
- `aiofiles` - Async file operations
|
|
- `beautifulsoup4` - HTML parsing
|
|
- `pandas` - Data manipulation
|
|
- `asyncio` - Async programming
|
|
|
|
## 🎯 Quick Start
|
|
|
|
### 1. Basic Usage
|
|
```bash
|
|
# Quick test (100 pages)
|
|
python run_parallel_scraper.py quick --pages 100
|
|
|
|
# Full scrape (all pages)
|
|
python run_parallel_scraper.py full
|
|
|
|
# Detailed scrape with company pages
|
|
python run_parallel_scraper.py detailed --pages 1000
|
|
```
|
|
|
|
### 2. Interactive Mode
|
|
```bash
|
|
python run_parallel_scraper.py
|
|
```
|
|
|
|
### 3. Programmatic Usage
|
|
```python
|
|
import asyncio
|
|
from zaubacorp_parallel_scraper import ZaubaCorpParallelScraper
|
|
|
|
# Create scraper
|
|
scraper = ZaubaCorpParallelScraper(
|
|
max_workers=20,
|
|
output_dir="my_output"
|
|
)
|
|
|
|
# Run scraping
|
|
asyncio.run(scraper.scrape_all_companies(
|
|
start_page=1,
|
|
end_page=1000,
|
|
batch_size=100,
|
|
scrape_details=False
|
|
))
|
|
```
|
|
|
|
## ⚙️ Configuration
|
|
|
|
### Performance Profiles
|
|
|
|
The scraper includes 4 performance profiles:
|
|
|
|
#### Conservative (Safe for servers)
|
|
```python
|
|
{
|
|
'max_workers': 5,
|
|
'batch_size': 50,
|
|
'request_delay': (0.5, 1.0),
|
|
'connection_limit': 10
|
|
}
|
|
```
|
|
|
|
#### Balanced (Recommended)
|
|
```python
|
|
{
|
|
'max_workers': 15,
|
|
'batch_size': 100,
|
|
'request_delay': (0.2, 0.5),
|
|
'connection_limit': 30
|
|
}
|
|
```
|
|
|
|
#### Aggressive (High speed)
|
|
```python
|
|
{
|
|
'max_workers': 25,
|
|
'batch_size': 200,
|
|
'request_delay': (0.1, 0.3),
|
|
'connection_limit': 50
|
|
}
|
|
```
|
|
|
|
#### Maximum (Use with caution)
|
|
```python
|
|
{
|
|
'max_workers': 40,
|
|
'batch_size': 300,
|
|
'request_delay': (0.05, 0.2),
|
|
'connection_limit': 80
|
|
}
|
|
```
|
|
|
|
### Custom Configuration
|
|
```python
|
|
from parallel_config import ParallelConfig
|
|
|
|
# Get optimized config for your system
|
|
config = ParallelConfig.get_optimized_config()
|
|
|
|
# Or create custom config
|
|
custom_config = ParallelConfig.get_config(
|
|
'balanced',
|
|
max_workers=20,
|
|
batch_size=150,
|
|
output_dir='custom_output'
|
|
)
|
|
```
|
|
|
|
## 📁 Data Structure
|
|
|
|
### Company List Data
|
|
Each company record contains:
|
|
```json
|
|
{
|
|
"cin": "U32107KA2000PTC026370",
|
|
"company_name": "ESPY SOLUTIONS PRIVATE LIMITED",
|
|
"status": "Strike Off",
|
|
"paid_up_capital": "0",
|
|
"address": "NO.32/A, 11TH 'A' CROSS, 6THMAIN, 3RD PHASE J P NAGAR BANGALORE -78",
|
|
"company_url": "https://www.zaubacorp.com/ESPY-SOLUTIONS-PRIVATE-LIMITED-U32107KA2000PTC026370",
|
|
"page_number": 90769,
|
|
"scraped_at": "2024-01-15T10:30:00"
|
|
}
|
|
```
|
|
|
|
### Enhanced Data (with details scraping)
|
|
When `scrape_details=True`, additional fields are extracted:
|
|
```json
|
|
{
|
|
"registration_number": "026370",
|
|
"authorized_capital": "100000",
|
|
"company_category": "Private Limited Company",
|
|
"class_of_company": "Private",
|
|
"roc": "Bangalore",
|
|
"registration_date": "2000-03-15",
|
|
"email": "contact@company.com",
|
|
"phone": "+91-80-12345678"
|
|
}
|
|
```
|
|
|
|
## 🔄 Scraping Strategies
|
|
|
|
### 1. Quick Sample
|
|
Test the scraper with a small number of pages:
|
|
```bash
|
|
python run_parallel_scraper.py quick --pages 100
|
|
```
|
|
|
|
### 2. Full Basic Scrape
|
|
Scrape all companies list pages (basic info only):
|
|
```bash
|
|
python run_parallel_scraper.py full
|
|
```
|
|
|
|
### 3. Detailed Scrape
|
|
Include company detail pages (much slower):
|
|
```bash
|
|
python run_parallel_scraper.py detailed --pages 1000
|
|
```
|
|
|
|
### 4. Resume Failed Pages
|
|
Continue from failed pages:
|
|
```bash
|
|
python run_parallel_scraper.py resume --failed-file failed_pages.json
|
|
```
|
|
|
|
### 5. Segmented Scraping
|
|
Divide work into segments:
|
|
```bash
|
|
python run_parallel_scraper.py segmented --segments 10
|
|
```
|
|
|
|
### 6. Adaptive Scraping
|
|
Smart scraping that adjusts based on success rate:
|
|
```bash
|
|
python run_parallel_scraper.py adaptive
|
|
```
|
|
|
|
## 📤 Output Files
|
|
|
|
The scraper generates several output files:
|
|
|
|
### Batch Files
|
|
- `companies_batch_1.json` - Companies from batch 1
|
|
- `companies_batch_1.csv` - Same data in CSV format
|
|
- `companies_batch_2.json` - Companies from batch 2
|
|
- etc.
|
|
|
|
### Consolidated Files
|
|
- `all_companies.json` - All companies in JSON format
|
|
- `all_companies.csv` - All companies in CSV format
|
|
|
|
### Metadata Files
|
|
- `scraping_statistics.json` - Performance statistics
|
|
- `failed_pages.json` - List of pages that failed to scrape
|
|
- `parallel_scraper.log` - Detailed log file
|
|
|
|
### Statistics Example
|
|
```json
|
|
{
|
|
"total_pages": 90769,
|
|
"pages_processed": 89432,
|
|
"companies_found": 2847392,
|
|
"companies_detailed": 0,
|
|
"failed_pages": 1337,
|
|
"start_time": "2024-01-15T08:00:00",
|
|
"end_time": "2024-01-15T14:30:00"
|
|
}
|
|
```
|
|
|
|
## 🚨 Rate Limiting & Best Practices
|
|
|
|
### Built-in Protections
|
|
- Random delays between requests (0.1-2.0 seconds)
|
|
- User agent rotation (8 different browsers)
|
|
- Connection pooling and limits
|
|
- Automatic retry with exponential backoff
|
|
- Graceful handling of HTTP 429 (rate limited)
|
|
|
|
### Recommendations
|
|
1. **Start Conservative**: Begin with 'conservative' profile
|
|
2. **Monitor Performance**: Watch success rates and adjust accordingly
|
|
3. **Respect the Server**: Don't overwhelm zaubacorp.com
|
|
4. **Use Appropriate Delays**: Longer delays for detail page scraping
|
|
5. **Monitor Logs**: Check logs for rate limiting warnings
|
|
|
|
### Error Handling
|
|
The scraper handles various error conditions:
|
|
- Network timeouts
|
|
- HTTP errors (404, 500, etc.)
|
|
- Rate limiting (429)
|
|
- Connection refused
|
|
- Invalid HTML/parsing errors
|
|
|
|
## 📊 Monitoring Progress
|
|
|
|
### Real-time Statistics
|
|
The scraper provides real-time updates:
|
|
```
|
|
Processing page 1000/90769
|
|
Found 25 companies on this page
|
|
Batch 10: Found 2,500 companies
|
|
Success rate: 98.5%
|
|
Speed: 23.4 pages/second
|
|
```
|
|
|
|
### Log File Analysis
|
|
Check the log file for detailed information:
|
|
```bash
|
|
tail -f zaubacorp_parallel_data/parallel_scraper.log
|
|
```
|
|
|
|
### Statistics Dashboard
|
|
View final statistics:
|
|
```
|
|
ZAUBACORP PARALLEL SCRAPING COMPLETED
|
|
================================================================================
|
|
Total pages processed: 89,432
|
|
Total companies found: 2,847,392
|
|
Companies with details: 0
|
|
Failed pages: 1,337
|
|
Success rate: 98.5%
|
|
Duration: 6:30:00
|
|
Average speed: 22.8 pages/second
|
|
Companies per minute: 7,301
|
|
Output directory: zaubacorp_parallel_data
|
|
================================================================================
|
|
```
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### 1. High Failure Rate
|
|
**Symptoms**: Many failed pages, low success rate
|
|
**Solutions**:
|
|
- Reduce `max_workers` and `batch_size`
|
|
- Increase `request_delay`
|
|
- Use 'conservative' profile
|
|
- Check internet connection
|
|
|
|
#### 2. Memory Issues
|
|
**Symptoms**: Out of memory errors, slow performance
|
|
**Solutions**:
|
|
- Reduce `batch_size`
|
|
- Enable periodic saving more frequently
|
|
- Close other applications
|
|
- Use 64-bit Python
|
|
|
|
#### 3. Rate Limiting
|
|
**Symptoms**: HTTP 429 errors, temporary blocks
|
|
**Solutions**:
|
|
- Increase delays between requests
|
|
- Reduce number of workers
|
|
- Use different IP address/proxy
|
|
- Wait before retrying
|
|
|
|
#### 4. Slow Performance
|
|
**Symptoms**: Very low pages/second rate
|
|
**Solutions**:
|
|
- Increase `max_workers` (if success rate is high)
|
|
- Reduce `request_delay`
|
|
- Check network speed
|
|
- Use SSD storage for output
|
|
|
|
### Debug Mode
|
|
Enable verbose logging for debugging:
|
|
```python
|
|
import logging
|
|
logging.basicConfig(level=logging.DEBUG)
|
|
```
|
|
|
|
## 🔄 Resuming Interrupted Scrapes
|
|
|
|
If scraping is interrupted, you can resume in several ways:
|
|
|
|
### 1. Resume from Failed Pages
|
|
```bash
|
|
python run_parallel_scraper.py resume --failed-file zaubacorp_parallel_data/failed_pages.json
|
|
```
|
|
|
|
### 2. Continue from Last Page
|
|
Check the last successfully processed page in logs and restart:
|
|
```bash
|
|
# If last successful page was 5000
|
|
python zaubacorp_parallel_scraper.py # Modify start_page in script
|
|
```
|
|
|
|
### 3. Merge Results
|
|
Combine multiple scraping sessions:
|
|
```python
|
|
import pandas as pd
|
|
import glob
|
|
|
|
# Read all CSV files
|
|
csv_files = glob.glob("*/all_companies.csv")
|
|
combined_df = pd.concat([pd.read_csv(f) for f in csv_files])
|
|
|
|
# Remove duplicates
|
|
combined_df = combined_df.drop_duplicates(subset=['cin'])
|
|
|
|
# Save combined results
|
|
combined_df.to_csv('final_combined_companies.csv', index=False)
|
|
```
|
|
|
|
## 🎛 Advanced Usage
|
|
|
|
### Custom Scraper Class
|
|
```python
|
|
from zaubacorp_parallel_scraper import ZaubaCorpParallelScraper
|
|
|
|
class CustomScraper(ZaubaCorpParallelScraper):
|
|
def parse_companies_list_page(self, html, page_num):
|
|
# Custom parsing logic
|
|
companies = super().parse_companies_list_page(html, page_num)
|
|
|
|
# Add custom filtering
|
|
filtered_companies = []
|
|
for company in companies:
|
|
if company.get('status') == 'Active':
|
|
filtered_companies.append(company)
|
|
|
|
return filtered_companies
|
|
|
|
# Use custom scraper
|
|
scraper = CustomScraper(max_workers=10)
|
|
```
|
|
|
|
### Distributed Scraping
|
|
Run multiple instances on different machines:
|
|
|
|
**Machine 1**: Pages 1-30,000
|
|
```bash
|
|
python run_parallel_scraper.py segmented --segments 3
|
|
```
|
|
|
|
**Machine 2**: Pages 30,001-60,000
|
|
```python
|
|
# Modify start_page and end_page in script
|
|
await scraper.scrape_all_companies(start_page=30001, end_page=60000)
|
|
```
|
|
|
|
**Machine 3**: Pages 60,001-90,769
|
|
```python
|
|
await scraper.scrape_all_companies(start_page=60001, end_page=90769)
|
|
```
|
|
|
|
## 📈 Performance Optimization
|
|
|
|
### System Optimization
|
|
1. **CPU**: More cores = more workers
|
|
2. **RAM**: 16GB+ recommended for large batches
|
|
3. **Storage**: SSD for faster I/O
|
|
4. **Network**: Stable high-speed connection
|
|
|
|
### Python Optimization
|
|
```bash
|
|
# Use PyPy for better performance
|
|
pypy3 -m pip install -r requirements_parallel.txt
|
|
pypy3 run_parallel_scraper.py full
|
|
|
|
# Or use performance Python
|
|
python -O run_parallel_scraper.py full
|
|
```
|
|
|
|
### Configuration Tuning
|
|
```python
|
|
# For high-end systems
|
|
config = {
|
|
'max_workers': 50,
|
|
'batch_size': 500,
|
|
'connection_limit': 100,
|
|
'request_delay': (0.01, 0.1)
|
|
}
|
|
|
|
# For low-end systems
|
|
config = {
|
|
'max_workers': 3,
|
|
'batch_size': 25,
|
|
'connection_limit': 5,
|
|
'request_delay': (1.0, 2.0)
|
|
}
|
|
```
|
|
|
|
## 🔒 Legal & Ethical Guidelines
|
|
|
|
### Important Considerations
|
|
1. **Respect robots.txt**: Check ZaubaCorp's robots.txt file
|
|
2. **Rate Limiting**: Built-in delays respect server capacity
|
|
3. **Terms of Service**: Ensure compliance with ZaubaCorp's ToS
|
|
4. **Data Usage**: Use scraped data responsibly
|
|
5. **Attribution**: Consider providing attribution when using data
|
|
|
|
### Best Practices
|
|
- Start with small tests before full scraping
|
|
- Use conservative settings initially
|
|
- Monitor server response and adjust accordingly
|
|
- Don't run multiple instances simultaneously
|
|
- Respect any temporary blocks or rate limits
|
|
|
|
## 🆘 Support & Contributing
|
|
|
|
### Getting Help
|
|
1. Check the troubleshooting section
|
|
2. Review log files for error details
|
|
3. Test with conservative settings first
|
|
4. Ensure all dependencies are installed
|
|
|
|
### Contributing
|
|
Contributions are welcome! Areas for improvement:
|
|
- Better error handling
|
|
- More efficient parsing
|
|
- Additional output formats
|
|
- Performance optimizations
|
|
- Better documentation
|
|
|
|
### Feature Requests
|
|
- Database integration
|
|
- GUI interface
|
|
- Cloud deployment scripts
|
|
- Real-time monitoring dashboard
|
|
- Integration with data analysis tools
|
|
|
|
## 📊 Expected Results
|
|
|
|
### Full Scrape Results
|
|
A complete scrape of ZaubaCorp should yield:
|
|
- **~90,769 pages** processed
|
|
- **~2.8-3.2 million companies** found
|
|
- **File sizes**: 500MB-1GB (CSV), 800MB-1.5GB (JSON)
|
|
- **Duration**: 3-24 hours (depending on configuration)
|
|
|
|
### Data Quality
|
|
- **Completeness**: 95-99% of available data
|
|
- **Accuracy**: High (direct from source)
|
|
- **Freshness**: As current as ZaubaCorp's database
|
|
- **Duplicates**: Minimal (handled by CIN uniqueness)
|
|
|
|
## 🎉 Conclusion
|
|
|
|
This parallel scraper provides a robust, scalable solution for extracting company data from ZaubaCorp. With proper configuration and responsible usage, it can efficiently process the entire database while respecting server limits and providing high-quality data output.
|
|
|
|
Remember to always scrape responsibly and in accordance with applicable laws and terms of service! |