first commit

This commit is contained in:
govardhan
2025-08-18 23:16:46 +05:30
commit 7e4f3da26d
28 changed files with 69351 additions and 0 deletions

255
README.md Normal file
View File

@ -0,0 +1,255 @@
# ZaubaCorp Companies Scraper
A Python web scraper designed to extract company information from ZaubaCorp.com, a comprehensive database of Indian companies. This scraper can collect detailed company profiles including CIN, registration details, financial information, and director information.
## Features
- **Comprehensive Data Extraction**: Scrapes company names, CIN numbers, registration details, capital information, addresses, and director information
- **Multiple Output Formats**: Saves data in both CSV and JSON formats
- **Configurable Scraping**: Customizable limits for pages and companies to scrape
- **Error Handling**: Robust error handling with retry logic and logging
- **Rate Limiting**: Built-in delays to be respectful to the target website
- **Resume Capability**: Periodic saving allows resuming interrupted scraping sessions
- **Statistics Tracking**: Detailed statistics about the scraping process
## Prerequisites
- Python 3.7 or higher
- Chrome or Chromium browser installed
- Stable internet connection
## Installation
1. Clone or download the project files
2. Install required dependencies:
```bash
pip install -r requirements.txt
```
## Quick Start
### Basic Usage
Run the basic scraper with default settings:
```bash
python zaubacorp_scraper.py
```
### Enhanced Version
Run the enhanced scraper with configuration support:
```bash
python zaubacorp_scraper_enhanced.py
```
## Configuration
The enhanced scraper uses `config.py` for configuration. You can modify the following settings:
### Browser Configuration
```python
BROWSER_CONFIG = {
"headless": True, # Set to False to see browser window
"window_size": "1920,1080",
"page_load_timeout": 30,
"implicit_wait": 10
}
```
### Scraping Limits
```python
SCRAPING_LIMITS = {
"max_companies": 100, # Set to None for unlimited
"max_pages": 5, # Set to None for all pages
"delay_between_requests": 1, # Seconds between requests
"save_interval": 50 # Save data every N companies
}
```
### Output Settings
```python
OUTPUT_CONFIG = {
"output_dir": "zaubacorp_data",
"save_formats": ["csv", "json"],
"csv_filename": "zaubacorp_companies.csv",
"json_filename": "zaubacorp_companies.json"
}
```
## Data Fields
The scraper extracts the following information for each company:
| Field | Description |
|-------|-------------|
| `url` | Company profile URL |
| `company_name` | Official company name |
| `cin` | Corporate Identification Number |
| `registration_number` | Company registration number |
| `company_category` | Category of the company |
| `company_sub_category` | Sub-category classification |
| `class_of_company` | Class classification |
| `roc` | Registrar of Companies |
| `registration_date` | Date of incorporation |
| `company_status` | Current status (Active/Inactive) |
| `authorized_capital` | Authorized capital amount |
| `paid_up_capital` | Paid-up capital amount |
| `activity_code` | Business activity code |
| `email` | Company email address |
| `address` | Registered address |
| `state` | State of registration |
| `pincode` | PIN code |
| `country` | Country (usually India) |
| `directors` | List of company directors |
| `last_updated` | Timestamp of data extraction |
## Output Files
The scraper generates the following files in the output directory:
- `zaubacorp_companies.csv` - Company data in CSV format
- `zaubacorp_companies.json` - Company data in JSON format
- `scraping_stats.json` - Statistics about the scraping session
- `zaubacorp_scraper.log` - Detailed logs of the scraping process
## Usage Examples
### Example 1: Basic Scraping
```python
from zaubacorp_scraper import ZaubaCorpScraper
scraper = ZaubaCorpScraper(headless=True, output_dir="my_data")
scraper.scrape_companies(max_companies=50, max_pages=2)
```
### Example 2: Enhanced Scraping with Custom Config
```python
from zaubacorp_scraper_enhanced import ZaubaCorpScraperEnhanced
custom_config = {
'scraping': {
'max_companies': 200,
'max_pages': 10,
'delay_between_requests': 2
},
'browser': {
'headless': False, # Show browser window
}
}
scraper = ZaubaCorpScraperEnhanced(config=custom_config)
scraper.scrape_companies()
```
## Browser Support
### Using Chrome (Default)
The scraper uses Chrome by default. Make sure Chrome is installed on your system.
### Using Brave Browser
To use Brave browser instead of Chrome, uncomment this line in the scraper:
```python
self.chrome_options.binary_location = "/Applications/Brave Browser.app/Contents/MacOS/Brave Browser"
```
## Troubleshooting
### Common Issues
1. **WebDriver Not Found**
- Solution: The scraper automatically downloads the appropriate ChromeDriver
2. **Page Load Timeouts**
- Solution: Increase the `page_load_timeout` in configuration
- Check your internet connection
3. **No Data Extracted**
- The website structure might have changed
- Check the selectors in `config.py`
- Run with `headless=False` to debug visually
4. **Rate Limiting**
- Increase the `delay_between_requests` value
- The scraper includes built-in delays to be respectful
### Debugging
1. **Enable Visual Mode**: Set `headless=False` to see what the browser is doing
2. **Check Logs**: Review the log file for detailed error information
3. **Reduce Scope**: Start with smaller limits (`max_companies=10`) to test
## Legal and Ethical Considerations
- **Respect robots.txt**: Always check the website's robots.txt file
- **Rate Limiting**: The scraper includes delays to avoid overwhelming the server
- **Terms of Service**: Ensure your usage complies with ZaubaCorp's terms of service
- **Data Usage**: Use scraped data responsibly and in accordance with applicable laws
- **Attribution**: Consider providing attribution when using the data
## Performance Tips
1. **Batch Processing**: Use the periodic save feature for large scraping jobs
2. **Headless Mode**: Run in headless mode for better performance
3. **Network**: Ensure stable internet connection for consistent results
4. **Resources**: Monitor system resources during large scraping sessions
## Data Quality
The scraper includes several data quality features:
- **Duplicate Detection**: Prevents scraping the same URL multiple times
- **Data Validation**: Basic validation of extracted data
- **Error Tracking**: Keeps track of failed URLs for later review
- **Timestamp**: Each record includes extraction timestamp
## Contributing
To contribute to this project:
1. Fork the repository
2. Create a feature branch
3. Make your improvements
4. Add tests if applicable
5. Submit a pull request
## Changelog
### Version 1.0
- Initial release with basic scraping functionality
- Support for CSV and JSON output
- Basic error handling
### Version 2.0 (Enhanced)
- Configuration file support
- Advanced error handling and retry logic
- Comprehensive logging
- Statistics tracking
- Improved data extraction
## License
This project is provided as-is for educational and research purposes. Please ensure compliance with all applicable laws and terms of service when using this scraper.
## Support
For issues and questions:
1. Check the troubleshooting section
2. Review the log files for detailed error information
3. Ensure all dependencies are properly installed
4. Test with smaller limits first
## Disclaimer
This scraper is for educational and research purposes only. Users are responsible for ensuring their usage complies with ZaubaCorp's terms of service and all applicable laws. The authors are not responsible for any misuse of this tool.