7.4 KiB
ZaubaCorp Companies Scraper
A Python web scraper designed to extract company information from ZaubaCorp.com, a comprehensive database of Indian companies. This scraper can collect detailed company profiles including CIN, registration details, financial information, and director information.
Features
- Comprehensive Data Extraction: Scrapes company names, CIN numbers, registration details, capital information, addresses, and director information
- Multiple Output Formats: Saves data in both CSV and JSON formats
- Configurable Scraping: Customizable limits for pages and companies to scrape
- Error Handling: Robust error handling with retry logic and logging
- Rate Limiting: Built-in delays to be respectful to the target website
- Resume Capability: Periodic saving allows resuming interrupted scraping sessions
- Statistics Tracking: Detailed statistics about the scraping process
Prerequisites
- Python 3.7 or higher
- Chrome or Chromium browser installed
- Stable internet connection
Installation
- Clone or download the project files
- Install required dependencies:
pip install -r requirements.txt
Quick Start
Basic Usage
Run the basic scraper with default settings:
python zaubacorp_scraper.py
Enhanced Version
Run the enhanced scraper with configuration support:
python zaubacorp_scraper_enhanced.py
Configuration
The enhanced scraper uses config.py for configuration. You can modify the following settings:
Browser Configuration
BROWSER_CONFIG = {
"headless": True, # Set to False to see browser window
"window_size": "1920,1080",
"page_load_timeout": 30,
"implicit_wait": 10
}
Scraping Limits
SCRAPING_LIMITS = {
"max_companies": 100, # Set to None for unlimited
"max_pages": 5, # Set to None for all pages
"delay_between_requests": 1, # Seconds between requests
"save_interval": 50 # Save data every N companies
}
Output Settings
OUTPUT_CONFIG = {
"output_dir": "zaubacorp_data",
"save_formats": ["csv", "json"],
"csv_filename": "zaubacorp_companies.csv",
"json_filename": "zaubacorp_companies.json"
}
Data Fields
The scraper extracts the following information for each company:
| Field | Description |
|---|---|
url |
Company profile URL |
company_name |
Official company name |
cin |
Corporate Identification Number |
registration_number |
Company registration number |
company_category |
Category of the company |
company_sub_category |
Sub-category classification |
class_of_company |
Class classification |
roc |
Registrar of Companies |
registration_date |
Date of incorporation |
company_status |
Current status (Active/Inactive) |
authorized_capital |
Authorized capital amount |
paid_up_capital |
Paid-up capital amount |
activity_code |
Business activity code |
email |
Company email address |
address |
Registered address |
state |
State of registration |
pincode |
PIN code |
country |
Country (usually India) |
directors |
List of company directors |
last_updated |
Timestamp of data extraction |
Output Files
The scraper generates the following files in the output directory:
zaubacorp_companies.csv- Company data in CSV formatzaubacorp_companies.json- Company data in JSON formatscraping_stats.json- Statistics about the scraping sessionzaubacorp_scraper.log- Detailed logs of the scraping process
Usage Examples
Example 1: Basic Scraping
from zaubacorp_scraper import ZaubaCorpScraper
scraper = ZaubaCorpScraper(headless=True, output_dir="my_data")
scraper.scrape_companies(max_companies=50, max_pages=2)
Example 2: Enhanced Scraping with Custom Config
from zaubacorp_scraper_enhanced import ZaubaCorpScraperEnhanced
custom_config = {
'scraping': {
'max_companies': 200,
'max_pages': 10,
'delay_between_requests': 2
},
'browser': {
'headless': False, # Show browser window
}
}
scraper = ZaubaCorpScraperEnhanced(config=custom_config)
scraper.scrape_companies()
Browser Support
Using Chrome (Default)
The scraper uses Chrome by default. Make sure Chrome is installed on your system.
Using Brave Browser
To use Brave browser instead of Chrome, uncomment this line in the scraper:
self.chrome_options.binary_location = "/Applications/Brave Browser.app/Contents/MacOS/Brave Browser"
Troubleshooting
Common Issues
-
WebDriver Not Found
- Solution: The scraper automatically downloads the appropriate ChromeDriver
-
Page Load Timeouts
- Solution: Increase the
page_load_timeoutin configuration - Check your internet connection
- Solution: Increase the
-
No Data Extracted
- The website structure might have changed
- Check the selectors in
config.py - Run with
headless=Falseto debug visually
-
Rate Limiting
- Increase the
delay_between_requestsvalue - The scraper includes built-in delays to be respectful
- Increase the
Debugging
- Enable Visual Mode: Set
headless=Falseto see what the browser is doing - Check Logs: Review the log file for detailed error information
- Reduce Scope: Start with smaller limits (
max_companies=10) to test
Legal and Ethical Considerations
- Respect robots.txt: Always check the website's robots.txt file
- Rate Limiting: The scraper includes delays to avoid overwhelming the server
- Terms of Service: Ensure your usage complies with ZaubaCorp's terms of service
- Data Usage: Use scraped data responsibly and in accordance with applicable laws
- Attribution: Consider providing attribution when using the data
Performance Tips
- Batch Processing: Use the periodic save feature for large scraping jobs
- Headless Mode: Run in headless mode for better performance
- Network: Ensure stable internet connection for consistent results
- Resources: Monitor system resources during large scraping sessions
Data Quality
The scraper includes several data quality features:
- Duplicate Detection: Prevents scraping the same URL multiple times
- Data Validation: Basic validation of extracted data
- Error Tracking: Keeps track of failed URLs for later review
- Timestamp: Each record includes extraction timestamp
Contributing
To contribute to this project:
- Fork the repository
- Create a feature branch
- Make your improvements
- Add tests if applicable
- Submit a pull request
Changelog
Version 1.0
- Initial release with basic scraping functionality
- Support for CSV and JSON output
- Basic error handling
Version 2.0 (Enhanced)
- Configuration file support
- Advanced error handling and retry logic
- Comprehensive logging
- Statistics tracking
- Improved data extraction
License
This project is provided as-is for educational and research purposes. Please ensure compliance with all applicable laws and terms of service when using this scraper.
Support
For issues and questions:
- Check the troubleshooting section
- Review the log files for detailed error information
- Ensure all dependencies are properly installed
- Test with smaller limits first
Disclaimer
This scraper is for educational and research purposes only. Users are responsible for ensuring their usage complies with ZaubaCorp's terms of service and all applicable laws. The authors are not responsible for any misuse of this tool.