admin/comp

Files

govardhan 7e4f3da26d first commit

2025-08-18 23:16:46 +05:30

7.4 KiB

Raw Permalink Blame History

ZaubaCorp Companies Scraper

A Python web scraper designed to extract company information from ZaubaCorp.com, a comprehensive database of Indian companies. This scraper can collect detailed company profiles including CIN, registration details, financial information, and director information.

Features

Comprehensive Data Extraction: Scrapes company names, CIN numbers, registration details, capital information, addresses, and director information
Multiple Output Formats: Saves data in both CSV and JSON formats
Configurable Scraping: Customizable limits for pages and companies to scrape
Error Handling: Robust error handling with retry logic and logging
Rate Limiting: Built-in delays to be respectful to the target website
Resume Capability: Periodic saving allows resuming interrupted scraping sessions
Statistics Tracking: Detailed statistics about the scraping process

Prerequisites

Python 3.7 or higher
Chrome or Chromium browser installed
Stable internet connection

Installation

Clone or download the project files
Install required dependencies:

pip install -r requirements.txt

Quick Start

Basic Usage

Run the basic scraper with default settings:

python zaubacorp_scraper.py

Enhanced Version

Run the enhanced scraper with configuration support:

python zaubacorp_scraper_enhanced.py

Configuration

The enhanced scraper uses config.py for configuration. You can modify the following settings:

Browser Configuration

BROWSER_CONFIG = {
    "headless": True,  # Set to False to see browser window
    "window_size": "1920,1080",
    "page_load_timeout": 30,
    "implicit_wait": 10
}

Scraping Limits

SCRAPING_LIMITS = {
    "max_companies": 100,  # Set to None for unlimited
    "max_pages": 5,        # Set to None for all pages
    "delay_between_requests": 1,  # Seconds between requests
    "save_interval": 50    # Save data every N companies
}

Output Settings

OUTPUT_CONFIG = {
    "output_dir": "zaubacorp_data",
    "save_formats": ["csv", "json"],
    "csv_filename": "zaubacorp_companies.csv",
    "json_filename": "zaubacorp_companies.json"
}

Data Fields

The scraper extracts the following information for each company:

Field	Description
`url`	Company profile URL
`company_name`	Official company name
`cin`	Corporate Identification Number
`registration_number`	Company registration number
`company_category`	Category of the company
`company_sub_category`	Sub-category classification
`class_of_company`	Class classification
`roc`	Registrar of Companies
`registration_date`	Date of incorporation
`company_status`	Current status (Active/Inactive)
`authorized_capital`	Authorized capital amount
`paid_up_capital`	Paid-up capital amount
`activity_code`	Business activity code
`email`	Company email address
`address`	Registered address
`state`	State of registration
`pincode`	PIN code
`country`	Country (usually India)
`directors`	List of company directors
`last_updated`	Timestamp of data extraction

Output Files

The scraper generates the following files in the output directory:

zaubacorp_companies.csv - Company data in CSV format
zaubacorp_companies.json - Company data in JSON format
scraping_stats.json - Statistics about the scraping session
zaubacorp_scraper.log - Detailed logs of the scraping process

Usage Examples

Example 1: Basic Scraping

from zaubacorp_scraper import ZaubaCorpScraper

scraper = ZaubaCorpScraper(headless=True, output_dir="my_data")
scraper.scrape_companies(max_companies=50, max_pages=2)

Example 2: Enhanced Scraping with Custom Config

from zaubacorp_scraper_enhanced import ZaubaCorpScraperEnhanced

custom_config = {
    'scraping': {
        'max_companies': 200,
        'max_pages': 10,
        'delay_between_requests': 2
    },
    'browser': {
        'headless': False,  # Show browser window
    }
}

scraper = ZaubaCorpScraperEnhanced(config=custom_config)
scraper.scrape_companies()

Browser Support

Using Chrome (Default)

The scraper uses Chrome by default. Make sure Chrome is installed on your system.

Using Brave Browser

To use Brave browser instead of Chrome, uncomment this line in the scraper:

self.chrome_options.binary_location = "/Applications/Brave Browser.app/Contents/MacOS/Brave Browser"

Troubleshooting

Common Issues

WebDriver Not Found
- Solution: The scraper automatically downloads the appropriate ChromeDriver
Page Load Timeouts
- Solution: Increase the page_load_timeout in configuration
- Check your internet connection
No Data Extracted
- The website structure might have changed
- Check the selectors in config.py
- Run with headless=False to debug visually
Rate Limiting
- Increase the delay_between_requests value
- The scraper includes built-in delays to be respectful

Debugging

Enable Visual Mode: Set headless=False to see what the browser is doing
Check Logs: Review the log file for detailed error information
Reduce Scope: Start with smaller limits (max_companies=10) to test

Legal and Ethical Considerations

Respect robots.txt: Always check the website's robots.txt file
Rate Limiting: The scraper includes delays to avoid overwhelming the server
Terms of Service: Ensure your usage complies with ZaubaCorp's terms of service
Data Usage: Use scraped data responsibly and in accordance with applicable laws
Attribution: Consider providing attribution when using the data

Performance Tips

Batch Processing: Use the periodic save feature for large scraping jobs
Headless Mode: Run in headless mode for better performance
Network: Ensure stable internet connection for consistent results
Resources: Monitor system resources during large scraping sessions

Data Quality

The scraper includes several data quality features:

Duplicate Detection: Prevents scraping the same URL multiple times
Data Validation: Basic validation of extracted data
Error Tracking: Keeps track of failed URLs for later review
Timestamp: Each record includes extraction timestamp

Contributing

To contribute to this project:

Fork the repository
Create a feature branch
Make your improvements
Add tests if applicable
Submit a pull request

Changelog

Version 1.0

Initial release with basic scraping functionality
Support for CSV and JSON output
Basic error handling

Version 2.0 (Enhanced)

Configuration file support
Advanced error handling and retry logic
Comprehensive logging
Statistics tracking
Improved data extraction

License

This project is provided as-is for educational and research purposes. Please ensure compliance with all applicable laws and terms of service when using this scraper.

Support

For issues and questions:

Check the troubleshooting section
Review the log files for detailed error information
Ensure all dependencies are properly installed
Test with smaller limits first

Disclaimer

This scraper is for educational and research purposes only. Users are responsible for ensuring their usage complies with ZaubaCorp's terms of service and all applicable laws. The authors are not responsible for any misuse of this tool.

7.4 KiB Raw Permalink Blame History