Files
comp/README.md
2025-08-18 23:16:46 +05:30

7.4 KiB

ZaubaCorp Companies Scraper

A Python web scraper designed to extract company information from ZaubaCorp.com, a comprehensive database of Indian companies. This scraper can collect detailed company profiles including CIN, registration details, financial information, and director information.

Features

  • Comprehensive Data Extraction: Scrapes company names, CIN numbers, registration details, capital information, addresses, and director information
  • Multiple Output Formats: Saves data in both CSV and JSON formats
  • Configurable Scraping: Customizable limits for pages and companies to scrape
  • Error Handling: Robust error handling with retry logic and logging
  • Rate Limiting: Built-in delays to be respectful to the target website
  • Resume Capability: Periodic saving allows resuming interrupted scraping sessions
  • Statistics Tracking: Detailed statistics about the scraping process

Prerequisites

  • Python 3.7 or higher
  • Chrome or Chromium browser installed
  • Stable internet connection

Installation

  1. Clone or download the project files
  2. Install required dependencies:
pip install -r requirements.txt

Quick Start

Basic Usage

Run the basic scraper with default settings:

python zaubacorp_scraper.py

Enhanced Version

Run the enhanced scraper with configuration support:

python zaubacorp_scraper_enhanced.py

Configuration

The enhanced scraper uses config.py for configuration. You can modify the following settings:

Browser Configuration

BROWSER_CONFIG = {
    "headless": True,  # Set to False to see browser window
    "window_size": "1920,1080",
    "page_load_timeout": 30,
    "implicit_wait": 10
}

Scraping Limits

SCRAPING_LIMITS = {
    "max_companies": 100,  # Set to None for unlimited
    "max_pages": 5,        # Set to None for all pages
    "delay_between_requests": 1,  # Seconds between requests
    "save_interval": 50    # Save data every N companies
}

Output Settings

OUTPUT_CONFIG = {
    "output_dir": "zaubacorp_data",
    "save_formats": ["csv", "json"],
    "csv_filename": "zaubacorp_companies.csv",
    "json_filename": "zaubacorp_companies.json"
}

Data Fields

The scraper extracts the following information for each company:

Field Description
url Company profile URL
company_name Official company name
cin Corporate Identification Number
registration_number Company registration number
company_category Category of the company
company_sub_category Sub-category classification
class_of_company Class classification
roc Registrar of Companies
registration_date Date of incorporation
company_status Current status (Active/Inactive)
authorized_capital Authorized capital amount
paid_up_capital Paid-up capital amount
activity_code Business activity code
email Company email address
address Registered address
state State of registration
pincode PIN code
country Country (usually India)
directors List of company directors
last_updated Timestamp of data extraction

Output Files

The scraper generates the following files in the output directory:

  • zaubacorp_companies.csv - Company data in CSV format
  • zaubacorp_companies.json - Company data in JSON format
  • scraping_stats.json - Statistics about the scraping session
  • zaubacorp_scraper.log - Detailed logs of the scraping process

Usage Examples

Example 1: Basic Scraping

from zaubacorp_scraper import ZaubaCorpScraper

scraper = ZaubaCorpScraper(headless=True, output_dir="my_data")
scraper.scrape_companies(max_companies=50, max_pages=2)

Example 2: Enhanced Scraping with Custom Config

from zaubacorp_scraper_enhanced import ZaubaCorpScraperEnhanced

custom_config = {
    'scraping': {
        'max_companies': 200,
        'max_pages': 10,
        'delay_between_requests': 2
    },
    'browser': {
        'headless': False,  # Show browser window
    }
}

scraper = ZaubaCorpScraperEnhanced(config=custom_config)
scraper.scrape_companies()

Browser Support

Using Chrome (Default)

The scraper uses Chrome by default. Make sure Chrome is installed on your system.

Using Brave Browser

To use Brave browser instead of Chrome, uncomment this line in the scraper:

self.chrome_options.binary_location = "/Applications/Brave Browser.app/Contents/MacOS/Brave Browser"

Troubleshooting

Common Issues

  1. WebDriver Not Found

    • Solution: The scraper automatically downloads the appropriate ChromeDriver
  2. Page Load Timeouts

    • Solution: Increase the page_load_timeout in configuration
    • Check your internet connection
  3. No Data Extracted

    • The website structure might have changed
    • Check the selectors in config.py
    • Run with headless=False to debug visually
  4. Rate Limiting

    • Increase the delay_between_requests value
    • The scraper includes built-in delays to be respectful

Debugging

  1. Enable Visual Mode: Set headless=False to see what the browser is doing
  2. Check Logs: Review the log file for detailed error information
  3. Reduce Scope: Start with smaller limits (max_companies=10) to test
  • Respect robots.txt: Always check the website's robots.txt file
  • Rate Limiting: The scraper includes delays to avoid overwhelming the server
  • Terms of Service: Ensure your usage complies with ZaubaCorp's terms of service
  • Data Usage: Use scraped data responsibly and in accordance with applicable laws
  • Attribution: Consider providing attribution when using the data

Performance Tips

  1. Batch Processing: Use the periodic save feature for large scraping jobs
  2. Headless Mode: Run in headless mode for better performance
  3. Network: Ensure stable internet connection for consistent results
  4. Resources: Monitor system resources during large scraping sessions

Data Quality

The scraper includes several data quality features:

  • Duplicate Detection: Prevents scraping the same URL multiple times
  • Data Validation: Basic validation of extracted data
  • Error Tracking: Keeps track of failed URLs for later review
  • Timestamp: Each record includes extraction timestamp

Contributing

To contribute to this project:

  1. Fork the repository
  2. Create a feature branch
  3. Make your improvements
  4. Add tests if applicable
  5. Submit a pull request

Changelog

Version 1.0

  • Initial release with basic scraping functionality
  • Support for CSV and JSON output
  • Basic error handling

Version 2.0 (Enhanced)

  • Configuration file support
  • Advanced error handling and retry logic
  • Comprehensive logging
  • Statistics tracking
  • Improved data extraction

License

This project is provided as-is for educational and research purposes. Please ensure compliance with all applicable laws and terms of service when using this scraper.

Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review the log files for detailed error information
  3. Ensure all dependencies are properly installed
  4. Test with smaller limits first

Disclaimer

This scraper is for educational and research purposes only. Users are responsible for ensuring their usage complies with ZaubaCorp's terms of service and all applicable laws. The authors are not responsible for any misuse of this tool.