An automated web scraping tool that extracts, processes, and analyzes lead generation job listings from Naukri.com. This project implements a robust scraping pipeline with built-in data cleaning and processing capabilities.
- 🤖 Automated web scraping of Naukri.com job listings
- 📊 Systematic data collection of lead generation positions
- 🧹 Built-in data cleaning and deduplication
- ⏱️ Smart waiting mechanisms to prevent rate limiting
- 📁 CSV export functionality
- 🔄 Resumable data collection
- Resume capability: Checks for existing data before starting
- Batch processing: Handles 20 jobs simultaneously
- Smart waits: Implements variable delays based on context
- Efficient selectors: Uses optimized CSS selectors for faster element location
- Error recovery: Automatically handles and recovers from common issues
├── backend/
│ ├── firstStep.py - Initial setup and search execution
│ ├── secondStep.py - Data extraction and collection
│ └── thirdStep.py - Data cleaning and processing
└── __main__.py - Main execution script
- Python 3.7+
- Chrome browser installed
- ChromeDriver matching your Chrome version
- Pandas
- NumPy
pip install selenium pandas numpy
- Clone the repository:
git clone https://github.com/XenosWarlocks/super-duper-engine.git
cd super-duper-engine
- Run the script:
python __main__.py
The script will:
- Initialize Chrome WebDriver
- Navigate to Naukri.com
- Search for lead generation positions
- Extract job details from multiple pages
- Clean and process the collected data
- Save results to CSV files
- Initializes Chrome WebDriver with custom options
- Handles page navigation and search execution
- Implements smart waiting mechanisms for page loads
- Extracts job details from search results
- Collects data from multiple pages (up to 20)
- Implements random waiting periods between requests
- Saves raw data to CSV
- Reads and processes the collected data
- Removes outdated job postings (30+ days old)
- Eliminates duplicate entries
- Exports cleaned data to a new CSV file
The tool generates a CSV file containing:
- Job ID
- Job Title
- Company Name
- Company URL
- Posted Time
- Page Number
The script includes comprehensive error handling for:
- Network issues
- Page load timeouts
- Missing elements
- Data processing errors
__init__()
: Initializes the validator and GUI componentsvalidate_csv_file()
: Handles CSV file selection and validationverify_profiles()
: Performs the profile verification processsave_updates()
: Generates the output reportcleanup()
: Handles resource cleanup
- Rate Limiting: Add delays between profile checks to avoid - LinkedIn's rate limits
- Error Handling: Enable debug mode when troubleshooting issues
- Data Privacy: Never commit LinkedIn credentials to version control
- CSV Backup: Keep backups of your input CSV files