The Hall of Pole project aims to aggregate and process data from various sources related to Pole Dance Studios, primarily focusing on data extraction from web sources and management of this data for further analysis and display on the website. The project uses Python scripts for web scraping, data processing, and database management.
Before you begin, ensure you have Python installed on your system. This project is compatible with Python 3.6 and above.
- Clone the Repository
First, clone the repository to your local machine. Open your terminal and run:
git clone https://github.com/hamudal/Hall_of_Pole_Version_5/tree/main/1_Latest_version_Hop_Scrapper_V5/Scraper_V3 cd Scraper_V3
- Install Dependencies
Open your terminal and run:
pip install -r requirements.txt
- Purpose: Extracts and reconstructs specific URLs from a given webpage.
- Details: This script uses requests and BeautifulSoup to fetch the content of a webpage and parse HTML. It extracts elements based on specific CSS classes and reconstructs URLs for different sections of the site (e.g., Overview, Classes, Workshops, etc.).
- Purpose: Validates a list of URLs to check their accessibility.
- Details: Utilizes requests to send HTTP requests to URLs and verify their status. Incorporates logging to track the validation process, marking URLs as valid or invalid.
- Purpose: Scrapes detailed information about Pole Studios from their webpages.
- Details: Extracts various data points like studio name, contact information, description, ratings, etc., using requests and BeautifulSoup for HTML parsing.
- Purpose: Gathers information about workshops offered by Pole Studios.
- Details: Scrapes workshop-related data such as name, date, price, and studio information, compiling it into a structured format.
- Purpose: Extracts detailed information about individual workshops.
- Details: Similar to c_PoleStudio_Overview_S.py, but focuses on specific workshops, retrieving details like descriptions, levels, dates, and times.
- Purpose: Collects data about classes available in different Pole Studios.
- Details: Processes multiple URLs to scrape class information, including time, duration, name, and availability.
- Purpose: Provides detailed insights into specific classes offered.
- Details: Extracts comprehensive details of individual classes, including descriptions, instructors, locations, and schedules.
- Purpose: Central script to process URLs using the above modules.
- Details: Orchestrates the execution of other scripts, managing the flow of data from one module to another, ensuring cohesive data processing.
- Purpose: Manages database connections and operations.
- Details: A Jupyter Notebook that outlines procedures for connecting to, querying, and managing a database storing the scraped data. It provides interactive elements for database operations.
- Purpose: Aggregates URLs from multiple CSV files.
- Details: Searches through CSV files in a specified directory, extracting and combining URLs that match a given pattern. Ensures data consolidation for analysis.
- Modularity: Each script focuses on a single task, making the codebase more maintainable and scalable.
- Error Handling: Robust error handling to manage exceptions and maintain process flow.
- Code Readability: Clear naming conventions, consistent formatting, and detailed comments.
- Logging: Comprehensive logging in b_URLS_Validation.py for tracking URL validation processes.
Each script can be run independently or orchestrated through PyCaller.py, depending on the required operation. Ensure that the Python environment has necessary dependencies installed (requests, pandas, BeautifulSoup, etc.).
The Main Frame serves as the orchestrator for processing URLs and aggregating data from various sources. It's designed to systematically extract, organize, and display data related to Pole Dance Studios.
-
process_and_print_results: This function processes a given URL and prints the results. It utilizes other scripts for data extraction, handling each URL individually.
-
Data Aggregation: After processing URLs, the script initializes several DataFrames to store different categories of data like Pole Studio, Workshop, Class details, etc.
-
Looping Through URLs: It iterates over a list of URLs (e.g., from a CSV file) and calls
process_and_print_results
for each URL. The gathered data is then added to respective DataFrames. -
Dataframes Initialization: Separate DataFrames for Pole Studio Data, Workshop Data, Workshop Details, Class Data, and Class Details are initialized and populated with scraped data.
-
Data Display: Finally, it prints out the collected data for each category, providing a comprehensive overview.
The Main Frame displays the aggregated data in an organized manner. Here's a glimpse of the output: