by Meredith Wang
Job hunting is a tedious and stressful process. Stacked paragraphs of description and long list of requirement from the job listings are only adding fuel to the flame. This project aims to help me and other aspiring data science professionals get a clear insight on the role they're pursuing, and to provide a better understanding on the education level of their competitors.
Our goal is to to analyze data-science job postings on Linkedin using Natural Language Processing techniques and predict the candidate's education level.
Education level is classified into two categories:
- Undergraduate (candidate whose highest education level is a Bachelor degree, and those who have 'other' degrees)
- Graduate (candidate whose highest education level is Master/PhD)
▪ What does overall candidate's education distribution look like?
▪ Is role dependent on the education level of candidates?
▪ Is job level dependent on the education level of candidates?
▪ Is job description different for graduate vs. undergraduate group?
Variable | Value | Meaning |
---|---|---|
Link | String | The url of the job posting |
Company | String | The company name of the job posting |
Mode | On-Site; Remote; Hybrid | The working environment of the job posting |
Type | Full-time; Contract | The contract type of the job posting |
Level | Entry; Associate; Mid-Senior | The job level of the job posting |
Requirements | String | The requirements in the description section of the job posting |
Edu Level | Int | Percentage of education level of candidates of the job position |
Skills | String | The top 10 skills from candidates of job posting |
Gather data from Linkedin using Selenium
-
Install Selenium web driver
-
Create function to guide driver to automate job search
-
Store data locally to a .csv file
Missing Values
-
When job posting does not have enough candidates to generate insight, the education level and skills will be missing
-
Missing values are manually filled by going to URL of job posting, and find another positng with the same job level, role, and company
Dummy Variables
Categorical features (e.g. role
, level
) are turned into dummy variables to quantify the features, so we can use them in the models.
Initial Text Cleaning
Job role names vary from companies. For example, for data scientist position, there are names like "Data Scientist II", "Data Scientist, Charging Data and Modeling", "Data Scientist - Credit Card", etc... For the purpose of analyzing the general category's relationship with the target variable, all roles are generalized to 4 categories: Data Scientist, Data Analyst, Data Engineer, Managerial Roles.
Parsing Text
-
Convert text to all lower case for normality
-
Remove any accented characters, non-ASCII characters
-
Remove special characters
-
Lemmatization
-
Remove stopwords
-
Store the clean text and the original text for use in future notebooks
-
Address initial questions to find what are the key features that are associated with undragudate and graduate group
-
Explore each feature's correlation with education distribution
-
Use visualizations to better understand the relationship between features and target variable
-
Conduct T-Test for categorical variable vs. numerical variable
-
Conduct Chi^2 Test for categorical variable vs. categorical variable
-
Conclude hypothesis and address the initial questions
-
Create decision tree classifer and fit train dataset
-
Find the max depth for the best performing decision tree classifer (evaluated using classification report, accuracy score)
-
Create random forest classifier and fit train dataset
-
Find the max depth for the best performing random forest classifier (evaluated using classification report, accuracy score)
-
Create logistic regression model and fit train dataset
-
Find the parameter C for the best performing logistic regression model (evaluated using classification report, accuracy score)
-
Create XGBoost classifier and fit train dataset
-
Pick the top 3 models among all the models and evaluate performance on validate dataset
-
Pick the model with highest accuracy and evaluate on test dataset
NOTE: The job postings data is not static. With that being said, the result of each run of auto-search would be different. Therefore, the insight from exploration and accuracy of models would be slightly different as well.
- You will need to have a Linkedin Premium account, preferrably a premium account so you can access part of data that's used as modeling features. Store your password locally in a secret text file.
- You will need to install Selenium webdrive. Please follow documentation and steps in acquisition notebook.
- Run driver and acquire the latest job postings on your own then store it in a .csv format file.
OR
- You can choose to use my data that I generate analysis on. Please contact me for the .csv file.
The following steps apply for both:
- Clone my repo (including imports.py, prepare.py)
- Libraries used are pandas, matplotlib, seaborn, plotly, sklearn, scipy, selenium, nltk
- Follow instructions in each notebook throughout the pipeline (preparation, exploration, modeling)and README file
- Good to run workbook and read through white paper 😸
-
Less than 1/4 of data science job posting's candidate's highest education level is Bachelor degree.
-
Candidate's education distribution is dependent on role (scientist, analyst, engineer, managerial roles)
-
Candidate's education distribution is independent with job level (entry, associate, mid-senior)
-
For entry level positions, the amount of candidates with graduate degrees is significantly more than those with undergrad degrees.
-
Top phrases mentioned in data science job descriptions are: Data Analytics, no. of years experience, SQL, Python, Master Degree, Business
-
Top skills among data science candidates: SQL, Python, Machine Learning, Data Analysis, R, C/C++, Tableau, Data Visualization
-
Final model decision tree is expected to predict with 87% accuracy on future unseen data.
-
For the purpose of completing a MVP, I was only able to gather 243 observations. That is one of the reason there's a class imbalance in our dataset, and why the model is failing to converge and having a higher accuracy. Therefore, gathering more data would be important.
-
This project is solely focused on Data Science related job positions in the United States. We can expand the field to other areas in tech (e.g. web development, cloud administration, etc.) and compare the education distribution across fields. We can also expand countries to see if such a master-degree dominant poll is solely in the United States.
-
There are extensive amount of master programs, and there is no indicator of the quality of the program itself. For further study, I would like to include parameters that distinguish different levels of degree accomplished.
-
For candidates who don't have a graduate degree, or a bachelor degree in STEM, I suggest you focus on mastering the "top skills" that we concluded in the explore section.
-
What exactly is the difference between candidates who acquire the skills on their own, and those who went through a graduate program that cost $50k on average? How small is the chance for someone without a desired degree to "survive" the sea of resumes?