Linked inSight

by Meredith Wang

🌐 Project Description

Job hunting is a tedious and stressful process. Stacked paragraphs of description and long list of requirement from the job listings are only adding fuel to the flame. This project aims to help me and other aspiring data science professionals get a clear insight on the role they're pursuing, and to provide a better understanding on the education level of their competitors.

🌟 Project Goals

Our goal is to to analyze data-science job postings on Linkedin using Natural Language Processing techniques and predict the candidate's education level.

Education level is classified into two categories:

Undergraduate (candidate whose highest education level is a Bachelor degree, and those who have 'other' degrees)
Graduate (candidate whose highest education level is Master/PhD)

📝 Initial Questions

▪ What does overall candidate's education distribution look like?

▪ Is role dependent on the education level of candidates?

▪ Is job level dependent on the education level of candidates?

▪ Is job description different for graduate vs. undergraduate group?

📂 Data Dictionary

Variable	Value	Meaning
Link	String	The url of the job posting
Company	String	The company name of the job posting
Mode	On-Site; Remote; Hybrid	The working environment of the job posting
Type	Full-time; Contract	The contract type of the job posting
Level	Entry; Associate; Mid-Senior	The job level of the job posting
Requirements	String	The requirements in the description section of the job posting
Edu Level	Int	Percentage of education level of candidates of the job position
Skills	String	The top 10 skills from candidates of job posting

🧭 Outline/Planning

1️⃣ Data Acquisition

Gather data from Linkedin using Selenium

Install Selenium web driver
Create function to guide driver to automate job search
Store data locally to a .csv file

Acquisition

2️⃣ Data Preparation

Missing Values

When job posting does not have enough candidates to generate insight, the education level and skills will be missing
Missing values are manually filled by going to URL of job posting, and find another positng with the same job level, role, and company

Dummy Variables

Categorical features (e.g. role, level) are turned into dummy variables to quantify the features, so we can use them in the models.

Initial Text Cleaning

Job role names vary from companies. For example, for data scientist position, there are names like "Data Scientist II", "Data Scientist, Charging Data and Modeling", "Data Scientist - Credit Card", etc... For the purpose of analyzing the general category's relationship with the target variable, all roles are generalized to 4 categories: Data Scientist, Data Analyst, Data Engineer, Managerial Roles.

Parsing Text

Convert text to all lower case for normality
Remove any accented characters, non-ASCII characters
Remove special characters
Lemmatization
Remove stopwords
Store the clean text and the original text for use in future notebooks

Preparation

3️⃣ Data Exploration

Address initial questions to find what are the key features that are associated with undragudate and graduate group
Explore each feature's correlation with education distribution
Use visualizations to better understand the relationship between features and target variable

4️⃣ Statistical Testing & Modeling

Conduct T-Test for categorical variable vs. numerical variable
Conduct Chi^2 Test for categorical variable vs. categorical variable
Conclude hypothesis and address the initial questions

Exploration

5️⃣ Modeling

Create decision tree classifer and fit train dataset
Find the max depth for the best performing decision tree classifer (evaluated using classification report, accuracy score)
Create random forest classifier and fit train dataset
Find the max depth for the best performing random forest classifier (evaluated using classification report, accuracy score)
Create logistic regression model and fit train dataset
Find the parameter C for the best performing logistic regression model (evaluated using classification report, accuracy score)
Create XGBoost classifier and fit train dataset
Pick the top 3 models among all the models and evaluate performance on validate dataset
Pick the model with highest accuracy and evaluate on test dataset

Modeling

🔁 Steps to Reproduce

NOTE: The job postings data is not static. With that being said, the result of each run of auto-search would be different. Therefore, the insight from exploration and accuracy of models would be slightly different as well.

You will need to have a Linkedin Premium account, preferrably a premium account so you can access part of data that's used as modeling features. Store your password locally in a secret text file.
You will need to install Selenium webdrive. Please follow documentation and steps in acquisition notebook.
Run driver and acquire the latest job postings on your own then store it in a .csv format file.

OR

You can choose to use my data that I generate analysis on. Please contact me for the .csv file.

The following steps apply for both:

Clone my repo (including imports.py, prepare.py)
Libraries used are pandas, matplotlib, seaborn, plotly, sklearn, scipy, selenium, nltk
Follow instructions in each notebook throughout the pipeline (preparation, exploration, modeling)and README file
Good to run workbook and read through white paper 😸

🔑 Key Findings

Less than 1/4 of data science job posting's candidate's highest education level is Bachelor degree.
Candidate's education distribution is dependent on role (scientist, analyst, engineer, managerial roles)
Candidate's education distribution is independent with job level (entry, associate, mid-senior)
For entry level positions, the amount of candidates with graduate degrees is significantly more than those with undergrad degrees.
Top phrases mentioned in data science job descriptions are: Data Analytics, no. of years experience, SQL, Python, Master Degree, Business
Top skills among data science candidates: SQL, Python, Machine Learning, Data Analysis, R, C/C++, Tableau, Data Visualization
Final model decision tree is expected to predict with 87% accuracy on future unseen data.

🔜 Next Steps

For the purpose of completing a MVP, I was only able to gather 243 observations. That is one of the reason there's a class imbalance in our dataset, and why the model is failing to converge and having a higher accuracy. Therefore, gathering more data would be important.
This project is solely focused on Data Science related job positions in the United States. We can expand the field to other areas in tech (e.g. web development, cloud administration, etc.) and compare the education distribution across fields. We can also expand countries to see if such a master-degree dominant poll is solely in the United States.
There are extensive amount of master programs, and there is no indicator of the quality of the program itself. For further study, I would like to include parameters that distinguish different levels of degree accomplished.

🔆 Recommendations/Further Questions

For candidates who don't have a graduate degree, or a bachelor degree in STEM, I suggest you focus on mastering the "top skills" that we concluded in the explore section.
What exactly is the difference between candidates who acquire the skills on their own, and those who went through a graduate program that cost $50k on average? How small is the chance for someone without a desired degree to "survive" the sea of resumes?

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
acquire.py		acquire.py
acquisition.ipynb		acquisition.ipynb
exploration.ipynb		exploration.ipynb
final_report.ipynb		final_report.ipynb
imports.py		imports.py
model.py		model.py
modeling.ipynb		modeling.ipynb
plot.py		plot.py
preparation.ipynb		preparation.ipynb
prepare.py		prepare.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linked inSight

🌐 Project Description

🌟 Project Goals

📝 Initial Questions

📂 Data Dictionary