California-Housing-price-prediction

Loading of data and EDA

Creating test data

Today I Created test data , and splitted data on the basis of train-test-split and also with stratifcation split to remove imbalance in data and create same proportion.
Creating test data Creating Training and test data with random sampling and stratification sampling

Visualization with geographical data

I plotted geographical visual with respect to population density and housing price to gain better understanding of data

Checking correlation

Also plotted scatterplot for coefficient of correlation of median_housing price with respect to different features and found out that it has high correlation with median_income but there was some straight line forming in middle of data which need to be filtered before training for better performance.
Here is code hope you gain some insight from it :

Expermenting with attribute combinations

I created some new combination of feature like room per house , bedroom ration, population per house and found that room per house has done well then other features, it got some high negative correlation that indicated the less bedroom ratio more the price.
Also Prepared data for machine learning algorithm by separating the features and target, and perform cleaning of data, replced missing values by filling it with median as it is less destructive.

Use of Simple Imputer

The benefit of using SimplImputer is that it will store the median value of each feature: this will make it possible to impute missing values not only on the training set, but also on the validation set,the test set, and any new data fed to the model.

Handling of text and categorical data

Text and categorical data can be handled by using ordinal encoder and One Hot encoding but incase of ordinal encoder it think data nearby data are more similar than far data which is not the case in Oceanproximity so we use onehot encoding.

Feature Scaling

Feature scaling is one of the most important transformation you need to apply to your data. Without feautre scaling most model will bias one feature with another. Two ways of feature scaling are : 1.Min-max scaling 2.Standarization.
Never use fit() or fit_transform() for anything else than training set.

Bucketing/Binning

The transformation of numeric features into categorical features, using a set of thresholds, is called bucketing (or binning)

Column Transformer

Column transformer is a versatile tool in machine learning that allows for the application of different preprocessing steps to different columns or subsets of columns in a dataset. It simplifies the preprocessing workflow, enhances reproducibility, and improves the efficiency of feature engineering in machine learning tasks.

Pipeline

Pipeline refers to a sequence of data processing steps that are applied in a specific order. It combines multiple operations, such as data preprocessing, feature engineering, and model training, into a single cohesive workflow. It make easier to apply same preprocessing to training and test set

Select and train a model

I trained some model like LinearRegression, DecisionTreeRegressor and RandomForestRegressor and found out RMSE very high in LinearRegrssion which indicated underfitting and RMSE 0 in DecisionTreeRegressor which was heavily overfitting and RMSE was comparatively low on RandomForestRegressor.So, I find that RandomForestRegressor can be a good choice

Evaluation of CrossValidation and Fine tunig the model

Performing CrossValidation also showed that Random forest was good choice despite of some overfitting and After some tuning in RandomForestRegressor using GridSearch CV I got some good hyperparameter and model perform more better than before and RMSE was also reduced.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
Calfornia_Housing_Price_Predicitions.ipynb		Calfornia_Housing_Price_Predicitions.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

California-Housing-price-prediction

Loading of data and EDA

Creating test data

Visualization with geographical data

Checking correlation

Expermenting with attribute combinations

Use of Simple Imputer

Handling of text and categorical data

Feature Scaling

Bucketing/Binning

Column Transformer

Pipeline

Select and train a model

Evaluation of CrossValidation and Fine tunig the model

About

Releases

Packages

Languages

Utshav-paudel/California-Housing-price-prediction

Folders and files

Latest commit

History

Repository files navigation

California-Housing-price-prediction

Loading of data and EDA

Creating test data

Visualization with geographical data

Checking correlation

Expermenting with attribute combinations

Use of Simple Imputer

Handling of text and categorical data

Feature Scaling

Bucketing/Binning

Column Transformer

Pipeline

Select and train a model

Evaluation of CrossValidation and Fine tunig the model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages