Skip to content

Utshav-paudel/California-Housing-price-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

California-Housing-price-prediction

Loading of data and EDA

calforinaday1 Eda result

Creating test data

  • Today I Created test data , and splitted data on the basis of train-test-split and also with stratifcation split to remove imbalance in data and create same proportion.
    Creating test data testdata Creating Training and test data with random sampling and stratification sampling stratification sampling in train test split

Visualization with geographical data

  • I plotted geographical visual with respect to population density and housing price to gain better understanding of data

Checking correlation

  • Also plotted scatterplot for coefficient of correlation of median_housing price with respect to different features and found out that it has high correlation with median_income but there was some straight line forming in middle of data which need to be filtered before training for better performance.
    correlation fig Here is code hope you gain some insight from it :
    visualization correlation code

Expermenting with attribute combinations

  • I created some new combination of feature like room per house , bedroom ration, population per house and found that room per house has done well then other features, it got some high negative correlation that indicated the less bedroom ratio more the price.
  • Also Prepared data for machine learning algorithm by separating the features and target, and perform cleaning of data, replced missing values by filling it with median as it is less destructive. code for data cleaning

Use of Simple Imputer

  • The benefit of using SimplImputer is that it will store the median value of each feature: this will make it possible to impute missing values not only on the training set, but also on the validation set,the test set, and any new data fed to the model. photo of simple imputer use

Handling of text and categorical data

  • Text and categorical data can be handled by using ordinal encoder and One Hot encoding but incase of ordinal encoder it think data nearby data are more similar than far data which is not the case in Oceanproximity so we use onehot encoding. handling text and categorical data

Feature Scaling

  • Feature scaling is one of the most important transformation you need to apply to your data. Without feautre scaling most model will bias one feature with another. Two ways of feature scaling are : 1.Min-max scaling 2.Standarization.
  • Never use fit() or fit_transform() for anything else than training set. feature scaling image

Bucketing/Binning

The transformation of numeric features into categorical features, using a set of thresholds, is called bucketing (or binning)

Column Transformer

  • Column transformer is a versatile tool in machine learning that allows for the application of different preprocessing steps to different columns or subsets of columns in a dataset. It simplifies the preprocessing workflow, enhances reproducibility, and improves the efficiency of feature engineering in machine learning tasks.

Pipeline

  • Pipeline refers to a sequence of data processing steps that are applied in a specific order. It combines multiple operations, such as data preprocessing, feature engineering, and model training, into a single cohesive workflow. It make easier to apply same preprocessing to training and test set

Select and train a model

  • I trained some model like LinearRegression, DecisionTreeRegressor and RandomForestRegressor and found out RMSE very high in LinearRegrssion which indicated underfitting and RMSE 0 in DecisionTreeRegressor which was heavily overfitting and RMSE was comparatively low on RandomForestRegressor.So, I find that RandomForestRegressor can be a good choice selecting model

Evaluation of CrossValidation and Fine tunig the model

  • Performing CrossValidation also showed that Random forest was good choice despite of some overfitting and After some tuning in RandomForestRegressor using GridSearch CV I got some good hyperparameter and model perform more better than before and RMSE was also reduced. last day code

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published