- Today I Created test data , and splitted data on the basis of train-test-split and also with stratifcation split to remove imbalance in data and create same proportion.
Creating test dataCreating Training and test data with random sampling and stratification sampling
- I plotted geographical visual with respect to population density and housing price to gain better understanding of data
- Also plotted scatterplot for coefficient of correlation of median_housing price with respect to different features and found out that it has high correlation with median_income but there was some straight line forming in middle of data which need to be filtered before training for better performance.
Here is code hope you gain some insight from it :
- I created some new combination of feature like room per house , bedroom ration, population per house and found that room per house has done well then other features, it got some high negative correlation that indicated the less bedroom ratio more the price.
- Also Prepared data for machine learning algorithm by separating the features and target, and perform cleaning of data, replced missing values by filling it with median as it is less destructive.
- The benefit of using SimplImputer is that it will store the median value of each feature: this will make it possible to impute missing values not only on the training set, but also on the validation set,the test set, and any new data fed to the model.
- Text and categorical data can be handled by using ordinal encoder and One Hot encoding but incase of ordinal encoder it think data nearby data are more similar than far data which is not the case in Oceanproximity so we use onehot encoding.
- Feature scaling is one of the most important transformation you need to apply to your data. Without feautre scaling most model will bias one feature with another. Two ways of feature scaling are : 1.Min-max scaling 2.Standarization.
Never use fit() or fit_transform() for anything else than training set.
The transformation of numeric features into categorical features, using a set of thresholds, is called bucketing (or binning)
- Column transformer is a versatile tool in machine learning that allows for the application of different preprocessing steps to different columns or subsets of columns in a dataset. It simplifies the preprocessing workflow, enhances reproducibility, and improves the efficiency of feature engineering in machine learning tasks.
- Pipeline refers to a sequence of data processing steps that are applied in a specific order. It combines multiple operations, such as data preprocessing, feature engineering, and model training, into a single cohesive workflow. It make easier to apply same preprocessing to training and test set
- I trained some model like LinearRegression, DecisionTreeRegressor and RandomForestRegressor and found out RMSE very high in LinearRegrssion which indicated underfitting and RMSE 0 in DecisionTreeRegressor which was heavily overfitting and RMSE was comparatively low on RandomForestRegressor.So, I find that RandomForestRegressor can be a good choice