[QUESTION] Unexpected Prediction with SGDClassifier #652

guiabbehusen · 2025-02-24T16:29:46Z

I just ran the SGDClassifier from Chapter 3, but it predicted incorrectly on the first attempt (expected a 5, got a 3). Could this really happen, or is it due to a mistake in my code?

ageron · 2025-02-24T19:29:56Z

Thanks for your question. Yes, this can happen, ML models don't always get things right. Moreover, the SGDClassifier model uses randomness during training (by randomly initializing the weights, and by picking random instances at each training iteration), so you will get two different models every time you try train it (hopefully not too different).

To get reproducible results is not easy: you have to ensure that every source of randomness is eliminated. The biggest source of randomness is the random number generator (RNG). Scikit-Learn uses NumPy's RNG. You can ensure it always produces the same results by setting the random seed with np.random.seed(42). Python itself has a RNG which you can seed using import random and random.seed(42). Other sources of variability include the order of sets, so if you really want perfect reproducibility, you need to always sort set items when iterating on them. And when you get the list of files in a directory, the order is not guaranteed, so once again if you want perfect reproducibility, you need to always sort the files when listing them. Finally, if your code is multithreaded (e.g., when using the GPU), it's even harder to ensure perfect reproducibility since the order in which each thread finishes its job may vary, which may change the order of operations, and due to floating point rounding errors, a+b is not always perfectly equal to b+a. Once again you would have to ensure that the results of each thread are aggregated in a deterministic way (e.g., sorted before they are aggregated).

Lastly, libraries sometimes evolve, algorithms may be slightly tweaked, so results may not be exactly identical to what they were in the past. So to get the same results as I did, you may have to run the exact same libraries on the same platform. If you use Colab, that's tricky since they keep updating the libraries, you would have to uninstall Scikit-Learn and reinstall an old version, but I don't recommend it, it's not worth the effort.

In short, ensuring perfect reproducibility can be very difficult, and in general I would argue that it's not worth the effort. It's more important to ensure that your model has approximately the same performance on the validation set. If my model had, say, 90.13% accuracy and yours has 90.12%, it's not a problem. The results may vary for specific instances, but overall the performance is roughly the same.

I hope this helps!

guiabbehusen · 2025-02-25T21:04:12Z

Thank you for the explanation! Now it makes a lot more sense. I was initially confused because, at the beginning of the code, the same SGDClassifier algorithm was used, and it was able to correctly predict that the value 5 in X[0] was True.

I hadn't considered how much randomness could influence the results. Now I see that even with the same algorithm, factors like initialization and execution environment can lead to different outcomes.

Your explanation really clarified things for me!

ageron · 2025-02-25T21:09:16Z

I'm glad I could help! 👍

guiabbehusen changed the title ~~QUESTION~~ [QUESTION] Unexpected Prediction with SGDClassifier Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Unexpected Prediction with SGDClassifier #652

[QUESTION] Unexpected Prediction with SGDClassifier #652

guiabbehusen commented Feb 24, 2025

ageron commented Feb 24, 2025

guiabbehusen commented Feb 25, 2025

ageron commented Feb 25, 2025

[QUESTION] Unexpected Prediction with SGDClassifier #652

[QUESTION] Unexpected Prediction with SGDClassifier #652

Comments

guiabbehusen commented Feb 24, 2025

ageron commented Feb 24, 2025

guiabbehusen commented Feb 25, 2025

ageron commented Feb 25, 2025