Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Unexpected Prediction with SGDClassifier #652

Open
guiabbehusen opened this issue Feb 24, 2025 · 3 comments
Open

[QUESTION] Unexpected Prediction with SGDClassifier #652

guiabbehusen opened this issue Feb 24, 2025 · 3 comments

Comments

@guiabbehusen
Copy link

I just ran the SGDClassifier from Chapter 3, but it predicted incorrectly on the first attempt (expected a 5, got a 3). Could this really happen, or is it due to a mistake in my code?

Image

@guiabbehusen guiabbehusen changed the title QUESTION [QUESTION] Unexpected Prediction with SGDClassifier Feb 24, 2025
@ageron
Copy link
Owner

ageron commented Feb 24, 2025

Thanks for your question. Yes, this can happen, ML models don't always get things right. Moreover, the SGDClassifier model uses randomness during training (by randomly initializing the weights, and by picking random instances at each training iteration), so you will get two different models every time you try train it (hopefully not too different).

To get reproducible results is not easy: you have to ensure that every source of randomness is eliminated. The biggest source of randomness is the random number generator (RNG). Scikit-Learn uses NumPy's RNG. You can ensure it always produces the same results by setting the random seed with np.random.seed(42). Python itself has a RNG which you can seed using import random and random.seed(42). Other sources of variability include the order of sets, so if you really want perfect reproducibility, you need to always sort set items when iterating on them. And when you get the list of files in a directory, the order is not guaranteed, so once again if you want perfect reproducibility, you need to always sort the files when listing them. Finally, if your code is multithreaded (e.g., when using the GPU), it's even harder to ensure perfect reproducibility since the order in which each thread finishes its job may vary, which may change the order of operations, and due to floating point rounding errors, a+b is not always perfectly equal to b+a. Once again you would have to ensure that the results of each thread are aggregated in a deterministic way (e.g., sorted before they are aggregated).

Lastly, libraries sometimes evolve, algorithms may be slightly tweaked, so results may not be exactly identical to what they were in the past. So to get the same results as I did, you may have to run the exact same libraries on the same platform. If you use Colab, that's tricky since they keep updating the libraries, you would have to uninstall Scikit-Learn and reinstall an old version, but I don't recommend it, it's not worth the effort.

In short, ensuring perfect reproducibility can be very difficult, and in general I would argue that it's not worth the effort. It's more important to ensure that your model has approximately the same performance on the validation set. If my model had, say, 90.13% accuracy and yours has 90.12%, it's not a problem. The results may vary for specific instances, but overall the performance is roughly the same.

I hope this helps!

@guiabbehusen
Copy link
Author

Thank you for the explanation! Now it makes a lot more sense. I was initially confused because, at the beginning of the code, the same SGDClassifier algorithm was used, and it was able to correctly predict that the value 5 in X[0] was True.

I hadn't considered how much randomness could influence the results. Now I see that even with the same algorithm, factors like initialization and execution environment can lead to different outcomes.

Your explanation really clarified things for me!

@ageron
Copy link
Owner

ageron commented Feb 25, 2025

I'm glad I could help! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants