-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440
[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440
Conversation
This is a re-submission of #13438 to fix target branch |
Test build #59740 has finished for PR 13440 at commit
|
Test build #59745 has finished for PR 13440 at commit
|
1136518
to
6d38cfd
Compare
Test build #59751 has finished for PR 13440 at commit
|
Test build #64309 has finished for PR 13440 at commit
|
Is this something your still working on? If so it would be good to merge in the latest master. We can also check with @jkbradley to see if he has some review bandwidth. |
@holdenk yes, I'll rebase it this week. |
6d38cfd
to
b199ae3
Compare
Test build #66679 has finished for PR 13440 at commit
|
b199ae3
to
83f5e83
Compare
Test build #66756 has finished for PR 13440 at commit
|
test this please |
Test build #66766 has finished for PR 13440 at commit
|
@holdenk @jkbradley looks like it's clean again |
@erikerlandson Are you still working on this PR? Thanks! Miao |
Hi @wangmiao1981, I am still interested in this, but I don't have any sense about whether upstream has any interest. Does upstream have any intention to accept it? |
Test build #73006 has started for PR 13440 at commit |
i stopped the build as i need to restart jenkins... i'll retrigger this when we're back up and running. |
@erikerlandson I am just helping clearing the stale PRs. :) I have no idea whether they have intention to accept it. |
test this please |
Test build #73008 has finished for PR 13440 at commit
|
@thunterdb Can you take a look? Thanks! |
*/ | ||
@Since("2.0.0") | ||
@DeveloperApi | ||
def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala: do not add a default implementation, it causes issues with java compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot guarantee compatibility with existing code here, since you would break the bytecode either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, there were other impurity measures using this 'unsupported' idiom. However, I'm fine with making it abstract, if that's the direction you are thinking.
*/ | ||
@Since("2.0.0") | ||
@DeveloperApi | ||
def isTestStatistic: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala: do not add a default implementation, it causes issues with java compatibility
* Utility functions for Impurity measures | ||
*/ | ||
@Since("2.0.0") | ||
@DeveloperApi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need for this object to be publicly exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I don't recall any specific motivation to keep it private, but historically Spark seems to default things to "minimum visibility." The only method currently defined here is an implementation detail for hacking p-values into the existing 'gain' system, where larger is assumed to be better.
.setMinInfoGain(0.01) | ||
val treeModel = dt.fit(train) | ||
|
||
// The tree should use exactly one of the 3 features: featue(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@Since("2.0.0") | ||
@Experimental | ||
object ChiSquared extends Impurity { | ||
private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not a private val
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it was to "allocate on the stack"
61cbf7c
to
6762a18
Compare
Ready for review |
@thunterdb can you take a look at this now that 2.2 is out? |
d2a2381
to
bb2f660
Compare
rebased to latest head of master |
Test build #80093 has finished for PR 13440 at commit
|
@HyukjinKwon @thunterdb can you all take a look at this? It's been under review for quite a long time! |
I don't have ML knowledge enough to review this. I can cc ML committer guys who I can guess have some expertise from git blame but I hope there are some sign-offs left here from some guys here ahead. |
I've not seen chi squared used as a split statistic; when is it theoretically better than entropy? Or something a bit more fundamental like KL divergence? It makes some sense but does require some assumption about the data |
@srowen I discuss some of these questions in the blog post, but the tl/dr is that split quality measures based on statistical tests having p-values are in some senses "less arbitrary." Specifying a p-value as a split quality halting condition has essentially the same semantic regardless of the test. Most such tests also intrinsically take into account decreasing population sizes. As the the splitting progresses and population sizes decrease, it inherently takes a larger and larger population difference to meet the p-value threshold. On the more pragmatic side, in that post I also demonstrate chi-squared split quality generating a more parsimonious tree than other metrics, which does a better job at ignoring poor quality features. |
@srowen @thunterdb any more thoughts on this? |
I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at this? |
Is this still being considered? |
Test build #94042 has finished for PR 13440 at commit
|
Test build #95018 has finished for PR 13440 at commit
|
@srowen @thunterdb this PR passes all tests and merges cleanly -- can you take another look? It's been open for quite a while now. |
// a larger-is-better gain value for the minimum-gain threshold | ||
val minGain = | ||
if (metadata.impurity.isTestStatistic) Impurity.pValToGain(metadata.minInfoGain) | ||
else metadata.minInfoGain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of a design question here... right now the caller has to switch logic based on what's inside metadata. Can methods like metadata.minInfoGain just implement different logic when the impurity is a test statistic, and so on? push this down towards the impurity implementation? I wonder if isTestStatistic
can go away with the right API, but I am not familiar with the details of what that requires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue I recall was that all of the existing metrics assume some kind of "larger is better" gain, and p-values are "smaller is better." I'll take another pass over it and see if I can push that distinction down so it doesn't require exposing new methods.
* :: Experimental :: | ||
* Class for calculating Chi Squared as a split quality metric during binary classification. | ||
*/ | ||
@Since("2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will have to be 2.5.0 for the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update those. 3.0 might be a good target, especially if I can't do this without new isTestStatistic
* Get this impurity instance. | ||
* This is useful for passing impurity parameters to a Strategy in Java. | ||
*/ | ||
@Since("1.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd label all these as Since 2.5.0 even if they override a method that existed earlier.
*/ | ||
@Since("2.0.0") | ||
@DeveloperApi | ||
def pValToGain(pval: Double): Double = -math.log(math.max(1e-20, pval)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private to spark?
*/ | ||
@Since("2.2.0") | ||
@DeveloperApi | ||
def isTestStatistic: Boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding methods to a public trait is technically an API breaking change. This might be considered a Developer API even though it's not labeled that way. Still if we can avoid adding to the API here, it'd be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be customized or extended externally to spark? I'm wondering why it is public.
*/ | ||
@Since("2.2.0") | ||
@DeveloperApi | ||
def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this new method doesn't make sense to implement for existing implementations, only the new one. That kind of suggests to me it isn't part of the generic API for an impurity. Is this really something that belongs inside the logic of the implementations? maybe there's a more general method that needs to be exposed, that can then be specialized for all implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll consider if there's a unifying idea here. pval-based metrics require integrating information across the new split children, which I believe was not the case for existing methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the generalization is closer to my newer signature
val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)
where you have all the context from the left and right nodes. The existing gain-based calculation should fit into this framework, just doing its current weighted average of purity gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen @willb
I cached the design of the metrics back in. In general, Impurity
already uses methods that are only defined on certain impurity sub-classes, and so this new method does not change that situation.
My take on the "problem" is that the existing measures are all based on a localized concept of "purity" (or impurity) that can be calculated using only the data at a single node. Splitting based on statistical tests (p-values) breaks that model, since it is making use of a more generalized concept of split quality that requires the sample populations of both children from a candidate split. A maximally general signature would probably involve the parent and both children.
Another kink in the current design is that ImpurityCalculator
is essentially parallel to Impurity
, and in fact ImpurityCalculator#calculate()
is how impurity measures are currently requested. Impurity
seems somewhat redundant, and might be factored out in favor of ImpurityCalculator
. The current signature calculate()
might be generalized into a more inclusive concept of split quality that expects to make use of {parent,left,right}.
Calls to calculate()
are not very wide-spread but threading that change through is outside the scope of this particular PR. If people are interested in that kind of refactoring I could look into it in the near future but probably not in the next couple weeks.
That kind of change would also be API breaking and so a good target for 3.0
Yeah I take your point that the trait Impurity already defines two methods, only one of which is implemented for each of the subclasses. It's already a funky design that probably should have been generalized differently. I think a rewrite for Spark 3 would be worthwhile, personally. I'm also not quite sure of the difference between the Impurity and ImpurityCalculator class; it seems like Impurity should fold into ImpurityCalculator. Is the single method we really want to define something like Well, I think either this gets a bigger redesign in 3.0, or we try to get it into 2.5 and accept some API changes. I think I lean towards a bolder breaking change to fix it up in 3.0, unless there's a pressing need for this metric. |
I think targeting 3.0 with a refactor makes the most sense. There's no way to do this without making small breaking changes, but slightly larger changes could clean up the design. |
update - I'm consulting with some teammates about what it might mean to also support Bayesian variations on split quality, since there has been a lot of interest in the last few years regarding Bayesian alternatives to more traditional p-value based statistics. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.
https://issues.apache.org/jira/browse/SPARK-15699
http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/
How was this patch tested?
I added unit testing to verify that the chi-squared "impurity" measure functions as expected when used for decision tree training.