[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440

erikerlandson · 2016-06-01T15:38:29Z

What changes were proposed in this pull request?

Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.

https://issues.apache.org/jira/browse/SPARK-15699

http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/

How was this patch tested?

I added unit testing to verify that the chi-squared "impurity" measure functions as expected when used for decision tree training.

erikerlandson · 2016-06-01T15:39:42Z

This is a re-submission of #13438 to fix target branch

SparkQA · 2016-06-01T15:58:54Z

Test build #59740 has finished for PR 13440 at commit 04c1316.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-01T16:34:11Z

Test build #59745 has finished for PR 13440 at commit 1136518.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-01T19:31:17Z

Test build #59751 has finished for PR 13440 at commit 6d38cfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-23T22:56:57Z

Test build #64309 has finished for PR 13440 at commit 6d38cfd.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

holdenk · 2016-10-07T19:59:22Z

Is this something your still working on? If so it would be good to merge in the latest master. We can also check with @jkbradley to see if he has some review bandwidth.

erikerlandson · 2016-10-09T20:59:11Z

@holdenk yes, I'll rebase it this week.

SparkQA · 2016-10-10T23:03:39Z

Test build #66679 has finished for PR 13440 at commit b199ae3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T21:00:17Z

Test build #66756 has finished for PR 13440 at commit 83f5e83.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

erikerlandson · 2016-10-11T22:14:17Z

test this please

SparkQA · 2016-10-12T00:38:14Z

Test build #66766 has finished for PR 13440 at commit 83f5e83.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

erikerlandson · 2016-10-12T12:41:21Z

@holdenk @jkbradley looks like it's clean again

wangmiao1981 · 2017-02-16T02:04:50Z

@erikerlandson Are you still working on this PR? Thanks! Miao

erikerlandson · 2017-02-16T17:22:03Z

Hi @wangmiao1981,

I am still interested in this, but I don't have any sense about whether upstream has any interest. Does upstream have any intention to accept it?

SparkQA · 2017-02-16T17:48:44Z

Test build #73006 has started for PR 13440 at commit 61cbf7c.

shaneknapp · 2017-02-16T19:20:55Z

i stopped the build as i need to restart jenkins... i'll retrigger this when we're back up and running.

wangmiao1981 · 2017-02-16T19:35:14Z

@erikerlandson I am just helping clearing the stale PRs. :) I have no idea whether they have intention to accept it.

shaneknapp · 2017-02-16T19:40:16Z

test this please

SparkQA · 2017-02-16T22:08:47Z

Test build #73008 has finished for PR 13440 at commit 61cbf7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangmiao1981 · 2017-02-21T23:21:02Z

@thunterdb Can you take a look? Thanks!

thunterdb · 2017-03-07T23:58:08Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala

+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =


scala: do not add a default implementation, it causes issues with java compatibility

you cannot guarantee compatibility with existing code here, since you would break the bytecode either way.

IIRC, there were other impurity measures using this 'unsupported' idiom. However, I'm fine with making it abstract, if that's the direction you are thinking.

thunterdb · 2017-03-07T23:58:14Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala

+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean = false


scala: do not add a default implementation, it causes issues with java compatibility

thunterdb · 2017-03-07T23:59:00Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala

+ * Utility functions for Impurity measures
+ */
+@Since("2.0.0")
+@DeveloperApi


there is no need for this object to be publicly exposed?

I don't think so. I don't recall any specific motivation to keep it private, but historically Spark seems to default things to "minimum visibility." The only method currently defined here is an implementation detail for hacking p-values into the existing 'gain' system, where larger is assumed to be better.

thunterdb · 2017-03-08T00:03:42Z

mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala

+      .setMinInfoGain(0.01)
+    val treeModel = dt.fit(train)
+
+    // The tree should use exactly one of the 3 features: featue(0)


nit: feature

thunterdb · 2017-03-08T00:07:29Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala

+@Since("2.0.0")
+@Experimental
+object ChiSquared extends Impurity {
+  private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest()


why not a private val?

IIRC it was to "allocate on the stack"

erikerlandson · 2017-05-26T23:57:24Z

Ready for review

willb · 2017-07-21T15:09:32Z

@thunterdb can you take a look at this now that 2.2 is out?

…lity when training decision trees

erikerlandson · 2017-07-31T21:14:25Z

rebased to latest head of master

SparkQA · 2017-08-01T00:17:09Z

Test build #80093 has finished for PR 13440 at commit bb2f660.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

willb · 2017-08-23T21:56:53Z

@HyukjinKwon @thunterdb can you all take a look at this? It's been under review for quite a long time!

HyukjinKwon · 2017-08-23T23:48:53Z

I don't have ML knowledge enough to review this. I can cc ML committer guys who I can guess have some expertise from git blame but I hope there are some sign-offs left here from some guys here ahead.

srowen · 2017-08-24T08:46:11Z

I've not seen chi squared used as a split statistic; when is it theoretically better than entropy? Or something a bit more fundamental like KL divergence? It makes some sense but does require some assumption about the data

erikerlandson · 2017-08-24T15:24:31Z

@srowen I discuss some of these questions in the blog post, but the tl/dr is that split quality measures based on statistical tests having p-values are in some senses "less arbitrary." Specifying a p-value as a split quality halting condition has essentially the same semantic regardless of the test. Most such tests also intrinsically take into account decreasing population sizes. As the the splitting progresses and population sizes decrease, it inherently takes a larger and larger population difference to meet the p-value threshold.

On the more pragmatic side, in that post I also demonstrate chi-squared split quality generating a more parsimonious tree than other metrics, which does a better job at ignoring poor quality features.

felixcheung · 2017-10-05T05:37:57Z

@srowen @thunterdb any more thoughts on this?
how about @sethah @yanboliang @jkbradley?

willb · 2017-10-18T15:55:50Z

I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at this?

jsigee87 · 2018-04-08T22:33:10Z

Is this still being considered?

SparkQA · 2018-08-02T22:33:33Z

Test build #94042 has finished for PR 13440 at commit bb2f660.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-21T18:23:27Z

Test build #95018 has finished for PR 13440 at commit bb2f660.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

willb · 2018-09-17T15:18:39Z

@srowen @thunterdb this PR passes all tests and merges cleanly -- can you take another look? It's been open for quite a while now.

srowen · 2018-09-17T20:05:43Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+    // a larger-is-better gain value for the minimum-gain threshold
+    val minGain =
+      if (metadata.impurity.isTestStatistic) Impurity.pValToGain(metadata.minInfoGain)
+      else metadata.minInfoGain


Kind of a design question here... right now the caller has to switch logic based on what's inside metadata. Can methods like metadata.minInfoGain just implement different logic when the impurity is a test statistic, and so on? push this down towards the impurity implementation? I wonder if isTestStatistic can go away with the right API, but I am not familiar with the details of what that requires.

The main issue I recall was that all of the existing metrics assume some kind of "larger is better" gain, and p-values are "smaller is better." I'll take another pass over it and see if I can push that distinction down so it doesn't require exposing new methods.

srowen · 2018-09-17T20:06:14Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala

+ * :: Experimental ::
+ * Class for calculating Chi Squared as a split quality metric during binary classification.
+ */
+@Since("2.2.0")


This will have to be 2.5.0 for the moment

I'll update those. 3.0 might be a good target, especially if I can't do this without new isTestStatistic

srowen · 2018-09-17T20:06:42Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala

+   * Get this impurity instance.
+   * This is useful for passing impurity parameters to a Strategy in Java.
+   */
+  @Since("1.1.0")


I think I'd label all these as Since 2.5.0 even if they override a method that existed earlier.

srowen · 2018-09-17T20:09:53Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala

+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def pValToGain(pval: Double): Double = -math.log(math.max(1e-20, pval))


private to spark?

srowen · 2018-09-17T20:10:12Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala

+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean


Adding methods to a public trait is technically an API breaking change. This might be considered a Developer API even though it's not labeled that way. Still if we can avoid adding to the API here, it'd be better.

Can this be customized or extended externally to spark? I'm wondering why it is public.

srowen · 2018-09-17T20:11:29Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala

+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =


It looks like this new method doesn't make sense to implement for existing implementations, only the new one. That kind of suggests to me it isn't part of the generic API for an impurity. Is this really something that belongs inside the logic of the implementations? maybe there's a more general method that needs to be exposed, that can then be specialized for all implementations.

I'll consider if there's a unifying idea here. pval-based metrics require integrating information across the new split children, which I believe was not the case for existing methods.

I suspect that the generalization is closer to my newer signature
val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)
where you have all the context from the left and right nodes. The existing gain-based calculation should fit into this framework, just doing its current weighted average of purity gain.

@srowen @willb
I cached the design of the metrics back in. In general, Impurity already uses methods that are only defined on certain impurity sub-classes, and so this new method does not change that situation.

My take on the "problem" is that the existing measures are all based on a localized concept of "purity" (or impurity) that can be calculated using only the data at a single node. Splitting based on statistical tests (p-values) breaks that model, since it is making use of a more generalized concept of split quality that requires the sample populations of both children from a candidate split. A maximally general signature would probably involve the parent and both children.

Another kink in the current design is that ImpurityCalculator is essentially parallel to Impurity, and in fact ImpurityCalculator#calculate() is how impurity measures are currently requested. Impurity seems somewhat redundant, and might be factored out in favor of ImpurityCalculator. The current signature calculate() might be generalized into a more inclusive concept of split quality that expects to make use of {parent,left,right}.

Calls to calculate() are not very wide-spread but threading that change through is outside the scope of this particular PR. If people are interested in that kind of refactoring I could look into it in the near future but probably not in the next couple weeks.

That kind of change would also be API breaking and so a good target for 3.0

srowen · 2018-09-22T21:05:16Z

Yeah I take your point that the trait Impurity already defines two methods, only one of which is implemented for each of the subclasses. It's already a funky design that probably should have been generalized differently. I think a rewrite for Spark 3 would be worthwhile, personally. I'm also not quite sure of the difference between the Impurity and ImpurityCalculator class; it seems like Impurity should fold into ImpurityCalculator.

Is the single method we really want to define something like computeInformationGain(ImpurityCalculator, ImpurityCalculator)? even the new method you've added is not directly computing info gain, nor were the existing ones in Impurity. But that's the thing we need and abstraction for over several implementations, it seems.

Well, I think either this gets a bigger redesign in 3.0, or we try to get it into 2.5 and accept some API changes. I think I lean towards a bolder breaking change to fix it up in 3.0, unless there's a pressing need for this metric.

erikerlandson · 2018-09-22T22:35:13Z

I think targeting 3.0 with a refactor makes the most sense. There's no way to do this without making small breaking changes, but slightly larger changes could clean up the design. ImpurityCalculator can subsume Impurity, and a more general rethinking of gain and impurity can be accommodated too.

erikerlandson · 2018-10-19T19:35:51Z

update - I'm consulting with some teammates about what it might mean to also support Bayesian variations on split quality, since there has been a lot of interest in the last few years regarding Bayesian alternatives to more traditional p-value based statistics.

github-actions · 2020-01-18T00:08:20Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

erikerlandson mentioned this pull request Jun 1, 2016

Implement a Chi-Squared test statistic option for measuring split quality #13438

Closed

erikerlandson force-pushed the chisquared_split_quality branch from 1136518 to 6d38cfd Compare June 1, 2016 17:17

erikerlandson force-pushed the chisquared_split_quality branch from 6d38cfd to b199ae3 Compare October 10, 2016 20:39

erikerlandson force-pushed the chisquared_split_quality branch from b199ae3 to 83f5e83 Compare October 11, 2016 19:03

thunterdb reviewed Mar 7, 2017

View reviewed changes

thunterdb reviewed Mar 8, 2017

View reviewed changes

erikerlandson force-pushed the chisquared_split_quality branch from 61cbf7c to 6762a18 Compare April 13, 2017 23:23

erikerlandson added 6 commits July 31, 2017 14:11

Implement a Chi-Squared test statistic option for measuring split qua…

acbc515

…lity when training decision trees

remove default defs

ae8f7ea

fix typo

cb76359

add scaladoc for Entropy, Gini, Variance

345ba6a

try documenting toString

11e6ed5

remove link that was breaking javadoc

bb2f660

erikerlandson force-pushed the chisquared_split_quality branch from d2a2381 to bb2f660 Compare July 31, 2017 21:12

srowen reviewed Sep 17, 2018

View reviewed changes

dongjoon-hyun added ML MLLIB labels Jun 14, 2019

github-actions bot added the Stale label Jan 18, 2020

github-actions bot closed this Jan 20, 2020

[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440

[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440

Conversation

erikerlandson commented Jun 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

erikerlandson commented Jun 1, 2016

SparkQA commented Jun 1, 2016

SparkQA commented Jun 1, 2016

SparkQA commented Jun 1, 2016

SparkQA commented Aug 23, 2016

holdenk commented Oct 7, 2016

erikerlandson commented Oct 9, 2016

SparkQA commented Oct 10, 2016

SparkQA commented Oct 11, 2016

erikerlandson commented Oct 11, 2016

SparkQA commented Oct 12, 2016

erikerlandson commented Oct 12, 2016

wangmiao1981 commented Feb 16, 2017

erikerlandson commented Feb 16, 2017

SparkQA commented Feb 16, 2017

shaneknapp commented Feb 16, 2017

wangmiao1981 commented Feb 16, 2017

shaneknapp commented Feb 16, 2017

SparkQA commented Feb 16, 2017

wangmiao1981 commented Feb 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikerlandson commented May 26, 2017

willb commented Jul 21, 2017

erikerlandson commented Jul 31, 2017

SparkQA commented Aug 1, 2017

willb commented Aug 23, 2017

HyukjinKwon commented Aug 23, 2017

srowen commented Aug 24, 2017

erikerlandson commented Aug 24, 2017

felixcheung commented Oct 5, 2017

willb commented Oct 18, 2017

jsigee87 commented Apr 8, 2018

SparkQA commented Aug 2, 2018

SparkQA commented Aug 21, 2018

willb commented Sep 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Sep 22, 2018

erikerlandson commented Sep 22, 2018

erikerlandson commented Oct 19, 2018

github-actions bot commented Jan 18, 2020