-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathBootstrapElectionModel.py
1874 lines (1587 loc) · 96.7 KB
/
BootstrapElectionModel.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
from __future__ import annotations # pylint: disable=too-many-lines
from datetime import timedelta
from itertools import combinations
import numpy as np
import pandas as pd
from elexsolver.OLSRegressionSolver import OLSRegressionSolver
from elexsolver.QuantileRegressionSolver import QuantileRegressionSolver
from scipy.linalg import block_diag
from scipy.special import expit
from elexmodel.handlers.data.Featurizer import Featurizer
from elexmodel.logger import getModelLogger
from elexmodel.models.BaseElectionModel import BaseElectionModel, PredictionIntervals
LOG = getModelLogger()
class BootstrapElectionModelException(Exception):
pass
class BootstrapElectionModel(BaseElectionModel):
"""
The bootstrap election model.
This model uses ordinary least squares regression for point predictions and the bootstrap to generate
prediction intervals.
In this setup, we are modeling normalized two party margin.
But because we need to sum our estimand from the unit level to the aggregate level, we need
to be able to convert normalized margin to unormalized margin (since normalized margins don't sum).
This means on the unit level we will be modeling unnormalized margin and on the aggregate level normalized margin.
We have found that instead of modeling the unit margin directly,
decomposing the estimand into two quantites works better because the linear
model assumption is more plausible for each individually.
The two quantities are normalized margin (y) and a turnout factor (z)
y = (D^{Y} - R^{Y}) / (D^{Y} + D^{Y})
z = (D^{Y} + R^{Y}) / (D^{Y'} + R^{Y'})
where Y is the year we are modeling and Y' is the previous election year.
If we define weights to be the total two party turnout in a previous election:
w = (D^{Y'} + R^{Y'}) -> w * y * z = (D^{Y} - R^{Y})
so (w * y * z) is the unnormalized margin.
We define our model as:
y_i = f_y(x) + epsilon_y(x)
z_i = f_z(x) + epsilon_z(x)
where f_y(x) and f_z(x) are ordinary least squares regressions
and the epsilons are contest (state/district) level random effects.
"""
def __init__(self, model_settings={}, versioned_data_handler=None, pres_predictions=None):
super().__init__(model_settings)
self.B = model_settings.get("B", 500) # number of bootstrap samples
self.strata = model_settings.get("strata", ["county_classification"]) # columns to stratify the data by
self.T = model_settings.get("T", 5000) # temperature for aggregate model
self.hard_threshold = model_settings.get(
"agg_model_hard_threshold", True
) # use sigmoid or hard thresold when calculating agg model
self.district_election = model_settings.get("district_election", False)
self.lambda_ = model_settings.get("lambda_", None) # regularization parameter for OLS
# save versioned data for later use
self.versioned_data_handler = versioned_data_handler
self.extrapolate_threshold = model_settings.get("extrapolate_threshold", 75)
self.min_extrapolating_units = model_settings.get("min_extrapolating_units", 5)
self.extrapolate_std_method = model_settings.get("extrapolate_std_method", "std")
self.max_dist_to_observed = model_settings.get("max_dist_to_observed", 5)
# save presidenial predictions for later use
self.pres_predictions = pres_predictions
self.correct_from_presidential = model_settings.get("correct_from_presidential", False)
# upper and lower bounds for the quantile regression which define the strata distributions
# these make sure that we can control the worst cases for the distributions in case we
# haven't seen enough data ayet
self.y_LB = model_settings.get("y_LB", -0.3) # normalied margin lower bound
self.y_UB = model_settings.get("y_UB", 0.3) # normalized margin upper bound
self.z_LB = model_settings.get("z_LB", -0.5) # turnout factor lower bound
self.z_UB = model_settings.get("z_UB", 0.5) # turnout factor upper bound
# percentiles to compute the strata distributions for
self.taus_lower = np.arange(0.01, 0.5, 0.01)
self.taus_upper = np.arange(0.50, 1, 0.01)
self.taus = np.concatenate([self.taus_lower, self.taus_upper])
# upper and lower bounds for normalized margin and turnout factor as to how the "outstanding vote" in
# non-reporting units can go. Used to clip our predictions
self.y_unobserved_lower_bound = model_settings.get("y_unobserved_lower_bound", -1.0)
self.y_unobserved_upper_bound = model_settings.get("y_unobserved_upper_bound", 1.0)
self.percent_expected_vote_error_bound = model_settings.get("percent_expected_vote_error_bound", 0.5)
self.z_unobserved_upper_bound = model_settings.get("z_unobserved_upper_bound", 1.5)
self.z_unobserved_lower_bound = model_settings.get("z_unobserved_lower_bound", 0.5)
self.states_for_separate_model = model_settings.get("states_for_separate_model", [])
self.featurizer = Featurizer(
self.features, self.fixed_effects, states_for_separate_model=self.states_for_separate_model
)
self.seed = model_settings.get("seed", 0)
self.rng = np.random.default_rng(seed=self.seed) # used for sampling
self.ran_bootstrap = False
# these are the max/min values for called races. Ie. if a contest is called for LHS party then the prediction/intervals should be at least lhs_called_threshold
# if a contest is called for RHS party then the prediction/interval should be at most rhs_called_threshold (at most because the values are negative)
self.lhs_called_threshold = 0.005
self.rhs_called_threshold = -0.005
# this is the correlation structure we impose when we sample from the contest level random effects
self.contest_correlations = model_settings.get("contest_correlations", [])
# impose perfect correlation in the national summary aggregation
self.national_summary_correlation = model_settings.get("national_summary_correlation", True)
self.stop_model_call = None
# Assume that we have a baseline normalized margin
# (D^{Y'} - R^{Y'}) / (D^{Y'} + R^{Y'}) is one of the covariates
if "baseline_normalized_margin" not in self.features:
raise BootstrapElectionModelException(
"baseline_normalized_margin not included as feature. This is necessary for the model to work."
)
def cv_lambda(
self, x: np.ndarray, y: np.ndarray, lambdas_: np.ndarray, weights: np.ndarray | None = None, k: int = 5
) -> float:
"""
This function does k-fold cross validation for a OLS regression model given x, y and a set of lambdas to try out
This function returns the lambda that minimizes the k-fold cross validation loss
"""
# if weights are none assume that all samples have equal weights
if weights is None:
weights = np.ones((y.shape[0], 1))
# concatenate since we need to keep x, y, weight samples together
x_y_w = np.concatenate([x, y, weights], axis=1)
self.rng.shuffle(x_y_w)
# generate k chunks
chunks = np.array_split(x_y_w, k, axis=0)
errors = np.zeros((len(lambdas_),))
# for each possible lambda value perform k-fold cross validation
# ie. train model on k-1 chunks and evaluate on one chunk (for all possible k combinations of heldout chunk)
for i, lambda_ in enumerate(lambdas_):
for test_chunk in range(k):
x_y_w_test = chunks[test_chunk]
# extract all chunks except for the current test chunk
x_y_w_train = np.concatenate(chunks[:test_chunk] + chunks[(test_chunk + 1) :], axis=0) # noqa: 203
# undo the concatenation above
x_test = x_y_w_test[:, :-2]
y_test = x_y_w_test[:, -2]
w_test = x_y_w_test[:, -1]
x_train = x_y_w_train[:, :-2]
y_train = x_y_w_train[:, -2]
w_train = x_y_w_train[:, -1]
ols_lambda = OLSRegressionSolver()
ols_lambda.fit(
x_train,
y_train,
weights=w_train,
lambda_=lambda_,
fit_intercept=True,
regularize_intercept=False,
n_feat_ignore_reg=1 + len(self.states_for_separate_model),
)
y_hat_lambda = ols_lambda.predict(x_test)
# error is the weighted sum of squares of the residual between
# the actual heldout y and the predicted y on the heldout set
errors[i] += np.sum(
w_test * ols_lambda.residuals(y_test, y_hat_lambda, loo=False, center=False) ** 2
) / np.sum(w_test)
# return lambda that minimizes the k-fold error
# np.argmin returns the first occurence if multiple minimum values
return lambdas_[np.argmin(errors)]
def get_minimum_reporting_units(self, alpha: float) -> int:
# arbitrary, just enough to fit coefficients
return 10
def _estimate_epsilon(self, residuals: np.ndarray, aggregate_indicator: np.ndarray) -> np.ndarray:
"""
This function estimates the epsilon (contest level random effects)
"""
# the estimate for epsilon is the average of the residuals in each contest
epsilon_hat, _, _, _ = np.linalg.lstsq(aggregate_indicator, residuals)
# we can't estimate a contest level effect if only have 1 unit in that contest (since our residual can just)
# be made equal to zero by setting the random effect to that value
epsilon_hat[aggregate_indicator.sum(axis=0) < 2] = 0
return epsilon_hat
def _estimate_delta(
self, residuals: np.ndarray, epsilon_hat: np.ndarray, aggregate_indicator: np.ndarray
) -> np.ndarray:
"""
This function estimates delta (the final unit level residual of the model)
"""
# our estimate for delta is the difference between the residual and
# what can be explained by the contest level random effect
return (residuals - (aggregate_indicator @ epsilon_hat)).flatten()
def _estimate_model_errors(
self, model: OLSRegressionSolver, x: np.ndarray, y: np.ndarray, aggregate_indicator: np.ndarray
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
This function estimates all components of the error in our bootstrap model:
residual: the centered leave one out residual, ie the difference between
our OLS prediction and our actual training values.
this includes the component of the error that we want to explain using a contest level random effect
epsilon_hat: our estimate for the contest level effect in our model
(e.g. how much did each state contribute)
deta_hat: the unit level error
(ie. the difference between the OLS residual and the contest level state effect)
"""
# get unit level predictions from OLS
y_pred = model.predict(x)
# compute residuals
residuals_y = model.residuals(y, y_pred, loo=True, center=True)
# estimate contest level effect (the average residual in units of a contest)
epsilon_y_hat = self._estimate_epsilon(residuals_y, aggregate_indicator)
# compute delta, which is the left over residual after removing the contest level effect
delta_y_hat = self._estimate_delta(residuals_y, epsilon_y_hat, aggregate_indicator)
return residuals_y, epsilon_y_hat, delta_y_hat
def _estimate_strata_dist(
self,
x_train: np.ndarray,
x_train_strata: np.ndarray,
x_test: np.ndarray,
x_test_strata: np.ndarray,
delta_hat: np.ndarray,
lb: float,
ub: float,
) -> tuple[dict, dict]:
"""
This function generates the distribution (ie. CDF/PPF) for the strata in which we want to exchange
bootstrap errors in this model.
"""
stratum_ppfs_delta = {}
stratum_cdfs_delta = {}
def ppf_creator(betas: np.ndarray, taus: np.ndarray, lb: float, ub: float) -> float:
"""
Creates a probability point function (inverse of a cumulative distribution function -- CDF)
Given a percentile, provides the value of the CDF at that point
"""
# we interpolate, because we want to return smooth betas
return lambda p: np.interp(p, taus, betas, lb, ub)
def cdf_creator(betas: np.ndarray, taus: np.ndarray) -> float:
"""
Creates a cumulative distribution function (CDF)
Provides the probability that a value is at most x
"""
# interpolates because we want to provide smooth probabilites
return lambda x: np.interp(x, betas, taus, right=1)
# we need the unique strata that exist in both the training and in the holdout data
# since we want a distribution for all strata
x_strata = np.unique(np.concatenate([x_train_strata, x_test_strata], axis=0), axis=0).astype(int)
# We compute the probability distribution for each stratum by fitting quantile regressions
# this works because the regression covariates are only dummy variables that define
# the strata. Therefore the i-th coefficient for a quantile regression at level tau is the
# tau-th delta for where dummy == 1
# ie. if tau is 0.5 and there are two unique dummies (x = [[0, 1], [1, 0], ...]), then
# the first coefficient is the median of all deltas where x = [1, 0] and the second coefficient
# is the median of all deltas where x = [0, 1]
# since the covariates define the strata, we get the tau-th (e.g. median or 30th percentile)
# delta per strata as the beta for that regression, which defines a probability distribution.
# for each stratum we want to add a worst case lower bound (for the taus between 0-0.49)
# and upper bound (for the taus between 0.5-1) so that if we sample a uniform from a different
# stratum that is larger/smaller than any one we have seen in this strata, we have a worst case
# value that is larger/smaller than what we saw in that strata. We do this by simply adding the
# lower/upper bound to the regression, one pair for each stratum.
# This lower/upper bound is set manually, note that if we observe a value that is more extreme than
# the lower/upper bound we are fine, since the quantile regression will use that instead
for x_stratum in x_strata:
x_train_aug = np.concatenate([x_train_strata, x_stratum.reshape(1, -1)], axis=0)
delta_aug_lb = np.concatenate([delta_hat, [lb]])
delta_aug_ub = np.concatenate([delta_hat, [ub]])
# fit the regressions to create the probability distributions
# for a single regression beta[i] is the tau-th (e.g. median or 30th percentile)
# for where dummy variable position i is equal to 1
# since we are fitting many quantile regressions at the same time, our beta is
# beta[tau, i] where tau stretches from 0.01 to 0.99
qr_lower = QuantileRegressionSolver()
qr_lower.fit(x_train_aug, delta_aug_lb, self.taus_lower, fit_intercept=False)
betas_lower = qr_lower.coefficients
qr_upper = QuantileRegressionSolver()
qr_upper.fit(x_train_aug, delta_aug_ub, self.taus_upper, fit_intercept=False)
betas_upper = qr_upper.coefficients
betas = np.concatenate([betas_lower, betas_upper])
# for each strata, we take the betas that belong to that stratum
# ie. for stratum [0, 1, 0] we take the betas (there is one for each tau between 0.01, 0.99])
# at position 1 (0-indexed)
# get all the betas for where x_stratum has a 1 (ie [1, 0, 0] position 0, [0, 1, 0] position 1 etc.)
betas_stratum = betas[:, np.where(x_stratum == 1)[0]].sum(axis=1)
# for this stratum value create ppf
# we want the lower bounds and upper bounds to be the actual lower and upper values taken from beta
stratum_ppfs_delta[tuple(x_stratum)] = ppf_creator(
betas_stratum, self.taus, np.min(betas_stratum), np.max(betas_stratum)
)
# for this stratum value create cdf
stratum_cdfs_delta[tuple(x_stratum)] = cdf_creator(betas_stratum, self.taus)
return stratum_ppfs_delta, stratum_cdfs_delta
# TODO: figure out how to better estimate margin_upper/lower_bound
def _generate_nonreporting_bounds(
self, nonreporting_units: pd.DataFrame, bootstrap_estimand: str, n_bins: int = 10
) -> tuple[np.ndarray, np.ndarray]:
"""
This function creates upper and lower bounds for y and z based on the expected vote
that we have for each unit. This is used to clip our predictions
"""
# turn expected for nonreporting units into decimal (also clip at 100)
nonreporting_expected_vote_frac = nonreporting_units.percent_expected_vote.values.clip(max=100) / 100
if bootstrap_estimand == "results_normalized_margin":
unobserved_upper_bound = self.y_unobserved_upper_bound
unobserved_lower_bound = self.y_unobserved_lower_bound
# the upper bound for LHS party if all the outstanding vote go in their favour
upper_bound = (
nonreporting_expected_vote_frac * nonreporting_units[bootstrap_estimand]
+ (1 - nonreporting_expected_vote_frac) * unobserved_upper_bound
)
# the lower bound for the LHS party if all the outstanding vote go against them
lower_bound = (
nonreporting_expected_vote_frac * nonreporting_units[bootstrap_estimand]
+ (1 - nonreporting_expected_vote_frac) * unobserved_lower_bound
)
elif bootstrap_estimand == "turnout_factor":
# our error bound for how much error we think our results provider has with expected vote
# e.g. 0.7 of the vote is in for a county, if percent_expected_vote_error_bound is 0.1
# then we are saying that we believe between 0.6 and 0.8 of the vote is in for that county
percent_expected_vote_error_bound = self.percent_expected_vote_error_bound
unobserved_upper_bound = self.z_unobserved_upper_bound
unobserved_lower_bound = self.z_unobserved_lower_bound
# inflate or deflate turnout factor appropriately
lower_bound = nonreporting_units[bootstrap_estimand] / (
nonreporting_expected_vote_frac + percent_expected_vote_error_bound
)
upper_bound = nonreporting_units[bootstrap_estimand] / (
nonreporting_expected_vote_frac - percent_expected_vote_error_bound
).clip(min=0.01)
# if 0 percent of the vote is in, the upper bound would be zero if we used the above
# code. So instead we set it to the naive bound
upper_bound[np.isclose(upper_bound, 0)] = unobserved_upper_bound
# if percent reporting is 0 or 1, don't try to compute anything and revert to naive bounds
lower_bound[
(nonreporting_expected_vote_frac < 0.5) | np.isclose(nonreporting_expected_vote_frac, 1)
] = unobserved_lower_bound
upper_bound[
(nonreporting_expected_vote_frac < 0.5) | np.isclose(nonreporting_expected_vote_frac, 1)
] = unobserved_upper_bound
return lower_bound.values.reshape(-1, 1), upper_bound.values.reshape(-1, 1)
def _strata_pit(
self,
x_train_strata: pd.DataFrame,
x_train_strata_unique: np.ndarray,
delta_hat: np.ndarray,
stratum_cdfs_delta: dict,
) -> np.ndarray:
"""
Apply the probability integral transform for each strata
"""
# We convert the deltas in the training data to their percentiles in uniform space
unifs = []
# for each strata, apply the CDF to the residual to turn the residual into perecntil
for strata_dummies in x_train_strata_unique:
# grab all deltas that belong to strata defined by strata_dummies
delta_strata = delta_hat[np.where((strata_dummies == x_train_strata).all(axis=1))[0]]
# plug the deltas into their CDF to get uniform random variables (probability integral transform)
# 1e-6 is added to solve with numerical issues
unifs_strata = stratum_cdfs_delta[tuple(strata_dummies)](delta_strata + 1e-6).reshape(-1, 1)
# if the uniform is close to 1/0 set percentile to 0.01/0.99 also for numerical issues
unifs_strata[np.isclose(unifs_strata, 1)] = np.max(self.taus)
unifs_strata[np.isclose(unifs_strata, 0)] = np.min(self.taus)
unifs.append(unifs_strata)
return np.concatenate(unifs).reshape(-1, 1)
def _bootstrap_deltas(
self,
unifs: np.ndarray,
x_train_strata: pd.DataFrame,
x_train_strata_unique: np.ndarray,
stratum_ppfs_delta_y: dict,
stratum_ppfs_delta_z: dict,
) -> tuple[np.ndarray, np.ndarray]:
"""
Bootstrap deltas (unit level errors of our entire model)
this is done by re-sampling from uniform random variables
"""
n_train = unifs.shape[0]
# re-sample uniform random variables
# we are re-sampling over all uniforms (not just per strata), which gives us more randomness
# but we can guarantee that the deltas per strata will be correct because when we convert
# the uniforms back, we use the distribution of the stratum that the training point came from.
# also note unifs is of size: (n_train, self.B, 2) so we are re-sampling the deltas
# for y and z jointly
unifs_B = self.rng.choice(unifs, (n_train, self.B), replace=True)
delta_y_B = np.zeros((n_train, self.B))
delta_z_B = np.zeros((n_train, self.B))
# convert percentile (uniforms) back into residuals
# we need to do this separately for each strata because there is a different
# inverse function per stratum
for strata_dummies in x_train_strata_unique:
# grab all training indices where units belong to strata strata_dummies
strata_indices = np.where((strata_dummies == x_train_strata).all(axis=1))[0]
# get their corresponding uniform random variables
unifs_strata = unifs_B[strata_indices]
# convert back to deltas. unifs_stratas last dimension defines either y or z
delta_y_B[strata_indices] = stratum_ppfs_delta_y[tuple(strata_dummies)](unifs_strata[:, :, 0])
delta_z_B[strata_indices] = stratum_ppfs_delta_z[tuple(strata_dummies)](unifs_strata[:, :, 1])
return delta_y_B, delta_z_B
# TODO: what if hat_epsilon_y, hat_epsilon_z are correlated?
def _bootstrap_epsilons(
self,
epsilon_y_hat: np.ndarray,
epsilon_z_hat: np.ndarray,
x_train_strata: pd.DataFrame,
x_train_strata_unique: np.ndarray,
stratum_ppfs_delta_y: dict,
stratum_ppfs_delta_z: dict,
aggregate_indicator_train: np.ndarray,
) -> tuple[np.ndarray.np.ndarray]:
"""
Bootstrap epsilons (contest level random effects) using the parametric bootstrap
In a random effects model we assume that the contest level random effects are drawn
from a normal distribution with mean zero and variance Sigma
epsilon_y, epsilon_z ~ N(0, Sigma)
and for now we assume Sigma is diagional
(e.g. contest level random effects between normalized margin and turnout are uncorrelated)
"""
# We first fit the parameters of the multivariate normal distribution that we are assuming epsilon comes
# from and then we sample from it B times to generate bootstrapped contest level random effects.
# The mean of our normal distribution is epsilon_hat (our best guess for contest level effects)
# we do this to maintain the sign of the contest level effect
# We now need to estimate Sigma, the covariance matrix of our contest level effects.
# We assume that the contest level random effects between contests is uncorrelated so the diagonals
# of \Sigma are zero.
# To estimate the variance (diagonals) we use the sum of the variances of the units within a contest.
# This is because residuals = epsilons + delta, if we have observed units in a contest then
# the true epsilon is no longer a random quantity and no residual_{i, j} ~ N(epsilon_{i}, \sigma_{j})
# (for i in contest and j in strata)
# which means that var(residuals) = var(delta). So that epsilon_hat = mean(residuals)
# so that var(epsilon_hat) = var(mean(residuals)) = mean(var(residuals)) = mean(var(delta))
# The variance of deltas is defined by their probability distribution (it's the same per stratum
# since we assume that deltas are iid within a stratum). So we use the distribution to compute
# the variance of the delta. We use interquartile-range (IQR) since this is more robust than
# sample standard deviation
# we first compute the unit variance estimate for each stratum (since that defines the variance per unit)
iqr_y_strata = {}
iqr_z_strata = {}
for x_stratum in x_train_strata_unique:
x_stratum_delta_y_ppf = stratum_ppfs_delta_y[tuple(x_stratum)]
iqr_y = x_stratum_delta_y_ppf(0.75) - x_stratum_delta_y_ppf(0.25)
iqr_y_strata[tuple(x_stratum)] = iqr_y
x_stratum_delta_z_ppf = stratum_ppfs_delta_z[tuple(x_stratum)]
iqr_z = x_stratum_delta_z_ppf(0.75) - x_stratum_delta_z_ppf(0.25)
iqr_z_strata[tuple(x_stratum)] = iqr_z
# set variance per contest to be zero
var_epsilon_y = np.zeros((aggregate_indicator_train.shape[1],))
var_epsilon_z = np.zeros((aggregate_indicator_train.shape[1],))
# we are now adding the unit variances to create the contest level variances
for strata_dummies in x_train_strata_unique:
# grab the indices of the units per state that within strata defined by strata_dummies
strata_indices = np.where((strata_dummies == x_train_strata).all(axis=1))[0]
# add variances
# we square because IQR approximates standard deviation
var_epsilon_y += (
aggregate_indicator_train[strata_indices] * (iqr_y_strata[tuple(strata_dummies)] ** 2)
).sum(axis=0)
var_epsilon_z += (
aggregate_indicator_train[strata_indices] * (iqr_z_strata[tuple(strata_dummies)] ** 2)
).sum(axis=0)
# IQR constant for a normal random variable
iqr_scale = 1.349
var_epsilon_y /= iqr_scale**2
var_epsilon_z /= iqr_scale**2
var_epsilon_y /= aggregate_indicator_train.sum(axis=0)
var_epsilon_z /= aggregate_indicator_train.sum(axis=0)
# if we only have 1 unit in a contest we define the variance to be zero
var_epsilon_y[aggregate_indicator_train.sum(axis=0) < 2] = 0
var_epsilon_z[aggregate_indicator_train.sum(axis=0) < 2] = 0
# sample B new epsilons
epsilon_y_B = self.rng.multivariate_normal(
mean=epsilon_y_hat.flatten(), cov=np.diag(var_epsilon_y), size=self.B
).T
epsilon_z_B = self.rng.multivariate_normal(
mean=epsilon_z_hat.flatten(), cov=np.diag(var_epsilon_z), size=self.B
).T
return epsilon_y_B, epsilon_z_B
def _bootstrap_errors(
self,
epsilon_y_hat: np.ndarray,
epsilon_z_hat: np.ndarray,
delta_y_hat: np.ndarray,
delta_z_hat: np.ndarray,
x_train_strata: pd.DataFrame,
stratum_cdfs_y: dict,
stratum_cdfs_z: dict,
stratum_ppfs_delta_y: dict,
stratum_ppfs_delta_z: dict,
aggregate_indicator_train: np.ndarray,
) -> tuple[tuple[np.ndarray, np.ndarray], tuple[np.ndarray, np.ndarray]]:
"""
Bootstrap the errors of our model (epsilon and delta)
"""
# get unique strata that appea in the training data
x_train_strata_unique = np.unique(x_train_strata, axis=0).astype(int)
# bootstrap the epsilons. Uses the parametric bootstrap
epsilon_y_B, epsilon_z_B = self._bootstrap_epsilons(
epsilon_y_hat,
epsilon_z_hat,
x_train_strata,
x_train_strata_unique,
stratum_ppfs_delta_y,
stratum_ppfs_delta_z,
aggregate_indicator_train,
)
# turn deltas into percentiles (uniforms) so that we can re-sample those
unifs_y = self._strata_pit(x_train_strata, x_train_strata_unique, delta_y_hat, stratum_cdfs_y)
unifs_z = self._strata_pit(x_train_strata, x_train_strata_unique, delta_z_hat, stratum_cdfs_z)
unifs = np.concatenate([unifs_y, unifs_z], axis=1)
# re-sample uniforms and apply the inverse CDF (PPF) to convert back to re-sampled residuals
delta_y_B, delta_z_B = self._bootstrap_deltas(
unifs, x_train_strata, x_train_strata_unique, stratum_ppfs_delta_y, stratum_ppfs_delta_z
)
return (epsilon_y_B, epsilon_z_B), (delta_y_B, delta_z_B)
def _sample_test_delta(
self, x_test_strata: pd.DataFrame, stratum_ppfs_delta_y: dict, stratum_ppfs_delta_z: dict
) -> tuple[np.ndarray, np.ndarray]:
"""
This function generates new test deltas (unit level errors)
"""
# we previously estimated the distribution of the deltas per stratum
# we can now generate a new set of uniform random variables (percentiles)
# and then convert them to deltas using the deltas PPFs. We just need
# to make sure that we use the correct distribution defined by the strata
n_test = x_test_strata.shape[0]
# sample a new set of uniforms
test_unifs = self.rng.uniform(low=0, high=1, size=(n_test, self.B, 2))
test_delta_y = np.zeros((n_test, self.B))
test_delta_z = np.zeros((n_test, self.B))
x_test_strata_unique = np.unique(x_test_strata, axis=0).astype(int)
for strata_dummies in x_test_strata_unique:
# get indices of the nonreporting data that are in strata strata_dummies
strata_indices = np.where((strata_dummies == x_test_strata).all(axis=1))[0]
# get their corresponding newly sampled uniforms
unifs_strata = test_unifs[strata_indices]
# use PPF to generate new deltas from the new uniforms
test_delta_y[strata_indices] = stratum_ppfs_delta_y[tuple(strata_dummies)](unifs_strata[:, :, 0])
test_delta_z[strata_indices] = stratum_ppfs_delta_z[tuple(strata_dummies)](unifs_strata[:, :, 1])
return test_delta_y, test_delta_z
def _sample_test_epsilon(
self,
residuals_y: np.ndarray,
residuals_z: np.ndarray,
epsilon_y_hat: np.ndarray,
epsilon_z_hat: np.ndarray,
aggregate_indicator_train: np.ndarray,
aggregate_indicator_test: np.ndarray,
) -> tuple[np.ndarray, np.ndarray]:
"""
This function generates new test epsilons (contest level random effects)
For partially observed contests, we apply a normal approximation, and sample
from the normal distribution implied by the remaining uncertainty of a
sampling without replacement process.
For unobserved contests, we sample from the normal distribution implied by the
sample covariance of the partially observed epsilons
"""
non_zero_epsilon_indices = np.nonzero(epsilon_y_hat)[0]
# if there is only one non zero contest OR if the contest is a state level election only
# then just sample from nearly zero since no variance
if non_zero_epsilon_indices.shape[0] == 1:
return np.zeros((1, self.B)), np.zeros((1, self.B))
aggregate_indicator = np.concatenate([aggregate_indicator_train, aggregate_indicator_test], axis=0)
# computes standard error of epsilon_hat estimate for each contest
# formula is given by square root of (1 - n/N) * (pop variance) / n
# where n is the number of observed units and N is the number of units in the contest
# pop variance is the variance of the residuals in the contest
def _get_epsilon_hat_std(residuals, epsilon):
var = np.var(aggregate_indicator_train * residuals, axis=0) # incorrect denominator for variance
var *= aggregate_indicator_train.shape[0] / (
aggregate_indicator_train.sum(axis=0) - 1
) # Bessel's correction
var *= (
1 - aggregate_indicator_train.sum(axis=0) / aggregate_indicator.sum(axis=0)
) / aggregate_indicator_train.sum(axis=0)
# where we have < 2 units in a contest, we set the variance to the variance of the observed epsilon_hat values
var[np.isnan(var) | np.isinf(var)] = np.var(epsilon[np.nonzero(epsilon)[0]].T, ddof=1)
return np.sqrt(var)
std_y = _get_epsilon_hat_std(residuals_y, epsilon_y_hat)
std_z = _get_epsilon_hat_std(residuals_z, epsilon_z_hat)
std = np.zeros((std_y.shape[0] + std_z.shape[0],))
std[0::2] = std_y
std[1::2] = std_z
# high observed correlation between epsilon_y_hat and epsilon_z_hat in 2020, so this is important
corr = np.corrcoef(np.concatenate([epsilon_y_hat, epsilon_z_hat], axis=1)[np.nonzero(epsilon_y_hat)[0]].T)
# tile corr into a block diagonal matrix
corr_list = [corr] * aggregate_indicator.shape[1]
corr_hat = block_diag(*corr_list)
# if we have additional inter-contest correlations to impose
# self.contest_correlations is list of tuples
# e.g., [(("AK", "AL", "AR"), 0.5), (("CA", "CO", "CT"), 0.5)]
# this will impose a correlation of 0.5 among the sampled state-level swings
# for AK, AL, AR and CA, CO, CT
if len(self.contest_correlations) > 0:
for contests, correlation in self.contest_correlations:
for c_1, c_2 in combinations(contests, 2):
i, j = 2 * self.aggregate_names[c_1], 2 * self.aggregate_names[c_2]
corr_hat[i, j] = correlation # epsilon_y_correlation
corr_hat[i + 1, j + 1] = correlation # epsilon_z_correlation
corr_hat[j, i] = correlation
corr_hat[j + 1, i + 1] = correlation
corr_hat[i + 1, j] = min(corr_hat[i, i + 1], correlation)
corr_hat[j, i + 1] = min(corr_hat[i, i + 1], correlation)
corr_hat[i, j + 1] = min(corr_hat[i, i + 1], correlation)
corr_hat[j + 1, i] = min(corr_hat[i, i + 1], correlation)
# project correlation matrix back to PSD cone
evals, evecs = np.linalg.eigh(corr_hat)
if (evals < -1e-5).any():
LOG.info("Epsilon correl. matrix is not PSD and requires projection.")
evals = np.clip(evals, 0, None)
corr_hat = evecs @ np.diag(evals) @ evecs.T
corr_hat /= np.sqrt(np.outer(np.diag(corr_hat), np.diag(corr_hat)))
# \epsilon ~ N(0, \Sigma)
# Sigma is a block diagonal matrix with 2x2 blocks running down the diagonal and 0s elsewhere
# each block is the covariance matrix of epsilon_y and epsilon_z for a particular contest,
# e.g., the first block is for the AK contest, the second block is for the AL contest, etc.
# we can sample from this distribution to get new epsilons
mu_hat = np.zeros(corr_hat.shape[0])
sigma_hat = np.diag(std) @ corr_hat @ np.diag(std)
test_epsilon = self.rng.multivariate_normal(mu_hat, sigma_hat, size=self.B)
test_epsilon_y = test_epsilon[:, 0::2].T
test_epsilon_z = test_epsilon[:, 1::2].T
test_epsilon_y = aggregate_indicator_test @ test_epsilon_y
test_epsilon_z = aggregate_indicator_test @ test_epsilon_z
return test_epsilon_y, test_epsilon_z
def _sample_test_errors(
self,
residuals_y: np.ndarray,
residuals_z: np.ndarray,
epsilon_y_hat: np.ndarray,
epsilon_z_hat: np.ndarray,
x_test_strata: pd.DataFrame,
stratum_ppfs_delta_y: dict,
stratum_ppfs_delta_z: dict,
aggregate_indicator_train: np.ndarray,
aggregate_indicator_test: np.ndarray,
) -> tuple[np.ndarray, np.ndarray]:
"""
This function samples new test errors for our model (ie. new test residuals)
"""
# we generate new test residuals by generating new test epsilons
# and new test deltas and then adding them together
test_epsilon_y, test_epsilon_z = self._sample_test_epsilon(
residuals_y, residuals_z, epsilon_y_hat, epsilon_z_hat, aggregate_indicator_train, aggregate_indicator_test
)
test_delta_y, test_delta_z = self._sample_test_delta(x_test_strata, stratum_ppfs_delta_y, stratum_ppfs_delta_z)
test_error_y = test_epsilon_y + test_delta_y
test_error_z = test_epsilon_z + test_delta_z
return test_error_y, test_error_z
# TODO: potentially generalize binning features for strata
def _get_strata(
self, reporting_units: pd.DataFrame, nonreporting_units: pd.DataFrame
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""
Gets strata for stratified bootstrap sampling
"""
n_train = reporting_units.shape[0]
# we can use the featurizer, since strata are defined by dummy variables in the same way that
# fixed effects are (ie. rural could be (1, 0, 0) while urban could be (0, 1, 0) while suburban
# could be (0, 0, 1))
# but like with fixed effects we drop one strata category and use the intercept instead so the
# example would be
# rural: 0, 0 urban: 1, 0 and rural: 0, 1
strata_featurizer = Featurizer([], self.strata, states_for_separate_model=self.states_for_separate_model)
all_units = pd.concat([reporting_units, nonreporting_units], axis=0)
strata_all = strata_featurizer.prepare_data(
all_units, center_features=False, scale_features=False, add_intercept=self.add_intercept
)
x_train_strata = strata_all[:n_train]
x_test_strata = strata_all[n_train:]
return x_train_strata, x_test_strata
def _extrapolate_unit_margin(self, reporting_units: pd.DataFrame, nonreporting_units: pd.DataFrame):
"""
This function will produce the extrapolated predictions for the non-reporting units.
At a high level, the idea is as follows:
We have a set of reporting units that we have observed the results for over time. Even though we
have only seen the reporting unit at a particular set of percent_expected_vote values, we can use
them to come up with an estimate for the normalized margin at any percent_expected_vote value.
For example, if we saw a reporting county at both 50% and 70% expected vote, we can estimate the normalized
margin at 60% expected vote via
margin_60 = [margin_50 * 50 + (batch_margin_70) * (60 - 50)] 60
where batch_margin_70 is the normalized margin of the batch of votes we saw when we recorded the 70%
vote for the county.
Now that we have these estimates for the normalized margin at any percent_expected_vote value, we can
use them to estimate how we ought to correct the current normalized margin to estimate the final value.
For example, let's imagine our non-reporting county is at 60% reporting.
For the previous county, we can look at the difference between margin_100 - margin_60 and
that would be our best guess (from that example) for how to correct our extrapolation from the observed
normalized margin in the non-reporting county.
We can repeat this for all the reporting counties and then take the mean of the corrections to get our
best guess for how to correct the normalized margin for the non-reporting county.
The code also includes some additional logic to ensure that this extrapolation step is only used when
we can be confident in its validity.
1) We only estimate the correction using counties belonging to the same state.
2) We also only apply this method to a non-reporting county once it has passed a certain threshold of reporting.
3) We also do not use the correction estimate from a reporting county if the closest observed vote to the
percent_expected_vote is too far away.
4) The correction estimates (obtained using VersionedResultsHandler) are also np.nan when there are
irregularities in the reporting (e.g., there's a correction to the dem/gop vote totals that revises them downwards).
5) We only run this method in states with at least self.min_extrapolating_units counties available.
"""
# first we need to concatenate the current reporting/non-reporting units
# to the end of the versioned_results stting sitting in the handler
# this is because the versioned_results_handler only sees *previous* runs
# of the model and doesn't have the latest set of reporting results
all_units = pd.concat([reporting_units, nonreporting_units], axis=0).copy()
missing_columns = list(set(self.versioned_data_handler.data.columns) - set(all_units.columns))
all_units[missing_columns] = self.versioned_data_handler.data[missing_columns].max()
all_units["last_modified"] = self.versioned_data_handler.data["last_modified"].max() + timedelta(seconds=1)
self.versioned_data_handler.data = pd.concat(
[self.versioned_data_handler.data, all_units[self.versioned_data_handler.data.columns]], axis=0
)
"""
The columns of the versioned_estimates dataframe are
(geographic_unit_fips, percent_expected_vote, est_margin, est_correction, nearest_observed_vote)
TODO: remove the other junk columns
geographic_unit_fips: the fips code of the geographic unit
percent_expected_vote: the % expected vote for the geographic unit
est_margin: our estimate of the normalized margin at the value of percent_expected_vote
est_correction: the difference between the latest value of normalized margin and our estimate
we would use this value to "correct" our estimate of the normalized margin
nearest_observed_vote: the nearest observed vote to the percent_expected_vote that we used to
derive our estimate`
"""
versioned_estimates = self.versioned_data_handler.compute_versioned_margin_estimate()
versioned_estimates["dist_to_observed"] = (
versioned_estimates.percent_expected_vote - versioned_estimates.nearest_observed_vote
).abs()
# we are only going to use this method for county units whose current reporting is above
# the threshold
modeling_filter = nonreporting_units.percent_expected_vote >= self.extrapolate_threshold
modeling_filter &= nonreporting_units.geographic_unit_type == "county"
# we should only use counties that have reported to understand how we should extrapolate
reporting_units_county = reporting_units[reporting_units.geographic_unit_type == "county"]
# get the versioned results for the reporting units
reporting_versioned_estimates = versioned_estimates[
versioned_estimates.geographic_unit_fips.isin(reporting_units_county.geographic_unit_fips)
].copy()
# merge is to get postal_code information for the versioned units
reporting_versioned_estimates = reporting_versioned_estimates.merge(
reporting_units[["geographic_unit_fips", "postal_code"]], on="geographic_unit_fips", how="left"
)
all_corrections = []
for postal_code in nonreporting_units[modeling_filter].postal_code.unique():
def _get_filter(df):
return df.postal_code == postal_code
# if there are fewer than k reporting units, we can't extrapolate
if _get_filter(reporting_units_county).sum() < self.min_extrapolating_units:
continue
# filters for the units we are interested in
reporting_filter = _get_filter(reporting_versioned_estimates)
nonreporting_filter = _get_filter(nonreporting_units) & modeling_filter
state_reporting_estimates = reporting_versioned_estimates[reporting_filter]
# get the percent_expected_vote for the nonreporting units we are looking to
# extrapolate for
nonreporting_merge = nonreporting_units.loc[
nonreporting_filter, ["geographic_unit_fips", "percent_expected_vote"]
].copy()
# round this value so that we can match to the values in state_reporting_estimates
nonreporting_merge["percent_expected_vote"] = nonreporting_merge.percent_expected_vote.round()
# for each non-reporting unit, we have now identified the reporting units (and est. margin / correction)
# we will use to extrapolate the normalized margin
est_corrections_df = pd.merge(
nonreporting_merge,
state_reporting_estimates,
on="percent_expected_vote",
how="left",
suffixes=("", "_reporting"),
)
grouped_est_corrections = est_corrections_df.groupby("geographic_unit_fips")
# get correction mean / std / max / min / count of units used for each correction
def compute_correction_statistics(df):
df_filtered = df[(df.dist_to_observed < self.max_dist_to_observed) & (df.est_correction.notnull())]
if df_filtered.empty:
return pd.DataFrame(
{
"est_correction": np.nan,
"est_correction_max": np.nan,
"est_correction_min": np.nan,
"est_correction_std": np.nan,
"est_correction_count": 0,
},
index=[df.geographic_unit_fips.iloc[0]],
)
return pd.DataFrame(
{
"est_correction": np.nanmean(df_filtered.est_correction.values),
"est_correction_max": np.nanmax(df_filtered.est_correction.values),
"est_correction_min": np.nanmin(df_filtered.est_correction.values),
"est_correction_std": np.nanstd(df_filtered.est_correction.values, ddof=1),
"est_correction_count": df_filtered.est_correction.count(),
},
index=[df.geographic_unit_fips.iloc[0]],
)
corrections = grouped_est_corrections.apply(compute_correction_statistics).reset_index()
all_corrections.append(corrections)
# if no states have enough reporting units to extrapolate, return nans
if len(all_corrections) == 0:
return np.nan * np.ones((nonreporting_units.shape[0], 1)), np.nan * np.ones(
(nonreporting_units.shape[0], 1)
)
# concatenate all the corrections together (since we did it state-by-state)
all_corrections = pd.concat(all_corrections, axis=0)
# merge correction data back into the nonreporting units so we can use add it
# to the current normalized margin and obtain our extrapolating prediction
nonreporting_units = nonreporting_units.merge(
all_corrections[
[
"geographic_unit_fips",
"est_correction",
"est_correction_max",
"est_correction_min",
"est_correction_std",
"est_correction_count",
]
],
how="left",
on="geographic_unit_fips",
)
nonreporting_units["pred_extrapolate"] = (
nonreporting_units.results_normalized_margin + nonreporting_units.est_correction
)
prediction = nonreporting_units.pred_extrapolate.values.reshape(-1, 1)
# need to report an error estimate so we can ensemble this prediction with the OLS prediction
if self.extrapolate_std_method == "std":
prediction_std = nonreporting_units.est_correction_std.values.reshape(-1, 1)
elif self.extrapolate_std_method == "max_min":
prediction_std = (
nonreporting_units.est_correction_max.values - nonreporting_units.est_correction_min.values
).reshape(-1, 1)
return prediction, prediction_std
def compute_bootstrap_errors(
self, reporting_units: pd.DataFrame, nonreporting_units: pd.DataFrame, unexpected_units: pd.DataFrame
):
"""
Computes unit level point predictions and runs the bootstrap to generate quantities needed for
prediction intervals.
The bootstrap generally re-samples the observed data with replacement in order to generate synthentic
"bootstrap" datasets, which can be used to estimate the sampling distribution of a quantity that we
are interested in.
Our implementation is the stratified residual bootstrap. The residual bootstrap samples residuals of a
model (instead of the original dataset).
This is preferable in a regression setting because it removes the the component of the observation that is
not random.
We use the stratified bootstrap because our units are not independent and identically distributed, which means
that we cannot assign the error of any unit to any other unit (e.g. the residual for an urban unit would
likely not fit for a rural unit). For now, this model stratifies on county classification
(rural/urban/suburban).
Generally we are interested in predicting functions of:
w * y * z = weights * normalized_margin * turnout_factor = unnormalized_margin
There are three cases:
1) In the unit case we are interested in the unnormalized margin:
w_i * y_i * z_i
2) In the aggregate (e.g. state aggregation) case we are interested in the normalized sum of the
unnormalized margin of units
(sum_{i = 1}^N w_i * y_i * z_i) / (sum_{i = 1}^N w_i * z_i)
3) In the national case we are interested in an interval over the sum of electoral votes generated by
the predictions
sum_{s = 1}^{51} sigmoid{sum_{i = 1}^{N_s} w_i * y_i * z_i} * ev_s
Our point prediction for each is:
1) w_i * hat{y_i} * hat{z_i}
2) (sum_{i = 1}^N w_i * hat{y_i} * hat_{z_i}) / (sum_{i = 1}^N w_i hat{z_i})
3) sum_{s = 1}^{51} sigmoid{sum_{i=1}^{N_s} w_i * hat{y_i} * hat{z_i}} * ev_s
remember that our model for y_i, z_i is OLS plus a contest random effect:
y_i = f_y(x) + epsilon_y(x)
z_i = f_z(x) + epsilon_z(x)
this means that:
hat{y_i} = hat{f_y(x)} + hat{epsilon_y(x)}