Predicting Rare Events More Accurately (PREMA)
Likelihood penalization
Generally, likelihood penalization means that a penalty term A(β) is added to the log likelihood:
log L(β)*= log L(β)+ A(β).
Maximization of log L(β)* yields the maximum penalized likelihood estimate of the vector of regression coefficients β. If A(β) = -λ∑ βj² , i.e., a multiple of minus the sum of squared (suitably standardized) regression coefficients, then a so-called ridge regression model is obtained (Hoerl et al., 1970). Setting A(β) = -λ∑ |βj| yields the least absolute shrinkage and selection operator (LASSO; Tibshirani, 1996). Basically, both types of penalty introduce a bias to the regression coefficients in order to reduce their mean squared error. The necessary amount of penalization, controlled by λ, can be found by crossvalidation methods, e.g., by maximizing the cross-validated likelihood (Verweij et al., 1993). While there is some evidence for an advantage of the ridge-type penalty over the LASSO for prognostic models (Ambler et al., 2012), the LASSO is highly esteemed by researchers for its implicit variable selection: some regression coefficients are estimated as exactly zero.
Another type of penalization that has been studied in the last decades is the Firth-type penalization (FTP). It was introduced to reduce the small-sample bias of maximum likelihood estimates (Firth, 1993). In exponential family models with canonical parametrization, this penalty equals A(β) = 1/2 log det I(β), with I(β) denoting the Fisher information matrix. This penalization is exactly equal to Jeffreys’ invariant prior (Jeffreys, 1946). Its bias-corrective effect was empirically confirmed in several studies, including those of our group, for logistic, conditional logistic or Cox regression (Heinze and Schemper, 2001, 2002; Bull et al, 2002, 2007; Heinze, 2006; Heinze and Dunkler, 2008; Heinze and Puhr, 2010). This bias-corrective effect has implications on the practical use of prognostic models obtained by FTP, as the regression coefficients, assumed to be unbiased, can be readily interpreted as differences in expected outcome on the log odds scale. (With PLR methods that supply biased regression coefficients, this interpretation is problematic.) FTP does not require the estimation of a tuning parameter such as ridge or LASSO regression. Furthermore, it is rotation-invariant, and will yield identical models no matter which parameterization is used. One particularly popular property of Firth-type penalization is its robustness to the problem that the likelihood is monotone in one or several regression coefficients. This phenomenon is likely to occur in small or sparse data sets, but may also occur in subgroup analyses of larger ones. Monotone likelihood causes maximum likelihood estimates to be undefined and numerical algorithms for likelihood maximization to diverge.