#Maximum Likelihood Estimation in Logistic Regression (logit link)
##Bernoulli distribution
The pmf of bernoulli distribution is \[\begin{equation} P(Y=y) = p^{y} (1-p) ^{1-y} \tag{1} \end{equation}\] where \(y\) is \(1\) or \(0\).
##Likelihood of Logistic Regression
\[\begin{equation} \begin{split} L(\boldsymbol{\beta} | \mathbf{y} ; \mathbf{x}) &= L(\beta_0,\beta_1|(y_1,...,y_n);(x_1,...,x_n)) \\ &= \prod^{n}_{i=1} p_{i}^{y_i}(1-p_i)^{1-y_i} \end{split} \tag{2} \end{equation}\]
##Log-likelihood of Logistic Regression
\[\begin{equation} \begin{split} \ell (\boldsymbol{\beta} | \mathbf{y} ; \mathbf{x}) & = log(\prod^{n}_{i=1} p_{i}^{y_i}(1-p_{i})^{1-y_i}) \\ & = \sum_{i=1}^{n} log(p_{i}^{y_i}) + log((1-p_{i})^{1-y_i}) \\ & = \sum_{i=1}^{n} y_{i} log(p_{i}) + (1-y_{i})log(1-p_{i}) \end{split} \tag{3} \end{equation}\]
\[\begin{equation} \begin{split} logit(p_i) &= log(\frac{p_i}{1-p_i}) \\ &= \beta_0 + \beta_1 x_1 \end{split} \tag{4} \end{equation}\]
Obviously, \(p_i\) is
\[\begin{equation} \begin{split} p_i &= \frac{exp(\beta_0+x_1 \beta_1)}{1 + exp(\beta_0+x_1 \beta_1)} \end{split} \tag{5} \end{equation}\]
First partial derivative with respect to the variable \(\beta_0\)
\[\begin{equation} \begin{split} \frac{\partial p_i}{\partial \beta_0} &= \frac{exp(\beta_0+x_1 \beta_1)}{(1+exp(\beta_0+x_1 \beta_1))^2}\\ &= p_i (1-p_i) \end{split} \tag{6} \end{equation}\]
Similiarly, we have
\[\begin{equation} \begin{split} \frac{\partial p_i}{\partial \beta_1} &= \frac{x_1 exp(\beta_0+x_1 \beta_1)}{(1+exp(\beta_0+x_1 \beta_1))^2}\\ &= x_1 p_i (1-p_i) \end{split} \tag{7} \end{equation}\]
\[\begin{equation} \begin{split} \frac{\partial \ell (\boldsymbol{\beta} | \mathbf{y} ; \mathbf{x})}{\partial \beta_0} & = \sum_{i=1}^{n} y_i \frac{\partial log(p_i)}{\partial p_i} \frac{\partial p_i}{\partial \beta_0} + (1-y_i) \frac{\partial log(1-p_i)}{\partial p_i} \frac{\partial p_i}{\partial \beta_0}\\ & = \sum_{i=1}^{n}\frac{y_i}{p_i} \frac{\partial p_i}{\partial \beta_0} - \frac{1 - y_i}{1- p_i} \frac{\partial p_i}{\partial \beta_0} \\ & = \sum_{i=1}^{n} y_i - p_i = 0 \end{split} \tag{8} \end{equation}\]
\[\begin{equation} \begin{split} \frac{\partial \ell (\boldsymbol{\beta} | \mathbf{y} ; \mathbf{x})}{\partial \beta_1} & = \sum_{i=1}^{n} y_i \frac{\partial log(p_i)}{\partial p_i} \frac{\partial p_i}{\partial \beta_1} + (1-y_i) \frac{\partial log(1-p_i)}{\partial p_i} \frac{\partial p_i}{\partial \beta_1}\\ & = \sum_{i=1}^{n}\frac{y_i}{p_i} \frac{\partial p_i}{\partial \beta_1} - \frac{1 - y_i}{1- p_i} \frac{\partial p_i}{\partial \beta_1} \\ & = \sum_{i=1}^{n} x_i (y_i - p_i) = 0 \end{split} \tag{9} \end{equation}\]
#Maximum Likelihood Estimation in Logistic Regression (probit link)
##Normal distribution
We define \(\varphi(x)\) as pdf function of standard normal distribution and define \(\Phi(x)\) as cdf function of standard normal distribution. Here we don’t need to show what exactly \(\varphi(x)\) and \(\Phi(x)\) are.
For this section, only relationship between \(\varphi(x)\) and \(\Phi(x)\) matters.
\[\begin{equation} \begin{split} \varphi(x) = \frac{d\Phi(x)}{dx} \end{split} \tag{10} \end{equation}\]
##Likelihood and loglikelihood of Logistic Regression with probit link
Likelihood and loglikelihood of Logistic Regression with probit link are as same as (2) and (3)
##Probit link
\[\begin{equation} \begin{split} probit(p_i) = \Phi^{-1}(p_i) = \beta_0 + \beta_1 x_i \end{split} \tag{11} \end{equation}\]
Equivalently
\[\begin{equation} \begin{split} p_i = \Phi(\Phi^{-1}(p_i)) = \Phi(\beta_0+\beta_1 x_i) \end{split} \tag{12} \end{equation}\]
Based on equation (10), \(\frac{\partial \Phi(\beta_0 + \beta_1 x_i)}{\partial (\beta_0 + \beta_1 x_i)} = \varphi(\beta_0 + \beta_1 x_i)\). First partial derivative with respect to the variable \(\beta_0\):
\[\begin{equation} \begin{split} \frac{\partial p_i}{\partial \beta_0} = \frac{\partial \Phi(\beta_0+\beta_1 x_i)}{\partial (\beta_0+\beta_1 x_i)} \frac{\partial (\beta_0+\beta_1 x_i)}{\partial \beta_0} = \varphi(\beta_0+\beta_1 x_i) \end{split} \tag{13} \end{equation}\]
Similarly, we have
\[\begin{equation} \begin{split} \frac{\partial p_i}{\partial \beta_1} = \frac{\partial \Phi(\beta_0+\beta_1 x_i)}{\partial (\beta_0+\beta_1 x_i)} \frac{\partial (\beta_0+\beta_1 x_i)}{\partial \beta_1} = \varphi(\beta_0+\beta_1 x_i) x_i \end{split} \tag{14} \end{equation}\]
##Partial Derivatives of log-likelihood function
\[\begin{equation} \begin{split} \frac{\partial \ell (\boldsymbol{\beta} | \mathbf{y} ; \mathbf{x})}{\partial \beta_0} & = \sum_{i=1}^{n} y_i \frac{\partial log(p_i)}{\partial p_i} \frac{\partial p_i}{\partial \beta_0} + (1-y_i) \frac{\partial log(1-p_i)}{\partial p_i} \frac{\partial p_i}{\partial \beta_0}\\ & = \sum_{i=1}^{n}\frac{y_i}{p_i} \frac{\partial p_i}{\partial \beta_0} - \frac{1 - y_i}{1- p_i} \frac{\partial p_i}{\partial \beta_0} \\ & = \sum_{i=1}^{n} \frac{y_i \varphi(\beta_0 + \beta_1 x_i)}{\Phi(\beta_0 + \beta_1 x_i)} - \frac{(1-y_i)\varphi(\beta_0 + \beta_1 x_i)}{1 - \Phi(\beta_0 + \beta_1 x_i)} \\ & = \sum_{i=1}^{n} \frac{\varphi(\beta_0 + \beta_1 x_i)(y_i - \Phi(\beta_0 + \beta_1 x_i))}{\Phi(\beta_0 + \beta_1 x_i)(1-\Phi(\beta_0+\beta_1 x_i))} \end{split} \tag{15} \end{equation}\]
\[\begin{equation} \begin{split} \frac{\partial \ell (\boldsymbol{\beta} | \mathbf{y} ; \mathbf{x})}{\partial \beta_1} & = \sum_{i=1}^{n} y_i \frac{\partial log(p_i)}{\partial p_i} \frac{\partial p_i}{\partial \beta_1} + (1-y_i) \frac{\partial log(1-p_i)}{\partial p_i} \frac{\partial p_i}{\partial \beta_1}\\ & = \sum_{i=1}^{n}\frac{y_i}{p_i} \frac{\partial p_i}{\partial \beta_1} - \frac{1 - y_i}{1- p_i} \frac{\partial p_i}{\partial \beta_1} \\ & = \sum_{i=1}^{n} \frac{y_i \varphi(\beta_0 + \beta_1 x_i)x_i}{\Phi(\beta_0 + \beta_1 x_i)} - \frac{(1-y_i)\varphi(\beta_0 + \beta_1 x_i)x_i}{1 - \Phi(\beta_0 + \beta_1 x_i)} \\ & = \sum_{i=1}^{n} \frac{\varphi(\beta_0 + \beta_1 x_i)(y_i - \Phi(\beta_0 + \beta_1 x_i))}{\Phi(\beta_0 + \beta_1 x_i)(1-\Phi(\beta_0+\beta_1 x_i))}x_i \end{split} \tag{16} \end{equation}\]
#Packages and version information
library(tidyverse)
library(cowplot)
library(numDeriv)
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] numDeriv_2016.8-1.1 lattice_0.20-45 optimization_1.0-9
## [4] Rcpp_1.0.9 lme4_1.1-31 Matrix_1.5-3
## [7] nycflights13_1.0.2 forcats_0.5.1 stringr_1.4.0
## [10] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
## [13] tibble_3.1.8 tidyverse_1.3.2 cowplot_1.1.1
## [16] dplyr_1.0.9 ggplot2_3.3.6 reticulate_1.28
## [19] knitr_1.39
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-157 fs_1.5.2 lubridate_1.8.0
## [4] RColorBrewer_1.1-3 httr_1.4.4 rprojroot_2.0.3
## [7] tools_4.2.1 backports_1.4.1 bslib_0.4.0
## [10] utf8_1.2.2 R6_2.5.1 DBI_1.1.3
## [13] colorspace_2.0-3 withr_2.5.0 tidyselect_1.1.2
## [16] compiler_4.2.1 cli_3.3.0 rvest_1.0.2
## [19] xml2_1.3.3 labeling_0.4.2 bookdown_0.28
## [22] sass_0.4.2 scales_1.2.0 digest_0.6.29
## [25] minqa_1.2.5 rmarkdown_2.15 pkgconfig_2.0.3
## [28] htmltools_0.5.3 dbplyr_2.2.1 fastmap_1.1.0
## [31] highr_0.9 rlang_1.0.4 readxl_1.4.1
## [34] rstudioapi_0.13 jquerylib_0.1.4 farver_2.1.1
## [37] generics_0.1.3 jsonlite_1.8.0 googlesheets4_1.0.1
## [40] magrittr_2.0.3 munsell_0.5.0 fansi_1.0.3
## [43] lifecycle_1.0.1 stringi_1.7.8 yaml_2.3.5
## [46] MASS_7.3-57 grid_4.2.1 crayon_1.5.1
## [49] haven_2.5.0 splines_4.2.1 hms_1.1.2
## [52] pillar_1.8.1 boot_1.3-28 reprex_2.0.2
## [55] glue_1.6.2 evaluate_0.16 modelr_0.1.8
## [58] nloptr_2.0.3 png_0.1-7 vctrs_0.4.1
## [61] tzdb_0.3.0 cellranger_1.1.0 gtable_0.3.0
## [64] assertthat_0.2.1 cachem_1.0.6 xfun_0.32
## [67] broom_1.0.0 googledrive_2.0.0 gargle_1.2.0
## [70] ellipsis_0.3.2 here_1.0.1
#MLE of logistic regression - Three Methods In this section, I will use three methods, Newton-Raphson, Fisher Scoring and IRLS(iteratively reweighted least squares) to estimate \(\beta_0, \beta_1\). The data set used in this section came from execrise 17.1 of Richard M. Heiberger (2015).
Newton-Raphson’s equation is
\[\begin{equation} \begin{split} \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - (\boldsymbol{H}^{(t)})^{-1}\boldsymbol{u}^{(t)} \end{split} \tag{17} \end{equation}\]
Where
\(\boldsymbol{\beta}^{(t)} = \begin{bmatrix} \beta_{0}^{(t)}\\ \beta_{1}^{(t)} \end{bmatrix}\)
From equation (8) and (9), we have
\(\boldsymbol{u}^{(t)} = \begin{bmatrix} u_{0}^{(t)}\\ u_{1}^{(t)} \end{bmatrix} = \begin{bmatrix} \frac{\partial \boldsymbol{\ell} (\boldsymbol(\beta)^{(t)}|\boldsymbol{y};\boldsymbol{x})}{\partial \beta_0}\\ \frac{\partial \boldsymbol{\ell} (\boldsymbol(\beta)^{(t)}|\boldsymbol{y};\boldsymbol{x})}{\partial \beta_1} \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{n} y_i - p_i^{(t)}\\ \sum_{i=1}^{n} x_i(y_i - p_i^{(t)}) \end{bmatrix}\)
From equation (3), we have
\(\ell (\boldsymbol{\beta}^{(t)} | \mathbf{y} ; \mathbf{x}) = \sum_{i=1}^{n} y_{i} log(p_{i}^{(t)}) + (1-y_{i})log(1-p_{i}^{(t)})\)
From equation (5), we have
\(p_{i}^{(t)} = \frac{exp(\beta_{0}^{(t)}+x_1 \beta_{1}^{(t)})}{1 + exp(\beta_{0}^{(t)}+x_{1} \beta_{1}^{(t)})}\)
\(\boldsymbol{H}^{(t)}\) can be considered as Jacobian matrix of \(\boldsymbol{u}(\cdot)\),
\(\boldsymbol{H}^{(t)} = \begin{bmatrix} \frac{\partial u_{0}^{(t)}}{\partial \beta_0} & \frac{\partial u_{0}^{(t)}}{\partial \beta_1}\\ \frac{\partial u_{1}^{(t)}}{\partial \beta_0} & \frac{\partial u_{1}^{(t)}}{\partial \beta_1} \end{bmatrix}\)
Last, \(\boldsymbol{H}^{(t)}\) can also be considered as Hessian matrix of \(\boldsymbol{\ell}(\cdot)\),
\(\boldsymbol{H}^{(t)} = \begin{bmatrix} \frac{\partial^2 \boldsymbol{\ell}(\boldsymbol{\beta}^{(t)} | \boldsymbol{y};\boldsymbol{x})}{\partial \beta_0^2} & \frac{\partial^2 \boldsymbol{\ell}(\boldsymbol{\beta}^{(t)} | \boldsymbol{y};\boldsymbol{x})}{\partial \beta_0 \beta_1} \\ \frac{\partial^2 \boldsymbol{\ell}(\boldsymbol{\beta}^{(t)} | \boldsymbol{y};\boldsymbol{x})}{\partial \beta_0 \beta_1} & \frac{\partial^2 \boldsymbol{\ell}(\boldsymbol{\beta}^{(t)} | \boldsymbol{y};\boldsymbol{x})}{\partial \beta_1^2} \end{bmatrix}\)
##Create a simple data frame
df1 <- tibble(xx = 1:4,
y = c(1,0,1,0))
df1
## # A tibble: 4 × 2
## xx y
## <int> <dbl>
## 1 1 1
## 2 2 0
## 3 3 1
## 4 4 0
Using equations (8) and (9), we can get same coefficients’ estimation via Newton-Raphson Method.
func.u <- function(x) c(sum(df1$y - exp(x[1] + x[2] * df1$xx)/ (1 + exp(x[1] + x[2] * df1$xx))),
sum(df1$xx * (df1$y - exp(x[1] + x[2] * df1$xx)/ (1 + exp(x[1] + x[2] * df1$xx)))))
delta <- matrix(1:2,nrow = 2)
x <- array(c(2,-1))
while(abs(sum(delta)) > 0.0001){
xx <- x ##current x value
x <- as.matrix(x) - solve(jacobian(func.u, x = x)) %*% func.u(x) ##new x value
delta <- x - as.matrix(xx)
}
x
## [,1]
## [1,] 2.2704607
## [2,] -0.9081843
Fisher Socring’s equation is
\[\begin{equation} \begin{split} \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\boldsymbol{J}^{(t)})^{-1}\boldsymbol{u}^{(t)} \end{split} \tag{18} \end{equation}\]
Where
\(\boldsymbol{J}^{(t)} = \boldsymbol{X}^{T} \boldsymbol{W}^{(t)} \boldsymbol{X}\), \(\boldsymbol{X}\) is design matrix.
\(\boldsymbol{W}^{(t)} = \text{diag} ((\frac{\partial \mu_i^{(t)}}{\partial (\beta_0 + x_i\beta_1)})^2 \frac{1}{\text{var}(y_i)})\)
\(\frac{\partial \mu_i^{(t)}}{\partial (\beta_0 + x_i\beta_1)} = \frac{exp(\beta_0 + x_i\beta_1)^{(t)}}{(1+exp(\beta_0 + x_i\beta_1)^{(t)})^2}\)
From eq (5), we have
\(\text{var}(y_i) = p_i(1-p_i) = \frac{exp(\beta_0 + x_i\beta_1)^{(t)}}{(1+exp(\beta_0 + x_i\beta_1)^{(t)})^2}\)
So
\((\frac{\partial \mu_i^{(t)}}{\partial (\beta_0 + x_i\beta_1)})^2 \frac{1}{\text{var}(y_i)} = \frac{exp(\beta_0 + x_i\beta_1)^{(t)}}{(1+exp(\beta_0 + x_i\beta_1)^{(t)})^2}\)
Eventually, we have
\(\boldsymbol{W}^{(t)} = \text{diag}(\frac{exp(\beta_0 + x_i\beta_1)}{(1+exp(\beta_0 + x_i\beta_1))^2})^{(t)}\)
X <- cbind(rep(1,4),df1$xx)
X <- as.matrix(X)
func1 <- function(x) {
x <- X %*% matrix(c(x[1],x[2]), nrow = 2)
exp(x)/(1+exp(x))^2
}
model1 <- function(x) c(sum(df1$y - exp(x[1] + x[2] * df1$xx)/ (1 + exp(x[1] + x[2] * df1$xx))),
sum(df1$xx * (df1$y - exp(x[1] + x[2] * df1$xx)/ (1 + exp(x[1] + x[2] * df1$xx)))))
delta <- matrix(1:2,nrow = 2)
x <- array(c(2,-1))
while (abs(sum(delta)) > 0.0001) {
xx <- x ##current x value
W <- diag(array(func1(x = x)))
J <- t(X) %*% W %*% X
x <- as.matrix(x) + solve(J) %*% model1(x)
delta <- x - as.matrix(xx)
}
x
## [,1]
## [1,] 2.2704607
## [2,] -0.9081843
Iteratively reweighted least squares’ equation is
\[\begin{equation} \begin{split} \boldsymbol{\beta}^{(t+1)} = (\boldsymbol{J}^{(t)})^{-1}\boldsymbol{X}^{T}\boldsymbol{W}^{(t)}\boldsymbol{Z}^{(t)} \end{split} \tag{19} \end{equation}\]
Where
\(\boldsymbol{Z}^{(t)} = \boldsymbol{X} \boldsymbol{\beta}^{(t)} + (\boldsymbol{D}^{(t)})^{-1}(\boldsymbol{y} - \boldsymbol{p}^{(t)})\)
\(\boldsymbol{D}^{(t)} = \text{ diag }(\frac{\partial \mu_i ^{(t)}}{\partial (\beta_0 + \beta_1x_i)}) = \text{diag}(\frac{exp(\beta_0 + x_i\beta_1)}{(1+exp(\beta_0 + x_i\beta_1))^2})^{(t)}\)
X <- cbind(rep(1,4),df1$xx)
X <- as.matrix(X) #design matrix X
func.mu <- function(x) {
upeta <- X %*% matrix(c(x[1],x[2]), nrow = 2)
return(exp(upeta)/(1+exp(upeta)))
}
func.W <- function(x) {
upeta <- X %*% matrix(c(x[1],x[2]), nrow = 2)
return(exp(upeta)/(1+exp(upeta))^2)
}
func.D <- function(x) {
upeta <- X %*% matrix(c(x[1],x[2]), nrow = 2)
return(exp(upeta)/(1+exp(upeta))^2)
}
delta <- matrix(1:2,nrow = 2)
x <- array(c(2,-1)) #inital values for beta0 and beta1
while (abs(sum(delta)) > 0.0001) {
xx <- x ##current x value
W <- diag(array(func.W(x = x)))
J <- t(X) %*% W %*% X
D <- diag(array(func.D(x = x)))
z <- X %*% x + solve(D) %*% (df1$y - func.mu(x))
x <- solve(J) %*% t(X) %*% W %*% z
delta <- x - as.matrix(xx)
}
x
## [,1]
## [1,] 2.2704607
## [2,] -0.9081843
Probit link Newton-Raphson equation looks as same as (17)
\[\begin{equation} \begin{split} \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - (\boldsymbol{H}^{(t)})^{-1}\boldsymbol{u}^{(t)} \end{split} \tag{20} \end{equation}\]
However, \(\boldsymbol{u}^{(t)}\) is acutally different so is \(\boldsymbol{H}^{(t)}\).
\(\boldsymbol{u}^{(t)} = \begin{bmatrix} \frac{\partial \boldsymbol{\ell} (\boldsymbol(\beta)^{(t)}|\boldsymbol{y};\boldsymbol{x})}{\partial \beta_0}\\ \frac{\partial \boldsymbol{\ell} (\boldsymbol(\beta)^{(t)}|\boldsymbol{y};\boldsymbol{x})}{\partial \beta_1} \end{bmatrix}\), \(\frac{\partial \boldsymbol{\ell} (\cdot) }{\partial \beta_0}\) and \(\frac{\partial \boldsymbol{\ell} (\cdot) }{\partial \beta_1}\) are as same as equations (15) and (16).
\(\boldsymbol{H}^{(t)} = \begin{bmatrix} \frac{\partial^2 \boldsymbol{\ell}(\boldsymbol{\beta}^{(t)} | \boldsymbol{y};\boldsymbol{x})}{\partial \beta_0^2} & \frac{\partial^2 \boldsymbol{\ell}(\boldsymbol{\beta}^{(t)} | \boldsymbol{y};\boldsymbol{x})}{\partial \beta_0 \beta_1} \\ \frac{\partial^2 \boldsymbol{\ell}(\boldsymbol{\beta}^{(t)} | \boldsymbol{y};\boldsymbol{x})}{\partial \beta_0 \beta_1} & \frac{\partial^2 \boldsymbol{\ell}(\boldsymbol{\beta}^{(t)} | \boldsymbol{y};\boldsymbol{x})}{\partial \beta_1^2} \end{bmatrix}\), \(\boldsymbol{H}^{(t)}\) are determined by \(\boldsymbol{u}^{(t)}\), hence will be different from [equations from MLE-logit section] MLE-logit link via Newton-Raphson.
model2 <- function(x) c(sum((dnorm(x[1]+x[2] * df1$xx))*(df1$y - pnorm(x[1]+x[2] * df1$xx))/
((pnorm(x[1]+x[2] * df1$xx))*(1-pnorm(x[1]+x[2] * df1$xx)))),
sum((dnorm(x[1]+x[2] * df1$xx))*(df1$y - pnorm(x[1]+x[2] * df1$xx))/
((pnorm(x[1]+x[2] * df1$xx))*(1-pnorm(x[1]+x[2] * df1$xx)))*df1$xx)
)
delta <- matrix(1:2,nrow = 2)
x <- array(c(2,-2))
while(abs(sum(delta)) > 0.0001){
xx <- x ##current x value
x <- as.matrix(x) - solve(jacobian(model2, x = x)) %*% model2(x) ##new x value
delta <- x - as.matrix(xx)
}
x
## [,1]
## [1,] 1.477002
## [2,] -0.590801
Fisher Socring’s equation is
\[\begin{equation} \begin{split} \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\boldsymbol{J}^{(t)})^{-1}\boldsymbol{u}^{(t)} \end{split} \tag{21} \end{equation}\]
Where \(\boldsymbol{J}^{(t)} = \boldsymbol{X}^{T} \boldsymbol{W}^{(t)} \boldsymbol{X}\)
\(\boldsymbol{W}^{(t)} = \text{ diag }((\frac{\partial \mu_i^{(t)}}{\partial (\beta_0 + x_i\beta_1)})^2 \frac{1}{\text{var}(y_i)})\)
\(\frac{\partial \mu_i}{\partial \beta_0 + x_1\beta_1} = \frac{\partial p_i}{\partial \beta_0 + x_1\beta_1} = \varphi(\beta_0 + x_1\beta_1)\)
\(\text{Var}(y_i) = p_i (1-p_i) = \Phi(\beta_0 + x_1\beta_1)(1-\Phi(\beta_0 + x_1\beta_1))\)
\(\boldsymbol{W}^{(t)} = \text{ diag }(\frac{\varphi^2(\beta_0 + x_1\beta_1)}{\Phi(\beta_0 + x_1\beta_1)(1-\Phi(\beta_0 + x_1\beta_1))})^{(t)}\)
X <- cbind(rep(1,4),df1$xx)
X <- as.matrix(X)
func2 <- function(x) {
x <- X %*% matrix(c(x[1],x[2]), nrow = 2)
dnorm(x)^2/(pnorm(x)*(1-pnorm(x)))
}
model2 <- function(x) c(sum((dnorm(x[1]+x[2] * df1$xx))*(df1$y - pnorm(x[1]+x[2] * df1$xx))/
((pnorm(x[1]+x[2] * df1$xx))*(1-pnorm(x[1]+x[2] * df1$xx)))),
sum((dnorm(x[1]+x[2] * df1$xx))*(df1$y - pnorm(x[1]+x[2] * df1$xx))/
((pnorm(x[1]+x[2] * df1$xx))*(1-pnorm(x[1]+x[2] * df1$xx)))*df1$xx)
)
delta <- matrix(1:2,nrow = 2)
x <- array(c(2,-1))
while (abs(sum(delta)) > 0.0001) {
xx <- x ##current x value
W <- diag(array(func2(x = x)))
J <- t(X) %*% W %*% X
x <- as.matrix(x) + solve(J) %*% model2(x)
delta <- x - as.matrix(xx)
}
x
## [,1]
## [1,] 1.4769988
## [2,] -0.5907995
Iteratively reweighted least squares’ equation is
\[\begin{equation} \begin{split} \boldsymbol{\beta}^{(t+1)} = (\boldsymbol{J}^{(t)})^{-1}\boldsymbol{X}^{T}\boldsymbol{W}^{(t)}\boldsymbol{Z}^{(t)} \end{split} \tag{22} \end{equation}\]
Where
\(\boldsymbol{Z}^{(t)} = \boldsymbol{X} \boldsymbol{\beta}^{(t)} + (\boldsymbol{D}^{(t)})^{-1}(\boldsymbol{y} - \boldsymbol{p}^{(t)})\)
\(\boldsymbol{D}^{(t)} = \text{ diag }(\frac{\partial \mu_i ^{(t)}}{\partial (\beta_0 + \beta_1x_i)}) = \text{diag }(\varphi(\beta_0 + \beta_1x_i))^{(t)}\)
df1 <- data.frame(xx = 1:4,
y = c(1,0,1,0))
X <- cbind(rep(1,4),df1$xx)
X <- as.matrix(X) #design matrix X
func.mu <- function(x) {
upeta <- X %*% matrix(c(x[1],x[2]), nrow = 2)
return(pnorm(upeta))
}
func.W <- function(x) {
upeta <- X %*% matrix(c(x[1],x[2]), nrow = 2)
return((dnorm(upeta))^2/(pnorm(upeta)*(1-pnorm(upeta))))
}
func.D <- function(x) {
upeta <- X %*% matrix(c(x[1],x[2]), nrow = 2)
return(dnorm(upeta))
}
delta <- matrix(1:2,nrow = 2)
x <- array(c(2,-1)) #inital values for beta0 and beta1
while (abs(sum(delta)) > 0.0001) {
xx <- x ##current x value
W <- diag(array(func.W(x = x)))
J <- t(X) %*% W %*% X
D <- diag(array(func.D(x = x)))
z <- X %*% x + solve(D) %*% (df1$y - func.mu(x))
x <- solve(J) %*% t(X) %*% W %*% z
delta <- x - as.matrix(xx)
}
x
## [,1]
## [1,] 1.4769988
## [2,] -0.5907995
##R glm
output
Using stats::glm
function we have identical estimations comparing from MLE-logit link via Newton-Raphson to MLE-logit link via IRLS.
However, for probit link, estimations we have from MLE-probit link via Newton-Raphson to MLE-probit link via IRLS are slightly different from estimations via stats::glm
function. There must be hiden secret that I don’t know. It will be inteseting to find out how exactly stats::glm
do estimation for logistic regression.
##Logit logistic regression
glm1 <- glm(y ~ xx, data=df1, family=binomial)
glm1$coefficients
## (Intercept) xx
## 2.2704607 -0.9081843
##Build probit logistic regression
glm2 <- glm(y ~ xx, data=df1, family=binomial(link = "probit"))
glm2$coefficients
## (Intercept) xx
## 1.4769772 -0.5907909