Chapter 5 Imputing Port Connections
5.1 Motivation
Following principal component analysis on the present ports combinations, the analysis shifts focus to the port combinations that are not present in the dataset. Imputing values for each of the four continuous features in the dataset for all possible source and destination port combinations yields a reasonable expected value in each cell of the ports matrix that can then be compared to actual connection values when they are observed. Actual values that differ greatly from the imputed values are flagged as anomalies and require further investigation.
This results in a 3-dimensional matrix completion problem. The matrix dimensions are defined as the number of source ports by the number of destination ports by the number of continuous features observed in each transaction. Similar techniques were employed in the Netflix Challenge where top competitors used matrix completion to predict ratings for movies by users that had not watched the movie based on the other entries in the matrix of users and movies.
5.2 Matrix Completion Algorithm
There are \(m\) source ports and \(n\) destination ports. \(Y \in {\rm I\!R}^{m \times n}\), is the matrix that stores the means of the combinations of source ports and destination ports. \(Y\) has missingness because not every source port interacts with every destination port. \(F \in {\rm I\!R}^{m \times n}\) is a sparse matrix that represents the frequencies of combinations, i.e \(F[32242,12312]\) represents the number of observations for the 32242 12312 port interaction. \(M \in {\rm I\!R}^{m \times n}\) represents a boolean matrix of whether the corresponding \(Y\) values are missing. \(Y[M]\) represents all of the missing values of \(Y\).
The objective is \[min \sum_{i,j:F_{i,j} > 0} (Y_{i,j} - u_iDv^T_j)^2\] where \(UDV^T\) represents the singular value decomposition of \(Y\). There are multiple steps to the matrix completion process:
5.2.1 Anova Initial Imputation
Impute the initial values for the missing \(y_{i,j}\) observations \(1 \leq i \leq m, 1 \leq j \leq n\): In general an additive model is applicable: \[y_{i,j} = \mu + a_i + b_j + \epsilon_{i,j}\] where \(\epsilon \in N(0,\sigma^2)\), \(\mu\) is the overall mean, \(a_i\) is the row mean, and \(b_j\) is column mean. An analysis of variance (ANOVA) imputation is used to fill in the initial values, \(y_{i,j}\). Ignoring the missing values for now, let \(y_{..}\) denote the empirical overall mean, \(y_{i.}\) denote the empirical row mean, and \(y_{.j}\) denote the column mean. \[y_{i,j} = y_{..} + (y_{i.}-y{..}) + (y_{.j}-y_{..}) = y_{i.} + y_{.j} - y{..}\]
5.2.2 Repeated Imputation
The repeated imputation procedure solves \(Y^{(s)}[M] = R_k(Y^{(s-1)})[M]\) where \(R_k\) is the best rank-k approximation for the \(s\)-th step. For each step \((s)\) use singular value decomposition to decompose \[Y^{(s)} = U^{(s)}DV^{T(s)}\] where \(D\) is a diagonal matrix of the singular values, \(U\) is the left singular vectors of \(Y\) and \(V\) is the right singular vectors of \(Y\).
The Eckart-Young-Mirsky (EYM) Theorem provides the best rank-k approximation for the missing values in \(Y^{(s+1)}\). Recall \(Y[M]\) represents all of the missing values of \(Y\). Applying the EYM theorem: \[Y^{(s+1)}[M] = (U[,1:k]^{(s)}D[,1:k]V[,1:k]^{T(s)})[M]\] Where \(U[,1:k]\) represents the first \(k\) columns of \(U\) and the same for \(D\) and \(V\).
5.2.3 Convergence Criterion
The EYM rank approximation imputation steps are repeated until the relative difference between \(Y^{(s+1)}\) and \(Y^{(s)}\) falls below a set threshold, \(T\). The relative difference threshold is expressed: \[\frac{\|Y^{(s+1)}-Y^{(s)}\|_2}{\|Y^{(s)}\|_2} < T\] where \(\|Y\|_2\) is the Frobenius norm. The denominator of the expression ensures the convergence criterion is invariate to a scale change in the matrix itself.
5.2.4 Implementation
#matrix parameters
n_Sport = 20
n_Dport = 20
#get freqs
Sport_table = as.data.frame(table(argus$Sport))
Sport_table = Sport_table[order(-Sport_table$Freq),]
top_Sport = (head(Sport_table$Var1, n_Sport))
#get freqs
Dport_table = as.data.frame(table(argus$Dport))
Dport_table = Dport_table[order(-Dport_table$Freq),]
top_Dport = (head(Dport_table$Var1, n_Dport))
#create starting matrices
ports_combo_matrix = matrix(list(), nrow = n_Sport, ncol = n_Dport)
dimnames(ports_combo_matrix) = list(top_Sport, top_Dport)
ports_freq_matrix = matrix(0, nrow = n_Sport, ncol = n_Dport)
dimnames(ports_freq_matrix) = list(top_Sport, top_Dport)
#fill the ports_combo_matrix and ports_freq_matrix
for (s in 1:n_Sport){
for (d in 1:n_Dport){
combination = argus[is.element(argus$Sport, top_Sport[s])
& is.element(argus$Dport, top_Dport[d]),]
obs = combination$SrcBytes
n_obs = length(obs) #ignores NA values
if (n_obs > 0){
obs = nscore(obs)$nscore #normal transformation
for (i in 1:n_obs){
ports_combo_matrix[[s,d]] = c(ports_combo_matrix[[s,d]],obs[i])
#O(1) time to append values to a list?
ports_freq_matrix[s,d] = ports_freq_matrix[s,d] + 1
}
}
}
}
#create mean and variance matrix
ports_mean_matrix = matrix(NA, nrow = n_Sport, ncol = n_Dport)
dimnames(ports_mean_matrix) = list(top_Sport, top_Dport)
ports_variance_matrix = matrix(NA, nrow = n_Sport, ncol = n_Dport)
dimnames(ports_variance_matrix) = list(top_Sport, top_Dport)
#fill mean and variance matrix
for (s in 1:n_Sport){
for (d in 1:n_Dport){
if (ports_freq_matrix[s,d] == 1){
ports_mean_matrix[s,d] = ports_combo_matrix[[s,d]]
ports_variance_matrix[s,d] = 0
}
else if (ports_freq_matrix[s,d] > 1){
ports_mean_matrix[s,d] = mean(ports_combo_matrix[[s,d]])
ports_variance_matrix[s,d] = var(ports_combo_matrix[[s,d]])
}
}
}
####Eckhart Young Theorem Implementation, Best Rank k Approximation####
matrix_complete = function(S = 1000, k = 2, n_Sport, n_Dport, Y, M){
S = 1000
k = 2
Y_imputed = Y
#calculate overall mean
n = 0
sum = 0
for (s in 1:n_Sport){
for (d in 1:n_Dport){
if (M[s,d] != 0){
sum = sum + Y[s,d]
n = n + M[s,d]
}
}
}
overall_mean = sum/n
#calculate row means and col means
row_means = rowMeans(Y, na.rm = TRUE)
col_means = colMeans(Y, na.rm = TRUE)
#set NaN to 0 in means to fix anova fill in
for (i in 1:n_Sport){
if (!is.finite(row_means[i])){
row_means[i] = 0
}
if (!is.finite(col_means[i])){
col_means[i] = 0
}
}
#Fill in missing values in Y_imputed with ANOVA
for (s in 1:n_Sport){
for (d in 1:n_Dport){
if (M[s,d] == 0){
Y_imputed[s,d] = row_means[s] + col_means[d] - overall_mean
}
}
}
for (i in 1:S){
#extract SVD
svd_Y = svd(Y_imputed)
D = diag((svd_Y$d)[1:k])
U = svd_Y$u
V = svd_Y$v
#EYM theorem
EYM = U[,1:k] %*% D %*% t(V[,1:k])
for (s in 1:n_Sport){
for (d in 1:n_Dport){
if (M[s,d] == 0){
Y_imputed[s,d] = EYM[s,d]
}
}
}
}
return (Y_imputed)
}
ports_mean_matrix_imputed = matrix_complete(1000, 2, n_Sport, n_Dport, ports_mean_matrix, ports_freq_matrix)
#Relative distance using Frobenius Norm
relative_distance = function(Y, Y_imputed){
return (sqrt(sum((Y - Y_imputed)^2)) / sqrt(sum(Y^2)))
}
5.3 Assessing Imputation Strategy
5.3.1 Leave One Out Cross Validation
To assess the quality of the imputation, Leave-One-Out Cross Validation (LOOCV) is used to generate a prediction error. LOOCV cycles through the observed values, setting each to NA (missing), and then performing the described imputation process. The prediction error is then calculated as some function of the difference between the imputed value and the true value. In this case, the algorithm records absolute error \(\sum \mid \hat y_{i,j} - y_{i,j}\mid\) and root mean square error \(\sqrt{\frac{\sum (\hat y_{i,j} - y_{i,j})^2}{n}}\) where \(n\) is the number of observations not missing.
5.3.2 Implementation
#Leave One Out Cross Validation
loocv = function (S = 1000, k = 2, nrows = n_Sport, ncols = n_Dport, Y, M){
error = 0
rmse = 0
n = 0
for (s in 1:nrows){
for (d in 1:ncols){
if (M[s,d] != 0){
n = n + 1
M_imputed = M
M_imputed[s,d] = 0
Y_imputed = matrix_complete(S, k, nrows, ncols, Y, M_imputed)
error = error + abs((Y_imputed[s,d] - Y[s,d]))
rmse = rmse + (Y_imputed[s,d] - Y[s,d])^2
}
}
}
rmse = sqrt(rmse/n)
return (list(Error = error, RMSE = rmse, Observations = n))
}
#find optimal rank
5.3.3 Results
While matrix completion via singular value decomposition presents valid missing value imputations, and the algorithm converges relatively quickly, the error generated from leave one out cross validation reflects that the imputation performs rather poorly for low rank solutions to the data. Moreover, the errors are minimized at a rank approximation of 3, but even at this rank, the errors are relatively high considering the data was first normal transformed.
This poor performance may largely be due to the fact the algorithm does not account for the variability in the number of observed solutions for each cell being imputed. Unlike the Netflix Competition, in which each cell of the matrix being completed contained only a single user rating of a movie, the matrix in this problem contains the average of a variable number of observations in each cell.