Prediction of internal egg quality characteristics and variable selection using regularization methods : ridge , LASSO and elastic net

This study was conducted to determine the inner quality characteristics of eggs using external egg quality characteristics. The variables were selected in order to obtain the simplest model using ridge, LASSO and elastic net regularization methods. For this purpose, measurements of the internal and external characteristics of 117 Japanese quail eggs were made. Internal quality characteristics were egg yolk weight and albumen weight; external quality characteristics were egg width, egg length, egg weight, shape index and shell weight. An ordinary least square method was applied to the data. Ridge, LASSO and elastic net regularization methods were performed to remove the multicollinearity of the data. The regression estimating equations of the internal egg quality were significant for all methods (P < 0.01). The goodness of fit of the regression estimating equations for egg yolk weight was 58.34, 59.17 and 59.11 % for the ridge, LASSO and elastic net methods, respectively. For egg albumen weight the goodness of fit of the regression estimating equations was 75.60 %, 75.94 % and 75.81 % for the respective ridge, LASSO and elastic net methods. It was revealed that LASSO, including two predictors for both egg yolk weight and egg albumen weight, was the best model with regard to high predictive accuracy.


Introduction
The egg production industry has significant economic value as well as being a remarkable source of employment.Consequently, it has an important place in the development of countries' economies and in meeting the nutritional needs of people worldwide.Determination of egg quality is a requirement for both edible eggs and for the production of hatching eggs.Egg quality is examined in two parts in this study, with focus on both internal and external quality characteristics.Previous research has pointed out that egg weight, shell weight, shell thickness, egg yolk weight, albumen weight, the albumen index, the egg yolk index and the Haugh units are all significant factors affecting egg quality (Uluocak et al., 1995;Khurshid, 2003;Alkan et al., 2010).These egg characteristics are highly correlated and are used for the determination of the relationship between internal and external quality of eggs (Khurshid et al., 2003;Kul and Şeker, 2004;Abanikannda et al., 2007;Üçkardeş et al., 2012).
In multiple linear regression analysis based on the ordinary least squares (OLS) method, this high correlation between independent or predictor variables can lead to the issue of multicollinearity (MC) (Montgomery et al., 2001;Şahinler, 2000).It has been reported that this MC problem causes a reduction in the reliability of estimates, as it expands the standard errors of the regression coefficients (Montgomery et al., 2001, Albayrak, 2005;Yakubu, 2010).As a result of this, although the OLS estimates are still unbiased in the model with the MC issue, it is not clear how the various egg weight measurements are affected by the egg components.
Various methods to overcome the MC problem are discussed in the literature.One of the methods used in such cases is ridge regression (Hoerl and Kennard, 1970), which is a regularization method that has been used by a number of researchers (Topal et al., 2010;Üçkardeş et al., 2012;Shafey et al., 2014;Orhan et al., 2016).Another regularization method is the least absolute shrinkage and selection operator, "LASSO" (Tibshirani, 1996).LASSO is a successful continuous procedure for estimating and selecting variables (Tibshirani, 1996;Efron et al., 2004;Hastie et al., 2007).This method has been successfully used by Kominakis et al. (2009), Ogutu et al. (2012), Acharjee et al. (2013) and Amin et al. (2014).However, LASSO has two important limitations which emerge in cases where the number of variables is too large for the number of observations (k > n), and when the pairwise correlations of a group of variables are high (Efron et al., 2004).The elastic net (EN) method, proposed by Zou and Hastie (2005), eliminates the shortcomings of the LASSO method.While this method works like LASSO when choosing a variable, it functions like ridge by bringing the coefficients of correlated predictors closer to each other (Hastie et al., 2008).There is currently no known study demonstrating the use of the LASSO and EN methods in order to determine the internal quality characteristics of eggs.
Therefore, the aims of this study were to determine egg yolk weight and albumen weight from external egg quality characteristics using the ridge, LASSO and EN regression models and to select the variables in order to reduce model complexity.

Materials
The materials utilized in this study were 117 eggs taken from Japanese quails; the eggs were obtained from the Van Yuzuncu Yil University Research and Application Farm.Egg weight (EWT), egg yolk weight (EYWT), egg albumen weight (EAWT) and shell weight (ESWT) (in grams) and egg width (EWI) and egg length (ELE) (in mm) were the variables measured, with the eggs collected daily.Shape index (SI) is a value that depends on EWI and ELE; SI was calcu-lated using the following equation: SI = [EWI/ELE] × 100.EWI, ELE, EWT, SI and ESWT were used as predictor variables in the models that were created separately for EYWT and EAWT.

Ordinary least squares
For the multiple linear regression model with as many independent variables as k for n individuals, the following equation was used for OLS prediction: where β is the OLS estimation of unknown parameters in the regression equation, y i is the dependent variable (i = 1, 2, . .., n), β 0 intercept and β j (j = 1, 2, . .., k) show the unknown parameters of the regression equation and x ij indicates the explanatory or predictor variables.

Ridge
Ridge, a biased prediction method, is based on the principle of minimizing the sum of the residual squares (RSS) in order to obtain the β coefficients.The following equation is used to obtain the ridge coefficients: where λ ≥ 0 is the complexity constant controlling the amount of shrinkage (Marquardt, 1970), and 2 = k j =1 β 2 j is the ridge penalty function (Hastie et al., 2008).

LASSO
In this method, it is possible to obtain β coefficients by solving the following optimization problem: where 1 = p j =1 β j is the LASSO penalty function.1 penalty is the least squares fit and shrinks some components of βLASSO to zero.The solution of the LASSO method requires quadratic programming (Hastie et al., 2007).

Elastic net (EN)
Elastic net is an extension of the LASSO method that is robust to extreme correlations among the predictors (Friedman et al., 2010).The method uses a mixture of the ridge ( 2 ) and LASSO ( 1 ) penalties and can be formulated as follows: Goodness off fit.The adjusted coefficients of determination (R 2 adj ) were used as cohesion criteria to compare the ridge, LASSO and EN methods: In Eq.5, R 2 represents the determination coefficient, n represents the sample size and p represents the total number of explanatory variables in the model not including the constant.The statistical analyses were performed using the GLMS-ELECT procedure in SAS/STAT (SAS, 2014).
The Pearson correlation coefficient between internal and external quality characteristics of quail eggs and MC diag-nostics, variance inflation factors (VIFs) and tolerance values (TVs) are given in Table 2. Eigenvalues and conditional index (CI) values, the other criteria used to determine MC, are presented in Table 3.The respective correlations between EWI and EWT and EWI and SI were 0.371 and 0.806 (P < 0.01), the respective correlations between ELE and EWT and ELE and SI were 0.654 and −0.529 (P < 0.01) and the correlation between EWT and ESWT was 0.183 (P < 0.05).The VIF values for EWI, ELE and SI were very high, 872.7, 416.4 and 1197.2,respectively, and TV values for these variables were close to zero, 0.00115, 0.00240 and 0.00084, respectively.In Table 3 it can be seen that the eigenvalues are close to zero (ranging from 0.018 to 6.18 × 10 −7 ) and the CI values are very high (ranging from 17.98 to 3109.37).
The prediction equations of the internal quality characteristics obtained using the OLS, ridge, LASSO and EN methods in the multiple linear regression analyses are given in Table 4.For all of the methods, the prediction equations are found significant (P < 0.01).When Table 4 is examined, it can be seen that the standard errors in ridge for EYWT show a significant decrease with the exception of EWT and ESWT.A similar result is also found for EAWT.When the results of LASSO and EN are evaluated, it is seen that the coefficients of EWI, SI and ESWT are reduced to zero for EYWT and the coefficients of EWI, ELE and SI are reduced to zero for EAWT.
The goodness of fit measurements of the prediction equations for the OLS, ridge, LASSO and EN methods and the number of predictors in the prediction are presented in Table 5.There are five predictor variables in OLS and ridge and two in LASSO and EN both for EYWT and EAWT.
Table 5 shows that the R 2 adj values for EYWT are 58.34,59.17 and 59.11 % for ridge, LASSO and EN, respectively; whilst the EAWT R 2 adj values for the for ridge, LASSO and EN methods are 75.60,75.94 and 74.81 %, respectively.

Discussion
When the data used in the study were evaluated in terms of basic statistics, EYWT, EAWT, EWI, ELE, EWT and SI were found to be similar to the findings of Kul and Şeker (2004)    1.46 ± 0.02, which was higher than that reported by Kul and Şeker (2004) (0.84 ± 0.01).
The results of the correlation analyses showed that high and significant correlations were obtained between the predictor variables: the correlation between EWI and SI was 0.806 (P < 0.001), the correlation between ELE and EWT was 0.654 (P < 0.001) and the negative correlation between ELE and SI was 0.529 (P < 0.001).Table 1 shows that it was necessary to investigate the MC problem.Similar findings have also been reported in a variety of studies on the internal and external quality characteristics of eggs, such as those by Özçelik (2002), Kul and Şeker (2004), Alkan et al. (2010) and Rathert et al. (2011).
In order to investigate the MC problem, the VIFs and TVs in Table 2, the eigenvalues and CI values in Table 3 were calculated using the OLS method.This was undertaken because it is known that the correlation between the predictor variables is not sufficient to define the MC issue (Albayrak, 2005;Shafey et al., 2014).The OLS results showed that VIF values were greater than 10 in 3 variables: 872.7, 416.4 and 1197.2 for EWI, ELE and SI, respectively.The TVs values were found to be small, depending on the VIFs due to the relationship between the two.The high VIF values were caused by the small tolerance value, as reported by Albayrak (2005).The eigenvalues were very close to zero (down to 6.18 × 10 −7 ) and the CI values were greater than 30 (up to 3109.37).All of these results revealed that there was in fact a MC problem in the dataset as reported by Marquardt and Snee (1975), Belsley (1991) and Albayrak (2005).
The aims of this study were to determine the internal quality characteristics of eggs and to choose variables using the external quality characteristics of eggs.As previous studies have proven that OLS estimates are less reliable if the data has an MC problem (Hoerl and Kennard, 1970;Montgomery et al., 2001;Albayrak, 2005;Yakubu, 2010), ridge regression was applied to the data to eliminate the MC issue (Table 4).The results of the regression analyses for both EYWT and EAWT were found to be significant (P < 0.001).The coefficients and standard errors of EWI, ELE, EWT, SI and ESWT in the prediction equations for EYWT and EAWT were smaller than those in the OLS prediction (Table 4); in particular, the sign of the coefficients of EWI and SI changed.All of these results were similar to those found in the literature (e.g., Topal et al. (2010); Üçkardeş (2012) and Öztürk (2014)).Due to the fact that ridge regression is not a sufficient method for selecting variables, LASSO and EN were applied to the data.Only two predictor variables were included in the prediction equations of LASSO and EN (ELE and EWT for EYWT; EWT and ESWT for EAWT) and the regression equations were both found to be significant (P < 0.001, Table 4).Both methods provided similar results in terms of coefficients and standard errors.The coefficients and the standard errors of ELE and EWT in both EN and LASSO were smaller than those in ridge for EYWT.Apart from the standard error of EWT in ridge, similar results were obtained for EAWT (Table 4).These results revealed that LASSO and EN performed better than ridge regression in this study, which was consistent with the study by Ogutu et al. (2012).
The goodness of fit statistics used in order to find the best models are only given for OLS and the regularization methods (Table 5).Since the number of parameters in the prediction equations obtained by the regularization methods were different from one another, R 2 adj was used to compare the methods.Therefore, for EYWT, the predictive ability as depicted by R 2 adj was highest using the LASSO method (59.17 %) and lowest using the ridge method (58.34 %).This was similar for EAWT, where R 2 adj was highest in LASSO (75.94 %) and lowest in ridge (75.60 %).Therefore, for both EYWT and EAWT, the LASSO technique succeeded in selecting the variables with the highest predictive ability.Zou and Hastie (2005) found that EN performed better than ridge and LASSO in terms of model choice consistency and predictive accuracy in their study.However, this result is only valid under two conditions: (1) that the data being studied contain more predictor variables than the number of observations (k > n) and ( 2) that there is a group of variables among which the pairwise correlations are very high.The materials used in this study do not have these conditions.In this research, a simpler prediction equation, which is both highly predictive and easy to interpret, was obtained using the LASSO technique.These results were also found to be consistent with the literature (Efron et al., 2004;Zou and Hastie, 2005;Friedman et al., 2010).
The determination of internal egg quality characteristics is important in terms of edible eggs and the production of hatching eggs.In this study the ridge, LASSO and EN regularization methods were used in order to perform prediction equations and variable selection for both EYWT and EAWT.It was revealed that LASSO, including two predictors in the prediction equation, was the best model with regard to high predictive accuracy.It was concluded that ELE and EWT were included in the prediction equation for EYWT, while EWT and ESWT were included for EAWT.

Conclusions
Regularization methods are superior to OLS in data with a MC problem because, when these methods are used, more accurate and reliable prediction equations are obtained.In this study we introduced the LASSO and EN methods for prediction and variable selection in agricultural research.It is concluded that LASSO and EN techniques may be utilized to develop the best and most stable models for internal egg quality characteristic prediction using external egg quality characteristics because they overcome the MC problem.These techniques also enable the selection of sufficient variables in order to obtain models that are easily interpreted by researchers.

Table 1 .
Descriptive statistics of egg quality characteristics.

Table 2 .
Correlation coefficients between internal and external quail egg quality characteristics and between variance inflation factors and tolerance values.

Table 3 .
Eigenvalues and conditional index values of external egg quality characteristics predicting EYWT and EAWT.
E: eigenvalue and CI: conditional index.

Table 4 .
The estimation of coefficients obtained using the OLS, ridge, LASSO and EN methods in the multiple linear regression analyses (standard errors in parentheses) for EYWT and EAWT.

Table 5 .
Goodness of fit measurements of OLS, ridge, LASSO and EN methods in multiple linear regression analyses.