EXPLORING DATA MINING ALGORITHMS FOR PREDICTING DUCK EGG WEIGHT BASED ON EGG QUALITY CHARACTERISTICS

.


INTRODUCTION
Eggs have acquired greater importance as an inexpensive and high-quality protein (Almeida et al.,2020).Eggs are common ingredients used by the food industry, predominantly for their taste and functional properties.Additionally, eggs contain numerous biologically active compounds that remain largely unexplored, but they hold significant potential for applications in the medical, pharmaceutical, and biotechnological industries (Anton et al.,2006;Zhang et al., 2021).Moreover, egg quality characteristics such us egg weight, proportions of shell, yolk and albumen and nutrient composition can considerably affect the growing embryo during incubation and chick performance (İpek and Sözcü, 2013).Hence, continuous evaluation of different egg quality traits has become one of the major points of concern in modern poultry production (Wang et al., 2017).Besides the chicken, ducks are the most significant poultry species (Bello et al., 2022).Duck production is one of the branches of poultry production that supplies protein, eggs, and fatty liver (El-Deghadi et al., 2022).Duck eggs are more nutritient than chicken eggs because they contain less water (Ismoyowati and Sumarmono, 2019).The egg production in the most productive duck breeds reaches about 250 to 300 eggs per year (Abd EL-Hack et al., 2019).However, the economic importance and contribution of ducks to food security vary considerably between continents and countries (Pingel, 2011).In Algeria, the Mallard ducks are abundant, but their breeding is relatively undeveloped and restricted to traditional farms due to the lack of information on the nutritional value of ducks.To the best of our knowledge, no work has been undertaken to date to characterize the egg from duck in Algeria.
In animal research, several studies have made use of traditional statistical methods such as correlations, simple regression, and multivariate linear regression to estimate the relationships between traits of economic importance.Nevertheless, these conventional methods have not been found sufficient enough to model complex relationships.Specifically, the presence of strong Journal of Animal & Plant Sciences, 34(2): 2024, Page: 336-350 ISSN (print): 1018-7081; ISSN (online): 2309-8694 https://doi.org/10.36899/JAPS.2024.2.0721 relationships among predictors also known as multicollinearity compromises the results of multivariable regression analyses due to the inflation of the standard errors of the parameters, resulting in a reduction in the reliability of the final regression model (Kim, 2019).Moreover, traditional approaches follow strict statistical assumptions and data requirements.Difficulties caused from multicollinearity in regression analysis have been reported by different researchers (Eyduran et al., 2010;Khorshidi-Jalali et al., 2019;Yakubu,2010;Dahloum et al., 2016).
An alternative to traditional statistics is statistical learning, also known as data mining (DM).DM is the use of computer-based methods to accurately model the nonlinear and complex relationship between the dependent variable and predictors in huge datasets (Pinto da Costa and Cabral, 2022).
Among various methods belonging to DM, the most commonly used algorithms include Multivariate Adaptive Regression Splines (MARS), Artificial Neural Networks (ANNs), and decision trees (DT) such us Chisquare Automatic Interaction Detection (CHAID), Classification and Regression Trees (CART), and Quick Unbiased Efficient Statistical Trees (QUEST).These methods, along with others such as Support Vector Machines (SVM), Random Forest Regression (RFR), and k-Nearest Neighbors (k-NN) have been preferred due to the advantages they possess, including the ability to handle nonlinear and noisy data, the absence of assumptions regarding the underlying distribution of values of the input variables, robustness against multicollinearity (Mendeş and Akkartal, 2009), suitability for high-dimensional data, simplicity, computational speed, high accuracy, and ease of interpretation.
Data mining applications have gained so much momentum in animal science recently (Grzesiak and Zaborski, 2012).Salawu et al. (2014) used ANN to predict the body weights of Rabbits.ANN was also successfully applied to predict and model milk yield in cows (Gocheva-Ilieva et al.,2022), and sheep (Karadas et al., 2017).Almeida et al. (2020) applied ANN to predict zootechnical and management data in commercial laying hens farms.On the other hand, Nasser and Abu-Naser (2019) employed ANN for predicting the animal category.Eyduran et al. (2017) compared the predictive ability of MLR, CART, CHAID, and ANN in body weight prediction from some body measurements of the indigenous Beetal goat. Lee et al. (2020) estimated the carcass weight of Hanwoo cattle as a function of body measurements of Hanwoo cattle by using MLR, PLS (Partial least squares) regression, and ANN.For the prediction of body weight in sheep breeds, Tirink (2022) evaluated the ability of BRNN (Bayesian Regularized Neural Network), SVM, RFR, and MARS algorithms.Eyduran et al. (2013) applied RTM (Regression tree method) to predict the 305-d milk yield of Brown Swiss cattle.Grzesiak et al. (2010) used CF (classification functions), LR (logistic regression), ANN, and MARS for the detection of cows with artificial insemination difficulties.
In regard to establishing egg quality standards, Orhan et al. (2016) applied MLR, RR (Ridge Regression), and CHAID algorithm to predict egg weight based on albumen weight, yolk weight, and shell weight in commercial layer hybrids.In quail, Çelik et al. (2017) compared the predictive performance of CHAID, exhaustive CHAID and CART in the estimation of egg weight from some egg quality traits measurements.In another study, Sengul et al. (2020) compared Grossman-Koops, cubic and segmented polynomial models with MARS algorithm for predicting egg production in the Chukar partridge and found that the MARS predictive model can serve as a better alternative to classical nonlinear models in predicting cumulative egg production.González Ariza et al. (2022) developed a stepwise discriminant canonical analysis to cluster eggs across hen genotypes considering egg quality attributes.Çelik et al (2016) investigated the effect of some egg quality traits (egg weight, egg width, egg height, and shape index) on fertility of eggs of Japanese quail with different colored feathers with the aid of CART datamining algorithm.
As yet, no other studies are available on the egg quality characteristics of ducks using robust computational methods, and our results are the first to be reported.Therefore, the purpose of this study was to estimate and compare the ability of ALM, ANN, CART, and MLR models in the prediction of duck egg weight from some egg characteristics measurements based on several goodness of fit criteria.

MATERIALS AND METHODS
Material: Data were obtained from a total of 173 freshly laid eggs of Mallard ducks (35-50 wk old), directly collected from 20 smallholders in the province of Tiaret (35°55′52″ N, 0°08′24′′E), located in northwest Algeria.The region is characterized by a semi-arid climate, and it is well known for its agricultural potential and livestock production.
To evaluate the internal and external egg quality traits, eggs were transported at 4°C to the lab within 24h.The external egg traits recorded included egg weight (EWT, g), egg length (EL, mm), egg width (EWd, mm), egg shape index (ESI), and eggshell weight (ESW, g).Regarding the internal egg quality, the parameters measured were albumen weight (AW, g), albumen height (AH, mm), yolk weight (YW, g), yolk height (YH, mm), yolk diameter (YD, mm), and Haugh unit (HU).EWT was determined to the nearest 0.01g using an electronic scale.EWd, EL, and YD were determined with a digital caliper accurate to 0.1mm, while AH and YH were determined using a tripod micrometer with a precision of 0.01mm.The eggshell weight (ESW, g) was determined according to Sun et al. (2019) and Inca et al. (2020).ESI and HU were evaluated according to the following equations: (1) (2)

Methods
Statistical analysis: The internal and external duck egg quality traits estimates were analyzed using some descriptive statistics (minimum, maximum, mean, standard error, and coefficient of variation).All phenotypic variables are given in Table 1.Linear associations between the egg traits were estimated by Pearson's correlation coefficient.In this study, EWT was the dependent variable, while the remaining 10 egg characteristics were the input explanatory variables (covariates).

Multiple linear regression analysis:
Multiple linear regression analysis is a form of regression analysis commonly used for modeling the relationship between a dependent variable (regressand) and a set of independent variables (regressor) by a linear regression equation (Tabrizi and Sancar, 2017).To assess multicollinearity, the Variance Inflation Factor (VIF) is commonly used (Daoud, 2017).Multicollinearity is present when the VIF is higher than 5 to 10 (Tranmer et al., 2020, Kim, 2019).
The variables YD, YH, YW, AW, and ESW have been selected as input variables using the stepwise technique to predict egg weight according to the following formula: (3 where EWT is the body weight, a is the regression intercept, bi is the ith partial regression coefficients of the ith egg trait, and Xi is the ith egg trait. Automatic Linear Modelling: Automatic Linear Modelling (ALM) is not as commonly used as the other computational methods but has gained popularity in recent years (Genç and Mendeş, 2021).ALM serves as a valuable screening tool, automating the process of selecting the most suitable subset of predictors, which is particularly crucial when dealing with a large number of predictors (Oshima and Dell-Ross, 2016;Genç and Mendeş, 2021).In the study, variables with VIF values > 10 were identified, and systematically removed to mitigate multicollinearity effects.The same predictor variables fitted into the MLR were used to generate the ALM.The selected ALM model was configured as a standard model with the forward stepwise as model selection method, and the Akaike's Information Criterion (AIC) for evaluating marginal contribution.

Machine learning models
Artificial Neural Network: An Artificial Neural Network (ANN) is a computing system based on the way biological nervous systems, such as the human brain (Dastres and Soori, 2021).The ANN is a methodology that takes into account nonlinearities in the relationship between the input and output information (Savegnago et al., 2011).It consists of a set of interconnected neurons linked with weighted connections (Li et al.,2018).
In the current study, Multilayer Perception (MLP) with one hidden layer and Back Propagation network was used.The network was trained with 70% of the whole dataset and tested (model validation) with 30% of the dataset.The input layer consists of nodes corresponding to the 10 egg characteristic traits used for predicting egg weight.The hyperbolic tangent function and the linear activation function were employed for the hidden and output layers in ANN according to Yakubu and Nimyak (2020).The output layer has been configured with a single output node dedicated to estimating the egg weight.The weights and biases of this layer have been optimized during the model training process.Every other option in the ANN was set to default.(Breiman et al., 1984).It is a powerful predictive algorithm widely used in machine learning.CART models can be categorized based on the dependent variable.Categorical outcome variables require the use of a classification tree, while continuous outcomes utilize regression trees (Wray and Byers, 2020;Razi and Athappilly, 2005).CART constructs a binary decision tree structure where each fork represents a predictor variable, and each node provides a prediction for the target variable (Lee et al., 2010;Ali et al., 2015;Wray and Byers, 2020).In general, CART analysis begins with a single node, also known as the 'Parent node', while subsequent nodes that undergo further partitioning are termed 'child nodes'.The nodes where partitioning concludes, indicating homogeneity or purity, are commonly known as 'terminal nodes' or 'leaves'.CART looks for splits that minimize the prediction squared error (the least-squared deviation).The prediction in each leaf is based on the weighted mean for node (Maimon and Rokach, 2005).In the study, the dataset was initially divided into two distinct subsets, namely the training set, comprising 70% of the total data, and the test set, which accounted for the remaining 30%.The minimum observation count in parent and child nodes was set to 10:5 in order to improve the model predictive ability.

Comparison of the models quality:
The quality of the assigned models was assessed and compared using the following specific statistical parameters according to Grzesiak and Zaborski (2012).
Pearson correlation coefficient between the observed and the predicted values The Adjusted coefficient of determination (8) Where: Yi is the actual egg weight value of ith egg, Yip is the predicted egg weight value of ith egg, isthe mean of the actual body weight values.n: the total sample size, and k the number of the independent variables in the model not including the constant.
All the computations were performed using the SPSS statistical software version 25.0.The significance level in all the analyses was set at p<0.05.

RESULTS
The Pearson correlation coefficients (r) between measured egg weight and external and internal quality traits of the egg are presented in Table 2. Correlations ranged from 0.000 to 0.945.EWT was highly and positively correlated with egg dimensions (EL and EWd, r= 0.752 and 0.790, respectively, p<0.01),AW (r= 0.815, p<0.01), and YW (r= 0.784, p<0.01).Low and negligible correlations were observed between EWT and ESI (r= −0.115, p> 0.05) and between EWT and AH (r= −0.055, p> 0.05).A negative significant correlation was also found between EWT and HU (r= −0.302, p<0.01).AW and YW showed a significant association with egg dimensions (EL and EWd) ranging from 0.602 to 0.672 (p<0.01).A highly significant, weak, and negative correlation (r= −0.195, p<0.01) was found between AH and YH.

Comparison of classification performances of the algorithms:
In the present study, first, all 10 explanatory variables were included in the MLR model to predict EWT.The ANOVA results showed that the MLR model fitted was statistically significant (F= 486.74, p<0.001).When considering all the 10 predictors, the percentage of the EWT variance explained by the model is equal to 96.8%.The high values of VIF (>10) obtained for some of the independent variables is a sign of multicollinearity in the model.In the current study, the multicollinearity issues were found in EL, EWd, ESI, AH, and HU(Table 2).The estimated parameters of the MLR model are summarized in Table 3.
The explanatory variables with significant influence in determining the EWT were AW, YW, and ESW.As a result, the EWT prediction equation was EWT= 1.47+0.965AW+0.984YW+0.999ESWalong with R 2 =0.966, indicating that 96.6% of the total variation in the EWT is explained by these three variables.With the positive coefficients, an increment in EWT would be expected as AW, YW, and ESW increased.
The performance quality criteria of MLR, ALM, ANN, and CART models for the prediction of egg weight are summarized in Table 4.In the current study, the model exhibiting the highest values of r, R 2 , and R 2 adj, along with the lowest values of RAE and RMSE, was selected as the most suitable model.The associations between the observed and the predicted egg weight using ALM, ANN, and CART are shown in Figures 1, 2, and 3.The Pearson correlation coefficient (r) between the observed and the predicted egg weight was highestin ANN (0.990)compared to ALM (0.984), MLR (0.982), andCART.(0.950).Similarly, the R 2 values for these models were 0.982, 0.970, 0.966, and 0.903, respectively.The R 2 adj values followed a similar pattern, with respective values of 0.981, 0.970, 0.964, and 0.897.The ANN model exhibited the lowest RMSE and RAE values of 0.753 and 0.012 in contrast to ALM (0.985,0.016),MLR (1.046, 0.017), and CART (1.778, 0.029).
The ANN model included input nodes consisting of 10 explanatory variables, hidden nodes comprising a bias term and seven H terms (H1:1 -H1:7), and the dependent variable EWT (Figure 4).Black lines indicate positive weights, while blue lines indicate negative weights.Line thickness is in proportion to the relative magnitude of each weight.In the ANN model, the most influential parameters for predicting EWT were AW, YW, HU, ESW, and AH (Table 5).These were followed by YD, EL, ESI, and EWd, while YH contributed the least to EWT determination.Both ALM and ANN algorithms ranked AW and YW as the most influential variables, while the order of importance for other variables varied between the models.In the ALM model, AW, YW and ESW, were identified as the significant explanatory variables automatically selected for predicting whole egg weight (Table 6).Conversely, the CART algorithm revealed a different set of significant input variables for predicting Mallard duck egg weight.
The regression tree using CART algorithm is shown in Figure 5.The tree was built with five variables (EWd, AW, EL, YW and ESI).The tree was mostly influenced by EWd while the least influence was exhibited by ESI.A total of ten terminal (homogeneous) nodes (nodes 7, 9, 10, 11, 12, 13, 15, 16, 17 and 18), on which decisions are made, were formed.Node 0, which is the root node, provided information on the descriptive statistics where the total number of observations was 173 and mean egg weight was 59.312 g with standard deviation of 5.728.Based on the influence of egg width, Node 0 was partitioned into non-homogeneous nodes 1 and 2 with predicted mean egg weight of 54.676 g and 63.394 g, respectively.Node 1, on the basis of egg width,was split further into node 3 (Ewd≤ 41.446 mm) and node 4 (Ewd> 41.446 mm) while node 2, based on albumen weight, was divided into node 5 (AW≤ 33.901 g) and node 6 (AW>33.401g).The respective predicted mean egg weights in both cases were 61.429 g and 66.451 g.On the basis of albumen weight, node 3 was further partitioned into homogeneous node 7 (AW≤ 25.902

DISCUSSION
The data used in the current study were estimated in terms of basic statistics.The mean egg weight and size obtained in this study are higher than those reported by Labbaci et al. (2014) in Mallard ducks at Lake Tonga (Northeastern Algeria).The mean values for egg length, egg width, eggshell weight, and albumen weight are similar to those reported in Alabio duck (Hartati et al.,2021).Contrastingly, the mean egg weight and most of the other external and internal quality trait averages were lower than the results obtained in Nigerian Muscovy duck reared under different management systems, except albumen height, yolk diameter,and Haugh unit (Etuk et al.,2012).Similarly, in Pekin duck eggs, Indarsih et al. (2021) found a mean weight of 67.5±5.9g, with a mean length of 60.7±3.1mm,mean width of 44.7±0.9mm, and an average shape index of 76.2±1.7 and 70.9±2.8for rounded and elongated eggs, respectively.These authors demonstrated that the shape index is a suitable parameter for sex identification in Pekin duck.Lin et al. (2016) reported higher mean values of egg weights of 65.0±3.9 and 67.0±4.2g in Shan Ma laying ducks, at 210 and 300 days of age, respectively.There were also reports of higher mean values in comparison to the present findings for egg weight in Domyati duck (Egyptian local breed) and Khaki-Campbell duck with 61.4±6.5and64.3±3.4g,respectively (El-Deghadi et al., 2022).The reasons for diverse opinions among researchers regarding some egg characteristics are multifactorial suchas genetic factors, layer's age, body weight, health, nutrition, and egg storage conditions (Roberts, 2004;Dahloum et al., 2018;Alkan and Türker, 2021;Çelik et al.,2021).In this sense, Reyna and Burggren (2017) reported a decrease in the fertility of duck eggs stored for more than six days from laying to incubation.In addition, Ipek and Sozcu (2017) found thatheavier eggs from Pekin ducks had better hatchability than the light and medium eggs.
The results of the correlation analyses showed highly significant associations between the independent variables.The strong correlation between EWT and both AW and YW suggests that these parameters can change at a significant level depending on the change that can occur in the egg weight.This finding is consistent with the results of several previous studies on other poultry species such us quails (Ouaffai et al.,2018), Guinea Fowl (Onunkwo and Okoro, 2015), partridge (Alkan et al.,2014), and commercial layers (Orhan et al., 2016).The statistically non-significant and negative phenotypic correlation value found in the present study between the egg weight and the egg shape index is consistent with the findings of several previous studies (Olawumi and Ogunlade, 2008;Alkan and Türker, 2021;Jang, 2022).
The negative significant relationship between EWT and HU stands in contrast to the insignificant correlation reported by other researchers (Debnath and Ghosh, 2015;Vekić et al., 2022).Haugh unit is an important index to evaluate egg protein qualityand reflect egg freshness (Gao et al., 2022).The negative associations of HU with AW, YW, and YD obtained in the current study agree with the previous report on indigenous chickens (Bekele et al., 2022).In Alabio Duck, Hartati et al. (2021) found that EL was positively correlated with ESW,AW,and YW (r=0.28,0.53,and 0.52,respectively,p<0.05).
Predictive performance of ANN,ALM, CART, and MLR: Data mining techniques can be a good option to describe complex associations between variables (Canga et al.,2021).In the study, the ALM, ANN, and CART algorithms yielded different sets of significant predictors due to the distinct methodologies and criteria they employ for variable selection and model construction.These variations reflect the inherent differences in their strategies for modelling and predicting egg weight.
In the poultry field, Bolzan et al. (2008) explored the use of ANN to predict the hatchability of artificially incubated eggs derived from a 39-week-old Cobb 500 broiler breeder flock.ALM and ANN were fitted with the intent to predict hatchability and mortality in muscovy ducks (Yakubu et al.,2019), and to forecast heat stress index in Sasso hens (Yakubu et al.,2018).Çelik et al. (2021) evaluated the performance of CART and MARS in the prediction of egg weight of the quails.Canga et al. (2021) aimed to predict egg weight from egg quality traits in Lohman LSL Classic white hybrid laying hens, with the help of the MARS data mining algorithm.
The application of data mining techniques was also successfully investigated to estimate egg weight in many poultry species.This is the first modelling study to determine Mallard duck egg weight using a combination of external and internal egg characteristics, employing ANN and CART algorithms, along with the ALM technique.However, in this study, it was not possible to make an adequate comparison with other studies owing to the use of different poultry species, traits, variables, sample sizes, and different computational methods.Portillo-Salgado et al. (2021) demonstrated that both the decision tree technique based on the CHAID algorithm and MLR can be used reliably for predictive estimates of egg weight from external traits of Guinea fowl as they showed similar accuracy (R 2 = 74.0 and 75.0%, respectively).Çeliket al. (2017) investigated the ability of the CART, CHAID, and Exhaustive CHAID algorithms in the prediction of quail egg weight.In their study, the Pearson correlation coefficients (r) between actual and predicted egg weight values for CHAID, Exhaustive CHAID, and CART algorithms were 90.6%, 92.7%, and 92.0%, respectively.In the same order, the R 2 values were 82.06%, 85.86%, and 84.66%, while the R 2 adj values were 82.06%, 85.85%, and 84.66%.The RAE estimates were 0.087 for all algorithms and the estimates of RMSE were 0.453, 0.402, and 0.419, respectively.The results indicate that the Exhaustive CHAID algorithm is very effective for determining internal and external quality features in quail eggs.In another study on quail, Çelik et al. (2021) demonstrated that MARS showed much better predictive performance than CART for the prediction of egg weight with R 2 values of 85.0% and 72.8%, respectively.The reported values of R 2 are smaller than the R 2 values obtained in the current study.Çiftsüren and Akkol (2018) used RR (Ridge regression), LASSO (Least Absolute Shrinkage and Selection Operator), and EN (Elastic net) regression models to determine egg yolk and egg albumen weights from some external egg characteristics in Japanese quail.In their study, it was revealed that LASSO was the best model due to its high predictive accuracy.For egg yolk weight, the goodness of fit of the regression estimating equations was 58.34%, 59.17%, and 59.11% for RR, LASSO, and EN methods, respectively.For egg albumen weight the goodness of fit of the regression equations was 75.60%, 75.94%, and 75.81% for the respective RR, LASSO, and EN methods.In the study conducted by Alapatt et al. (2022) to determine the egg weight in White Leghorn Chicken from some internal and external egg traits using different methods, the EN regression was identified as the best predictive model (R 2 adj=86.5%)followed by RR (R 2 adj=81.13%),RFR (R 2 adj= 65.02%), LASSO (R 2 adj= 29.67%), and CART algorithm (R 2 adj= 29.45%).In the prediction of the egg weight from albumen weight, yolk weight, and shell weight in commercial layer hybrids, Orhan et al. (2016) demonstrates the superiority of the CHAID algorithm with higher accuracy (R 2 = 99.98%)compared to MLR (R 2 = 93.4%)and RR (R 2 = 93.15%).Canga et al. (2021) applied the MARS data mining algorithmto predict egg weight from egg quality traits in Lohman LSL Classic white hybrid laying hens and achieved sufficient fit with the mean predictive performance measures estimated as 61.0%, 0.779, and 0.430 for R 2 , r, and SD ratio, respectively.In indigenous free-range chickens, Liswaniso et al. (2021) preferred CHAID and CART algorithms to predict the egg weight from egg length, egg width, shell weight, shell thickness, albumen weight, yolk height, yolk width, and yolk weight.For CHAID algorithm, the goodness of fit was R 2 = 82.3%,R 2 adj=82.3%,RMSE=2.23,RAE=0.04, and SD ratio=0.04.In the case of the CART algorithm, the results were estimated to be 59.3%, 59.3%, 2.32, 0.07, and 0.24, respectively.

Conclusion:
Based on the results of the present study, it was concluded that the ANN algorithm was slightly more efficient for egg weight determination in Mallard ducks based on some internal and external egg traits as illustrated by its lower error measurements compared to ALM, MLR, and CART algorithm.These findings may assist poultry researchers and producers to choose the best predictors to increase egg quality in ducks by the selection of high-performance genotypes.

Figure 1 .Figure 3 .Figure 4 .
Figure 1.The scatter plot of observed and predicted egg weight using ALM

Table 1 . Basic statitistics for different traits (n=173).
Node 8 was further divided into two terminal nodes 15 and 16 using albumen weight as the splitting variable.While node15 with AW≤29.519 had predicted mean egg weight of 52.382 g and SD of 1.292, node 16 with AW>29.519 was characterized by predicted mean egg weight of 55.000 g and SD of 1.655.Node 14 was split into two terminal nodes 17 and 18 using egg shape index as the criterion with a small improvement of 0.446.With ESI≤70.723, the predicted egg weight and standard deviation of node 17 were 66.847 g and 1.983.At node 18 with ESI>70.723, the predicted egg weight of 71.051 g (SD= 3.742) was the highest among all the ten terminal nodes.