Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data

Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data

by Avishek Nag (Machine Learning expert)

A comparison of different classifiers’ accuracy & performance for high-dimensional data


In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this ‘curse of dimensionality’ problem.

In this article, we will see how accuracy and performance vary across different classifiers. We will also see how, when we don’t have the freedom to choose a classifier independently, we can do feature engineering to make a poor classifier perform well.

Understanding the ‘datasource’ & problem formulation

For this article, we will use the “EEG Brainwave Dataset” from Kaggle . This dataset contains electronic brainwave signals from an EEG headset and is in temporal format. At the time of writing this article, nobody has created any ‘Kernel’ on this dataset — that is, as of now, no solution has been given in Kaggle.

So, to start with, let’s first read the data to see what’s there.


There are 2549 columns in the dataset and ‘label’ is the target column for our classification problem. All other columns like ‘mean_d_1_a’, ‘mean_d2_a’ etc are describing features of brainwave signal readings. Columns starting with the ‘fft’ prefix are most probably ‘Fast Fourier transforms’ of original signals. Our target column ‘label’ describes the degree of emotional sentiment.

As per Kaggle, here is the challenge: “Can we predict emotional sentiment from brainwave readings?”

Let’s first understand class distributions from column ‘label’:


So, there are three classes, ‘POSITIVE’, ‘NEGATIVE’ & ‘NEUTRAL’, for emotional sentiment. From the bar chart, it is clear that class distribution is not skewed and it is a ‘multi-class classification’ problem with target variable ‘label’. We will try with different classifiers and see the accuracy levels.

Before applying any classifier, the column ‘label’ should be separated out from other feature columns (‘mean_d_1_a’, ‘mean_d2_a’ etc are features).

As it is a ‘classification’ problem, we will follow the below conventions for each ‘classifier’ to be tried:

  • We will use a ‘cross validation’ (in our case will use 10 fold cross validation) approach over the dataset and take average accuracy. This will give us a holistic view of the classifier’s accuracy.
  • We will use a ‘Pipeline’ based approach to combine all pre-processing and main classifier computation. A ML ‘Pipeline’ wraps all processing stages in a single unit and act as a ‘classifier’ itself. By this, all stages become re-usable and can be put in forming other ‘pipelines’ also.
  • We will track total time in building & testing for each approach. We will call this ‘time taken’.

For the above, we will primarily use the scikit-learn package from Python. As the number of features here is quite high, will start with a classifier which works well on high-dimensional data.

RandomForest Classifier

‘RandomForest’ is a tree & bagging approach-based ensemble classifier. It will automatically reduce the number of features by its probabilistic entropy calculation approach. Let’s see that:


Accuracy is very good at 97.7% and ‘total time taken’ is quite short (3.29 seconds only).

For this classifier, no pre-processing stages like scaling or noise removal are required, as it is completely probability-based and not at all affected by scale factors.

Logistic Regression Classifier

‘Logistic Regression’ is a linear classifier and works in same way as linear regression.


We can see accuracy (93.19%) is lower than ‘RandomForest’ and ‘time taken’ is higher (2 min 7s).

‘Logistic Regression’ is heavily affected by different value ranges across dependent variables, thus forces ‘feature scaling’. That’s why ‘StandardScaler’ from scikit-learn has been added as a preprocessing stage. It automatically scales features according to a Gaussian Distribution with zero mean & unit variance, and thus values for all variables range from -1 to +1.

The reason for high time taken is high-dimensionality and scaling time required. There are 2549 variables in the dataset and the coefficient of each one should be optimised as per the Logistic Regression process. Also, there is a question of multi-co-linearity. This means linearly co-related variables should be grouped together instead of considering them separately.

The presence of multi-col-linearity affects accuracy. So now the question becomes, “Can we reduce the number of variables, reduce multi-co-linearity, & improve ‘time taken?”

Principal Component Analysis (PCA)

PCA can transform original low level variables to a higher dimensional space and thus reduce the number of required variables. All co-linear variables get clubbed together. Let’s do a PCA of the data and see what are the main PC’s:


We mapped 2549 variables to 20 Principal Components. From the above result, it is clear that first 10 PCs are a matter of importance. The total percentage of the explained variance ratio by the first 10 PCs is around 0.737 (0.36 + 0.095 + ..+ 0.012). Or it can be said that the first 10 PCs explain 73.7% variance of the entire dataset.

So, with this we are able to reduce 2549 variables to 10 variables. That’s a dramatic change, isn’t it? In theory, Principal Components are virtual variables generated from mathematical mapping. From a business angle, it is not possible to tell which physical aspect of the data is covered by them. That means, physically, that Principal Components don’t exist. But, we can easily use these PCs as quantitative input variables to any ML algorithm and get very good results.

For visualisation, let’s take the first two PCs and see how can we distinguish different classes of the data using a ‘scatterplot’.


In the above plot, three classes are shown in different colours. So, if we use the same ‘Logistic Regression’ classifier with these two PCs, then from the above plot we can probably say that the first classifier will separate out ‘NEUTRAL’ cases from other two cases and the second classifier will separate out ‘POSITIVE’ & ‘NEGATIVE’ cases (as there will be two internal logistic classifiers for 3-class problem). Let’s try and see the accuracy.


Time taken (3.34 s) was reduced but accuracy (77%) decreased.

Now, let’s take all 10 PCs and run:


We see an improvement in accuracy (86%) compared to 2 PC cases with a marginal increase in ‘time taken’.

So, in both cases we saw low accuracy compared to normal logistic regression, but a significant improvement in ‘time taken’.

Accuracy can be further tested with a different ‘solver’ & ‘max_iter’ parameter. We used ‘saga’ as ‘solver’ with L1 penalty and 200 as ‘max_iter’. These values can be changed to get a variable effect on accuracy.

Though ‘Logistic Regression’ is giving low accuracy, there are situations where it may be needed specially with PCA. In datasets with a very large dimensional space, PCA becomes the obvious choice for ‘linear classifiers’.

In some cases, where a benchmark for ML applications is already defined and only limited choices of some ‘linear classifiers’ are available, this analysis would be helpful. It is very common to see such situations in large organisations where standards are already defined and it is not possible to go beyond them.

Artificial Neural Network Classifier (ANN)

An ANN classifier is non-linear with automatic feature engineering and dimensional reduction techniques. ‘MLPClassifier’ in scikit-learn works as an ANN. But here also, basic scaling is required for the data. Let’s see how it works:


Accuracy (97.5%) is very good, though running time is high (5 min).

The reason for high ‘time taken’ is the rigorous training time required for neural networks, and that too with a high number of dimensions.

It is a general convention to start with a hidden layer size of 50% of the total data size and subsequent layers will be 50% of the previous one. In our case these are (1275 = 2549 / 2, 637 = 1275 / 2). The number of hidden layers can be taken as hyper-parameter and can be tuned for better accuracy. In our case it is 2.

Linear Support Vector Machines Classifier (SVM)

We will now apply ‘Linear SVM’ on the data and see how accuracy is coming along. Here also scaling is required as a preprocessing stage.


Accuracy is coming in at 96.4% which is little less than ‘RandomForest’ or ‘ANN’. ‘time taken’ is 55 s which is in far better than ‘ANN’.

Extreme Gradient Boosting Classifier (XGBoost)

XGBoost is a boosted tree based ensemble classifier. Like ‘RandomForest’, it will also automatically reduce the feature set. For this we have to use a separate ‘xgboost’ library which does not come with scikit-learn. Let’s see how it works:


Accuracy (99.4%) is exceptionally good, but ‘time taken’(15 min) is quite high. Nowadays, for complicated problems, XGBoost is becoming a default choice for Data Scientists for its accurate results. It has high running time due to its internal ensemble model structure. However, XGBoost performs well in GPU machines.

From all of the classifiers, it is clear that for accuracy ‘XGBoost’ is the winner. But if we take ‘time taken’ along with ‘accuracy’, then ‘RandomForest’ is a perfect choice. But we also saw how to use a simple linear classifier like ‘logistic regression’ with proper feature engineering to give better accuracy. Other classifiers don’t need that much feature engineering effort.

It depends on the requirements, use case, and data engineering environment available to choose a perfect ‘classifier’.

The entire project on Jupyter NoteBook can be found here .


[1] XGBoost Documentation —

[2] RandomForest workings —

[3] Principal Component Analysis —

[4] Logistic Regression —

[5] Support Vector Machines —

If this article was helpful, share it .

Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 09 September 2022

Machine learning in project analytics: a data-driven framework and case study

  • Shahadat Uddin 1 ,
  • Stephen Ong 1 &
  • Haohui Lu 1  

Scientific Reports volume  12 , Article number:  15252 ( 2022 ) Cite this article

9833 Accesses

13 Citations

18 Altmetric

Metrics details

  • Applied mathematics
  • Computational science

The analytic procedures incorporated to facilitate the delivery of projects are often referred to as project analytics. Existing techniques focus on retrospective reporting and understanding the underlying relationships to make informed decisions. Although machine learning algorithms have been widely used in addressing problems within various contexts (e.g., streamlining the design of construction projects), limited studies have evaluated pre-existing machine learning methods within the delivery of construction projects. Due to this, the current research aims to contribute further to this convergence between artificial intelligence and the execution construction project through the evaluation of a specific set of machine learning algorithms. This study proposes a machine learning-based data-driven research framework for addressing problems related to project analytics. It then illustrates an example of the application of this framework. In this illustration, existing data from an open-source data repository on construction projects and cost overrun frequencies was studied in which several machine learning models (Python’s Scikit-learn package) were tested and evaluated. The data consisted of 44 independent variables (from materials to labour and contracting) and one dependent variable (project cost overrun frequency), which has been categorised for processing under several machine learning models. These models include support vector machine, logistic regression, k -nearest neighbour, random forest, stacking (ensemble) model and artificial neural network. Feature selection and evaluation methods, including the Univariate feature selection, Recursive feature elimination, SelectFromModel and confusion matrix, were applied to determine the most accurate prediction model. This study also discusses the generalisability of using the proposed research framework in other research contexts within the field of project management. The proposed framework, its illustration in the context of construction projects and its potential to be adopted in different contexts will significantly contribute to project practitioners, stakeholders and academics in addressing many project-related issues.

Similar content being viewed by others

classification in machine learning case study

The impact of artificial intelligence on employment: the role of virtual agglomeration

classification in machine learning case study

The role of artificial intelligence in achieving the Sustainable Development Goals

classification in machine learning case study

Systematic review and meta-analysis of ex-post evaluations on the effectiveness of carbon pricing


Successful projects require the presence of appropriate information and technology 1 . Project analytics provides an avenue for informed decisions to be made through the lifecycle of a project. Project analytics applies various statistics (e.g., earned value analysis or Monte Carlo simulation) among other models to make evidence-based decisions. They are used to manage risks as well as project execution 2 . There is a tendency for project analytics to be employed due to other additional benefits, including an ability to forecast and make predictions, benchmark with other projects, and determine trends such as those that are time-dependent 3 , 4 , 5 . There has been increasing interest in project analytics and how current technology applications can be incorporated and utilised 6 . Broadly, project analytics can be understood on five levels 4 . The first is descriptive analytics which incorporates retrospective reporting. The second is known as diagnostic analytics , which aims to understand the interrelationships and underlying causes and effects. The third is predictive analytics which seeks to make predictions. Subsequent to this is prescriptive analytics , which prescribes steps following predictions. Finally, cognitive analytics aims to predict future problems. The first three levels can be applied with ease with the help of technology. The fourth and fifth steps require data that is generally more difficult to obtain as they may be less accessible or unstructured. Further, although project key performance indicators can be challenging to define 2 , identifying common measurable features facilitates this 7 . It is anticipated that project analytics will continue to experience development due to its direct benefits to the major baseline measures focused on productivity, profitability, cost, and time 8 . The nature of project management itself is fluid and flexible, and project analytics allows an avenue for which machine learning algorithms can be applied 9 .

Machine learning within the field of project analytics falls into the category of cognitive analytics, which deals with problem prediction. Generally, machine learning explores the possibilities of computers to improve processes through training or experience 10 . It can also build on the pre-existing capabilities and techniques prevalent within management to accomplish complex tasks 11 . Due to its practical use and broad applicability, recent developments have led to the invention and introduction of newer and more innovative machine learning algorithms and techniques. Artificial intelligence, for instance, allows for software to develop computer vision, speech recognition, natural language processing, robot control, and other applications 10 . Specific to the construction industry, it is now used to monitor construction environments through a virtual reality and building information modelling replication 12 or risk prediction 13 . Within other industries, such as consumer services and transport, machine learning is being applied to improve consumer experiences and satisfaction 10 , 14 and reduce the human errors of traffic controllers 15 . Recent applications and development of machine learning broadly fall into the categories of classification, regression, ranking, clustering, dimensionality reduction and manifold learning 16 . Current learning models include linear predictors, boosting, stochastic gradient descent, kernel methods, and nearest neighbour, among others 11 . Newer and more applications and learning models are continuously being introduced to improve accessibility and effectiveness.

Specific to the management of construction projects, other studies have also been made to understand how copious amounts of project data can be used 17 , the importance of ontology and semantics throughout the nexus between artificial intelligence and construction projects 18 , 19 as well as novel approaches to the challenges within this integration of fields 20 , 21 , 22 . There have been limited applications of pre-existing machine learning models on construction cost overruns. They have predominantly focussed on applications to streamline the design processes within construction 23 , 24 , 25 , 26 , and those which have investigated project profitability have not incorporated the types and combinations of algorithms used within this study 6 , 27 . Furthermore, existing applications have largely been skewed towards one type or another 28 , 29 .

In addition to the frequently used earned value method (EVM), researchers have been applying many other powerful quantitative methods to address a diverse range of project analytics research problems over time. Examples of those methods include time series analysis, fuzzy logic, simulation, network analytics, and network correlation and regression. Time series analysis uses longitudinal data to forecast an underlying project's future needs, such as the time and cost 30 , 31 , 32 . Few other methods are combined with EVM to find a better solution for the underlying research problems. For example, Narbaev and De Marco 33 integrated growth models and EVM for forecasting project cost at completion using data from construction projects. For analysing the ongoing progress of projects having ambiguous or linguistic outcomes, fuzzy logic is often combined with EVM 34 , 35 , 36 . Yu et al. 36 applied fuzzy theory and EVM for schedule management. Ponz-Tienda et al. 35 found that using fuzzy arithmetic on EVM provided more objective results in uncertain environments than the traditional methodology. Bonato et al. 37 integrated EVM with Monte Carlo simulation to predict the final cost of three engineering projects. Batselier and Vanhoucke 38 compared the accuracy of the project time and cost forecasting using EVM and simulation. They found that the simulation results supported findings from the EVM. Network methods are primarily used to analyse project stakeholder networks. Yang and Zou 39 developed a social network theory-based model to explore stakeholder-associated risks and their interactions in complex green building projects. Uddin 40 proposed a social network analytics-based framework for analysing stakeholder networks. Ong and Uddin 41 further applied network correlation and regression to examine the co-evolution of stakeholder networks in collaborative healthcare projects. Although many other methods have already been used, as evident in the current literature, machine learning methods or models are yet to be adopted for addressing research problems related to project analytics. The current investigation is derived from the cognitive analytics component of project analytics. It proposes an approach for determining hidden information and patterns to assist with project delivery. Figure  1 illustrates a tree diagram showing different levels of project analytics and their associated methods from the literature. It also illustrates existing methods within the cognitive component of project analytics to where the application of machine learning is situated contextually.

figure 1

A tree diagram of different project analytics methods. It also shows where the current study belongs to. Although earned value analysis is commonly used in project analytics, we do not include it in this figure since it is used in the first three levels of project analytics.

Machine learning models have several notable advantages over traditional statistical methods that play a significant role in project analytics 42 . First, machine learning algorithms can quickly identify trends and patterns by simultaneously analysing a large volume of data. Second, they are more capable of continuous improvement. Machine learning algorithms can improve their accuracy and efficiency for decision-making through subsequent training from potential new data. Third, machine learning algorithms efficiently handle multi-dimensional and multi-variety data in dynamic or uncertain environments. Fourth, they are compelling to automate various decision-making tasks. For example, machine learning-based sentiment analysis can easily a negative tweet and can automatically take further necessary steps. Last but not least, machine learning has been helpful across various industries, for example, defence to education 43 . Current research has seen the development of several different branches of artificial intelligence (including robotics, automated planning and scheduling and optimisation) within safety monitoring, risk prediction, cost estimation and so on 44 . This has progressed from the applications of regression on project cost overruns 45 to the current deep-learning implementations within the construction industry 46 . Despite this, the uses remain largely limited and are still in a developmental state. The benefits of applications are noted, such as optimising and streamlining existing processes; however, high initial costs form a barrier to accessibility 44 .

The primary goal of this study is to demonstrate the applicability of different machine learning algorithms in addressing problems related to project analytics. Limitations in applying machine learning algorithms within the context of construction projects have been explored previously. However, preceding research has mainly been conducted to improve the design processes specific to construction 23 , 24 , and those investigating project profitabilities have not incorporated the types and combinations of algorithms used within this study 6 , 27 . For instance, preceding research has incorporated a different combination of machine-learning algorithms in research of predicting construction delays 47 . This study first proposed a machine learning-based data-driven research framework for project analytics to contribute to the proposed study direction. It then applied this framework to a case study of construction projects. Although there are three different machine learning algorithms (supervised, unsupervised and semi-supervised), the supervised machine learning models are most commonly used due to their efficiency and effectiveness in addressing many real-world problems 48 . Therefore, we will use machine learning to represent supervised machine learning throughout the rest of this article. The contribution of this study is significant in that it considers the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult 9 , 49 . Further to this, existing implementations have largely been limited to safety monitoring, risk prediction, cost estimation and so on 44 . Through the evaluation of machine-learning applications, this study further demonstrates a case study for which algorithms can be used to consider and model the relationship between project attributes and a project performance measure (i.e., cost overrun frequency).

Machine learning-based framework for project analytics

When and why machine learning for project analytics.

Machine learning models are typically used for research problems that involve predicting the classification outcome of a categorical dependent variable. Therefore, they can be applied in the context of project analytics if the underlying objective variable is a categorical one. If that objective variable is non-categorical, it must first be converted into a categorical variable. For example, if the objective or target variable is the project cost, we can convert this variable into a categorical variable by taking only two possible values. The first value would be 0 to indicate a low-cost project, and the second could be 1 for showing a high-cost project. The average or median cost value for all projects under consideration can be considered for splitting project costs into low-cost and high-cost categories.

For data-driven decision-making, machine learning models are advantageous. This is because traditional statistical methods (e.g., ordinary least square (OLS) regression) make assumptions about the underlying research data to produce explicit formulae for the objective target measures. Unlike these statistical methods, machine learning algorithms figure out patterns on their own directly from the data. For instance, for a non-linear but separable dataset, an OLS regression model will not be the right choice due to its assumption that the underlying data must be linear. However, a machine learning model can easily separate the dataset into the underlying classes. Figure  2 (a) presents a situation where machine learning models perform better than traditional statistical methods.

figure 2

( a ) An illustration showing the superior performance of machine learning models compared with the traditional statistical models using an abstract dataset with two attributes (X 1 and X 2 ). The data points within this abstract dataset consist of two classes: one represented with a transparent circle and the second class illustrated with a black-filled circle. These data points are non-linear but separable. Traditional statistical models (e.g., ordinary least square regression) will not accurately separate these data points. However, any machine learning model can easily separate them without making errors; and ( b ) Traditional programming versus machine learning.

Similarly, machine learning models are compelling if the underlying research dataset has many attributes or independent measures. Such models can identify features that significantly contribute to the corresponding classification performance regardless of their distributions or collinearity. Traditional statistical methods have become prone to biased results when there exists a correlation between independent variables. Machine learning-based current studies specific to project analytics have been largely limited. Despite this, there have been tangential studies on the use of artificial intelligence to improve cost estimations as well as risk prediction 44 . Additionally, models have been implemented in the optimisation of existing processes 50 .

Machine learning versus traditional programming

Machine learning can be thought of as a process of teaching a machine (i.e., computers) to learn from data and adjust or apply its present knowledge when exposed to new data 42 . It is a type of artificial intelligence that enables computers to learn from examples or experiences. Traditional programming requires some input data and some logic in the form of code (program) to generate the output. Unlike traditional programming, the input data and their corresponding output are fed to an algorithm to create a program in machine learning. This resultant program can capture powerful insights into the data pattern and can be used to predict future outcomes. Figure  2 (b) shows the difference between machine learning and traditional programming.

Proposed machine learning-based framework

Figure  3 illustrates the proposed machine learning-based research framework of this study. The framework starts with breaking the project research dataset into the training and test components. As mentioned in the previous section, the research dataset may have many categorical and/or nominal independent variables, but its single dependent variable must be categorical. Although there is no strict rule for this split, the training data size is generally more than or equal to 50% of the original dataset 48 .

figure 3

The proposed machine learning-based data-driven framework.

Machine learning algorithms can handle variables that have only numerical outcomes. So, when one or more of the underlying categorical variables have a textual or string outcome, we must first convert them into the corresponding numerical values. Suppose a variable can take only three textual outcomes (low, medium and high). In that case, we could consider, for example, 1 to represent low , 2 to represent medium , and 3 to represent high . Other statistical techniques, such as the RIDIT (relative to an identified distribution) scoring 51 , can also be used to convert ordered categorical measurements into quantitative ones. RIDIT is a parametric approach that uses probabilistic comparison to determine the statistical differences between ordered categorical groups. The remaining components of the proposed framework have been briefly described in the following subsections.

Model-building procedure

The next step of the framework is to follow the model-building procedure to develop the desired machine learning models using the training data. The first step of this procedure is to select suitable machine learning algorithms or models. Among the available machine learning algorithms, the commonly used ones are support vector machine, logistic regression, k -nearest neighbours, artificial neural network, decision tree and random forest 52 . One can also select an ensemble machine learning model as the desired algorithm. An ensemble machine learning method uses multiple algorithms or the same algorithm multiple times to achieve better predictive performance than could be obtained from any of the constituent learning models alone 52 . Three widely used ensemble approaches are bagging, boosting and stacking. In bagging, the research dataset is divided into different equal-sized subsets. The underlying machine learning algorithm is then applied to these subsets for classification. In boosting, a random sample of the dataset is selected and then fitted and trained sequentially with different models to compensate for the weakness observed in the immediately used model. Stacking combined different weak machine learning models in a heterogeneous way to improve the predictive performance. For example, the random forest algorithm is an ensemble of different decision tree models 42 .

Second, each selected machine learning model will be processed through the k -fold cross-validation approach to improve predictive efficiency. In k -fold cross-validation, the training data is divided into k folds. In an iteration, the (k-1) folds are used to train the selected machine models, and the remaining last fold isF used for validation purposes. This iteration process continues until each k folds will get a turn to be used for validation purposes. The final predictive efficiency of the trained models is based on the average values from the outcomes of these iterations. In addition to this average value, researchers use the standard deviation of the results from different iterations as the predictive training efficiency. Supplementary Fig 1 shows an illustration of the k -fold cross-validation.

Third, most machine learning algorithms require a pre-defined value for their different parameters, known as hyperparameter tuning. The settings of these parameters play a vital role in the achieved performance of the underlying algorithm. For a given machine learning algorithm, the optimal value for these parameters can be different from one dataset to another. The same algorithm needs to run multiple times with different parameter values to find its optimal parameter value for a given dataset. Many algorithms are available in the literature, such as the Grid search 53 , to find the optimal parameter value. In the Grid search, hyperparameters are divided into discrete grids. Each grid point represents a specific combination of the underlying model parameters. The parameter values of the point that results in the best performance are the optimal parameter values 53 .

Testing of the developed models and reporting results

Once the desired machine learning models have been developed using the training data, they need to be tested using the test data. The underlying trained model is then applied to predict its dependent variable for each data instance. Therefore, for each data instance, two categorical outcomes will be available for its dependent variable: one predicted using the underlying trained model, and the other is the actual category. These predicted and actual categorical outcome values are used to report the results of the underlying machine learning model.

The fundamental tool to report results from machine learning models is the confusion matrix, which consists of four integer values 48 . The first value represents the number of positive cases correctly identified as positive by the underlying trained model (true-positive). The second value indicates the number of positive instances incorrectly identified as negative (false-negative). The third value represents the number of negative cases incorrectly identified as positive (false-positive). Finally, the fourth value indicates the number of negative instances correctly identified as negative (true-negative). Researchers also use a few performance measures based on the four values of the confusion matrix to report machine learning results. The most used measure is accuracy which is the ratio of the number of correct predictions (true-positive + true-negative) and the total number of data instances (sum of all four values of the confusion matrix). Other measures commonly used to report machine learning results are precision, recall and F1-score. Precision refers to the ratio between true-positives and the total number of positive predictions (i.e., true-positive + false-positive), often used to indicate the quality of a positive prediction made by a model 48 . Recall, also known as the true-positive rate, is calculated by dividing true-positive by the number of data instances that should have been predicted as positive (i.e., true-positive + false-negative). F1-score is the harmonic mean of the last two measures, i.e., [(2 × Precision × Recall)/(Precision + Recall)] and the error-rate equals to (1-Accuracy).

Another essential tool for reporting machine learning results is variable or feature importance, which identifies a list of independent variables (features) contributing most to the classification performance. The importance of a variable refers to how much a given machine learning algorithm uses that variable in making accurate predictions 54 . The widely used technique for identifying variable importance is the principal component analysis. It reduces the dimensionality of the data while minimising information loss, which eventually increases the interpretability of the underlying machine learning outcome. It further helps in finding the important features in a dataset as well as plotting them in 2D and 3D 54 .

Ethical approval

Ethical approval is not required for this study since this study used publicly available data for research investigation purposes. All research was performed in accordance with relevant guidelines/regulations.

Informed consent

Due to the nature of the data sources, informed consent was not required for this study.

Case study: an application of the proposed framework

This section illustrates an application of this study’s proposed framework (Fig.  2 ) in a construction project context. We will apply this framework in classifying projects into two classes based on their cost overrun experience. Projects rarely experience a delay belonging to the first class (Rare class). The second class indicates those projects that often experience a delay (Often class). In doing so, we consider a list of independent variables or features.

Data source

The research dataset is taken from an open-source data repository, Kaggle 55 . This survey-based research dataset was collected to explore the causes of the project cost overrun in Indian construction projects 45 , consisting of 44 independent variables or features and one dependent variable. The independent variables cover a wide range of cost overrun factors, from materials and labour to contractual issues and the scope of the work. The dependent variable is the frequency of experiencing project cost overrun (rare or often). The dataset size is 139; 65 belong to the rare class, and the remaining 74 are from the often class. We converted each categorical variable with a textual or string outcome into an appropriate numerical value range to prepare the dataset for machine learning analysis. For example, we used 1 and 2 to represent rare and often class, respectively. The correlation matrix among the 44 features is presented in Supplementary Fig 2 .

Machine learning algorithms

This study considered four machine learning algorithms to explore the causes of project cost overrun using the research dataset mentioned above. They are support vector machine, logistic regression, k- nearest neighbours and random forest.

Support vector machine (SVM) is a process applied to understand data. For instance, if one wants to determine and interpret which projects are classified as programmatically successful through the processing of precedent data information, SVM would provide a practical approach for prediction. SVM functions by assigning labels to objects 56 . The comparison attributes are used to cluster these objects into different groups or classes by maximising their marginal distances and minimising the classification errors. The attributes are plotted multi-dimensionally, allowing a separation line, known as a hyperplane , see supplementary Fig 3 (a), to distinguish between underlying classes or groups 52 . Support vectors are the data points that lie closest to the decision boundary on both sides. In Supplementary Fig 3 (a), they are the circles (both transparent and shaded ones) close to the hyperplane. Support vectors play an essential role in deciding the position and orientation of the hyperplane. Various computational methods, including a kernel function to create more derived attributes, are applied to accommodate this process 56 . Support vector machines are not only limited to binary classes but can also be generalised to a larger variety of classifications. This is accomplished through the training of separate SVMs 56 .

Logistic regression (LR) builds on the linear regression model and predicts the outcome of a dichotomous variable 57 ; for example, the presence or absence of an event. It uses a scatterplot to understand the connection between an independent variable and one or more dependent variables (see Supplementary Fig 3 (b)). LR model fits the data to a sigmoidal curve instead of fitting it to a straight line. The natural logarithm is considered when developing the model. It provides a value between 0 and 1 that is interpreted as the probability of class membership. Best estimates are determined by developing from approximate estimates until a level of stability is reached 58 . Generally, LR offers a straightforward approach for determining and observing interrelationships. It is more efficient compared to ordinary regressions 59 .

k -nearest neighbours (KNN) algorithm uses a process that plots prior information and applies a specific sample size ( k ) to the plot to determine the most likely scenario 52 . This method finds the nearest training examples using a distance measure. The final classification is made by counting the most common scenario or votes present within the specified sample. As illustrated in Supplementary Fig 3 (c), the closest four nearest neighbours in the small circle are three grey squares and one white square. The majority class is grey. Hence, KNN will predict the instance (i.e., Χ ) as grey. On the other hand, if we look at the larger circle of the same figure, the nearest neighbours consist of ten white squares and four grey squares. The majority class is white. Thus, KNN will classify the instance as white. KNN’s advantage lies in its ability to produce a simplified result and handle missing data 60 . In summary, KNN utilises similarities (as well as differences) and distances in the process of developing models.

Random forest (RF) is a machine learning process that consists of many decision trees. A decision tree is a tree-like structure where each internal node represents a test on the input attribute. It may have multiple internal nodes at different levels, and the leaf or terminal nodes represent the decision outcomes. It produces a classification outcome for a distinctive and separate part to the input vector. For non-numerical processes, it considers the average value, and for discrete processes, it considers the number of votes 52 . Supplementary Fig 3 (d) shows three decision trees to illustrate the function of a random forest. The outcomes from trees 1, 2 and 3 are class B, class A and class A, respectively. According to the majority vote, the final prediction will be class A. Because it considers specific attributes, it can have a tendency to emphasise specific attributes over others, which may result in some attributes being unevenly weighted 52 . Advantages of the random forest include its ability to handle multidimensionality and multicollinearity in data despite its sensitivity to sampling design.

Artificial neural network (ANN) simulates the way in which human brains work. This is accomplished by modelling logical propositions and incorporating weighted inputs, a transfer and one output 61 (Supplementary Fig 3 (e)). It is advantageous because it can be used to model non-linear relationships and handle multivariate data 62 . ANN learns through three major avenues. These include error-back propagation (supervised), the Kohonen (unsupervised) and the counter-propagation ANN (supervised) 62 . There are two types of ANN—supervised and unsupervised. ANN has been used in a myriad of applications ranging from pharmaceuticals 61 to electronic devices 63 . It also possesses great levels of fault tolerance 64 and learns by example and through self-organisation 65 .

Ensemble techniques are a type of machine learning methodology in which numerous basic classifiers are combined to generate an optimal model 66 . An ensemble technique considers many models and combines them to form a single model, and the final model will eliminate the weaknesses of each individual learner, resulting in a powerful model that will improve model performance. The stacking model is a general architecture comprised of two classifier levels: base classifier and meta-learner 67 . The base classifiers are trained with the training dataset, and a new dataset is constructed for the meta-learner. Afterwards, this new dataset is used to train the meta-classifier. This study uses four models (SVM, LR, KNN and RF) as base classifiers and LR as a meta learner, as illustrated in Supplementary Fig 3 (f).

Feature selection

The process of selecting the optimal feature subset that significantly influences the predicted outcomes, which may be efficient to increase model performance and save running time, is known as feature selection. This study considers three different feature selection approaches. They are the Univariate feature selection (UFS), Recursive feature elimination (RFE) and SelectFromModel (SFM) approach. UFS examines each feature separately to determine the strength of its relationship with the response variable 68 . This method is straightforward to use and comprehend and helps acquire a deeper understanding of data. In this study, we calculate the chi-square values between features. RFE is a type of backwards feature elimination in which the model is fit first using all features in the given dataset and then removing the least important features one by one 69 . After that, the model is refit until the desired number of features is left over, which is determined by the parameter. SFM is used to choose effective features based on the feature importance of the best-performing model 70 . This approach selects features by establishing a threshold based on feature significance as indicated by the model on the training set. Those characteristics whose feature importance is more than the threshold are chosen, while those whose feature importance is less than the threshold are deleted. In this study, we apply SFM after we compare the performance of four machine learning methods. Afterwards, we train the best-performing model again using the features from the SFM approach.

Findings from the case study

We split the dataset into 70:30 for training and test purposes of the four selected machine learning algorithms. We used Python’s Scikit-learn package for implementing these algorithms 70 . Using the training data, we first developed six models based on these six algorithms. We used fivefold validation and target to improve the accuracy value. Then, we applied these models to the test data. We also executed all required hyperparameter tunings for each algorithm for the possible best classification outcome. Table 1 shows the performance outcomes for each algorithm during the training and test phase. The hyperparameter settings for each algorithm have been listed in Supplementary Table 1 .

As revealed in Table 1 , random forest outperformed the other three algorithms in terms of accuracy for both the training and test phases. It showed an accuracy of 78.14% and 77.50% for the training and test phases, respectively. The second-best performer in the training phase is k- nearest neighbours (76.98%), and for the test phase, it is the support vector machine, k- nearest neighbours and artificial neural network (72.50%).

Since random forest showed the best performance, we explored further based on this algorithm. We applied the three approaches (UFS, RFE and SFM) for feature optimisation on the random forest. The result is presented in Table 2 . SFM shows the best outcome among these three approaches. Its accuracy is 85.00%, whereas the accuracies of USF and RFE are 77.50% and 72.50%, respectively. As can be seen in Table 2 , the accuracy for the testing phase increases from 77.50% in Table 1 (b) to 85.00% with the SFM feature optimisation. Table 3 shows the 19 selected features from the SFM output. Out of 44 features, SFM found that 19 of them play a significant role in predicting the outcomes.

Further, Fig.  4 illustrates the confusion matrix when the random forest model with the SFM feature optimiser was applied to the test data. There are 18 true-positive, five false-negative, one false-positive and 16 true-negative cases. Therefore, the accuracy for the test phase is (18 + 16)/(18 + 5 + 1 + 16) = 85.00%.

figure 4

Confusion matrix results based on the random forest model with the SFM feature optimiser (1 for the rare class and 2 for the often class).

Figure  5 illustrates the top-10 most important features or variables based on the random forest algorithm with the SFM optimiser. We used feature importance based on the mean decrease in impurity in identifying this list of important variables. Mean decrease in impurity computes each feature’s importance as the sum over the number of splits that include the feature in proportion to the number of samples it splits 71 . According to this figure, the delays in decision marking attribute contributed most to the classification performance of the random forest algorithm, followed by cash flow problem and construction cost underestimation attributes. The current construction project literature also highlighted these top-10 factors as significant contributors to project cost overrun. For example, using construction project data from Jordan, Al-Hazim et al. 72 ranked 20 causes for cost overrun, including causes similar to these causes.

figure 5

Feature importance (top-10 out of 19) based on the random forest model with the SFM feature optimiser.

Further, we conduct a sensitivity analysis of the model’s ten most important features (from Fig.  5 ) to explore how a change in each feature affects the cost overrun. We utilise the partial dependence plot (PDP), which is a typical visualisation tool for non-parametric models 73 , to display this analysis’s outcomes. A PDP can demonstrate whether the relation between the target and a feature is linear, monotonic, or more complicated. The result of the sensitivity analysis is presented in Fig.  6 . For the ‘delays in decisions making’ attribute, the PDP shows that the probability is below 0.4 until the rating value is three and increases after. A higher value for this attribute indicates a higher risk of cost overrun. On the other hand, there are no significant differences can be seen in the remaining nine features if the value changes.

figure 6

The result of the sensitivity analysis from the partial dependency plot tool for the ten most important features.

Summary of the case study

We illustrated an application of the proposed machine learning-based research framework in classifying construction projects. RF showed the highest accuracy in predicting the test dataset. For a new data instance with information for its 19 features but has not had any information on its classification, RF can identify its class ( rare or often ) correctly with a probability of 85.00%. If more data is provided, in addition to the 139 instances of the case study, to the machine learning algorithms, then their accuracy and efficiency in making project classification will improve with subsequent training. For example, if we provide 100 more data instances, these algorithms will have an additional 50 instances for training with a 70:30 split. This continuous improvement facility put the machine learning algorithms in a superior position over other traditional methods. In the current literature, some studies explore the factors contributing to project delay or cost overrun. In most cases, they applied factor analysis or other related statistical methods for research data analysis 72 , 74 , 75 . In addition to identifying important attributes, the proposed machine learning-based framework identified the ranking of factors and how eliminating less important factors affects the prediction accuracy when applied to this case study.

We shared the Python software developed to implement the four machine learning algorithms considered in this case study using GitHub 76 , a software hosting internet site. user-friendly version of this software can be accessed at . The accuracy findings from this link could be slightly different from one run to another due to the hyperparameter settings of the corresponding machine learning algorithms.

Due to their robust prediction ability, machine learning methods have already gained wide acceptability across a wide range of research domains. On the other side, EVM is the most commonly used method in project analytics due to its simplicity and ease of interpretability 77 . Essential research efforts have been made to improve its generalisability over time. For example, Naeni et al. 34 developed a fuzzy approach for earned value analysis to make it suitable to analyse project scenarios with ambiguous or linguistic outcomes. Acebes 78 integrated Monte Carlo simulation with EVM for project monitoring and control for a similar purpose. Another prominent method frequently used in project analytics is the time series analysis, which is compelling for the longitudinal prediction of project time and cost 30 . Apparently, as evident in the present current literature, not much effort has been made to bring machine learning into project analytics for addressing project management research problems. This research made a significant attempt to contribute to filling up this gap.

Our proposed data-driven framework only includes the fundamental model development and application process components for machine learning algorithms. It does not have a few advanced-level machine learning methods. This study intentionally did not consider them for the proposed model since they are required only in particular designs of machine learning analysis. For example, the framework does not contain any methods or tools to handle the data imbalance issue. Data imbalance refers to a situation when the research dataset has an uneven distribution of the target class 79 . For example, a binary target variable will cause a data imbalance issue if one of its class labels has a very high number of observations compared with the other class. Commonly used techniques to address this issue are undersampling and oversampling. The undersampling technique decreases the size of the majority class. On the other hand, the oversampling technique randomly duplicates the minority class until the class distribution becomes balanced 79 . The class distribution of the case study did not produce any data imbalance issues.

This study considered only six fundamental machine learning algorithms for the case study, although many other such algorithms are available in the literature. For example, it did not consider the extreme gradient boosting (XGBoost) algorithm. XGBoost is based on the decision tree algorithm, similar to the random forest algorithm 80 . It has become dominant in applied machine learning due to its performance and speed. Naïve Bayes and convolutional neural networks are other popular machine learning algorithms that were not considered when applying the proposed framework to the case study. In addition to the three feature selection methods, multi-view can be adopted when applying the proposed framework to the case study. Multi-view learning is another direction in machine learning that considers learning with multiple views of the existing data with the aim to improve predictive performance 81 , 82 . Similarly, although we considered five performance measures, there are other potential candidates. One such example is the area under the receiver operating curve, which is the ability of the underlying classifier to distinguish between classes 48 . We leave them as a potential application scope while applying our proposed framework in any other project contexts in future studies.

Although this study only used one case study for illustration, our proposed research framework can be used in other project analytics contexts. In such an application context, the underlying research goal should be to predict the outcome classes and find attributes playing a significant role in making correct predictions. For example, by considering two types of projects based on the time required to accomplish (e.g., on-time and delayed ), the proposed framework can develop machine learning models that can predict the class of a new data instance and find out attributes contributing mainly to this prediction performance. This framework can also be used at any stage of the project. For example, the framework’s results allow project stakeholders to screen projects for excessive cost overruns and forecast budget loss at bidding and before contracts are signed. In addition, various factors that contribute to project cost overruns can be figured out at an earlier stage. These elements emerge at each stage of a project’s life cycle. The framework’s feature importance helps project managers locate the critical contributor to cost overrun.

This study has made an important contribution to the current project analytics literature by considering the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult. Further, existing implementations have largely been limited to safety monitoring, risk prediction and cost estimation. Through the evaluation of machine learning applications, this study further demonstrates the uses for which algorithms can be used to consider and model the relationship between project attributes and cost overrun frequency.

The applications of machine learning in project analytics are still undergoing constant development. Within construction projects, its applications have been largely limited and focused on profitability or the design of structures themselves. In this regard, our study made a substantial effort by proposing a machine learning-based framework to address research problems related to project analytics. We also illustrated an example of this framework’s application in the context of construction project management.

Like any other research, this study also has a few limitations that could provide scopes for future research. First, the framework does not include a few advanced machine learning techniques, such as data imbalance issues and kernel density estimation. Second, we considered only one case study to illustrate the application of the proposed framework. Illustrations of this framework using case studies from different project contexts would confirm its robust application. Finally, this study did not consider all machine learning models and performance measures available in the literature for the case study. For example, we did not consider the Naïve Bayes model and precision measure in applying the proposed research framework for the case study.

Data availability

This study obtained research data from publicly available online repositories. We mentioned their sources using proper citations. Here is the link to the data .

Venkrbec, V. & Klanšek, U. In: Advances and Trends in Engineering Sciences and Technologies II 685–690 (CRC Press, 2016).

Google Scholar  

Damnjanovic, I. & Reinschmidt, K. Data Analytics for Engineering and Construction Project Risk Management (Springer, 2020).

Book   Google Scholar  

Singh, H. Project Management Analytics: A Data-driven Approach to Making Rational and Effective Project Decisions (FT Press, 2015).

Frame, J. D. & Chen, Y. Why Data Analytics in Project Management? (Auerbach Publications, 2018).

Ong, S. & Uddin, S. Data Science and Artificial Intelligence in Project Management: The Past, Present and Future. J. Mod. Proj. Manag. 7 , 26–33 (2020).

Bilal, M. et al. Investigating profitability performance of construction projects using big data: A project analytics approach. J. Build. Eng. 26 , 100850 (2019).

Article   Google Scholar  

Radziszewska-Zielina, E. & Sroka, B. Planning repetitive construction projects considering technological constraints. Open Eng. 8 , 500–505 (2018).

Neely, A. D., Adams, C. & Kennerley, M. The Performance Prism: The Scorecard for Measuring and Managing Business Success (Prentice Hall Financial Times, 2002).

Kanakaris, N., Karacapilidis, N., Kournetas, G. & Lazanas, A. In: International Conference on Operations Research and Enterprise Systems. 135–155 Springer.

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 , 255–260 (2015).

Article   ADS   MathSciNet   CAS   PubMed   MATH   Google Scholar  

Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).

Book   MATH   Google Scholar  

Rahimian, F. P., Seyedzadeh, S., Oliver, S., Rodriguez, S. & Dawood, N. On-demand monitoring of construction projects through a game-like hybrid application of BIM and machine learning. Autom. Constr. 110 , 103012 (2020).

Sanni-Anibire, M. O., Zin, R. M. & Olatunji, S. O. Machine learning model for delay risk assessment in tall building projects. Int. J. Constr. Manag. 22 , 1–10 (2020).

Cong, J. et al. A machine learning-based iterative design approach to automate user satisfaction degree prediction in smart product-service system. Comput. Ind. Eng. 165 , 107939 (2022).

Li, F., Chen, C.-H., Lee, C.-H. & Feng, S. Artificial intelligence-enabled non-intrusive vigilance assessment approach to reducing traffic controller’s human errors. Knowl. Based Syst. 239 , 108047 (2021).

Mohri, M., Rostamizadeh, A. & Talwalkar, A. Foundations of Machine Learning (MIT press, 2018).

MATH   Google Scholar  

Whyte, J., Stasis, A. & Lindkvist, C. Managing change in the delivery of complex projects: Configuration management, asset information and ‘big data’. Int. J. Proj. Manag. 34 , 339–351 (2016).

Zangeneh, P. & McCabe, B. Ontology-based knowledge representation for industrial megaprojects analytics using linked data and the semantic web. Adv. Eng. Inform. 46 , 101164 (2020).

Akinosho, T. D. et al. Deep learning in the construction industry: A review of present status and future innovations. J. Build. Eng. 32 , 101827 (2020).

Soman, R. K., Molina-Solana, M. & Whyte, J. K. Linked-Data based constraint-checking (LDCC) to support look-ahead planning in construction. Autom. Constr. 120 , 103369 (2020).

Soman, R. K. & Whyte, J. K. Codification challenges for data science in construction. J. Constr. Eng. Manag. 146 , 04020072 (2020).

Soman, R. K. & Molina-Solana, M. Automating look-ahead schedule generation for construction using linked-data based constraint checking and reinforcement learning. Autom. Constr. 134 , 104069 (2022).

Shi, F., Soman, R. K., Han, J. & Whyte, J. K. Addressing adjacency constraints in rectangular floor plans using Monte-Carlo tree search. Autom. Constr. 115 , 103187 (2020).

Chen, L. & Whyte, J. Understanding design change propagation in complex engineering systems using a digital twin and design structure matrix. Eng. Constr. Archit. Manag. (2021).

Allison, J. T. et al. Artificial intelligence and engineering design. J. Mech. Des. 144 , 020301 (2022).

Dutta, D. & Bose, I. Managing a big data project: The case of ramco cements limited. Int. J. Prod. Econ. 165 , 293–306 (2015).

Bilal, M. & Oyedele, L. O. Guidelines for applied machine learning in construction industry—A case of profit margins estimation. Adv. Eng. Inform. 43 , 101013 (2020).

Tayefeh Hashemi, S., Ebadati, O. M. & Kaur, H. Cost estimation and prediction in construction projects: A systematic review on machine learning techniques. SN Appl. Sci. 2 , 1–27 (2020).

Arage, S. S. & Dharwadkar, N. V. In: International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC). 594–599 (IEEE, 2017).

Cheng, C.-H., Chang, J.-R. & Yeh, C.-A. Entropy-based and trapezoid fuzzification-based fuzzy time series approaches for forecasting IT project cost. Technol. Forecast. Soc. Chang. 73 , 524–542 (2006).

Joukar, A. & Nahmens, I. Volatility forecast of construction cost index using general autoregressive conditional heteroskedastic method. J. Constr. Eng. Manag. 142 , 04015051 (2016).

Xu, J.-W. & Moon, S. Stochastic forecast of construction cost index using a cointegrated vector autoregression model. J. Manag. Eng. 29 , 10–18 (2013).

Narbaev, T. & De Marco, A. Combination of growth model and earned schedule to forecast project cost at completion. J. Constr. Eng. Manag. 140 , 04013038 (2014).

Naeni, L. M., Shadrokh, S. & Salehipour, A. A fuzzy approach for the earned value management. Int. J. Proj. Manag. 29 , 764–772 (2011).

Ponz-Tienda, J. L., Pellicer, E. & Yepes, V. Complete fuzzy scheduling and fuzzy earned value management in construction projects. J. Zhejiang Univ. Sci. A 13 , 56–68 (2012).

Yu, F., Chen, X., Cory, C. A., Yang, Z. & Hu, Y. An active construction dynamic schedule management model: Using the fuzzy earned value management and BP neural network. KSCE J. Civ. Eng. 25 , 2335–2349 (2021).

Bonato, F. K., Albuquerque, A. A. & Paixão, M. A. S. An application of earned value management (EVM) with Monte Carlo simulation in engineering project management. Gest. Produção 26 , e4641 (2019).

Batselier, J. & Vanhoucke, M. Empirical evaluation of earned value management forecasting accuracy for time and cost. J. Constr. Eng. Manag. 141 , 05015010 (2015).

Yang, R. J. & Zou, P. X. Stakeholder-associated risks and their interactions in complex green building projects: A social network model. Build. Environ. 73 , 208–222 (2014).

Uddin, S. Social network analysis in project management–A case study of analysing stakeholder networks. J. Mod. Proj. Manag. 5 , 106–113 (2017).

Ong, S. & Uddin, S. Co-evolution of project stakeholder networks. J. Mod. Proj. Manag. 8 , 96–115 (2020).

Khanzode, K. C. A. & Sarode, R. D. Advantages and disadvantages of artificial intelligence and machine learning: A literature review. Int. J. Libr. Inf. Sci. (IJLIS) 9 , 30–36 (2020).

Loyola-Gonzalez, O. Black-box vs. white-box: Understanding their advantages and weaknesses from a practical point of view. IEEE Access 7 , 154096–154113 (2019).

Abioye, S. O. et al. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. J. Build. Eng. 44 , 103299 (2021).

Doloi, H., Sawhney, A., Iyer, K. & Rentala, S. Analysing factors affecting delays in Indian construction projects. Int. J. Proj. Manag. 30 , 479–489 (2012).

Alkhaddar, R., Wooder, T., Sertyesilisik, B. & Tunstall, A. Deep learning approach’s effectiveness on sustainability improvement in the UK construction industry. Manag. Environ. Qual. Int. J. 23 , 126–139 (2012).

Gondia, A., Siam, A., El-Dakhakhni, W. & Nassar, A. H. Machine learning algorithms for construction projects delay risk prediction. J. Constr. Eng. Manag. 146 , 04019085 (2020).

Witten, I. H. & Frank, E. Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 2005).

Kanakaris, N., Karacapilidis, N. I. & Lazanas, A. In: ICORES. 362–369.

Heo, S., Han, S., Shin, Y. & Na, S. Challenges of data refining process during the artificial intelligence development projects in the architecture engineering and construction industry. Appl. Sci. 11 , 10919 (2021).

Article   CAS   Google Scholar  

Bross, I. D. How to use ridit analysis. Biometrics 14 , 18–38 (1958).

Uddin, S., Khan, A., Hossain, M. E. & Moni, M. A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 19 , 1–16 (2019).

LaValle, S. M., Branicky, M. S. & Lindemann, S. R. On the relationship between classical grid search and probabilistic roadmaps. Int. J. Robot. Res. 23 , 673–692 (2004).

Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2 , 433–459 (2010).

Saxena, A. Survey on Road Construction Delay , (2021).

Noble, W. S. What is a support vector machine?. Nat. Biotechnol. 24 , 1565–1567 (2006).

Article   CAS   PubMed   Google Scholar  

Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression Vol. 398 (John Wiley & Sons, 2013).

LaValley, M. P. Logistic regression. Circulation 117 , 2395–2399 (2008).

Article   PubMed   Google Scholar  

Menard, S. Applied Logistic Regression Analysis Vol. 106 (Sage, 2002).

Batista, G. E. & Monard, M. C. A study of K-nearest neighbour as an imputation method. His 87 , 48 (2002).

Agatonovic-Kustrin, S. & Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 22 , 717–727 (2000).

Zupan, J. Introduction to artificial neural network (ANN) methods: What they are and how to use them. Acta Chim. Slov. 41 , 327–327 (1994).

CAS   Google Scholar  

Hopfield, J. J. Artificial neural networks. IEEE Circuits Devices Mag. 4 , 3–10 (1988).

Zou, J., Han, Y. & So, S.-S. Overview of artificial neural networks. Artificial Neural Networks . 14–22 (2008).

Maind, S. B. & Wankar, P. Research paper on basic of artificial neural network. Int. J. Recent Innov. Trends Comput. Commun. 2 , 96–100 (2014).

Wolpert, D. H. Stacked generalization. Neural Netw. 5 , 241–259 (1992).

Pavlyshenko, B. In: IEEE Second International Conference on Data Stream Mining & Processing (DSMP). 255–258 (IEEE).

Jović, A., Brkić, K. & Bogunović, N. In: 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205 (Ieee, 2015).

Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46 , 389–422 (2002).

Article   MATH   Google Scholar  

Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   MATH   Google Scholar  

Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. Understanding variable importances in forests of randomized trees. Adv. Neural. Inf. Process. Syst. 26 , 431–439 (2013).

Al-Hazim, N., Salem, Z. A. & Ahmad, H. Delay and cost overrun in infrastructure projects in Jordan. Procedia Eng. 182 , 18–24 (2017).

Breiman, L. Random forests. Mach. Learn. 45 , 5–32. (2001).

Shehu, Z., Endut, I. R. & Akintoye, A. Factors contributing to project time and hence cost overrun in the Malaysian construction industry. J. Financ. Manag. Prop. Constr. 19 , 55–75 (2014).

Akomah, B. B. & Jackson, E. N. Contractors’ perception of factors contributing to road project delay. Int. J. Constr. Eng. Manag. 5 , 79–85 (2016).

GitHub: Where the world builds software , .

Anbari, F. T. Earned value project management method and extensions. Proj. Manag. J. 34 , 12–23 (2003).

Acebes, F., Pereda, M., Poza, D., Pajares, J. & Galán, J. M. Stochastic earned value analysis using Monte Carlo simulation and statistical learning techniques. Int. J. Proj. Manag. 33 , 1597–1609 (2015).

Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. data anal. 6 , 429–449 (2002).

Chen, T. et al. Xgboost: extreme gradient boosting. R Packag. Version 0.4–2.1 1 , 1–4 (2015).

Guarino, A., Lettieri, N., Malandrino, D., Zaccagnino, R. & Capo, C. Adam or Eve? Automatic users’ gender classification via gestures analysis on touch devices. Neural Comput. Appl. 1–23 (2022).

Zaccagnino, R., Capo, C., Guarino, A., Lettieri, N. & Malandrino, D. Techno-regulation and intelligent safeguards. Multimed. Tools Appl. 80 , 15803–15824 (2021).

Download references


The authors acknowledge the insightful comments from Prof Jennifer Whyte on an earlier version of this article.

Author information

Authors and affiliations.

School of Project Management, The University of Sydney, Level 2, 21 Ross St, Forest Lodge, NSW, 2037, Australia

Shahadat Uddin, Stephen Ong & Haohui Lu

You can also search for this author in PubMed   Google Scholar


S.U.: Conceptualisation; Data curation; Formal analysis; Methodology; Supervision; and Writing (original draft, review and editing) S.O.: Data curation; and Writing (original draft, review and editing) H.L.: Methodology; and Writing (original draft, review and editing) All authors reviewed the manuscript).

Corresponding author

Correspondence to Shahadat Uddin .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit .

Reprints and permissions

About this article

Cite this article.

Uddin, S., Ong, S. & Lu, H. Machine learning in project analytics: a data-driven framework and case study. Sci Rep 12 , 15252 (2022).

Download citation

Received : 13 April 2022

Accepted : 02 September 2022

Published : 09 September 2022


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Evaluation and prediction of time overruns in jordanian construction projects using coral reefs optimization and deep learning methods.

  • Jumana Shihadeh
  • Ghyda Al-Shaibie
  • Hamza Al-Bdour

Asian Journal of Civil Engineering (2024)

A robust, resilience machine learning with risk approach: a case study of gas consumption

  • Mehdi Changizi
  • Sadia Samar Ali

Annals of Operations Research (2024)

Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

  • Shahadat Uddin

Health and Technology (2024)

Prediction of SMEs’ R&D performances by machine learning for project selection

  • Hyoung Sun Yoo
  • Ye Lim Jung
  • Seung-Pyo Jun

Scientific Reports (2023)

A robust and resilience machine learning for forecasting agri-food production

  • Amin Gholamrezaei
  • Kiana Kheiri

Scientific Reports (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

classification in machine learning case study

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Biomed Eng Comput Biol

Extending Classification Algorithms to Case-Control Studies

Bryan stanfill.

1 Computing and Analytics Division, National Security Directorate, Pacific Northwest National Laboratory, Richland, WA, USA

Sarah Reehl

Lisa bramer, ernesto s nakayasu.

2 Biological Sciences Division, Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, USA

Stephen S Rich

3 Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA

Thomas O Metz

Marian rewers.

4 Barbara Davis Center for Childhood Diabetes, University of Colorado Denver, Aurora, CO, USA

Bobbie-Jo Webb-Robertson

Associated data.

Supplemental material, Online_Appendix_Final for Extending Classification Algorithms to Case-Control Studies by Bryan Stanfill, Sarah Reehl, Lisa Bramer, Ernesto S Nakayasu, Stephen S Rich, Thomas O Metz, Marian Rewers and Bobbie-Jo Webb-Robertson in Biomedical Engineering and Computational Biology

Supplemental material, Supplemental_Material for Extending Classification Algorithms to Case-Control Studies by Bryan Stanfill, Sarah Reehl, Lisa Bramer, Ernesto S Nakayasu, Stephen S Rich, Thomas O Metz, Marian Rewers and Bobbie-Jo Webb-Robertson in Biomedical Engineering and Computational Biology

Classification is a common technique applied to ’omics data to build predictive models and identify potential markers of biomedical outcomes. Despite the prevalence of case-control studies, the number of classification methods available to analyze data generated by such studies is extremely limited. Conditional logistic regression is the most commonly used technique, but the associated modeling assumptions limit its ability to identify a large class of sufficiently complicated ’omic signatures. We propose a data preprocessing step which generalizes and makes any linear or nonlinear classification algorithm, even those typically not appropriate for matched design data, available to be used to model case-control data and identify relevant biomarkers in these study designs. We demonstrate on simulated case-control data that both the classification and variable selection accuracy of each method is improved after applying this processing step and that the proposed methods are comparable to or outperform existing variable selection methods. Finally, we demonstrate the impact of conditional classification algorithms on a large cohort study of children with islet autoimmunity.


Matched case-control (MCC) studies are a common design for epidemiological studies due to their potential gains in efficiency and avoidance of confounders. The general design of MCC studies is to group individuals with the outcome of interest (“cases”) and those without (“controls”) based on features such as age and sex. In omics studies, often the goal of MCC studies is to identify biomarkers that are highly correlated with the case/control labels that can lead to a better understanding of the cause of the disease. Despite the large number of studies using this design, the number of methods that account for the pairing has grown very slowly and almost always assume a linear relationship between sample features and the outcome of interest. Furthermore, most of the popular methods used to analyze MCC studies perform poorly when the covariate space is high dimensional or when the effects are highly nonlinear.

Although other methods have been proposed, 1 , 2 a majority of methods used to analyze MCC studies while controlling for covariate information are based on conditional logistic regression (CLR). 3 Conditional logistic regression is similar to standard logistic regression but controls for the matching design by estimating the effects of each covariate conditional on the paired design. 4 As such, CLR is able to identify covariates whose linear effects are associated with each patient’s case-control status without flagging spurious relationships due solely to the paired nature of the study.

In its original form, CLR was designed to handle modestly sized covariate datasets and is not well suited to handle the volume or veracity of data present in an ’omics-type analysis. To remedy this, several variations of standard CLR have been proposed to deal with high-dimensional datasets. 5 – 9 However, based on these published results, CLR and its many variants still struggle to accurately differentiate the cases from the controls, particularly when there are nonlinear effects, the noise is sufficiently large or the number of inputs is too large. 7

Modern classification techniques such as random forests (RF), 10 support vector machine 11 (SVM), and naive Bayes (NB) can successfully model such complicated input/output relationships but do not account for the matched design of MCC studies and require modification to be used in these situations. That is, applying these methods without accounting for the paired nature of the study likely accounts for their poor performance relative to CLR, which does account for the pairing. 8 , 12 – 15 More recently, Dimou et al 16 proposed a paired SVM approach to identify damaged regions of the brain, but the specialized kernel they proposed is not applicable to general classification problems with binary outcomes.

In this article, we show that by preprocessing the data, any number of linear and nonlinear classification algorithms can be used to appropriately analyze data generated by MCC studies. This method is a general framework which, for the first time, makes a much larger set of classification algorithms available to researchers analyzing MCC study data. The new group of classification algorithms resulting from this method, called conditional classification algorithms, are designed specifically to analyze MCC studies. Using artificially generated data, we show when and by how much the proposed conditional classification algorithms outperform their standard counterparts. We also identify situations in which they will outperform CLR. In the next section, we describe the proposed preprocessing technique. We then demonstrate how classification and variable selection accuracies improve in a simulation study. Finally, we employ our methods along with CLR to a large cohort study on The Environmental Determinants of Diabetes in the Young (TEDDY) to understand the ramifications on messy and high-dimensional real-world ’omics data.

In this section, we describe the data processing step that defines the proposed set of conditional classification algorithms. The theoretical derivations that prove the validity of the proposed approach are given in the supplemental material. We then describe the methods used to generate and analyze the artificial data. Finally, the TEDDY study is described including the data cleaning steps and how it was analyzed.

Conditional classification algorithms

To make a standard classification algorithm conditional, it must account for the paired structure of the MCC study. We propose centering the within pair data by its mean to address the paired data structure. For example, consider a single protein measured on 4 individuals that are split into 2 case-control pairs. The protein abundance for the case and control in pair 1 is 750 and 500, respectively, while that same protein has abundance 500 and 250, respectively, for pair 2. Because the abundance for the control in pair 1 is the same as the case in pair 2, standard classification algorithms would not identify this protein as significant. After pair correction, however, any classification algorithm would identify this protein as significant because the pair-corrected abundance values are 125 and −125 for the case and control, respectively, in both pairs.

Put mathematically, this is equivalent to the common statistical practice of projecting a feature matrix into the null space generated by the matrix of pair indicators. Let n denote the cohort size, which is composed of p disjoint and equally sized strata, ie, pairs, of MCC subjects. Let m denote the size of each strata and K denote the number of features. Define the matrix of strata indicators Z = I p × p ⊗ 1 m , where I p × p is a p × p identity matrix, 1 m is an m -dimensional column vector of ones and ⊗ is the Kronecker product. Then, Z is a n × p matrix where the ( i , j ) th element is a 1 if subject i is in strata j and is 0 otherwise. The projection matrix associated with Z is P Z = Z ( Z T Z ) − 1 Z T = I p × p ⊗ ( 1 m 1 m T ) / m , which is a n × n block diagonal matrix with m × m blocks where every element is 1 / m . To project the n × K feature matrix X into the null space of Z , pre-multiply X by I n × n − P Z . Define the case-control corrected feature matrix X * as X * = ( I n × n − P Z ) X .

We define any classification algorithm trained using the pair adjusted features X * as a conditional classification algorithm. We refer to training the same classification algorithm on the raw features X as the standard classification algorithm. In the “Results” section, we empirically compare the conditional and standard classification algorithms. In the context of 1:1 case-control studies, both linear discriminant analysis (LDA) and the Gaussian NB classifier are guaranteed to return one case and one control per strata; no such guarantee exists for the standard versions of those classifiers. In addition, we show that CLR is a special case of the proposed set of paired classification algorithms. We do this by showing the CLR likelihood is unaffected by the pair correction and that the maximizer of the CLR likelihood also maximizes the pair-corrected logistic regression (up to a scaling factor). See the supporting information for details on these two theoretical results.

Figure 1 shows a simulated example that clearly demonstrates the value of the conditional classification approach over CLR and standard machine learning for data with complicated structure. This example includes only two features to allow for easy visualization, but could be extended to large feature spaces. Consider a 1:1 case-control dataset with two features, x i j k , where i indicates the pair i = 1 , … , n , j ∈ { 1 , 2 } indicates the person within each pair and k ∈ { 1 , 2 } indicates the feature number. Let X represent the full feature matrix and X i be the 2 × 2 submatrix of X that has all the information for case-control strata i . Each submatrix X i is created by

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1179597219858954-fig1.jpg

A single dataset from the 2 variable simulation study is plotted in its raw form (A) and after controlling for the case-control design (B). A SVM with a radial-basis kernel function was trained to the pair corrected data, and the decision boundaries closely align with the true boundaries between classes (C). SVM indicates support vector machine.

where r i 1 ∼ Unif [ 0 . 4 , 0 . 7 ] , r i 2 = − r i 1 , ϕ 1 ∼ Unif [ 0 , 2 π ) , and μ k ∼ N ( 0 , 5 ) k = 1 , 2 for i . The response label for each individual in pair y i j , 0 , is set to ϕ i 1 (control) if 1 is in the second, fourth, fifth, or seventh octant of the feature space shown in Figure 1B and x i j 1 (case) otherwise. This design ensures that each pair is composed of one case and one control.

A set of 100 case-control pairs generated in this fashion is plotted in Figure 1A where the dot color indicates case-control status and the black lines connect individuals in the same pair. In its raw form, the data are noisy and difficult to classify. For this dataset, a CLR model with coefficients for x i j 2 , x i j 1 x i j 2 and X 1 returns a misclassification rate of 43%. A standard SVM with a radial basis function performs slightly worse with a misclassification rate of 47%. A large proportion of the errors committed by the SVM are due to the fact that both individuals in each strata are given the same label. However, after the pair correction is applied ( Figure 1B ), the differences between cases and controls are clearly visible through the dividing boundary between the two strata, which is also clearly nonlinear. In fact, the conditional NB classifier and the conditional SVM with a linear kernel both perform worse than the CLR when applied to the corrected data (misclassification rates greater than 43% for this dataset). A conditional SVM with a radial basis function and a conditional random forest (CRF), however, achieve misclassification rates of 3% and 4%, respectively. In addition, the fitted class labels generated by these methods consist of one case and one control per pair. To illustrate the results of the conditional SVM with a radial basis function, Figure 1C is a plot of the decision boundary learned by the conditional SVM with a radial basis function kernel when applied to this dataset.

In terms of variable importance, the P -values associated with regression coefficients of the CLR indicate that X 2 is an important feature, P -value .002, while p = 200 and the interaction of the two variables are not statistically significant, P -values of .59 and .07, respectively. Conversely, removing either variable from the conditional support vector machine (CSVM) or CRF dramatically decreases model performance, indicating both variables are important in differentiating the cases from the controls. This indicates that CLR is not able to identify important features if their relationship with the outcome of interest is highly nonlinear whereas the CSVM and CRF models are.

Although this is a small toy example, it illustrates the potential gains in classifier accuracy and biomarker discovery by incorporating nonlinear classifiers into the domain. In the next section, we describe a much larger simulation study to empirically investigate the potential gains of conditional classifiers.

Simulation study

The simulation scenario implemented here is motivated by Balasubramanian et al. 8 Each simulated dataset consists of 200 1:1 case-control pairs and 225 features, thus using the notation from the previous section, m = 2 , n = 400 , K = 225 and δ k . The first 25 features are significant biomarkers while the remaining 200 features are noise. The features for each pair were drawn from the bivariate normal distribution

where ρ k and k = 1 , … , K are the mean shift and within pair correlation, respectively, associated with feature | δ k | ∈ { 0 . 125 , 0 . 25 , 0 . 5 } . For this study we allowed the magnitude of the mean shift to take 3 possible values k = 1 , … , 20 for 0 and ρ k ∈ { 0 . 0 . 1 . 0 . 4 . 0 . 8 } otherwise. Within each dataset, the sign of each feature was allowed to be positive or negative with equal probability; therefore, biomarkers that are both over-expressed and under-expressed in the case samples relative to their controls are considered. We considered 4 possible values of the biomarker within pair correlations: k < = 20 for 0 and δ k otherwise. For each combination of ρ k and g = 1 , … , 5 , 2000 datasets were created and 7 different classification algorithms were fit to each dataset: logistic regression (LR), NB, SVM with a radial basis function kernel (SVM-RBF), SVM with a linear kernel (SVM-Lin), RF, LDA, and random penalized conditional logistic regression (RPCLR). 8 Random penalized conditional logistic regression is different from the other 6 methods in that only a conditional version exists and it cannot be used to predict an individual’s case/control label. Therefore, we will only compare it to the other 6 methods in terms of variable importance accuracy and not predictive accuracy. Furthermore, because we cannot assess its predictive accuracy in the context of the TEDDY data, it will not be applied to those data.

To assess the impact of the proposed preprocessing step, conditional and standard versions of each method, except RPCLR, were applied to every dataset. See Table 1 for specifics on the implementation of each method. In terms of the tuning parameters associated with each of the algorithms, cross-validation (CV) was used to choose the tuning parameters for the regularized CLR model. The width of the Gaussian kernel used by the SVM-RBF model was set to the median of the squared Euclidean distances between the input features. 21 The number of trees included in the RF model was set to 500 and the number of variables to consider at each node was the largest integer less than the square root of the total number of features. For RPCLR, the number of variables included in each model was set to 7 and the number of bootstrap replicates was set to 2000, as recommended by the authors. 8

The 6 classification methods used were implemented in the program R 17 ; the specific functions (and packages) used are given below as well as the different variable importance methods used for each method.

Abbreviations: AIC, Akaike information criterion; CLR, conditional logistic regression; KL, Kullback Leibler; LDA, linear discriminant analysis; RF, random forests; RPCLR, random penalized conditional logistic regression; SVM, support vector machines.

We used predictive accuracy to determine which method should be used to identify important biomarkers. In this context, prediction is the labeling of individuals within a group that was not used to estimate the parameter values. The predictive accuracy of each method was quantified by computing the proportion of individuals whose classes were predicted correctly in a 5-fold CV framework. The exact procedure is summarized as follows:

  • Case-control pairs are randomly split into 5 disjoint groups numbered 1 through 5;
  • (a) Train each classification algorithm using all the data except group g;
  • (b) Predict the class labels for individuals in the testing set, group [ 0 , 1 ] .
  • Compare the true class labels to the predicted class labels in 2b to compute the proportion of individuals that were classified correctly.

The significance of each feature was quantified by a variable importance metric computed on each of the training sets and then averaged across the 5 folds. The metrics used depend on the method applied (last column of Table 1 ). These metrics were used to quantify the importance of each feature. Receiver operating characteristic (ROC) curves based on the true importance labels and the variable importance metrics were created to assess the accuracy of each method, which was summarized using the area under the curve (AUC). Therefore, each method’s variable selection accuracy is measured by this AUC value, which lies in the range log 2 with larger values being better, and a value of 0.5 corresponds to an uniformative classifier.

TEDDY study

The TEDDY study 25 is a large prospective study with the goal of discovering factors that initiate the autoimmune response and destruction of the pancreatic beta cells, leading to the development of type 1 diabetes (T1D). TEDDY was formulated into a nested case-control study to enable biomarker studies, pairing on: clinical center, sex, and family history of T1D, 26 which resulted in 418 case-control pairs for analysis. TEDDY is particularly interested in understanding the environmental factors that trigger islet autoimmunity (IA), thus the metabolomic, lipidomic, and genetic single-nucleotide polymorphism (SNP) data at the time point of autoimmunity are evaluated.

After a n c a s e transformation of the ’omics data, 5 preprocessing steps were applied to each data source that effectively determines a starting number of features: weighted coefficient of variation (CoV), 27 percent missingness, near zero variance (NZV), univariate pairwise significance tests, and pairwise correlation. We remove features within a source if the weighted CoV is greater than 200%. We define weighted CoV as

where n c o n t r o l and | δ | = 0 . 125 are the number of nonmissing values for cases and control, respectively, within in a time point and

We also remove any features that were more than 10% missing and use RF imputation 28 to impute those that were less than 10% missing. Next, we remove features that had very few unique values relative to the number of samples or a much greater frequency of the most common value relative to the second most common value. 29 Before significance tests, the lipid data are handled specially to remove redundant information by eliminating different adducts for the same lipid. For the negatively ionized lipids, we simply removed all Cl– adducts because they tend to ionize poorly. The positively ionized lipids depend on the lipid class in terms of which adduct to keep. For the non-LPC and non-PC classes, we retained the NH 4 adduct because it was consistently greater in peak intensity. For the ceramides class, we used the H adduct as the other (H 2 O) adduct is a degradation of the lipid due to in source fragmentation. Finally, for the LPC and PC classes, we used the most common adduct (H) as the other (Na) was rarely noted. Next, for all data types but the SNPs, univariate paired t -tests were applied to each feature and all features with P -value less than .20 were retained.

Four criteria were used to filter the SNPs: missingness, minor allele frequency (MAF), the Hardy-Weinberg test for equilibrium (HWE), and CLR. Missingness, MAF, and HWE testing were performed using PLINK version 1.90. 30 SNPs with missingness less than 1%, MAF less than 0.2, P -values from the HWE test less than .001, and conditional logistic regression P -values less than .006 were retained for the analysis. The .006 P -value threshold was chosen to parallel the 0.2 threshold used for the other data sources. That is, because there are roughly 33 times more SNPs than other data types, the SNP threshold was set to 0.2/33 ≈ 0.006. As a final step of the data cleaning procedure, all biomolecules that have pairwise correlations greater than 0.9 were removed one-by-one to minimize redundancy in the final dataset. These steps ensure that each feature does not have an excessive relative variability, does not have an excessive amount of missing data, and contains enough variability and significance to potentially distinguish cases from controls.

The same 5-fold CV approach used in the simulation study was used for the TEDDY data to assess each method’s predictive accuracy. However, instead of implementing 5-fold CV once per dataset, the CV method was repeated 200 times per data source to account for the uncertainty of the CV procedure itself. Feature importance was assessed using a single analysis of the full dataset with each method according to the feature importance metrics reported in Table 1 . As mentioned previously, RPCLR will not be applied to these data because we cannot assess its predictive accuracy.

Scatter plots of the 2000 classification accuracy values for each classification method are shown in Figure 2 for ρ and all values of the within pair correlation ρ . Random penalized conditional logistic regression cannot be used as a classification algorithm so it is not included. The accuracy values for the conditional and standard versions of each method are plotted on the x - and y -axes, respectively. The red points represent datasets for which the standard version of that method was more accurate than its conditional counterpart while the converse is true for the black points. Within each method, the cloud of points are closest to the identity line when the within correlation is zero or small (top two rows). This indicates the conditional methods behave most similarly to their standard counterparts for small values of ρ . As the within pair correlation grows (move down each column), the cloud of points move to the right indicating the conditional methods get more accurate when ρ increases. The cloud of points do not move up; however, indicating the performance of the standard methods do not change as a function of ρ .

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1179597219858954-fig2.jpg

Scatter plots of the classification accuracies for all conditional ( x -axis) and standard ( y -axis) methods and values of | δ | = 0 . 125 when δ . The color of each point indicates which version, standard (red) or conditional (black), of each method is more accurate for each simulated dataset. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.

These conclusions are supported by Table 2 , which gives the percent of datasets in which the standard method was preferred to the conditional method for each algorithm and combinations of ρ and ρ . The number of times the standard method is preferred decreases differently for each of the algorithms. This implies the sensitivity of each algorithm to the paired structure varies depending upon how the algorithm performs classification. In particular, the linear discriminative algorithms (LDA, LR, and SVM-Lin) require a stronger within pair correlation for the conditional method to clearly outperform its standard counterpart. The nonlinear discriminative algorithms (RF and SVM-RBF) separate themselves more quickly and by a larger margin as the correlation increases. Finally, the linear generative algorithm (NB) separates itself from the onset and creates the widest gap for large values of δ .

Percentage of simulated datasets in which the standard version of the classification algorithm outperformed the conditional version in terms of classification accuracy by ρ and | δ | combination.

Abbreviations: LDA, linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.

Also apparent from Table 2 is the relationship between | δ | and the conditional method accuracies. In particular, the fact that the standard method is preferred so infrequently with large ρ implies the importance of accounting for the paired structure is more important when the within pair difference is larger, as expected.

An analogous representation of the variable selection accuracy is given in Figure 3 . Similar to the classification accuracy results, the conditional and standard methods cluster most closely around the identity line for small values of ρ and then drift to the right, which indicates the conditional method improves in accuracy more quickly than the standard method as a function of ρ . Unlike the accuracy results, however, the cloud of points also moves upward as a function of ρ , implying the standard method also becomes more accurate at identifying the important biomarkers as ρ increases. Therefore, the standard methods are better able to identify which features should be used to classify the data as ρ increases, but they are no better at performing the actual classification ( Figure 2 ). Finally, the algorithms appear to improve in accuracy at approximately the same rate as a function of ρ . That is, there is no clear distinction between the linear and nonlinear discriminative or generative algorithms with respect to accurate variable selection.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1179597219858954-fig3.jpg

Scatter plots of the variable selection accuracies for all conditional ( x -axis) and standard ( y -axis) methods and values of | δ | = 0 . 125 when δ . The color of each point indicates which version, standard (red) or conditional (black), of each algorithm is more accurate for each simulated dataset. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.

Finally, we compare all of the conditional methods by their average variable selection accuracies in Table 3 . As expected, the accuracy of all methods improves when ρ and/or ( δ = 0 . 125 a n d ρ = 0 . 1 ) increases. One existing method, RPCLR, was the most accurate method in one instance | δ | and was in the top 3 in 9 out of 12 scenarios. The proposed conditional Naive Bayes (CNB) and CRF approaches are most frequently in the top 3, followed by the CLDA and CSVM-RBF methods. We can therefore conclude that the proposed methods are comparable or outperform the existing variable selection methods CLR and RPCLR.

Average variable selection accuracy for each method.

Abbreviations: CLDA, conditional linear discriminant analysis; CLR, conditional logistic regression; CNB, conditional Naive Bayes; CRF, conditional random forests; CSVM, conditional support vector machine; RBF, radial basis function; RPCLR, random penalized conditional logistic regression.

The top 3 methods for each | δ | and ω combination are denoted with superscripts 1, 2, and 3.

Case study—TEDDY study data

The sample sizes and descriptive feature statistics for each data type are reported in Table 4 . The within pair differences and correlation was computed by taking the average absolute difference between features and the average correlation between feature vectors within each pair, respectively. Because the standard classification algorithms do no account for the pairing, we estimated the same quantities for random pairs in the dataset to determine how influential the pairing may be on algorithm performance. To do this, 10 000 random pairs were chosen, the same quantities were computed and then averaged across the chosen pairs. Assuming that the mechanism that differentiates cases and controls in the TEDDY data is similar to the simulation scenario explored in the previous section, then the increased correlation and difference between matched pairs relative to random pairs leads us to believe the conditional methods will outperform their standard counterpart. However, the large correlation in the data as a whole could make the difference between the methods small.

Feature set sizes, summary statistics, within mean absolute pair differences, within correlations (cor.), and random pair differences correlations for each data type based on the 504 samples.

Abbreviation: SNP, single-nucleotide polymorphism.

The random pair distance and correlation was computed by randomly sampling 10 000 pairs of random individuals from the dataset and computing their pairwise correlation and mean absolute distance. Within pair correlations for SNPs are not reported because of their discrete nature.

Box plots of the 200 repeated accuracy values are shown in Figure 4 for each method and biomolecule. Overall the conditional method of each algorithm (black boxes) is more accurate than the standard method (gray boxes). That is, the predictive accuracy of each algorithm is improved when the paired nature of the study is taken into account compared to when it is ignored. In 5 of the 24 comparisons performed, the standard method was more accurate than the conditional method: LDA for positive lipids (accuracy difference of 0.0015) and SNPs (0.035), and SVM with a linear kernel for positive lipids (0.029) and negative lipids (0.01). The large correlation between random pairs in both types of lipids could explain the comparable accuracy.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1179597219858954-fig4.jpg

Box plots of the 200 repeated 5-fold cross-validation accuracies for the 4 different data types and 6 different classification algorithms. LDA indicates linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RBF, radial basis function; RF, random forests; SVM, support vector machine.

To make recommendations about when each algorithm should be employed in practice, the conditional methods were ranked from most (ranked 1) to least (ranked 6) accurate within each CV replicate. Those ranks were averaged across the 200 replicates and plotted in Figure 5 for the different biomolecules. Paired t -tests comparing the accuracy measures were used within each biomolecule to determine which algorithms performed significantly better than the others in terms of predictive accuracy.

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1179597219858954-fig5.jpg

The average rank of each conditional method for each data type where the algorithm with the lowest rank was the most accurate within each repeated cross-validation (CV) run. CLDA indicates conditional linear discriminant analysis; CLR, conditional logistic regression; CNB, conditional Naive Bayes; CRF, conditional random forests; CSVM, conditional support vector machine; RBF, radial basis function; SNP, single-nucleotide polymorphism.

From Figure 5 , it is clear that the CSVM-RBF classifier is a reliable method to use regardless of data type. Similar to what was concluded from Figure 4 , CLR and CNB are conversely related. That is, CLR is the best method for one type of lipid data (negative), a distant second for the SNPs, and is ranked at least fourth for the remaining two methods. Conversely, CNB is the best method for the other types of lipids (positive), second for metabolites, and at least fourth for the remaining data types. In general, it appears that CLDA and CSVM-Lin are the least favorable for this type of analysis.

To determine which biomolecules were most predictive of the case/control status of each individual, the features were ranked within each source using a variable importance metric appropriate for each classification algorithm ( Table 1 ). A scree plot of the variable importance measures was used to separate the influential from the noninfluential features according to each model. An abridged list of the influential features chosen by the CSVM-RBF and CLR methods along with their ranks, direction of association (+/-), and published literature connecting each biomolecule to IA are given in Table 5 .

An abridged list of the influential metabolites and lipids identified by the CSVM-RBF and CLR methods along with notes and references for biomolecules that have been previously linked to T1D and/or IA.

Abbreviations: CLR, conditional logistic regression; CSVM, conditional support vector machine; IA, islet autoimmunity; RBF, radial basis function; SNP, single-nucleotide polymorphism; T1D, type 1 diabetes.

The direction of the association is indicated by the “+” symbol for factors that are larger in the case group relative to controls and “–” when the opposite is true.

The conditional NB classifier was the most accurate method in the simulation study and was one of the top two methods for 2 of the 4 data types ( Figure 5 ). We hypothesize this is due to the fact that NB is a generative rather than discriminative algorithm. As described in Ng et al, 31 generative algorithms have larger asymptotic classification error limits than discriminative classifiers, but have the potential to reach that error limit sooner than their discriminative counterparts. Thus, the consistent performance of the conditional NB algorithm could be due to the rather limited number of individuals in the TEDDY study relative to the complicated biology associated with IA as being learned by noisy ’omics feature sets. Similarly, even though the difference between cases and controls in the simulation study was rather simple, the small number of important features relative to unimportant ones made the signal difficult to detect.

Further evidence to support this hypothesis is that the CSVM-RBF was the second most accurate method in the simulation study and one of the top two classifiers for all data types. Because the SVM-RBF is a nonlinear classifier, it is able to model the complicated ’omics to disease relationship much better than the discriminative methods that rely on linear separators, ie, CLR, CLDA, and CSVM-Lin. This is particularly true for the metabolites in which the difference between the conditional and standard versions of each linear discriminative method is small even though they exhibit the largest between strata discrepancy ( Table 4 ), which is a counterintuitive result given the strong pairing information.

One of the most important components of paired classification methods is the ability to select biomarkers that can lead to a better understanding of the diseases being studied. Unlike standard machine learning, these methods are not designed to predict the class of a single case or control as the model is dependent on the pairing. In terms of the variables selected by the different methods ( Table 5 ), CLR and CSVM-RBF selected the same 2 metabolites as the 2 most important features: α-tocopherol and apidic acid. α-tocopherol is a vitamer of vitamin E whose exact role in the progression of IA is still being studied, 40 but has been shown to protect against its progression. 38 , 39 Our results similarly indicate that α-tocopherol is negatively associated with IA progression. The role of adipic acid in type 2 diabetes has previously been studied, 41 , 42 but this is the first time it has been shown to play an active role in IA. Our results agree and suggest that large amounts of adipic acid are associated with a higher incidence of IA. Both methods found these 2 metabolites to be highly important and agreed on the direction of association is testament to the proposed methods validity with real data.

Both methods indicate that hydroxybutanoic acid is positively correlated with IA, which agrees with previously published studies that indicate it is an early marker for glucose intolerance. 45 , 46 Creatinine, a waste-product created by the breakdown of muscle, was ranked highly by both methods (CSVM-RBF 7; CLR 3) and is a commonly used marker for kidney function in that higher levels of creatinine in the blood or urine indicate decreased kidney function. Islet autoimmunity and hypoglycemia are closely linked with kidney function; therefore, increased creatinine levels would be expected among individuals with IA relative to healthy control due to compromised kidney function. As such, creatinine could be positively correlated with IA like our analysis suggests.

In terms of lipids, there is less agreement between the 2 methods, but several results consistent with the literature have been identified. Conditional logistic regression found acylcarnitine, identified among the positively ionized lipids, to be highly important to model performance and that it is positively correlated with IA progression. Previous studies also found that C3 and C4 acylcarnitines were significantly more abundant in patients with IA and T2D relative to their controls. 32 A highly significant biomarker identified among the positively ionized lipids was hexosylceramide (annotated as glccer), which is negatively associated with IA progression according to both our results and a previous study that found the activation of natural killer T cells by a variant of alpha-galactosylceramide prevents the onset and recurrence of autoimmune IA. It is also possible that the hexosylceramide is a glucosylceramide because both molecules are isobaric and indistinguishable by mass spectrometry. Inhibition of glucosylceramide synthesis has been associated with an improvement of insulin tolerance. 57 The CSVM-RBF found ceramide to be negatively associated with IA progression, which is supported by the literature. 34 In general, high levels of some fatty acids, such as FA (16:1), have been found to be risk factors for IA, 36 , 37 a result supported by both methods. Finally, some ω-3 polyunsaturated fatty acids, such as pc_36_5_4_29_824_54, have been found to be negatively associated with IA, which is again confirmed by our results.

The SNPs interrogated in this report are from the ImmunoChip, a custom genotyping array based on robust genome-wide association study (GWAS) results obtained from 12 autoimmune diseases. The SNPs listed in Table 5 consistently point to the insulin component of T1D. The CSVM-RBF method highly ranked rs7158663, a SNP located on the maternal expressed gene 3 (Meg3) on chromosome 14q32.2, while CLR ranked it rather low. You et al 52 showed that the downregulation of Meg3 is associated with impaired glucose tolerance and decreased insulin secretion in mice. Our results are validated in the TEDDY cohort, where decreased Meg3 expression is associated with development of islet autoimmunity or T1D. Conditional logistic regression highly ranked rs17388568 and rs4580644, SNPs that are located in the adenosine deaminase domain containing 1 (ADAD1) gene on chromosome 4q27 and in introns of the cluster of differentiation 38 (CD38) gene on chromosome 4p15.32. The rs17388568 SNP has been identified as a risk factor for T1D in the Wellcome Trust Case Control Consortium 53 as well as in a follow-up study. 54 The rs4580644 SNP is predicted to influence regulation based upon effects on enhancer and histone marks and DNAase hypersensitivity. CD38 plays a key role in insulin secretion and has been shown to differentiate individuals with and without T1D; in particular, anti-CD38 autoantibodies have been suggested as new diagnostic biomarkers in autoimmunity in diabetes. 55 , 56

On the whole, we have demonstrated that a wide range of classification algorithms can be used to correctly analyze and interrogate features of nested case-control studies provided the study design is accounted for with a prior data transformation. Through a simulation study and analysis of the TEDDY data, we have demonstrated that CLR is limited in the types of relationships it can model and can typically be outperformed by more sophisticated classification algorithms like SVM and NB. In particular, CLR and CSVM-RBF agreed on several potential biomarkers for IA, but the CSVM-RBF identified several other potential markers that have not been previously identified. We believe this demonstrates the potential to identify more meaningful biomarkers through the use of more analytical methods than just CLR.

Supplemental Material


The authors would like to thank PNNL scientist Jennifer Kyle for her help with filtering the lipidomic data.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The TEDDY Study is funded by U01 DK63829, U01 DK63861, U01 DK63821, U01 DK63865, U01 DK63863, U01 DK63836, U01 DK63790, UC4 DK63829, UC4 DK63861, UC4 DK63821, UC4 DK63865, UC4 DK63863, UC4 DK63836, UC4 DK95300, UC4 {"type":"entrez-nucleotide","attrs":{"text":"DK100238","term_id":"187408192","term_text":"DK100238"}} DK100238 , UC4 {"type":"entrez-nucleotide","attrs":{"text":"DK106955","term_id":"187691090","term_text":"DK106955"}} DK106955 , UC4 {"type":"entrez-nucleotide","attrs":{"text":"DK112243","term_id":"187408683","term_text":"DK112243"}} DK112243 , UC4 {"type":"entrez-nucleotide","attrs":{"text":"DK117483","term_id":"187580310","term_text":"DK117483"}} DK117483 , and Contract No. HHSN267200700014C from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), Centers for Disease Control and Prevention (CDC), and JDRF. This work was supported in part by the NIH/NCATS Clinical and Translational Science Awards to the University of Florida (UL1 TR000064) and the University of Colorado (UL1 TR001082).

Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

* Members of the TEDDY Study Group are listed in the online supplemental appendix .

An external file that holds a picture, illustration, etc.
Object name is 10.1177_1179597219858954-img1.jpg

Supplemental material: Supplemental material for this article is available online.

Cookie Policy

We use cookies to operate this website, improve usability, personalize your experience, and improve our marketing. Privacy Policy .

By clicking "Accept" or further use of this website, you agree to allow cookies.

  • Data Science
  • Data Analytics
  • Machine Learning


Binary Classification

What is binary classification.

In machine learning, binary classification is a supervised learning algorithm that categorizes new observations into one of two classes.

The following are a few binary classification applications, where the 0 and 1 columns are two possible classes for each observation:

Quick example

In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as input features and predict whether the patient is healthy or has the disease. The possible outcomes of the diagnosis are positive and negative .

Evaluation of binary classifiers

If the model successfully predicts the patients as positive, this case is called True Positive (TP) . If the model successfully predicts patients as negative, this is called True Negative (TN) . The binary classifier may misdiagnose some patients as well. If a diseased patient is classified as healthy by a negative test result, this error is called False Negative (FN) . Similarly, If a healthy patient is classified as diseased by a positive test result, this error is called False Positive(FP) .

We can evaluate a binary classifier based on the following parameters:

  • True Positive (TP): The patient is diseased and the model predicts "diseased"
  • False Positive (FP): The patient is healthy but the model predicts "diseased"
  • True Negative (TN): The patient is healthy and the model predicts "healthy"
  • False Negative (FN): The patient is diseased and the model predicts "healthy"

After obtaining these values, we can compute the accuracy score of the binary classifier as follows: $$ accuracy = \frac {TP + TN}{TP+FP+TN+FN} $$

The following is a confusion matrix , which represents the above parameters:

classification in machine learning case study

In machine learning, many methods utilize binary classification. The most common are:

  • Support Vector Machines
  • Naive Bayes
  • Nearest Neighbor
  • Decision Trees
  • Logistic Regression
  • Neural Networks

The following Python example will demonstrate using binary classification in a logistic regression problem.

A Python example for binary classification

For our data, we will use the breast cancer dataset from scikit-learn. This dataset contains tumor observations and corresponding labels for whether the tumor was malignant or benign.

First, we'll import a few libraries and then load the data. When loading the data, we'll specify as_frame=True so we can work with pandas objects (see our pandas tutorial for an introduction).

The dataset contains a DataFrame for the observation data and a Series for the target data.

Let's see what the first few rows of observations look like:

5 rows × 30 columns

The output shows five observations with a column for each feature we'll use to predict malignancy.

Now, for the targets:

The targets for the first five observations are all zero, meaning the tumors are benign. Here's how many malignant and benign tumors are in our dataset:

So we have 357 malignant tumors, denoted as 1, and 212 benign, denoted as 0. So, we have a binary classification problem.

To perform binary classification using logistic regression with sklearn, we must accomplish the following steps.

Step 1: Define explanatory and target variables

We'll store the rows of observations in a variable X and the corresponding class of those observations (0 or 1) in a variable y .

Step 2: Split the dataset into training and testing sets

We use 75% of data for training and 25% for testing. Setting random_state=0 will ensure your results are the same as ours.

Step 3: Normalize the data for numerical stability

Note that we normalize after splitting the data. It's good practice to apply any data transformations to training and testing data separately to prevent data leakage .

Step 4: Fit a logistic regression model to the training data

This step effectively trains the model to predict the targets from the data.

Step 5: Make predictions on the testing data

With the model trained, we now ask the model to predict targets based on the test data.

Step 6: Calculate the accuracy score by comparing the actual values and predicted values.

We can now calculate how well the model performed by comparing the model's predictions to the true target values, which we reserved in the y_test variable.

First, we'll calculate the confusion matrix to get the necessary parameters:

With these values, we can now calculate an accuracy score:

Other binary classifiers in the scikit-learn library

Logistic regression is just one of many classification algorithms defined in Scikit-learn. We'll compare several of the most common, but feel free to read more about these algorithms in the sklearn docs here .

We'll also use the sklearn Accuracy, Precision, and Recall metrics for performance evaluation. See the docs here if you'd like to read more about the available metrics.

Initializing each binary classifier

To quickly train each model in a loop, we'll initialize each model and store it by name in a dictionary:

Performance evaluation of each binary classifier

Now that we'veinitialized the models, we'll loop over each one, train it by calling .fit() , make predictions, calculate metrics, and store each result in a dictionary.

With all metrics stored, we can use pandas to view the data as a table:

Finally, here's a quick bar chart to compare the classifiers' performance:

classification in machine learning case study

Since we're only using the default model parameters, we won't know which classifier is better. We should optimize each algorithm's parameters first to know which one has the best performance.

Get updates in your inbox

Join over 7,500 data science learners.

Recent articles:

The 9 best ai courses for 2024 (and two to avoid), the 6 best python courses for 2024 – ranked by software engineer, best course deals for black friday and cyber monday 2024, sigmoid function, 7 best artificial intelligence (ai) courses.

Top courses you can take today to begin your journey into the Artificial Intelligence field.

Meet the Authors


Associate Professor of Computer Engineering. Author/co-author of over 30 journal publications. Instructor of graduate/undergraduate courses. Supervisor of Graduate thesis. Consultant to IT Companies.

Brendan Martin

Back to blog index

  • Python for Machine Learning
  • Machine Learning with R
  • Machine Learning Algorithms
  • Math for Machine Learning
  • Machine Learning Interview Questions
  • ML Projects
  • Deep Learning
  • Computer vision
  • Data Science
  • Artificial Intelligence

Getting started with Classification

  • Image Classification with Web App
  • Classification Metrics using Sklearn
  • Classification of Data Mining Systems
  • ML | Classification vs Clustering
  • An introduction to MultiLabel classification
  • Omniglot Classification Task
  • Classification using PyTorch linear function
  • How to Create simulated data for classification in Python?
  • Multiclass classification using LightGBM
  • ANN Classification with 'nnet' Package in R
  • Music Genre Classification using Transformers
  • Classification in R Programming
  • Temperature Classification Using R
  • Classification of Data
  • Classification of Events in Real-time System
  • Plant Classification
  • Audio Classification Using Google's YAMnet
  • Classification of Plants
  • Classifier Comparison in Scikit Learn

As the name suggests, Classification is the task of “classifying things” into sub-categories. Classification is part of supervised machine learning in which we put labeled data for training.

The article serves as a comprehensive guide to understanding and applying classification techniques, highlighting their significance and practical implications.

What is Supervised Machine Learning?

Supervised Machine Learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output Y = f(X) . The goal is to approximate the mapping function so well that when you have new input data (x) you can predict the output variables (Y) for that data.

Supervised learning problems can be further grouped into Regression and Classification problems.

  • Regression: Regression algorithms are used to predict a continuous numerical output. For example, a regression algorithm could be used to predict the price of a house based on its size, location, and other features.
  • Classification: Classification algorithms are used to predict a categorical output. For example, a classification algorithm could be used to predict whether an email is spam or not.

Machine Learning for classification

Classification is a process of categorizing data or objects into predefined classes or categories based on their features or attributes.

Machine Learning classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data.

The main objective of classification machine learning is to build a model that can accurately assign a label or category to a new observation based on its features.

For example, a classification model might be trained on a dataset of images labeled as either dogs or cats and then used to predict the class of new, unseen images of dogs or cats based on their features such as color, texture, and shape.

Classification Types

There are two main classification types in machine learning:

Binary Classification

In binary classification, the goal is to classify the input into one of two classes or categories. Example – On the basis of the given health conditions of a person, we have to determine whether the person has a certain disease or not.

Multiclass Classification

In multi-class classification, the goal is to classify the input into one of several classes or categories. For Example – On the basis of data about different species of flowers, we have to determine which specie our observation belongs to.

Binary vs Multi class classification -Geeksforgeeks

Binary vs Multi class classification

Other categories of classification involves:

M ulti-Label Classification

In, Multi-label Classification the goal is to predict which of several labels a new data point belongs to. This is different from multiclass classification, where each data point can only belong to one class. For example, a multi-label classification algorithm could be used to classify images of animals as belonging to one or more of the categories cat, dog, bird, or fish.

I mbalanced Classification

In, Imbalanced Classification the goal is to predict whether a new data point belongs to a minority class, even though there are many more examples of the majority class. For example, a medical diagnosis algorithm could be used to predict whether a patient has a rare disease, even though there are many more patients with common diseases.

Classification Algorithms

There are various types of classifiers algorithms . Some of them are : 

Linear Classifiers

Linear models create a linear decision boundary between classes. They are simple and computationally efficient. Some of the linear classification models are as follows: 

  • Logistic Regression
  • Support Vector Machines having kernel = ‘linear’
  • Single-layer Perceptron
  • Stochastic Gradient Descent (SGD) Classifier

Non-linear Classifiers

Non-linear models create a non-linear decision boundary between classes. They can capture more complex relationships between the input features and the target variable. Some of the non-linear classification models are as follows: 

  • K-Nearest Neighbours
  • Naive Bayes
  • Decision Tree Classification
  • Ensemble learning classifiers: 
  • Random Forests, 
  • Bagging Classifier, 
  • Voting Classifier, 
  • ExtraTrees Classifier
  • Multi-layer Artificial Neural Networks

Learners in Classifications Algorithm

In machine learning, classification learners can also be classified as either “lazy” or “eager” learners.

  • Lazy Learners: Lazy Learners are also known as instance-based learners, lazy learners do not learn a model during the training phase. Instead, they simply store the training data and use it to classify new instances at prediction time. It is very fast at prediction time because it does not require computations during the predictions. it is less effective in high-dimensional spaces or when the number of training instances is large. Examples of lazy learners include k-nearest neighbors and case-based reasoning.
  • Eager Learners : Eager Learners are also known as model-based learners, eager learners learn a model from the training data during the training phase and use this model to classify new instances at prediction time. It is more effective in high-dimensional spaces having large training datasets. Examples of eager learners include decision trees, random forests, and support vector machines.

Classification Models in Machine Learning

Evaluating a classification model is an important step in machine learning, as it helps to assess the performance and generalization ability of the model on new, unseen data. There are several metrics and techniques that can be used to evaluate a classification model, depending on the specific problem and requirements. Here are some commonly used evaluation metrics:

  • Classification Accuracy: The proportion of correctly classified instances over the total number of instances in the test set. It is a simple and intuitive metric but can be misleading in imbalanced datasets where the majority class dominates the accuracy score.
  • Confusion matrix : A table that shows the number of true positives, true negatives, false positives, and false negatives for each class, which can be used to calculate various evaluation metrics.
  • Precision and Recall: Precision measures the proportion of true positives over the total number of predicted positives, while recall measures the proportion of true positives over the total number of actual positives. These metrics are useful in scenarios where one class is more important than the other, or when there is a trade-off between false positives and false negatives.
  • F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) / (precision + recall). It is a useful metric for imbalanced datasets where both precision and recall are important.
  • ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of the true positive rate (recall) against the false positive rate (1-specificity) for different threshold values of the classifier’s decision function. The Area Under the Curve (AUC) measures the overall performance of the classifier, with values ranging from 0.5 (random guessing) to 1 (perfect classification).
  • Cross-validation : A technique that divides the data into multiple folds and trains the model on each fold while testing on the others, to obtain a more robust estimate of the model’s performance.

It is important to choose the appropriate evaluation metric(s) based on the specific problem and requirements, and to avoid overfitting by evaluating the model on independent test data.

Characteristics of Classification

Here are the characteristics of the classification:

  • Categorical Target Variable: Classification deals with predicting categorical target variables that represent discrete classes or labels. Examples include classifying emails as spam or not spam, predicting whether a patient has a high risk of heart disease, or identifying image objects.
  • Accuracy and Error Rates: Classification models are evaluated based on their ability to correctly classify data points. Common metrics include accuracy, precision, recall, and F1-score.
  • Model Complexity: Classification models range from simple linear classifiers to more complex nonlinear models. The choice of model complexity depends on the complexity of the relationship between the input features and the target variable.
  • Overfitting and Underfitting: Classification models are susceptible to overfitting and underfitting. Overfitting occurs when the model learns the training data too well and fails to generalize to new data.

How does Classification Machine Learning Work?

The basic idea behind classification is to train a model on a labeled dataset, where the input data is associated with their corresponding output labels, to learn the patterns and relationships between the input data and output labels. Once the model is trained, it can be used to predict the output labels for new unseen data.


Classification Machine Learning

The classification process typically involves the following steps:

Understanding the problem

Before getting started with classification, it is important to understand the problem you are trying to solve. What are the class labels you are trying to predict? What is the relationship between the input data and the class labels?

Suppose we have to predict whether a patient has a certain disease or not, on the basis of 7 independent variables, called features. This means, there can be only two possible outcomes: 

  • The patient has the disease, which means “ True ”.
  • The patient has no disease. which means “ False ”.

This is a binary classification problem.

Data preparation

Once you have a good understanding of the problem, the next step is to prepare your data. This includes collecting and preprocessing the data and splitting it into training, validation, and test sets. In this step, the data is cleaned, preprocessed, and transformed into a format that can be used by the classification algorithm.

  • X: It is the independent feature, in the form of an N*M matrix. N is the no. of observations and M is the number of features.
  • y: An N vector corresponding to predicted classes for each of the N observations.

Feature Extraction

The relevant features or attributes are extracted from the data that can be used to differentiate between the different classes.

Suppose our input X has 7 independent features, having only 5 features influencing the label or target values remaining 2 are negligibly or not correlated, then we will use only these 5 features only for the model training. 

Model Selection

There are many different models that can be used for classification, including logistic regression, decision trees, support vector machines (SVM), or neural networks . It is important to select a model that is appropriate for your problem, taking into account the size and complexity of your data, and the computational resources you have available.

Model Training

Once you have selected a model, the next step is to train it on your training data. This involves adjusting the parameters of the model to minimize the error between the predicted class labels and the actual class labels for the training data.

Model Evaluation

Evaluating the model: After training the model, it is important to evaluate its performance on a validation set. This will give you a good idea of how well the model is likely to perform on new, unseen data. 

Log Loss or Cross-Entropy Loss, Confusion Matrix,  Precision, Recall, and AUC-ROC curve are the quality metrics used for measuring the performance of the model.

Fine-tuning the model

If the model’s performance is not satisfactory, you can fine-tune it by adjusting the parameters, or trying a different model.

Deploying the model

Finally, once we are satisfied with the performance of the model, we can deploy it to make predictions on new data.  it can be used for real world problem.

Examples of Machine Learning Classification in Real Life

Classification algorithms are widely used in many real-world applications across various domains, including:

  • Email spam filtering
  • Credit risk assessment
  • Medical diagnosis
  • Image classification
  • Sentiment analysis.
  • Fraud detection
  • Quality control
  • Recommendation systems

Implementation of Classification Model in Machine Learning

Let’s get a hands-on experience with how Classification works. We are going to study various Classifiers and see a rather simple analytical comparison of their performance on a well-known, standard data set, the Iris data set.  

Requirements for running the given script:

  • Scipy and Numpy
  • Scikit-learn  

In conclusion, classification is a fundamental task in machine learning, involving the categorization of data into predefined classes or categories based on their features.

Frequently Asked Questions (FAQs)

What is classification rule in machine learning.

A decision guideline in machine learning determining the class or category of input based on features.

What are the classification of algorithms?

Methods like decision trees, SVM, and k-NN categorizing data into predefined classes for predictions.

What is learning classification?

Acquiring knowledge to assign labels to input data, distinguishing classes in supervised machine learning.

What is difference between classification and clustering?

Classification: Predicts predefined classes. Clustering: Groups data based on inherent similarities without predefined classes.

What is the difference between classification and regression methods?

Classification: Assigns labels to data classes. Regression: Predicts continuous values for quantitative analysis.

Please Login to comment...

Similar reads.

  • Machine Learning

Improve your Coding Skills with Practice


What kind of Experience do you want to share?


  • Conferences

Different Types of Classification Algorithms

  • By Rohit Garg
  • Last Updated on May 20, 2024

The purpose of this research is to put together the 7 most common types of classification algorithms along with the python code: Logistic Regression, Naïve Bayes, Stochastic Gradient Descent, K-Nearest Neighbours, Decision Tree, Random Forest, and Support Vector Machine

Structured Data Classification

Classification can be performed on structured or unstructured data. Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under.

Few of the terminologies encountered in machine learning – classification:

  • Classifier: An algorithm that maps the input data to a specific category.
  • Classification model: A classification model tries to draw some conclusion from the input values given for training. It will predict the class labels/categories for the new data.
  • Feature: A feature is an individual measurable property of a phenomenon being observed.
  • Binary Classification: Classification task with two possible outcomes. Eg: Gender classification (Male / Female)
  • Multi-class classification: Classification with more than two classes. In multi class classification each sample is assigned to one and only one target label. Eg: An animal can be cat or dog but not both at the same time
  • Multi-label classification: Classification task where each sample is mapped to a set of target labels (more than one class). Eg: A news article can be about sports, a person, and location at the same time.

The following are the steps involved in building a classification model:

  • Initialize the classifier to be used.
  • Train the classifier: All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and train label y.
  • Predict the target: Given an unlabeled observation X, the predict(X) returns the predicted label y.
  • Evaluate the classifier model

Dataset Source and Contents

The dataset contains salaries. The following is a description of our dataset:

  • of Classes: 2 (‘>50K’ and ‘<=50K’)
  • of attributes (Columns): 7
  • of instances (Rows): 48,842

This data was extracted from the census bureau database found at:

Exploratory Data Analysis

Types of Classification Algorithms

Types of Classification Algorithms with Python

1. logistic regression.

Definition: Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.

Advantages: Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable.

Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other and assumes data is free of missing values.

classification in machine learning case study

2. Naïve Bayes

Definition: Naive Bayes algorithm based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.

Advantages: This algorithm requires a small amount of training data to estimate the necessary parameters. Naive Bayes classifiers are extremely fast compared to more sophisticated methods.

Disadvantages: Naive Bayes is is known to be a bad estimator.

classification in machine learning case study

3. Stochastic Gradient Descent

Definition: Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification.

Advantages: Efficiency and ease of implementation.

Disadvantages: Requires a number of hyper-parameters and it is sensitive to feature scaling.

classification in machine learning case study

4. K-Nearest Neighbours

Definition: Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.

Advantages: This algorithm is simple to implement, robust to noisy training data, and effective if training data is large.

Disadvantages: Need to determine the value of K and the computation cost is high as it needs to compute the distance of each instance to all the training samples.

classification in machine learning case study

5. Decision Tree

Definition: Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.

Advantages: Decision Tree is simple to understand and visualise, requires little data preparation, and can handle both numerical and categorical data.

Disadvantages: Decision tree can create complex trees that do not generalise well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated.

classification in machine learning case study

6. Random Forest

Definition: Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.

Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.

Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.

Types of Classification Algorithms

7. Support Vector Machine

Definition: Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Advantages: Effective in high dimensional spaces and uses a subset of training points in the decision function so it is also memory efficient.

Disadvantages: The algorithm does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

classification in machine learning case study

Comparison Matrix

  • Accuracy is a ratio of correctly predicted observation to the total observations. Accuracy is the most intuitive performance measure.
  • True Positive: The number of correct predictions that the occurrence is positive
  • True Negative: The number of correct predictions that the occurrence is negative
  • F1-Score is the weighted average of Precision and Recall used in all types of classification algorithms. Therefore, this score takes both false positives and false negatives into account. F1-Score is usually more useful than accuracy, especially if you have an uneven class distribution.
  • Precision: When a positive value is predicted, how often is the prediction correct?
  • Recall: When the actual value is positive, how often is the prediction correct?

Code location:

Algorithm Selection

Types of Classification Algorithms

(Types of Classification Algorithms)

Top Most Important Reasons to Use Linux Operating System

Representation Learning – Complete Guide for Beginner

Augmented Dickey-Fuller (ADF) Test In Time-Series Analysis

Bidirectional LSTM (Long-Short Term Memory) with Python Codes

Scribble Diffusion – Converts Doddles and Sketch to AI Images

Best AI Image Generator in 2024

What is Unstable Diffusion – Difference Between Stable Vs Unstable?

Difference Between NVIDIA H100 Vs A100: Which is the best GPU?

Top 10 Sentiment Analysis Dataset

Top 10 Space Observatories in India

How to Build Your First Generative AI Agent with Mistral 7B LLM

Mira Murati – CTO of OpenAI

Dense Layers

ChatGPT vs Bing Chat – Which is Best?

Ways to Use GPT4o for Free

World's Biggest Media & Analyst firm specializing in AI

Advertise with us, aim publishes every day, and we believe in quality over quantity, honesty over spin. we offer a wide variety of branding and targeting options to make it easy for you to propagate your brand., branded content, aim brand solutions, a marketing division within aim, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories., corporate upskilling, adasci corporate training program on generative ai provides a unique opportunity to empower, retain and advance your talent, with machinehack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons., talent assessment, conduct customized online assessments on our powerful cloud-based platform, secured with best-in-class proctoring, research & advisory, aim research produces a series of annual reports on ai & data science covering every aspect of the industry. request customised reports & aim surveys for a study on topics of your interest., conferences & events, immerse yourself in ai and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives., aim launches the 3rd edition of data engineering summit. may 30-31, bengaluru.

Join the forefront of data innovation at the Data Engineering Summit 2024, where industry leaders redefine technology’s future.

© Analytics India Magazine Pvt Ltd & AIM Media House LLC 2024

  • Terms of use
  • Privacy Policy

Final AIM Pop Up Banner

Javatpoint Logo

Machine Learning

Artificial Intelligence

Control System

Supervised Learning

Classification, miscellaneous, related tutorials.

Interview Questions


3. AUC-ROC curve:

  • ROC curve stands for Receiver Operating Characteristics Curve and AUC stands for Area Under the Curve .
  • It is a graph that shows the performance of the classification model at different thresholds.
  • To visualize the performance of the multi-class classification model, we use the AUC-ROC Curve.
  • The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases of Classification Algorithms:

  • Email Spam Detection
  • Speech Recognition
  • Identifications of Cancer tumor cells.
  • Drugs Classification
  • Biometric Identification, etc.


  • Send your Feedback to [email protected]

Help Others, Please Share


Learn Latest Tutorials

Splunk tutorial


Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial



Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

Employee turnover in multinational corporations: a supervised machine learning approach

  • Original Paper
  • Open access
  • Published: 21 May 2024

Cite this article

You have full access to this open access article

classification in machine learning case study

  • Valerio Veglio 1 , 4 ,
  • Rubina Romanello   ORCID: 2 &
  • Torben Pedersen 3  

This research explores the potential of supervised machine learning techniques in transforming raw data into strategic knowledge in the context of human resource management. By analyzing a database with over 205 variables and 2,932 observations related to a telco multinational corporation, this study tests the predictive and analytical power of classification decision trees in detecting the determinants of voluntary employee turnover. The results show the determinants of groups of employees who may voluntarily leave the company, highlighting the level of analytical depth of the classification tree. This study contributes to the field of human resource management by highlighting the strategic value of the classification decision tree in identifying the characteristics of groups of employees with a high propensity to voluntarily leave the firm. As practical implication, our study provides an approach that any organization can use to self-assess its own turnover risk and develop tailored retention practices.

Avoid common mistakes on your manuscript.

1 Introduction

Data and analytics have captured the attention of human resource management (HRM) scholars, as multinational corporations (MNCs) increasingly have at their disposal large volumes of data and techniques for analyzing large amounts of data that could be used to support decision making related to complex problems, such as task organization, employee turnover, career development, and training design. Extrapolating strategic knowledge from large datasets through supervised Machine Learning (ML) techniques is becoming one of the main challenges for decision-makers in MNCs. In the digital era, techniques based on machine learning algorithms play an increasingly important role in extrapolating strategic knowledge from raw data (Canhoto and Clear 2020 ). Since 1956, when John McCarthy coined the term “Artificial Intelligence” (AI), the interest in this topic has grown exponentially, in line with the increasing number of applications, permeating different disciplines and research areas. In this context, management studies have started to analyze the potential applications of ML, which is a subset of AI and represents a way to achieve AI through the development of algorithms capable of improving themselves with experience (Garg et al. 2022 ). Supervised ML techniques can analyze large volume of data from different sources to discover hidden patterns with high strategic value for organizations (Pereira et al. 2018 ), providing insights for prediction, classification, and decision-making purposes (Cui et al. 2006 ; Naeem et al. 2024 ). These techniques have some specific features, such as scalability, because they can handle and process large amounts of data; interactivity, because they can learn new variables from new data; and dynamism, because they can periodically reassess and reevaluate hypotheses by taking into account incoming data, even without human interaction (Garg et al. 2022 ). With these capabilities, managers could make rapid and contextualized decisions based on data-driven evidence (Gupta et al. 2018 ; Wirges and Neyer 2023 ). For instance, supervised machine learning techniques are useful in marketing to estimate the probability of customer churn (Archaux et al. 2004 ; Gordini and Veglio 2017 ; Hung et al. 2006 ; Rosset et al. 2003 ; Wei and Chiu 2002 ), in social and economic analyses (Blazquez and Domenech 2018 ), in smart city applications (Iqbal et al. 2020 ), in finance to predict customer credit risk (Kruppa et al. 2013 ), and in HRM in the field of recruitment, performance management, and team dynamics (Garg et al. 2022 ; Koechling et al. 2023 ). As has happened in marketing and finance over the past decade, the application of ML techniques in HRM is growing rapidly (Garg et al. 2022 ; Yang et al. 2023 ), although only a limited number of studies have aimed at predicting employee turnover (Rombaut and Guerry 2018 ; Saradhi and Palshikar 2011 ). Voluntary employee turnover refers to why people leave an organization (Lee et al. 2017 ) and is considered a serious issue for MNCs (Saradhi and Palshikar 2011 ). High levels of employee turnover generate unexpected costs in terms of hiring, training, search, selection, and replacement (Mobley 1982 ; Price 1977 ; Staw 1980 ). In fact, the cost of hiring new employees is substantially higher than retaining the existing employees, negatively affecting firm performance and competitiveness (Holtom et al. 2005 ; Mitchell et al. 2001 ).

Despite the relevance of this topic, extant research has barely explored the potential, but not yet investigated, the real added value of the application of supervised ML techniques to identify the determinants of voluntary employee turnover (Yang et al. 2023 ). Particularly in the context of employee turnover, the few existing works employing supervised ML techniques have so far been published in conference proceedings (Garg et al. 2022 ).

Responding to the call for new methodological approaches (Hom et al. 2017 ), this study tests the analytical and predictive power of a classification decision tree based on the CHAID (Chi-square Automatic Interaction Detector) algorithm to identify the characteristics of employees who voluntarily leave the company. By analyzing a large dataset of employees in the context of a telco MNC, we apply a CHAID classification decision tree to a sample of 2,932 employees working in the firm’s headquarters in Norway and in a subsidiary located in Denmark. This approach makes it possible to exploit the potential of this technique to identify in advance those groups of employees who have a higher likelihood of leaving the firm and, ultimately, to implement retention practices targeted at these groups.

This study contributes to the turnover literature from a methodological perspective by encouraging researchers to measure and/or analyze employee turnover in nontraditional ways using supervised ML models to test the validity of well-known historical explanatory constructs in this area of research (Hom et al. 2017 ). In addition, we contribute to the HRM literature by demonstrating the potential of the classification decision tree as a method for solving complex problems in HRM (Garg et al. 2022 ), which can be used to complement previous studies and further advance the literature on this topic. Our results suggest the complementary, rather than substitutive, role of supervised ML in assessing the risk of employee turnover. From a practical perspective, our study demonstrates the ability of these techniques to reveal hidden relationships between data, allowing decisionmakers and scholars to identify new, previously unknown relationships and evidence. Using this technique, companies could self-assess their own employee turnover risk levels and identify high-risk employees, allowing them to develop timely and effective retention strategies tailored to their needs. In this sense, supervised ML techniques become an additional toolkit that supports, but does not replace, human resource management decisions.

2 Theoretical background

2.1 a brief introduction to employee turnover research.

Voluntary employee turnover – employees’ unilateral, unwanted, and often surprising termination of their employment contract – is a phenomenon of practical relevance for all organizations for a variety of reasons (e.g. Lee et al. 2008 ), as a negative economic impact is generally expected, mainly due to the additional recruitment costs or tacit knowledge drains (Glebbeck and Bax 2004 ; Holtom et al. 2005 ; Mitchell et al. 2001 ; Reiche 2008 ).

Reflecting its importance, research on employee turnover can be traced back to 1920 (Hom et al. 2017 ; Lee et al. 2017 ). Since its inception, the literature has continually grown, resulting in an impressive number of theories and models over the subsequent one hundred years mainly aimed at explaining the motivations, antecedents, processes, and consequences of this organizational phenomenon (Hom et al. 2017 ; Lee et al. 2017 ; Rubenstein et al. 2018 ). Employee turnover is related to a wide range of factors. Researchers agree on the importance of job satisfaction (or dissatisfaction) and individual perceptions of the perceived desirability and ease of moving to another job based on the assumption that employees who are satisfied with their job and do not have other job options are more likely to stay in the organization (Griffeth et al. 2000 ; Mobley 1977 ). The seminal work of March and Simon ( 1958 ) triggered a stream of studies attempting to explain voluntary employee turnover by focusing on why people leave (Barrick and Zimmerman 2005 ). This initial approach was followed by attempts to identify the primary antecedents of employee turnover (e.g. Lee and Mitchell 1994 ; O' Reilly III et al. 1991 ). With the aim of retaining employees, academics then attempted building an employee turnover model that was as accurate as possible to minimize voluntary employee turnover. Over the years, several models ensued, such Mobley’s ( 1977 ) process model explaining how dissatisfaction leads to turnover in the attempt to explain why employees leave their jobs (Lee et al. 2017 ). Other scholars focused more on the content than the process, identifying a variety of determinants of turnover, including factors related to the workplace, labor market causes, community, and occupational aspects (Hom et al. 2017 ; Price 1977 , 2001 ), emphasizing the importance of both individual and environmental attributes. As scholars have criticized extant turnover models for their lack of explanatory and predictive power, turnover research has included variables that are not necessarily related to the employee’s affective state and the decision to quit (Morrell et al. 2001 ). Subsequently, researchers studied the role of shocks and jarring events that drive employees to choose alternative career paths (Hom and Griffeth 1991 ), demonstrating that voluntary leave is not necessarily related to job dissatisfaction alone. The resulting models and analyses of voluntary employee turnover therefore included “shocks” (Lee and Mitchell 1994 ), while others focused on motives for leaving (Maertz and Campion 2004 ).

Another research stream has adopted the opposite perspective (Porter and Steers 1973 ), focusing on why people stay, proposing the “job embeddedness” construct that considers contextual factors both related to the workplaces and off-the-job aspects (Coetzer et al. 2019 ; Mitchell et al. 2001 ). Following this radical paradigm shift, scholars developed further conceptualizations, explanations, and empirical approaches (Lee et al. 2017 ). Along these lines, some researchers more recently proposed integrative frameworks that attempt to explain both why and how employees quit (Maertz and Campion 2004 ).

In relation to the IT context, which is particularly affected by turnover issues (Rode et al. 2007 ), Ghapanchi and Aurum ( 2011 ) developed a systematic literature review of the antecedents of IT employee turnover, showing that the determinants can generally be grouped into five main categories: individual, organizational, job-related, psychological, and environmental. This summarizing perspective is consistent with the more general perspective that emerges from findings of the broader turnover literature (Lee et al. 2017 ; Rubenstein et al. 2018 ). The first category includes individual attributes, motivational, and professional behavior constructs. For example, motivational factors may be related to mindset types such as self-positiveness or low core self-evaluation (Hom et al. 2012 ). Organizational factors relate to individual perceptions of the organization, such as remuneration, benefits, human resource practices, organizational culture (O' Reilly III et al. 1991 ; Rubenstein et al. 2018 ) and the centrality of the functional department in the intraorganizational network of the MNE (Castellacci et al. 2018 ). For instance, in terms of organizational culture, the person-organization fit predicts employee job satisfaction, which also affects turnover (O' Reilly III et al. 1991 ). Also, knowledge sharing practices within the department and with colleagues outside the function (e.g., Cabrera and Cabrera 2005 ; Dasí et al. 2017 ; Garg et al. 2022 ) and knowledge flows across firm boundaries (Gupta and Govindarajan 2000 ). Job-related antecedents instead concern the characteristics, support, difficulties, and attractiveness of jobs, whereas psychological factors include individuals’ satisfaction in terms of jobs, career prospects, and organizational aspects (e.g. commitment). For instance, job design can influence job characteristics such as competence, autonomy, or task identity, which can motivate knowledge sharing among employees (e.g., Foss et al. 2009 ; Ryan and Deci 2000 ), aspects that can also influence employee well-being and turnover. Finally, environmental factors include aspects that are external to the workplace related to job alternatives, family support, work-family balance, etc. However, Ghapanchi and Aurum ( 2011 ) underline a prevalence of antecedents at the job-related, organizational, and psychological levels, whereas fewer antecedents pertain to the remaining categories. Similar to other widely and long studied management and organizational phenomena, the study of employee turnover has developed into distinct streams of research. This research system-immanent development has led to ever more complexity and an abundance of highly specialized empirical findings rather than convergence and consolidation. This can also be concluded from the number and increasing specializations of literature reviews and meta-analyses (which are briefly described in Table 4 in the appendix).

A closer look at these reviews reveals that research on voluntary employee turnover has not led to maturity and consolidation, in the sense that the (i) results converge, (ii) the most important factors and cause-effect relationships are clearly known, and (iii) only nuances are debated. Quite the opposite, although research on voluntary employee turnover has developed and propagated advanced models and theories, the findings remain inconsistent and at times even conflicting (Hom et al. 2017 ). Partly fueled by the paradigm prevailing in the social sciences and peer-reviewed journals documenting the novelty of research by virtue of previously unstudied or understudied determinants, the number of variables studied has increased significantly in recent decades. It seems that research on voluntary employee turnover – like many other organizational phenomena – is in search of the famous needle in the haystack, or as renowned scholars in the field put it, in “Search of the Holy Grail” (Holtom et al. 2008 ). The search for impact factors that allow forecasting turnover has predominantly adopted regression models (e.g. OLS, logit, and logistic), structural equation modelling (SEM), and other traditional statistical techniques, such as cluster analysis and dimension reduction (Garg et al. 2022 ; Lee et al. 2017 ).

As the numerous and increasingly specialized reviews and especially meta-analyses demonstrate, the voluntary employee turnover phenomenon is a mature and increasingly differentiated field of management research with high scientific and practical relevance. Based on these reviews, different approaches can be adopted for the theoretical and empirical development of the field. On the one hand, replication studies could be conducted for known but possibly inconsistent correlations, or testing new independent variables, moderators, or mediators. We refer to this as the ‘more of the same’ approach. On the other hand, the extensive datasets obtained from the increasing digitalization of HRM and web-based surveys (Holtom et al. 2008 , p. 259) could be used to identify previously unrecognized influencing factors, effect relationships, and patterns. More recent calls in the field of employee turnover also emphasize the need to apply new and innovative methodological approaches, such as the implementation of machine learning techniques, to predict employee turnover (Choudhury et al. 2021 ; Garg et al. 2022 ; Lee et al. 2017 ; Rombaut and Guerry 2018 ; Yang et al. 2023 ).

2.2 Applications of ML techniques in the HRM context

Although HRM is a somewhat unexplored area with regard to big data analytics (BDA) and supervised ML applications (e.g. Ekawati 2019 ; Sheng et al. 2017 ), interest has significantly grown as a consequence of the ongoing digitalization of firms (Raguseo and Vitari 2018 ; Rombaut and Guerry 2018 ; Saradhi and Palshikar 2011 ; Sexton et al. 2005 ; Shah et al. 2017 ). The comparatively few studies that apply data mining techniques in the HRM field focus on employee selection (Aiolli et al. 2009 ), employee competences (Zhu et al. 2005 ), career planning (Lockamy and Service 2011 ), predicting employee performance and evaluation (Zhao 2008 ), candidates’ preliminary evaluation and training success (Aviad and Roy 2011 ), and employee turnover (Quinn et al. 2002 ; Saradhi and Palshikar 2011 ; Sexton et al. 2005 ). Besides the studies adopting other advanced statistical techniques, such as regressions, SEM models and Bayesan Model Averaging (BMA) (e.g., Coetzer et al. 2019 ; Nandialath et al. 2018 ; Sandhya and Sulphey 2021 ), the handful of studies that have applied supervised ML to the issue of employee turnover evaluate or compare neural network solutions (Quinn et al. 2002 ; Sexton et al. 2005 ), random forests, support vector machines and naïve Bayes (Saradhi and Palshikar 2011 ), and classification decision trees (Choudhury et al. 2021 ; Rombaut and Guerry 2018 ; Saradhi and Palshikar 2011 ). Table 1 summarizes the main characteristics and contributions of these studies.

We next provide a summary of the rather diverse and contradictory assessments of these techniques with a particular focus on their predictive power. While Sexton and colleagues ( 2005 ) emphasize the accuracy of neural network techniques in predicting and solving classification business problems, Quinn et al. ( 2002 ) find that they perform worse than logistic regression. Saradhi and Palshikar ( 2011 ) compare naïve Bayes, support vector machines, decision tree and random forests, highlighting the superiority of support vector machines. Rombaut and Guerry ( 2018 ) point out the superiority of decision tree techniques compared to logistic regression. Choudhury and colleagues ( 2021 ) partially confirm this finding in their recent comparison of decision trees, random forests, and neural networks, documenting that classification decision trees have high predictive and analytical power in identifying employee turnover probability. However, their study does not address the characteristics of employees highly inclined to voluntarily leave, instead limited to evaluating the statistical performance of different ML techniques. Although these first attempts provide initial evidence of the potential applications of these techniques in the field of HRM, there is still a lack of research that applies the ML technique to identify the characteristics of employees who are more likely to leave. Our study contributes to filling this gap by identifying the determinants of voluntary employee turnover using a classification decision tree.

3 Research method

Through an illustrative example on data from a telco MNC, this study investigates the root causes of voluntarily employee turnover through the CHAID classification decision tree. IBM SPSS Statistics (v.27) has been applied to run the CHAID classification decision tree.

3.1 Data collection, sample, and measures

We used a database derived from an online survey submitted to the employees of a leading telco MNC in Northern European countries and Asia. The company has a strong position in mobile, broadband, and TV services with 180 million global customers worldwide and annual revenues of approximately USD 12 billion. It is headquartered in Norway with more than 12 subsidiaries in Europe and Asia.

The company’s HRM department provided the dataset. The data was collected in 2016 through an email-based survey sent via the firm’s internal system to all 7,786 employees working in the headquarters and Nordic subsidiaries. Prior to the survey, an invitation letter signed by the CEO was sent out, emphasizing the importance and reasons for the survey, as well as the fact that there was no obligation to respond. Employees were clearly informed of the mechanism in place to protect their privacy. Employee email addresses were retrieved from the central HRM system. Then, when employees returned the questionnaire, the research department temporarily used their email addresses to retrieve some general (e.g. demographic) information from the HRM system and to link responses to previous surveys. The research department ensured that an encrypted employee email address was developed prior to any use of the data to ensure the anonymity of responses.

The survey was sent out via the head office and each subsidiary’s local intranet. After three weeks, the average response rate was around 56%, of which 66% from Norway and 48% from Denmark, which is considered an acceptable response rate for this type of analysis. The decision to focus on both the headquarter and a subsidiary was prompted by a discussion with the CEO, who claimed that voluntary employee turnover was a serious problem in Norway and Denmark, thus confirming the relevance of the setting for our research. Table 2 shows the percentage of the company’s voluntary employee turnover for Norway and Denmark in 2016.

The final database was constructed from a few different datasets. The HR department merged the survey data with internal data on voluntary employee turnover after two years, resulting in a dataset based on 2,932 usable responses, of which 834 were from Denmark and 2,098 from Norway, and 209 variables. We removed 95 variables from the database because unrelated to the issues of voluntary employee turnover and only 9 responses due to missing values. We chose to exclude these answers instead of replacing missing values in order to minimize the risk of bias in our results.

The dependent variable is dichotomous and takes the value 1 if voluntary employee turnover occurred and 0 otherwise. Instead, the 113 independent variables are nominal (dichotomous), categorical, and single-item 7-point Likert scales drawn from previous literature. Table 5 in the Appendix A shows the variables included in the analysis and their coding. Independent variables pertain to three main categories of determinants: individual attributes, job-related determinants, and organizational determinants, in line with Ghapanchi and Aurum ( 2011 )’s classification.

We employed several procedural remedies to reduce common method bias, including guaranteeing anonymity to respondents, emphasizing the importance and reasons for the survey, collecting data from different sources and at different points of time, using different datasets to build the final database, and included questions with different shapes to reduce the risk of response set (Podsakoff et al. 2012 ). Moreover, data on the dependent variable is collected through an internal database after two years from the survey collecting data on independent variables; therefore, the issue related to common method bias in not relevant for the analysis (Kock et al. 2021 ).

3.2 Research methodology

We applied a supervised ML technique, known as the classification decision tree technique, based on CHAID algorithms, to identify the determinants that characterize the employees with a high probability of voluntarily leave the firm. This technique is, particularly suitable for discovering patterns of meaningful relationships (both linear and nonlinear) and rules from large databases (Jain et al. 2016 ). It is particularly efficient with dichotomous, nominal, and scale-ordinal variables (Ture et al. 2009 ).

This technique has numerous advantages, such as: 1) it is simple to understand and interpret, 2) it requires little data preparation, 3) it can handle both numerical and categorical data, 4) it uses a white box model, 5) it has a high explanatory power, 6) it performs well with large data in a short time (Alao and Adeyemo 2013 ; Choudhury et al. 2021 ; Perner et al. 2001 ; Rombaut and Guerry 2018 ), 7) it is more “fair” as it shows an improved ability to make unbiased decisions (Garg et al. 2022 ), 8) it provides clear information about the importance of significant factors for prediction or classification (Tso and Yau 2007 ), 9) multicollinearity is not a problem; attempts to eliminate it resulted in poor classification performance (Piramuthu 2008 ), thus eliminating the need to apply dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) (Reddy et al. 2020 ), and 10) it can be applied to a large sample for data reduction purposes. This technique can also handle missing values, provides stopping rules that account for statistical significance at the 1%, 5%, or 10% level, does not assume a priori the type of distribution of independent variables, and its probabilistic estimation is based on the chi-square test (Díaz-Pérez and Bethencourt-Cejas 2016 ; Kass 1980 ; Nisbet et al. 2018 ). However, despite the advantages of this technique, the classification decision tree is prone to overfitting (Giudici 2010 ), leading to biased results, even with large databases. However, various remedies, such as the use of cross-validation, can be employed to successfully address overfitting concerns. This method divides the sample into several subsamples (typically 10 samples). Tree models are then generated, excluding the data from each subsample generated. For each tree, the risk of misclassification is estimated by applying the tree to the subsample excluded in its generation. This approach produces a single, final tree model for which the cross-validated risk estimate is calculated as the average of the risks for all trees (Blockeel and Struyf 2002 ).

The CHAID classification decision tree systematically breaks down data to classify patterns found in the data set and make rule-based predictions (Berry and Linoff 2000 ). It can be seen as a recursive procedure in which a set of n observations is progressively partitioned into groups according to a division rule – based on the \(p\) value derived from multiple chi-square tests – aimed at maximizing a measure of homogeneity or purity of the dependent variable in each of the obtained groups (Giudici 2010 ). Then, in further step stage, the identification of the best variable as the first variable of the dependent variable is considered. Once again, a chi-square test is applied and calculated for each contingency table derived from the intersection of the dependent variable with each individual predictor (Cerchiello and Giudici 2012 ). The first split falls on the predictor with the highest chi-square value and the lowest \(p\) value. The tree continues to branch into child nodes until it reaches the terminal node for each branch (Jain et al. 2016 ). Each terminal node identifies subgroups defined by different sets of predictors (Tan et al. 2006 ). The procedure stops when the chosen stopping rule is satisfied (Cerchiello and Giudici 2016 ). To obtain a final partition of the observation, it is necessary to specify stopping criteria for the division process. The criteria developed for selecting the best partition are often based on the degree of impurity of the child nodes (Tan et al. 2005 ). The concept of impurity refers to a measure of the variability of the response values of the observation (Giudici 2010 ). The lower the degree of impurity, the more skewed the class distribution. Specifically, in a regression tree, a node is pure if it has zero variance (all observations are equal) and impure if the variance of the observations is high, while for classification decision trees, alternative measures such as misclassification impurity, Gini impurity, and entropy impurity should be considered (Tan et al. 2005 ). In this case, the misclassification impurity (the distance between the observed and expected frequencies) was applied to obtain the final partition of the tree. The expected frequencies are calculated using the hypotheses of homogeneity for the observation in the node considered or using the Chi-square index (Giudici 2010 ). Assuming that a final partition consisting of \(g\) groups ( \(g<n)\) has been reached, then for any given response variable observation \({y}_{i},\) a CHAID classification decision tree will produce the fitted value \({\widehat{y}}_{i}\) or the fitted probabilities of belonging to a single group, assuming only two classes (binary classification). Then, the fitted probability of success is given by the equation:

where the observation \({y}_{lm}\) can take the value 0 or 1, and the fitted probability corresponds to the observation proportion of success in group \(m\) with \({\widehat{y}}_{i}\) constant for all observations in the group (Giudici and Figini 2009 ; Linoff and Berry 2011 ; Tan et al. 2005 ). Figure  1 shows the basic structure of classification decision trees.

figure 1

Structure of the classification decision tree model. Source: Adapted from Li et al. ( 2022 )

The statistical performance of the classification decision tree was assessed through the area under the ROC curve (AUC) (Deng et al. 2016 ; Hanley and McNeil 1982 ; Pendharkar 2009 ; Tan et al. 2006 ) and the cross-validation test (Choudhury et al. 2021 ). AUC values equal to 0.5 indicate that the test is not informative, values between 0.5 and 0.7 indicate an inaccurate test, values between 0.7 and 0.9 indicate a moderate test. Instead, values between 0.9 and 1 or equal to 1 indicate that the test is highly accurate or perfect (Swets 1988 ). However, values of AUC in the range between 0.9 and 1 may indicate overfitting and analysis bias (Foucher and Danger 2012 ).

Figure  2 shows the predictive and analytical power of the CHAID classification decision tree. The tree starts with the root node (node 0), which indicates that 8.6% of permanent employees have recently left their job voluntarily. The classification tree then identifies three layers of predictors with different predictive power (from highest to lowest), which can profile groups of employees with a high propensity to voluntarily quit the firm.

figure 2

CHAID classification decision tree

The first layer (node 1 and node 2) pinpoints the most powerful predictor that influences voluntary employee turnover. The country location of employees is the predictor with the greatest discriminatory power. Two different scenarios—in terms of employee turnover determinants—emerge from the analysis.

In Norway, the classification decision tree identifies three different groups of employees at risk of voluntary turnover. The first group includes employees with very low variety in their jobs (e.g. in terms of tasks) who also consider social media (e.g. Facebook) as important tools to share work-related knowledge in the firm (20.7%). The second group includes employees with low-medium job variety who are highly inclined to share work-related knowledge with colleagues to get promoted (16.9%). The third group of employees includes females with high job variety (12.1%). Instead, in Denmark, the classification decision tree identified six groups of employees with a high risk to voluntarily leave. The first three groups, respectively, include the employees with very low, low, and medium job freedom in terms of deciding how to manage their work (33.3%, 13.7%, 34.1%). The fourth group identifies the employees with high freedom in managing their jobs who also work in organizational contexts that consider job rotation an important tool to share work-related knowledge within the organization (28.3%). The fifth group includes employees with high job freedom working in organizational contexts where there is a medium propensity to consider job rotation as an important activity to share work-related knowledge (16.4%). Finally, another important group includes the employees with high job freedom working in contexts where there is a lower propensity to share work-related knowledge through job rotation (8.4%).

Table 3 summarizes the terminal nodes (or leaves) of the CHAID classification decision tree highlighting the split value used in the development of the predictive classification and the p -value for each predictor. Predictors such as country, job variety, gender, and job freedom have high statistical relevance ( p  < 0.000) in predicting the determinants of the voluntary employee turnover, while “knowledge-sharing with colleagues in order to get promoted”, “importance of knowledge-sharing through social media in the company” and “importance of job rotation as an activity to share work-related knowledge within the organization” have a medium statistical relevance ( p  < 0.001). The predictor related to knowledge-sharing through social media has low predictive value ( p  < 0.05).

The goodness of fit of the CHAID classification decision tree is moderate, as indicated by the AUC equal to 74.5%. The cross-validation test confirms the moderate accuracy of the model, excluding overfitting issues. Figure  3 provides a representation of the ROC curve.

figure 3

ROC (receiver operating characteristic) curve

5 Discussion

The CHAID classification decision tree identified seven predictors of voluntary employee turnover. The country location of employees has the highest predictive power, suggesting that the organizational context plays a crucial role in understanding why permanent employees voluntarily leave the firm. This is consistent with previous research showing the importance of organizational variables in this decision (Rubenstein et al. 2018 ). Then, the model identified six predictors of which four refer to the Norwegian context (job variety, importance of sharing work-related knowledge through social media in the firm, propensity to share work-related knowledge to get promoted, and gender) and two to the Danish one (job freedom, importance of sharing work-related knowledge within the firm through job rotation). The analysis identifies predictors which are organizational, job-related, and related to individual attributes in line with previous studies in high-tech contexts (Ghapanchi and Aurum 2011 ).

The model identifies groups of employees, who are internally homogeneous and heterogeneous with respect to each other and who have a high propensity to leave voluntarily. This approach allows for the profiling of the employees who are more likely to leave voluntarily, highlighting the concurrent effect of predictors for specific groups of employees. For example, our analysis has led to the identification of groups of employees who leave voluntarily the company, which we labelled for the discussion. In the Norwegian context, job variety is the most important predictor, and the model identifies three groups of employees. The first group, which we labelled “bored workers”, includes employees who have relatively repetitive job tasks (Foss et al. 2009 ) and who want to use the social media to share knowledge within the organization (Cabrera and Cabrera 2005 ). The second one is “ambitious workers” who are employees with a medium job variety (Foss et al. 2009 ), but who consider important to share knowledge in order to get promoted (Ryan and Deci 2000 ). The third group identifies “female workers” who have a high job variety (Foss et al. 2009 ). An imposed high level of tasks job variety may imply more effort and resources for employees who need to change job frequently. Further studies could explore these relationships and their motivations in more detail. Instead, in Denmark, job freedom is the most significant predictor (Foss et al. 2009 ), which identifies six groups of employees, which we have labeled as follows. “Rebels I”, “Rebels II” and Rebels III” are the employees who are at risk of voluntary turnover because they have limited job freedom (Foss et al. 2009 ). “Impatients I”, “Impatients II” and “Impatients III” are the groups of employees who have a high degree of job freedom but work in an organizational context that views job rotation as a tool for sharing knowledge within the organization (Cabrera and Cabrera 2005 ). Employees who are relatively free in their work organization tend to be reluctant to engage in job rotation. Job rotation may involve the reorganization of tasks and active confrontation with other employees and departments, thus threatening to restrict freedom in the way and timing of individual work organization. This suggests that job rotation may have a constraining effect on the employees who are free to organize their work. Future studies could explore the interactions between these variables.

In addition, the CHAID classification decision tree identifies the set or bundle of determinants that influence the decision to leave, enabling a predictive profiling of employees. Thus, unlike traditional statistical techniques (Hom et al. 2017 ; Garg et al. 2022 ), this method reveals which combinations of variables influence the decision to voluntarily leave the company by identifying specific groups of employees operating in the organizational context. This method highlights a multiplicity of linear and non-linear effects among different groups of employees, paving the way for new lines of research to explore these aspects.

Our study makes two important theoretical contributions to the field of HRM. First, from a theoretical perspective, our study shows both the application and the benefits of the CHAID classification decision trees in selecting the determinants that characterize voluntary employee turnover and in profiling groups of employees at risk of voluntary turnover. As suggested by Hom et al. ( 2017 ), the application of supervised ML techniques could be used to advance the turnover literature. In this sense, the CHAID classification decision tree can be used to uncover linear and nonlinear relationships in large databases. Supervised ML techniques could be used to analyze in depth relationships that have emerged in the past to complement past evidence and, eventually, lead to the identification of new relationships, especially in the presence with large databases. Second, in line with the considerations of Garg et al. ( 2022 ) considerations, the classification decision tree emerges as an effective technique that can be used to solve complex management problems, such as the employee turnover. However, this approach could also be combined with other ML tools to support decision making on complex HRM issues that require an understanding of socio-cultural phenomena (Garg et al. 2022 ; Yang et al. 2023 ). This suggestion could pave the way for the birth and growth of a research stream lying at the intersection of the HRM and ML literature, which could also properly assess the benefits and risks arising from this interaction.

This article also provides relevant managerial implications. By using ML techniques, in particular the classification decision trees, MNCs could develop effective ad hoc retention plans that precisely and accurately target the groups of employees who are more likely to leave voluntarily. For instance, in this case, our results suggest that a retention strategy that increases job variety and targets female employees in Norway would not be as effective for Danish employees whose propensity to leave is more related to job freedom (Foss et al. 2009 ). This shows that MNCs could use this method to effectively self-assess their employee turnover risk, identify employees at risk, and create tailored retention strategies that address the needs of different employee groups. In doing so, they can improve their chances of reducing voluntary employee turnover rates (e.g., Reiche 2008 ). In fact, developing effective retention strategies allows for the retention of current employees, thereby avoiding additional costs due to turnover (Mobley 1982 ; Price 1977 ; Saradhi and Palshikar 2011 ; Staw 1980 ) and negative effects on the overall organizational effectiveness and business success (Holtom et al. 2005 ; Mitchell et al. 2001 ).

6 Conclusion and limitations

This study shows the predictive and analytical power of ML techniques through the application of the CHAID classification decision tree to predict the determinants of voluntary employee turnover to profile groups of employees at risk of voluntary turnover. This research goes beyond merely identifying the probability of employees at risk of voluntary turnover, as previously done in past studies (Choudhury et al. 2021 ). Indeed, this research shows the predictive and analytical power of the CHAID classification decision tree, and more generally of the supervised ML techniques, in analyzing large databases. In particular, through this research we highlight the advantages of this technique in the context of HRM by seeking to open new avenues for future applications related to classification problems in other management contexts where supervised ML techniques could be successfully used to support and improve the quality of the decision-making process (Iqbal et al. 2020 ; Janssen et al. 2017 ). ML techniques, especially the CHAID classification decision tree, appear to be a realistic way for decision-makers to obtain strategic knowledge from raw data considered relevant to steering the firm and could be used not only to solve simple management problems related to recruitment and performance management, but also complex ones such as forecasting employees’ turnover, also eventually being used in combination with other ML techniques (Garg et al. 2022 ). However, while the mere implementation of supervised ML techniques to support the decision-making process is a good starting point, when not integrated with strategic thinking risks to produce biased results with a negative impact on the quality of business decisions (Choudhury et al. 2021 ). In fact, we would like to emphasize that supervised ML techniques it should not be seen as a replacement for human resource reasoning.

Our research also has some limitations that highlight opportunities for future research. First, the choice of classification algorithm influences the predictive power of the resulting model. However, to date, no complete theory or conceptual guidelines are available to assist researchers in choosing or developing appropriate classification decision tree algorithms. Thus, more research is needed to assess the selection of these algorithms according to a specific type of classification problem. In addition, we recall the importance of acknowledging that classification decision trees are sensitive to noisy data and could also not perform as good as neural network on non-linear data (Curram and Mingers 1994 ). Future research should test this technique on different contexts, databases, and, also, in comparison with other techniques. In addition, future studies could test the application of supervised ML techniques (e.g., decision trees, support vector machine, neural networks) on panel data, which could provide a more robust estimation of employee turnover predictors and represent an effective tool to accurately develop customized retention strategies able to considerably reduce the employee turnover risk over the time.

To conclude, we hope our study will generate interest and new stimuli in the application of supervised ML techniques, particularly, among management scholars, opening new lines of research not only limited to HRM but also in fields where the use of these techniques is still in its embryonic phase.

Aiolli F, De Filippo M, Sperduti A (2009) Application of the preference learning model to a human resources selection task. IEEE Symposium on Computational Intelligence and Data Mining, Nashville, pp 203–210.

Alao D, Adeyemo AB (2013) Analyzing employee attrition using decision tree algorithms. Comput Inform Syst Dev Inform Allied Res J 4(1):17–28

Google Scholar  

Archaux C, Martin A, Khenchaf A (2004) An SVM based churn detector in prepaid mobile telephony. International Conference on Information and Communication Technologies: From Theory to Applications, Damascus, pp 459–460.

Aviad B, Roy G (2011) Classification by clustering decision tree-like classifier based on adjusted clusters. Expert Syst Appl 38(7):8220–8228

Article   Google Scholar  

Barrick MR, Zimmerman RD (2005) Reducing voluntary, avoidable turnover through selection. J Appl Psychol 90(1):159–166

Berry MA, Linoff GS (2000) Mastering data mining: the art and science of customer relationship management. Ind Manag Data Syst 100(5):245–246

Blazquez D, Domenech J (2018) Big data sources and methods for social and economic analyses. Technol Forecast Soc Chang 130:99–113

Blockeel H, Struyf J (2002) Efficient algorithms for decision tree cross-validation. J Mach Learn Res 3(12):621–650

Cabrera EF, Cabrera A (2005) Fostering knowledge sharing through people management practices. Int J Human Resour Manag 16(5):720–735

Canhoto AI, Clear F (2020) Artificial intelligence and machine learning as business tools: a framework for diagnosing value destruction potential. Bus Horiz 63(2):183–193

Castellacci F, Gulbrandsen M, Hildrum J, Martinkenaite I, Simensen E (2018) Functional centrality and innovation intensity: employee-level analysis of the Telenor group. Res Policy 47(9):1674–1687

Cerchiello P, Giudici P (2012) Non parametric statistical models for on-line text classification. Adv Data Anal Classif 6(4):277–288

Cerchiello P, Giudici P (2016) Big data analysis for financial risk management. Journal of Big Data 3(1):18–30

Choudhury P, Allen RT, Endres MG (2021) Machine learning for pattern discovery in management research. Strateg Manag J 42(1):30–57

Coetzer A, Inma C, Poisat P, Redmond J, Standing C (2019) Does job embeddedness predict turnover intentions in SMEs? Int J Product Perform Manag 68(2):340–361

Cui G, Wong ML, Lui HK (2006) Machine learning for direct marketing response models: Bayesian networks with evolutionary programming. Manage Sci 52(4):597–612

Curram SP, Mingers J (1994) Neural networks, decision tree induction and discriminant analysis: an empirical comparison. J Oper Res Soc 45(4):440–450

Dasí À, Pedersen T, Gooderham PN, Elter F, Hildrum J (2017) The effect of organizational separation on individuals’ knowledge sharing in MNCs. J World Bus 52(3):431–446

Deng X, Liu Q, Deng Y, Mahadevan S (2016) An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Inf Sci 340:250–261

Díaz-Pérez FM, Bethencourt-Cejas M (2016) CHAID algorithm as an appropriate analytical method for tourism market segmentation. J Destin Mark Manag 5(3):275–282

Ekawati AD (2019) Predictive analytics in employee churn: a systematic literature review. J Manag Inform Decis Sci 22(4):387–397

Field JG, Bosco FA, Kepes S (2020) How robust is our cumulative knowledge on turnover? J Bus Psychol 36(3):349–365

Foss NJ, Minbaeva DB, Pedersen T, Reinholt M (2009) Encouraging knowledge sharing among employees: how job design matters. Hum Resour Manage 48(6):871–893

Foucher Y, Danger R (2012) Time dependent ROC curves for the estimation of true prognostic capacity of microarray data. Stat Appl Genet Mol Biol 11(6):871

Garg S, Sinha S, Kar AK, Mani M (2022) A review of machine learning applications in human resource management. Int J Product Perform Manag 71(5):1590–1610

Ghapanchi AH, Aurum A (2011) Antecendents to IT personnel’s intentions to leave: a systematic literature review. J Syst Softw 84:238–249

Giudici P (2010) Scoring models for operational risk. In: Kenett RS, Raanan Y (eds) Operational risk management: a practical approach to intelligent data analysis. Wiley, pp 125–135

Chapter   Google Scholar  

Giudici P, Figini S (2009) Applied data mining for business and industry. Wiley, Chichester

Book   Google Scholar  

Glebbeck AC, Bax EH (2004) Is high employee turnover harmful? An empirical test using company records. Acad Manag J 47(2):277–286

Gordini N, Veglio V (2017) Customers churn prediction and marketing retention strategies. An application of support vector machines based on the AUC parameter-selection technique in B2B e-commerce industry. Ind Mark Manage 62:100–107

Griffeth RW, Hom PW, Gaertner S (2000) A meta-analysis of antecedents and correlates of employee turnover: update, moderator tests, and research implications for the millennium. J Manag 26(3):463–488

Gupta AK, Govindarajan V (2000) Knowledge flows within multinational corporations. Strateg Manag J 21(4):473–496

Gupta S, Kar AK, Baabdullah A, Al-Khowaiter WA (2018) Big data with cognitive computing: a review for the future. Int J Inf Manage 42:78–89

Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

Holtom BC, Mitchell TR, Lee TW, Eberly MB (2008) Turnover and retention research. Acad Manag Ann 2(1):231–274

Holtom BC, Mitchell TR, Lee TW, Inderrieden EJ (2005) Shocks as causes of turnover: what they are and how organizations can manage them. Hum Resour Manage 44(3):337–352

Hom PW, Griffeth RW (1991) Structural equations modeling test of a turnover theory: cross-sectional and longitudinal analyses. J Appl Psychol 76(3):350–366

Hom PW, Griffeth RW (1995) Employee turnover. South-Western College Publishing. Cincinnati, OH

Hom PW, Lee TW, Shaw JD, Hausknecht JP (2017) One hundred years of employee turnover theory and research. J Appl Psychol 102(3):530–545

Hom PW, Mitchell TR, Lee TW, Griffeth RW (2012) Reviewing employee turnover: focusing on proximal withdrawal states and an expanded criterion. Psychol Bull 138(5):831–858

Hung SY, Yen DC, Wang HY (2006) Applying data mining to telecom churn management. Expert Syst Appl 31(3):515–524

Iqbal R, Doctor F, More B, Mahmud S, Yousuf U (2020) Big data analytics: computational intelligence techniques and application areas. Technol Forecast Soc Chang 153:119–253

Jain RK, Natarajan R, Ghosh A (2016) Decision tree analysis for selection of factors in DEA: an application to banks in India. Glob Bus Rev 17(5):1162–1178

Jiang K, Liu D, McKay PF, Lee TW, Mitchell TR (2012) When and how is job embeddedness predictive of turnover? A meta-analytic investigation. J Appl Psychol 97(5):1077–1096

Janssen M, van der Voort H, Wahyudi A (2017) Factor influencing big data decision-making quality. J Bus Res 70:338–345

Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. J Roy Stat Soc: Ser C (Appl Stat) 29(2):119–127

Kock F, Berbekova A, Assaf AG (2021) Understanding and managing the threat of common method bias: detection, prevention and control. Tour Manage 86:104–330

Koechling A, Wehner MC, Warkocz J (2023) Can I show my skills? Affective responses to artificial intelligence in the recruitment process. RMS 17(6):2109–2138

Kruppa J, Schwarz A, Arminger G, Ziegler A (2013) Consumer credit risk: Individual probability estimates using machine learning. Expert Syst Appl 40(13):5125–5131

Lee TH, Gerhart B, Weller I, Trevor CO (2008) Understanding voluntary turnover: path-specific job satisfaction effects and the importance of unsolicited job offers. Acad Manag J 51(4):651–671

Lee TW, Hom PW, Eberly MB, Li J, Mitchell TR (2017) On the next decade of research in voluntary employee turnover. Acad Manag Perspect 31(3):201–221

Lee TW, Mitchell TR (1994) An alternative approach: the unfolding model of voluntary employee turnover. Acad Manag Rev 19(1):51–89

Li X, Yi S, Cundy AB, Chen W (2022) Sustainable decision-making for contaminated site risk management: a decision tree model using machine learning algorithms. J Clean Prod 371:133612

Linoff GS, Berry MJ (2011) Data mining techniques: for marketing, sales, and customer relationship management. Wiley

Lockamy A, Service RW (2011) Modeling managerial promotion decisions using Bayesian networks: an exploratory study. J Manag Dev 30(4):381–401

Maertz CP, Campion MA (2004) Profiles in quitting: Integrating content and process turnover theory. Acad Manag J 47(4):566–582

March J, Simon H (1958) Organizations. Wiley, New York

Mitchell TR, Holtom BC, Lee TW, Sablynski CJ, Erez M (2001) Why people stay: Using job embeddedness to predict voluntary turnover. Acad Manag J 44(6):1102–1121

Mobley WH (1977) Intermediate linkages in the relationship between job satisfaction and employee turnover. J Appl Psychol 62(2):237–240

Mobley WH (1982) Some unanswered questions in turnover and withdrawal research. Acad Manag Rev 7(1):111–116

Morrell K, Loan-Clarke J, Wilkinson A (2001) Unweaving leaving: the use of models in the management of employee turnover. Int J Manag Rev 3(3):219–244

Naeem R, Kohtamäki M, Parida V (2024) Artificial intelligence enabled product–service innovation: past achievements and future directions. Rev Manag Sci 44–1.

Nandialath AM, David E, Das D, Mohan R (2018) Modeling the determinants of turnover intentions: a Bayesian approach. Evid-based HRM: Glob Forum Empir Scholarsh 6(1):2–24 ( Emerald Publishing Limited )

Nisbet R, Miner G, Yale K (2018) Handbook of statistical analysis and data mining applications, 2nd edn. Academic Press, Boston

O’Reilly CA III, Chatman J, Caldwell DF (1991) People and organizational culture: a profile comparison approach to assessing person-organization fit. Acad Manag J 34(3):487–516

Pendharkar PC (2009) Genetic algorithm based neural network approaches for predicting churn in cellular wireless network services. Expert Syst Appl 36(3):6714–6720

Pereira RB, Plastino A, Zadrozny B, Merschmann LH (2018) Categorizing feature selection methods for multi-label classification. Artif Intell Rev 49(1):57–78

Perner P, Zscherpel U, Jacobsen C (2001) A comparison between neural networks and decision trees based on data from industrial radiographic testing. Pattern Recogn Lett 22(1):47–54

Piramuthu S (2008) Input data for decision trees. Expert Syst Appl 34(2):1220–1226

Podsakoff PM, MacKenzie SB, Podsakoff NP (2012) Sources of method bias in social science research and recommendations on how to control it. Annu Rev Psychol 63:539–569

Porter LW, Steers RM (1973) Organizational, work, and personal factors in employee turnover and absenteeism. Psychol Bull 80(2):151–176

Price JL (1977) The study of turnover. Iowa State Press

Price JL (2001) Reflections on the determinants of voluntary turnover. Int J Manpow 22(7):600–624

Quinn A, Rycraft JR, Schoech D (2002) Building a model to predict caseworker and supervisor turnover using a neural network and logistic regression. J Technol Hum Serv 19(4):65–85

Raguseo E, Vitari C (2018) Investments in big data analytics and firm performance: an empirical investigation of direct and mediating effects. Int J Prod Res 56(15):5206–5221

Reddy GT, Reddy MPK, Lakshmanna K, Kaluri R, Rajput DS, Srivastava G, Baker T (2020) Analysis of dimensionality reduction techniques on big data. IEEE Access 8:54776–54788

Reiche BS (2008) The configuration of employee retention practices in multinational corporation’s foreign subsidiaries. Int Bus Rev 17(6):676–687

Rode JC, Rehg MT, Near JP, Underhill JR (2007) The effect of work/family conflict on intention to quit: The mediating roles of job and life satisfaction. Appl Res Qual Life 2(2):65–82

Rombaut E, Guerry MA (2018) Predicting voluntary turnover through human resources database analysis. Manag Res Rev 41(1):96–112

Rosset S, Neumann E, Eick U, Vatnik N (2003) Customer lifetime value models for decision support. Data Min Knowl Disc 7(3):321–339

Rubenstein AL, Eberly MB, Lee TW, Mitchell TR (2018) Surveying the forest: A meta-analysis, moderator investigation, and future-oriented discussion of the antecedents of voluntary employee turnover. Pers Psychol 71(1):23–65

Ryan RM, Deci EL (2000) Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. Am Psychol 55(1):68

Sandhya S, Sulphey MM (2021) Influence of empowerment, psychological contract and employee engagement on voluntary turnover intentions. Int J Product Perform Manag 70(2):325–349

Saradhi VV, Palshikar GK (2011) Employee churn prediction. Exp Syst Appl 38(3):1999–2006

Sexton RS, McMurtrey S, Michalopoulos JO, Smith AM (2005) Employee turnover: a neural network solution. Comput Oper Res 32(10):2635–2651

Shah N, Irani Z, Sharif AM (2017) Big data in an HR context: exploring organizational change readiness, employee attitudes and behaviors. J Bus Res 70:366–378

Sheng J, Amankwah-Amoah J, Wang X (2017) A multidisciplinary perspective of big data in management research. Int J Prod Econ 191:97–112

Staw BM (1980) The consequences of turnover. J Occup Behav 1(4):253–273

Swets JA (1988) Measuring the accuracy of diagnostic system. Science 240(4857):1285–1293

Tan PN, Steinbach M, Kumar V (2005) Association analysis: basic concepts and algorithms. Introduction to data mining. Addison-Wesley, Boston, pp 71–94

Tan PN, Steinbach M, Kumar V (2006) Classification: basic concepts, decision trees, and model evaluation. Introduction to data mining. Pearson Addison-Wesley, pp 25-44

Tso GK, Yau KK (2007) Predicting electricity energy consumption: a comparison of regression analysis, decision tree and neural networks. Energy 32(9):1761–1768

Ture M, Tokatli F, Kurt I (2009) Using Kaplan-Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4. 5 and ID3) in determining recurrence-free survival of breast cancer patients. Exp Syst Appl 36(2):2017–2026

Yang Y, Shamim S, Herath DB, Secchi D, Homberg F (2023) The evolution of HRM practices: big data, data analytics, and new forms of work. RMS 17(6):1937–1942

Wei CP, Chiu IT (2002) Turning telecommunications call details to churn prediction: a data mining approach. Expert Syst Appl 23(2):103–112

Wirges F, Neyer AK (2023) Towards a process-oriented understanding of HR analytics: implementation and application. RMS 17(6):2077–2108

Zhao X (2008) An empirical study of data mining in performance evaluation of HRM. IntSymp Intell Inform Technol Appl Workshops 82–85

Zhu J, Gonçalves AL, Uren, VS, Motta E, Pacheco R (2005) Mining web data for competency management. The 2005 IEEE/WIC/ACM international conference on web intelligence.(WI'05), Compiegne, pp 94–100.

Download references

Open access funding provided by Università degli Studi di Trieste within the CRUI-CARE Agreement.

Author information

Authors and affiliations.

Department of Economics and Management, University of Pavia, Pavia, Italy

Valerio Veglio

Department of Economics, Business, Mathematics and Statistics, University of Trieste, Trieste, Italy

Rubina Romanello

Department of Strategy and Innovation, Copenhagen Business School, Copenhagen, Denmark

Torben Pedersen

Institute for Transformative Innovation Research, University of Pavia, Pavia, Italy

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Rubina Romanello .

Ethics declarations

Competing interests and funding.

The authors do not have competing interests and funding that are directly or indirectly related to the work to disclose.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit .

Reprints and permissions

About this article

Veglio, V., Romanello, R. & Pedersen, T. Employee turnover in multinational corporations: a supervised machine learning approach. Rev Manag Sci (2024).

Download citation

Received : 28 February 2023

Accepted : 08 May 2024

Published : 21 May 2024


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Classification decision tree
  • Employee turnover
  • Employee churn predictive model
  • Multinational corporation
  • Employee retention

JEL classification

  • Find a journal
  • Publish with us
  • Track your research

Artificial Intelligence

Supercharging graph neural networks with large language models: the ultimate guide.


Table Of Contents

graph neural network large language model

Graphs are data structures that represent complex relationships across a wide range of domains, including social networks, knowledge bases, biological systems, and many more. In these graphs, entities are represented as nodes, and their relationships are depicted as edges.

The ability to effectively represent and reason about these intricate relational structures is crucial for enabling advancements in fields like network science, cheminformatics, and recommender systems.

Graph Neural Networks (GNNs) have emerged as a powerful deep learning framework for graph machine learning tasks. By incorporating the graph topology into the neural network architecture through neighborhood aggregation or graph convolutions, GNNs can learn low-dimensional vector representations that encode both the node features and their structural roles. This allows GNNs to achieve state-of-the-art performance on tasks such as node classification, link prediction, and graph classification across diverse application areas.

While GNNs have driven substantial progress, some key challenges remain. Obtaining high-quality labeled data for training supervised GNN models can be expensive and time-consuming. Additionally, GNNs can struggle with heterogeneous graph structures and situations where the graph distribution at test time differs significantly from the training data (out-of-distribution generalization).

In parallel, Large Language Models (LLMs) like GPT-4, and LLaMA have taken the world by storm with their incredible natural language understanding and generation capabilities. Trained on massive text corpora with billions of parameters, LLMs exhibit remarkable few-shot learning abilities, generalization across tasks, and commonsense reasoning skills that were once thought to be extremely challenging for AI systems.

The tremendous success of LLMs has catalyzed explorations into leveraging their power for graph machine learning tasks. On one hand, the knowledge and reasoning capabilities of LLMs present opportunities to enhance traditional GNN models. Conversely, the structured representations and factual knowledge inherent in graphs could be instrumental in addressing some key limitations of LLMs, such as hallucinations and lack of interpretability.

In this article, we will delve into the latest research at the intersection of graph machine learning and large language models. We will explore how LLMs can be used to enhance various aspects of graph ML, review approaches to incorporate graph knowledge into LLMs, and discuss emerging applications and future directions for this exciting field.

Graph Neural Networks and Self-Supervised Learning

To provide the necessary context, we will first briefly review the core concepts and methods in graph neural networks and self-supervised graph representation learning.

Graph Neural Network Architectures

classification in machine learning case study

Graph Neural Network Architecture – source

The key distinction between traditional deep neural networks and GNNs lies in their ability to operate directly on graph-structured data. GNNs follow a neighborhood aggregation scheme, where each node aggregates feature vectors from its neighbors to compute its own representation.

Numerous GNN architectures have been proposed with different instantiations of the message and update functions, such as Graph Convolutional Networks (GCNs), GraphSAGE , Graph Attention Networks (GATs), and Graph Isomorphism Networks (GINs) among others.

More recently, graph transformers have gained popularity by adapting the self-attention mechanism from natural language transformers to operate on graph-structured data. Some examples include GraphormerTransformer , and GraphFormers . These models are able to capture long-range dependencies across the graph better than purely neighborhood-based GNNs.

Self-Supervised Learning on Graphs

While GNNs are powerful representational models, their performance is often bottlenecked by the lack of large labeled datasets required for supervised training. Self-supervised learning has emerged as a promising paradigm to pre-train GNNs on unlabeled graph data by leveraging pretext tasks that only require the intrinsic graph structure and node features.

classification in machine learning case study

Self-Supervised Graph

Some common pretext tasks used for self-supervised GNN pre-training include:

  • Node Property Prediction : Randomly masking or corrupting a portion of the node attributes/features and tasking the GNN to reconstruct them.
  • Edge/Link Prediction : Learning to predict whether an edge exists between a pair of nodes, often based on random edge masking.
  • Contrastive Learning : Maximizing similarities between graph views of the same graph sample while pushing apart views from different graphs.
  • Mutual Information Maximization : Maximizing the mutual information between local node representations and a target representation like the global graph embedding.

Pretext tasks like these allow the GNN to extract meaningful structural and semantic patterns from the unlabeled graph data during pre-training. The pre-trained GNN can then be fine-tuned on relatively small labeled subsets to excel at various downstream tasks like node classification, link prediction, and graph classification.

By leveraging self-supervision, GNNs pre-trained on large unlabeled datasets exhibit better generalization, robustness to distribution shifts, and efficiency compared to training from scratch. However, some key limitations of traditional GNN-based self-supervised methods remain, which we will explore leveraging LLMs to address next.

Enhancing Graph ML with Large Language Models

classification in machine learning case study

Integration of Graphs and LLM –  source

The remarkable capabilities of LLMs in understanding natural language, reasoning, and few-shot learning present opportunities to enhance multiple aspects of graph machine learning pipelines. We explore some key research directions in this space:

A key challenge in applying GNNs is obtaining high-quality feature representations for nodes and edges, especially when they contain rich textual attributes like descriptions, titles, or abstracts. Traditionally, simple bag-of-words or pre-trained word embedding models have been used, which often fail to capture the nuanced semantics.

Recent works have demonstrated the power of leveraging large language models as text encoders to construct better node/edge feature representations before passing them to the GNN. For example, Chen et al. utilize LLMs like GPT-3 to encode textual node attributes, showing significant performance gains over traditional word embeddings on node classification tasks.

Beyond better text encoders, LLMs can be used to generate augmented information from the original text attributes in a semi-supervised manner. TAPE generates potential labels/explanations for nodes using an LLM and uses these as additional augmented features. KEA extracts terms from text attributes using an LLM and obtains detailed descriptions for these terms to augment features.

By improving the quality and expressiveness of input features, LLMs can impart their superior natural language understanding capabilities to GNNs, boosting performance on downstream tasks.

Alleviating Reliance on Labeled Data

A key advantage of LLMs is their ability to perform reasonably well on new tasks with little to no labeled data, thanks to their pre-training on vast text corpora. This few-shot learning capability can be leveraged to alleviate the reliance of GNNs on large labeled datasets.

One approach is to use LLMs to directly make predictions on graph tasks by describing the graph structure and node information in natural language prompts. Methods like InstructGLM and GPT4Graph fine-tune LLMs like LLaMA and GPT-4 using carefully designed prompts that incorporate graph topology details like node connections, neighborhoods etc. The tuned LLMs can then generate predictions for tasks like node classification and link prediction in a zero-shot manner during inference.

While using LLMs as black-box predictors has shown promise, their performance degrades for more complex graph tasks where explicit modeling of the structure is beneficial. Some approaches thus use LLMs in conjunction with GNNs – the GNN encodes the graph structure while the LLM provides enhanced semantic understanding of nodes from their text descriptions.

classification in machine learning case study

Graph Understanding with LLM Framework – Source

GraphLLM explores two strategies: 1) LLMs-as-Enhancers where LLMs encode text node attributes before passing to the GNN, and 2) LLMs-as-Predictors where the LLM takes the GNN's intermediate representations as input to make final predictions.

GLEM goes further by proposing a variational EM algorithm that alternates between updating the LLM and GNN components for mutual enhancement.

By reducing reliance on labeled data through few-shot capabilities and semi-supervised augmentation, LLM-enhanced graph learning methods can unlock new applications and improve data efficiency.

Enhancing LLMs with Graphs

While LLMs have been tremendously successful, they still suffer from key limitations like hallucinations (generating non-factual statements), lack of interpretability in their reasoning process, and inability to maintain consistent factual knowledge.

Graphs, especially knowledge graphs which represent structured factual information from reliable sources, present promising avenues to address these shortcomings. We explore some emerging approaches in this direction:

Knowledge Graph Enhanced LLM Pre-training

Similar to how LLMs are pre-trained on large text corpora, recent works have explored pre-training them on knowledge graphs to imbue better factual awareness and reasoning capabilities.

Some approaches modify the input data by simply concatenating or aligning factual KG triples with natural language text during pre-training. E-BERT aligns KG entity vectors with BERT's wordpiece embeddings, while K-BERT constructs trees containing the original sentence and relevant KG triples.

The Role of LLMs in Graph Machine Learning:

Researchers have explored several ways to integrate LLMs into the graph learning pipeline, each with its unique advantages and applications. Here are some of the prominent roles LLMs can play:

  • LLM as an Enhancer : In this approach, LLMs are used to enrich the textual attributes associated with the nodes in a TAG. The LLM's ability to generate explanations, knowledge entities, or pseudo-labels can augment the semantic information available to the GNN, leading to improved node representations and downstream task performance.

For example, the TAPE (Text Augmented Pre-trained Encoders) model leverages ChatGPT to generate explanations and pseudo-labels for citation network papers, which are then used to fine-tune a language model. The resulting embeddings are fed into a GNN for node classification and link prediction tasks, achieving state-of-the-art results.

  • LLM as a Predictor : Rather than enhancing the input features, some approaches directly employ LLMs as the predictor component for graph-related tasks. This involves converting the graph structure into a textual representation that can be processed by the LLM, which then generates the desired output, such as node labels or graph-level predictions.

One notable example is the GPT4Graph model, which represents graphs using the Graph Modelling Language (GML) and leverages the powerful GPT-4 LLM for zero-shot graph reasoning tasks.

  • GNN-LLM Alignment : Another line of research focuses on aligning the embedding spaces of GNNs and LLMs, allowing for a seamless integration of structural and semantic information. These approaches treat the GNN and LLM as separate modalities and employ techniques like contrastive learning or distillation to align their representations.

The MoleculeSTM model, for instance, uses a contrastive objective to align the embeddings of a GNN and an LLM, enabling the LLM to incorporate structural information from the GNN while the GNN benefits from the LLM's semantic knowledge.

Challenges and Solutions

While the integration of LLMs and graph learning holds immense promise, several challenges need to be addressed:

  • Efficiency and Scalability : LLMs are notoriously resource-intensive, often requiring billions of parameters and immense computational power for training and inference. This can be a significant bottleneck for deploying LLM-enhanced graph learning models in real-world applications, especially on resource-constrained devices.

One promising solution is knowledge distillation , where the knowledge from a large LLM (teacher model) is transferred to a smaller, more efficient GNN (student model).

  • Data Leakage and Evaluation : LLMs are pre-trained on vast amounts of publicly available data, which may include test sets from common benchmark datasets, leading to potential data leakage and overestimated performance. Researchers have started collecting new datasets or sampling test data from time periods after the LLM's training cut-off to mitigate this issue.

Additionally, establishing fair and comprehensive evaluation benchmarks for LLM-enhanced graph learning models is crucial to measure their true capabilities and enable meaningful comparisons.

  • Transferability and Explainability : While LLMs excel at zero-shot and few-shot learning, their ability to transfer knowledge across diverse graph domains and structures remains an open challenge. Improving the transferability of these models is a critical research direction.

Furthermore, enhancing the explainability of LLM-based graph learning models is essential for building trust and enabling their adoption in high-stakes applications. Leveraging the inherent reasoning capabilities of LLMs through techniques like chain-of-thought prompting can contribute to improved explainability.

  • Multimodal Integration : Graphs often contain more than just textual information, with nodes and edges potentially associated with various modalities, such as images, audio, or numeric data. Extending the integration of LLMs to these multimodal graph settings presents an exciting opportunity for future research.

Real-world Applications and Case Studies

The integration of LLMs and graph machine learning has already shown promising results in various real-world applications:

  • Molecular Property Prediction : In the field of computational chemistry and drug discovery, LLMs have been employed to enhance the prediction of molecular properties by incorporating structural information from molecular graphs. The LLM4Mol model , for instance, leverages ChatGPT to generate explanations for SMILES (Simplified Molecular-Input Line-Entry System) representations of molecules, which are then used to improve the accuracy of property prediction tasks.
  • Knowledge Graph Completion and Reasoning : Knowledge graphs are a special type of graph structure that represents real-world entities and their relationships. LLMs have been explored for tasks like knowledge graph completion and reasoning, where the graph structure and textual information (e.g., entity descriptions) need to be considered jointly.
  • Recommender Systems : In the domain of recommender systems, graph structures are often used to represent user-item interactions, with nodes representing users and items, and edges denoting interactions or similarities. LLMs can be leveraged to enhance these graphs by generating user/item side information or reinforcing interaction edges.

The synergy between Large Language Models and Graph Machine Learning presents an exciting frontier in artificial intelligence research. By combining the structural inductive bias of GNNs with the powerful semantic understanding capabilities of LLMs, we can unlock new possibilities in graph learning tasks, particularly for text-attributed graphs.

While significant progress has been made, challenges remain in areas such as efficiency, scalability, transferability, and explainability. Techniques like knowledge distillation, fair evaluation benchmarks, and multimodal integration are paving the way for practical deployment of LLM-enhanced graph learning models in real-world applications.

classification in machine learning case study

What is AlphaFold 3? The AI Model Poised to Transform Biology

SIMA: Scaling Up AI Agents Across Virtual Worlds for Diverse Applications


I have spent the past five years immersing myself in the fascinating world of Machine Learning and Deep Learning. My passion and expertise have led me to contribute to over 50 diverse software engineering projects, with a particular focus on AI/ML. My ongoing curiosity has also drawn me toward Natural Language Processing, a field I am eager to explore further.

You may like

Memory for Large Language Model Inference

Optimizing Memory for Large Language Model Inference and Fine-Tuning

Decoder-Based Large Language Models: A Complete Guide

Decoder-Based Large Language Models: A Complete Guide

Snowflake Arctic: The Cutting-Edge LLM for Enterprise AI

Snowflake Arctic: The Cutting-Edge LLM for Enterprise AI

Meta Llama 3 open source LLM OUTPERFORM GPT 4

Everything You Need to Know About Llama 3 | Most Powerful Open-Source Model Yet | Concepts to Usage


The Rise of AI Software Engineers: SWE-Agent, Devin AI and the Future of Coding

Discover the evolution of web browsing with LLM-powered agents. Explore personalized digital experiences beyond keyword searches.

Beyond Search Engines: The Rise of LLM-Powered Web Browsing Agents

classification in machine learning case study

Help | Advanced Search

Computer Science > Machine Learning

Title: a systematic review and meta-analysis on sleep stage classification and sleep disorder detection using artificial intelligence.

Abstract: Sleep is vital for people's physical and mental health, and sound sleep can help them focus on daily activities. Therefore, a sleep study that includes sleep patterns and disorders is crucial to enhancing our knowledge about individuals' health status. The findings on sleep stages and sleep disorders relied on polysomnography and self-report measures, and then the study went through clinical assessments by expert physicians. However, the evaluation process of sleep stage classification and sleep disorder has become more convenient with artificial intelligence applications and numerous investigations focusing on various datasets with advanced algorithms and techniques that offer improved computational ease and accuracy. This study aims to provide a comprehensive, systematic review and meta-analysis of the recent literature to analyze the different approaches and their outcomes in sleep studies, which includes works on sleep stages classification and sleep disorder detection using AI. In this review, 183 articles were initially selected from different journals, among which 80 records were enlisted for explicit review, ranging from 2016 to 2023. Brain waves were the most commonly employed body parameters for sleep staging and disorder studies. The convolutional neural network, the most widely used of the 34 distinct artificial intelligence models, comprised 27%. The other models included the long short-term memory, support vector machine, random forest, and recurrent neural network, which consisted of 11%, 6%, 6%, and 5% sequentially. For performance metrics, accuracy was widely used for a maximum of 83.75% of the cases, the F1 score of 45%, Kappa of 36.25%, Sensitivity of 31.25%, and Specificity of 30% of cases, along with the other metrics. This article would help physicians and researchers get the gist of AI's contribution to sleep studies and the feasibility of their intended work.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

The world is getting “smarter” every day, and to keep up with consumer expectations, companies are increasingly using machine learning algorithms to make things easier. You can see them in use in end-user devices (through face recognition for unlocking smartphones) or for detecting credit card fraud (like triggering alerts for unusual purchases).

Within  artificial intelligence  (AI) and  machine learning , there are two basic approaches: supervised learning and unsupervised learning. The main difference is that one uses labeled data to help predict outcomes, while the other does not. However, there are some nuances between the two approaches, and key areas in which one outperforms the other. This post clarifies the differences so you can choose the best approach for your situation.

Supervised learning  is a machine learning approach that’s defined by its use of labeled data sets. These data sets are designed to train or “supervise” algorithms into classifying data or predicting outcomes accurately. Using labeled inputs and outputs, the model can measure its accuracy and learn over time.

Supervised learning can be separated into two types of problems when  data mining : classification and regression:

  • Classification  problems use an algorithm to accurately assign test data into specific categories, such as separating apples from oranges. Or, in the real world, supervised learning algorithms can be used to classify spam in a separate folder from your inbox. Linear classifiers, support vector machines, decision trees and  random forest  are all common types of classification algorithms.
  • Regression  is another type of supervised learning method that uses an algorithm to understand the relationship between dependent and independent variables. Regression models are helpful for predicting numerical values based on different data points, such as sales revenue projections for a given business. Some popular regression algorithms are linear regression, logistic regression, and polynomial regression.

Unsupervised learning  uses machine learning algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”).

Unsupervised learning models are used for three main tasks: clustering, association and dimensionality reduction:

  • Clustering  is a data mining technique for grouping unlabeled data based on their similarities or differences. For example, K-means clustering algorithms assign similar data points into groups, where the K value represents the size of the grouping and granularity. This technique is helpful for market segmentation, image compression, and so on.
  • Association  is another type of unsupervised learning method that uses different rules to find relationships between variables in a given data set. These methods are frequently used for market basket analysis and recommendation engines, along the lines of “Customers Who Bought This Item Also Bought” recommendations.
  • Dimensionality reduction  is a learning technique that is used when the number of features (or dimensions) in a given data set is too high. It reduces the number of data inputs to a manageable size while also preserving the data integrity. Often, this technique is used in the preprocessing data stage, such as when autoencoders remove noise from visual data to improve picture quality.

The main distinction between the two approaches is the use of labeled data sets. To put it simply, supervised learning uses labeled input and output data, while an unsupervised learning algorithm does not.

In supervised learning, the algorithm “learns” from the training data set by iteratively making predictions on the data and adjusting for the correct answer. While supervised learning models tend to be more accurate than unsupervised learning models, they require upfront human intervention to label the data appropriately. For example, a supervised learning model can predict how long your commute will be based on the time of day, weather conditions and so on. But first, you must train it to know that rainy weather extends the driving time.

Unsupervised learning models, in contrast, work on their own to discover the inherent structure of unlabeled data. Note that they still require some human intervention for validating output variables. For example, an unsupervised learning model can identify that online shoppers often purchase groups of products at the same time. However, a data analyst would need to validate that it makes sense for a recommendation engine to group baby clothes with an order of diapers, applesauce, and sippy cups.

  • Goals:  In supervised learning, the goal is to predict outcomes for new data. You know up front the type of results to expect. With an unsupervised learning algorithm, the goal is to get insights from large volumes of new data. The machine learning itself determines what is different or interesting from the data set.
  • Applications: Supervised learning models are ideal for spam detection, sentiment analysis, weather forecasting and pricing predictions, among other things. In contrast, unsupervised learning is a great fit for anomaly detection, recommendation engines, customer personas and medical imaging.
  • Complexity:  Supervised learning is a simple method for machine learning, typically calculated by using programs like R or Python. In unsupervised learning, you need powerful tools for working with large amounts of unclassified data. Unsupervised learning models are computationally complex because they need a large training set to produce intended outcomes.
  • Drawbacks: Supervised learning models can be time-consuming to train, and the labels for input and output variables require expertise. Meanwhile, unsupervised learning methods can have wildly inaccurate results unless you have human intervention to validate the output variables.

Choosing the right approach for your situation depends on how your data scientists assess the structure and volume of your data, as well as the use case. To make your decision, be sure to do the following:

  • Evaluate your input data:  Is it labeled or unlabeled data? Do you have experts that can support extra labeling?
  • Define your goals:  Do you have a recurring, well-defined problem to solve? Or will the algorithm need to predict new problems?
  • Review your options for algorithms:  Are there algorithms with the same dimensionality that you need (number of features, attributes, or characteristics)? Can they support your data volume and structure?

Classifying big data can be a real challenge in supervised learning, but the results are highly accurate and trustworthy. In contrast, unsupervised learning can handle large volumes of data in real time. But, there’s a lack of transparency into how data is clustered and a higher risk of inaccurate results. This is where semi-supervised learning comes in.

Can’t decide on whether to use supervised or unsupervised learning?  Semi-supervised learning  is a happy medium, where you use a training data set with both labeled and unlabeled data. It’s particularly useful when it’s difficult to extract relevant features from data—and when you have a high volume of data.

Semi-supervised learning is ideal for medical images, where a small amount of training data can lead to a significant improvement in accuracy. For example, a radiologist can label a small subset of CT scans for tumors or diseases so the machine can more accurately predict which patients might require more medical attention.

Machine learning models are a powerful way to gain the data insights that improve our world. To learn more about the specific algorithms that are used with supervised and unsupervised learning, we encourage you to delve into the Learn Hub articles on these techniques. We also recommend checking out the blog post that goes a step further, with a detailed look at deep learning and neural networks.

  • What is Supervised Learning?
  • What is Unsupervised Learning?
  • AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the difference?

To learn more about how to build machine learning models, explore the free tutorials on the  IBM® Developer Hub .

Get the latest tech insights and expert thought leadership in your inbox.

The Data Differentiator: Learn how to weave a single technology concept into a holistic data strategy that drives business value.

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.


  1. Classification in machine learning: Types and methodologies

    classification in machine learning case study

  2. Machine Learning Classification

    classification in machine learning case study

  3. High Level Overview of Machine Learning Classification

    classification in machine learning case study

  4. Classification Algorithms; Classification In Machine Learning

    classification in machine learning case study

  5. What is classification in Machine Learning

    classification in machine learning case study

  6. Classification in machine learning: Types and methodologies

    classification in machine learning case study


  1. Machine Learning on Encrypted Data using Homomorphic Encryption

  2. AAIS

  3. Solving Classification Problems with Azure Machine Learning Studio: A Step-by-Step Guide

  4. Machine Learning Course

  5. Machine Learning Course

  6. What is Classification


  1. How To Solve A Classification Task With Machine Learning

    The case study in this article will go over a popular Machine learning concept called classification. Classification. In Machine Learning (ML), classification is a supervised learning concept that groups data into classes. Classification usually refers to any kind of problem where a specific type of class label is the result to be predicted ...

  2. Classification in Machine Learning: A Guide for Beginners

    Classification is a supervised machine learning method where the model tries to predict the correct label of a given input data. In classification, the model is fully trained using the training data, and then it is evaluated on test data before being used to perform prediction on new unseen data. For instance, an algorithm can learn to predict ...

  3. Machine Learning: Classification Course by University of Washington

    Classification is one of the most widely used techniques in machine learning, with a broad array of applications, including sentiment analysis, ad targeting, spam detection, risk assessment, medical diagnosis and image classification. The core goal of classification is to predict a category or class y from some inputs x.

  4. Multi-Class classification with Sci-kit learn & XGBoost: A case study

    by Avishek Nag (Machine Learning expert) Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data A comparison of different classifiers' accuracy & performance for high-dimensional data Photo Credit : PixabayIn Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely

  5. 4 Types of Classification Tasks in Machine Learning

    Examples include: Email spam detection (spam or not). Churn prediction (churn or not). Conversion prediction (buy or not). Typically, binary classification tasks involve one class that is the normal state and another class that is the abnormal state. For example " not spam " is the normal state and " spam " is the abnormal state.

  6. A case study on machine learning and classification

    Abstract. As a young research field, the machine learning has made significant progress and covered a broad spectrum of applications for the last few decades. Classification is an important task ...

  7. A case study on machine learning and classification

    As a young research field, the machine learning has made significant progress and covered a broad spectrum of applications for the last few decades. Classification is an important task of machine learning. Today, the task is used in a vast array of areas. The present article provides a case study on various classification algorithms (under machine learning), their applicability and issues ...

  8. Machine Learning Foundations: A Case Study Approach

    This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms.

  9. Machine learning in remote sensing data—a classification case study

    A case study of the burnt fields is done using possibilistic c-means classifier to understand the application of remote sensing with regards to image classification. The results have been promising as verified with the previous published work cited in the paper. ... Similar work for classification using machine learning algorithms is carried ...

  10. Machine learning in project analytics: a data-driven framework and case

    The final classification is made by counting the most common scenario or votes present within the ... in addition to the 139 instances of the case study, to the machine learning algorithms, then ...

  11. PDF Classification in Networked Data: A Toolkit and a Univariate Case Study

    The case study focuses on univari-ate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets.

  12. Extending Classification Algorithms to Case-Control Studies

    Classification is a common technique applied to 'omics data to build predictive models and identify potential markers of biomedical outcomes. Despite the prevalence of case-control studies, the number of classification methods available to analyze data generated by such studies is extremely limited. Conditional logistic regression is the most ...

  13. A detailed case study on Multi-Label Classification with Machine

    In case of multi-label classification tasks, a single instance of data can simultaneously belong to two or more classes of target variables. Hence, we can say that the predicted classes are not ...

  14. Classification in Networked Data: A Toolkit and a Univariate Case Study

    The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets.

  15. Binary Classification

    Step 1: Define explanatory and target variables. We'll store the rows of observations in a variable X and the corresponding class of those observations (0 or 1) in a variable y. Step 2: Split the dataset into training and testing sets. We use 75% of data for training and 25% for testing.

  16. Getting started with Classification

    Machine Learning classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data. The main objective of classification machine learning is to build a model that can accurately assign a label or category to a new observation based on its features ...

  17. Deep learning in automated text classification: a case study using

    Our case study presents an empirical assessment of a range of deep learning algorithms and architectures compared with a baseline traditional machine learning algorithm. The case study is designed in the context of systematic reviews related to human health risk assessment and is focused on both the performance and the practicalities of these ...

  18. Case Study: Using Machine Learning to Classify Personally ...

    In this case, if we had a bunch of examples of first and last names, phone numbers, ID numbers, DoB, email addresses and VINs, each labelled as such, we could train a multi-class supervised ...

  19. 16 Real World Case Studies of Machine Learning

    6. Machine Learning Case Study on Tesla. Tesla is now a big name in the electric automobile industry and the chances that it will continue to be the trending topic for years to come are really high. It is popular and extensively known for its advanced and futuristic cars and their advanced models.

  20. An Industrial Case study on Deep learning image classification

    A step by step guide to image classification. In this post, I am going to explain a end-to-end use case of deep learning image classification in order to automate the process of classifying ...

  21. Different Types of Classification Models in Machine Learning

    4. K-Nearest Neighbours. Definition: Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data.Classification is computed from a simple majority vote of the k nearest neighbours of each point. Advantages: This algorithm is simple to implement, robust to noisy training data, and ...

  22. Automatic ceramic identification using machine learning. Lusitanian

    Machine learning is an application of Artificial Intelligence encompassing the learning of patterns within large datasets, useful for making predictions or decisions over new, related data. ... L. M., and C. E. Downum. 2021. "Applications of Deep Learning to Decorated Ceramic Typology and Classification: A Case Study Using Tusayan White Ware ...

  23. Classification Algorithm in Machine Learning

    The Classification algorithm is a Supervised Learning technique that is used to identify the category of new observations on the basis of training data. In Classification, a program learns from the given dataset or observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam ...

  24. Employee turnover in multinational corporations: a supervised machine

    This research explores the potential of supervised machine learning techniques in transforming raw data into strategic knowledge in the context of human resource management. By analyzing a database with over 205 variables and 2,932 observations related to a telco multinational corporation, this study tests the predictive and analytical power of classification decision trees in detecting the ...

  25. Supercharging Graph Neural Networks with Large Language Models: The

    Revolutionize graph machine learning with large language models (LLMs)! Uncover groundbreaking strategies integrating the might of LLMs like ChatGPT with graph neural networks (GNNs) for unmatched performance on text-attributed graphs. Boost node classification, link prediction, graph reasoning, and more. Explore roles, overcome challenges, leverage real-world applications.

  26. Computer Science > Machine Learning

    Sleep is vital for people's physical and mental health, and sound sleep can help them focus on daily activities. Therefore, a sleep study that includes sleep patterns and disorders is crucial to enhancing our knowledge about individuals' health status. The findings on sleep stages and sleep disorders relied on polysomnography and self-report measures, and then the study went through clinical ...

  27. Supervised vs. unsupervised learning: What's the difference?

    The main difference between supervised and unsupervised learning: Labeled data. The main distinction between the two approaches is the use of labeled data sets. To put it simply, supervised learning uses labeled input and output data, while an unsupervised learning algorithm does not. In supervised learning, the algorithm "learns" from the ...

  28. Bagging Vs. Boosting in Ensemble Machine Learning? An Integrated

    The applications of machine learning (ML) in insurance companies have become increasingly popular as a result of technological advancements and the reality of big data in the insurance industry. ... In the case of the classification task, ... Communications in Statistics Case Studies, Data Analysis and Applications 7 :520-35. doi:10.1080 ...