An Introduction to Classification in Machine Learning

Classification is a supervised machine learning process that predicts the class of input data based on the algorithms training data. Here’s what you need to know.

Sidath Asiri

Classification is the process of predicting the class of given data points. Classes are sometimes called targets, labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y.)

What Is Classification in Machine Learning?

For example, spam detection in email service providers can be identified as a classification problem. This is a binary classification since there are only two classes marked as “spam” and “not spam.” A classifier utilizes some training data to understand how given input variables relate to the class. In this case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to detect an unknown email.

Classification belongs to the category of supervised learning where the targets are also provided with the input data. Classification can be applied to a wide-variety of tasks , including credit approval, medical diagnosis and target marketing, etc.

Types of Classification in Machine Learning

There are two types of learners in classification — lazy learners and eager learners.

1. Lazy Learners

Lazy learners store the training data and wait until testing data appears. When it does, classification is conducted based on the most related stored training data. Compared to eager learners, lazy learners spend less training time but more time in predicting.

Examples: K-nearest neighbor and case-based reasoning.

2. Eager Learners

Eager learners construct a classification model based on the given training data before receiving data for classification. It must be able to commit to a single hypothesis that covers the entire instance space. Because of this, eager learners take a long time for training and less time for predicting.

Examples: Decision tree , naive Bayes and artificial neural networks .

More on Machine Learning: Top 10 Machine Learning Algorithms Every Beginner Should Know

Classification Algorithms

There are a lot of classification algorithms to choose from. Picking the right one depends on the application and nature of the available data set. For example, if the classes are linearly separable, linear classifiers like logistic regression and Fisher’s linear discriminant can outperform sophisticated models and vice versa.

Important Classification Algorithms to Know

  • Decision tree

Naive Bayes

  • Artificial neural network
  • K-nearest neighbor (KNN)

Decision Tree

A decision tree builds classification or regression models in the form of a tree structure. It utilizes an “ if-then ” rule set that is mutually exclusive and exhaustive for classification. The rules are learned sequentially using the training data one at a time. Each time a rule is learned, the tuples covered by the rules are removed. This process continues until it meets a termination condition.

The tree is constructed in a top-down, recursive, divide-and-conquer manner. All attributes should be categorical. Otherwise, they should be discretized in advance. Attributes in the top of the tree have more impact in the classification, and they are identified using the information gain concept.

A decision tree can be easily over-fitted generating too many branches and may reflect anomalies due to noise or outliers. An over-fitted model results in very poor performances on the unseen data, even though it gives off an impressive performance on training data. You can avoid this with pre-pruning, which halts tree construction early, or through post-pruning, which removes branches from the fully grown tree.

Naive Bayes is a probabilistic classifier inspired by the Bayes theorem under the assumption that attributes are conditionally independent.

The classification is conducted by deriving the maximum posterior, which is the maximal P(Ci|X) , with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption isn’t valid in most cases since the attributes are dependent, surprisingly, naive Bayes is able to perform impressively.

Naive Bayes is a simple algorithm to implement and can yield good results in most cases. It can be easily scaled to larger data sets since it takes linear time, rather than the expensive iterative approximation that other types of classifiers use. 

Naive Bayes can suffer from a problem called the zero probability problem. When the conditional probability is zero for a particular attribute, it fails to give a valid prediction. This needs to be fixed explicitly using a Laplacian estimator.

Artificial Neural Networks

An artificial neural network is a set of connected input/output units, where each connection has a weight associated with it. A team of psychologists and neurobiologists founded it as a way to develop and test computational analogs of neurons. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples.

There are several network architectures available today, including feed-forward, convolutional and recurrent networks. The appropriate architecture depends on the application of the model. For most cases, feed-forward models give reasonably accurate results, but convolutional networks perform better for image processing. 

There can be multiple hidden layers in the model depending on the complexity of the function that the model is going to map. These hidden layers will allow you to model complex relationships, such as deep neural networks .

However, when there are many hidden layers, it takes a lot of time to train and adjust the weights. The other disadvantage of this is the poor interpretability of the model compared to others like decision trees. This is due to the unknown symbolic meaning behind the learned weights.

But artificial neural networks have performed impressively in most real world applications. It has a high tolerance for noisy data and is able to classify untrained patterns. Usually, artificial neural networks perform better with continuous-valued inputs and outputs.

All of the above algorithms are eager learners since they train a model in advance to generalize the training data and use it for prediction later.

K-Nearest Neighbor (KNN)

K-Nearest Neighbor is a lazy learning algorithm that stores all instances corresponding to training data points in n-dimensional space. When an unknown discrete data is received, it analyzes the closest k number of instances saved (nearest neighbors) and returns the most common class as the prediction. For real-valued data, it returns the mean of k nearest neighbors.

In the distance-weighted nearest neighbor algorithm, it weighs the contribution of each of the k neighbors according to their distance using the following query, giving greater weight to the closest neighbors:

Usually, KNN is robust to noisy data since it is averaging the k-nearest neighbors.

How to Evaluate a Classifier

After training the model, the most important part is to evaluate the classifier to verify its applicability.

Machine Learning Classifier Evaluation Methods

  • Holdout method.
  • Cross-validation.
  • Precision and recall.
  • Receiver operating characteristics (ROC) curve.

More on Machine Learning: How Does Backpropagation in a Neural Network Work?

Holdout Method

There are several methods to evaluate a classifier, but the most common way is the holdout method. In it, the given data set is divided into two partitions, test and train . Twenty percent of the data is used as a test and 80 percent is used to train. The train set will be used to train the model, and the unseen test data will be used to test its predictive power.

Cross-Validation

Overfitting is a common problem in machine learning and it occurs in most models. K-fold cross-validation can be conducted to verify that the model is not overfitted. In this method, the data set is randomly partitioned into k-mutually exclusive subsets, each approximately equal in size. One is kept for testing while others are used for training. This process is iterated throughout the whole k folds.

Precision and Recall

Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. Precision and recall are used as a measurement of the relevance.

Receiver Operating Characteristics (ROC) Curve

A ROC curve provides a visual comparison of classification models, showing the trade-off between the true positive rate and the false positive rate. 

The area under the ROC curve is a measure of the accuracy of the model. When a model is closer to the diagonal, it is less accurate. A model with perfect accuracy will have an area of 1.0. 

Recent Data Science Articles

What Is Open Source Intelligence (OSINT)?

Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data

Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data

by Avishek Nag (Machine Learning expert)

A comparison of different classifiers’ accuracy & performance for high-dimensional data

HgrQEY1ls7wmdrV8KRTkFHOm8qhwMARrADSp

In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this ‘curse of dimensionality’ problem.

In this article, we will see how accuracy and performance vary across different classifiers. We will also see how, when we don’t have the freedom to choose a classifier independently, we can do feature engineering to make a poor classifier perform well.

Understanding the ‘datasource’ & problem formulation

For this article, we will use the “EEG Brainwave Dataset” from Kaggle . This dataset contains electronic brainwave signals from an EEG headset and is in temporal format. At the time of writing this article, nobody has created any ‘Kernel’ on this dataset — that is, as of now, no solution has been given in Kaggle.

So, to start with, let’s first read the data to see what’s there.

AzxLbSZrXH13JT0xCsCVNlveCIPB5UAH6EjI

There are 2549 columns in the dataset and ‘label’ is the target column for our classification problem. All other columns like ‘mean_d_1_a’, ‘mean_d2_a’ etc are describing features of brainwave signal readings. Columns starting with the ‘fft’ prefix are most probably ‘Fast Fourier transforms’ of original signals. Our target column ‘label’ describes the degree of emotional sentiment.

As per Kaggle, here is the challenge: “Can we predict emotional sentiment from brainwave readings?”

Let’s first understand class distributions from column ‘label’:

oPF55uAVtTkqcK-RkZ7RaNC9BvlXBHxW0giG

So, there are three classes, ‘POSITIVE’, ‘NEGATIVE’ & ‘NEUTRAL’, for emotional sentiment. From the bar chart, it is clear that class distribution is not skewed and it is a ‘multi-class classification’ problem with target variable ‘label’. We will try with different classifiers and see the accuracy levels.

Before applying any classifier, the column ‘label’ should be separated out from other feature columns (‘mean_d_1_a’, ‘mean_d2_a’ etc are features).

As it is a ‘classification’ problem, we will follow the below conventions for each ‘classifier’ to be tried:

  • We will use a ‘cross validation’ (in our case will use 10 fold cross validation) approach over the dataset and take average accuracy. This will give us a holistic view of the classifier’s accuracy.
  • We will use a ‘Pipeline’ based approach to combine all pre-processing and main classifier computation. A ML ‘Pipeline’ wraps all processing stages in a single unit and act as a ‘classifier’ itself. By this, all stages become re-usable and can be put in forming other ‘pipelines’ also.
  • We will track total time in building & testing for each approach. We will call this ‘time taken’.

For the above, we will primarily use the scikit-learn package from Python. As the number of features here is quite high, will start with a classifier which works well on high-dimensional data.

RandomForest Classifier

‘RandomForest’ is a tree & bagging approach-based ensemble classifier. It will automatically reduce the number of features by its probabilistic entropy calculation approach. Let’s see that:

BoP9Xlm-bsN3YmKjvOLep3l5nv8WqmyWq-35

Accuracy is very good at 97.7% and ‘total time taken’ is quite short (3.29 seconds only).

For this classifier, no pre-processing stages like scaling or noise removal are required, as it is completely probability-based and not at all affected by scale factors.

Logistic Regression Classifier

‘Logistic Regression’ is a linear classifier and works in same way as linear regression.

Xk8t927mnTLdQUzZoFFJcNeYLkIuyMJGiDpF

We can see accuracy (93.19%) is lower than ‘RandomForest’ and ‘time taken’ is higher (2 min 7s).

‘Logistic Regression’ is heavily affected by different value ranges across dependent variables, thus forces ‘feature scaling’. That’s why ‘StandardScaler’ from scikit-learn has been added as a preprocessing stage. It automatically scales features according to a Gaussian Distribution with zero mean & unit variance, and thus values for all variables range from -1 to +1.

The reason for high time taken is high-dimensionality and scaling time required. There are 2549 variables in the dataset and the coefficient of each one should be optimised as per the Logistic Regression process. Also, there is a question of multi-co-linearity. This means linearly co-related variables should be grouped together instead of considering them separately.

The presence of multi-col-linearity affects accuracy. So now the question becomes, “Can we reduce the number of variables, reduce multi-co-linearity, & improve ‘time taken?”

Principal Component Analysis (PCA)

PCA can transform original low level variables to a higher dimensional space and thus reduce the number of required variables. All co-linear variables get clubbed together. Let’s do a PCA of the data and see what are the main PC’s:

440KTh6c2a31AKyZrCRZaIZpshq1f903koj2

We mapped 2549 variables to 20 Principal Components. From the above result, it is clear that first 10 PCs are a matter of importance. The total percentage of the explained variance ratio by the first 10 PCs is around 0.737 (0.36 + 0.095 + ..+ 0.012). Or it can be said that the first 10 PCs explain 73.7% variance of the entire dataset.

So, with this we are able to reduce 2549 variables to 10 variables. That’s a dramatic change, isn’t it? In theory, Principal Components are virtual variables generated from mathematical mapping. From a business angle, it is not possible to tell which physical aspect of the data is covered by them. That means, physically, that Principal Components don’t exist. But, we can easily use these PCs as quantitative input variables to any ML algorithm and get very good results.

For visualisation, let’s take the first two PCs and see how can we distinguish different classes of the data using a ‘scatterplot’.

Su4sA409ETB-Cyoi02WiqDgojX7Pk3E65O1-

In the above plot, three classes are shown in different colours. So, if we use the same ‘Logistic Regression’ classifier with these two PCs, then from the above plot we can probably say that the first classifier will separate out ‘NEUTRAL’ cases from other two cases and the second classifier will separate out ‘POSITIVE’ & ‘NEGATIVE’ cases (as there will be two internal logistic classifiers for 3-class problem). Let’s try and see the accuracy.

OsXLQnznooi6fBsBnpflEDr4e2bHAXp8PWpy

Time taken (3.34 s) was reduced but accuracy (77%) decreased.

Now, let’s take all 10 PCs and run:

4doVxMxtTjlSaHJcxIfAItR-LwDmrmgXtH4r

We see an improvement in accuracy (86%) compared to 2 PC cases with a marginal increase in ‘time taken’.

So, in both cases we saw low accuracy compared to normal logistic regression, but a significant improvement in ‘time taken’.

Accuracy can be further tested with a different ‘solver’ & ‘max_iter’ parameter. We used ‘saga’ as ‘solver’ with L1 penalty and 200 as ‘max_iter’. These values can be changed to get a variable effect on accuracy.

Though ‘Logistic Regression’ is giving low accuracy, there are situations where it may be needed specially with PCA. In datasets with a very large dimensional space, PCA becomes the obvious choice for ‘linear classifiers’.

In some cases, where a benchmark for ML applications is already defined and only limited choices of some ‘linear classifiers’ are available, this analysis would be helpful. It is very common to see such situations in large organisations where standards are already defined and it is not possible to go beyond them.

Artificial Neural Network Classifier (ANN)

An ANN classifier is non-linear with automatic feature engineering and dimensional reduction techniques. ‘MLPClassifier’ in scikit-learn works as an ANN. But here also, basic scaling is required for the data. Let’s see how it works:

GvS62mSORy1dT6dRtBoGAT2kZz9CP5uLdmv4

Accuracy (97.5%) is very good, though running time is high (5 min).

The reason for high ‘time taken’ is the rigorous training time required for neural networks, and that too with a high number of dimensions.

It is a general convention to start with a hidden layer size of 50% of the total data size and subsequent layers will be 50% of the previous one. In our case these are (1275 = 2549 / 2, 637 = 1275 / 2). The number of hidden layers can be taken as hyper-parameter and can be tuned for better accuracy. In our case it is 2.

Linear Support Vector Machines Classifier (SVM)

We will now apply ‘Linear SVM’ on the data and see how accuracy is coming along. Here also scaling is required as a preprocessing stage.

r74XbN6LHPjZW6oRUWeGucV5yscNJksuwEw0

Accuracy is coming in at 96.4% which is little less than ‘RandomForest’ or ‘ANN’. ‘time taken’ is 55 s which is in far better than ‘ANN’.

Extreme Gradient Boosting Classifier (XGBoost)

XGBoost is a boosted tree based ensemble classifier. Like ‘RandomForest’, it will also automatically reduce the feature set. For this we have to use a separate ‘xgboost’ library which does not come with scikit-learn. Let’s see how it works:

2TrRyjjIFsVSIq8w6Q5uf4ByeJZzNeq8ko8Y

Accuracy (99.4%) is exceptionally good, but ‘time taken’(15 min) is quite high. Nowadays, for complicated problems, XGBoost is becoming a default choice for Data Scientists for its accurate results. It has high running time due to its internal ensemble model structure. However, XGBoost performs well in GPU machines.

From all of the classifiers, it is clear that for accuracy ‘XGBoost’ is the winner. But if we take ‘time taken’ along with ‘accuracy’, then ‘RandomForest’ is a perfect choice. But we also saw how to use a simple linear classifier like ‘logistic regression’ with proper feature engineering to give better accuracy. Other classifiers don’t need that much feature engineering effort.

It depends on the requirements, use case, and data engineering environment available to choose a perfect ‘classifier’.

The entire project on Jupyter NoteBook can be found here .

References:

[1] XGBoost Documentation — https://xgboost.readthedocs.io/en/latest/

[2] RandomForest workings — http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/

[3] Principal Component Analysis — https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

[4] Logistic Regression — http://ufldl.stanford.edu/tutorial/supervised/LogisticRegression/

[5] Support Vector Machines — https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

If this article was helpful, share it .

Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started

Tutorial Playlist

Machine learning tutorial: a step-by-step guide for beginners.

An Introduction To Machine Learning

What is Machine Learning and How Does It Work?

Machine Learning Steps: A Complete Guide

Top 10 Machine Learning Applications in 2024

Different types of machine learning: exploring ai's core, a beginner's guide to supervised & unsupervised learning in ai, everything you need to know about feature selection, linear regression in python, everything you need to know about classification in machine learning, an introduction to logistic regression in python, understanding the difference between linear vs logistic regression, the best guide on how to implement decision tree in python, random forest algorithm, understanding naive bayes classifier, the best guide to confusion matrix, how to leverage knn algorithm in machine learning, k-means clustering algorithm: applications, types, demos and use cases, pca in machine learning: your complete guide to principal component analysis, what is cost function in machine learning, the ultimate guide to cross-validation in machine learning, an easy guide to stock price prediction using machine learning, what is reinforcement learning: a complete guide, what is q-learning: the best guide to understand q-learning, the best guide to regularization in machine learning, everything you need to know about bias and variance, the complete guide on overfitting and underfitting in machine learning, mathematics for machine learning - important skills you must possess, a one-stop guide to statistics for machine learning, embarking on a machine learning career here’s all you need to know.

How to Become a Machine Learning Engineer?

Top 45 Machine Learning Interview Questions and Answers for 2024

Explaining the concepts of quantum computing, supervised machine learning: all you need to know, 10 machine learning platforms to revolutionize your business, what is boosting in machine learning : a comprehensive guide, machine learning vs. neural networks: understanding the differences, unlocking the future: 5 compelling reasons to master machine learning in 2024, feature engineering, classification in machine learning: what it is & classification models.

Lesson 9 of 38 By Mayank Banoula

Everything You Need to Know About Classification in Machine Learning

Table of Contents

Imagine opening your cupboard to see that everything is jumbled up. You will find it very difficult and time-consuming to take what you need. If everything were grouped, it would be so simple. That is what machine learning classification algorithms do.

Your AI/ML Career is Just Around The Corner!

Your AI/ML Career is Just Around The Corner!

What is Supervised Learning?

Before we dive into Classification, let’s take a look at what Supervised Learning is. Suppose you are trying to learn a new concept in maths and after solving a problem, you may refer to the solutions to see if you were right or not. Once you are confident in your ability to solve a particular type of problem, you will stop referring to the answers and solve the questions put before you by yourself.

This is also how Supervised Learning works with machine learning models. In Supervised Learning, the model learns by example. Along with our input variable, we also give our model the corresponding correct labels. While training, the model gets to look at which label corresponds to our data and hence can find patterns between our data and those labels.

Some examples of Supervised Learning include:

  • It classifies spam Detection by teaching a model of what mail is spam and not spam.
  • Speech recognition where you teach a machine to recognize your voice.
  • Object Recognition by showing a machine what an object looks like and having it pick that object from among other objects.

We can further divide Supervised Learning into the following:

supervised

  Figure 1: Supervised Learning Subdivisions

Also Read: Supervised and Unsupervised Learning in Machine Learning

What is Classification?

Classification is defined as the process of recognition, understanding, and grouping of objects and ideas into preset categories a.k.a “sub-populations.” With the help of these pre-categorized training datasets, classification in machine learning programs leverage a wide range of algorithms to classify future datasets into respective and relevant categories.

Classification algorithms used in machine learning utilize input training data for the purpose of predicting the likelihood or probability that the data that follows will fall into one of the predetermined categories. One of the most common applications of classification is for filtering emails into “spam” or “non-spam”, as used by today’s top email service providers.

Read more: Top 10 Machine Learning Algorithms

In short, classification is a form of “pattern recognition,”. Here, classification algorithms applied to the training data find the same pattern (similar number sequences, words or sentiments, and the like) in future data sets.

We will explore classification algorithms in detail, and discover how a text analysis software can perform actions like sentiment analysis - used for categorizing unstructured text by opinion polarity (positive, negative, neutral, and the like). 

classification

                      Figure 2: Classification of vegetables and groceries

What is Classification Algorithm?

Based on training data, the Classification algorithm is a Supervised Learning technique used to categorize new observations. In classification, a program uses the dataset or observations provided to learn how to categorize new observations into various classes or groups. For instance, 0 or 1, red or blue, yes or no, spam or not spam, etc. Targets, labels, or categories can all be used to describe classes. The Classification algorithm uses labeled input data because it is a supervised learning technique and comprises input and output information. A discrete output function (y) is transferred to an input variable in the classification process (x).

In simple words, classification is a type of pattern recognition in which classification algorithms are performed on training data to discover the same pattern in new data sets.

Learners in Classification Problems

There are two types of learners.

Lazy Learners

It first stores the training dataset before waiting for the test dataset to arrive. When using a lazy learner, the classification is carried out using the training dataset's most appropriate data. Less time is spent on training, but more time is spent on predictions. Some of the examples are case-based reasoning and the KNN algorithm.

Eager Learners

Before obtaining a test dataset, eager learners build a classification model using a training dataset. They spend more time studying and less time predicting. Some of the examples are ANN, naive Bayes, and Decision trees.

Now, let us discuss four types of Classification Tasks in Machine Learning.

4 Types Of Classification Tasks In Machine Learning

Before diving into the four types of Classification Tasks in Machine Learning, let us first discuss Classification Predictive Modeling.

Classification Predictive Modeling

A classification problem in machine learning is one in which a class label is anticipated for a specific example of input data.

Problems with categorization include the following:

  • Give an example and indicate whether it is spam or not.
  • Identify a handwritten character as one of the recognized characters.
  • Determine whether to label the current user behavior as churn.

A training dataset with numerous examples of inputs and outputs is necessary for classification from a modeling standpoint.

A model will determine the optimal way to map samples of input data to certain class labels using the training dataset. The training dataset must therefore contain a large number of samples of each class label and be suitably representative of the problem.

When providing class labels to a modeling algorithm, string values like "spam" or "not spam" must first be converted to numeric values. Label encoding, which is frequently used, assigns a distinct integer to every class label, such as "spam" = 0, "no spam," = 1.

There are numerous varieties of algorithms for classification in modeling problems, including predictive modeling and classification.

It is typically advised that a practitioner undertake controlled tests to determine what algorithm and algorithm configuration produces the greatest performance for a certain classification task because there is no strong theory on how to map algorithms onto issue types.

Based on their output, classification predictive modeling algorithms are assessed. A common statistic for assessing a model's performance based on projected class labels is classification accuracy. Although not perfect, classification accuracy is a reasonable place to start for many classification jobs.

Some tasks may call for a class membership probability prediction for each example rather than class labels. This adds more uncertainty to the prediction, which a user or application can subsequently interpret. The ROC Curve is a well-liked diagnostic for assessing anticipated probabilities.

There are four different types of Classification Tasks in Machine Learning and they are following -

Binary Classification

Multi-class classification, multi-label classification, imbalanced classification.

Now, let us look at each of them in detail.

Those classification jobs with only two class labels are referred to as binary classification.

Examples comprise -

  • Prediction of conversion (buy or not).
  • Churn forecast (churn or not).
  • Detection of spam email (spam or not).

Binary classification problems often require two classes, one representing the normal state and the other representing the aberrant state.

For instance, the normal condition is "not spam," while the abnormal state is "spam." Another illustration is when a task involving a medical test has a normal condition of "cancer not identified" and an abnormal state of "cancer detected."

Class label 0 is given to the class in the normal state, whereas class label 1 is given to the class in the abnormal condition.

A model that forecasts a Bernoulli probability distribution for each case is frequently used to represent a binary classification task.

The discrete probability distribution known as the Bernoulli distribution deals with the situation where an event has a binary result of either 0 or 1. In terms of classification, this indicates that the model forecasts the likelihood that an example would fall within class 1, or the abnormal state.

The following are well-known binary classification algorithms:

  • Logistic Regression
  • Support Vector Machines
  • Simple Bayes
  • Decision Trees

Some algorithms, such as Support Vector Machines and Logistic Regression, were created expressly for binary classification and do not by default support more than two classes.

Let us now discuss Multi-Class Classification.

Multi-class labels are used in classification tasks referred to as multi-class classification.

  • Categorization of faces.
  • Classifying plant species.
  • Character recognition using optical.

The multi-class classification does not have the idea of normal and abnormal outcomes, in contrast to binary classification. Instead, instances are grouped into one of several well-known classes.

In some cases, the number of class labels could be rather high. In a facial recognition system, for instance, a model might predict that a shot belongs to one of thousands or tens of thousands of faces.

Text translation models and other problems involving word prediction could be categorized as a particular case of multi-class classification. Each word in the sequence of words to be predicted requires a multi-class classification, where the vocabulary size determines the number of possible classes that may be predicted and may range from tens of thousands to hundreds of thousands of words.

Multiclass classification tasks are frequently modeled using a model that forecasts a Multinoulli probability distribution for each example.

An event that has a categorical outcome, such as K in 1, 2, 3,..., K, is covered by the Multinoulli distribution, which is a discrete probability distribution. In terms of classification, this implies that the model forecasts the likelihood that a given example will belong to a certain class label.

For multi-class classification, many binary classification techniques are applicable.

The following well-known algorithms can be used for multi-class classification:

  • Progressive Boosting
  • Choice trees
  • Nearest K Neighbors
  • Rough Forest

Multi-class problems can be solved using algorithms created for binary classification.

In order to do this, a method is known as "one-vs-rest" or "one model for each pair of classes" is used, which includes fitting multiple binary classification models with each class versus all other classes (called one-vs-one).

  • One-vs-One: For each pair of classes, fit a single binary classification model.

The following binary classification algorithms can apply these multi-class classification techniques:

  • One-vs-Rest: Fit a single binary classification model for each class versus all other classes.
  • Support vector Machine

Let us now learn about Multi-Label Classification.

Multi-label classification problems are those that feature two or more class labels and allow for the prediction of one or more class labels for each example.

Think about the photo classification example. Here a model can predict the existence of many known things in a photo, such as “person”, “apple”, "bicycle," etc. A particular photo may have multiple objects in the scene.

This greatly contrasts with multi-class classification and binary classification, which anticipate a single class label for each occurrence.

Multi-label classification problems are frequently modeled using a model that forecasts many outcomes, with each outcome being forecast as a Bernoulli probability distribution. In essence, this approach predicts several binary classifications for each example.

It is not possible to directly apply multi-label classification methods used for multi-class or binary classification. The so-called multi-label versions of the algorithms, which are specialized versions of the conventional classification algorithms, include:

  • Multi-label Gradient Boosting
  • Multi-label Random Forests
  • Multi-label Decision Trees

Another strategy is to forecast the class labels using a different classification algorithm.

Now, we will look into the Imbalanced Classification Task in detail.

The term "imbalanced classification" describes classification jobs where the distribution of examples within each class is not equal.

A majority of the training dataset's instances belong to the normal class, while a minority belong to the abnormal class, making imbalanced classification tasks binary classification tasks in general.

  • Clinical diagnostic procedures
  • Detection of outliers
  • Fraud investigation

Although they could need unique methods, these issues are modeled as binary classification jobs.

By oversampling the minority class or undersampling the majority class, specialized strategies can be employed to alter the sample composition in the training dataset.

  • SMOTE Oversampling
  • Random Undersampling

It is possible to utilize specialized modeling techniques, like the cost-sensitive machine learning algorithms, that give the minority class more consideration when fitting the model to the training dataset.

Examples comprise:

  • Cost-sensitive Support Vector Machines
  • Cost-sensitive Decision Trees
  • Cost-sensitive Logistic Regression

Since reporting the classification accuracy may be deceptive, alternate performance indicators may be necessary.

Now, we will be discussing the types of Machine Learning Classification Algorithms.

Types of Classification Algorithms

You can apply many different classification methods based on the dataset you are working with. It is so because the study of classification in statistics is extensive. The top five machine learning algorithms are listed below.

1. Logistic Regression

It is a supervised learning classification technique that forecasts the likelihood of a target variable. There will only be a choice between two classes. Data can be coded as either one or yes, representing success, or as 0 or no, representing failure. The dependent variable can be predicted most effectively using logistic regression. When the forecast is categorical, such as true or false, yes or no, or a 0 or 1, you can use it. A logistic regression technique can be used to determine whether or not an email is a spam.

2. Naive Byes

Naive Bayes determines whether a data point falls into a particular category. It can be used to classify phrases or words in text analysis as either falling within a predetermined classification or not.

Text

Tag

“A great game”

Sports

“The election is over”

Not Sports

“What a great score”

Sports

“A clean and unforgettable game”

Sports

“The spelling bee winner was a surprise”

Not Sports

3. K-Nearest Neighbors

It calculates the likelihood that a data point will join the groups based on which group the data points closest to it are a part of. When using k-NN for classification, you determine how to classify the data according to its nearest neighbor.

4. Decision Tree

A decision tree is an example of supervised learning. Although it can solve regression and classification problems, it excels in classification problems. Similar to a flow chart, it divides data points into two similar groups at a time, starting with the "tree trunk" and moving through the "branches" and "leaves" until the categories are more closely related to one another.

5. Random Forest Algorithm

The random forest algorithm is an extension of the Decision Tree algorithm where you first create a number of decision trees using training data and then fit your new data into one of the created ‘tree’ as a ‘random forest’. It averages the data to connect it to the nearest tree data based on the data scale. These models are great for improving the decision tree’s problem of forcing data points unnecessarily within a category.

6. Support Vector Machine

Support Vector Machine is a popular supervised machine learning technique for classification and regression problems. It goes beyond X/Y prediction by using algorithms to classify and train the data according to polarity.

Types of ML Classification Algorithms

1. supervised learning approach.

 The supervised learning approach explicitly trains algorithms under close human supervision. Both the input and the output data are first provided to the algorithm. The algorithm then develops rules that map the input to the output. The training procedure is repeated as soon as the highest level of performance is attained.

The two types of supervised learning approaches are:

  • Classification

2. Unsupervised Learning

This approach is applied to examine data's inherent structure and derive insightful information from it. This technique looks for insights that can produce better results by looking for patterns and insights in unlabeled data.

There are two types of unsupervised learning:

  • Dimensionality reduction

3. Semi-supervised Learning

Semi-supervised learning lies on the spectrum between unsupervised and supervised learning. It combines the most significant aspects of both worlds to provide a unique set of algorithms.

4. Reinforcement Learning

The goal of reinforcement learning is to create autonomous, self-improving algorithms. The algorithm's goal is to improve itself through a continual cycle of trials and errors based on the interactions and combinations between the incoming and labeled data.

Classification Models

  • Naive Bayes : Naive Bayes is a classification algorithm that assumes that predictors in a dataset are independent. This means that it assumes the features are unrelated to each other. For example, if given a banana, the classifier will see that the fruit is of yellow color, oblong-shaped and long and tapered. All of these features will contribute independently to the probability of it being a banana and are not dependent on each other. Naive Bayes is based on Bayes’ theorem, which is given as:

bayes

Figure 3 : Bayes’ Theorem

         Where :

         P(A | B) = how often happens given that B happens

         P(A) = how likely A will happen

         P(B) = how likely B will happen

         P(B | A) = how often B happens given that A happens

  • Decision Trees : A Decision Tree is an algorithm that is used to visually represent decision-making. A Decision Tree can be made by asking a yes/no question and splitting the answer to lead to another decision. The question is at the node and it places the resulting decisions below at the leaves. The tree depicted below is used to decide if we can play tennis.

decision

                                            Figure 4: Decision Tree

In the above figure, depending on the weather conditions and the humidity and wind, we can systematically decide if we should play tennis or not. In decision trees, all the False statements lie on the left of the tree and the True statements branch off to the right. Knowing this, we can make a tree which has the features at the nodes and the resulting classes at the leaves.

  • K-Nearest Neighbor s: K-Nearest Neighbor is a classification and prediction algorithm that is used to divide data into classes based on the distance between the data points. K-Nearest Neighbor assumes that data points which are close to one another must be similar and hence, the data point to be classified will be grouped with the closest cluster.

data-classified

Figure 5: Data to be classified

k-nearest

                                       Figure 6: Classification using K-Nearest Neighbours 

Evaluating a Classification Model

After our model is finished, we must assess its performance to determine whether it is a regression or classification model. So, we have the following options for assessing a classification model:

1. Confusion Matrix

  • The confusion matrix describes the model performance and gives us a matrix or table as an output.
  • The error matrix is another name for it.
  • The matrix is made up of the results of the forecasts in a condensed manner, together with the total number of right and wrong guesses. 

The matrix appears in the following table:

Actual Positive

Actual Negative

Predicted Positive

True Positive

False Positive

Predicted Negative

False Negative

True Negative

Accuracy = (TP+TN)/Total Population

2. Log Loss or Cross-Entropy Loss 

  • It is used to assess a classifier's performance, and the output is a probability value between 1 and 0.
  • A successful binary classification model should have a log loss value that is close to 0.
  • If the anticipated value differs from the actual value, the value of log loss rises.
  • The lower log loss shows the model’s higher accuracy.

Cross-entropy for binary classification can be calculated as: 

(ylog(p)+(1?y)log(1?p)) 

Where p = Predicted Output, y = Actual output.

3. AUC-ROC Curve

  • AUC is for Area Under the Curve, and ROC refers to Receiver Operating Characteristics Curve.
  • It is a graph that displays the classification model's performance at various thresholds.
  • The AUC-ROC Curve is used to show how well the multi-class classification model performs.
  • The TPR and FPR are used to draw the ROC curve, with the True Positive Rate (TPR) on the Y-axis and the FPR (False Positive Rate) on the X-axis.

Now, let us discuss the use cases of Classification Algorithms.

Use Cases Of Classification Algorithms

There are many applications for classification algorithms. Here are a few of them

  • Speech Recognition
  • Detecting Spam Emails
  • Categorization of Drugs
  • Cancer Tumor Cell Identification
  • Biometric Authentication, etc.

Classifier Evaluation

The evaluation to verify a classifier's accuracy and effectiveness is the most crucial step after it is finished. We can evaluate a classifier in a variety of ways. Let's look at these techniques that are stated below, beginning with Cross-Validation.

Cross-Validation

The most prominent issue with most machine learning models is over-fitting. It is possible to check the model's overfitting with K-fold cross-validation.

With this technique, the data set is randomly divided into k equal-sized, mutually exclusive subsets. One is retained for testing, while the others are utilized for training the model. For each of the k folds, the same procedure is followed.

Holdout Method

This is the approach used the most frequently to assess classifiers. According to this method, the given data set is split into a test set and a train set, each comprising 20% and 80% of the total data.

The unseen test set is used to evaluate the data's prediction ability after it has been trained using the train set.

For a visual comparison of classification models, the ROC curve, also known as receiver operating characteristics, is utilized. It illustrates the correlation between the false positive rate and the true positive rate. The accuracy of the model is determined by the area under the ROC curve.

Bias and Variance

Bias is the difference between our actual and predicted values. Bias is the simple assumptions that our model makes about our data to be able to predict on new data. It directly corresponds to the patterns found in our data. When the Bias is high, assumptions made by our model are too basic, the model can’t capture the important features of our data, this is called underfitting.

7-bias

                                                   Figure 7: Bias

We can define variance as the model’s sensitivity to fluctuations in the data. Our model may learn from noise. This will cause our model to consider trivial features as important. When the Variance is high, our model will capture all the features of the data given to it, will tune itself to the data, and predict on it very well but new data may not have the exact same features and the model won’t be able to predict on it very well. We call this Overfitting .

variance

                                                Figure 8: Example of Variance 

Precision and Recall  

Precision is used to calculate the model's ability to classify values correctly. It is given by dividing the number of correctly classified data points by the total number of classified data points for that class label.                          

TP = True Positives, when our model correctly classifies the data point to the class it belongs to.

FP = False Positives, when the model falsely classifies the data point.

Recall is used to calculate the ability of the mode to predict positive values. But, "How often does the model predict the correct positive values?". This is calculated by the ratio of true positives and the total number of actual positive values.    

Now, let us look at Algorithm Selection.

Algorithm Selection

In addition to the strategy described above, we may apply the procedures listed below to choose the optimum algorithm for the model.

  • Read the information.
  • Based on our independent and dependent features, and create dependent and independent data sets.
  • Create training and test sets for the data.
  • Utilize many algorithms to train the model, including SVM, Decision Tree, KNN, etc.
  • Consider the classifier.
  • Decide on the most accurate classifier.

Accuracy is the greatest path ahead to making your model efficient, even though it could take longer than necessary to select the optimum algorithm for your model.   

Our Learners Also Asked

1. what is a classification algorithm, with example.

A classification involves predicting a class label for a specific example of input data. For example, It can identify whether or not a code is a spam. It can classify the handwriting if it consists of one of the known characters. 

2. What is the best classification algorithm?

Compared to other classification algorithms like Logistic Regression, Support Vector Machines, and Decision Regression, the Naive Bayes classifier algorithm produces better results.

3. What is the most straightforward classification algorithm?

One of the most straightforward classification techniques is kNN.

4. Classifier vs. Algorithm in Machine Learning?

The technique, or set of guidelines, that computers use to categorize data is known as a classifier. When it comes to the classification model, it is the result of the classifiers ML. The classifier is used to train the model, which then eventually classifies your data.

5. What are classification and types?

Classification is a category or division in a system that categorizes or organizes objects into groups or types. You can encounter the following four categories of classification tasks: Binary, Multi-class, Multi-label, and Imbalanced classification. 

6. What is the difference between classification and clustering?

The goal of clustering is to group similar types of items by taking into account the most satisfying criteria, which states that no two items in the same group should be comparable. This differs from classification, where the goal is to forecast the target class.                          

Acelerate your career in AI and ML with the  AI and ML Course  with Purdue University collaborated with IBM.

In conclusion, classification can be considered a standard supervised learning activity. It is a valuable strategy that we use while attempting to determine whether a specific example falls into a given category or not.

You should enroll in the Machine Learning Course program if you want to elevate your skills and acquire the most outstanding achievements. Additionally, if you are a professional with prior programming knowledge, you can profit from this AI ML certification course, which discusses reinforcement learning, natural language processing, statistics, and neural networks.

Find our Post Graduate Program in AI and Machine Learning Online Bootcamp in top cities:

NameDatePlace
Cohort starts on 25th Jul 2024,
Weekend batch
Your City
Cohort starts on 7th Aug 2024,
Weekend batch
Your City

About the Author

Mayank Banoula

Mayank is a Research Analyst at Simplilearn. He is proficient in Machine learning and Artificial intelligence with python.

Recommended Resources

An Introduction To Machine Learning

Machine Learning Interview Guide

Regression vs. Classification in Machine Learning for Beginners

Regression vs. Classification in Machine Learning for Beginners

How to Become a Machine Learning Engineer?

Machine Learning Career Guide: A Playbook to Becoming a Machine Learning Engineer

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 09 September 2022

Machine learning in project analytics: a data-driven framework and case study

  • Shahadat Uddin 1 ,
  • Stephen Ong 1 &
  • Haohui Lu 1  

Scientific Reports volume  12 , Article number:  15252 ( 2022 ) Cite this article

10k Accesses

18 Altmetric

Metrics details

  • Applied mathematics
  • Computational science

The analytic procedures incorporated to facilitate the delivery of projects are often referred to as project analytics. Existing techniques focus on retrospective reporting and understanding the underlying relationships to make informed decisions. Although machine learning algorithms have been widely used in addressing problems within various contexts (e.g., streamlining the design of construction projects), limited studies have evaluated pre-existing machine learning methods within the delivery of construction projects. Due to this, the current research aims to contribute further to this convergence between artificial intelligence and the execution construction project through the evaluation of a specific set of machine learning algorithms. This study proposes a machine learning-based data-driven research framework for addressing problems related to project analytics. It then illustrates an example of the application of this framework. In this illustration, existing data from an open-source data repository on construction projects and cost overrun frequencies was studied in which several machine learning models (Python’s Scikit-learn package) were tested and evaluated. The data consisted of 44 independent variables (from materials to labour and contracting) and one dependent variable (project cost overrun frequency), which has been categorised for processing under several machine learning models. These models include support vector machine, logistic regression, k -nearest neighbour, random forest, stacking (ensemble) model and artificial neural network. Feature selection and evaluation methods, including the Univariate feature selection, Recursive feature elimination, SelectFromModel and confusion matrix, were applied to determine the most accurate prediction model. This study also discusses the generalisability of using the proposed research framework in other research contexts within the field of project management. The proposed framework, its illustration in the context of construction projects and its potential to be adopted in different contexts will significantly contribute to project practitioners, stakeholders and academics in addressing many project-related issues.

Similar content being viewed by others

classification in machine learning case study

Promising directions of machine learning for partial differential equations

classification in machine learning case study

Principal component analysis

classification in machine learning case study

An overview of clinical decision support systems: benefits, risks, and strategies for success

Introduction.

Successful projects require the presence of appropriate information and technology 1 . Project analytics provides an avenue for informed decisions to be made through the lifecycle of a project. Project analytics applies various statistics (e.g., earned value analysis or Monte Carlo simulation) among other models to make evidence-based decisions. They are used to manage risks as well as project execution 2 . There is a tendency for project analytics to be employed due to other additional benefits, including an ability to forecast and make predictions, benchmark with other projects, and determine trends such as those that are time-dependent 3 , 4 , 5 . There has been increasing interest in project analytics and how current technology applications can be incorporated and utilised 6 . Broadly, project analytics can be understood on five levels 4 . The first is descriptive analytics which incorporates retrospective reporting. The second is known as diagnostic analytics , which aims to understand the interrelationships and underlying causes and effects. The third is predictive analytics which seeks to make predictions. Subsequent to this is prescriptive analytics , which prescribes steps following predictions. Finally, cognitive analytics aims to predict future problems. The first three levels can be applied with ease with the help of technology. The fourth and fifth steps require data that is generally more difficult to obtain as they may be less accessible or unstructured. Further, although project key performance indicators can be challenging to define 2 , identifying common measurable features facilitates this 7 . It is anticipated that project analytics will continue to experience development due to its direct benefits to the major baseline measures focused on productivity, profitability, cost, and time 8 . The nature of project management itself is fluid and flexible, and project analytics allows an avenue for which machine learning algorithms can be applied 9 .

Machine learning within the field of project analytics falls into the category of cognitive analytics, which deals with problem prediction. Generally, machine learning explores the possibilities of computers to improve processes through training or experience 10 . It can also build on the pre-existing capabilities and techniques prevalent within management to accomplish complex tasks 11 . Due to its practical use and broad applicability, recent developments have led to the invention and introduction of newer and more innovative machine learning algorithms and techniques. Artificial intelligence, for instance, allows for software to develop computer vision, speech recognition, natural language processing, robot control, and other applications 10 . Specific to the construction industry, it is now used to monitor construction environments through a virtual reality and building information modelling replication 12 or risk prediction 13 . Within other industries, such as consumer services and transport, machine learning is being applied to improve consumer experiences and satisfaction 10 , 14 and reduce the human errors of traffic controllers 15 . Recent applications and development of machine learning broadly fall into the categories of classification, regression, ranking, clustering, dimensionality reduction and manifold learning 16 . Current learning models include linear predictors, boosting, stochastic gradient descent, kernel methods, and nearest neighbour, among others 11 . Newer and more applications and learning models are continuously being introduced to improve accessibility and effectiveness.

Specific to the management of construction projects, other studies have also been made to understand how copious amounts of project data can be used 17 , the importance of ontology and semantics throughout the nexus between artificial intelligence and construction projects 18 , 19 as well as novel approaches to the challenges within this integration of fields 20 , 21 , 22 . There have been limited applications of pre-existing machine learning models on construction cost overruns. They have predominantly focussed on applications to streamline the design processes within construction 23 , 24 , 25 , 26 , and those which have investigated project profitability have not incorporated the types and combinations of algorithms used within this study 6 , 27 . Furthermore, existing applications have largely been skewed towards one type or another 28 , 29 .

In addition to the frequently used earned value method (EVM), researchers have been applying many other powerful quantitative methods to address a diverse range of project analytics research problems over time. Examples of those methods include time series analysis, fuzzy logic, simulation, network analytics, and network correlation and regression. Time series analysis uses longitudinal data to forecast an underlying project's future needs, such as the time and cost 30 , 31 , 32 . Few other methods are combined with EVM to find a better solution for the underlying research problems. For example, Narbaev and De Marco 33 integrated growth models and EVM for forecasting project cost at completion using data from construction projects. For analysing the ongoing progress of projects having ambiguous or linguistic outcomes, fuzzy logic is often combined with EVM 34 , 35 , 36 . Yu et al. 36 applied fuzzy theory and EVM for schedule management. Ponz-Tienda et al. 35 found that using fuzzy arithmetic on EVM provided more objective results in uncertain environments than the traditional methodology. Bonato et al. 37 integrated EVM with Monte Carlo simulation to predict the final cost of three engineering projects. Batselier and Vanhoucke 38 compared the accuracy of the project time and cost forecasting using EVM and simulation. They found that the simulation results supported findings from the EVM. Network methods are primarily used to analyse project stakeholder networks. Yang and Zou 39 developed a social network theory-based model to explore stakeholder-associated risks and their interactions in complex green building projects. Uddin 40 proposed a social network analytics-based framework for analysing stakeholder networks. Ong and Uddin 41 further applied network correlation and regression to examine the co-evolution of stakeholder networks in collaborative healthcare projects. Although many other methods have already been used, as evident in the current literature, machine learning methods or models are yet to be adopted for addressing research problems related to project analytics. The current investigation is derived from the cognitive analytics component of project analytics. It proposes an approach for determining hidden information and patterns to assist with project delivery. Figure  1 illustrates a tree diagram showing different levels of project analytics and their associated methods from the literature. It also illustrates existing methods within the cognitive component of project analytics to where the application of machine learning is situated contextually.

figure 1

A tree diagram of different project analytics methods. It also shows where the current study belongs to. Although earned value analysis is commonly used in project analytics, we do not include it in this figure since it is used in the first three levels of project analytics.

Machine learning models have several notable advantages over traditional statistical methods that play a significant role in project analytics 42 . First, machine learning algorithms can quickly identify trends and patterns by simultaneously analysing a large volume of data. Second, they are more capable of continuous improvement. Machine learning algorithms can improve their accuracy and efficiency for decision-making through subsequent training from potential new data. Third, machine learning algorithms efficiently handle multi-dimensional and multi-variety data in dynamic or uncertain environments. Fourth, they are compelling to automate various decision-making tasks. For example, machine learning-based sentiment analysis can easily a negative tweet and can automatically take further necessary steps. Last but not least, machine learning has been helpful across various industries, for example, defence to education 43 . Current research has seen the development of several different branches of artificial intelligence (including robotics, automated planning and scheduling and optimisation) within safety monitoring, risk prediction, cost estimation and so on 44 . This has progressed from the applications of regression on project cost overruns 45 to the current deep-learning implementations within the construction industry 46 . Despite this, the uses remain largely limited and are still in a developmental state. The benefits of applications are noted, such as optimising and streamlining existing processes; however, high initial costs form a barrier to accessibility 44 .

The primary goal of this study is to demonstrate the applicability of different machine learning algorithms in addressing problems related to project analytics. Limitations in applying machine learning algorithms within the context of construction projects have been explored previously. However, preceding research has mainly been conducted to improve the design processes specific to construction 23 , 24 , and those investigating project profitabilities have not incorporated the types and combinations of algorithms used within this study 6 , 27 . For instance, preceding research has incorporated a different combination of machine-learning algorithms in research of predicting construction delays 47 . This study first proposed a machine learning-based data-driven research framework for project analytics to contribute to the proposed study direction. It then applied this framework to a case study of construction projects. Although there are three different machine learning algorithms (supervised, unsupervised and semi-supervised), the supervised machine learning models are most commonly used due to their efficiency and effectiveness in addressing many real-world problems 48 . Therefore, we will use machine learning to represent supervised machine learning throughout the rest of this article. The contribution of this study is significant in that it considers the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult 9 , 49 . Further to this, existing implementations have largely been limited to safety monitoring, risk prediction, cost estimation and so on 44 . Through the evaluation of machine-learning applications, this study further demonstrates a case study for which algorithms can be used to consider and model the relationship between project attributes and a project performance measure (i.e., cost overrun frequency).

Machine learning-based framework for project analytics

When and why machine learning for project analytics.

Machine learning models are typically used for research problems that involve predicting the classification outcome of a categorical dependent variable. Therefore, they can be applied in the context of project analytics if the underlying objective variable is a categorical one. If that objective variable is non-categorical, it must first be converted into a categorical variable. For example, if the objective or target variable is the project cost, we can convert this variable into a categorical variable by taking only two possible values. The first value would be 0 to indicate a low-cost project, and the second could be 1 for showing a high-cost project. The average or median cost value for all projects under consideration can be considered for splitting project costs into low-cost and high-cost categories.

For data-driven decision-making, machine learning models are advantageous. This is because traditional statistical methods (e.g., ordinary least square (OLS) regression) make assumptions about the underlying research data to produce explicit formulae for the objective target measures. Unlike these statistical methods, machine learning algorithms figure out patterns on their own directly from the data. For instance, for a non-linear but separable dataset, an OLS regression model will not be the right choice due to its assumption that the underlying data must be linear. However, a machine learning model can easily separate the dataset into the underlying classes. Figure  2 (a) presents a situation where machine learning models perform better than traditional statistical methods.

figure 2

( a ) An illustration showing the superior performance of machine learning models compared with the traditional statistical models using an abstract dataset with two attributes (X 1 and X 2 ). The data points within this abstract dataset consist of two classes: one represented with a transparent circle and the second class illustrated with a black-filled circle. These data points are non-linear but separable. Traditional statistical models (e.g., ordinary least square regression) will not accurately separate these data points. However, any machine learning model can easily separate them without making errors; and ( b ) Traditional programming versus machine learning.

Similarly, machine learning models are compelling if the underlying research dataset has many attributes or independent measures. Such models can identify features that significantly contribute to the corresponding classification performance regardless of their distributions or collinearity. Traditional statistical methods have become prone to biased results when there exists a correlation between independent variables. Machine learning-based current studies specific to project analytics have been largely limited. Despite this, there have been tangential studies on the use of artificial intelligence to improve cost estimations as well as risk prediction 44 . Additionally, models have been implemented in the optimisation of existing processes 50 .

Machine learning versus traditional programming

Machine learning can be thought of as a process of teaching a machine (i.e., computers) to learn from data and adjust or apply its present knowledge when exposed to new data 42 . It is a type of artificial intelligence that enables computers to learn from examples or experiences. Traditional programming requires some input data and some logic in the form of code (program) to generate the output. Unlike traditional programming, the input data and their corresponding output are fed to an algorithm to create a program in machine learning. This resultant program can capture powerful insights into the data pattern and can be used to predict future outcomes. Figure  2 (b) shows the difference between machine learning and traditional programming.

Proposed machine learning-based framework

Figure  3 illustrates the proposed machine learning-based research framework of this study. The framework starts with breaking the project research dataset into the training and test components. As mentioned in the previous section, the research dataset may have many categorical and/or nominal independent variables, but its single dependent variable must be categorical. Although there is no strict rule for this split, the training data size is generally more than or equal to 50% of the original dataset 48 .

figure 3

The proposed machine learning-based data-driven framework.

Machine learning algorithms can handle variables that have only numerical outcomes. So, when one or more of the underlying categorical variables have a textual or string outcome, we must first convert them into the corresponding numerical values. Suppose a variable can take only three textual outcomes (low, medium and high). In that case, we could consider, for example, 1 to represent low , 2 to represent medium , and 3 to represent high . Other statistical techniques, such as the RIDIT (relative to an identified distribution) scoring 51 , can also be used to convert ordered categorical measurements into quantitative ones. RIDIT is a parametric approach that uses probabilistic comparison to determine the statistical differences between ordered categorical groups. The remaining components of the proposed framework have been briefly described in the following subsections.

Model-building procedure

The next step of the framework is to follow the model-building procedure to develop the desired machine learning models using the training data. The first step of this procedure is to select suitable machine learning algorithms or models. Among the available machine learning algorithms, the commonly used ones are support vector machine, logistic regression, k -nearest neighbours, artificial neural network, decision tree and random forest 52 . One can also select an ensemble machine learning model as the desired algorithm. An ensemble machine learning method uses multiple algorithms or the same algorithm multiple times to achieve better predictive performance than could be obtained from any of the constituent learning models alone 52 . Three widely used ensemble approaches are bagging, boosting and stacking. In bagging, the research dataset is divided into different equal-sized subsets. The underlying machine learning algorithm is then applied to these subsets for classification. In boosting, a random sample of the dataset is selected and then fitted and trained sequentially with different models to compensate for the weakness observed in the immediately used model. Stacking combined different weak machine learning models in a heterogeneous way to improve the predictive performance. For example, the random forest algorithm is an ensemble of different decision tree models 42 .

Second, each selected machine learning model will be processed through the k -fold cross-validation approach to improve predictive efficiency. In k -fold cross-validation, the training data is divided into k folds. In an iteration, the (k-1) folds are used to train the selected machine models, and the remaining last fold isF used for validation purposes. This iteration process continues until each k folds will get a turn to be used for validation purposes. The final predictive efficiency of the trained models is based on the average values from the outcomes of these iterations. In addition to this average value, researchers use the standard deviation of the results from different iterations as the predictive training efficiency. Supplementary Fig 1 shows an illustration of the k -fold cross-validation.

Third, most machine learning algorithms require a pre-defined value for their different parameters, known as hyperparameter tuning. The settings of these parameters play a vital role in the achieved performance of the underlying algorithm. For a given machine learning algorithm, the optimal value for these parameters can be different from one dataset to another. The same algorithm needs to run multiple times with different parameter values to find its optimal parameter value for a given dataset. Many algorithms are available in the literature, such as the Grid search 53 , to find the optimal parameter value. In the Grid search, hyperparameters are divided into discrete grids. Each grid point represents a specific combination of the underlying model parameters. The parameter values of the point that results in the best performance are the optimal parameter values 53 .

Testing of the developed models and reporting results

Once the desired machine learning models have been developed using the training data, they need to be tested using the test data. The underlying trained model is then applied to predict its dependent variable for each data instance. Therefore, for each data instance, two categorical outcomes will be available for its dependent variable: one predicted using the underlying trained model, and the other is the actual category. These predicted and actual categorical outcome values are used to report the results of the underlying machine learning model.

The fundamental tool to report results from machine learning models is the confusion matrix, which consists of four integer values 48 . The first value represents the number of positive cases correctly identified as positive by the underlying trained model (true-positive). The second value indicates the number of positive instances incorrectly identified as negative (false-negative). The third value represents the number of negative cases incorrectly identified as positive (false-positive). Finally, the fourth value indicates the number of negative instances correctly identified as negative (true-negative). Researchers also use a few performance measures based on the four values of the confusion matrix to report machine learning results. The most used measure is accuracy which is the ratio of the number of correct predictions (true-positive + true-negative) and the total number of data instances (sum of all four values of the confusion matrix). Other measures commonly used to report machine learning results are precision, recall and F1-score. Precision refers to the ratio between true-positives and the total number of positive predictions (i.e., true-positive + false-positive), often used to indicate the quality of a positive prediction made by a model 48 . Recall, also known as the true-positive rate, is calculated by dividing true-positive by the number of data instances that should have been predicted as positive (i.e., true-positive + false-negative). F1-score is the harmonic mean of the last two measures, i.e., [(2 × Precision × Recall)/(Precision + Recall)] and the error-rate equals to (1-Accuracy).

Another essential tool for reporting machine learning results is variable or feature importance, which identifies a list of independent variables (features) contributing most to the classification performance. The importance of a variable refers to how much a given machine learning algorithm uses that variable in making accurate predictions 54 . The widely used technique for identifying variable importance is the principal component analysis. It reduces the dimensionality of the data while minimising information loss, which eventually increases the interpretability of the underlying machine learning outcome. It further helps in finding the important features in a dataset as well as plotting them in 2D and 3D 54 .

Ethical approval

Ethical approval is not required for this study since this study used publicly available data for research investigation purposes. All research was performed in accordance with relevant guidelines/regulations.

Informed consent

Due to the nature of the data sources, informed consent was not required for this study.

Case study: an application of the proposed framework

This section illustrates an application of this study’s proposed framework (Fig.  2 ) in a construction project context. We will apply this framework in classifying projects into two classes based on their cost overrun experience. Projects rarely experience a delay belonging to the first class (Rare class). The second class indicates those projects that often experience a delay (Often class). In doing so, we consider a list of independent variables or features.

Data source

The research dataset is taken from an open-source data repository, Kaggle 55 . This survey-based research dataset was collected to explore the causes of the project cost overrun in Indian construction projects 45 , consisting of 44 independent variables or features and one dependent variable. The independent variables cover a wide range of cost overrun factors, from materials and labour to contractual issues and the scope of the work. The dependent variable is the frequency of experiencing project cost overrun (rare or often). The dataset size is 139; 65 belong to the rare class, and the remaining 74 are from the often class. We converted each categorical variable with a textual or string outcome into an appropriate numerical value range to prepare the dataset for machine learning analysis. For example, we used 1 and 2 to represent rare and often class, respectively. The correlation matrix among the 44 features is presented in Supplementary Fig 2 .

Machine learning algorithms

This study considered four machine learning algorithms to explore the causes of project cost overrun using the research dataset mentioned above. They are support vector machine, logistic regression, k- nearest neighbours and random forest.

Support vector machine (SVM) is a process applied to understand data. For instance, if one wants to determine and interpret which projects are classified as programmatically successful through the processing of precedent data information, SVM would provide a practical approach for prediction. SVM functions by assigning labels to objects 56 . The comparison attributes are used to cluster these objects into different groups or classes by maximising their marginal distances and minimising the classification errors. The attributes are plotted multi-dimensionally, allowing a separation line, known as a hyperplane , see supplementary Fig 3 (a), to distinguish between underlying classes or groups 52 . Support vectors are the data points that lie closest to the decision boundary on both sides. In Supplementary Fig 3 (a), they are the circles (both transparent and shaded ones) close to the hyperplane. Support vectors play an essential role in deciding the position and orientation of the hyperplane. Various computational methods, including a kernel function to create more derived attributes, are applied to accommodate this process 56 . Support vector machines are not only limited to binary classes but can also be generalised to a larger variety of classifications. This is accomplished through the training of separate SVMs 56 .

Logistic regression (LR) builds on the linear regression model and predicts the outcome of a dichotomous variable 57 ; for example, the presence or absence of an event. It uses a scatterplot to understand the connection between an independent variable and one or more dependent variables (see Supplementary Fig 3 (b)). LR model fits the data to a sigmoidal curve instead of fitting it to a straight line. The natural logarithm is considered when developing the model. It provides a value between 0 and 1 that is interpreted as the probability of class membership. Best estimates are determined by developing from approximate estimates until a level of stability is reached 58 . Generally, LR offers a straightforward approach for determining and observing interrelationships. It is more efficient compared to ordinary regressions 59 .

k -nearest neighbours (KNN) algorithm uses a process that plots prior information and applies a specific sample size ( k ) to the plot to determine the most likely scenario 52 . This method finds the nearest training examples using a distance measure. The final classification is made by counting the most common scenario or votes present within the specified sample. As illustrated in Supplementary Fig 3 (c), the closest four nearest neighbours in the small circle are three grey squares and one white square. The majority class is grey. Hence, KNN will predict the instance (i.e., Χ ) as grey. On the other hand, if we look at the larger circle of the same figure, the nearest neighbours consist of ten white squares and four grey squares. The majority class is white. Thus, KNN will classify the instance as white. KNN’s advantage lies in its ability to produce a simplified result and handle missing data 60 . In summary, KNN utilises similarities (as well as differences) and distances in the process of developing models.

Random forest (RF) is a machine learning process that consists of many decision trees. A decision tree is a tree-like structure where each internal node represents a test on the input attribute. It may have multiple internal nodes at different levels, and the leaf or terminal nodes represent the decision outcomes. It produces a classification outcome for a distinctive and separate part to the input vector. For non-numerical processes, it considers the average value, and for discrete processes, it considers the number of votes 52 . Supplementary Fig 3 (d) shows three decision trees to illustrate the function of a random forest. The outcomes from trees 1, 2 and 3 are class B, class A and class A, respectively. According to the majority vote, the final prediction will be class A. Because it considers specific attributes, it can have a tendency to emphasise specific attributes over others, which may result in some attributes being unevenly weighted 52 . Advantages of the random forest include its ability to handle multidimensionality and multicollinearity in data despite its sensitivity to sampling design.

Artificial neural network (ANN) simulates the way in which human brains work. This is accomplished by modelling logical propositions and incorporating weighted inputs, a transfer and one output 61 (Supplementary Fig 3 (e)). It is advantageous because it can be used to model non-linear relationships and handle multivariate data 62 . ANN learns through three major avenues. These include error-back propagation (supervised), the Kohonen (unsupervised) and the counter-propagation ANN (supervised) 62 . There are two types of ANN—supervised and unsupervised. ANN has been used in a myriad of applications ranging from pharmaceuticals 61 to electronic devices 63 . It also possesses great levels of fault tolerance 64 and learns by example and through self-organisation 65 .

Ensemble techniques are a type of machine learning methodology in which numerous basic classifiers are combined to generate an optimal model 66 . An ensemble technique considers many models and combines them to form a single model, and the final model will eliminate the weaknesses of each individual learner, resulting in a powerful model that will improve model performance. The stacking model is a general architecture comprised of two classifier levels: base classifier and meta-learner 67 . The base classifiers are trained with the training dataset, and a new dataset is constructed for the meta-learner. Afterwards, this new dataset is used to train the meta-classifier. This study uses four models (SVM, LR, KNN and RF) as base classifiers and LR as a meta learner, as illustrated in Supplementary Fig 3 (f).

Feature selection

The process of selecting the optimal feature subset that significantly influences the predicted outcomes, which may be efficient to increase model performance and save running time, is known as feature selection. This study considers three different feature selection approaches. They are the Univariate feature selection (UFS), Recursive feature elimination (RFE) and SelectFromModel (SFM) approach. UFS examines each feature separately to determine the strength of its relationship with the response variable 68 . This method is straightforward to use and comprehend and helps acquire a deeper understanding of data. In this study, we calculate the chi-square values between features. RFE is a type of backwards feature elimination in which the model is fit first using all features in the given dataset and then removing the least important features one by one 69 . After that, the model is refit until the desired number of features is left over, which is determined by the parameter. SFM is used to choose effective features based on the feature importance of the best-performing model 70 . This approach selects features by establishing a threshold based on feature significance as indicated by the model on the training set. Those characteristics whose feature importance is more than the threshold are chosen, while those whose feature importance is less than the threshold are deleted. In this study, we apply SFM after we compare the performance of four machine learning methods. Afterwards, we train the best-performing model again using the features from the SFM approach.

Findings from the case study

We split the dataset into 70:30 for training and test purposes of the four selected machine learning algorithms. We used Python’s Scikit-learn package for implementing these algorithms 70 . Using the training data, we first developed six models based on these six algorithms. We used fivefold validation and target to improve the accuracy value. Then, we applied these models to the test data. We also executed all required hyperparameter tunings for each algorithm for the possible best classification outcome. Table 1 shows the performance outcomes for each algorithm during the training and test phase. The hyperparameter settings for each algorithm have been listed in Supplementary Table 1 .

As revealed in Table 1 , random forest outperformed the other three algorithms in terms of accuracy for both the training and test phases. It showed an accuracy of 78.14% and 77.50% for the training and test phases, respectively. The second-best performer in the training phase is k- nearest neighbours (76.98%), and for the test phase, it is the support vector machine, k- nearest neighbours and artificial neural network (72.50%).

Since random forest showed the best performance, we explored further based on this algorithm. We applied the three approaches (UFS, RFE and SFM) for feature optimisation on the random forest. The result is presented in Table 2 . SFM shows the best outcome among these three approaches. Its accuracy is 85.00%, whereas the accuracies of USF and RFE are 77.50% and 72.50%, respectively. As can be seen in Table 2 , the accuracy for the testing phase increases from 77.50% in Table 1 (b) to 85.00% with the SFM feature optimisation. Table 3 shows the 19 selected features from the SFM output. Out of 44 features, SFM found that 19 of them play a significant role in predicting the outcomes.

Further, Fig.  4 illustrates the confusion matrix when the random forest model with the SFM feature optimiser was applied to the test data. There are 18 true-positive, five false-negative, one false-positive and 16 true-negative cases. Therefore, the accuracy for the test phase is (18 + 16)/(18 + 5 + 1 + 16) = 85.00%.

figure 4

Confusion matrix results based on the random forest model with the SFM feature optimiser (1 for the rare class and 2 for the often class).

Figure  5 illustrates the top-10 most important features or variables based on the random forest algorithm with the SFM optimiser. We used feature importance based on the mean decrease in impurity in identifying this list of important variables. Mean decrease in impurity computes each feature’s importance as the sum over the number of splits that include the feature in proportion to the number of samples it splits 71 . According to this figure, the delays in decision marking attribute contributed most to the classification performance of the random forest algorithm, followed by cash flow problem and construction cost underestimation attributes. The current construction project literature also highlighted these top-10 factors as significant contributors to project cost overrun. For example, using construction project data from Jordan, Al-Hazim et al. 72 ranked 20 causes for cost overrun, including causes similar to these causes.

figure 5

Feature importance (top-10 out of 19) based on the random forest model with the SFM feature optimiser.

Further, we conduct a sensitivity analysis of the model’s ten most important features (from Fig.  5 ) to explore how a change in each feature affects the cost overrun. We utilise the partial dependence plot (PDP), which is a typical visualisation tool for non-parametric models 73 , to display this analysis’s outcomes. A PDP can demonstrate whether the relation between the target and a feature is linear, monotonic, or more complicated. The result of the sensitivity analysis is presented in Fig.  6 . For the ‘delays in decisions making’ attribute, the PDP shows that the probability is below 0.4 until the rating value is three and increases after. A higher value for this attribute indicates a higher risk of cost overrun. On the other hand, there are no significant differences can be seen in the remaining nine features if the value changes.

figure 6

The result of the sensitivity analysis from the partial dependency plot tool for the ten most important features.

Summary of the case study

We illustrated an application of the proposed machine learning-based research framework in classifying construction projects. RF showed the highest accuracy in predicting the test dataset. For a new data instance with information for its 19 features but has not had any information on its classification, RF can identify its class ( rare or often ) correctly with a probability of 85.00%. If more data is provided, in addition to the 139 instances of the case study, to the machine learning algorithms, then their accuracy and efficiency in making project classification will improve with subsequent training. For example, if we provide 100 more data instances, these algorithms will have an additional 50 instances for training with a 70:30 split. This continuous improvement facility put the machine learning algorithms in a superior position over other traditional methods. In the current literature, some studies explore the factors contributing to project delay or cost overrun. In most cases, they applied factor analysis or other related statistical methods for research data analysis 72 , 74 , 75 . In addition to identifying important attributes, the proposed machine learning-based framework identified the ranking of factors and how eliminating less important factors affects the prediction accuracy when applied to this case study.

We shared the Python software developed to implement the four machine learning algorithms considered in this case study using GitHub 76 , a software hosting internet site. user-friendly version of this software can be accessed at https://share.streamlit.io/haohuilu/pa/main/app.py . The accuracy findings from this link could be slightly different from one run to another due to the hyperparameter settings of the corresponding machine learning algorithms.

Due to their robust prediction ability, machine learning methods have already gained wide acceptability across a wide range of research domains. On the other side, EVM is the most commonly used method in project analytics due to its simplicity and ease of interpretability 77 . Essential research efforts have been made to improve its generalisability over time. For example, Naeni et al. 34 developed a fuzzy approach for earned value analysis to make it suitable to analyse project scenarios with ambiguous or linguistic outcomes. Acebes 78 integrated Monte Carlo simulation with EVM for project monitoring and control for a similar purpose. Another prominent method frequently used in project analytics is the time series analysis, which is compelling for the longitudinal prediction of project time and cost 30 . Apparently, as evident in the present current literature, not much effort has been made to bring machine learning into project analytics for addressing project management research problems. This research made a significant attempt to contribute to filling up this gap.

Our proposed data-driven framework only includes the fundamental model development and application process components for machine learning algorithms. It does not have a few advanced-level machine learning methods. This study intentionally did not consider them for the proposed model since they are required only in particular designs of machine learning analysis. For example, the framework does not contain any methods or tools to handle the data imbalance issue. Data imbalance refers to a situation when the research dataset has an uneven distribution of the target class 79 . For example, a binary target variable will cause a data imbalance issue if one of its class labels has a very high number of observations compared with the other class. Commonly used techniques to address this issue are undersampling and oversampling. The undersampling technique decreases the size of the majority class. On the other hand, the oversampling technique randomly duplicates the minority class until the class distribution becomes balanced 79 . The class distribution of the case study did not produce any data imbalance issues.

This study considered only six fundamental machine learning algorithms for the case study, although many other such algorithms are available in the literature. For example, it did not consider the extreme gradient boosting (XGBoost) algorithm. XGBoost is based on the decision tree algorithm, similar to the random forest algorithm 80 . It has become dominant in applied machine learning due to its performance and speed. Naïve Bayes and convolutional neural networks are other popular machine learning algorithms that were not considered when applying the proposed framework to the case study. In addition to the three feature selection methods, multi-view can be adopted when applying the proposed framework to the case study. Multi-view learning is another direction in machine learning that considers learning with multiple views of the existing data with the aim to improve predictive performance 81 , 82 . Similarly, although we considered five performance measures, there are other potential candidates. One such example is the area under the receiver operating curve, which is the ability of the underlying classifier to distinguish between classes 48 . We leave them as a potential application scope while applying our proposed framework in any other project contexts in future studies.

Although this study only used one case study for illustration, our proposed research framework can be used in other project analytics contexts. In such an application context, the underlying research goal should be to predict the outcome classes and find attributes playing a significant role in making correct predictions. For example, by considering two types of projects based on the time required to accomplish (e.g., on-time and delayed ), the proposed framework can develop machine learning models that can predict the class of a new data instance and find out attributes contributing mainly to this prediction performance. This framework can also be used at any stage of the project. For example, the framework’s results allow project stakeholders to screen projects for excessive cost overruns and forecast budget loss at bidding and before contracts are signed. In addition, various factors that contribute to project cost overruns can be figured out at an earlier stage. These elements emerge at each stage of a project’s life cycle. The framework’s feature importance helps project managers locate the critical contributor to cost overrun.

This study has made an important contribution to the current project analytics literature by considering the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult. Further, existing implementations have largely been limited to safety monitoring, risk prediction and cost estimation. Through the evaluation of machine learning applications, this study further demonstrates the uses for which algorithms can be used to consider and model the relationship between project attributes and cost overrun frequency.

The applications of machine learning in project analytics are still undergoing constant development. Within construction projects, its applications have been largely limited and focused on profitability or the design of structures themselves. In this regard, our study made a substantial effort by proposing a machine learning-based framework to address research problems related to project analytics. We also illustrated an example of this framework’s application in the context of construction project management.

Like any other research, this study also has a few limitations that could provide scopes for future research. First, the framework does not include a few advanced machine learning techniques, such as data imbalance issues and kernel density estimation. Second, we considered only one case study to illustrate the application of the proposed framework. Illustrations of this framework using case studies from different project contexts would confirm its robust application. Finally, this study did not consider all machine learning models and performance measures available in the literature for the case study. For example, we did not consider the Naïve Bayes model and precision measure in applying the proposed research framework for the case study.

Data availability

This study obtained research data from publicly available online repositories. We mentioned their sources using proper citations. Here is the link to the data https://www.kaggle.com/datasets/amansaxena/survey-on-road-construction-delay .

Venkrbec, V. & Klanšek, U. In: Advances and Trends in Engineering Sciences and Technologies II 685–690 (CRC Press, 2016).

Google Scholar  

Damnjanovic, I. & Reinschmidt, K. Data Analytics for Engineering and Construction Project Risk Management (Springer, 2020).

Book   Google Scholar  

Singh, H. Project Management Analytics: A Data-driven Approach to Making Rational and Effective Project Decisions (FT Press, 2015).

Frame, J. D. & Chen, Y. Why Data Analytics in Project Management? (Auerbach Publications, 2018).

Ong, S. & Uddin, S. Data Science and Artificial Intelligence in Project Management: The Past, Present and Future. J. Mod. Proj. Manag. 7 , 26–33 (2020).

Bilal, M. et al. Investigating profitability performance of construction projects using big data: A project analytics approach. J. Build. Eng. 26 , 100850 (2019).

Article   Google Scholar  

Radziszewska-Zielina, E. & Sroka, B. Planning repetitive construction projects considering technological constraints. Open Eng. 8 , 500–505 (2018).

Neely, A. D., Adams, C. & Kennerley, M. The Performance Prism: The Scorecard for Measuring and Managing Business Success (Prentice Hall Financial Times, 2002).

Kanakaris, N., Karacapilidis, N., Kournetas, G. & Lazanas, A. In: International Conference on Operations Research and Enterprise Systems. 135–155 Springer.

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 , 255–260 (2015).

Article   ADS   MathSciNet   CAS   PubMed   MATH   Google Scholar  

Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).

Book   MATH   Google Scholar  

Rahimian, F. P., Seyedzadeh, S., Oliver, S., Rodriguez, S. & Dawood, N. On-demand monitoring of construction projects through a game-like hybrid application of BIM and machine learning. Autom. Constr. 110 , 103012 (2020).

Sanni-Anibire, M. O., Zin, R. M. & Olatunji, S. O. Machine learning model for delay risk assessment in tall building projects. Int. J. Constr. Manag. 22 , 1–10 (2020).

Cong, J. et al. A machine learning-based iterative design approach to automate user satisfaction degree prediction in smart product-service system. Comput. Ind. Eng. 165 , 107939 (2022).

Li, F., Chen, C.-H., Lee, C.-H. & Feng, S. Artificial intelligence-enabled non-intrusive vigilance assessment approach to reducing traffic controller’s human errors. Knowl. Based Syst. 239 , 108047 (2021).

Mohri, M., Rostamizadeh, A. & Talwalkar, A. Foundations of Machine Learning (MIT press, 2018).

MATH   Google Scholar  

Whyte, J., Stasis, A. & Lindkvist, C. Managing change in the delivery of complex projects: Configuration management, asset information and ‘big data’. Int. J. Proj. Manag. 34 , 339–351 (2016).

Zangeneh, P. & McCabe, B. Ontology-based knowledge representation for industrial megaprojects analytics using linked data and the semantic web. Adv. Eng. Inform. 46 , 101164 (2020).

Akinosho, T. D. et al. Deep learning in the construction industry: A review of present status and future innovations. J. Build. Eng. 32 , 101827 (2020).

Soman, R. K., Molina-Solana, M. & Whyte, J. K. Linked-Data based constraint-checking (LDCC) to support look-ahead planning in construction. Autom. Constr. 120 , 103369 (2020).

Soman, R. K. & Whyte, J. K. Codification challenges for data science in construction. J. Constr. Eng. Manag. 146 , 04020072 (2020).

Soman, R. K. & Molina-Solana, M. Automating look-ahead schedule generation for construction using linked-data based constraint checking and reinforcement learning. Autom. Constr. 134 , 104069 (2022).

Shi, F., Soman, R. K., Han, J. & Whyte, J. K. Addressing adjacency constraints in rectangular floor plans using Monte-Carlo tree search. Autom. Constr. 115 , 103187 (2020).

Chen, L. & Whyte, J. Understanding design change propagation in complex engineering systems using a digital twin and design structure matrix. Eng. Constr. Archit. Manag. (2021).

Allison, J. T. et al. Artificial intelligence and engineering design. J. Mech. Des. 144 , 020301 (2022).

Dutta, D. & Bose, I. Managing a big data project: The case of ramco cements limited. Int. J. Prod. Econ. 165 , 293–306 (2015).

Bilal, M. & Oyedele, L. O. Guidelines for applied machine learning in construction industry—A case of profit margins estimation. Adv. Eng. Inform. 43 , 101013 (2020).

Tayefeh Hashemi, S., Ebadati, O. M. & Kaur, H. Cost estimation and prediction in construction projects: A systematic review on machine learning techniques. SN Appl. Sci. 2 , 1–27 (2020).

Arage, S. S. & Dharwadkar, N. V. In: International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC). 594–599 (IEEE, 2017).

Cheng, C.-H., Chang, J.-R. & Yeh, C.-A. Entropy-based and trapezoid fuzzification-based fuzzy time series approaches for forecasting IT project cost. Technol. Forecast. Soc. Chang. 73 , 524–542 (2006).

Joukar, A. & Nahmens, I. Volatility forecast of construction cost index using general autoregressive conditional heteroskedastic method. J. Constr. Eng. Manag. 142 , 04015051 (2016).

Xu, J.-W. & Moon, S. Stochastic forecast of construction cost index using a cointegrated vector autoregression model. J. Manag. Eng. 29 , 10–18 (2013).

Narbaev, T. & De Marco, A. Combination of growth model and earned schedule to forecast project cost at completion. J. Constr. Eng. Manag. 140 , 04013038 (2014).

Naeni, L. M., Shadrokh, S. & Salehipour, A. A fuzzy approach for the earned value management. Int. J. Proj. Manag. 29 , 764–772 (2011).

Ponz-Tienda, J. L., Pellicer, E. & Yepes, V. Complete fuzzy scheduling and fuzzy earned value management in construction projects. J. Zhejiang Univ. Sci. A 13 , 56–68 (2012).

Yu, F., Chen, X., Cory, C. A., Yang, Z. & Hu, Y. An active construction dynamic schedule management model: Using the fuzzy earned value management and BP neural network. KSCE J. Civ. Eng. 25 , 2335–2349 (2021).

Bonato, F. K., Albuquerque, A. A. & Paixão, M. A. S. An application of earned value management (EVM) with Monte Carlo simulation in engineering project management. Gest. Produção 26 , e4641 (2019).

Batselier, J. & Vanhoucke, M. Empirical evaluation of earned value management forecasting accuracy for time and cost. J. Constr. Eng. Manag. 141 , 05015010 (2015).

Yang, R. J. & Zou, P. X. Stakeholder-associated risks and their interactions in complex green building projects: A social network model. Build. Environ. 73 , 208–222 (2014).

Uddin, S. Social network analysis in project management–A case study of analysing stakeholder networks. J. Mod. Proj. Manag. 5 , 106–113 (2017).

Ong, S. & Uddin, S. Co-evolution of project stakeholder networks. J. Mod. Proj. Manag. 8 , 96–115 (2020).

Khanzode, K. C. A. & Sarode, R. D. Advantages and disadvantages of artificial intelligence and machine learning: A literature review. Int. J. Libr. Inf. Sci. (IJLIS) 9 , 30–36 (2020).

Loyola-Gonzalez, O. Black-box vs. white-box: Understanding their advantages and weaknesses from a practical point of view. IEEE Access 7 , 154096–154113 (2019).

Abioye, S. O. et al. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. J. Build. Eng. 44 , 103299 (2021).

Doloi, H., Sawhney, A., Iyer, K. & Rentala, S. Analysing factors affecting delays in Indian construction projects. Int. J. Proj. Manag. 30 , 479–489 (2012).

Alkhaddar, R., Wooder, T., Sertyesilisik, B. & Tunstall, A. Deep learning approach’s effectiveness on sustainability improvement in the UK construction industry. Manag. Environ. Qual. Int. J. 23 , 126–139 (2012).

Gondia, A., Siam, A., El-Dakhakhni, W. & Nassar, A. H. Machine learning algorithms for construction projects delay risk prediction. J. Constr. Eng. Manag. 146 , 04019085 (2020).

Witten, I. H. & Frank, E. Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 2005).

Kanakaris, N., Karacapilidis, N. I. & Lazanas, A. In: ICORES. 362–369.

Heo, S., Han, S., Shin, Y. & Na, S. Challenges of data refining process during the artificial intelligence development projects in the architecture engineering and construction industry. Appl. Sci. 11 , 10919 (2021).

Article   CAS   Google Scholar  

Bross, I. D. How to use ridit analysis. Biometrics 14 , 18–38 (1958).

Uddin, S., Khan, A., Hossain, M. E. & Moni, M. A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 19 , 1–16 (2019).

LaValle, S. M., Branicky, M. S. & Lindemann, S. R. On the relationship between classical grid search and probabilistic roadmaps. Int. J. Robot. Res. 23 , 673–692 (2004).

Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2 , 433–459 (2010).

Saxena, A. Survey on Road Construction Delay , https://www.kaggle.com/amansaxena/survey-on-road-construction-delay (2021).

Noble, W. S. What is a support vector machine?. Nat. Biotechnol. 24 , 1565–1567 (2006).

Article   CAS   PubMed   Google Scholar  

Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression Vol. 398 (John Wiley & Sons, 2013).

LaValley, M. P. Logistic regression. Circulation 117 , 2395–2399 (2008).

Article   PubMed   Google Scholar  

Menard, S. Applied Logistic Regression Analysis Vol. 106 (Sage, 2002).

Batista, G. E. & Monard, M. C. A study of K-nearest neighbour as an imputation method. His 87 , 48 (2002).

Agatonovic-Kustrin, S. & Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 22 , 717–727 (2000).

Zupan, J. Introduction to artificial neural network (ANN) methods: What they are and how to use them. Acta Chim. Slov. 41 , 327–327 (1994).

CAS   Google Scholar  

Hopfield, J. J. Artificial neural networks. IEEE Circuits Devices Mag. 4 , 3–10 (1988).

Zou, J., Han, Y. & So, S.-S. Overview of artificial neural networks. Artificial Neural Networks . 14–22 (2008).

Maind, S. B. & Wankar, P. Research paper on basic of artificial neural network. Int. J. Recent Innov. Trends Comput. Commun. 2 , 96–100 (2014).

Wolpert, D. H. Stacked generalization. Neural Netw. 5 , 241–259 (1992).

Pavlyshenko, B. In: IEEE Second International Conference on Data Stream Mining & Processing (DSMP). 255–258 (IEEE).

Jović, A., Brkić, K. & Bogunović, N. In: 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205 (Ieee, 2015).

Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46 , 389–422 (2002).

Article   MATH   Google Scholar  

Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   MATH   Google Scholar  

Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. Understanding variable importances in forests of randomized trees. Adv. Neural. Inf. Process. Syst. 26 , 431–439 (2013).

Al-Hazim, N., Salem, Z. A. & Ahmad, H. Delay and cost overrun in infrastructure projects in Jordan. Procedia Eng. 182 , 18–24 (2017).

Breiman, L. Random forests. Mach. Learn. 45 , 5–32. https://doi.org/10.1023/A:1010933404324 (2001).

Shehu, Z., Endut, I. R. & Akintoye, A. Factors contributing to project time and hence cost overrun in the Malaysian construction industry. J. Financ. Manag. Prop. Constr. 19 , 55–75 (2014).

Akomah, B. B. & Jackson, E. N. Contractors’ perception of factors contributing to road project delay. Int. J. Constr. Eng. Manag. 5 , 79–85 (2016).

GitHub: Where the world builds software , https://github.com/ .

Anbari, F. T. Earned value project management method and extensions. Proj. Manag. J. 34 , 12–23 (2003).

Acebes, F., Pereda, M., Poza, D., Pajares, J. & Galán, J. M. Stochastic earned value analysis using Monte Carlo simulation and statistical learning techniques. Int. J. Proj. Manag. 33 , 1597–1609 (2015).

Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. data anal. 6 , 429–449 (2002).

Chen, T. et al. Xgboost: extreme gradient boosting. R Packag. Version 0.4–2.1 1 , 1–4 (2015).

Guarino, A., Lettieri, N., Malandrino, D., Zaccagnino, R. & Capo, C. Adam or Eve? Automatic users’ gender classification via gestures analysis on touch devices. Neural Comput. Appl. 1–23 (2022).

Zaccagnino, R., Capo, C., Guarino, A., Lettieri, N. & Malandrino, D. Techno-regulation and intelligent safeguards. Multimed. Tools Appl. 80 , 15803–15824 (2021).

Download references

Acknowledgements

The authors acknowledge the insightful comments from Prof Jennifer Whyte on an earlier version of this article.

Author information

Authors and affiliations.

School of Project Management, The University of Sydney, Level 2, 21 Ross St, Forest Lodge, NSW, 2037, Australia

Shahadat Uddin, Stephen Ong & Haohui Lu

You can also search for this author in PubMed   Google Scholar

Contributions

S.U.: Conceptualisation; Data curation; Formal analysis; Methodology; Supervision; and Writing (original draft, review and editing) S.O.: Data curation; and Writing (original draft, review and editing) H.L.: Methodology; and Writing (original draft, review and editing) All authors reviewed the manuscript).

Corresponding author

Correspondence to Shahadat Uddin .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Uddin, S., Ong, S. & Lu, H. Machine learning in project analytics: a data-driven framework and case study. Sci Rep 12 , 15252 (2022). https://doi.org/10.1038/s41598-022-19728-x

Download citation

Received : 13 April 2022

Accepted : 02 September 2022

Published : 09 September 2022

DOI : https://doi.org/10.1038/s41598-022-19728-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

classification in machine learning case study

  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

Getting started with Classification

As the name suggests, Classification is the task of “classifying things” into sub-categories. Classification is part of supervised machine learning in which we put labeled data for training.

The article serves as a comprehensive guide to understanding and applying classification techniques, highlighting their significance and practical implications.

What is Supervised Machine Learning?

Supervised Machine Learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output Y = f(X) . The goal is to approximate the mapping function so well that when you have new input data (x) you can predict the output variables (Y) for that data.

Supervised learning problems can be further grouped into Regression and Classification problems.

  • Regression: Regression algorithms are used to predict a continuous numerical output. For example, a regression algorithm could be used to predict the price of a house based on its size, location, and other features.
  • Classification: Classification algorithms are used to predict a categorical output. For example, a classification algorithm could be used to predict whether an email is spam or not.

Machine Learning for classification

Classification is a process of categorizing data or objects into predefined classes or categories based on their features or attributes.

Machine Learning classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data.

The main objective of classification machine learning is to build a model that can accurately assign a label or category to a new observation based on its features.

For example, a classification model might be trained on a dataset of images labeled as either dogs or cats and then used to predict the class of new, unseen images of dogs or cats based on their features such as color, texture, and shape.

Classification Types

There are two main classification types in machine learning:

Binary Classification

In binary classification, the goal is to classify the input into one of two classes or categories. Example – On the basis of the given health conditions of a person, we have to determine whether the person has a certain disease or not.

Multiclass Classification

In multi-class classification, the goal is to classify the input into one of several classes or categories. For Example – On the basis of data about different species of flowers, we have to determine which specie our observation belongs to.

Binary vs Multi class classification -Geeksforgeeks

Binary vs Multi class classification

Other categories of classification involves:

M ulti-Label Classification

In, Multi-label Classification the goal is to predict which of several labels a new data point belongs to. This is different from multiclass classification, where each data point can only belong to one class. For example, a multi-label classification algorithm could be used to classify images of animals as belonging to one or more of the categories cat, dog, bird, or fish.

I mbalanced Classification

In, Imbalanced Classification the goal is to predict whether a new data point belongs to a minority class, even though there are many more examples of the majority class. For example, a medical diagnosis algorithm could be used to predict whether a patient has a rare disease, even though there are many more patients with common diseases.

Classification Algorithms

There are various types of classifiers algorithms . Some of them are : 

Linear Classifiers

Linear models create a linear decision boundary between classes. They are simple and computationally efficient. Some of the linear classification models are as follows: 

  • Logistic Regression
  • Support Vector Machines having kernel = ‘linear’
  • Single-layer Perceptron
  • Stochastic Gradient Descent (SGD) Classifier

Non-linear Classifiers

Non-linear models create a non-linear decision boundary between classes. They can capture more complex relationships between the input features and the target variable. Some of the non-linear classification models are as follows: 

  • K-Nearest Neighbours
  • Naive Bayes
  • Decision Tree Classification
  • Ensemble learning classifiers: 
  • Random Forests, 
  • Bagging Classifier, 
  • Voting Classifier, 
  • ExtraTrees Classifier
  • Multi-layer Artificial Neural Networks

Learners in Classifications Algorithm

In machine learning, classification learners can also be classified as either “lazy” or “eager” learners.

  • Lazy Learners: Lazy Learners are also known as instance-based learners, lazy learners do not learn a model during the training phase. Instead, they simply store the training data and use it to classify new instances at prediction time. It is very fast at prediction time because it does not require computations during the predictions. it is less effective in high-dimensional spaces or when the number of training instances is large. Examples of lazy learners include k-nearest neighbors and case-based reasoning.
  • Eager Learners : Eager Learners are also known as model-based learners, eager learners learn a model from the training data during the training phase and use this model to classify new instances at prediction time. It is more effective in high-dimensional spaces having large training datasets. Examples of eager learners include decision trees, random forests, and support vector machines.

Classification Models in Machine Learning

Evaluating a classification model is an important step in machine learning, as it helps to assess the performance and generalization ability of the model on new, unseen data. There are several metrics and techniques that can be used to evaluate a classification model, depending on the specific problem and requirements. Here are some commonly used evaluation metrics:

  • Classification Accuracy: The proportion of correctly classified instances over the total number of instances in the test set. It is a simple and intuitive metric but can be misleading in imbalanced datasets where the majority class dominates the accuracy score.
  • Confusion matrix : A table that shows the number of true positives, true negatives, false positives, and false negatives for each class, which can be used to calculate various evaluation metrics.
  • Precision and Recall: Precision measures the proportion of true positives over the total number of predicted positives, while recall measures the proportion of true positives over the total number of actual positives. These metrics are useful in scenarios where one class is more important than the other, or when there is a trade-off between false positives and false negatives.
  • F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) / (precision + recall). It is a useful metric for imbalanced datasets where both precision and recall are important.
  • ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of the true positive rate (recall) against the false positive rate (1-specificity) for different threshold values of the classifier’s decision function. The Area Under the Curve (AUC) measures the overall performance of the classifier, with values ranging from 0.5 (random guessing) to 1 (perfect classification).
  • Cross-validation : A technique that divides the data into multiple folds and trains the model on each fold while testing on the others, to obtain a more robust estimate of the model’s performance.

It is important to choose the appropriate evaluation metric(s) based on the specific problem and requirements, and to avoid overfitting by evaluating the model on independent test data.

Characteristics of Classification

Here are the characteristics of the classification:

  • Categorical Target Variable: Classification deals with predicting categorical target variables that represent discrete classes or labels. Examples include classifying emails as spam or not spam, predicting whether a patient has a high risk of heart disease, or identifying image objects.
  • Accuracy and Error Rates: Classification models are evaluated based on their ability to correctly classify data points. Common metrics include accuracy, precision, recall, and F1-score.
  • Model Complexity: Classification models range from simple linear classifiers to more complex nonlinear models. The choice of model complexity depends on the complexity of the relationship between the input features and the target variable.
  • Overfitting and Underfitting: Classification models are susceptible to overfitting and underfitting. Overfitting occurs when the model learns the training data too well and fails to generalize to new data.

How does Classification Machine Learning Work?

The basic idea behind classification is to train a model on a labeled dataset, where the input data is associated with their corresponding output labels, to learn the patterns and relationships between the input data and output labels. Once the model is trained, it can be used to predict the output labels for new unseen data.

classification-task

Classification Machine Learning

The classification process typically involves the following steps:

Understanding the problem

Before getting started with classification, it is important to understand the problem you are trying to solve. What are the class labels you are trying to predict? What is the relationship between the input data and the class labels?

Suppose we have to predict whether a patient has a certain disease or not, on the basis of 7 independent variables, called features. This means, there can be only two possible outcomes: 

  • The patient has the disease, which means “ True ”.
  • The patient has no disease. which means “ False ”.

This is a binary classification problem.

Data preparation

Once you have a good understanding of the problem, the next step is to prepare your data. This includes collecting and preprocessing the data and splitting it into training, validation, and test sets. In this step, the data is cleaned, preprocessed, and transformed into a format that can be used by the classification algorithm.

  • X: It is the independent feature, in the form of an N*M matrix. N is the no. of observations and M is the number of features.
  • y: An N vector corresponding to predicted classes for each of the N observations.

Feature Extraction

The relevant features or attributes are extracted from the data that can be used to differentiate between the different classes.

Suppose our input X has 7 independent features, having only 5 features influencing the label or target values remaining 2 are negligibly or not correlated, then we will use only these 5 features only for the model training. 

Model Selection

There are many different models that can be used for classification, including logistic regression, decision trees, support vector machines (SVM), or neural networks . It is important to select a model that is appropriate for your problem, taking into account the size and complexity of your data, and the computational resources you have available.

Model Training

Once you have selected a model, the next step is to train it on your training data. This involves adjusting the parameters of the model to minimize the error between the predicted class labels and the actual class labels for the training data.

Model Evaluation

Evaluating the model: After training the model, it is important to evaluate its performance on a validation set. This will give you a good idea of how well the model is likely to perform on new, unseen data. 

Log Loss or Cross-Entropy Loss, Confusion Matrix,  Precision, Recall, and AUC-ROC curve are the quality metrics used for measuring the performance of the model.

Fine-tuning the model

If the model’s performance is not satisfactory, you can fine-tune it by adjusting the parameters, or trying a different model.

Deploying the model

Finally, once we are satisfied with the performance of the model, we can deploy it to make predictions on new data.  it can be used for real world problem.

Examples of Machine Learning Classification in Real Life

Classification algorithms are widely used in many real-world applications across various domains, including:

  • Email spam filtering
  • Credit risk assessment
  • Medical diagnosis
  • Image classification
  • Sentiment analysis.
  • Fraud detection
  • Quality control
  • Recommendation systems

Implementation of Classification Model in Machine Learning

Let’s get a hands-on experience with how Classification works. We are going to study various Classifiers and see a rather simple analytical comparison of their performance on a well-known, standard data set, the Iris data set.  

Requirements for running the given script:

  • Scipy and Numpy
  • Scikit-learn  

In conclusion, classification is a fundamental task in machine learning, involving the categorization of data into predefined classes or categories based on their features.

Frequently Asked Questions (FAQs)

What is classification rule in machine learning.

A decision guideline in machine learning determining the class or category of input based on features.

What are the classification of algorithms?

Methods like decision trees, SVM, and k-NN categorizing data into predefined classes for predictions.

What is learning classification?

Acquiring knowledge to assign labels to input data, distinguishing classes in supervised machine learning.

What is difference between classification and clustering?

Classification: Predicts predefined classes. Clustering: Groups data based on inherent similarities without predefined classes.

What is the difference between classification and regression methods?

Classification: Assigns labels to data classes. Regression: Predicts continuous values for quantitative analysis.

Please Login to comment...

Similar reads, improve your coding skills with practice.

 alt=

What kind of Experience do you want to share?

thecleverprogrammer

Data Science Case Studies on Classification

Aman Kharwal

  • May 22, 2021
  • Machine Learning

Working on case studies is one of the best practices that will help you improve your problem-solving skills as a data scientist. In this article, I’m going to introduce you to some of the best data science case studies based on the problems of classification that will help you understand and solve problems based on classification using machine learning.

Click-Through Rate Prediction:

If you are getting your first data science job working for an advertising company that does internet advertising, this case study will help you a lot. By understanding the click-through rate, an ad agency selects and targets the most potential customers who are most likely to respond to ads.

This data science case study is based on classification because you have to predict whether a person will respond to ads or not. You can find this data science case study solved and explained using Python from  here .

Mobile Price Classification:

Smartphones are one of the best-selling electronic devices because people keep buying new smartphones when they find new features on a new device. In such a situation, it is very difficult for someone to decide on the price of a smartphone who is considering starting a new smartphone business.

So in this task, you have to categorize the price range of smartphones to give an idea of the price range of a smartphone based on its features. You can find this data science case study on classification solved and explained using Python from  here .

Gender Detection:

The prediction of a person’s gender is based on the problem of classification and computer vision. If you get your first data science job at a company that is very active in building computer vision applications, this will be the most basic classification-based task for you.

Here you need to train a model that can detect the gender of a person by taking an input image or using a real-time camera. You can find a solution for this data science case study from  here .

Working on case studies is one of the best practices that will help you improve your problem-solving skills as a data scientist. I hope you liked this article on data science case studies on classification. All the three case studies mentioned in this article are solved and explained using Python . Feel free to ask your valuable questions in the comments section below.

Aman Kharwal

Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Recommended For You

Roadmap to Learn Computer Vision

Roadmap to Learn Computer Vision

  • July 10, 2024

Regex Operations for Data Science

Regex Operations for Data Science

  • July 9, 2024

Datasets to Practice NLP Problems

Datasets to Practice NLP Problems

  • July 4, 2024

Data Science Project Ideas on Finance

Data Science Project Ideas on Finance

  • July 3, 2024

Leave a Reply Cancel reply

Discover more from thecleverprogrammer.

Subscribe now to keep reading and get access to the full archive.

Type your email…

Continue reading

Machine Learning Case Studies with Powerful Insights

Explore the potential of machine learning through these practical machine learning case studies and success stories in various industries. | ProjectPro

Machine Learning Case Studies with Powerful Insights

Machine learning is revolutionizing how different industries function, from healthcare to finance to transportation. If you're curious about how this technology is applied in real-world scenarios, look no further. In this blog, we'll explore some exciting machine learning case studies that showcase the potential of this powerful emerging technology.

Machine-learning-based applications have quickly transformed work methods in the technological world. It is changing the way we work, live, and interact with the world around us. Machine learning is revolutionizing industries, from personalized recommendations on streaming platforms to self-driving cars.

But while the technology of artificial intelligence and machine learning may seem abstract or daunting to some, its applications are incredibly tangible and impactful. Data Scientists use machine learning algorithms to predict equipment failures in manufacturing, improve cancer diagnoses in healthcare , and even detect fraudulent activity in 5 . If you're interested in learning more about how machine learning is applied in real-world scenarios, you are on the right page. This blog will explore in depth how machine learning applications are used for solving real-world problems.

Machine Learning Case Studies

We'll start with a few case studies from GitHub that examine how machine learning is being used by businesses to retain their customers and improve customer satisfaction. We'll also look at how machine learning is being used with the help of Python programming language to detect and prevent fraud in the financial sector and how it can save companies millions of dollars in losses. Next, we will examine how top companies use machine learning to solve various business problems. Additionally, we'll explore how machine learning is used in the healthcare industry, and how this technology can improve patient outcomes and save lives.

By going through these case studies, you will better understand how machine learning is transforming work across different industries. So, let's get started!

Table of Contents

Machine learning case studies on github, machine learning case studies in python, company-specific machine learning case studies, machine learning case studies in biology and healthcare, aws machine learning case studies , azure machine learning case studies, how to prepare for machine learning case studies interview.

This section has machine learning case studies along with their GitHub repository that contains the sample code.

1. Customer Churn Prediction

Predicting customer churn is essential for businesses interested in retaining customers and maximizing their profits. By leveraging historical customer data, machine learning algorithms can identify patterns and factors that are correlated with churn, enabling businesses to take proactive steps to prevent it.

Customer Churn Prediction Machine Learning Case Study

In this case study, you will study how a telecom company uses machine learning for customer churn prediction. The available data contains information about the services each customer signed up for, their contact information, monthly charges, and their demographics. The goal is to first analyze the data at hand with the help of methods used in Exploratory Data Analysis . It will assist in picking a suitable machine-learning algorithm. The five machine learning models used in this case-study are AdaBoost, Gradient Boost, Random Forest, Support Vector Machines, and K-Nearest Neighbors. These models are used to determine which customers are at risk of churn. 

By using machine learning for churn prediction, businesses can better understand customer behavior, identify areas for improvement, and implement targeted retention strategies. It can result in increased customer loyalty, higher revenue, and a better understanding of customer needs and preferences. This case study example will help you understand how machine learning is a valuable tool for any business looking to improve customer retention and stay ahead of the competition.

GitHub Repository: https://github.com/Pradnya1208/Telecom-Customer-Churn-prediction  

ProjectPro Free Projects on Big Data and Data Science

2. Market Basket Analysis

Market basket analysis is a common application of machine learning in retail and e-commerce, where it is used to identify patterns and relationships between products that are frequently purchased together. By leveraging this information, businesses can make informed decisions about product placement, promotions, and pricing strategies.

Market Basket Analysis Machine Learning Case Study

In this case study, you will utilize the EDA methods to carefully analyze the relationships among different variables in the data. Next, you will study how to use the Apriori algorithm to identify frequent itemsets and association rules, which describe the likelihood of a product being purchased given the presence of another product. These rules can generate recommendations, optimize product placement, and increase sales, and they can also be used for customer segmentation.  

Using machine learning for market basket analysis allows businesses to understand customer behavior better, identify cross-selling opportunities, and increase customer satisfaction. It has the potential to result in increased revenue, improved customer loyalty, and a better understanding of customer needs and preferences. 

GitHub Repository: https://github.com/kkrusere/Market-Basket-Analysis-on-the-Online-Retail-Data

3. Predicting Prices for Airbnb

Airbnb is a tech company that enables hosts to rent out their homes, apartments, or rooms to guests interested in temporary lodging. One of the key challenges hosts face is optimizing the rent prices for the customers. With the help of machine learning, hosts can have rough estimates of the rental costs based on various factors such as location, property type, amenities, and availability.

The first step, in this case study, is to clean the dataset to handle missing values, duplicates, and outliers. In the same step, the data is transformed, and the data is prepared for modeling with the help of feature engineering methods. The next step is to perform EDA to understand how the rental listings are spread across different cities in the US. Next, you will learn how to visualize how prices change over time, looking at trends for different seasons, months, days of the week, and times of the day.

The final step involves implementing ML models like linear regression (ridge and lasso), Naive Bayes, and Random Forests to produce price estimates for listings. You will learn how to compare the outcome of these models and evaluate their performance.

GitHub Repository: https://github.com/samuelklam/airbnb-pricing-prediction  

New Projects

4. Titanic Disaster Analysis

The Titanic Machine Learning Case Study is a classic example in the field of data science and machine learning. The study is based on the dataset of passengers aboard the Titanic when it sank in 1912. The study's goal is to predict whether a passenger survived or not based on their demographic and other information.

The dataset contains information on 891 passengers, including their age, gender, ticket class, fare paid, as well as whether or not they survived the disaster. The first step in the analysis is to explore the dataset and identify any missing values or outliers. Once this is done, the data is preprocessed to prepare it for modeling.

Titanic Disaster Analysis Machine Learning Case Study

The next step is to build a predictive model using various machine learning algorithms, such as logistic regression, decision trees, and random forests. These models are trained on a subset of the data and evaluated on another subset to ensure they can generalize well to new data.

Finally, the model is used to make predictions on a test dataset, and the model performance is measured using various metrics such as accuracy, precision, and recall. The study results can be used to improve safety protocols and inform future disaster response efforts.

GitHub Repository: https://github.com/ashishpatel26/Titanic-Machine-Learning-from-Disaster  

Here's what valued users are saying about ProjectPro

user profile

Graduate Research assistance at Stony Brook University

user profile

Gautam Vermani

Data Consultant at Confidential

Not sure what you are looking for?

If you are looking for a sample of machine learning case study in python, then keep reading this space.

5. Loan Application Classification

Financial institutions receive tons of requests for lending money by borrowers and making decisions for each request is a crucial task. Manually processing these requests can be a time-consuming and error-prone process, so there is an increasing demand for machine learning to improve this process by automation.

Loan Application Classification Machine Learning Case Study

You can work on this Loan Dataset on Kaggle to get started on this one of the most real-world case studies in the financial industry. The dataset contains 614 unique values for 13 columns: Follow the below-mentioned steps to get started on this case study.

Analyze the dataset and explore how various factors such as gender, marital status, and employment affect the loan amount and status of the loan application .

Select the features to automate the process of classification of loan applications.

Apply machine learning models such as logistic regression, decision trees, and random forests to the features and compare their performance using statistical metrics.

This case study falls under the umbrella of supervised learning problems in machine learning and demonstrates how ML models are used to automate tasks in the financial industry.

With these Data Science Projects in Python , your career is bound to reach new heights. Start working on them today!

6. Computer Price Estimation

Whenever one thinks of buying a new computer, the first thing that comes to mind is to curate a list of hardware specifications that best suit their needs. The next step is browsing different websites and looking for the cheapest option available. Performing all these processes can be time-consuming and require a lot of effort. But you don’t have to worry as machine learning can help you build a system that can estimate the price of a computer system by taking into account its various features.

Computer Price Estimation Machine Learning Case Study

This sample basic computer dataset on Kaggle can help you develop a price estimation model that can analyze historical data and identify patterns and trends in the relationship between computer specifications and prices. By training a machine learning model on this data, the model can learn to make accurate predictions of prices for new or unseen computer components. Machine learning algorithms such as K-Nearest Neighbours, Decision Trees, Random Forests, ADA Boost and XGBoost can effectively capture complex relationships between features and prices, leading to more accurate price estimates. 

Besides saving time and effort compared to manual estimation methods, this project also has a business use case as it can provide stakeholders with valuable insights into market trends and consumer preferences.

7. House Price Prediction

Here is a machine learning case study that aims to predict the median value of owner-occupied homes in Boston suburbs based on various features such as crime rate, number of rooms, and pupil-teacher ratio.

House Price Prediction  Machine Learning Case Study

Start working on this study by collecting the data from the publicly available UCI Machine Learning Repository, which contains information about 506 neighborhoods in the Boston area. The dataset includes 13 features such as per capita crime rate, average number of rooms per dwelling, and the proportion of owner-occupied units built before 1940. You can gain more insights into this data by using EDA techniques. Then prepare the dataset for implementing ML models by handling missing values, converting categorical features to numerical ones, and scaling the data.

Use machine learning algorithms such as Linear Regression, Lasso Regression, and Random Forest to predict house prices for different neighborhoods in the Boston area. Select the best model by comparing the performance of each one using metrics such as mean squared error, mean absolute error, and R-squared.

This section has machine learning case studies of different firms across various industries.

8. Machine Learning Case Study on Dell

Dell Technologies is a multinational technology company that designs, develops, and sells computers, servers, data storage devices, network switches, software, and other technology products and services. Dell is one of the world's most prominent PC vendors and serves customers in over 180 countries. As Data is an integral component of Dell's hard drive, the marketing team of Dell required a data-focused solution that would improve response rates and demonstrate why some words and phrases are more effective than others.

Machine Learning Case Study on Dell

Dell contacted Persado and partnered with the firm that utilizes AI to create marketing content. Persado helped Dell revamp the email marketing strategy and leverage the data analytics to garner their audiences' attention. The statistics revealed that the partnership resulted in a noticeable increase in customer engagement as the page visits by 22% on average and a 50% average increase in CTR.

Dell currently relies on ML methods to improve their marketing strategy for emails, banners, direct mail, Facebook ads, and radio content.

Explore Categories

9. Machine Learning Case Study on Harley Davidson

In the current environment, it is challenging to overcome traditional marketing. An artificial intelligence powered robot, Albert is appealing for a business like Harley Davidson. Robots are now directing traffic, creating news stories, working in hotels, and even running McDonald's, thanks to machine learning and artificial intelligence.

There are many marketing channels that Albert can be applied to, including Email and social media.It automatically prepares customized creative copies and forecasts which customers will most likely convert.

Machine Learning Case Study on Harley Davidson

The only company to make use of Albert is Harley Davidson. The business examined customer data to ascertain the activities of past clients who successfully made purchases and invested more time than usual across different pages on the website. With this knowledge, Albert divided the customer base into groups and adjusted the scale of test campaigns accordingly.

Results reveal that using Albert increased Harley Davidson's sales by 40%. The brand also saw a 2,930% spike in leads, 50% of which came from very effective "lookalikes" found by machine learning and artificial intelligence.

10. Machine Learning Case Study on Zomato

Zomato is a popular online platform that provides restaurant search and discovery services, online ordering and delivery, and customer reviews and ratings. Founded in India in 2008, the company has expanded to over 24 countries and serves millions of users globally. Over the years, it has become a popular choice for consumers to browse the ratings of different restaurants in their area. 

Machine Learning Case Study on Zomato

To provide the best restaurant options to their customers, Zomato ensures to hand-pick the ones likely to perform well in the future. Machine Learning can help zomato in making such decisions by considering the different restaurant features. You can work on this sample Zomato Restaurants Data and experiment with how machine learning can be useful to Zomato. The dataset has the details of 9551 restaurants. The first step should involve careful analysis of the data and identifying outliers and missing values in the dataset. Treat them using statistical methods and then use regression models to predict the rating of different restaurants.

The Zomato Case study is one of the most popular machine learning startup case studies among data science enthusiasts.

11. Machine Learning Case Study on Tesla

Tesla, Inc. is an American electric vehicle and clean energy company founded in 2003 by Elon Musk. The company designs, manufactures, and sells electric cars, battery storage systems, and solar products. Tesla has pioneered the electric vehicle industry and has popularized high-capacity lithium-ion batteries and regenerative braking systems. The company strongly focuses on innovation, sustainability, and reducing the world's dependence on fossil fuels.

Tesla uses machine learning in various ways to enhance the performance and features of its electric vehicles. One of the most notable applications of machine learning at Tesla is in its Autopilot system, which uses a combination of cameras, sensors, and machine learning algorithms to enable advanced driver assistance features such as lane centering, adaptive cruise control, and automatic emergency braking.

Machine Learning Case Study on Tesla

Tesla's Autopilot system uses deep neural networks to process large amounts of real-world driving data and accurately predict driving behavior and potential hazards. It enables the system to learn and adapt over time, improving its accuracy and responsiveness.

Additionally, Tesla also uses machine learning in its battery management systems to optimize the performance and longevity of its batteries. Machine learning algorithms are used to model and predict the behavior of the batteries under different conditions, enabling Tesla to optimize charging rates, temperature control, and other factors to maximize the lifespan and performance of its batteries.

Unlock the ProjectPro Learning Experience for FREE

12. Machine Learning Case Study on Amazon

Amazon Prime Video uses machine learning to ensure high video quality for its users. The company has developed a system that analyzes video content and applies various techniques to enhance the viewing experience.

Machine Learning Case Study on Amazon

The system uses machine learning algorithms to automatically detect and correct issues such as unexpected black frames, blocky frames, and audio noise. For detecting block corruption, residual neural networks are used. After training the algorithm on the large dataset, a threshold of 0.07 was set for the corrupted-area ratio to mark the areas of the frame that have block corruption. For detecting unwanted noise in the audio, a model based on a pre-trained audio neural network is used to classify a one-second audio sample into one of these classes: audio hum, audio distortion, audio diss, audio clicks, and no defect. The lip sync is handled using the SynNet architecture.

By using machine learning to optimize video quality, Amazon can deliver a consistent and high-quality viewing experience to its users, regardless of the device or network conditions they are using. It helps maintain customer satisfaction and loyalty and ensures that Amazon remains a competitive video streaming market leader.

Machine Learning applications are not only limited to financial and tech use cases. It also finds its use in the Healthcare industry. So, here are a few machine learning case studies that showcase the use of this technology in the Biology and Healthcare domain.

13. Microbiome Therapeutics Development

The development of microbiome therapeutics involves the study of the interactions between the human microbiome and various diseases and identifying specific microbial strains or compositions that can be used to treat or prevent these diseases. Machine learning plays a crucial role in this process by enabling the analysis of large, complex datasets and identifying patterns and correlations that would be difficult or impossible to detect through traditional methods.

Machine Learning in Microbiome Therapeutics Development

Machine learning algorithms can analyze microbiome data at various levels, including taxonomic composition, functional pathways, and gene expression profiles. These algorithms can identify specific microbial strains or communities associated with different diseases or conditions and can be used to develop targeted therapies.

Besides that, machine learning can be used to optimize the design and delivery of microbiome therapeutics. For example, machine learning algorithms can be used to predict the efficacy of different microbial strains or compositions and optimize these therapies' dosage and delivery mechanisms.

14. Mental Illness Diagnosis

Machine learning is increasingly being used to develop predictive models for diagnosing and managing mental illness. One of the critical advantages of machine learning in this context is its ability to analyze large, complex datasets and identify patterns and correlations that would be difficult for human experts to detect.

Machine learning algorithms can be trained on various data sources, including clinical assessments, self-reported symptoms, and physiological measures such as brain imaging or heart rate variability. These algorithms can then be used to develop predictive models to identify individuals at high risk of developing a mental illness or who are likely to experience a particular symptom or condition.

Machine Learning Case Study for Mental Illness Diagnosis

One example of machine learning being used to predict mental illness is in the development of suicide risk assessment tools. These tools use machine learning algorithms to analyze various risk factors, such as demographic information, medical history, and social media activity, to identify individuals at risk of suicide. These tools can be used to guide early intervention and support for individuals struggling with mental health issues.

One can also a build a Chatbot using Machine learning and Natural Lanaguage Processing that can analyze the responses of the user and recommend them the necessary steps that they can immediately take.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

15. 3D Bioprinting

Another popular subject in the biotechnology industry is Bioprinting. Based on a computerized blueprint, the printer prints biological tissues like skin, organs, blood arteries, and bones layer by layer using cells and biomaterials, also known as bioinks.

They can be made in printers more ethically and economically than by relying on organ donations. Additionally, synthetic construct tissue is used for drug testing instead of testing on animals or people. Due to its tremendous complexity, the entire technology is still in its early stages of maturity. Data science is one of the most essential components to handle this complexity of printing.

3D Bioprinting  Machine Learning Case Study

The qualities of the bioinks, which have inherent variability, or the many printing parameters, are just a couple of the many variables that affect the printing process and quality. For instance, Bayesian optimization improves the likelihood of producing useable output and optimizes the printing process.

A crucial element of the procedure is the printing speed. To estimate the optimal speed, siamese network models are used. Convolutional neural networks are applied to photographs of the layer-by-layer tissue to detect material, or tissue abnormalities.

In this section, you will find a list of machine learning case studies that have utilized Amazon Web Services to create machine learning based solutions.

16. Machine Learning Case Study on AutoDesk

Autodesk is a US-based software company that provides solutions for 3D design, engineering, and entertainment industries. The company offers a wide range of software products and services, including computer-aided design (CAD) software, 3D animation software, and other tools used in architecture, construction, engineering, manufacturing, media and entertainment industries.

Autodesk utilizes machine learning (ML) models that are constructed on Amazon SageMaker, a managed ML service provided by Amazon Web Services (AWS), to assist designers in categorizing and sifting through a multitude of versions created by generative design procedures and selecting the most optimal design.  ML techniques built with Amazon SageMaker help Autodesk progress from intuitive design to exploring the boundaries of generative design for their customers to produce innovative products that can even be life-changing. As an example, Edera Safety, a design studio located in Austria, created a superior and more effective spine protector by utilizing Autodesk's generative design process constructed on AWS.

17. Machine Learning Case Study on Capital One

Capital One is a financial services company in the United States that offers a range of financial products and services to consumers, small businesses, and commercial clients. The company provides credit cards, loans, savings and checking accounts, investment services, and other financial products and services.

Capital One leverages AWS to transform data into valuable insights using machine learning, enabling the company to innovate rapidly on behalf of its customers.  To power its machine-learning innovation, Capital One utilizes a range of AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), and AWS Lambda. AWS is enabling Capital One to implement flexible DevOps processes, enabling the company to introduce new products and features to the market in just a few weeks instead of several months or years. Additionally, AWS assists Capital One in providing data to and facilitating the training of sophisticated machine-learning analysis and customer-service solutions. The company also integrates its contact centers with its CRM and other critical systems, while simultaneously attracting promising entry-level and mid-career developers and engineers with the opportunity to gain knowledge and innovate with the most up-to-date cloud technologies.

18. Machine Learning Case Study on BuildFax

In 2008, BuildFax began by collecting widely scattered building permit data from different parts of the United States and distributing it to various businesses, including building inspectors, insurance companies, and economic analysts. Today, it offers custom-made solutions to these professions and several other services. These services comprise indices that monitor trends like commercial construction, and housing remodels.

Machine Learning Case Study on BuildFax

Source: aws.amazon.com/solutions/case-studies

The primary customer base of BuildFax is insurance companies that splurge billion dollars on rood losses. BuildFax assists its customers in developing policies and premiums by evaluating the roof losses for them. Initially, it relied on general data and ZIP codes for building predictive models but they did not prove to be useful as they were not accurate and were slightly complex in nature. It thus required a way out of building a solution that could support more accurate results for property-specific estimates. It thus chose Amazon Machine Learning for predictive modeling. By employing Amazon Machine Learning, it is possible for the company to offer insurance companies and builders personalized estimations of roof-age and job-cost, which are specific to a particular property and it does not have to depend on more generalized estimates based on ZIP codes.  It now utilizes customers' data and data from public sources to create predictive models.

What makes Python one of the best programming languages for ML Projects? The answer lies in these solved and end-to-end Machine Learning Projects in Python . Check them out now!

This section will present you with a list of machine learning case studies that showcase how companies have leveraged Microsoft Azure Services for completing machine learning tasks in their firm.

19. Machine Learning Case Study for an Enterprise Company

Consider a company (Azure customer) in the Electronic Design Automation industry that provides software, hardware, and IP for electronic systems and semiconductor companies. Their finance team was struggling to manage account receivables efficiently, so they wanted to use machine learning to predict payment outcomes and reduce outstanding receivables. The team faced a major challenge with managing change data capture using Azure Data Factory . A3S provided a solution by automating data migration from SAP ECC to Azure Synapse and offering fully automated analytics as a service, which helped the company streamline their account receivables management. It was able to achieve the entire scenario from data ingestion to analytics within a week, and they plan to use A3S for other analytics initiatives.

20. Machine Learning Case Study on Shell

Royal Dutch Shell, a global company managing oil wells to retail petrol stations, is using computer vision technology to automate safety checks at its service stations. In partnership with Microsoft, it has developed the project called Video Analytics for Downstream Retail (VADR) that uses machine vision and image processing to detect dangerous behavior and alert the servicemen. It uses OpenCV and Azure Databricks in the background highlighting how Azure can be used for personalised applications. Once the projects shows decent results in the countries where it has been deployed (Thailand and Singapore), Shell plans to expand the project further by going global with the VADR project. 

21. Machine Learning Case Study on TransLink

TransLink, a transportation company in Vancouver, deployed 18,000 different sets of machine learning models using Azure Machine Learning to predict bus departure times and determine bus crowdedness. The models take into account factors such as traffic, bad weather and at-capacity buses. The deployment led to an improvement in predicted bus departure times of 74%. The company also created a mobile app that allows people to plan their trips based on how at-capacity a bus might be at different times of day.

22. Machine Learning Case Study on XBox

Microsoft Azure Personaliser is a cloud-based service that uses reinforcement learning to select the best content for customers based on up-to-date information about them, the context, and the application. Custom recommender services can also be created using Azure Machine Learning. The Xbox One group used Cognitive Services Personaliser to find content suited to each user, which resulted in a 40% increase in user engagement compared to a random personalisation policy on the Xbox platform.

All the mentioned case studies in this blog will help you explore the application of machine learning in solving real problems across different industries. But you must not stop after working on them if you are preparing for an interview and intend to showcase that you have mastered the art of implementing ML algorithms, and you must practice more such caste studies in machine learning.

And if you have decided to dive deeper into machine learning, data science, and big data, be sure to check out ProjectPro , which offers a repository of solved projects in data science and big data. With a wide range of projects, you can explore different techniques and approaches and build your machine learning and data science skills . Our repository has a project for each one of you, irrespective of your academic and professional background. The customer-specific learning path is likely to help you find your way to making a mark in this newly emerging field. So why wait? Start exploring today and see what you can accomplish with big data and data science ! 

Access Data Science and Machine Learning Project Code Examples

1. What is a case study in machine learning?

A case study in machine learning is an in-depth analysis of a real-world problem or scenario, where machine learning techniques are applied to solve the problem or provide insights. Case studies can provide valuable insights into the application of machine learning and can be used as a basis for further research or development.

2. What is a good use case for machine learning?

A good use case for machine learning is any scenario with a large and complex dataset and where there is a need to identify patterns, predict outcomes, or automate decision-making based on that data. It could include fraud detection, predictive maintenance, recommendation systems, and image or speech recognition, among others.

3. What are the 3 basic types of machine learning problems?

The three basic types of machine learning problems are supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the algorithm is trained on labeled data. In unsupervised learning, the algorithm seeks to identify patterns in unstructured data. In reinforcement learning, the algorithm learns through trial and error based on feedback from the environment.

4. What are the 4 basics of machine learning?

The four basics of machine learning are data preparation, model selection, model training, and model evaluation. Data preparation involves collecting, cleaning, and preparing data for use in training models. Model selection involves choosing the appropriate algorithm for a given task. Model training involves optimizing the chosen algorithm to achieve the desired outcome. Model evaluation consists of assessing the performance of the trained model on new data.

Access Solved Big Data and Data Science Projects

About the Author

author profile

Manika Nagpal is a versatile professional with a strong background in both Physics and Data Science. As a Senior Analyst at ProjectPro, she leverages her expertise in data science and writing to create engaging and insightful blogs that help businesses and individuals stay up-to-date with the

arrow link

© 2024

© 2024 Iconiq Inc.

Privacy policy

User policy

Write for ProjectPro

Python Geeks

Learn Python Programming from Scratch

  • Learn Machine Learning

16 Real World Case Studies of Machine Learning

Master Programming with Our Comprehensive Courses Enroll Now!

A decade ago, no one must have thought that the term “Machine Learning” would be hyped so much in the years to come. Right from our entertainment to our basic needs to complex data handling statistics, Machine Learning takes care of all of this. The clutches of Machine Learning aren’t just limited to the basic necessities and entertainment.

The technology plays a pivotal role in domain areas such as data retrieval, database consistency, and spam detection along with many other vast ranges of applications. We do come across various articles that are ready to teach us about the basic concepts of Machine Learning, however, learning becomes more fun when we actually see it working in practicality.

Keeping this in mind, PythonGeeks brings to you, an article that will talk about the real-life case studies of Machine Learning stating its advancement in various fields. We will talk about the merits of Machine Learning in the field of technology as well as in Life Science and Biology. So, without further delay, let us look at these case studies and get to know a bit more about Machine Learning.

Machine Learning Case Studies in Technology

1. machine learning case study on dell.

We all are aware of the multinational leader in technology, Dell. This tech giant empowers people and communities from across the globe by providing superior software and hardware services at very affordable prices. As a matter of fact, data plays a pivotal role in the programming of the hard drive of Dell, the marketing team of Dell requires a data-driven solution that supercharges response rates and exhibits why certain words and phrases outpace others in terms of efficiency and reliability.

Dell made a partnership with Persado, one of the names amongst the world’s leading technology in AI and ML fabricating marketing creative, in order to harness the power of words in their respective email channel and garner data-driven analytics for each of their key audiences for a better user experience.

As an evident outcome of this partnership, Dell experienced a 50% average increase in CTR and a 46% average increase in responses from its customer engagement . Apart from this, it also witnessed a huge 22% average increase in page visits and a 77% average increase in add-to-carts orders .

Overwhelmed by this success rate and learnings with email, Dell adamantly wanted to elevate their entire marketing platform with Persado for more profit and audience engagement. Dell now makes use of machine learning algorithms to enhance the marketing copy of their promotional and lifecycle emails. Apart from these, their management even deploys Machine Learning models for Facebook ads, display banners, direct mail, and even radio content for a farther reach for the target audience.

2. Machine Learning Case Study on Sky

Sky UK is a British telecommunication service that transforms customer experiences with the help of machine learning and artificial intelligence algorithms with the help of Adobe Sensei.

Due to the immense profit that the company gained due to the deployment of the Machine Learning model, the Head of Digital Decisioning and Analytics, Sky UK once stated that they have 22.5 million very diverse customers. Even attempting to divide people by their favorite television genre can result in pretty broad segments for their services.

This will result in the following outcomes:

  • Creating hyper-focused segments to engage customers.
  • Usage of machine learning to deliver actionable intelligence.
  • Improvement in the relationships with customers.
  • Applying AI learnings across channels to understand what matters to customers.

The company was competent in efficiently analyzing large volumes of customer information with the help of machine learning frameworks. With the deployment of Machine Learning models, the services were able to recommend their target audience with products and services that resonated the most with each of them.

McLaughlin once stated that people think of machine learning as a tool for delivering experiences that are strictly defined and very robotic in their approach, but it’s actually the other way round. With Adobe Sensei, the management of the Sky was drawing a line that connects customer intelligence and personalized experiences that are valuable and appropriate for their customers.

3. Machine Learning Case Study on Trendyol

Trendyol is amongst the leading e-commerce companies based in Turkey. It once faced threats from its global competitors like Adidas and ASOS, particularly for its sportswear sales and audience engagement.

In order to assist the company in gaining customer loyalty and to enhance its emailing system, Trendyol partnered with the vendor Liveclicker, which specializes in real-time personalization for a better user experience for its customers.

Trendyol made use of machine learning and artificial intelligence algorithms to create several highly personalized marketing campaigns based on the interests of a particular target audience. It was not only aimed at providing a personalized touch to the campaign, but it also helped to distinguish which messages would be most relevant or draw the attention of which set of customers. It also came up with an offer for a football jersey imposing the recipient’s name on the back of the jersey to ramp up the personalization level and grab the consumer’s attention.

By innovating such one-to-one personalization, not only were the retailer’s open rates, click-through rates, conversions were high, it also significantly made their sales reach all-time highs. It resulted in the generation of a 30% increase in click-through rates for Trendyol, a 62% growth in response rates, and a striking 130% increase in conversion rates for the tech giant.

4. Machine Learning Case Study On Harley Davidson

The world that we live in today is where it becomes difficult to break through traditional marketing. For an emerging business like – Harley Davidson NYC, Albert (an artificial intelligence-powered robot) has a lot of appeal for the growth and popularity of the company. Powered by effective and reliable machine learning and artificial intelligence algorithms, robots are writing news stories, opening new dimensions, working in hotels, managing traffic, and even running McDonald’s customers’ outlets.

We can use Albert in various marketing channels including social media and email campaigns. The software accurately predicts and differentiates among the consumers who are most likely to convert and adjust personal creative copies on their own for the benefits of the campaign.

Harley Davidson is the only brand to date that uses Albert to its advantage. The company analyzed customer data to determine a strong pattern in the behavior of previous customers whose actions were positive in terms of purchasing and spending more than the average amount of time on browsing through the website giving way to the use of Albert. With this analyzed data, Albert bifurcates segments of customers and scales up the test campaigns according to the interests and engagement of customers.

Once the company efficiently deployed Albert, Harley Davidson witnessed an increase in its sales by 40% with the use of Albert. The brand also witnessed a 2,930% increase in leads, with 50% of those from high converting ‘lookalikes’ identified by artificial intelligence and machine learning using Albert.

5. Machine Learning Case Study on Yelp

As far as our technical knowledge is concerned, we are not able to recognize Yelp as a tech company. However, it is effectively taking advantage of machine learning to improve its users’ experience to a great extent.

Yelp’s machine learning algorithms assist the company’s non-robotic staff in tasks like collecting, categorizing, and labeling images more efficiently and precisely. Since images play a pivotal role to Yelp as user reviews themselves, the tech giant is always trying to improve how it handles image processing to analyze customer feedback in a constructive way. Through this assistance, the company is serving millions of its users now with accurate and satisfactory services.

For an entire generation nowadays, capturing photos of their food has become second nature. Owing to this, Yelp has such a huge database of photos for image processing. Its software makes use of techniques for analysis of the image to identify and classify the extracted features on the basis of color, texture, and shape. It implies that it can recognize the presence of, say, pizzas, or whether a restaurant has outdoor seating by merely analyzing the images that we provide as input data.

As a constructive outcome, the company is now capable of predicting attributes like ‘good for kids’ and ‘classy ambiance’ with a striking more than 80% accuracy.

6. Machine Learning Case Study on Tesla

Tesla is now a big name in the electric automobile industry and the chances that it will continue to be the trending topic for years to come are really high. It is popular and extensively known for its advanced and futuristic cars and their advanced models. The company states that their cars have their own AI hardware for their advancement. Tesla is even making use of AI for fabricating self-driving cars.

With the current progress rate of technology, cars are not yet completely autonomous and need human intervention to some extent. The company is working extensively on the thinking algorithm for cars to help them become fully autonomous. It is currently working in an advert partnership with NVIDIA on an unsupervised ML algorithm for its development.

This step by Tesla would be a game-changer in the field of automobiles and Machine Learning models for many reasons. The cars feed the data directly to Tesla’s cloud storage to avoid data leakage. The car sends the driver’s seating position, traffic of the area, and other valuable information on the cloud to precisely predict the next move of the car. The car is equipped with various internal and external sensors that detect the above-mentioned data for processing.

Machine Learning Case Studies in Life Science and Biology

7. development of microbiome therapeutics.

We have studied and identified a vast number of microorganisms, so-called microbiota like bacteria, fungi, viruses, and other single-celled organisms in our body till today with the advancement in technology. All the genes of the microbiota are collectively known as the microbiome. These genes are present in an enormous number of trillions, for example, the bacteria present in the human body have more than 100 times more unique genes than humans could ever have.

These microbiotas that are present in the human body have a massive influence on human health and cause imbalances leading to many disorders like Parkinson’s disease or inflammatory bowel disease. There is also the presumption that such imbalances may even cause several autoimmune diseases if precariously left in the human body. So, microbiome research is a very trendy research area and Machine Learning models can help in handling them effectively.

In order to influence the microbiota and develop microbiome therapeutics to reverse the diseases caused by them, we need to understand the microbiota’s genes and their influence on our body. With all the gene sequencing possibilities that are present today, terabytes of data are available however we cannot use it as it is not yet probed.

8. Predicting Heart Failure in Mobile Health

Heart failure typically leads to emergency or hospital admission and may even be fatal in some situations. And with the increase in the aging population, the percentage of heart failure in the population is expected to increase.

People that suffer from heart failure usually have some pre-existing illnesses that go undiagnosed and lead to fatal ailments. So, it is not uncommon that we make use of telemedicine systems to monitor and consult a patient, and collect valuable data like mobile health data like blood pressure, body weight, or heart rate and transmit it effectively.

Most prediction and prevention systems are now fabricated based on fixed rules, like when specific measurements of the vital readings of the human body are beyond a predefined threshold, the patient is alerted even before the diagnosis of any kind of ailment. It is self-explanatory that such a predictive system may lead to a high number of false alerts, due to fluctuating reading of the vitals due to reasons that are not serious.

Because of the programming that we do on the algorithms, alerts lead mostly to hospital admission. Due to this reason, too many false alerts lead to increased health costs and deteriorate the patient’s confidence in the prediction defying the cause of the algorithms. Eventually, the concerned patient will stop following the recommendation for medical help even if the algorithm alters it for fatal ailments.

So, on the basis of baseline data of the patient like age, gender, smoker or not, a pacemaker or not along with measurements of vital elements of the body like sodium, potassium, or hemoglobin concentrations in the blood, apart from the monitored characteristics like heart rate, body weight, (systolic and diastolic) blood pressure, or questionnaire proves to be helpful in answering about the well-being, or physical activities, a classifier on the basis of Naïve Bayes has been finally developed to reduce the chances of false positives.

9. Mental Health Prediction, Diagnosis, and Treatment

According to an estimated number that at least 10% of the global population has a mental disorder, it is now high time that we need to take preventive measures in this field. Economic losses that are evident due to mental illness sum up to nearly $10 trillion.

Mental disorders include a large variety of ailments ranging from anxiety, depression, substance use disorder, and others. Some other prime examples include opioids, bipolar disorder, schizophrenia, or eating disorders that cause high risk to the human resources.

As a result of which, the detection of mental disorders and intervention as early as possible is critical in order to reduce the loss of precious resources. There are two main approaches to deploy Machine Learning models in detecting mental disorders: apps for consumers that detect mental diseases and tools for psychiatrists to support diagnostics of their patients.

The apps for consumers are typically conversational chatbots enhanced with machine learning algorithms to help the consumers in reducing their anxiety or panic attacks. The app analyzes the behavioral traits of the person like the spoken language of the consumer and recommends help to the customers accordingly. As the recommendations must be strictly on the basis of scientific evidence, the interaction and response of proposals and the individual language pattern of the chatbot, as well as, the consumer must be predicted as precisely as possible.

10. Research Publication and Database Scanning for Bio-Markers for Stroke

As a matter of fact, Stroke is one of the major reasons for disability and death amongst the elder generations. The lifetime risk analysis of an adult person is about 25% of having once a stroke history. However, stroke is a very heterogeneous disorder in nature. Therefore, having individualized pre-stroke and post-stroke care is critical for the success of a cure.

In order to determine this individualized care, the person’s phenotype indicates that the observable characteristics of a person should be chosen wisely. Furthermore, we usually achieve this by biomarkers. A so-called biomarker represents a measurable data point such that we can stratify the patients. Examples of such biomarkers are disease severity scores, lifestyle characteristics, or genomic properties.

There are many recognized biomarkers already published or in databases. Apart from this, there are hundreds of scientific publications that talk daily about the detection of biomarkers for all the different diseases.

11. 3D Bioprinting

Bioprinting is yet another trending topic in the domain of biotechnology. It works on the basis of a digital blueprint where the printer uses cells and natural or synthetic biomaterials — also called bio-inks — to print layer-by-layer living tissues like skin, organs, blood vessels, or bones that have exact replication of the real tissues.

As an alternative for depending on organ donations, we can produce these tissues in printers more ethically and cost-effectively. Apart from this, we can even perform drug tests on the synthetic build tissue than with animal or human testing. The whole technology is still emerging and is in early maturity due to its high complexity. One of the most crucial parts to cope with this complexity of printing is data science.

12. Supply Chain Optimization

As we might have observed, the production of drugs needs time, especially for today’s high-tech cures based on specific substances and production methods only. Apart from this, we have to break down the whole process into many different steps, and several of them are outsourced to specialist delivery agents.

We observe this currently with the COVID-19 vaccine production as well. The vaccine inventors deliver the blueprint for the vaccine. Then the production happens in plants of companies specialized in sterile production. The production unit then delivers the vaccine in tanks to companies. They do the filling in small doses under clinical conditions, and at last, another company makes the supply for the given blueprint.

The complete planning, right from having the right input substances available at the right time, then having the adequate production capacity, and at last, the exact amount of drugs stored for serving the demand is a highly complicated system. As a result of which, this must be managed for hundreds and thousands of therapies, each with its specific conditions.

13. AES On Google Cloud AutoML Vision

As we have known, the AES Corporation is a power generation and distribution company. They generate and sell power that the consumers use for utilities and industrial work. They depend on Google Cloud on their road to make renewable energy more efficient. AES makes use of Google AutoML Vision to review images of wind turbine blades and analyze their maintenance needs beforehand.

Outcomes of this case study:

  • It reduces image review time by approximately 50%
  • It helps in reducing the prices of renewable energy
  • This results in more time to invest in identifying wind turbine damage and mending it

14. Bayes AG on AWS SageMaker

Bayer AG is an emerging name in multinational pharmaceutical and life sciences companies and it is based in Germany. One of their key highlights is in the production of insecticides, fungicides, and herbicides for agricultural purposes.

In order to assist farmers monitor their crops, they fabricate their Digital Yellow Trap: an Internet of Things (IoT) device that alerts farmers of pests using image recognition on the farming land.

  • It helps in reducing Bayer lab’s architecture costs by 94%
  • We can scale it to accommodate for fluctuating demand
  • It is able to handle tens of thousands of requests per second
  • It helps in Community-based, early warning

15. American Cancer Society on Google Cloud ML Engine

The American Cancer Society is a nonprofit organization for eradicating cancer. They operate in more than 250 regional offices all over America.

They make use of the Google Cloud ML Engine to identify novel patterns in digital pathology images. Their aim is to improve breast cancer detection accuracy and reduce the overall diagnosis timeline as well as ensure effective costing.

Outcomes of this use case:

  • It helps in enhancing the speed and accuracy of image analysis by removing human limitations
  • It even aids in improving patients’ quality of life and life expectancy
  • This helps to protect tissue samples by backing up image data to the cloud

16. Road Safety Commission of Western Australia

The Road Safety Commission of Western Australia operates under the Western Australia Police Force. It takes the responsibility for tracking road accidents and making the roads safer by taking adequate precautions.

In an attempt to achieve its safety strategy “Towards Zero 2008-2020” which aims at reducing road fatalities by 40%, the road safety commission is depending on machine learning, artificial intelligence, and advanced analytics for precise and reliable results.

  • It helps in achieving the goal of data engineering and visualization time reduced by 80%
  • It has achieved an estimated 25% reduction in vehicle crashes
  • This is based on straightforward and efficient data sharing
  • It works on flexibility of data with various coding languages

With this, we have seen the various case studies that are done till now in the field of Machine Learning. PythonGeeks specially curated this list of case studies to help readers to understand the deployment of Machine Learning models in the real world. The article can benefit you in various ways since it delivers accurate studies of the various uses of Machine Learning. You can study these cases to get to know Machine Learning a bit better and even try to find improvements in the existing solution.

Tags: Machine Learning Case Studies

4 Responses

  • Pingbacks 0

Great content and relevant to current digital transformation process.

Very informative

Very insightful

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

courses

ACM Digital Library home

  • Advanced Search

Global and local interpretability techniques of supervised machine learning black box models for numerical medical data

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, interpretability in the medical field: a systematic mapping and review study.

Recently, the machine learning (ML) field has been rapidly growing, mainly owing to the availability of historical datasets and advanced computational power. This growth is still facing a set of challenges, such as ...

  • Literature review of machine Learning interpretability in medicine resulted in 179 papers published between 1994 and 2020.

Learning Locally Interpretable Rule Ensemble

This paper proposes a new framework for learning a rule ensemble model that is both accurate and interpretable. A rule ensemble is an interpretable model based on the linear combination of weighted rules. In practice, we often face the trade-off ...

State of the art of Fairness, Interpretability and Explainability in Machine Learning: Case of PRIM

The adoption of complex machine learning (ML) models in recent years has brought along a new challenge related to how to interpret, understand, and explain the reasoning behind these complex models' predictions. Treating complex ML systems as ...

Information

Published in.

Pergamon Press, Inc.

United States

Publication History

Author tags.

  • Interpretability
  • Explainability
  • Numerical data
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

The role of beat-by-beat cardiac features in machine learning classification of ischemic heart disease (IHD) in magnetocardiogram (MCG)

Affiliations.

  • 1 SQUIDs Applications Section, SQUID and Detector Technology Division, Materials Science Group, Indira Gandhi Centre for Atomic Research, Kalpakkam-603 102, Tamil Nadu, India.
  • 2 Centre for Medical Electronics, Department of Electronics and Communication Engineering, Anna University, Chennai-600 025, Tamil Nadu, India.
  • 3 Department of Cardiology, Jawaharlal Institute of Postgraduate Medical Education and Research, Dhanvantri Nagar, Pondicherry-605 006, Puducherry, India.
  • PMID: 38640907
  • DOI: 10.1088/2057-1976/ad40b1

Cardiac electrical changes associated with ischemic heart disease (IHD) are subtle and could be detected even in rest condition in magnetocardiography (MCG) which measures weak cardiac magnetic fields. Cardiac features that are derived from MCG recorded from multiple locations on the chest of subjects and some conventional time domain indices are widely used in Machine learning (ML) classifiers to objectively distinguish IHD and control subjects. Most of the earlier studies have employed features that are derived from signal-averaged cardiac beats and have ignored inter-beat information. The present study demonstrates the utility of beat-by-beat features to be useful in classifying IHD subjects (n = 23) and healthy controls (n = 75) in 37-channel MCG data taken under rest condition of subjects. The study reveals the importance of three features (out of eight measured features) namely, the field map angle (FMA) computed from magnetic field map, beat-by-beat variations of alpha angle in the ST-T region and T wave magnitude variations in yielding a better classification accuracy (92.7 %) against that achieved by conventional features (81 %). Further, beat-by-beat features are also found to augment the accuracy in classifying myocardial infarction (MI) Versus control subjects in two public ECG databases (92 % from 88 % and 94 % from 77 %). These demonstrations summarily suggest the importance of beat-by-beat features in clinical diagnosis of ischemia.

Keywords: beat-by-beat cardiac features; ischemic heart disease; machine learning classifiers; magnetocardiography; myocardial infarction.

© 2024 IOP Publishing Ltd.

PubMed Disclaimer

Similar articles

  • Magnetocardiography-Based Ischemic Heart Disease Detection and Localization Using Machine Learning Methods. Rong Tao, Shulin Zhang, Xiao Huang, Minfang Tao, Jian Ma, Shixin Ma, Chaoxiang Zhang, Tongxin Zhang, Fakuan Tang, Jianping Lu, Chenxing Shen, Xiaoming Xie. Rong Tao, et al. IEEE Trans Biomed Eng. 2019 Jun;66(6):1658-1667. doi: 10.1109/TBME.2018.2877649. Epub 2018 Oct 23. IEEE Trans Biomed Eng. 2019. PMID: 30369432
  • Comparison of Electric- and Magnetic-Cardiograms Produced by Myocardial Ischemia in Models of the Human Ventricle and Torso. Alday EA, Ni H, Zhang C, Colman MA, Gan Z, Zhang H. Alday EA, et al. PLoS One. 2016 Aug 24;11(8):e0160999. doi: 10.1371/journal.pone.0160999. eCollection 2016. PLoS One. 2016. PMID: 27556808 Free PMC article.
  • Diagnostic value of magnetocardiography in coronary artery disease and cardiac arrhythmias: a review of clinical data. Kwong JS, Leithäuser B, Park JW, Yu CM. Kwong JS, et al. Int J Cardiol. 2013 Sep 1;167(5):1835-42. doi: 10.1016/j.ijcard.2012.12.056. Epub 2013 Jan 19. Int J Cardiol. 2013. PMID: 23336954 Review.
  • Identification of ischemic heart disease via machine learning analysis on magnetocardiograms. Tantimongcolwat T, Naenna T, Isarankura-Na-Ayudhya C, Embrechts MJ, Prachayasittikul V. Tantimongcolwat T, et al. Comput Biol Med. 2008 Jul;38(7):817-25. doi: 10.1016/j.compbiomed.2008.04.009. Epub 2008 Jun 11. Comput Biol Med. 2008. PMID: 18550044
  • Detection of coronary artery disease with MCG. Hailer B, Van Leeuwen P. Hailer B, et al. Neurol Clin Neurophysiol. 2004 Nov 30;2004:82. Neurol Clin Neurophysiol. 2004. PMID: 16012675 Review.

Publication types

  • Search in MeSH

Related information

Linkout - more resources, full text sources.

  • IOP Publishing Ltd.
  • Genetic Alliance
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Heart disease prediction using machine learning, deep Learning and optimization techniques-A semantic review

  • Published: 05 July 2024

Cite this article

classification in machine learning case study

  • Girish Shrikrushnarao Bhavekar 1 ,
  • Agam Das Goswami 2 ,
  • Chafle Pratiksha Vasantrao 3 ,
  • Amit K. Gaikwad 4 ,
  • Amol V. Zade 4 &
  • Harsha Vyawahare 5  

44 Accesses

Explore all metrics

Cardiovascular disease holds the position of being the foremost cause of death worldwide. Heart Disease Prediction (HDP) is a difficult task as it needs advanced knowledge with better experience. Moreover, it encounters numerous significant challenges in clinical data analysis. While many researchers have focused on predicting heart disease, the performance metric, namely prediction accuracy, remains suboptimal. The accurate HDP can help the person to prevent himself from life threats and at the same time, inaccurate prediction can prove to be fatal. To solve these issues, in this review work several Deep Learning (DL), Machine Learning (ML) and optimization based HDP techniques are discussed. In recent times, many researchers have been utilizing different DL and ML algorithms to help the professionals and health care industry for the prediction of heart disease. Further, it discussed about various optimization-based algorithms and its performance analysis. Therefore, this review paper suggests that the optimization-based HDP algorithm could assist doctors in predicting the occurrence of heart disease in advance and offering suitable treatment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

classification in machine learning case study

Source of reviewed papers

classification in machine learning case study

Similar content being viewed by others

classification in machine learning case study

A Hybrid Model for the Detection and Classification of Cardiovascular Diseases Based on Deep Learning and Optimization Techniques

classification in machine learning case study

An Extensive Review of Machine Learning and Deep Learning Techniques on Heart Disease Classification and Prediction

classification in machine learning case study

Cardiovascular Disease Prognosis and Analysis Using Machine Learning Techniques

Data availability.

Data will be made available on reasonable request.

Folorunso SO, Awotunde JB, Adeniyi EA, Abiodun KM, Ayo FE (2022) Heart disease classification using machine learning models. Info Intell Appl 1547:35–49

Google Scholar  

Phasinam K, Mondal T, Novaliendry D, Yang C-H, Dutta C, Shabaz M (2022) Analyzing the performance of machine learning techniques in disease prediction. J Food Qual 2022:1–9

TR Ramesh, Lilhore UK, Poongodi M, Simaiya S, Kaur A, Hamdi M (2022) Predictive analysis of heart diseases with machine learning approaches. Malays J Comput Sci 132–148. https://doi.org/10.22452/mjcs.sp2022no1.10

Mahesh TR, Dhilip Kumar V, Vinoth Kumar V, Asghar J, Geman O, Arulkumaran G, Arun N (2022) ADABOOST ensemble methods using K-fold cross validation for survivability with the early detection of heart disease. Comput Intell Neurosci 2022:1–11

Yang H-Y, Liu M-L, Luo P, Yao X-S, Zhou H (2022) Network pharmacology provides a systematic approach to understanding the treatment of ischemic heart diseases with traditional Chinese medicine. Phytomed 104:154268

Article   Google Scholar  

Hossain MA, Kim J-H (2022) Possibility as role of ginseng and ginsenosides on inhibiting the heart disease of COVID-19: A systematic review. J Ginseng Res 46:321–330

Mijwil MM, Shukur BS, Mahmood ESh (2022) The most common heart diseases and their influence on human life: A Mini-review. J Adv Med Med Res 34:26–36

Agrud A, Subburaju S, Goel P, Ren J, Kumar AS, Caldarone BJ, Dai W, Chavez J, Fukumura D, Jain RK, Kloner RA (2022) Gabrb3 endothelial cell-specific knockout mice display abnormal blood flow, hypertension, and behavioral dysfunction. Sci Rep. https://doi.org/10.1038/s41598-022-08806-9

Ahmad GN, Shafiullah FH, Abbas M, Rahman O, Imdadullah AMS (2022) Mixed machine learning approach for efficient prediction of human heart disease by identifying the numerical and categorical features. Appl Sci 12:7449

Orji KN, Ike OH, Wariso M, Oguji CE, Omejua CG, Uchendu IK, Makata VC, Emuebie H, Inalegwu SE (2022) Review on cardiovascular disease and antihypertensive drugs effect on the circulating biomarkers of heart disease. GSC Biol Pharm Sci 20:120–129

Nanthini K, Pyingkodi M, Sivabalaselvamani D, Kumari S, Kumar T (2022) Performance analysis of machine learning algorithms in Heart diseases prediction. IoT Based Control Netw Intell Syst 528:407–423

Mantovani A, Byrne CD, Benfari G, Bonapace S, Simon TG, Targher G (2022) Risk of heart failure in patients with nonalcoholic fatty liver disease. J Am Coll Cardiol 79:180–191

Wienecke LM, Cohen S, Bauersachs J, Mebazaa A, Chousterman BG (2021) Immunity and inflammation: The neglected key players in congenital heart disease? Heart Fail Rev 27:1957–1971

Domyati A, Memon Q (2022) Robust detection of cardiac disease using machine learning algorithms. 2022 The 5th Int Conf Control Comput Vision 52–55. https://doi.org/10.1145/3561613.3561622

Heidenreich PA, Fonarow GC, Opsha Y, Sandhu AT, Sweitzer NK, Warraich HJ, Butler J, Hsich E, Pressler SB, Shah K, Taylor K (2022) Economic issues in heart failure in the United States. J Card Fail 28:453–466

Kreutz R, Brunström M, Thomopoulos C, Carlberg B, Mancia G (2022) Do recent meta-analyses truly prove that treatment with blood pressure-lowering drugs is beneficial at any blood pressure value, no matter how low? A critical review. J Hypertens 40:839–846

Vasantrao CP, Gupta N (2023) Wader hunt optimization based UNET model for change detection in satellite images. Int J Inf Technol 15:1611–1623

Alkayed NJ, Cao Z, Qian ZY, Nagarajan S, Liu X, Nelson JW, Xie F, Li B, Fan W, Liu L, Grafe MR (2022) Control of coronary vascular resistance by Eicosanoids via a novel GPCR. Am J Physiol-Cell Physiol. https://doi.org/10.1152/ajpcell.00454.2021

Su J, Li Z, Huang M, Wang Y, Yang T, Ma M, Ni T, Pan G, Lai Z, Li C, Li L (2022) Triglyceride glucose index for the detection of the severity of coronary artery disease in different glucose metabolic states in patients with coronary heart disease: A RCSCD-TCM study in China. Cardiovasc Diabetol. https://doi.org/10.1186/s12933-022-01523-7

Frąk W, Wojtasińska A, Lisińska W, Młynarska E, Franczyk B, Rysz J (2022) Pathophysiology of cardiovascular diseases: New insights into molecular mechanisms of atherosclerosis, arterial hypertension, and coronary artery disease. Biomedicine 10:1938

Malik A, Daniel B, Sarosh V, Lovely C (2023) Congestive heart failure. InStatPearls [internet]. StatPearls Publishing

Ding M, Li QF, Yin G, Liu JL, Jan XY, Huang T, Li AC, Zheng L (2022) Effects of drosophila melanogaster regular exercise and apolipoprotein b knockdown on abnormal heart rhythm induced by a high-fat diet. PLoS One. https://doi.org/10.1371/journal.pone.0262471

https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset

Shah D, Patel S, Bharti SK (2020) Heart disease prediction using machine learning techniques. SN Comput Sci. https://doi.org/10.1007/s42979-020-00365-y

Salhi DE, Tari A, Kechadi M-T (2021) Using machine learning for heart disease prediction. Adv Comput Syst Appl 199:70–81. https://doi.org/10.1007/978-3-030-69418-0_7

Rajendran R, Karthi A (2022) Heart disease prediction using entropy based feature engineering and ensembling of machine learning classifiers. Expert Syst Appl 207:117882

Wankhede J, Kumar M, Sambandam P (2020) Efficient heart disease prediction-based on optimal feature selection using DFCSS and classification by improved Elman-SFO. IET Syst Biol 14:380–390

Deepika D, Balaji N (2022) Effective heart disease prediction using novel MLP-EBMDA approach. Biomed Signal Process Control 72:103318

Singh R, Rajesh E (2019) Prediction of heart disease by clustering and classification techniques. Int J Comput Sci Eng 7:861–866

Venkatesan C, Saravanan S, Satheeskumaran S (2021) Real-time ECG signal pre-processing and neuro fuzzy-based CHD risk prediction. Int J Comput Sci Eng 24:323

Seker E, Talburt JR, Greer ML (2022) Preprocessing to address bias in healthcare data. Stud Health Technol Info. https://doi.org/10.3233/shti220468

Aziz S, Khan MU, Iqtidar K, Ali S, Remete AN, Javid MA (2022) Pulse plethysmograph signal analysis method for classification of heart diseases using novel local spectral ternary patterns. Expert Syst. https://doi.org/10.1111/exsy.13011

Boukhatem C, Youssef HY, Nassif AB (2022) Heart disease prediction using machine learning. 2022 Adv Sci Eng Technol International Conferences (ASET) 1–6. https://doi.org/10.1109/aset53988.2022.9734880

Rath A, Mishra D, Panda G, Pal M (2022) Development and assessment of machine learning based heart disease detection using imbalanced heart sound signal. Biomed Signal Process Control 76:103730

Heena A, Biradar N, Maroof NM (2021) Machine learning based detection and classification of heart abnormalities. Lect Notes Netw Syst 300:15–22. https://doi.org/10.1007/978-3-030-84760-9_2

IrinSherly S, Mathivanan G (2023) An efficient honey badger based faster region CNN for chronc heart failure prediction. Biomed Signal Process Control 79:104165

Shehzadi S, Hassan MA, Rizwan M, Kryvinska N, Vincent K (2022) Diagnosis of chronic ischemic heart disease using machine learning techniques. Comput Intell Neurosci 2022:1–9

Al Bataineh A, Manacek S (2022) MLP-PSO Hybrid Algorithm for heart disease prediction. J Pers Med 12:1208

Balamurugan R, Ratheesh S, Venila YM (2021) Classification of heart disease using adaptive Harris Hawk optimization-based clustering algorithm and enhanced deep genetic algorithm. Soft Comput 26:2357–2373

Nanehkaran YA, Licai Z, Chen J, Jamel AA, Shengnan Z, Navaei YD, Aghbolagh MA (2022) Anomaly detection in heart disease using a density-based unsupervised approach. Wirel Commun Mob Comput 2022:1–14

Akcin E, Isleyen KS, Ozcan E, Hameed AA, Alimovski E, Jamil A (2021) A hybrid feature extraction method for heart disease classification using ECG Signals. 2021 Innovations in Intell Syst Appl Conference (ASYU). https://doi.org/10.1109/asyu52992.2021.9599070

Gao X-Y, Amin Ali A, Shaban Hassan H, Anwar EM (2021) Improving the accuracy for analyzing heart diseases prediction based on the ensemble method. Complexity 2021:1–10

Sekar J, Aruchamy P, SulaimaLebbe Abdul H, Mohammed AS, Khamuruddeen S (2021) An efficient clinical support system for heart disease prediction using TANFIS classifier. Comput Intell 38:610–640

Ogundokun RO, Misra S, Awotunde JB, Agrawal A, Ahuja R (2022) PCA-based feature extraction for classification of heart disease. Lect Notes Electr Eng 881:173–183. https://doi.org/10.1007/978-981-19-1111-8_15

Prabha DrR, Senthil GA, Lazha DrA, VijendraBabu DrD, Roopa MsD (2021) A novel computational rough set based feature extraction for heart disease analysis. Proceedings of the First International Conference on Computing, Communication and Control System, I3CAC 2021, 7–8 June 2021, Bharath University, Chennai, India. https://doi.org/10.4108/eai.7-6-2021.2308575

Almustafa KM (2020) Prediction of heart disease and classifiers’ sensitivity analysis. BMC Bioinf. https://doi.org/10.1186/s12859-020-03626-y

Venkatesan M, Lakshmipathy P, Vijayan V, Sundar R (2021) Cardiac disease diagnosis using feature extraction and machine learning based classification with internet of things (iot). Concurrency Comput Pract Experience. https://doi.org/10.1002/cpe.6622

Spencer R, Thabtah F, Abdelhamid N, Thompson M (2020) Exploring feature selection and classification methods for predicting heart disease. Digit Health 6:205520762091477

Abdollahi J, Nouri-Moghaddam B (2022) A hybrid method for heart disease diagnosis utilizing feature selection based ensemble classifier model generation. Iran J Comput Sci 5:229–246

Hassan MR, Huda S, Hassan MM, Abawajy J, Alsanad A, Fortino G (2022) Early detection of cardiovascular autonomic neuropathy: A multi-class classification model based on feature selection and deep learning feature fusion. Info Fusion 77:70–80

Ansarullah SI, Saif SM, Kumar P, Kirmani MM (2022) Significance of visible non-invasive risk attributes for the initial prediction of heart disease using different machine learning techniques. Comput Intell Neurosci 2022:1–12

Balasubramaniam S, Joe CV, Manthiramoorthy C, Kumar KS (2024) ReliefF based feature selection and gradient squirrel search algorithm enabled deep maxout network for detection of heart disease. Biomed Signal Process Control 87:105446

Nancy AA, Ravindran D, Raj Vincent PM, Srinivasan K, Gutierrez Reina D (2022) IOT-cloud-based smart healthcare monitoring system for heart disease prediction via deep learning. Electronics 11:2292

Barhoom A, Almasri A, Abu-Nasser B, Abu-Naser S (2022) Prediction of heart disease using a collection of machine and deep learning algorithms. International Journal of Engineering and Information Systems (IJEAIS) 6:1–13

Raju KB, Dara S, Vidyarthi A, Gupta VM, Khan B (2022) Smart heart disease prediction system with IOT and fog computing sectors enabled by cascaded deep learning model. Comput Intell Neurosci 2022:1–22

Bhavekar GS, Goswami AD (2022) A hybrid model for heart disease prediction using recurrent neural network and long short term memory. Int J Inf Technol 14:1781–1789

Goswami AD, Bhavekar GS, Chafle PV (2022) Electrocardiogram signal classification using vggnet: A neural network based classification model. Int J Inf Technol 15:119–128

Mehmood A, Iqbal M, Mehmood Z, Irtaza A, Nawaz M, Nazir T, Masood M (2021) Prediction of heart disease using deep convolutional neural networks. Arab J Sci Eng 46:3409–3422

Xiao C, Li Y, Jiang Y (2020) Heart coronary artery segmentation and disease risk warning based on a deep learning algorithm. IEEE Access 8:140108–140121

Ali F, El-Sappagh S, Islam SMR, Kwak D, Ali A, Imran M, Kwak K-S (2020) A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Info Fusion 63:208–222

Mienye ID, Sun Y, Wang Z (2020) An improved ensemble learning approach for the prediction of heart disease risk. Info Med Unlocked 20:100402

Pan Y, Fu M, Cheng B, Tao X, Guo J (2020) Enhanced deep learning assisted convolutional neural network for heart disease prediction on the internet of medical things platform. IEEE Access 8:189503–189512

Khan MA (2020) An IOT framework for heart disease prediction based on MDCNN classifier. IEEE Access 8:34717–34727

Ali L, Rahman A, Khan A, Zhou M, Javeed A, Khan JA (2019) An automated diagnostic system for heart disease prediction based on statistical model and optimally configured deep neural network. IEEE Access 7:34938–34945

Tuli S, Basumatary N, Gill SS, Kahani M, Arya RC, Wander GS, Buyya R (2020) HealthFog: an ensemble deep learning based smart healthcare system for automatic diagnosis of heart diseases in integrated IOT and fog computing environments. Future Gener Comput Syst 104:187–200

Hassan D, Hussein HI, Hassan MM (2023) Heart disease prediction based on pre-trained deep neural networks combined with principal component analysis. Biomed Signal Process Control 79:104019

Patro SP, Nayak GS, Padhy N (2021) Heart disease prediction by using novel optimization algorithm: a supervised learning prospective. Info Med Unlocked 26:100696

Sharma S, Parmar M (2020) Heart diseases prediction using deep learning neural network model. Int J Innov Technol Explor Eng 9:2244–2248

Al-Tashi Q, Rais H, Jadid S (2018) Feature selection method based on grey wolf optimization for coronary artery disease classification. Adv Intell Syst Comput 843:257–266. https://doi.org/10.1007/978-3-319-99007-1_25

Mienye ID, Sun Y (2021) Improved heart disease prediction using particle swarm optimization based stacked sparse autoencoder. Electr 10:2347

Al-Yarimi FA, Munassar NM, Bamashmos MH, Ali MY (2020) Feature optimization by discrete weights for heart disease prediction using supervised learning. Soft Comput 25:1821–1831

El-Shafiey MG, Hagag A, El-Dahshan E-SA, Ismail MA (2022) A hybrid GA and PSO optimized approach for heart-disease prediction based on random forest. Multimed Tools Appl 81:18155–18179

Khourdifi Y, Bahaj M (2019) Heart disease prediction and classification using machine learning algorithms optimized by particle swarm optimization and ant colony optimization. Int J Intell Eng Syst 12:242–252

Abdar M, Książek W, Acharya UR, Tan R-S, Makarenkov V, Pławiak P (2019) A new machine learning technique for an accurate diagnosis of coronary artery disease. Comput Methods Prog Biomed 179:104992

Al Bataineh A, Manacek S (2022) MLP-PSO hybrid algorithm for heart disease prediction. J Pers Med 12:1208

Nandy S, Adhikari M, Balasubramanian V, Menon VG, Li X, Zakarya M (2021) An intelligent heart disease prediction system based on swarm-artificial neural network. Neural Comput Appl 35:14723–14737

Jain A, Chandra Sekhara Rao A, Kumar Jain P, Hu Y-C (2023) Optimized levy flight model for heart disease prediction using CNN framework in big data application. Expert Syst Appl 223:119859

Bhavekar GS, Das Goswami A (2022) Herding exploring algorithm with light gradient boosting machine classifier for effective prediction of heart diseases. Int J Swarm Intell Res 13:1–22

Rani P, Kumar R, Ahmed NM, Jain A (2021) A decision support system for heart disease prediction based upon machine learning. J Reliable Intell Environ 7:263–275

Kavitha M, Gnaneswar G, Dinesh R, Sai YR, Suraj RS (2021) Heart disease prediction using Hybrid Machine Learning Model. 2021 6th International Conference Inventive Comput Technol (ICICT). pp 1329–1333. https://doi.org/10.1109/icict50816.2021.9358597

Aggarwal R, Podder P, Khamparia A (2022) ECG classification and analysis for heart disease prediction using XAI-driven machine learning algorithms. Biomedical Data Analysis and Processing Using Explainable (XAI) and Responsive Artificial Intell (RAI) 222:91–103. https://doi.org/10.1007/978-981-19-1476-8_7

Jagtap A, Rambade H, Baswat O, Malewadkar P (2019) Heart disease prediction using machine learning. Sci Manage 2:352–355

Mohan S, Thirumalai C, Srivastava G (2019) Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7:81542–81554

Repaka AN, Ravikanti SD, Franklin RG (2019) Design and implementing heart disease prediction using naives bayesian. 2019 3rd International Conference Trends Electr Info (ICOEI). pp 292–297. https://doi.org/10.1109/icoei.2019.8862604

Bhatt CM, Patel P, Ghetia T, Mazzeo PL (2023) Effective heart disease prediction using machine learning techniques. Algorithms 16:88

Biswas N, Ali MM, Rahaman MA, Islam M, Mia MdR, Azam S, Ahmed K, Bui FM, Al-Zahrani FA, Moni MA (2023) Machine learning-based model to predict heart disease in early stage employing different feature selection techniques. BioMed Res Int 2023:1–15

Bani Hani SH, Ahmad MM (2023) Machine-learning algorithms for ischemic heart disease prediction: A systematic review. Curr Cardiol Rev. https://doi.org/10.2174/1573403x18666220609123053

Berrill M, Ashcroft E, Fluck D, John I, Beeton I, Sharma P, Baltabaeva A (2022) Tricuspid regurgitation in acute heart failure: Predicting outcome using novel quantitative echocardiography techniques. Diagnostics 13:109

Xu J, Sun Y, Gong D, Fan Y (2023) Association between disease-specific health-related quality of life and all-cause mortality in patients with heart failure: a meta-analysis. Curr Probl Cardiol 48:101592

Trigka M, Dritsas E (2023) Long-term coronary artery disease risk prediction with machine learning models. Sensors 23:1193

Adekkanattu P, Rasmussen LV, Pacheco JA et al (2023) Prediction of left ventricular ejection fraction changes in heart failure patients using machine learning and electronic health records: a multi-site study. Sci Rep. https://doi.org/10.1038/s41598-023-27493-8

Sudha VK, Kumar D (2023) Hybrid CNN and LSTM network for heart disease prediction. SN Comput Sci. https://doi.org/10.1007/s42979-022-01598-9

Bozkurt B (2023) Successful decongestion as a clinical target, performance indicator, and as a study endpoint in hospitalized heart failure patients. JACC: Heart Failure 11:126–129

Chen S, Hu W, Yang Y et al (2023) Predicting six-month re-admission risk in heart failure patients using multiple machine learning methods: a study based on the Chinese heart failure population database. J Clin Med 12:870

Sun H, Pan J (2023) Heart disease prediction using machine learning algorithms with self-measurable physical condition indicators. J Data Anal Inf Process 11:1–10

Behera A, Mishra TK, Sahoo KS, Sarathchandra B (2022) An improved machine learning framework for cardiovascular disease prediction. Commun Comput Info Sci 1729:289–299. https://doi.org/10.1007/978-3-031-21750-0_25

Salman Shukur B, MohsinMijwil M (2023) Involving machine learning techniques in heart disease diagnosis: a performance analysis. Int J Electr Comput Eng (IJECE) 13:2177

Verma P, Sahu SK, Awasthi VK (2022) Deep neural network with feature optimization technique for classification of coronary artery disease. Adv Comput Intell Robotics 257–269. https://doi.org/10.4018/978-1-7998-8892-5.ch016

Ozcan M, Peker S (2023) A classification and regression tree algorithm for heart disease modeling and prediction. Healthc Anal 3:100130

Forrest IS, Petrazzini BO, Duffy Á et al (2023) Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts. Lancet 401:215–225

Ogundepo EA, Yahya WB (2023) Performance analysis of supervised classification models on heart disease prediction. Innov Syst Softw Eng 19:129–144

Shrivastava PK, Sharma M, sharma P, Kumar A (2023) HCBILSTM: A hybrid model for predicting heart disease using CNN and BILSTM algorithms. Meas: Sens 25:100657

Fajri YA, Wiharto W, Suryani E (2022) Hybrid model feature selection with the bee swarm optimization method and Q-learning on the diagnosis of coronary heart disease. Info 14:15

Nayak O, Pallapothala T, Gupta GP (2022) Heart disease prediction framework using soft voting-based ensemble learning techniques. Convergence Big Data Technol Comput Intell Techniques. pp 147–165. https://doi.org/10.4018/978-1-6684-5264-6.ch007

Download references

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

Department of Artificial Intelligence & Data Science, CSMSS Chh. Shahu College of Engineering, Aurangabad, 431136, India

Girish Shrikrushnarao Bhavekar

Department of SENSE, VIT-AP University, Andhra Pradesh, 522237, India

Agam Das Goswami

Department of AI&DS, CSMSS Chh. Shahu College of Engineering, Aurangabad, Maharashtra, 431136, India

Chafle Pratiksha Vasantrao

Department of Computer Science & Engineering, GH Raisoni University, Amravati, Maharashtra, 444701, India

Amit K. Gaikwad & Amol V. Zade

Department of Computer Science & Engineering, Sipna College of Engineering & Technology, Amravati, Maharashtra, 444701, India

Harsha Vyawahare

You can also search for this author in PubMed   Google Scholar

Contributions

All the authors have contributed equally to the work.

Corresponding author

Correspondence to Agam Das Goswami .

Ethics declarations

Ethical approval.

All applicable institutional and/or national guidelines for the care and use of animals were followed.

Informed consent

For this type of analysis formal consent is not needed.

Potential conflict of interest

The authors declare that they have no potential conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Bhavekar, G.S., Das Goswami, A., Vasantrao, C.P. et al. Heart disease prediction using machine learning, deep Learning and optimization techniques-A semantic review. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19680-0

Download citation

Received : 26 September 2023

Revised : 24 February 2024

Accepted : 10 June 2024

Published : 05 July 2024

DOI : https://doi.org/10.1007/s11042-024-19680-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Heart disease prediction
  • Deep learning
  • Machine learning
  • Optimization
  • Prediction analysis
  • Find a journal
  • Publish with us
  • Track your research

Exploring Transfer Learning Using Segment Anything Model in Optical Remote Sensing

  • Albughdadi, Mohanad
  • Baousis, Vasileios
  • Kaprol, Tolga
  • Karatosun, Armagan
  • Pisa, Claudio

In the realm of remote sensing, where labeled datasets are scarce, leveraging pre-trained models via transfer learning offers a compelling solution. This study investigates the efficacy of the Segment Anything Model (SAM), a foundational computer vision model, in the domain of optical remote sensing tasks, specifically focusing on image classification and semantic segmentation.The scarcity of labeled data in remote sensing poses a significant challenge for machine learning development. Transfer learning, a technique utilizing pre-trained models like SAM, circumvents this challenge by leveraging existing data from related domains. SAM, developed and trained by Meta AI, serves as a foundational model for prompt-based image segmentation. It employs over 1 billion masks on 11 million images, facilitating robust zero-shot and few-shot capabilities. SAM's architecture comprises an image encoder, prompt encoder, and mask decoder components, all geared towards swift and accurate segmentation for various prompts, ensuring real-time interactivity and handling ambiguity.Two distinct use cases leveraging SAM-based models in the domain of optical remote sensing are presented, representing two critical tasks: image classification and semantic segmentation. Through comprehensive analysis and comparative assessments, various model architectures, including linear and convolutional classifiers, SAM-based adaptations, and UNet for semantic segmentation, are examined. Experiments encompass contrasting model performances across different dataset splits and varying training data sizes. The SAM-based models include using a linear, a convolutional or a ViT decoder classifiers on top of the SAM encoder.Use Case 1: Image Classification with EuroSAT DatasetThe EuroSAT dataset, comprising 27,000 labeled image patches from Sentinel-2 satellite images across ten distinct land cover classes, serves as the testing ground for image classification tasks. SAM-ViT models consistently demonstrate high accuracy, ranging between 89% and 93% on various sizes of training datasets. These models outperform baseline approaches, exhibiting resilience even with limited training data. This use case highlights SAM-ViT's effectiveness in accurately categorizing land cover classes despite data limitations.Use Case 2: Semantic Segmentation with Road DatasetIn the semantic segmentation domain, the study focuses on the Road dataset, evaluating SAM-based models, particularly SAM-CONV, against the benchmark UNet model. SAM-CONV showcases remarkable superiority, achieving F1-scores and Dice coefficients exceeding 0.84 and 0.82, respectively. Its exceptional performance in pixel-level labeling emphasizes its robustness in delineating roads from surrounding environments, surpassing established benchmarks and demonstrating its applicability in fine-grained analysis.In conclusion, SAM-driven transfer learning methods hold promise for robust remote sensing analysis. SAM-ViT excels in image classification, while SAM-CONV demonstrates superiority in semantic segmentation, paving the way for their practical use in real-world remote sensing applications despite limited labeled data availability.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

land-logo

Article Menu

classification in machine learning case study

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Refining long-time series of urban built-up-area extraction based on night-time light—a case study of the dongting lake area in china.

classification in machine learning case study

1. Introduction

2. study area and data, 2.1. study area, 2.2. research data, 2.2.1. ntl research data sources, 2.2.2. other research data, 3.1. viirs-like ntl dataset generation, 3.1.1. intercalibration of ntl data, 3.1.2. conversion of dmsp/ols ntl, 3.2. calculation of the vanui index, 3.3. svm-based urban built-up-area extraction, 3.4. accuracy assessment, 4.1. assessment of extraction results, 4.2. chronological changes, 4.3. spatial change, 5. discussion, 5.1. comparisons with previous studies, 5.2. limitations of study, 6. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

  • Buhaug, H.; Urdal, H. An urbanization bomb? Population growth and social disorder in cities. Glob. Environ. Change-Hum. Policy Dimens. 2013 , 23 , 1–10. [ Google Scholar ] [ CrossRef ]
  • Brenner, N.; Schmid, C. The ‘Urban Age’ in Question. Int. J. Urban Reg. Res. 2014 , 38 , 731–755. [ Google Scholar ] [ CrossRef ]
  • Wang, X.R.; Hui, E.C.M.; Choguill, C.; Jia, S.H. The new urbanization policy in China: Which way forward? Habitat Int. 2015 , 47 , 279–284. [ Google Scholar ] [ CrossRef ]
  • Chen, M.X.; Liu, W.D.; Lu, D.D. Challenges and the way forward in China’s new-type urbanization. Land Use Policy 2016 , 55 , 334–339. [ Google Scholar ] [ CrossRef ]
  • Chen, M.X.; Liu, W.D.; Tao, X.L. Evolution and assessment on China’s urbanization 1960–2010: Under-urbanization or over-urbanization? Habitat Int. 2013 , 38 , 25–33. [ Google Scholar ] [ CrossRef ]
  • Zhang, X.L.; Li, H. Urban resilience and urban sustainability: What we know and what do not know? Cities 2018 , 72 , 141–148. [ Google Scholar ] [ CrossRef ]
  • Li, M.M.; Verburg, P.H.; van Vliet, J. Global trends and local variations in land take per person. Landsc. Urban Plan. 2022 , 218 , 104308. [ Google Scholar ] [ CrossRef ]
  • Guan, X.L.; Wei, H.K.; Lu, S.S.; Dai, Q.; Su, H.J. Assessment on the urbanization strategy in China: Achievements, challenges and reflections. Habitat Int. 2018 , 71 , 97–109. [ Google Scholar ] [ CrossRef ]
  • Li, C.M.; Wang, X.Y.; Wu, Z.; Dai, Z.X.; Yin, J.; Zhang, C.C. An Improved Method for Urban Built-Up Area Extraction Supported by Multi-Source Data. Sustainability 2021 , 13 , 5042. [ Google Scholar ] [ CrossRef ]
  • Tian, Y.Y.; Tsendbazar, N.E.; van Leeuwen, E.; Fensholt, R.; Herold, M. A global analysis of multifaceted urbanization patterns using Earth Observation data from 1975 to 2015. Landsc. Urban Plan. 2022 , 219 , 104316. [ Google Scholar ] [ CrossRef ]
  • Chen, M.X.; Ye, C.; Lu, D.D.; Sui, Y.W.; Guo, S.S. Cognition and construction of the theoretical connotations of new urbanization with Chinese characteristics. J. Geogr. Sci. 2019 , 29 , 1681–1698. [ Google Scholar ] [ CrossRef ]
  • Jiang, L.W.; O‘Neill, B.C. Global urbanization projections for the Shared Socioeconomic Pathways. Glob. Environ. Change-Hum. Policy Dimens. 2017 , 42 , 193–199. [ Google Scholar ] [ CrossRef ]
  • Elvidge, C.D.; Imhoff, M.L.; Baugh, K.E.; Hobson, V.R.; Nelson, I.; Safran, J.; Dietz, J.B.; Tuttle, B.T. Night-time lights of the world: 1994–1995. ISPRS J. Photogramm. Remote Sens. 2001 , 56 , 81–99. [ Google Scholar ] [ CrossRef ]
  • Shi, K.F.; Huang, C.; Yu, B.L.; Yin, B.; Huang, Y.X.; Wu, J.P. Evaluation of NPP-VIIRS night-time light composite data for extracting built-up urban areas. Remote Sens. Lett. 2014 , 5 , 358–366. [ Google Scholar ] [ CrossRef ]
  • Levin, N.; Kyba, C.C.M.; Zhang, Q.L.; de Miguel, A.S.; Román, M.O.; Li, X.; Portnov, B.A.; Molthan, A.L.; Jechow, A.; Miller, S.D.; et al. Remote sensing of night lights: A review and an outlook for the future. Remote Sens. Environ. 2020 , 237 , 111443. [ Google Scholar ] [ CrossRef ]
  • Li, X.; Li, X.Y.; Li, D.R.; He, X.J.; Jendryke, M. A preliminary investigation of Luojia-1 night-time light imagery. Remote Sens. Lett. 2019 , 10 , 526–535. [ Google Scholar ] [ CrossRef ]
  • Zheng, Q.M.; Weng, Q.H.; Huang, L.Y.; Wang, K.; Deng, J.S.; Jiang, R.W.; Ye, Z.R.; Gan, M.Y. A new source of multi-spectral high spatial resolution night-time light imagery-JL1-3B. Remote Sens. Environ. 2018 , 215 , 300–312. [ Google Scholar ] [ CrossRef ]
  • Elvidge, C.D.; Baugh, K.; Zhizhin, M.; Hsu, F.C.; Ghosh, T. VIIRS night-time lights. Int. J. Remote Sens. 2017 , 38 , 5860–5879. [ Google Scholar ] [ CrossRef ]
  • Zhang, Q.; Seto, K.C. Can Night-Time Light Data Identify Typologies of Urbanization? A Global Assessment of Successes and Failures. Remote Sens. 2013 , 5 , 3476–3494. [ Google Scholar ] [ CrossRef ]
  • Hu, K.; Qi, K.L.; Guan, Q.F.; Wu, C.Q.; Yu, J.M.; Qing, Y.X.; Zheng, J.; Wu, H.Y.; Li, X. A Scientometric Visualization Analysis for Night-Time Light Remote Sensing Research from 1991 to 2016. Remote Sens. 2017 , 9 , 802. [ Google Scholar ] [ CrossRef ]
  • Zheng, Q.M.; Seto, K.C.; Zhou, Y.Y.; You, S.X.; Weng, Q.H. Nighttime light remote sensing for urban applications: Progress, challenges, and prospects. ISPRS J. Photogramm. Remote Sens. 2023 , 202 , 125–141. [ Google Scholar ] [ CrossRef ]
  • Elvidge, C.D.; Ziskin, D.; Baugh, K.E.; Tuttle, B.T.; Ghosh, T.; Pack, D.W.; Erwin, E.H.; Zhizhin, M. A Fifteen Year Record of Global Natural Gas Flaring Derived from Satellite Data. Energies 2009 , 2 , 595–622. [ Google Scholar ] [ CrossRef ]
  • Liu, Z.F.; He, C.Y.; Zhang, Q.F.; Huang, Q.X.; Yang, Y. Extracting the dynamics of urban expansion in China using DMSP-OLS nighttime light data from 1992 to 2008. Landsc. Urban Plan. 2012 , 106 , 62–72. [ Google Scholar ] [ CrossRef ]
  • Xin, X.; Liu, B.; Di, K.C.; Zhu, Z.; Zhao, Z.Y.; Liu, J.; Yue, Z.Y.; Zhang, G. Monitoring urban expansion using time series of night-time light data: A case study in Wuhan, China. Int. J. Remote Sens. 2017 , 38 , 6110–6128. [ Google Scholar ] [ CrossRef ]
  • Zhang, Q.L.; Schaaf, C.; Seto, K.C. The Vegetation Adjusted NTL Urban Index: A new approach to reduce saturation and increase variation in nighttime luminosity. Remote Sens. Environ. 2013 , 129 , 32–41. [ Google Scholar ] [ CrossRef ]
  • Zhuo, L.; Zheng, J.; Zhang, X.F.; Li, J.; Liu, L. An improved method of night-time light saturation reduction based on EVI. Int. J. Remote Sens. 2015 , 36 , 4114–4130. [ Google Scholar ] [ CrossRef ]
  • Liu, Y.X.Y.; Yang, Y.P.; Jing, W.L.; Yao, L.; Yue, X.F.; Zhao, X.D. A New Urban Index for Expressing Inner-City Patterns Based on MODIS LST and EVI Regulated DMSP/OLS NTL. Remote Sens. 2017 , 9 , 777. [ Google Scholar ] [ CrossRef ]
  • Zhang, J.; Yuan, X.D.; Lin, H. The Extraction of Urban Built-Up Areas by Integrating Night-Time Light and POI Data-A Case Study of Kunming, China. IEEE Access 2021 , 9 , 22417–22429. [ Google Scholar ] [ CrossRef ]
  • Zhang, Q.F.; Zheng, Z.H.; Wu, Z.F.; Cao, Z.; Luo, R.B. Using Multi-Source Geospatial Information to Reduce the Saturation Problem of DMSP/OLS Nighttime Light Data. Remote Sens. 2022 , 14 , 3264. [ Google Scholar ] [ CrossRef ]
  • He, X.; Zhou, C.S.; Zhang, J.; Yuan, X.D. Using Wavelet Transforms to Fuse Nighttime Light Data and POI Big Data to Extract Urban Built-Up Areas. Remote Sens. 2020 , 12 , 3887. [ Google Scholar ] [ CrossRef ]
  • He, X.; Zhang, Z.M.; Yang, Z.J. Extraction of urban built-up area based on the fusion of night-time light data and point of interest data. R. Soc. Open Sci. 2021 , 8 , 210838. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Li, X.M.; Song, Y.S.; Liu, H.; Hou, X.Y. Extraction of Urban Built-Up Areas Using Nighttime Light (NTL) and Multi-Source Data: A Case Study in Dalian City, China. Land 2023 , 12 , 495. [ Google Scholar ] [ CrossRef ]
  • Li, X.; Li, D.R.; Xu, H.M.; Wu, C.Q. Intercalibration between DMSP/OLS and VIIRS night-time light images to evaluate city light dynamics of Syria’s major human settlement during Syrian Civil War. Int. J. Remote Sens. 2017 , 38 , 5934–5951. [ Google Scholar ] [ CrossRef ]
  • Li, X.C.; Zhou, Y.Y.; Zhao, M.; Zhao, X. A harmonized global nighttime light dataset 1992–2018. Sci. Data 2020 , 7 , 168. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zhao, M.; Zhou, Y.Y.; Li, X.C.; Zhou, C.H.; Cheng, W.M.; Li, M.C.; Huang, K. Building a Series of Consistent Night-Time Light Data (1992–2018) in Southeast Asia by Integrating DMSP-OLS and NPP-VIIRS. IEEE Trans. Geosci. Remote Sens. 2020 , 58 , 1843–1856. [ Google Scholar ] [ CrossRef ]
  • Chen, Z.Q.; Yu, B.L.; Yang, C.S.; Zhou, Y.Y.; Yao, S.J.; Qian, X.J.; Wang, C.X.; Wu, B.; Wu, J.P. An extended time series (2000–2018) of global NPP-VIIRS-like nighttime light data from a cross-sensor calibration. Earth Syst. Sci. Data 2021 , 13 , 889–906. [ Google Scholar ] [ CrossRef ]
  • Zhou, Y.Y.; Smith, S.J.; Elvidge, C.D.; Zhao, K.G.; Thomson, A.; Imhoff, M. A cluster-based method to map urban area from DMSP/OLS nightlights. Remote Sens. Environ. 2014 , 147 , 173–185. [ Google Scholar ] [ CrossRef ]
  • Ma, T.; Zhou, Y.K.; Zhou, C.H.; Haynie, S.; Pei, T.; Xu, T. Night-time light derived estimation of spatio-temporal characteristics of urbanization dynamics using DMSP/OLS satellite data. Remote Sens. Environ. 2015 , 158 , 453–464. [ Google Scholar ] [ CrossRef ]
  • Liu, L.; Leung, Y. A study of urban expansion of prefectural-level cities in South China using night-time light images. Int. J. Remote Sens. 2015 , 36 , 5557–5575. [ Google Scholar ] [ CrossRef ]
  • Pandey, B.; Joshi, P.K.; Seto, K.C. Monitoring urbanization dynamics in India using DMSP/OLS night time lights and SPOT-VGT data. Int. J. Appl. Earth Obs. Geoinf. 2013 , 23 , 49–61. [ Google Scholar ] [ CrossRef ]
  • Ju, Y.; Dronova, I.; Ma, Q.; Zhang, X. Analysis of urbanization dynamics in mainland China using pixel-based night-time light trajectories from 1992 to 2013. Int. J. Remote Sens. 2017 , 38 , 6047–6072. [ Google Scholar ] [ CrossRef ]
  • Li, Q.M.; Zheng, B.H.; Tu, B.; Yang, Y.S.; Wang, Z.Y.; Jiang, W.; Yao, K.; Yang, J.W. Refining Urban Built-Up Area via Multi-Source Data Fusion for the Analysis of Dongting Lake Eco-Economic Zone Spatiotemporal Expansion. Remote Sens. 2020 , 12 , 1797. [ Google Scholar ] [ CrossRef ]
  • Zhou, Y.Y.; Li, X.C.; Asrar, G.R.; Smith, S.J.; Imhoff, M. A global record of annual urban dynamics (1992–2013) from nighttime lights. Remote Sens. Environ. 2018 , 219 , 206–220. [ Google Scholar ] [ CrossRef ]
  • Kamarajugedda, S.A.; Mandapaka, P.V.; Lo, E.Y.M. Assessing urban growth dynamics of major Southeast Asian cities using night-time light data. Int. J. Remote Sens. 2017 , 38 , 6073–6093. [ Google Scholar ] [ CrossRef ]
  • Zhong, Y.; Lin, A.W.; Zhou, Z.G.; Chen, F.Y. Spatial Pattern Evolution and Optimization of Urban System in the Yangtze River Economic Belt, China, Based on DMSP-OLS Night Light Data. Sustainability 2018 , 10 , 3782. [ Google Scholar ] [ CrossRef ]
  • Xu, T.; Ma, T.; Zhou, C.H.; Zhou, Y.K. Characterizing Spatio-Temporal Dynamics of Urbanization in China Using Time Series of DMSP/OLS Night Light Data. Remote Sens. 2014 , 6 , 7708–7731. [ Google Scholar ] [ CrossRef ]
  • Elvidge, C.D.; Baugh, K.E.; Kihn, E.A.; Kroehl, H.W.; Davis, E.R. Mapping city lights with nighttime data from the DMSP operational linescan system. Photogramm. Eng. Remote Sens. 1997 , 63 , 727–734. [ Google Scholar ]
  • Baugh, K.; Elvidge, C.D.; Ghosh, T.; Ziskin, D. Development of a 2009 stable lights product using DMSP-OLS data. Proc. Asia-Pac. Adv. Netw. 2010 , 30 , 114. [ Google Scholar ] [ CrossRef ]
  • Elvidge, C.D.; Zhizhin, M.; Ghosh, T.; Hsu, F.C.; Taneja, J. Annual Time Series of Global VIIRS Nighttime Lights Derived from Monthly Averages: 2012 to 2019. Remote Sens. 2021 , 13 , 922. [ Google Scholar ] [ CrossRef ]
  • Ma, X.L.; Tong, X.H.; Liu, S.C.; Luo, X.; Xie, H.; Li, C.M. Optimized Sample Selection in SVM Classification by Combining with DMSP-OLS, Landsat NDVI and GlobeLand30 Products for Extracting Urban Built-Up Areas. Remote Sens. 2017 , 9 , 236. [ Google Scholar ] [ CrossRef ]
  • Ma, X.L.; Li, C.M.; Tong, X.H.; Liu, S.C. A New Fusion Approach for Extracting Urban Built-up Areas from Multisource Remotely Sensed Data. Remote Sens. 2019 , 11 , 2516. [ Google Scholar ] [ CrossRef ]
  • Liu, L.; Li, Z.C.; Fu, X.Y.; Liu, X.; Li, Z.H.; Zheng, W.F. Impact of Power on Uneven Development: Evaluating Built-Up Area Changes in Chengdu Based on NPP-VIIRS Images (2015–2019). Land 2022 , 11 , 489. [ Google Scholar ] [ CrossRef ]
  • Chai, C.; He, Y.; Yu, P.; Zheng, Y.; Chen, Z.; Fan, M.; Lin, Y.J.L. Spatiotemporal evolution characteristics of urbanization in the xiamen special economic zone based on nighttime-light data from 1992 to 2020. Land 2022 , 11 , 1264. [ Google Scholar ] [ CrossRef ]
  • Elvidge, C.D.; Baugh, K.E.; Zhizhin, M.; Hsu, F.C. Why VIIRS data are superior to DMSP for mapping nighttime lights. Proc. Asia-Pac. Adv. Netw. 2013 , 35 . [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

SatelliteYearabcR SatelliteYearabcR
F101992−2.0570 1.5903 −0.0090 0.9075 F1520020.0491 0.9568 0.0010 0.9658
F101993−1.0582 1.5983 −0.0093 0.9360 F1520030.2217 1.5122 −0.0080 0.9314
F101994−0.3458 1.4864 −0.0079 0.9243 F1520040.5751 1.3335 −0.0051 0.9479
F121994−0.6890 1.1770 −0.0025 0.9071 F1520050.6367 1.2838 −0.0041 0.9335
F121995−0.0515 1.2293 −0.0038 0.9178 F1520060.8261 1.2790 −0.0041 0.9387
F121996−0.0959 1.2727 −0.0040 0.9319 F1520071.3606 1.2974 −0.0045 0.9013
F121997−0.3321 1.1782 −0.0026 0.9245 F1620040.2853 1.1955 −0.0034 0.9039
F121998−0.0608 1.0648 −0.0013 0.9536 F162005−0.0001 1.4159 −0.0063 0.9390
F1219990.0000 1.0000 0.0000 1.0000 F1620060.1065 1.1371 −0.0016 0.9199
F141997−1.1323 1.7696 −0.0122 0.9101 F1620070.6394 0.9114 0.0014 0.9511
F141998−0.1917 1.6321 −0.0101 0.9723 F1620080.5564 0.9931 0.0000 0.9450
F141999−0.1557 1.5055 −0.0078 0.9717 F1620090.9492 1.0683 −0.0016 0.8918
F1420001.0988 1.3155 −0.0053 0.9278 F1820102.3430 0.5102 0.0065 0.8462
F1420010.1943 1.3219 −0.0051 0.9448 F1820102.3458 0.5100 0.0065 0.8453
F1420021.0517 1.1905 −0.0036 0.9203 F1820111.8956 0.7345 0.0030 0.9095
F1420030.7390 1.2416 −0.0040 0.9432 F1820121.8750 0.6203 0.0052 0.9392
F1520000.1254 1.0452 −0.0010 0.9320 F1820131.8411 0.7049 0.0033 0.9321
F152001−0.7024 1.1081 −0.0012 0.9593
ClassActual Class
Built-Up AreaNon-Built-Up Area
Predicted classBuilt-up AreaTrue Built-up AreaFalse Built-up Area
Non-built-up
Area
False Non-Built-up Area True Non-Built-up Area
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Chen, Y.; Ren, F.; Du, Q.; Zhou, P. Refining Long-Time Series of Urban Built-Up-Area Extraction Based on Night-Time Light—A Case Study of the Dongting Lake Area in China. Land 2024 , 13 , 1006. https://doi.org/10.3390/land13071006

Chen Y, Ren F, Du Q, Zhou P. Refining Long-Time Series of Urban Built-Up-Area Extraction Based on Night-Time Light—A Case Study of the Dongting Lake Area in China. Land . 2024; 13(7):1006. https://doi.org/10.3390/land13071006

Chen, Yinan, Fu Ren, Qingyun Du, and Pan Zhou. 2024. "Refining Long-Time Series of Urban Built-Up-Area Extraction Based on Night-Time Light—A Case Study of the Dongting Lake Area in China" Land 13, no. 7: 1006. https://doi.org/10.3390/land13071006

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. What is classification in Machine Learning

    classification in machine learning case study

  2. SOLUTION: Classification in machine learning

    classification in machine learning case study

  3. GitHub

    classification in machine learning case study

  4. Classification in machine learning: Types and methodologies

    classification in machine learning case study

  5. Types Of Classification In Machine Learning

    classification in machine learning case study

  6. Classification in machine learning: Types and methodologies

    classification in machine learning case study

VIDEO

  1. Machine Learning on Encrypted Data using Homomorphic Encryption

  2. Solving Classification Problems with Azure Machine Learning Studio: A Step-by-Step Guide

  3. Machine Learning Course

  4. Machine Learning Course

  5. 03

  6. 04

COMMENTS

  1. How To Solve A Classification Task With Machine Learning

    The case study in this article will go over a popular Machine learning concept called classification. Classification. In Machine Learning (ML), classification is a supervised learning concept that groups data into classes. Classification usually refers to any kind of problem where a specific type of class label is the result to be predicted ...

  2. Machine Learning with Python: Classification (complete tutorial)

    Up to 300 passengers survived and about 550 didn't, in other words the survival rate (or the population mean) is 38%. Moreover, a histogram is perfect to give a rough sense of the density of the underlying distribution of a single numerical data. I recommend using a box plot to graphically depict data groups through their quartiles. Let's take the Age variable for instance:

  3. Classification in Machine Learning: Algorithms and Techniques

    Classification is a task of Machine Learning which assigns a label value to a specific class and then can identify a particular type to be of one kind or another. The most basic example can be of the mail spam filtration system where one can classify a mail as either "spam" or "not spam". You will encounter multiple types of ...

  4. Machine Learning: Classification

    Classification is one of the most widely used techniques in machine learning, with a broad array of applications, including sentiment analysis, ad targeting, spam detection, risk assessment, medical diagnosis and image classification. The core goal of classification is to predict a category or class y from some inputs x.

  5. 4 Types of Classification Tasks in Machine Learning

    Examples include: Email spam detection (spam or not). Churn prediction (churn or not). Conversion prediction (buy or not). Typically, binary classification tasks involve one class that is the normal state and another class that is the abnormal state. For example " not spam " is the normal state and " spam " is the abnormal state.

  6. Machine Learning Foundations: A Case Study Approach

    This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms.

  7. Classification Case Study

    Exploring Classification: A Deep Dive into Real-World Case StudiesWelcome to our series dedicated to unraveling the intricacies of classification through cap...

  8. What Is Machine Learning Classification?

    Machine learning classification can be used in a variety of day-to-day applications. In the health care industry, researchers can use machine learning classification to predict new future diseases and whether someone might contract an infection. ... Additionally, through a series of case studies, you'll gain hands-on experience in significant ...

  9. Classification in Machine Learning: An Introduction

    Classification is a supervised machine learning process that involves predicting the class of given data points. Those classes can be targets, labels or categories. For example, a spam detection machine learning algorithm would aim to classify emails as either "spam" or "not spam.". Common classification algorithms include: K-nearest ...

  10. A case study on machine learning and classification

    Abstract. As a young research field, the machine learning has made significant progress and covered a broad spectrum of applications for the last few decades. Classification is an important task ...

  11. Multi-Class classification with Sci-kit learn & XGBoost: A case study

    by Avishek Nag (Machine Learning expert) Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data A comparison of different classifiers' accuracy & performance for high-dimensional data Photo Credit : PixabayIn Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely

  12. What is Classification in Machine Learning?

    A classification problem in machine learning is one in which a class label is anticipated for a specific example of input data. Problems with categorization include the following: Give an example and indicate whether it is spam or not. Identify a handwritten character as one of the recognized characters.

  13. A detailed case study on Multi-Label Classification with Machine

    In case of multi-label classification tasks, a single instance of data can simultaneously belong to two or more classes of target variables. Hence, we can say that the predicted classes are not ...

  14. A case study on machine learning and classification

    As a young research field, the machine learning has made significant progress and covered a broad spectrum of applications for the last few decades. Classification is an important task of machine learning. Today, the task is used in a vast array of areas. The present article provides a case study on various classification algorithms (under machine learning), their applicability and issues ...

  15. Machine learning in project analytics: a data-driven framework and case

    The final classification is made by counting the most common scenario or votes present within the ... in addition to the 139 instances of the case study, to the machine learning algorithms, then ...

  16. 6 Individual Machine Learning Algorithms for Classification Model

    Overall, logistic regression is a powerful machine-learning algorithm that can be used to solve a variety of problems. It is a good choice for beginners because it is relatively simple to ...

  17. Getting started with Classification

    Classification is a process of categorizing data or objects into predefined classes or categories based on their features or attributes. Machine Learning classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data.

  18. Case Study: Using Machine Learning to Classify Personally ...

    In this case, if we had a bunch of examples of first and last names, phone numbers, ID numbers, DoB, email addresses and VINs, each labelled as such, we could train a multi-class supervised ...

  19. Data Science Case Studies on Classification

    Working on case studies is one of the best practices that will help you improve your problem-solving skills as a data scientist. In this article, I'm going to introduce you to some of the best data science case studies based on the problems of classification that will help you understand and solve problems based on classification using machine learning.

  20. Binary Classification Machine Learning Case Study

    Explore and run machine learning code with Kaggle Notebooks | Using data from Mines vs Rocks. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active ...

  21. Machine Learning Case Studies with Powerful Insights

    The Titanic Machine Learning Case Study is a classic example in the field of data science and machine learning. The study is based on the dataset of passengers aboard the Titanic when it sank in 1912. The study's goal is to predict whether a passenger survived or not based on their demographic and other information.

  22. 16 Real World Case Studies of Machine Learning

    6. Machine Learning Case Study on Tesla. Tesla is now a big name in the electric automobile industry and the chances that it will continue to be the trending topic for years to come are really high. It is popular and extensively known for its advanced and futuristic cars and their advanced models.

  23. Global and local interpretability techniques of supervised machine

    The most effective machine learning classification techniques, such as artificial neural networks, are not easily interpretable, which limits their usefulness in critical areas, such as medicine, where errors can have severe consequences. Researchers have been working to balance the trade-off between the model performance and interpretability.

  24. The role of beat-by-beat cardiac features in machine learning

    The study reveals the importance of three features (out of eight measured features) namely, the field map angle (FMA) computed from magnetic field map, beat-by-beat variations of alpha angle in the ST-T region and T wave magnitude variations in yielding a better classification accuracy (92.7 %) against that achieved by conventional features (81 %).

  25. Heart disease prediction using machine learning, deep Learning and

    Heart disease has been recognized as a deadly and complex human illness across the worldwide. [1,2,3,4,5].Heart disease disrupts the normal functions of the heart, leading to the blockage of blood vessels [6,7,8,9].This condition increases the risk of stroke, angina, and heart attack, as well as coronary artery infections, which can weaken the body, especially in elderly individuals and adults ...

  26. Exploring Transfer Learning Using Segment Anything Model in Optical

    This study investigates the efficacy of the Segment Anything Model (SAM), a foundational computer vision model, in the domain of optical remote sensing tasks, specifically focusing on image classification and semantic segmentation.The scarcity of labeled data in remote sensing poses a significant challenge for machine learning development.

  27. Land

    By studying the development law of urbanization, the problems of disorderly expansion and resource wastage in urban built-up areas can be effectively avoided, which is crucial for the long-term sustainable development of cities. This study proposes a high-precision urban built-up-area extraction method for county-level cities for small and medium-sized towns in county-level regions. Our ...