Classification Algorithm in Machine Learning: A Comprehensive Guide
- 6 minute read
- August 28, 2024
Written by:
I am Julie Bowie a data scientist with a specialization in machine learning. I have conducted research in the field of language processing and has published several papers in reputable journals.
Summary: This comprehensive guide covers the basics of classification algorithms, key techniques like Logistic Regression and SVM, and advanced topics such as handling imbalanced datasets. It also includes practical implementation steps and discusses the future of classification in Machine Learning.
Introduction
Machine Learning has revolutionised the way we analyse and interpret data, enabling machines to learn from historical data and make predictions or decisions without explicit programming. One of the most fundamental and widely used techniques in Machine Learning is classification.
Classification algorithms are crucial in various industries, from spam detection in emails to medical diagnosis and customer segmentation.
In this blog, we will delve into the world of classification algorithms, exploring their basics, key algorithms, how they work, advanced topics, practical implementation, and the future of classification in Machine Learning.
What is Classification?
Classification is a supervised learning technique where the model predicts the category or class that a new observation belongs to, based on the patterns learned from the training data. Unlike regression, which deals with continuous output variables, classification involves predicting categorical output variables.
Types of Classification Tasks
Explore various classification tasks, including binary, multi-class, multi-label, and imbalanced classification. Understand the unique characteristics and challenges of each type to apply the right approach effectively.
Binary Classification : This involves separating the dataset into two categories. For example, classifying emails as “spam” or “not spam”.
Multi-Class Classification: Here, the model predicts one of multiple classes. For instance, classifying images into different categories like “dog,” “cat,” or “bird”.
Multi-Label Classification: In this scenario, each observation can belong to multiple classes. For example, tagging a piece of text with multiple topics like “sports,” “politics,” and “entertainment”.
Imbalanced Classification: Unequal class representation in the dataset challenges model training and evaluation.
Learners in Classification
Classification algorithms can be categorised into eager learners, which build models from training data before making predictions, and lazy learners, which memorise training data and predict based on nearest neighbors.Classification algorithms can be categorised into two types of learners:
Eager Learners
These algorithms build a model from the training data before making predictions. Examples include Logistic Regression, Support Vector Machines (SVM), Decision Trees, and Artificial Neural Networks.
Lazy Learners
These algorithms do not build a model immediately from the training data. Instead, they memorise the training data and make predictions by finding the nearest neighbour. Examples include K-Nearest Neighbors (KNN) and Case-based Reasoning.
Key Classification Algorithms
Several classification algorithms are widely used in Machine Learning, each with its strengths and weaknesses.Delve into prominent classification algorithms and learn their practical applications:
Logistic Regression
Logistic Regression is a popular and explainable algorithm that models the probability of an observation belonging to a particular class using the sigmoid function. It is commonly used for binary classification tasks.
Decision Trees
Decision Trees are tree-based models that use a hierarchical structure to classify data. They are easy to interpret and can handle both categorical and numerical data. However, they can suffer from overfitting if not regularised.
Random Forests
Random Forests are an ensemble learning method that combines multiple Decision Trees to improve the accuracy and robustness of the model. They are less prone to overfitting compared to single Decision Trees.
Support Vector Machines (SVM)
SVMs are powerful algorithms that learn to draw the hyperplane (decision boundary) by maximising the margin between different classes. They can handle non-linear data using kernel tricks.
K-Nearest Neighbors (KNN)
KNN is a lazy learning algorithm that classifies observations based on their similarity to the nearest neighbours in the training data. It is simple to implement but can be computationally expensive for large datasets.
Naive Bayes
Naive Bayes is a family of probabilistic algorithms based on Bayes’ theorem. It is particularly useful for text classification and spam detection due to its simplicity and efficiency.
How Classification Algorithms Work
Understand the step-by-step process of classification algorithms, from data preprocessing and model selection to training and evaluation. Learn how these algorithms learn patterns and make predictions effectively.The process of using a classification algorithm involves several key steps:
Data Preprocessing: This includes encoding categorical variables, handling missing values, and normalising or standardising the data to ensure that all features are on the same scale.
Splitting the Dataset: The dataset is divided into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.
Model Selection: Choosing the right algorithm based on the problem at hand. This involves considering factors like the type of classification task, the size and complexity of the dataset, and the computational resources available.
Training the Model : The selected algorithm is trained on the training dataset to learn the patterns and relationships between the input features and the output class labels.
Model Evaluation: The trained model is evaluated on the testing dataset to assess its performance using metrics like accuracy, precision, recall, F1-score, and ROC-AUC.
Advanced Topics in Classification
Dive into advanced topics such as handling imbalanced datasets, feature selection and engineering, and ensemble methods. Discover techniques to enhance model performance and address complex classification challenges.
Handling Imbalanced Datasets
Imbalanced datasets can significantly impact the performance of classification models. Techniques to handle imbalanced datasets include oversampling the minority class, undersampling the majority class, using class weights, and employing algorithms like SMOTE (Synthetic Minority Over-sampling Technique).
Feature Selection and Engineering
Feature selection involves choosing the most relevant features to enhance model performance and reduce dimensionality. Meanwhile, feature engineering focuses on creating new features from existing ones to capture underlying relationships in the data more effectively.
Ensemble Methods
Ensemble methods combine multiple models to improve overall performance. Techniques like bagging, boosting, and stacking can be used to create robust and accurate classification models.
Practical Implementation
To implement a classification algorithm practically, you can follow these steps:
Choose a Dataset: Select a relevant dataset for your problem. For example, the Iris dataset for multi-class classification or the Spam vs. Ham dataset for binary classification.
Preprocess the Data: Clean and preprocess the data by handling missing values, encoding categorical variables, and normalizing the features.
Split the Dataset: Divide the dataset into training and testing sets.
Select and Train a Model: Choose an appropriate classification algorithm and train it on the training dataset.
Evaluate the Model : Evaluate the performance of the model on the testing dataset using relevant metrics.
Tune Hyperparameters : Perform hyperparameter tuning to optimise the model’s performance.
Future of Classification in Machine Learning
The future of classification in Machine Learning looks promising with several emerging trends.Learn how to practically implement classification algorithms, including dataset selection, preprocessing, model training, and evaluation. Follow step-by-step examples to apply these techniques in real-world scenarios effectively.
Deep Learning
Deep Learning models, particularly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), are becoming increasingly popular for complex classification tasks like image and text classification.
Explainability
There is a growing need for explainable AI, with techniques like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) gaining traction to provide insights into model decisions.
Transfer Learning
Transfer learning allows models to leverage pre-trained weights from large datasets, enabling faster training and better performance on smaller datasets.
Classification algorithms are a cornerstone of Machine Learning, enabling machines to predict categorical outcomes from input data. By understanding the basics of classification, key algorithms, and practical implementation steps, you can effectively apply these techniques to solve real-world problems.
As Machine Learning continues to evolve, the role of classification algorithms will remain pivotal, driving advancements in various fields and industries.
Frequently Asked Questions
What is the difference between classification and regression in machine learning.
Classification involves predicting categorical output variables, while regression involves predicting continuous output variables.
What are Some Common Metrics Used to Evaluate the Performance of Classification Models?
Common metrics include accuracy, precision, recall, F1-score, and ROC-AUC. These metrics help in assessing the model’s ability to correctly classify observations.
How do you Handle Imbalanced Datasets in Classification Problems?
Techniques to handle imbalanced datasets include oversampling the minority class, undersampling the majority class, using class weights, and employing algorithms like SMOTE.
By mastering classification algorithms and staying updated with the latest trends and techniques, you can unlock the full potential of Machine Learning in solving complex real-world problems.
Reviewed by:
Post written by: Julie Bowie
Python The Power of Pandas: Mastering the concat Function in Python
You may also like.
- Machine Learning
Machine Learning Demystified
- Ayush Pareek
- July 27, 2021
- 8 minute read
Regression in Machine Learning: Types & Examples
- August 7, 2023
- 10 minute read
Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data
By Avishek Nag (Machine Learning expert)
A comparison of different classifiers’ accuracy & performance for high-dimensional data
In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this ‘curse of dimensionality’ problem.
In this article, we will see how accuracy and performance vary across different classifiers. We will also see how, when we don’t have the freedom to choose a classifier independently, we can do feature engineering to make a poor classifier perform well.
Understanding the ‘datasource’ & problem formulation
For this article, we will use the “EEG Brainwave Dataset” from Kaggle . This dataset contains electronic brainwave signals from an EEG headset and is in temporal format. At the time of writing this article, nobody has created any ‘Kernel’ on this dataset — that is, as of now, no solution has been given in Kaggle.
So, to start with, let’s first read the data to see what’s there.
There are 2549 columns in the dataset and ‘label’ is the target column for our classification problem. All other columns like ‘mean_d_1_a’, ‘mean_d2_a’ etc are describing features of brainwave signal readings. Columns starting with the ‘fft’ prefix are most probably ‘Fast Fourier transforms’ of original signals. Our target column ‘label’ describes the degree of emotional sentiment.
As per Kaggle, here is the challenge: “Can we predict emotional sentiment from brainwave readings?”
Let’s first understand class distributions from column ‘label’:
So, there are three classes, ‘POSITIVE’, ‘NEGATIVE’ & ‘NEUTRAL’, for emotional sentiment. From the bar chart, it is clear that class distribution is not skewed and it is a ‘multi-class classification’ problem with target variable ‘label’. We will try with different classifiers and see the accuracy levels.
Before applying any classifier, the column ‘label’ should be separated out from other feature columns (‘mean_d_1_a’, ‘mean_d2_a’ etc are features).
As it is a ‘classification’ problem, we will follow the below conventions for each ‘classifier’ to be tried:
- We will use a ‘cross validation’ (in our case will use 10 fold cross validation) approach over the dataset and take average accuracy. This will give us a holistic view of the classifier’s accuracy.
- We will use a ‘Pipeline’ based approach to combine all pre-processing and main classifier computation. A ML ‘Pipeline’ wraps all processing stages in a single unit and act as a ‘classifier’ itself. By this, all stages become re-usable and can be put in forming other ‘pipelines’ also.
- We will track total time in building & testing for each approach. We will call this ‘time taken’.
For the above, we will primarily use the scikit-learn package from Python. As the number of features here is quite high, will start with a classifier which works well on high-dimensional data.
RandomForest Classifier
‘RandomForest’ is a tree & bagging approach-based ensemble classifier. It will automatically reduce the number of features by its probabilistic entropy calculation approach. Let’s see that:
Accuracy is very good at 97.7% and ‘total time taken’ is quite short (3.29 seconds only).
For this classifier, no pre-processing stages like scaling or noise removal are required, as it is completely probability-based and not at all affected by scale factors.
Logistic Regression Classifier
‘Logistic Regression’ is a linear classifier and works in same way as linear regression.
We can see accuracy (93.19%) is lower than ‘RandomForest’ and ‘time taken’ is higher (2 min 7s).
‘Logistic Regression’ is heavily affected by different value ranges across dependent variables, thus forces ‘feature scaling’. That’s why ‘StandardScaler’ from scikit-learn has been added as a preprocessing stage. It automatically scales features according to a Gaussian Distribution with zero mean & unit variance, and thus values for all variables range from -1 to +1.
The reason for high time taken is high-dimensionality and scaling time required. There are 2549 variables in the dataset and the coefficient of each one should be optimised as per the Logistic Regression process. Also, there is a question of multi-co-linearity. This means linearly co-related variables should be grouped together instead of considering them separately.
The presence of multi-col-linearity affects accuracy. So now the question becomes, “Can we reduce the number of variables, reduce multi-co-linearity, & improve ‘time taken?”
Principal Component Analysis (PCA)
PCA can transform original low level variables to a higher dimensional space and thus reduce the number of required variables. All co-linear variables get clubbed together. Let’s do a PCA of the data and see what are the main PC’s:
We mapped 2549 variables to 20 Principal Components. From the above result, it is clear that first 10 PCs are a matter of importance. The total percentage of the explained variance ratio by the first 10 PCs is around 0.737 (0.36 + 0.095 + ..+ 0.012). Or it can be said that the first 10 PCs explain 73.7% variance of the entire dataset.
So, with this we are able to reduce 2549 variables to 10 variables. That’s a dramatic change, isn’t it? In theory, Principal Components are virtual variables generated from mathematical mapping. From a business angle, it is not possible to tell which physical aspect of the data is covered by them. That means, physically, that Principal Components don’t exist. But, we can easily use these PCs as quantitative input variables to any ML algorithm and get very good results.
For visualisation, let’s take the first two PCs and see how can we distinguish different classes of the data using a ‘scatterplot’.
In the above plot, three classes are shown in different colours. So, if we use the same ‘Logistic Regression’ classifier with these two PCs, then from the above plot we can probably say that the first classifier will separate out ‘NEUTRAL’ cases from other two cases and the second classifier will separate out ‘POSITIVE’ & ‘NEGATIVE’ cases (as there will be two internal logistic classifiers for 3-class problem). Let’s try and see the accuracy.
Time taken (3.34 s) was reduced but accuracy (77%) decreased.
Now, let’s take all 10 PCs and run:
We see an improvement in accuracy (86%) compared to 2 PC cases with a marginal increase in ‘time taken’.
So, in both cases we saw low accuracy compared to normal logistic regression, but a significant improvement in ‘time taken’.
Accuracy can be further tested with a different ‘solver’ & ‘max_iter’ parameter. We used ‘saga’ as ‘solver’ with L1 penalty and 200 as ‘max_iter’. These values can be changed to get a variable effect on accuracy.
Though ‘Logistic Regression’ is giving low accuracy, there are situations where it may be needed specially with PCA. In datasets with a very large dimensional space, PCA becomes the obvious choice for ‘linear classifiers’.
In some cases, where a benchmark for ML applications is already defined and only limited choices of some ‘linear classifiers’ are available, this analysis would be helpful. It is very common to see such situations in large organisations where standards are already defined and it is not possible to go beyond them.
Artificial Neural Network Classifier (ANN)
An ANN classifier is non-linear with automatic feature engineering and dimensional reduction techniques. ‘MLPClassifier’ in scikit-learn works as an ANN. But here also, basic scaling is required for the data. Let’s see how it works:
Accuracy (97.5%) is very good, though running time is high (5 min).
The reason for high ‘time taken’ is the rigorous training time required for neural networks, and that too with a high number of dimensions.
It is a general convention to start with a hidden layer size of 50% of the total data size and subsequent layers will be 50% of the previous one. In our case these are (1275 = 2549 / 2, 637 = 1275 / 2). The number of hidden layers can be taken as hyper-parameter and can be tuned for better accuracy. In our case it is 2.
Linear Support Vector Machines Classifier (SVM)
We will now apply ‘Linear SVM’ on the data and see how accuracy is coming along. Here also scaling is required as a preprocessing stage.
Accuracy is coming in at 96.4% which is little less than ‘RandomForest’ or ‘ANN’. ‘time taken’ is 55 s which is in far better than ‘ANN’.
Extreme Gradient Boosting Classifier (XGBoost)
XGBoost is a boosted tree based ensemble classifier. Like ‘RandomForest’, it will also automatically reduce the feature set. For this we have to use a separate ‘xgboost’ library which does not come with scikit-learn. Let’s see how it works:
Accuracy (99.4%) is exceptionally good, but ‘time taken’(15 min) is quite high. Nowadays, for complicated problems, XGBoost is becoming a default choice for Data Scientists for its accurate results. It has high running time due to its internal ensemble model structure. However, XGBoost performs well in GPU machines.
From all of the classifiers, it is clear that for accuracy ‘XGBoost’ is the winner. But if we take ‘time taken’ along with ‘accuracy’, then ‘RandomForest’ is a perfect choice. But we also saw how to use a simple linear classifier like ‘logistic regression’ with proper feature engineering to give better accuracy. Other classifiers don’t need that much feature engineering effort.
It depends on the requirements, use case, and data engineering environment available to choose a perfect ‘classifier’.
The entire project on Jupyter NoteBook can be found here .
References:
[1] XGBoost Documentation — https://xgboost.readthedocs.io/en/latest/
[2] RandomForest workings — http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/
[3] Principal Component Analysis — https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
[4] Logistic Regression — http://ufldl.stanford.edu/tutorial/supervised/LogisticRegression/
[5] Support Vector Machines — https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
If you read this far, thank the author to show them you care. Say Thanks
Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started
- Data Science
- Data Analysis
- Data Visualization
- Machine Learning
- Deep Learning
- Computer Vision
- Artificial Intelligence
- AI ML DS Interview Series
- AI ML DS Projects series
- Data Engineering
- Web Scrapping
Getting started with Classification
As the name suggests, Classification is the task of “classifying things” into sub-categories. Classification is part of supervised machine learning in which we put labeled data for training.
The article serves as a comprehensive guide to understanding and applying classification techniques, highlighting their significance and practical implications.
What is Supervised Machine Learning?
Supervised Machine Learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output Y = f(X) . The goal is to approximate the mapping function so well that when you have new input data (x) you can predict the output variables (Y) for that data.
Supervised learning problems can be further grouped into Regression and Classification problems.
- Regression: Regression algorithms are used to predict a continuous numerical output. For example, a regression algorithm could be used to predict the price of a house based on its size, location, and other features.
- Classification: Classification algorithms are used to predict a categorical output. For example, a classification algorithm could be used to predict whether an email is spam or not.
Machine Learning for classification
Classification is a process of categorizing data or objects into predefined classes or categories based on their features or attributes.
Machine Learning classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data.
The main objective of classification machine learning is to build a model that can accurately assign a label or category to a new observation based on its features.
For example, a classification model might be trained on a dataset of images labeled as either dogs or cats and then used to predict the class of new, unseen images of dogs or cats based on their features such as color, texture, and shape.
Classification Types
There are two main classification types in machine learning:
Binary Classification
In binary classification, the goal is to classify the input into one of two classes or categories. Example – On the basis of the given health conditions of a person, we have to determine whether the person has a certain disease or not.
Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several classes or categories. For Example – On the basis of data about different species of flowers, we have to determine which specie our observation belongs to.
Binary vs Multi class classification
Other categories of classification involves:
M ulti-Label Classification
In, Multi-label Classification the goal is to predict which of several labels a new data point belongs to. This is different from multiclass classification, where each data point can only belong to one class. For example, a multi-label classification algorithm could be used to classify images of animals as belonging to one or more of the categories cat, dog, bird, or fish.
I mbalanced Classification
In, Imbalanced Classification the goal is to predict whether a new data point belongs to a minority class, even though there are many more examples of the majority class. For example, a medical diagnosis algorithm could be used to predict whether a patient has a rare disease, even though there are many more patients with common diseases.
Classification Algorithms
There are various types of classifiers algorithms . Some of them are :
Linear Classifiers
Linear models create a linear decision boundary between classes. They are simple and computationally efficient. Some of the linear classification models are as follows:
- Logistic Regression
- Support Vector Machines having kernel = ‘linear’
- Single-layer Perceptron
- Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers
Non-linear models create a non-linear decision boundary between classes. They can capture more complex relationships between the input features and the target variable. Some of the non-linear classification models are as follows:
- K-Nearest Neighbours
- Naive Bayes
- Decision Tree Classification
- Ensemble learning classifiers:
- Random Forests,
- Bagging Classifier,
- Voting Classifier,
- ExtraTrees Classifier
- Multi-layer Artificial Neural Networks
Learners in Classifications Algorithm
In machine learning, classification learners can also be classified as either “lazy” or “eager” learners.
- Lazy Learners: Lazy Learners are also known as instance-based learners, lazy learners do not learn a model during the training phase. Instead, they simply store the training data and use it to classify new instances at prediction time. It is very fast at prediction time because it does not require computations during the predictions. it is less effective in high-dimensional spaces or when the number of training instances is large. Examples of lazy learners include k-nearest neighbors and case-based reasoning.
- Eager Learners : Eager Learners are also known as model-based learners, eager learners learn a model from the training data during the training phase and use this model to classify new instances at prediction time. It is more effective in high-dimensional spaces having large training datasets. Examples of eager learners include decision trees, random forests, and support vector machines.
Classification Models in Machine Learning
Evaluating a classification model is an important step in machine learning, as it helps to assess the performance and generalization ability of the model on new, unseen data. There are several metrics and techniques that can be used to evaluate a classification model, depending on the specific problem and requirements. Here are some commonly used evaluation metrics:
- Classification Accuracy: The proportion of correctly classified instances over the total number of instances in the test set. It is a simple and intuitive metric but can be misleading in imbalanced datasets where the majority class dominates the accuracy score.
- Confusion matrix : A table that shows the number of true positives, true negatives, false positives, and false negatives for each class, which can be used to calculate various evaluation metrics.
- Precision and Recall: Precision measures the proportion of true positives over the total number of predicted positives, while recall measures the proportion of true positives over the total number of actual positives. These metrics are useful in scenarios where one class is more important than the other, or when there is a trade-off between false positives and false negatives.
- F1-Score: The harmonic mean of precision and recall, calculated as 2 x (precision x recall) / (precision + recall). It is a useful metric for imbalanced datasets where both precision and recall are important.
- ROC curve and AUC: The Receiver Operating Characteristic (ROC) curve is a plot of the true positive rate (recall) against the false positive rate (1-specificity) for different threshold values of the classifier’s decision function. The Area Under the Curve (AUC) measures the overall performance of the classifier, with values ranging from 0.5 (random guessing) to 1 (perfect classification).
- Cross-validation : A technique that divides the data into multiple folds and trains the model on each fold while testing on the others, to obtain a more robust estimate of the model’s performance.
It is important to choose the appropriate evaluation metric(s) based on the specific problem and requirements, and to avoid overfitting by evaluating the model on independent test data.
Characteristics of Classification
Here are the characteristics of the classification:
- Categorical Target Variable: Classification deals with predicting categorical target variables that represent discrete classes or labels. Examples include classifying emails as spam or not spam, predicting whether a patient has a high risk of heart disease, or identifying image objects.
- Accuracy and Error Rates: Classification models are evaluated based on their ability to correctly classify data points. Common metrics include accuracy, precision, recall, and F1-score.
- Model Complexity: Classification models range from simple linear classifiers to more complex nonlinear models. The choice of model complexity depends on the complexity of the relationship between the input features and the target variable.
- Overfitting and Underfitting: Classification models are susceptible to overfitting and underfitting. Overfitting occurs when the model learns the training data too well and fails to generalize to new data.
How does Classification Machine Learning Work?
The basic idea behind classification is to train a model on a labeled dataset, where the input data is associated with their corresponding output labels, to learn the patterns and relationships between the input data and output labels. Once the model is trained, it can be used to predict the output labels for new unseen data.
Classification Machine Learning
The classification process typically involves the following steps:
Understanding the problem
Before getting started with classification, it is important to understand the problem you are trying to solve. What are the class labels you are trying to predict? What is the relationship between the input data and the class labels?
Suppose we have to predict whether a patient has a certain disease or not, on the basis of 7 independent variables, called features. This means, there can be only two possible outcomes:
- The patient has the disease, which means “ True ”.
- The patient has no disease. which means “ False ”.
This is a binary classification problem.
Data preparation
Once you have a good understanding of the problem, the next step is to prepare your data. This includes collecting and preprocessing the data and splitting it into training, validation, and test sets. In this step, the data is cleaned, preprocessed, and transformed into a format that can be used by the classification algorithm.
- X: It is the independent feature, in the form of an N*M matrix. N is the no. of observations and M is the number of features.
- y: An N vector corresponding to predicted classes for each of the N observations.
Feature Extraction
The relevant features or attributes are extracted from the data that can be used to differentiate between the different classes.
Suppose our input X has 7 independent features, having only 5 features influencing the label or target values remaining 2 are negligibly or not correlated, then we will use only these 5 features only for the model training.
Model Selection
There are many different models that can be used for classification, including logistic regression, decision trees, support vector machines (SVM), or neural networks . It is important to select a model that is appropriate for your problem, taking into account the size and complexity of your data, and the computational resources you have available.
Model Training
Once you have selected a model, the next step is to train it on your training data. This involves adjusting the parameters of the model to minimize the error between the predicted class labels and the actual class labels for the training data.
Model Evaluation
Evaluating the model: After training the model, it is important to evaluate its performance on a validation set. This will give you a good idea of how well the model is likely to perform on new, unseen data.
Log Loss or Cross-Entropy Loss, Confusion Matrix, Precision, Recall, and AUC-ROC curve are the quality metrics used for measuring the performance of the model.
Fine-tuning the model
If the model’s performance is not satisfactory, you can fine-tune it by adjusting the parameters, or trying a different model.
Deploying the model
Finally, once we are satisfied with the performance of the model, we can deploy it to make predictions on new data. it can be used for real world problem.
Examples of Machine Learning Classification in Real Life
Classification algorithms are widely used in many real-world applications across various domains, including:
- Email spam filtering
- Credit risk assessment
- Medical diagnosis
- Image classification
- Sentiment analysis.
- Fraud detection
- Quality control
- Recommendation systems
Implementation of Classification Model in Machine Learning
Let’s get a hands-on experience with how Classification works. We are going to study various Classifiers and see a rather simple analytical comparison of their performance on a well-known, standard data set, the Iris data set.
Requirements for running the given script:
- Scipy and Numpy
- Scikit-learn
In conclusion, classification is a fundamental task in machine learning, involving the categorization of data into predefined classes or categories based on their features.
Frequently Asked Questions (FAQs)
What is classification rule in machine learning.
A decision guideline in machine learning determining the class or category of input based on features.
What are the classification of algorithms?
Methods like decision trees, SVM, and k-NN categorizing data into predefined classes for predictions.
What is learning classification?
Acquiring knowledge to assign labels to input data, distinguishing classes in supervised machine learning.
What is difference between classification and clustering?
Classification: Predicts predefined classes. Clustering: Groups data based on inherent similarities without predefined classes.
What is the difference between classification and regression methods?
Classification: Assigns labels to data classes. Regression: Predicts continuous values for quantitative analysis.
Please Login to comment...
Similar reads.
- How to Get a Free SSL Certificate
- Best SSL Certificates Provider in India
- Elon Musk's xAI releases Grok-2 AI assistant
- What is OpenAI SearchGPT? How it works and How to Get it?
- Content Improvement League 2024: From Good To A Great Article
Improve your Coding Skills with Practice
What kind of Experience do you want to share?
15 Top Machine Learning Case Studies to Look Into Right Now
Introduction.
Machine learning is one of the most valuable skills that a data science professional can have in 2024.
According to this report from Gartner , as the adoption of machine learning continues to grow across industries, it is evolving from mere predictive algorithms to a more data-centric discipline.
Machine learning case studies are in-depth analyses of real-world business problems in which machine learning techniques are applied to solve the problem or provide insights .
If you’re looking for an updated list of machine learning case studies to explore, you’re in the right place. Read on for our hand-picked case studies and tips on solving them.
Why Should You Explore Machine Learning Case Studies?
Better job prospects.
Employers are often concerned that their recruits lack business acumen or data-handling skills. Working on real-world case studies and adding them to your resume will showcase your hands-on expertise, thereby bolstering your CV.
We’ve seen numerous examples where adding relevant personal and academic projects to interviewees’ resumes has helped them get their foot in the door.
Helps You Identify In-Demand Skills
Case studies often highlight tools and techniques currently in demand within a particular industry. By studying them, you can tailor your preparation strategy to acquire these skills, aligning your expertise with what leading tech firms are looking for.
This will enhance your prospects in a very competitive job market.
Insight On Industry-Specific Challenges
Industries leverage data science and machine learning in different ways. By examining case studies across healthcare, finance, or retail, you’ll gain insight into how ML solutions are customized to meet industry-specific challenges.
For example, suppose you are planning to interview at a banking organization.
In that case, you can leverage what you learned to discuss industry-relevant ML applications and propose solutions to common banking and financial challenges. This will help you land specialized roles that are much more lucrative than general data roles.
With these benefits in mind, let’s explore the top 15 machine learning case studies that are particularly relevant in 2024.
What Are the Best ML Case Studies Right Now?
We’ve curated examples that highlight the innovative use of AI and ML technologies and reflect common business challenges in today’s job market.
1. Starbucks Customer Loyalty Program
Starbucks aims to enhance customer engagement and loyalty by delivering personalized offers and recommendations.
The goal is to analyze customer data to uncover patterns and preferences for tailoring marketing efforts and increase customer satisfaction by making each customer feel uniquely understood.
- Objective : To increase retention, boost sales, and enhance the customer experience.
- How to build : Cluster customers based on similar behaviors, identify the types of offers most likely to appeal to each group and develop a recommendation engine to generate personalized offers. Key tools include data management systems like SQL databases for structured data storage, Python for data processing and machine learning with libraries such as pandas and scikit-learn, and a platform like TensorFlow to develop and train the recommendation models.
2. Amazon Pricing Case Study
Amazon employs a dynamic pricing model to avoid updating prices manually. It uses sophisticated algorithms to adjust prices in real time based on demand, competitor pricing, inventory levels, and customer behavior to achieve maximum profitability.
- Objective : To maximize revenue and market share by implementing optimal pricing across millions of products.
- How to build : First aggregate real-time and historical data. Train regression models and ensemble methods like random forests or boosted trees to predict optimal price points. These models learn from past pricing outcomes and continuously adjust to changing variables. Use core technologies including a big data platform like Apache Hadoop and machine learning frameworks such as TensorFlow or PyTorch.
Here is an interesting pricing problem for calculating electricity consumption .
3. Amazon’s Real-Time Fraud Detection System
Another case study from Amazon—its fraud detection system uses machine learning to identify and prevent fraudulent transactions as they occur.
- Objective : To accurately identify potentially fraudulent transactions in real time without impacting the user experience with false positives.
- How to build : Create relevant features from raw data that help identify suspicious activity, such as unusual transaction sizes or patterns that deviate from the norm. Employ ensemble methods like random forest or gradient boosting machines (GBMs) due to their robustness and ability to handle unbalanced datasets. Tools you’d typically use include Amazon Redshift, frameworks such as TensorFlow, and Apache Kafka for real-time streaming.
Here is our takehome project on a similar business problem: detecting credit card fraud.
4. Netflix’s Recommendation Engine
Netflix’s recommendation engine analyzes individual viewing habits to suggest shows and movies that users are likely to enjoy.
This personalization is critical for enhancing user satisfaction and engagement and driving continued subscription renewals.
- Objective : To maximize the relevance of recommended content, promote a diverse array of content that users might not find on their own, and increase overall viewing time and subscriber retention.
- How to build : Extract useful features from the data, such as time stamps, duration of views, and metadata of the content like genres, actors, and release dates. Netflix uses various methods, including collaborative filtering, matrix factorization, and deep learning techniques, to predict user preferences. Also, test and refine the algorithms using A/B testing and other evaluation metrics to ensure the recommendations are accurate and engaging.
For more practice, the MovieLens dataset is a classic choice for building recommendation systems.
5. Google’s Search Algorithm
Google’s search engine uses complex machine learning algorithms to analyze, interpret, and rank web pages based on their relevance to user queries.
The core of it involves crawling, indexing, and ranking web pages using various signals to deliver the most relevant results.
- Objective : To provide the most accurate and relevant search results based on the user’s query and search intent with speed and efficiency.
- Web crawling : Use crawlers to visit web pages, read the information, and follow links to other pages on the internet.
- Indexing : Organize the content found during crawling into an index. This index needs to be structured so that the system can find data quickly in response to user queries.
- Ranking algorithm : Google uses the PageRank algorithm, which evaluates the quality and quantity of links to a page to determine a rough estimate of the website’s importance.
- Query processing : Develop a system to interpret and process user queries, applying natural language processing techniques to understand context.
- Tools : Open-source web crawlers like Apache Nutch, scalable databases such as Apache Cassandra, and NLTK or spaCy libraries in Python for understanding user queries.
6. Telecom Customer Churn Prediction
In the telecom industry, customer churn prediction models identify customers likely to cancel their services.
This allows companies to address at-risk customers with targeted interventions.
- Objective : To identify customers likely to churn by understanding the factors that lead to customer dissatisfaction.
- How to build : Use ML algorithms such as logistic regression, decision trees, or ensemble methods like random forests or gradient boosting machines to build the model.
The Telco Customer Churn dataset on Kaggle is very popular for customer churn prediction projects .
7. Loan Application Case Study
Machine learning models are increasingly used by financial companies to streamline and improve the decision-making process for loan applications.
These models analyze applicants’ financial data, credit history, and other relevant variables to predict the likelihood of default.
- Objective : To improve the accuracy of loan approval decisions by predicting the risk associated with potential borrowers. Automating this process also reduces the time to approve a loan application.
- How to build : Develop features from raw data that are predictive of loan repayments, such as debt-to-income ratio, credit utilization rate, and past financial behavior. Use supervised learning algorithms like logistic regression, decision trees, or more sophisticated methods such as gradient boosting or neural networks to train the model on historical data.
Here is a list of more fintech projects to try.
8. LinkedIn’s AI-Powered Job Matching System
LinkedIn leverages advanced algorithms to connect job seekers with the most relevant opportunities.
This system analyzes job postings and user profiles to make accurate recommendations that align with the user’s career goals and the employer’s needs.
- Objective : To refine the accuracy of job matches, increase user engagement, and streamline the hiring process for employers.
- How to build : Clean and transform data from profiles, job listings, and user interactions using NLP methods to extract relevant features for job matching. Use collaborative filtering and neural networks on this data to predict user preferences and match jobs.
9. Twitter Contextual Ad Placement Study
Twitter’s contextual ad placement system dynamically serves ads based on real-time analysis of user interactions.
- Objective : To enhance user engagement with ads by making them more relevant and less intrusive. This relevance increases the likelihood of users interacting with the ads, which improves the efficiency of ad campaigns.
- How to build : Extract useful features from the data, such as keywords from tweets, used hashtags, sentiment of the tweets, and user engagement rates with similar content. Models like logistic regression for click prediction or deep learning models for more complex patterns are common choices for the algorithm. Finally, implement the model using a real-time processing framework to allow for dynamic ad placement as user behavior changes.
10. Uber’s Demand Forecasting
Uber’s demand forecasting model leverages machine learning to predict future ride demand in various geographic areas.
This system helps optimize the allocation of drivers while maximizing earnings.
- Objective : To balance supply and demand across Uber’s network. This includes lowering wait times, maximizing earnings for drivers, and optimizing surge pricing by predicting spikes in demand.
- How to build : Employ time series forecasting models like ARIMA or more complex models such as LSTM (long short-term memory) networks, which are capable of handling sequential data and can learn patterns over time.
11. Hotel Recommendation System
These systems analyze vast amounts of data, including previous bookings, user ratings, search queries, and user demographics, to predict hotels that a customer might prefer.
This approach enhances user satisfaction and boosts booking conversion rates for platforms.
- Objective : The system aims to increase the likelihood of bookings by providing relevant recommendations that match user preferences. In the long term, personalized experiences help build customer loyalty, as users are more likely to return to a service that understands their needs.
- How to build : Create features that can help in understanding user preferences, such as preferred locations, amenities, price range, and types of accommodations (e.g., hotels, B&Bs, resorts). Features related to temporal patterns, like booking during a particular season or for specific types of trips (business, leisure), can also influence decisions. Implement machine learning algorithms such as collaborative filtering, which can recommend hotels based on similar user preferences, or content-based filtering, which suggests hotels similar to those the user has liked before. Advanced models also integrate deep learning to handle the complexity of the data.
Here is an interesting takehome problem on recommending Airbnb homes to users .
12. IBM’s Weather Prediction
IBM’s The Weather Company harnesses advanced machine learning and artificial intelligence to enhance the accuracy of weather forecasts.
Through these tools, IBM aims to provide precise weather predictions that can inform decisions ranging from agriculture to disaster response.
- Objective : To enhance the precision of weather forecasts to better predict events such as storms, rainfall, and temperature fluctuations. Also, it aims to support decision-making in various sectors for operational decisions and advance climate research.
- How to build : Advanced machine learning models like neural networks and ensemble methods are utilized in order to analyze complex weather data. The models are regularly refined and tested against actual weather outcomes to improve their accuracy over time. High-capacity databases like IBM Db2 or cloud storage solutions are used to handle large datasets.
13. Zillow’s House Price Prediction System
Zillow’s house price prediction, well-known through its “Zestimate” feature, utilizes machine learning to estimate the market value of homes across the US.
This system analyzes data from various sources, including property characteristics, location, market conditions, and historical transaction data to generate a market value in near real time.
- Objective : To provide homeowners and buyers with a reliable estimate of property values, helping them make informed buying, selling, and refinancing decisions.
- How to build : Develop predictive features from the collected data. This involves extracting insights from the raw data, like normalizing prices by square footage or adjusting values based on local real estate market health. Employ advanced regression models and techniques like gradient boosting or neural networks to learn from complex datasets. Libraries such as XGBoost, TensorFlow, or PyTorch are commonly employed.
14. Tesla’s Autopilot System
Tesla’s Autopilot system is a highly advanced driver-assistance system that uses machine learning to enable its vehicles to steer, accelerate, and brake automatically under the driver’s supervision.
The system relies on a combination of sensors, cameras, and algorithms to interpret the vehicle’s surroundings, make real-time driving decisions, and learn from diverse driving conditions.
- Objective : To reduce the likelihood of accidents by assisting drivers with advanced safety features and optimizing driving decisions and to improve the system toward achieving full self-driving capabilities eventually.
- How to build : Key features such as object detection, lane marking recognition, and vehicle trajectory predictions are derived from the raw data to train models. Convolutional neural networks (CNNs) are employed to process and interpret the sensory input. Tesla also uses over-the-air software updates to deploy new features based on aggregated fleet learning.
15. GE Healthcare Image Analysis
GE Healthcare leverages machine learning to enhance the analysis of medical images to improve the accuracy and efficiency of diagnostics across various medical fields.
This technology allows for more precise identification and evaluation of anomalies in medical imaging, such as MRI, CT scans, and X-rays.
- Objective : To detect and classify anomalies in medical images that might be too subtle for human eyes. It also accelerates image analysis and shortens diagnosis time, which is important for providing quick patient care.
- How to build : Collect large sets of annotated medical images, which include a variety of imaging types and conditions. Label these images accurately to serve as a training set. Extract relevant features from the images, such as texture, shape, intensity, and spatial patterns of the imaged tissues or organs. Convolutional neural networks (CNNs) are particularly effective due to their ability to pick up on spatial hierarchies and patterns and should be deployed. Be sure to rigorously test the models against new, unseen images to ensure they generalize well and maintain high accuracy and reliability.
Frequently Asked Questions
What skills can i learn from machine learning case studies that are applicable to data science jobs.
Employers look for candidates with a mix of technical and soft skills.
Some competencies you can develop through exploring and analyzing case studies are problem-solving, critical thinking, better data interpretation, an understanding of commonly used ML algorithms, and coding skills in relevant languages.
We recommend that you work on these problems in addition to reading up on them. You can use public datasets provided by Kaggle or UCI Machine Learning Repository or Interview Query’s storehouse of takehome assignments .
To help you get started, we’ve created a comprehensive guide on how to start a data analytics project .
Are there beginner-friendly case studies in machine learning?
The examples we’ve provided in the list above are a mix of beginner-friendly and advanced ML case studies.
There are more beginner-friendly cases you can explore on Kaggle, such as the iris flower classification, Titanic survival predictions, and basic revenue forecasting for e-commerce.
We’ve also compiled a list of data science case studies categorized by difficulty level.
How do I use machine learning case studies to craft a better resume or portfolio?
You can tailor your resume to highlight ML case studies or projects you’ve worked on that match the skills and industry you’re applying to.
For each project, provide a concise title and description of what the project entailed, the tools and techniques used, and its outcomes.
Wherever possible, quantify the impact of the project , for example, the model’s accuracy. Use action verbs like “developed,” “built,” “implemented,” or “analyzed” to increase persuasion.
Lastly, rehearse how you would present your project in an interview, an often overlooked step in getting selected. On a related note, you can try a mock interview with us to test your current preparedness for a project presentation.
To wrap up, staying updated on, exploring, and implementing machine learning case studies is a clever strategy to showcase your hands-on experience and set you apart in a competitive job market.
Plan your interview strategy, considering the perspective of your desired future employer and tailoring your project selection to the skills they want to see.
Here at Interview Query , we offer multiple learning paths , interview questions, and both paid and free resources you can use to upskill for your dream role. You can access specific interview questions , participate in mock interviews , and receive expert coaching .
If you have a specific company in mind to apply to, check out our company interview guide section , where we have detailed company and role-specific preparation guides. We have guides for all the companies that are mentioned in our case study list, including Uber , Tesla , Amazon , Google , and Netflix .
We hope this discussion has been helpful. If you have any other questions, don’t hesitate to reach out to us or explore our blog .
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Open access
- Published: 09 September 2022
Machine learning in project analytics: a data-driven framework and case study
- Shahadat Uddin 1 ,
- Stephen Ong 1 &
- Haohui Lu 1
Scientific Reports volume 12 , Article number: 15252 ( 2022 ) Cite this article
10k Accesses
19 Citations
18 Altmetric
Metrics details
- Applied mathematics
- Computational science
The analytic procedures incorporated to facilitate the delivery of projects are often referred to as project analytics. Existing techniques focus on retrospective reporting and understanding the underlying relationships to make informed decisions. Although machine learning algorithms have been widely used in addressing problems within various contexts (e.g., streamlining the design of construction projects), limited studies have evaluated pre-existing machine learning methods within the delivery of construction projects. Due to this, the current research aims to contribute further to this convergence between artificial intelligence and the execution construction project through the evaluation of a specific set of machine learning algorithms. This study proposes a machine learning-based data-driven research framework for addressing problems related to project analytics. It then illustrates an example of the application of this framework. In this illustration, existing data from an open-source data repository on construction projects and cost overrun frequencies was studied in which several machine learning models (Python’s Scikit-learn package) were tested and evaluated. The data consisted of 44 independent variables (from materials to labour and contracting) and one dependent variable (project cost overrun frequency), which has been categorised for processing under several machine learning models. These models include support vector machine, logistic regression, k -nearest neighbour, random forest, stacking (ensemble) model and artificial neural network. Feature selection and evaluation methods, including the Univariate feature selection, Recursive feature elimination, SelectFromModel and confusion matrix, were applied to determine the most accurate prediction model. This study also discusses the generalisability of using the proposed research framework in other research contexts within the field of project management. The proposed framework, its illustration in the context of construction projects and its potential to be adopted in different contexts will significantly contribute to project practitioners, stakeholders and academics in addressing many project-related issues.
Similar content being viewed by others
Long-term prediction modeling of shallow rockburst with small dataset based on machine learning
An ensemble-based machine learning solution for imbalanced multiclass dataset during lithology log generation
Prediction of jumbo drill penetration rate in underground mines using various machine learning approaches and traditional models
Introduction.
Successful projects require the presence of appropriate information and technology 1 . Project analytics provides an avenue for informed decisions to be made through the lifecycle of a project. Project analytics applies various statistics (e.g., earned value analysis or Monte Carlo simulation) among other models to make evidence-based decisions. They are used to manage risks as well as project execution 2 . There is a tendency for project analytics to be employed due to other additional benefits, including an ability to forecast and make predictions, benchmark with other projects, and determine trends such as those that are time-dependent 3 , 4 , 5 . There has been increasing interest in project analytics and how current technology applications can be incorporated and utilised 6 . Broadly, project analytics can be understood on five levels 4 . The first is descriptive analytics which incorporates retrospective reporting. The second is known as diagnostic analytics , which aims to understand the interrelationships and underlying causes and effects. The third is predictive analytics which seeks to make predictions. Subsequent to this is prescriptive analytics , which prescribes steps following predictions. Finally, cognitive analytics aims to predict future problems. The first three levels can be applied with ease with the help of technology. The fourth and fifth steps require data that is generally more difficult to obtain as they may be less accessible or unstructured. Further, although project key performance indicators can be challenging to define 2 , identifying common measurable features facilitates this 7 . It is anticipated that project analytics will continue to experience development due to its direct benefits to the major baseline measures focused on productivity, profitability, cost, and time 8 . The nature of project management itself is fluid and flexible, and project analytics allows an avenue for which machine learning algorithms can be applied 9 .
Machine learning within the field of project analytics falls into the category of cognitive analytics, which deals with problem prediction. Generally, machine learning explores the possibilities of computers to improve processes through training or experience 10 . It can also build on the pre-existing capabilities and techniques prevalent within management to accomplish complex tasks 11 . Due to its practical use and broad applicability, recent developments have led to the invention and introduction of newer and more innovative machine learning algorithms and techniques. Artificial intelligence, for instance, allows for software to develop computer vision, speech recognition, natural language processing, robot control, and other applications 10 . Specific to the construction industry, it is now used to monitor construction environments through a virtual reality and building information modelling replication 12 or risk prediction 13 . Within other industries, such as consumer services and transport, machine learning is being applied to improve consumer experiences and satisfaction 10 , 14 and reduce the human errors of traffic controllers 15 . Recent applications and development of machine learning broadly fall into the categories of classification, regression, ranking, clustering, dimensionality reduction and manifold learning 16 . Current learning models include linear predictors, boosting, stochastic gradient descent, kernel methods, and nearest neighbour, among others 11 . Newer and more applications and learning models are continuously being introduced to improve accessibility and effectiveness.
Specific to the management of construction projects, other studies have also been made to understand how copious amounts of project data can be used 17 , the importance of ontology and semantics throughout the nexus between artificial intelligence and construction projects 18 , 19 as well as novel approaches to the challenges within this integration of fields 20 , 21 , 22 . There have been limited applications of pre-existing machine learning models on construction cost overruns. They have predominantly focussed on applications to streamline the design processes within construction 23 , 24 , 25 , 26 , and those which have investigated project profitability have not incorporated the types and combinations of algorithms used within this study 6 , 27 . Furthermore, existing applications have largely been skewed towards one type or another 28 , 29 .
In addition to the frequently used earned value method (EVM), researchers have been applying many other powerful quantitative methods to address a diverse range of project analytics research problems over time. Examples of those methods include time series analysis, fuzzy logic, simulation, network analytics, and network correlation and regression. Time series analysis uses longitudinal data to forecast an underlying project's future needs, such as the time and cost 30 , 31 , 32 . Few other methods are combined with EVM to find a better solution for the underlying research problems. For example, Narbaev and De Marco 33 integrated growth models and EVM for forecasting project cost at completion using data from construction projects. For analysing the ongoing progress of projects having ambiguous or linguistic outcomes, fuzzy logic is often combined with EVM 34 , 35 , 36 . Yu et al. 36 applied fuzzy theory and EVM for schedule management. Ponz-Tienda et al. 35 found that using fuzzy arithmetic on EVM provided more objective results in uncertain environments than the traditional methodology. Bonato et al. 37 integrated EVM with Monte Carlo simulation to predict the final cost of three engineering projects. Batselier and Vanhoucke 38 compared the accuracy of the project time and cost forecasting using EVM and simulation. They found that the simulation results supported findings from the EVM. Network methods are primarily used to analyse project stakeholder networks. Yang and Zou 39 developed a social network theory-based model to explore stakeholder-associated risks and their interactions in complex green building projects. Uddin 40 proposed a social network analytics-based framework for analysing stakeholder networks. Ong and Uddin 41 further applied network correlation and regression to examine the co-evolution of stakeholder networks in collaborative healthcare projects. Although many other methods have already been used, as evident in the current literature, machine learning methods or models are yet to be adopted for addressing research problems related to project analytics. The current investigation is derived from the cognitive analytics component of project analytics. It proposes an approach for determining hidden information and patterns to assist with project delivery. Figure 1 illustrates a tree diagram showing different levels of project analytics and their associated methods from the literature. It also illustrates existing methods within the cognitive component of project analytics to where the application of machine learning is situated contextually.
A tree diagram of different project analytics methods. It also shows where the current study belongs to. Although earned value analysis is commonly used in project analytics, we do not include it in this figure since it is used in the first three levels of project analytics.
Machine learning models have several notable advantages over traditional statistical methods that play a significant role in project analytics 42 . First, machine learning algorithms can quickly identify trends and patterns by simultaneously analysing a large volume of data. Second, they are more capable of continuous improvement. Machine learning algorithms can improve their accuracy and efficiency for decision-making through subsequent training from potential new data. Third, machine learning algorithms efficiently handle multi-dimensional and multi-variety data in dynamic or uncertain environments. Fourth, they are compelling to automate various decision-making tasks. For example, machine learning-based sentiment analysis can easily a negative tweet and can automatically take further necessary steps. Last but not least, machine learning has been helpful across various industries, for example, defence to education 43 . Current research has seen the development of several different branches of artificial intelligence (including robotics, automated planning and scheduling and optimisation) within safety monitoring, risk prediction, cost estimation and so on 44 . This has progressed from the applications of regression on project cost overruns 45 to the current deep-learning implementations within the construction industry 46 . Despite this, the uses remain largely limited and are still in a developmental state. The benefits of applications are noted, such as optimising and streamlining existing processes; however, high initial costs form a barrier to accessibility 44 .
The primary goal of this study is to demonstrate the applicability of different machine learning algorithms in addressing problems related to project analytics. Limitations in applying machine learning algorithms within the context of construction projects have been explored previously. However, preceding research has mainly been conducted to improve the design processes specific to construction 23 , 24 , and those investigating project profitabilities have not incorporated the types and combinations of algorithms used within this study 6 , 27 . For instance, preceding research has incorporated a different combination of machine-learning algorithms in research of predicting construction delays 47 . This study first proposed a machine learning-based data-driven research framework for project analytics to contribute to the proposed study direction. It then applied this framework to a case study of construction projects. Although there are three different machine learning algorithms (supervised, unsupervised and semi-supervised), the supervised machine learning models are most commonly used due to their efficiency and effectiveness in addressing many real-world problems 48 . Therefore, we will use machine learning to represent supervised machine learning throughout the rest of this article. The contribution of this study is significant in that it considers the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult 9 , 49 . Further to this, existing implementations have largely been limited to safety monitoring, risk prediction, cost estimation and so on 44 . Through the evaluation of machine-learning applications, this study further demonstrates a case study for which algorithms can be used to consider and model the relationship between project attributes and a project performance measure (i.e., cost overrun frequency).
Machine learning-based framework for project analytics
When and why machine learning for project analytics.
Machine learning models are typically used for research problems that involve predicting the classification outcome of a categorical dependent variable. Therefore, they can be applied in the context of project analytics if the underlying objective variable is a categorical one. If that objective variable is non-categorical, it must first be converted into a categorical variable. For example, if the objective or target variable is the project cost, we can convert this variable into a categorical variable by taking only two possible values. The first value would be 0 to indicate a low-cost project, and the second could be 1 for showing a high-cost project. The average or median cost value for all projects under consideration can be considered for splitting project costs into low-cost and high-cost categories.
For data-driven decision-making, machine learning models are advantageous. This is because traditional statistical methods (e.g., ordinary least square (OLS) regression) make assumptions about the underlying research data to produce explicit formulae for the objective target measures. Unlike these statistical methods, machine learning algorithms figure out patterns on their own directly from the data. For instance, for a non-linear but separable dataset, an OLS regression model will not be the right choice due to its assumption that the underlying data must be linear. However, a machine learning model can easily separate the dataset into the underlying classes. Figure 2 (a) presents a situation where machine learning models perform better than traditional statistical methods.
( a ) An illustration showing the superior performance of machine learning models compared with the traditional statistical models using an abstract dataset with two attributes (X 1 and X 2 ). The data points within this abstract dataset consist of two classes: one represented with a transparent circle and the second class illustrated with a black-filled circle. These data points are non-linear but separable. Traditional statistical models (e.g., ordinary least square regression) will not accurately separate these data points. However, any machine learning model can easily separate them without making errors; and ( b ) Traditional programming versus machine learning.
Similarly, machine learning models are compelling if the underlying research dataset has many attributes or independent measures. Such models can identify features that significantly contribute to the corresponding classification performance regardless of their distributions or collinearity. Traditional statistical methods have become prone to biased results when there exists a correlation between independent variables. Machine learning-based current studies specific to project analytics have been largely limited. Despite this, there have been tangential studies on the use of artificial intelligence to improve cost estimations as well as risk prediction 44 . Additionally, models have been implemented in the optimisation of existing processes 50 .
Machine learning versus traditional programming
Machine learning can be thought of as a process of teaching a machine (i.e., computers) to learn from data and adjust or apply its present knowledge when exposed to new data 42 . It is a type of artificial intelligence that enables computers to learn from examples or experiences. Traditional programming requires some input data and some logic in the form of code (program) to generate the output. Unlike traditional programming, the input data and their corresponding output are fed to an algorithm to create a program in machine learning. This resultant program can capture powerful insights into the data pattern and can be used to predict future outcomes. Figure 2 (b) shows the difference between machine learning and traditional programming.
Proposed machine learning-based framework
Figure 3 illustrates the proposed machine learning-based research framework of this study. The framework starts with breaking the project research dataset into the training and test components. As mentioned in the previous section, the research dataset may have many categorical and/or nominal independent variables, but its single dependent variable must be categorical. Although there is no strict rule for this split, the training data size is generally more than or equal to 50% of the original dataset 48 .
The proposed machine learning-based data-driven framework.
Machine learning algorithms can handle variables that have only numerical outcomes. So, when one or more of the underlying categorical variables have a textual or string outcome, we must first convert them into the corresponding numerical values. Suppose a variable can take only three textual outcomes (low, medium and high). In that case, we could consider, for example, 1 to represent low , 2 to represent medium , and 3 to represent high . Other statistical techniques, such as the RIDIT (relative to an identified distribution) scoring 51 , can also be used to convert ordered categorical measurements into quantitative ones. RIDIT is a parametric approach that uses probabilistic comparison to determine the statistical differences between ordered categorical groups. The remaining components of the proposed framework have been briefly described in the following subsections.
Model-building procedure
The next step of the framework is to follow the model-building procedure to develop the desired machine learning models using the training data. The first step of this procedure is to select suitable machine learning algorithms or models. Among the available machine learning algorithms, the commonly used ones are support vector machine, logistic regression, k -nearest neighbours, artificial neural network, decision tree and random forest 52 . One can also select an ensemble machine learning model as the desired algorithm. An ensemble machine learning method uses multiple algorithms or the same algorithm multiple times to achieve better predictive performance than could be obtained from any of the constituent learning models alone 52 . Three widely used ensemble approaches are bagging, boosting and stacking. In bagging, the research dataset is divided into different equal-sized subsets. The underlying machine learning algorithm is then applied to these subsets for classification. In boosting, a random sample of the dataset is selected and then fitted and trained sequentially with different models to compensate for the weakness observed in the immediately used model. Stacking combined different weak machine learning models in a heterogeneous way to improve the predictive performance. For example, the random forest algorithm is an ensemble of different decision tree models 42 .
Second, each selected machine learning model will be processed through the k -fold cross-validation approach to improve predictive efficiency. In k -fold cross-validation, the training data is divided into k folds. In an iteration, the (k-1) folds are used to train the selected machine models, and the remaining last fold isF used for validation purposes. This iteration process continues until each k folds will get a turn to be used for validation purposes. The final predictive efficiency of the trained models is based on the average values from the outcomes of these iterations. In addition to this average value, researchers use the standard deviation of the results from different iterations as the predictive training efficiency. Supplementary Fig 1 shows an illustration of the k -fold cross-validation.
Third, most machine learning algorithms require a pre-defined value for their different parameters, known as hyperparameter tuning. The settings of these parameters play a vital role in the achieved performance of the underlying algorithm. For a given machine learning algorithm, the optimal value for these parameters can be different from one dataset to another. The same algorithm needs to run multiple times with different parameter values to find its optimal parameter value for a given dataset. Many algorithms are available in the literature, such as the Grid search 53 , to find the optimal parameter value. In the Grid search, hyperparameters are divided into discrete grids. Each grid point represents a specific combination of the underlying model parameters. The parameter values of the point that results in the best performance are the optimal parameter values 53 .
Testing of the developed models and reporting results
Once the desired machine learning models have been developed using the training data, they need to be tested using the test data. The underlying trained model is then applied to predict its dependent variable for each data instance. Therefore, for each data instance, two categorical outcomes will be available for its dependent variable: one predicted using the underlying trained model, and the other is the actual category. These predicted and actual categorical outcome values are used to report the results of the underlying machine learning model.
The fundamental tool to report results from machine learning models is the confusion matrix, which consists of four integer values 48 . The first value represents the number of positive cases correctly identified as positive by the underlying trained model (true-positive). The second value indicates the number of positive instances incorrectly identified as negative (false-negative). The third value represents the number of negative cases incorrectly identified as positive (false-positive). Finally, the fourth value indicates the number of negative instances correctly identified as negative (true-negative). Researchers also use a few performance measures based on the four values of the confusion matrix to report machine learning results. The most used measure is accuracy which is the ratio of the number of correct predictions (true-positive + true-negative) and the total number of data instances (sum of all four values of the confusion matrix). Other measures commonly used to report machine learning results are precision, recall and F1-score. Precision refers to the ratio between true-positives and the total number of positive predictions (i.e., true-positive + false-positive), often used to indicate the quality of a positive prediction made by a model 48 . Recall, also known as the true-positive rate, is calculated by dividing true-positive by the number of data instances that should have been predicted as positive (i.e., true-positive + false-negative). F1-score is the harmonic mean of the last two measures, i.e., [(2 × Precision × Recall)/(Precision + Recall)] and the error-rate equals to (1-Accuracy).
Another essential tool for reporting machine learning results is variable or feature importance, which identifies a list of independent variables (features) contributing most to the classification performance. The importance of a variable refers to how much a given machine learning algorithm uses that variable in making accurate predictions 54 . The widely used technique for identifying variable importance is the principal component analysis. It reduces the dimensionality of the data while minimising information loss, which eventually increases the interpretability of the underlying machine learning outcome. It further helps in finding the important features in a dataset as well as plotting them in 2D and 3D 54 .
Ethical approval
Ethical approval is not required for this study since this study used publicly available data for research investigation purposes. All research was performed in accordance with relevant guidelines/regulations.
Informed consent
Due to the nature of the data sources, informed consent was not required for this study.
Case study: an application of the proposed framework
This section illustrates an application of this study’s proposed framework (Fig. 2 ) in a construction project context. We will apply this framework in classifying projects into two classes based on their cost overrun experience. Projects rarely experience a delay belonging to the first class (Rare class). The second class indicates those projects that often experience a delay (Often class). In doing so, we consider a list of independent variables or features.
Data source
The research dataset is taken from an open-source data repository, Kaggle 55 . This survey-based research dataset was collected to explore the causes of the project cost overrun in Indian construction projects 45 , consisting of 44 independent variables or features and one dependent variable. The independent variables cover a wide range of cost overrun factors, from materials and labour to contractual issues and the scope of the work. The dependent variable is the frequency of experiencing project cost overrun (rare or often). The dataset size is 139; 65 belong to the rare class, and the remaining 74 are from the often class. We converted each categorical variable with a textual or string outcome into an appropriate numerical value range to prepare the dataset for machine learning analysis. For example, we used 1 and 2 to represent rare and often class, respectively. The correlation matrix among the 44 features is presented in Supplementary Fig 2 .
Machine learning algorithms
This study considered four machine learning algorithms to explore the causes of project cost overrun using the research dataset mentioned above. They are support vector machine, logistic regression, k- nearest neighbours and random forest.
Support vector machine (SVM) is a process applied to understand data. For instance, if one wants to determine and interpret which projects are classified as programmatically successful through the processing of precedent data information, SVM would provide a practical approach for prediction. SVM functions by assigning labels to objects 56 . The comparison attributes are used to cluster these objects into different groups or classes by maximising their marginal distances and minimising the classification errors. The attributes are plotted multi-dimensionally, allowing a separation line, known as a hyperplane , see supplementary Fig 3 (a), to distinguish between underlying classes or groups 52 . Support vectors are the data points that lie closest to the decision boundary on both sides. In Supplementary Fig 3 (a), they are the circles (both transparent and shaded ones) close to the hyperplane. Support vectors play an essential role in deciding the position and orientation of the hyperplane. Various computational methods, including a kernel function to create more derived attributes, are applied to accommodate this process 56 . Support vector machines are not only limited to binary classes but can also be generalised to a larger variety of classifications. This is accomplished through the training of separate SVMs 56 .
Logistic regression (LR) builds on the linear regression model and predicts the outcome of a dichotomous variable 57 ; for example, the presence or absence of an event. It uses a scatterplot to understand the connection between an independent variable and one or more dependent variables (see Supplementary Fig 3 (b)). LR model fits the data to a sigmoidal curve instead of fitting it to a straight line. The natural logarithm is considered when developing the model. It provides a value between 0 and 1 that is interpreted as the probability of class membership. Best estimates are determined by developing from approximate estimates until a level of stability is reached 58 . Generally, LR offers a straightforward approach for determining and observing interrelationships. It is more efficient compared to ordinary regressions 59 .
k -nearest neighbours (KNN) algorithm uses a process that plots prior information and applies a specific sample size ( k ) to the plot to determine the most likely scenario 52 . This method finds the nearest training examples using a distance measure. The final classification is made by counting the most common scenario or votes present within the specified sample. As illustrated in Supplementary Fig 3 (c), the closest four nearest neighbours in the small circle are three grey squares and one white square. The majority class is grey. Hence, KNN will predict the instance (i.e., Χ ) as grey. On the other hand, if we look at the larger circle of the same figure, the nearest neighbours consist of ten white squares and four grey squares. The majority class is white. Thus, KNN will classify the instance as white. KNN’s advantage lies in its ability to produce a simplified result and handle missing data 60 . In summary, KNN utilises similarities (as well as differences) and distances in the process of developing models.
Random forest (RF) is a machine learning process that consists of many decision trees. A decision tree is a tree-like structure where each internal node represents a test on the input attribute. It may have multiple internal nodes at different levels, and the leaf or terminal nodes represent the decision outcomes. It produces a classification outcome for a distinctive and separate part to the input vector. For non-numerical processes, it considers the average value, and for discrete processes, it considers the number of votes 52 . Supplementary Fig 3 (d) shows three decision trees to illustrate the function of a random forest. The outcomes from trees 1, 2 and 3 are class B, class A and class A, respectively. According to the majority vote, the final prediction will be class A. Because it considers specific attributes, it can have a tendency to emphasise specific attributes over others, which may result in some attributes being unevenly weighted 52 . Advantages of the random forest include its ability to handle multidimensionality and multicollinearity in data despite its sensitivity to sampling design.
Artificial neural network (ANN) simulates the way in which human brains work. This is accomplished by modelling logical propositions and incorporating weighted inputs, a transfer and one output 61 (Supplementary Fig 3 (e)). It is advantageous because it can be used to model non-linear relationships and handle multivariate data 62 . ANN learns through three major avenues. These include error-back propagation (supervised), the Kohonen (unsupervised) and the counter-propagation ANN (supervised) 62 . There are two types of ANN—supervised and unsupervised. ANN has been used in a myriad of applications ranging from pharmaceuticals 61 to electronic devices 63 . It also possesses great levels of fault tolerance 64 and learns by example and through self-organisation 65 .
Ensemble techniques are a type of machine learning methodology in which numerous basic classifiers are combined to generate an optimal model 66 . An ensemble technique considers many models and combines them to form a single model, and the final model will eliminate the weaknesses of each individual learner, resulting in a powerful model that will improve model performance. The stacking model is a general architecture comprised of two classifier levels: base classifier and meta-learner 67 . The base classifiers are trained with the training dataset, and a new dataset is constructed for the meta-learner. Afterwards, this new dataset is used to train the meta-classifier. This study uses four models (SVM, LR, KNN and RF) as base classifiers and LR as a meta learner, as illustrated in Supplementary Fig 3 (f).
Feature selection
The process of selecting the optimal feature subset that significantly influences the predicted outcomes, which may be efficient to increase model performance and save running time, is known as feature selection. This study considers three different feature selection approaches. They are the Univariate feature selection (UFS), Recursive feature elimination (RFE) and SelectFromModel (SFM) approach. UFS examines each feature separately to determine the strength of its relationship with the response variable 68 . This method is straightforward to use and comprehend and helps acquire a deeper understanding of data. In this study, we calculate the chi-square values between features. RFE is a type of backwards feature elimination in which the model is fit first using all features in the given dataset and then removing the least important features one by one 69 . After that, the model is refit until the desired number of features is left over, which is determined by the parameter. SFM is used to choose effective features based on the feature importance of the best-performing model 70 . This approach selects features by establishing a threshold based on feature significance as indicated by the model on the training set. Those characteristics whose feature importance is more than the threshold are chosen, while those whose feature importance is less than the threshold are deleted. In this study, we apply SFM after we compare the performance of four machine learning methods. Afterwards, we train the best-performing model again using the features from the SFM approach.
Findings from the case study
We split the dataset into 70:30 for training and test purposes of the four selected machine learning algorithms. We used Python’s Scikit-learn package for implementing these algorithms 70 . Using the training data, we first developed six models based on these six algorithms. We used fivefold validation and target to improve the accuracy value. Then, we applied these models to the test data. We also executed all required hyperparameter tunings for each algorithm for the possible best classification outcome. Table 1 shows the performance outcomes for each algorithm during the training and test phase. The hyperparameter settings for each algorithm have been listed in Supplementary Table 1 .
As revealed in Table 1 , random forest outperformed the other three algorithms in terms of accuracy for both the training and test phases. It showed an accuracy of 78.14% and 77.50% for the training and test phases, respectively. The second-best performer in the training phase is k- nearest neighbours (76.98%), and for the test phase, it is the support vector machine, k- nearest neighbours and artificial neural network (72.50%).
Since random forest showed the best performance, we explored further based on this algorithm. We applied the three approaches (UFS, RFE and SFM) for feature optimisation on the random forest. The result is presented in Table 2 . SFM shows the best outcome among these three approaches. Its accuracy is 85.00%, whereas the accuracies of USF and RFE are 77.50% and 72.50%, respectively. As can be seen in Table 2 , the accuracy for the testing phase increases from 77.50% in Table 1 (b) to 85.00% with the SFM feature optimisation. Table 3 shows the 19 selected features from the SFM output. Out of 44 features, SFM found that 19 of them play a significant role in predicting the outcomes.
Further, Fig. 4 illustrates the confusion matrix when the random forest model with the SFM feature optimiser was applied to the test data. There are 18 true-positive, five false-negative, one false-positive and 16 true-negative cases. Therefore, the accuracy for the test phase is (18 + 16)/(18 + 5 + 1 + 16) = 85.00%.
Confusion matrix results based on the random forest model with the SFM feature optimiser (1 for the rare class and 2 for the often class).
Figure 5 illustrates the top-10 most important features or variables based on the random forest algorithm with the SFM optimiser. We used feature importance based on the mean decrease in impurity in identifying this list of important variables. Mean decrease in impurity computes each feature’s importance as the sum over the number of splits that include the feature in proportion to the number of samples it splits 71 . According to this figure, the delays in decision marking attribute contributed most to the classification performance of the random forest algorithm, followed by cash flow problem and construction cost underestimation attributes. The current construction project literature also highlighted these top-10 factors as significant contributors to project cost overrun. For example, using construction project data from Jordan, Al-Hazim et al. 72 ranked 20 causes for cost overrun, including causes similar to these causes.
Feature importance (top-10 out of 19) based on the random forest model with the SFM feature optimiser.
Further, we conduct a sensitivity analysis of the model’s ten most important features (from Fig. 5 ) to explore how a change in each feature affects the cost overrun. We utilise the partial dependence plot (PDP), which is a typical visualisation tool for non-parametric models 73 , to display this analysis’s outcomes. A PDP can demonstrate whether the relation between the target and a feature is linear, monotonic, or more complicated. The result of the sensitivity analysis is presented in Fig. 6 . For the ‘delays in decisions making’ attribute, the PDP shows that the probability is below 0.4 until the rating value is three and increases after. A higher value for this attribute indicates a higher risk of cost overrun. On the other hand, there are no significant differences can be seen in the remaining nine features if the value changes.
The result of the sensitivity analysis from the partial dependency plot tool for the ten most important features.
Summary of the case study
We illustrated an application of the proposed machine learning-based research framework in classifying construction projects. RF showed the highest accuracy in predicting the test dataset. For a new data instance with information for its 19 features but has not had any information on its classification, RF can identify its class ( rare or often ) correctly with a probability of 85.00%. If more data is provided, in addition to the 139 instances of the case study, to the machine learning algorithms, then their accuracy and efficiency in making project classification will improve with subsequent training. For example, if we provide 100 more data instances, these algorithms will have an additional 50 instances for training with a 70:30 split. This continuous improvement facility put the machine learning algorithms in a superior position over other traditional methods. In the current literature, some studies explore the factors contributing to project delay or cost overrun. In most cases, they applied factor analysis or other related statistical methods for research data analysis 72 , 74 , 75 . In addition to identifying important attributes, the proposed machine learning-based framework identified the ranking of factors and how eliminating less important factors affects the prediction accuracy when applied to this case study.
We shared the Python software developed to implement the four machine learning algorithms considered in this case study using GitHub 76 , a software hosting internet site. user-friendly version of this software can be accessed at https://share.streamlit.io/haohuilu/pa/main/app.py . The accuracy findings from this link could be slightly different from one run to another due to the hyperparameter settings of the corresponding machine learning algorithms.
Due to their robust prediction ability, machine learning methods have already gained wide acceptability across a wide range of research domains. On the other side, EVM is the most commonly used method in project analytics due to its simplicity and ease of interpretability 77 . Essential research efforts have been made to improve its generalisability over time. For example, Naeni et al. 34 developed a fuzzy approach for earned value analysis to make it suitable to analyse project scenarios with ambiguous or linguistic outcomes. Acebes 78 integrated Monte Carlo simulation with EVM for project monitoring and control for a similar purpose. Another prominent method frequently used in project analytics is the time series analysis, which is compelling for the longitudinal prediction of project time and cost 30 . Apparently, as evident in the present current literature, not much effort has been made to bring machine learning into project analytics for addressing project management research problems. This research made a significant attempt to contribute to filling up this gap.
Our proposed data-driven framework only includes the fundamental model development and application process components for machine learning algorithms. It does not have a few advanced-level machine learning methods. This study intentionally did not consider them for the proposed model since they are required only in particular designs of machine learning analysis. For example, the framework does not contain any methods or tools to handle the data imbalance issue. Data imbalance refers to a situation when the research dataset has an uneven distribution of the target class 79 . For example, a binary target variable will cause a data imbalance issue if one of its class labels has a very high number of observations compared with the other class. Commonly used techniques to address this issue are undersampling and oversampling. The undersampling technique decreases the size of the majority class. On the other hand, the oversampling technique randomly duplicates the minority class until the class distribution becomes balanced 79 . The class distribution of the case study did not produce any data imbalance issues.
This study considered only six fundamental machine learning algorithms for the case study, although many other such algorithms are available in the literature. For example, it did not consider the extreme gradient boosting (XGBoost) algorithm. XGBoost is based on the decision tree algorithm, similar to the random forest algorithm 80 . It has become dominant in applied machine learning due to its performance and speed. Naïve Bayes and convolutional neural networks are other popular machine learning algorithms that were not considered when applying the proposed framework to the case study. In addition to the three feature selection methods, multi-view can be adopted when applying the proposed framework to the case study. Multi-view learning is another direction in machine learning that considers learning with multiple views of the existing data with the aim to improve predictive performance 81 , 82 . Similarly, although we considered five performance measures, there are other potential candidates. One such example is the area under the receiver operating curve, which is the ability of the underlying classifier to distinguish between classes 48 . We leave them as a potential application scope while applying our proposed framework in any other project contexts in future studies.
Although this study only used one case study for illustration, our proposed research framework can be used in other project analytics contexts. In such an application context, the underlying research goal should be to predict the outcome classes and find attributes playing a significant role in making correct predictions. For example, by considering two types of projects based on the time required to accomplish (e.g., on-time and delayed ), the proposed framework can develop machine learning models that can predict the class of a new data instance and find out attributes contributing mainly to this prediction performance. This framework can also be used at any stage of the project. For example, the framework’s results allow project stakeholders to screen projects for excessive cost overruns and forecast budget loss at bidding and before contracts are signed. In addition, various factors that contribute to project cost overruns can be figured out at an earlier stage. These elements emerge at each stage of a project’s life cycle. The framework’s feature importance helps project managers locate the critical contributor to cost overrun.
This study has made an important contribution to the current project analytics literature by considering the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult. Further, existing implementations have largely been limited to safety monitoring, risk prediction and cost estimation. Through the evaluation of machine learning applications, this study further demonstrates the uses for which algorithms can be used to consider and model the relationship between project attributes and cost overrun frequency.
The applications of machine learning in project analytics are still undergoing constant development. Within construction projects, its applications have been largely limited and focused on profitability or the design of structures themselves. In this regard, our study made a substantial effort by proposing a machine learning-based framework to address research problems related to project analytics. We also illustrated an example of this framework’s application in the context of construction project management.
Like any other research, this study also has a few limitations that could provide scopes for future research. First, the framework does not include a few advanced machine learning techniques, such as data imbalance issues and kernel density estimation. Second, we considered only one case study to illustrate the application of the proposed framework. Illustrations of this framework using case studies from different project contexts would confirm its robust application. Finally, this study did not consider all machine learning models and performance measures available in the literature for the case study. For example, we did not consider the Naïve Bayes model and precision measure in applying the proposed research framework for the case study.
Data availability
This study obtained research data from publicly available online repositories. We mentioned their sources using proper citations. Here is the link to the data https://www.kaggle.com/datasets/amansaxena/survey-on-road-construction-delay .
Venkrbec, V. & Klanšek, U. In: Advances and Trends in Engineering Sciences and Technologies II 685–690 (CRC Press, 2016).
Google Scholar
Damnjanovic, I. & Reinschmidt, K. Data Analytics for Engineering and Construction Project Risk Management (Springer, 2020).
Book Google Scholar
Singh, H. Project Management Analytics: A Data-driven Approach to Making Rational and Effective Project Decisions (FT Press, 2015).
Frame, J. D. & Chen, Y. Why Data Analytics in Project Management? (Auerbach Publications, 2018).
Ong, S. & Uddin, S. Data Science and Artificial Intelligence in Project Management: The Past, Present and Future. J. Mod. Proj. Manag. 7 , 26–33 (2020).
Bilal, M. et al. Investigating profitability performance of construction projects using big data: A project analytics approach. J. Build. Eng. 26 , 100850 (2019).
Article Google Scholar
Radziszewska-Zielina, E. & Sroka, B. Planning repetitive construction projects considering technological constraints. Open Eng. 8 , 500–505 (2018).
Neely, A. D., Adams, C. & Kennerley, M. The Performance Prism: The Scorecard for Measuring and Managing Business Success (Prentice Hall Financial Times, 2002).
Kanakaris, N., Karacapilidis, N., Kournetas, G. & Lazanas, A. In: International Conference on Operations Research and Enterprise Systems. 135–155 Springer.
Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 , 255–260 (2015).
Article ADS MathSciNet CAS PubMed MATH Google Scholar
Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).
Book MATH Google Scholar
Rahimian, F. P., Seyedzadeh, S., Oliver, S., Rodriguez, S. & Dawood, N. On-demand monitoring of construction projects through a game-like hybrid application of BIM and machine learning. Autom. Constr. 110 , 103012 (2020).
Sanni-Anibire, M. O., Zin, R. M. & Olatunji, S. O. Machine learning model for delay risk assessment in tall building projects. Int. J. Constr. Manag. 22 , 1–10 (2020).
Cong, J. et al. A machine learning-based iterative design approach to automate user satisfaction degree prediction in smart product-service system. Comput. Ind. Eng. 165 , 107939 (2022).
Li, F., Chen, C.-H., Lee, C.-H. & Feng, S. Artificial intelligence-enabled non-intrusive vigilance assessment approach to reducing traffic controller’s human errors. Knowl. Based Syst. 239 , 108047 (2021).
Mohri, M., Rostamizadeh, A. & Talwalkar, A. Foundations of Machine Learning (MIT press, 2018).
MATH Google Scholar
Whyte, J., Stasis, A. & Lindkvist, C. Managing change in the delivery of complex projects: Configuration management, asset information and ‘big data’. Int. J. Proj. Manag. 34 , 339–351 (2016).
Zangeneh, P. & McCabe, B. Ontology-based knowledge representation for industrial megaprojects analytics using linked data and the semantic web. Adv. Eng. Inform. 46 , 101164 (2020).
Akinosho, T. D. et al. Deep learning in the construction industry: A review of present status and future innovations. J. Build. Eng. 32 , 101827 (2020).
Soman, R. K., Molina-Solana, M. & Whyte, J. K. Linked-Data based constraint-checking (LDCC) to support look-ahead planning in construction. Autom. Constr. 120 , 103369 (2020).
Soman, R. K. & Whyte, J. K. Codification challenges for data science in construction. J. Constr. Eng. Manag. 146 , 04020072 (2020).
Soman, R. K. & Molina-Solana, M. Automating look-ahead schedule generation for construction using linked-data based constraint checking and reinforcement learning. Autom. Constr. 134 , 104069 (2022).
Shi, F., Soman, R. K., Han, J. & Whyte, J. K. Addressing adjacency constraints in rectangular floor plans using Monte-Carlo tree search. Autom. Constr. 115 , 103187 (2020).
Chen, L. & Whyte, J. Understanding design change propagation in complex engineering systems using a digital twin and design structure matrix. Eng. Constr. Archit. Manag. (2021).
Allison, J. T. et al. Artificial intelligence and engineering design. J. Mech. Des. 144 , 020301 (2022).
Dutta, D. & Bose, I. Managing a big data project: The case of ramco cements limited. Int. J. Prod. Econ. 165 , 293–306 (2015).
Bilal, M. & Oyedele, L. O. Guidelines for applied machine learning in construction industry—A case of profit margins estimation. Adv. Eng. Inform. 43 , 101013 (2020).
Tayefeh Hashemi, S., Ebadati, O. M. & Kaur, H. Cost estimation and prediction in construction projects: A systematic review on machine learning techniques. SN Appl. Sci. 2 , 1–27 (2020).
Arage, S. S. & Dharwadkar, N. V. In: International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC). 594–599 (IEEE, 2017).
Cheng, C.-H., Chang, J.-R. & Yeh, C.-A. Entropy-based and trapezoid fuzzification-based fuzzy time series approaches for forecasting IT project cost. Technol. Forecast. Soc. Chang. 73 , 524–542 (2006).
Joukar, A. & Nahmens, I. Volatility forecast of construction cost index using general autoregressive conditional heteroskedastic method. J. Constr. Eng. Manag. 142 , 04015051 (2016).
Xu, J.-W. & Moon, S. Stochastic forecast of construction cost index using a cointegrated vector autoregression model. J. Manag. Eng. 29 , 10–18 (2013).
Narbaev, T. & De Marco, A. Combination of growth model and earned schedule to forecast project cost at completion. J. Constr. Eng. Manag. 140 , 04013038 (2014).
Naeni, L. M., Shadrokh, S. & Salehipour, A. A fuzzy approach for the earned value management. Int. J. Proj. Manag. 29 , 764–772 (2011).
Ponz-Tienda, J. L., Pellicer, E. & Yepes, V. Complete fuzzy scheduling and fuzzy earned value management in construction projects. J. Zhejiang Univ. Sci. A 13 , 56–68 (2012).
Yu, F., Chen, X., Cory, C. A., Yang, Z. & Hu, Y. An active construction dynamic schedule management model: Using the fuzzy earned value management and BP neural network. KSCE J. Civ. Eng. 25 , 2335–2349 (2021).
Bonato, F. K., Albuquerque, A. A. & Paixão, M. A. S. An application of earned value management (EVM) with Monte Carlo simulation in engineering project management. Gest. Produção 26 , e4641 (2019).
Batselier, J. & Vanhoucke, M. Empirical evaluation of earned value management forecasting accuracy for time and cost. J. Constr. Eng. Manag. 141 , 05015010 (2015).
Yang, R. J. & Zou, P. X. Stakeholder-associated risks and their interactions in complex green building projects: A social network model. Build. Environ. 73 , 208–222 (2014).
Uddin, S. Social network analysis in project management–A case study of analysing stakeholder networks. J. Mod. Proj. Manag. 5 , 106–113 (2017).
Ong, S. & Uddin, S. Co-evolution of project stakeholder networks. J. Mod. Proj. Manag. 8 , 96–115 (2020).
Khanzode, K. C. A. & Sarode, R. D. Advantages and disadvantages of artificial intelligence and machine learning: A literature review. Int. J. Libr. Inf. Sci. (IJLIS) 9 , 30–36 (2020).
Loyola-Gonzalez, O. Black-box vs. white-box: Understanding their advantages and weaknesses from a practical point of view. IEEE Access 7 , 154096–154113 (2019).
Abioye, S. O. et al. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. J. Build. Eng. 44 , 103299 (2021).
Doloi, H., Sawhney, A., Iyer, K. & Rentala, S. Analysing factors affecting delays in Indian construction projects. Int. J. Proj. Manag. 30 , 479–489 (2012).
Alkhaddar, R., Wooder, T., Sertyesilisik, B. & Tunstall, A. Deep learning approach’s effectiveness on sustainability improvement in the UK construction industry. Manag. Environ. Qual. Int. J. 23 , 126–139 (2012).
Gondia, A., Siam, A., El-Dakhakhni, W. & Nassar, A. H. Machine learning algorithms for construction projects delay risk prediction. J. Constr. Eng. Manag. 146 , 04019085 (2020).
Witten, I. H. & Frank, E. Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 2005).
Kanakaris, N., Karacapilidis, N. I. & Lazanas, A. In: ICORES. 362–369.
Heo, S., Han, S., Shin, Y. & Na, S. Challenges of data refining process during the artificial intelligence development projects in the architecture engineering and construction industry. Appl. Sci. 11 , 10919 (2021).
Article CAS Google Scholar
Bross, I. D. How to use ridit analysis. Biometrics 14 , 18–38 (1958).
Uddin, S., Khan, A., Hossain, M. E. & Moni, M. A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 19 , 1–16 (2019).
LaValle, S. M., Branicky, M. S. & Lindemann, S. R. On the relationship between classical grid search and probabilistic roadmaps. Int. J. Robot. Res. 23 , 673–692 (2004).
Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2 , 433–459 (2010).
Saxena, A. Survey on Road Construction Delay , https://www.kaggle.com/amansaxena/survey-on-road-construction-delay (2021).
Noble, W. S. What is a support vector machine?. Nat. Biotechnol. 24 , 1565–1567 (2006).
Article CAS PubMed Google Scholar
Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression Vol. 398 (John Wiley & Sons, 2013).
LaValley, M. P. Logistic regression. Circulation 117 , 2395–2399 (2008).
Article PubMed Google Scholar
Menard, S. Applied Logistic Regression Analysis Vol. 106 (Sage, 2002).
Batista, G. E. & Monard, M. C. A study of K-nearest neighbour as an imputation method. His 87 , 48 (2002).
Agatonovic-Kustrin, S. & Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 22 , 717–727 (2000).
Zupan, J. Introduction to artificial neural network (ANN) methods: What they are and how to use them. Acta Chim. Slov. 41 , 327–327 (1994).
CAS Google Scholar
Hopfield, J. J. Artificial neural networks. IEEE Circuits Devices Mag. 4 , 3–10 (1988).
Zou, J., Han, Y. & So, S.-S. Overview of artificial neural networks. Artificial Neural Networks . 14–22 (2008).
Maind, S. B. & Wankar, P. Research paper on basic of artificial neural network. Int. J. Recent Innov. Trends Comput. Commun. 2 , 96–100 (2014).
Wolpert, D. H. Stacked generalization. Neural Netw. 5 , 241–259 (1992).
Pavlyshenko, B. In: IEEE Second International Conference on Data Stream Mining & Processing (DSMP). 255–258 (IEEE).
Jović, A., Brkić, K. & Bogunović, N. In: 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205 (Ieee, 2015).
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46 , 389–422 (2002).
Article MATH Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).
MathSciNet MATH Google Scholar
Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. Understanding variable importances in forests of randomized trees. Adv. Neural. Inf. Process. Syst. 26 , 431–439 (2013).
Al-Hazim, N., Salem, Z. A. & Ahmad, H. Delay and cost overrun in infrastructure projects in Jordan. Procedia Eng. 182 , 18–24 (2017).
Breiman, L. Random forests. Mach. Learn. 45 , 5–32. https://doi.org/10.1023/A:1010933404324 (2001).
Shehu, Z., Endut, I. R. & Akintoye, A. Factors contributing to project time and hence cost overrun in the Malaysian construction industry. J. Financ. Manag. Prop. Constr. 19 , 55–75 (2014).
Akomah, B. B. & Jackson, E. N. Contractors’ perception of factors contributing to road project delay. Int. J. Constr. Eng. Manag. 5 , 79–85 (2016).
GitHub: Where the world builds software , https://github.com/ .
Anbari, F. T. Earned value project management method and extensions. Proj. Manag. J. 34 , 12–23 (2003).
Acebes, F., Pereda, M., Poza, D., Pajares, J. & Galán, J. M. Stochastic earned value analysis using Monte Carlo simulation and statistical learning techniques. Int. J. Proj. Manag. 33 , 1597–1609 (2015).
Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. data anal. 6 , 429–449 (2002).
Chen, T. et al. Xgboost: extreme gradient boosting. R Packag. Version 0.4–2.1 1 , 1–4 (2015).
Guarino, A., Lettieri, N., Malandrino, D., Zaccagnino, R. & Capo, C. Adam or Eve? Automatic users’ gender classification via gestures analysis on touch devices. Neural Comput. Appl. 1–23 (2022).
Zaccagnino, R., Capo, C., Guarino, A., Lettieri, N. & Malandrino, D. Techno-regulation and intelligent safeguards. Multimed. Tools Appl. 80 , 15803–15824 (2021).
Download references
Acknowledgements
The authors acknowledge the insightful comments from Prof Jennifer Whyte on an earlier version of this article.
Author information
Authors and affiliations.
School of Project Management, The University of Sydney, Level 2, 21 Ross St, Forest Lodge, NSW, 2037, Australia
Shahadat Uddin, Stephen Ong & Haohui Lu
You can also search for this author in PubMed Google Scholar
Contributions
S.U.: Conceptualisation; Data curation; Formal analysis; Methodology; Supervision; and Writing (original draft, review and editing) S.O.: Data curation; and Writing (original draft, review and editing) H.L.: Methodology; and Writing (original draft, review and editing) All authors reviewed the manuscript).
Corresponding author
Correspondence to Shahadat Uddin .
Ethics declarations
Competing interests.
The authors declare no competing interests.
Additional information
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Supplementary information., rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
About this article
Cite this article.
Uddin, S., Ong, S. & Lu, H. Machine learning in project analytics: a data-driven framework and case study. Sci Rep 12 , 15252 (2022). https://doi.org/10.1038/s41598-022-19728-x
Download citation
Received : 13 April 2022
Accepted : 02 September 2022
Published : 09 September 2022
DOI : https://doi.org/10.1038/s41598-022-19728-x
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
This article is cited by
Evaluation and prediction of time overruns in jordanian construction projects using coral reefs optimization and deep learning methods.
- Jumana Shihadeh
- Ghyda Al-Shaibie
- Hamza Al-Bdour
Asian Journal of Civil Engineering (2024)
A robust, resilience machine learning with risk approach: a case study of gas consumption
- Mehdi Changizi
- Sadia Samar Ali
Annals of Operations Research (2024)
Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets
- Shahadat Uddin
Health and Technology (2024)
Prediction of SMEs’ R&D performances by machine learning for project selection
- Hyoung Sun Yoo
- Ye Lim Jung
- Seung-Pyo Jun
Scientific Reports (2023)
A robust and resilience machine learning for forecasting agri-food production
- Amin Gholamrezaei
- Kiana Kheiri
Scientific Reports (2022)
By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.
Cookie Policy
We use cookies to operate this website, improve usability, personalize your experience, and improve our marketing. Privacy Policy .
By clicking "Accept" or further use of this website, you agree to allow cookies.
- Data Science
- Data Analytics
- Machine Learning
Binary Classification
LearnDataSci is reader-supported. When you purchase through links on our site, earned commissions help support our team of writers, researchers, and designers at no extra cost to you.
What is Binary Classification?
In machine learning, binary classification is a supervised learning algorithm that categorizes new observations into one of two classes.
The following are a few binary classification applications, where the 0 and 1 columns are two possible classes for each observation:
Application | Observation | 0 | 1 |
---|---|---|---|
Medical Diagnosis | Patient | Healthy | Diseased |
Email Analysis | Not Spam | Spam | |
Financial Data Analysis | Transaction | Not Fraud | Fraud |
Marketing | Website visitor | Won't Buy | Will Buy |
Image Classification | Image | Hotdog | Not Hotdog |
Quick example
In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as input features and predict whether the patient is healthy or has the disease. The possible outcomes of the diagnosis are positive and negative .
Evaluation of binary classifiers
If the model successfully predicts the patients as positive, this case is called True Positive (TP) . If the model successfully predicts patients as negative, this is called True Negative (TN) . The binary classifier may misdiagnose some patients as well. If a diseased patient is classified as healthy by a negative test result, this error is called False Negative (FN) . Similarly, If a healthy patient is classified as diseased by a positive test result, this error is called False Positive(FP) .
We can evaluate a binary classifier based on the following parameters:
- True Positive (TP): The patient is diseased and the model predicts "diseased"
- False Positive (FP): The patient is healthy but the model predicts "diseased"
- True Negative (TN): The patient is healthy and the model predicts "healthy"
- False Negative (FN): The patient is diseased and the model predicts "healthy"
After obtaining these values, we can compute the accuracy score of the binary classifier as follows: $$ accuracy = \frac {TP + TN}{TP+FP+TN+FN} $$
The following is a confusion matrix , which represents the above parameters:
In machine learning, many methods utilize binary classification. The most common are:
- Support Vector Machines
- Naive Bayes
- Nearest Neighbor
- Decision Trees
- Logistic Regression
- Neural Networks
The following Python example will demonstrate using binary classification in a logistic regression problem.
A Python example for binary classification
For our data, we will use the breast cancer dataset from scikit-learn. This dataset contains tumor observations and corresponding labels for whether the tumor was malignant or benign.
First, we'll import a few libraries and then load the data. When loading the data, we'll specify as_frame=True so we can work with pandas objects (see our pandas tutorial for an introduction).
The dataset contains a DataFrame for the observation data and a Series for the target data.
Let's see what the first few rows of observations look like:
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
The output shows five observations with a column for each feature we'll use to predict malignancy.
Now, for the targets:
The targets for the first five observations are all zero, meaning the tumors are benign. Here's how many malignant and benign tumors are in our dataset:
So we have 357 malignant tumors, denoted as 1, and 212 benign, denoted as 0. So, we have a binary classification problem.
To perform binary classification using logistic regression with sklearn, we must accomplish the following steps.
Step 1: Define explanatory and target variables
We'll store the rows of observations in a variable X and the corresponding class of those observations (0 or 1) in a variable y .
Step 2: Split the dataset into training and testing sets
We use 75% of data for training and 25% for testing. Setting random_state=0 will ensure your results are the same as ours.
Step 3: Normalize the data for numerical stability
Note that we normalize after splitting the data. It's good practice to apply any data transformations to training and testing data separately to prevent data leakage .
Step 4: Fit a logistic regression model to the training data
This step effectively trains the model to predict the targets from the data.
Step 5: Make predictions on the testing data
With the model trained, we now ask the model to predict targets based on the test data.
Step 6: Calculate the accuracy score by comparing the actual values and predicted values.
We can now calculate how well the model performed by comparing the model's predictions to the true target values, which we reserved in the y_test variable.
First, we'll calculate the confusion matrix to get the necessary parameters:
With these values, we can now calculate an accuracy score:
Other binary classifiers in the scikit-learn library
Logistic regression is just one of many classification algorithms defined in Scikit-learn. We'll compare several of the most common, but feel free to read more about these algorithms in the sklearn docs here .
We'll also use the sklearn Accuracy, Precision, and Recall metrics for performance evaluation. See the docs here if you'd like to read more about the available metrics.
Initializing each binary classifier
To quickly train each model in a loop, we'll initialize each model and store it by name in a dictionary:
Performance evaluation of each binary classifier
Now that we'veinitialized the models, we'll loop over each one, train it by calling .fit() , make predictions, calculate metrics, and store each result in a dictionary.
With all metrics stored, we can use pandas to view the data as a table:
Accuracy | Precision | Recall | |
---|---|---|---|
Logistic Regression | 0.958042 | 0.955556 | 0.977273 |
Support Vector Machines | 0.937063 | 0.933333 | 0.965517 |
Decision Trees | 0.902098 | 0.866667 | 0.975000 |
Random Forest | 0.972028 | 0.966667 | 0.988636 |
Naive Bayes | 0.937063 | 0.955556 | 0.945055 |
K-Nearest Neighbor | 0.951049 | 0.988889 | 0.936842 |
Finally, here's a quick bar chart to compare the classifiers' performance:
Since we're only using the default model parameters, we won't know which classifier is better. We should optimize each algorithm's parameters first to know which one has the best performance.
Get updates in your inbox
Join over 7,500 data science learners.
Recent articles:
The 9 best ai courses online for 2024: beginner to advanced, the 6 best python courses for 2024 – ranked by software engineer, best course deals for black friday and cyber monday 2024, sigmoid function, 7 best artificial intelligence (ai) courses.
Top courses you can take today to begin your journey into the Artificial Intelligence field.
Meet the Authors
Associate Professor of Computer Engineering. Author/co-author of over 30 journal publications. Instructor of graduate/undergraduate courses. Supervisor of Graduate thesis. Consultant to IT Companies.
Back to blog index
Machine Learning Case Studies with Powerful Insights
Explore the potential of machine learning through these practical machine learning case studies and success stories in various industries. | ProjectPro
Machine learning is revolutionizing how different industries function, from healthcare to finance to transportation. If you're curious about how this technology is applied in real-world scenarios, look no further. In this blog, we'll explore some exciting machine learning case studies that showcase the potential of this powerful emerging technology.
Machine-learning-based applications have quickly transformed work methods in the technological world. It is changing the way we work, live, and interact with the world around us. Machine learning is revolutionizing industries, from personalized recommendations on streaming platforms to self-driving cars.
But while the technology of artificial intelligence and machine learning may seem abstract or daunting to some, its applications are incredibly tangible and impactful. Data Scientists use machine learning algorithms to predict equipment failures in manufacturing, improve cancer diagnoses in healthcare , and even detect fraudulent activity in 5 . If you're interested in learning more about how machine learning is applied in real-world scenarios, you are on the right page. This blog will explore in depth how machine learning applications are used for solving real-world problems.
We'll start with a few case studies from GitHub that examine how machine learning is being used by businesses to retain their customers and improve customer satisfaction. We'll also look at how machine learning is being used with the help of Python programming language to detect and prevent fraud in the financial sector and how it can save companies millions of dollars in losses. Next, we will examine how top companies use machine learning to solve various business problems. Additionally, we'll explore how machine learning is used in the healthcare industry, and how this technology can improve patient outcomes and save lives.
By going through these case studies, you will better understand how machine learning is transforming work across different industries. So, let's get started!
Table of Contents
Machine learning case studies on github, machine learning case studies in python, company-specific machine learning case studies, machine learning case studies in biology and healthcare, aws machine learning case studies , azure machine learning case studies, how to prepare for machine learning case studies interview.
This section has machine learning case studies along with their GitHub repository that contains the sample code.
1. Customer Churn Prediction
Predicting customer churn is essential for businesses interested in retaining customers and maximizing their profits. By leveraging historical customer data, machine learning algorithms can identify patterns and factors that are correlated with churn, enabling businesses to take proactive steps to prevent it.
In this case study, you will study how a telecom company uses machine learning for customer churn prediction. The available data contains information about the services each customer signed up for, their contact information, monthly charges, and their demographics. The goal is to first analyze the data at hand with the help of methods used in Exploratory Data Analysis . It will assist in picking a suitable machine-learning algorithm. The five machine learning models used in this case-study are AdaBoost, Gradient Boost, Random Forest, Support Vector Machines, and K-Nearest Neighbors. These models are used to determine which customers are at risk of churn.
By using machine learning for churn prediction, businesses can better understand customer behavior, identify areas for improvement, and implement targeted retention strategies. It can result in increased customer loyalty, higher revenue, and a better understanding of customer needs and preferences. This case study example will help you understand how machine learning is a valuable tool for any business looking to improve customer retention and stay ahead of the competition.
GitHub Repository: https://github.com/Pradnya1208/Telecom-Customer-Churn-prediction
2. Market Basket Analysis
Market basket analysis is a common application of machine learning in retail and e-commerce, where it is used to identify patterns and relationships between products that are frequently purchased together. By leveraging this information, businesses can make informed decisions about product placement, promotions, and pricing strategies.
In this case study, you will utilize the EDA methods to carefully analyze the relationships among different variables in the data. Next, you will study how to use the Apriori algorithm to identify frequent itemsets and association rules, which describe the likelihood of a product being purchased given the presence of another product. These rules can generate recommendations, optimize product placement, and increase sales, and they can also be used for customer segmentation.
Using machine learning for market basket analysis allows businesses to understand customer behavior better, identify cross-selling opportunities, and increase customer satisfaction. It has the potential to result in increased revenue, improved customer loyalty, and a better understanding of customer needs and preferences.
GitHub Repository: https://github.com/kkrusere/Market-Basket-Analysis-on-the-Online-Retail-Data
3. Predicting Prices for Airbnb
Airbnb is a tech company that enables hosts to rent out their homes, apartments, or rooms to guests interested in temporary lodging. One of the key challenges hosts face is optimizing the rent prices for the customers. With the help of machine learning, hosts can have rough estimates of the rental costs based on various factors such as location, property type, amenities, and availability.
The first step, in this case study, is to clean the dataset to handle missing values, duplicates, and outliers. In the same step, the data is transformed, and the data is prepared for modeling with the help of feature engineering methods. The next step is to perform EDA to understand how the rental listings are spread across different cities in the US. Next, you will learn how to visualize how prices change over time, looking at trends for different seasons, months, days of the week, and times of the day.
The final step involves implementing ML models like linear regression (ridge and lasso), Naive Bayes, and Random Forests to produce price estimates for listings. You will learn how to compare the outcome of these models and evaluate their performance.
GitHub Repository: https://github.com/samuelklam/airbnb-pricing-prediction
New Projects
4. Titanic Disaster Analysis
The Titanic Machine Learning Case Study is a classic example in the field of data science and machine learning. The study is based on the dataset of passengers aboard the Titanic when it sank in 1912. The study's goal is to predict whether a passenger survived or not based on their demographic and other information.
The dataset contains information on 891 passengers, including their age, gender, ticket class, fare paid, as well as whether or not they survived the disaster. The first step in the analysis is to explore the dataset and identify any missing values or outliers. Once this is done, the data is preprocessed to prepare it for modeling.
The next step is to build a predictive model using various machine learning algorithms, such as logistic regression, decision trees, and random forests. These models are trained on a subset of the data and evaluated on another subset to ensure they can generalize well to new data.
Finally, the model is used to make predictions on a test dataset, and the model performance is measured using various metrics such as accuracy, precision, and recall. The study results can be used to improve safety protocols and inform future disaster response efforts.
GitHub Repository: https://github.com/ashishpatel26/Titanic-Machine-Learning-from-Disaster
Here's what valued users are saying about ProjectPro
Ameeruddin Mohammed
ETL (Abintio) developer at IBM
Savvy Sahai
Data Science Intern, Capgemini
Not sure what you are looking for?
If you are looking for a sample of machine learning case study in python, then keep reading this space.
5. Loan Application Classification
Financial institutions receive tons of requests for lending money by borrowers and making decisions for each request is a crucial task. Manually processing these requests can be a time-consuming and error-prone process, so there is an increasing demand for machine learning to improve this process by automation.
You can work on this Loan Dataset on Kaggle to get started on this one of the most real-world case studies in the financial industry. The dataset contains 614 unique values for 13 columns: Follow the below-mentioned steps to get started on this case study.
Analyze the dataset and explore how various factors such as gender, marital status, and employment affect the loan amount and status of the loan application .
Select the features to automate the process of classification of loan applications.
Apply machine learning models such as logistic regression, decision trees, and random forests to the features and compare their performance using statistical metrics.
This case study falls under the umbrella of supervised learning problems in machine learning and demonstrates how ML models are used to automate tasks in the financial industry.
With these Data Science Projects in Python , your career is bound to reach new heights. Start working on them today!
6. Computer Price Estimation
Whenever one thinks of buying a new computer, the first thing that comes to mind is to curate a list of hardware specifications that best suit their needs. The next step is browsing different websites and looking for the cheapest option available. Performing all these processes can be time-consuming and require a lot of effort. But you don’t have to worry as machine learning can help you build a system that can estimate the price of a computer system by taking into account its various features.
This sample basic computer dataset on Kaggle can help you develop a price estimation model that can analyze historical data and identify patterns and trends in the relationship between computer specifications and prices. By training a machine learning model on this data, the model can learn to make accurate predictions of prices for new or unseen computer components. Machine learning algorithms such as K-Nearest Neighbours, Decision Trees, Random Forests, ADA Boost and XGBoost can effectively capture complex relationships between features and prices, leading to more accurate price estimates.
Besides saving time and effort compared to manual estimation methods, this project also has a business use case as it can provide stakeholders with valuable insights into market trends and consumer preferences.
7. House Price Prediction
Here is a machine learning case study that aims to predict the median value of owner-occupied homes in Boston suburbs based on various features such as crime rate, number of rooms, and pupil-teacher ratio.
Start working on this study by collecting the data from the publicly available UCI Machine Learning Repository, which contains information about 506 neighborhoods in the Boston area. The dataset includes 13 features such as per capita crime rate, average number of rooms per dwelling, and the proportion of owner-occupied units built before 1940. You can gain more insights into this data by using EDA techniques. Then prepare the dataset for implementing ML models by handling missing values, converting categorical features to numerical ones, and scaling the data.
Use machine learning algorithms such as Linear Regression, Lasso Regression, and Random Forest to predict house prices for different neighborhoods in the Boston area. Select the best model by comparing the performance of each one using metrics such as mean squared error, mean absolute error, and R-squared.
This section has machine learning case studies of different firms across various industries.
8. Machine Learning Case Study on Dell
Dell Technologies is a multinational technology company that designs, develops, and sells computers, servers, data storage devices, network switches, software, and other technology products and services. Dell is one of the world's most prominent PC vendors and serves customers in over 180 countries. As Data is an integral component of Dell's hard drive, the marketing team of Dell required a data-focused solution that would improve response rates and demonstrate why some words and phrases are more effective than others.
Dell contacted Persado and partnered with the firm that utilizes AI to create marketing content. Persado helped Dell revamp the email marketing strategy and leverage the data analytics to garner their audiences' attention. The statistics revealed that the partnership resulted in a noticeable increase in customer engagement as the page visits by 22% on average and a 50% average increase in CTR.
Dell currently relies on ML methods to improve their marketing strategy for emails, banners, direct mail, Facebook ads, and radio content.
Explore Categories
9. Machine Learning Case Study on Harley Davidson
In the current environment, it is challenging to overcome traditional marketing. An artificial intelligence powered robot, Albert is appealing for a business like Harley Davidson. Robots are now directing traffic, creating news stories, working in hotels, and even running McDonald's, thanks to machine learning and artificial intelligence.
There are many marketing channels that Albert can be applied to, including Email and social media.It automatically prepares customized creative copies and forecasts which customers will most likely convert.
The only company to make use of Albert is Harley Davidson. The business examined customer data to ascertain the activities of past clients who successfully made purchases and invested more time than usual across different pages on the website. With this knowledge, Albert divided the customer base into groups and adjusted the scale of test campaigns accordingly.
Results reveal that using Albert increased Harley Davidson's sales by 40%. The brand also saw a 2,930% spike in leads, 50% of which came from very effective "lookalikes" found by machine learning and artificial intelligence.
10. Machine Learning Case Study on Zomato
Zomato is a popular online platform that provides restaurant search and discovery services, online ordering and delivery, and customer reviews and ratings. Founded in India in 2008, the company has expanded to over 24 countries and serves millions of users globally. Over the years, it has become a popular choice for consumers to browse the ratings of different restaurants in their area.
To provide the best restaurant options to their customers, Zomato ensures to hand-pick the ones likely to perform well in the future. Machine Learning can help zomato in making such decisions by considering the different restaurant features. You can work on this sample Zomato Restaurants Data and experiment with how machine learning can be useful to Zomato. The dataset has the details of 9551 restaurants. The first step should involve careful analysis of the data and identifying outliers and missing values in the dataset. Treat them using statistical methods and then use regression models to predict the rating of different restaurants.
The Zomato Case study is one of the most popular machine learning startup case studies among data science enthusiasts.
11. Machine Learning Case Study on Tesla
Tesla, Inc. is an American electric vehicle and clean energy company founded in 2003 by Elon Musk. The company designs, manufactures, and sells electric cars, battery storage systems, and solar products. Tesla has pioneered the electric vehicle industry and has popularized high-capacity lithium-ion batteries and regenerative braking systems. The company strongly focuses on innovation, sustainability, and reducing the world's dependence on fossil fuels.
Tesla uses machine learning in various ways to enhance the performance and features of its electric vehicles. One of the most notable applications of machine learning at Tesla is in its Autopilot system, which uses a combination of cameras, sensors, and machine learning algorithms to enable advanced driver assistance features such as lane centering, adaptive cruise control, and automatic emergency braking.
Tesla's Autopilot system uses deep neural networks to process large amounts of real-world driving data and accurately predict driving behavior and potential hazards. It enables the system to learn and adapt over time, improving its accuracy and responsiveness.
Additionally, Tesla also uses machine learning in its battery management systems to optimize the performance and longevity of its batteries. Machine learning algorithms are used to model and predict the behavior of the batteries under different conditions, enabling Tesla to optimize charging rates, temperature control, and other factors to maximize the lifespan and performance of its batteries.
Unlock the ProjectPro Learning Experience for FREE
12. Machine Learning Case Study on Amazon
Amazon Prime Video uses machine learning to ensure high video quality for its users. The company has developed a system that analyzes video content and applies various techniques to enhance the viewing experience.
The system uses machine learning algorithms to automatically detect and correct issues such as unexpected black frames, blocky frames, and audio noise. For detecting block corruption, residual neural networks are used. After training the algorithm on the large dataset, a threshold of 0.07 was set for the corrupted-area ratio to mark the areas of the frame that have block corruption. For detecting unwanted noise in the audio, a model based on a pre-trained audio neural network is used to classify a one-second audio sample into one of these classes: audio hum, audio distortion, audio diss, audio clicks, and no defect. The lip sync is handled using the SynNet architecture.
By using machine learning to optimize video quality, Amazon can deliver a consistent and high-quality viewing experience to its users, regardless of the device or network conditions they are using. It helps maintain customer satisfaction and loyalty and ensures that Amazon remains a competitive video streaming market leader.
Machine Learning applications are not only limited to financial and tech use cases. It also finds its use in the Healthcare industry. So, here are a few machine learning case studies that showcase the use of this technology in the Biology and Healthcare domain.
13. Microbiome Therapeutics Development
The development of microbiome therapeutics involves the study of the interactions between the human microbiome and various diseases and identifying specific microbial strains or compositions that can be used to treat or prevent these diseases. Machine learning plays a crucial role in this process by enabling the analysis of large, complex datasets and identifying patterns and correlations that would be difficult or impossible to detect through traditional methods.
Machine learning algorithms can analyze microbiome data at various levels, including taxonomic composition, functional pathways, and gene expression profiles. These algorithms can identify specific microbial strains or communities associated with different diseases or conditions and can be used to develop targeted therapies.
Besides that, machine learning can be used to optimize the design and delivery of microbiome therapeutics. For example, machine learning algorithms can be used to predict the efficacy of different microbial strains or compositions and optimize these therapies' dosage and delivery mechanisms.
14. Mental Illness Diagnosis
Machine learning is increasingly being used to develop predictive models for diagnosing and managing mental illness. One of the critical advantages of machine learning in this context is its ability to analyze large, complex datasets and identify patterns and correlations that would be difficult for human experts to detect.
Machine learning algorithms can be trained on various data sources, including clinical assessments, self-reported symptoms, and physiological measures such as brain imaging or heart rate variability. These algorithms can then be used to develop predictive models to identify individuals at high risk of developing a mental illness or who are likely to experience a particular symptom or condition.
One example of machine learning being used to predict mental illness is in the development of suicide risk assessment tools. These tools use machine learning algorithms to analyze various risk factors, such as demographic information, medical history, and social media activity, to identify individuals at risk of suicide. These tools can be used to guide early intervention and support for individuals struggling with mental health issues.
One can also a build a Chatbot using Machine learning and Natural Lanaguage Processing that can analyze the responses of the user and recommend them the necessary steps that they can immediately take.
Get confident to build end-to-end projects
Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.
15. 3D Bioprinting
Another popular subject in the biotechnology industry is Bioprinting. Based on a computerized blueprint, the printer prints biological tissues like skin, organs, blood arteries, and bones layer by layer using cells and biomaterials, also known as bioinks.
They can be made in printers more ethically and economically than by relying on organ donations. Additionally, synthetic construct tissue is used for drug testing instead of testing on animals or people. Due to its tremendous complexity, the entire technology is still in its early stages of maturity. Data science is one of the most essential components to handle this complexity of printing.
The qualities of the bioinks, which have inherent variability, or the many printing parameters, are just a couple of the many variables that affect the printing process and quality. For instance, Bayesian optimization improves the likelihood of producing useable output and optimizes the printing process.
A crucial element of the procedure is the printing speed. To estimate the optimal speed, siamese network models are used. Convolutional neural networks are applied to photographs of the layer-by-layer tissue to detect material, or tissue abnormalities.
In this section, you will find a list of machine learning case studies that have utilized Amazon Web Services to create machine learning based solutions.
16. Machine Learning Case Study on AutoDesk
Autodesk is a US-based software company that provides solutions for 3D design, engineering, and entertainment industries. The company offers a wide range of software products and services, including computer-aided design (CAD) software, 3D animation software, and other tools used in architecture, construction, engineering, manufacturing, media and entertainment industries.
Autodesk utilizes machine learning (ML) models that are constructed on Amazon SageMaker, a managed ML service provided by Amazon Web Services (AWS), to assist designers in categorizing and sifting through a multitude of versions created by generative design procedures and selecting the most optimal design. ML techniques built with Amazon SageMaker help Autodesk progress from intuitive design to exploring the boundaries of generative design for their customers to produce innovative products that can even be life-changing. As an example, Edera Safety, a design studio located in Austria, created a superior and more effective spine protector by utilizing Autodesk's generative design process constructed on AWS.
17. Machine Learning Case Study on Capital One
Capital One is a financial services company in the United States that offers a range of financial products and services to consumers, small businesses, and commercial clients. The company provides credit cards, loans, savings and checking accounts, investment services, and other financial products and services.
Capital One leverages AWS to transform data into valuable insights using machine learning, enabling the company to innovate rapidly on behalf of its customers. To power its machine-learning innovation, Capital One utilizes a range of AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), and AWS Lambda. AWS is enabling Capital One to implement flexible DevOps processes, enabling the company to introduce new products and features to the market in just a few weeks instead of several months or years. Additionally, AWS assists Capital One in providing data to and facilitating the training of sophisticated machine-learning analysis and customer-service solutions. The company also integrates its contact centers with its CRM and other critical systems, while simultaneously attracting promising entry-level and mid-career developers and engineers with the opportunity to gain knowledge and innovate with the most up-to-date cloud technologies.
18. Machine Learning Case Study on BuildFax
In 2008, BuildFax began by collecting widely scattered building permit data from different parts of the United States and distributing it to various businesses, including building inspectors, insurance companies, and economic analysts. Today, it offers custom-made solutions to these professions and several other services. These services comprise indices that monitor trends like commercial construction, and housing remodels.
Source: aws.amazon.com/solutions/case-studies
The primary customer base of BuildFax is insurance companies that splurge billion dollars on rood losses. BuildFax assists its customers in developing policies and premiums by evaluating the roof losses for them. Initially, it relied on general data and ZIP codes for building predictive models but they did not prove to be useful as they were not accurate and were slightly complex in nature. It thus required a way out of building a solution that could support more accurate results for property-specific estimates. It thus chose Amazon Machine Learning for predictive modeling. By employing Amazon Machine Learning, it is possible for the company to offer insurance companies and builders personalized estimations of roof-age and job-cost, which are specific to a particular property and it does not have to depend on more generalized estimates based on ZIP codes. It now utilizes customers' data and data from public sources to create predictive models.
What makes Python one of the best programming languages for ML Projects? The answer lies in these solved and end-to-end Machine Learning Projects in Python . Check them out now!
This section will present you with a list of machine learning case studies that showcase how companies have leveraged Microsoft Azure Services for completing machine learning tasks in their firm.
19. Machine Learning Case Study for an Enterprise Company
Consider a company (Azure customer) in the Electronic Design Automation industry that provides software, hardware, and IP for electronic systems and semiconductor companies. Their finance team was struggling to manage account receivables efficiently, so they wanted to use machine learning to predict payment outcomes and reduce outstanding receivables. The team faced a major challenge with managing change data capture using Azure Data Factory . A3S provided a solution by automating data migration from SAP ECC to Azure Synapse and offering fully automated analytics as a service, which helped the company streamline their account receivables management. It was able to achieve the entire scenario from data ingestion to analytics within a week, and they plan to use A3S for other analytics initiatives.
20. Machine Learning Case Study on Shell
Royal Dutch Shell, a global company managing oil wells to retail petrol stations, is using computer vision technology to automate safety checks at its service stations. In partnership with Microsoft, it has developed the project called Video Analytics for Downstream Retail (VADR) that uses machine vision and image processing to detect dangerous behavior and alert the servicemen. It uses OpenCV and Azure Databricks in the background highlighting how Azure can be used for personalised applications. Once the projects shows decent results in the countries where it has been deployed (Thailand and Singapore), Shell plans to expand the project further by going global with the VADR project.
21. Machine Learning Case Study on TransLink
TransLink, a transportation company in Vancouver, deployed 18,000 different sets of machine learning models using Azure Machine Learning to predict bus departure times and determine bus crowdedness. The models take into account factors such as traffic, bad weather and at-capacity buses. The deployment led to an improvement in predicted bus departure times of 74%. The company also created a mobile app that allows people to plan their trips based on how at-capacity a bus might be at different times of day.
22. Machine Learning Case Study on XBox
Microsoft Azure Personaliser is a cloud-based service that uses reinforcement learning to select the best content for customers based on up-to-date information about them, the context, and the application. Custom recommender services can also be created using Azure Machine Learning. The Xbox One group used Cognitive Services Personaliser to find content suited to each user, which resulted in a 40% increase in user engagement compared to a random personalisation policy on the Xbox platform.
All the mentioned case studies in this blog will help you explore the application of machine learning in solving real problems across different industries. But you must not stop after working on them if you are preparing for an interview and intend to showcase that you have mastered the art of implementing ML algorithms, and you must practice more such caste studies in machine learning.
And if you have decided to dive deeper into machine learning, data science, and big data, be sure to check out ProjectPro , which offers a repository of solved projects in data science and big data. With a wide range of projects, you can explore different techniques and approaches and build your machine learning and data science skills . Our repository has a project for each one of you, irrespective of your academic and professional background. The customer-specific learning path is likely to help you find your way to making a mark in this newly emerging field. So why wait? Start exploring today and see what you can accomplish with big data and data science !
Access Data Science and Machine Learning Project Code Examples
1. What is a case study in machine learning?
A case study in machine learning is an in-depth analysis of a real-world problem or scenario, where machine learning techniques are applied to solve the problem or provide insights. Case studies can provide valuable insights into the application of machine learning and can be used as a basis for further research or development.
2. What is a good use case for machine learning?
A good use case for machine learning is any scenario with a large and complex dataset and where there is a need to identify patterns, predict outcomes, or automate decision-making based on that data. It could include fraud detection, predictive maintenance, recommendation systems, and image or speech recognition, among others.
3. What are the 3 basic types of machine learning problems?
The three basic types of machine learning problems are supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the algorithm is trained on labeled data. In unsupervised learning, the algorithm seeks to identify patterns in unstructured data. In reinforcement learning, the algorithm learns through trial and error based on feedback from the environment.
4. What are the 4 basics of machine learning?
The four basics of machine learning are data preparation, model selection, model training, and model evaluation. Data preparation involves collecting, cleaning, and preparing data for use in training models. Model selection involves choosing the appropriate algorithm for a given task. Model training involves optimizing the chosen algorithm to achieve the desired outcome. Model evaluation consists of assessing the performance of the trained model on new data.
|
|
About the Author
Manika Nagpal is a versatile professional with a strong background in both Physics and Data Science. As a Senior Analyst at ProjectPro, she leverages her expertise in data science and writing to create engaging and insightful blogs that help businesses and individuals stay up-to-date with the
© 2024
© 2024 Iconiq Inc.
Privacy policy
User policy
Write for ProjectPro
Data Science Introduction
- What Is Data Science? A Beginner's Guide To Data Science
- Data Science Tutorial – Learn Data Science from Scratch!
- What are the Best Books for Data Science?
- Top 15 Hot Artificial Intelligence Technologies
- Top 8 Data Science Tools Everyone Should Know
- Top 10 Data Analytics Tools You Need To Know In 2024
- 5 Data Science Projects – Data Science Projects For Practice
- Top 10 Data Science Applications with Real Life Examples in 2024
- Who is a Data Scientist?
- SQL For Data Science: One stop Solution for Beginners
Statistical Inference
- All You Need To Know About Statistics And Probability
- A Complete Guide To Math And Statistics For Data Science
- Introduction To Markov Chains With Examples – Markov Chains With Python
What is Fuzzy Logic in AI and What are its Applications?
- How To Implement Bayesian Networks In Python? – Bayesian Networks Explained With Examples
- All You Need To Know About Principal Component Analysis (PCA)
- Python for Data Science – How to Implement Python Libraries
Machine Learning
What is machine learning machine learning for beginners.
- Which is the Best Book for Machine Learning?
- Mathematics for Machine Learning: All You Need to Know
- Top 10 Machine Learning Frameworks You Need to Know
- Predicting the Outbreak of COVID-19 Pandemic using Machine Learning
- Introduction To Machine Learning: All You Need To Know About Machine Learning
- Machine Learning Tutorial for Beginners
Top 10 Applications of Machine Learning in Daily Life
- Machine Learning Algorithms
- How To Implement Find-S Algorithm In Machine Learning?
- What is Cross-Validation in Machine Learning and how to implement it?
All You Need To Know About The Breadth First Search Algorithm
Supervised learning.
- What is Supervised Learning and its different types?
- Linear Regression Algorithm from Scratch
- How To Implement Linear Regression for Machine Learning?
- Introduction to Classification Algorithms
How To Implement Classification In Machine Learning?
- Naive Bayes Classifier: Learning Naive Bayes with Python
- A Comprehensive Guide To Naive Bayes In R
- A Complete Guide On Decision Tree Algorithm
- Decision Tree: How To Create A Perfect Decision Tree?
- What is Overfitting In Machine Learning And How To Avoid It?
- How To Use Regularization in Machine Learning?
Unsupervised Learning
- What is Unsupervised Learning and How does it Work?
- K-means Clustering Algorithm: Know How It Works
- KNN Algorithm: A Practical Implementation Of KNN Algorithm In R
- Implementing K-means Clustering on the Crime Dataset
- K-Nearest Neighbors Algorithm Using Python
- Apriori Algorithm : Know How to Find Frequent Itemsets
- What Are GANs? How and why you should use them!
Q Learning: All you need to know about Reinforcement Learning
Miscellaneous.
- Data Science vs Machine Learning - What's The Difference?
- AI vs Machine Learning vs Deep Learning
- Data Analyst vs Data Engineer vs Data Scientist: Salary, Skills, Responsibilities
- Data Science vs Big Data vs Data Analytics
Career Opportunities
- Data Science Career Opportunities: Your Guide To Unlocking Top Data Scientist Jobs
- Data Scientist Skills – What Does It Take To Become A Data Scientist?
- 10 Skills To Master For Becoming A Data Scientist
- Data Scientist Resume Sample – How To Build An Impressive Data Scientist Resume
- Data Scientist Salary – How Much Does A Data Scientist Earn?
- Machine Learning Engineer vs Data Scientist : Career Comparision
- How To Become A Machine Learning Engineer? – Learning Path
Interview Questions
- Top Machine Learning Interview Questions You Must Prepare In 2024
- Top Data Science Interview Questions For Budding Data Scientists In 2024
- 120+ Data Science Interview Questions And Answers for 2024
Artificial Intelligence
Classification in machine learning and statistics is a supervised learning approach in which the computer program learns from the data given to it and makes new observations or classifications. In this article, we will learn about classification in machine learning in detail.
Machine Learning Full Course – Learn Machine Learning 10 Hours | Machine Learning Tutorial | Edureka
Machine Learning Course lets you master the application of AI with the expert guidance. It includes various algorithms with applications.
The following topics are covered in this blog:
- What is Classification in Machine Learning?
- Classification Terminologies In Machine Learning
Logistic Regression
- Naive Bayes
Stochastic Gradient Descent
- K-Nearest Neighbors
Decision Tree
Random forest.
- Artificial Neural Network
Support Vector Machine
Classifier evaluation, algorithm selection.
- Use Case- MNIST Digit Classification
What is Classification In Machine Learning
Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.
The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The main goal is to identify which class/category the new data will fall into.
Let us try to understand this with a simple example.
Heart disease detection can be identified as a classification problem, this is a binary classification since there can be only two classes i.e has heart disease or does not have heart disease. The classifier, in this case, needs training data to understand how the given input variables are related to the class. And once the classifier is trained accurately, it can be used to detect whether heart disease is there or not for a particular patient.
Since classification is a type of supervised learning , even the targets are also provided with the input data. Let us get familiar with the classification in machine learning terminologies.
ML makes computers learn the data and make their own decisions and using in multiple industries. It resolves the complex problem very easily and makes well-planned management. Our MLOps certification course provides certain skills to streamline this process, ensuring scalable and robust machine learning operations.
Classification Terminologies In Machine Learning
Classifier – It is an algorithm that is used to map the input data to a specific category.
Classification Model – The model predicts or draws a conclusion to the input data given for training, it will predict the class or category for the data.
Feature – A feature is an individual measurable property of the phenomenon being observed.
Binary Classification – It is a type of classification with two outcomes, for eg – either true or false.
Multi-Class Classification – The classification with more than two classes, in multi-class classification each sample is assigned to one and only one label or target.
Multi-label Classification – This is a type of classification where each sample is assigned to a set of labels or targets.
Initialize – It is to assign the classifier to be used for the
Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the model for training the train X and train label y.
Predict the Target – For an unlabeled observation X, the predict(X) method returns predicted label y.
Evaluate – This basically means the evaluation of the model i.e classification report, accuracy score, etc.
Types Of Learners In Classification
Lazy Learners – Lazy learners simply store the training data and wait until a testing data appears. The classification is done using the most related data in the stored training data. They have more predicting time compared to eager learners. Eg – k-nearest neighbor, case-based reasoning.
Eager Learners – Eager learners construct a classification model based on the given training data before getting data for predictions. It must be able to commit to a single hypothesis that will work for the entire space. Due to this, they take a lot of time in training and less time for a prediction. Eg – Decision Tree, Naive Bayes, Artificial Neural Networks.
Classification Algorithms
In machine learning, classification is a supervised learning concept which basically categorizes a set of data into classes. The most common classification problems are – speech recognition , face detection , handwriting recognition, document classification, etc. It can be either a binary classification problem or a multi-class problem too. There are a bunch of machine learning algorithms for classification in machine learning. Let us take a look at those classification algorithms in machine learning.
It is a classification algorithm in machine learning that uses one or more independent variables to determine an outcome. The outcome is measured with a dichotomous variable meaning it will have only two possible outcomes .
The goal of logistic regression is to find a best-fitting relationship between the dependent variable and a set of independent variables. It is better than other binary classification algorithms like nearest neighbor since it quantitatively explains the factors leading to classification.
Advantages and Disadvantages
Logistic regression is specifically meant for classification, it is useful in understanding how a set of independent variables affect the outcome of the dependent variable.
The main disadvantage of the logistic regression algorithm is that it only works when the predicted variable is binary, it assumes that the data is free of missing values and assumes that the predictors are independent of each other.
Identifying risk factors for diseases
Word classification
Weather Prediction
Voting Applications
Learn more about logistic regression with python here .
Naive Bayes Classifier
It is a classification algorithm based on Bayes’s theorem which gives an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Even if the features depend on each other, all of these properties contribute to the probability independently. Naive Bayes model is easy to make and is particularly useful for comparatively large data sets. Even with a simplistic approach, Naive Bayes is known to outperform most of the classification methods in machine learning. Following is the Bayes theorem to implement the Naive Bayes Theorem.
The Naive Bayes classifier requires a small amount of training data to estimate the necessary parameters to get the results. They are extremely fast in nature compared to other classifiers.
The only disadvantage is that they are known to be a bad estimator.
Disease Predictions
Document Classification
Spam Filters
Sentiment Analysis
Know more about the Naive Bayes Classifier here .
It is a very effective and simple approach to fit linear models. Stochastic Gradient Descent is particularly useful when the sample data is in a large number . It supports different loss functions and penalties for classification.
Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.
The only advantage is the ease of implementation and efficiency whereas a major setback with stochastic gradient descent is that it requires a number of hyper-parameters and is sensitive to feature scaling.
Internet Of Things
Updating the parameters such as weights in neural networks or coefficients in linear regression
K-Nearest Neighbor
It is a lazy learning algorithm that stores all instances corresponding to training data in n-dimensional space . It is a lazy learning algorithm as it does not focus on constructing a general internal model, instead, it works on storing instances of training data.
Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is supervised and takes a bunch of labeled points and uses them to label other points. To label a new point, it looks at the labeled points closest to that new point also known as its nearest neighbors. It has those neighbors vote, so whichever label most of the neighbors have is the label for the new point. The “k” is the number of neighbors it checks.
Advantages And Disadvantages
This algorithm is quite simple in its implementation and is robust to noisy training data. Even if the training data is large, it is quite efficient. The only disadvantage with the KNN algorithm is that there is no need to determine the value of K and computation cost is pretty high compared to other algorithms.
Industrial applications to look for similar tasks in comparison to others
Handwriting detection applications
Image recognition
Video recognition
Stock analysis
Know more about K Nearest Neighbor Algorithm here
The decision tree algorithm builds the classification model in the form of a tree structure . It utilizes the if-then rules which are equally exhaustive and mutually exclusive in classification. The process goes on with breaking down the data into smaller structures and eventually associating it with an incremental decision tree. The final structure looks like a tree with nodes and leaves. The rules are learned sequentially using the training data one at a time. Each time a rule is learned, the tuples covering the rules are removed. The process continues on the training set until the termination point is met.
The tree is constructed in a top-down recursive divide and conquer approach. A decision node will have two or more branches and a leaf represents a classification or decision. The topmost node in the decision tree that corresponds to the best predictor is called the root node, and the best thing about a decision tree is that it can handle both categorical and numerical data.
A decision tree gives an advantage of simplicity to understand and visualize, it requires very little data preparation as well. The disadvantage that follows with the decision tree is that it can create complex trees that may bot categorize efficiently. They can be quite unstable because even a simplistic change in the data can hinder the whole structure of the decision tree.
Data exploration
Pattern Recognition
Option pricing in finances
Identifying disease and risk threats
Know more about decision tree algorithm here
Random decision trees or random forest are an ensemble learning method for classification, regression, etc. It operates by constructing a multitude of decision trees at training time and outputs the class that is the mode of the classes or classification or mean prediction(regression) of the individual trees.
A random forest is a meta-estimator that fits a number of trees on various subsamples of data sets and then uses an average to improve the accuracy in the model’s predictive nature. The sub-sample size is always the same as that of the original input size but the samples are often drawn with replacements.
The advantage of the random forest is that it is more accurate than the decision trees due to the reduction in the over-fitting. The only disadvantage with the random forest classifiers is that it is quite complex in implementation and gets pretty slow in real-time prediction.
Industrial applications such as finding if a loan applicant is high-risk or low-risk
For Predicting the failure of mechanical parts in automobile engines
Predicting social media share scores
Performance scores
Know more about the Random Forest algorithm here.
Artificial Neural Networks
A neural network consists of neurons that are arranged in layers , they take some input vector and convert it into an output. The process involves each neuron taking input and applying a function which is often a non-linear function to it and then passes the output to the next layer.
In general, the network is supposed to be feed-forward meaning that the unit or neuron feeds the output to the next layer but there is no involvement of any feedback to the previous layer.
Weighings are applied to the signals passing from one layer to the other, and these are the weighings that are tuned in the training phase to adapt a neural network for any problem statement.
It has a high tolerance to noisy data and able to classify untrained patterns, it performs better with continuous-valued inputs and outputs. The disadvantage with the artificial neural networks is that it has poor interpretation compared to other models.
Handwriting analysis
Colorization of black and white images
Computer vision processes
Captioning photos based on facial features
Know more about artificial neural networks here
The support vector machine is a classifier that represents the training data as points in space separated into categories by a gap as wide as possible. New points are then added to space by predicting which category they fall into and which space they will belong to.
It uses a subset of training points in the decision function which makes it memory efficient and is highly effective in high dimensional spaces. The only disadvantage with the support vector machine is that the algorithm does not directly provide probability estimates.
Business applications for comparing the performance of a stock over a period of time
Investment suggestions
Classification of applications requiring accuracy and efficiency
Learn more about support vector machine in python here
The most important part after the completion of any classifier is the evaluation to check its accuracy and efficiency. There are a lot of ways in which we can evaluate a classifier. Let us take a look at these methods listed below.
Holdout Method
This is the most common method to evaluate a classifier. In this method, the given data set is divided into two parts as a test and train set 20% and 80% respectively.
The train set is used to train the data and the unseen test set is used to test its predictive power.
Cross-Validation
Over-fitting is the most common problem prevalent in most of the machine learning models. K-fold cross-validation can be conducted to verify if the model is over-fitted at all.
In this method, the data set is randomly partitioned into k mutually exclusive subsets, each of which is of the same size. Out of these, one is kept for testing and others are used to train the model. The same process takes place for all k folds.
Classification Report
A classification report will give the following results, it is a sample classification report of an SVM classifier using a cancer_data dataset.
Accuracy
Accuracy is a ratio of correctly predicted observation to the total observations
True Positive: The number of correct predictions that the occurrence is positive.
True Negative: Number of correct predictions that the occurrence is negative.
It is the weighted average of precision and recall
- Precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total number of instances. They are basically used as the measure of relevance.
Receiver operating characteristics or ROC curve is used for visual comparison of classification models, which shows the relationship between the true positive rate and the false positive rate. The area under the ROC curve is the measure of the accuracy of the model.
Apart from the above approach, We can follow the following steps to use the best algorithm for the model
Read the data
Create dependent and independent data sets based on our dependent and independent features
Split the data into training and testing sets
Train the model using different algorithms such as KNN, Decision tree, SVM, etc
Evaluate the classifier
Choose the classifier with the most accuracy.
Although it may take more time than needed to choose the best algorithm suited for your model, accuracy is the best way to go forward to make your model efficient.
Let us take a look at the MNIST data set, and we will use two different algorithms to check which one will suit the model best.
What is MNIST?
It is a set of 70,000 small handwritten images labeled with the respective digit that they represent. Each image has almost 784 features, a feature simply represents the pixel’s density and each image is 28×28 pixels.
We will make a digit predictor using the MNIST dataset with the help of different classifiers.
Loading the MNIST dataset
Exploring The Dataset
Splitting the Data
We are using the first 6000 entries as the training data, the dataset is as large as 70000 entries. You can check using the shape of the X and y. So to make our model memory efficient, we have only taken 6000 entries as the training set and 1000 entries as a test set.
Shuffling The Data
To avoid unwanted errors, we have shuffled the data using the numpy array. It basically improves the efficiency of the model.
Creating A Digit Predictor Using Logistic Regression
Cross-Validation
Creating A Predictor Using Support Vector Machine
In the above example, we were able to make a digit predictor. Since we were predicting if the digit were 2 out of all the entries in the data, we got false in both the classifiers, but the cross-validation shows much better accuracy with the logistic regression classifier instead of the support vector machine classifier.
This brings us to the end of this article where we have learned Classification in Machine Learning. I hope you are clear with all that has been shared with you in this tutorial.
You can also take a Machine Learning Course Masters Program. The program will provide you with the most in-depth and practical information on machine-learning applications in real-world situations. Additionally, you’ll learn the essentials needed to be successful in the field of machine learning, such as statistical analysis, Python, and data science.
Also, if you’re looking to develop the career you’re in with Deep learning, you should take a look at the Deep Learning Course . This course gives students information about the techniques, tools, and techniques they need to grow their careers.
We are here to help you with every step on your journey and come up with a curriculum that is designed for students and professionals who want to be a Python developer . The course is designed to give you a head start into Python programming and train you for both core and advanced Python concepts along with various Python frameworks like Django.
If you come across any questions, feel free to ask all your questions in the comments section of “Classification In Machine Learning” and our team will be glad to answer.
Course Name | Date | Details |
---|---|---|
Class Starts on 31st August,2024 31st August SAT&SUN (Weekend Batch) |
Recommended videos for you
Introduction to mahout, recommended blogs for you, generative ai vs. predictive ai: understanding the differences, best generative ai learning path in 2024, 25 best free datasets for machine learning, introduction to myrrix and oryx, what are the prerequisites for machine learning, ai in supply chain: understand the benefits and challenges , most frequently asked artificial intelligence interview questions in 2024, top 10 skills to become a machine learning engineer, 10 practical generative ai examples to be more productive, what is knowledge representation in ai techniques you need to know, deep learning : perceptron learning algorithm, building a chatbot using prompt engineering, what is the future of ai know about the scopes and ideas, what is production system in artificial intelligence, what is large language models (llm) explained, join the discussion cancel reply, trending courses in artificial intelligence, human-computer interaction (hci) for ai syste ....
- 2k Enrolled Learners
- Weekend/Weekday
ChatGPT Training Course: Beginners to Advance ...
- 15k Enrolled Learners
Generative AI in Business: University of Camb ...
- 1k Enrolled Learners
Prompt Engineering Course
- 5k Enrolled Learners
Artificial Intelligence Certification Course
- 16k Enrolled Learners
MLOps Certification Course Online
- 6k Enrolled Learners
Large Language Models (LLMs) Certification Co ...
- 3k Enrolled Learners
Reinforcement Learning
Graphical models certification training, generative ai in hr certification course, browse categories, subscribe to our newsletter, and get personalized recommendations..
Already have an account? Sign in .
20,00,000 learners love us! Get personalised resources in your inbox.
At least 1 upper-case and 1 lower-case letter
Minimum 8 characters and Maximum 50 characters
We have recieved your contact details.
You will recieve an email from us shortly.
Learn Python Programming from Scratch
- Learn Machine Learning
16 Real World Case Studies of Machine Learning
A decade ago, no one must have thought that the term “Machine Learning” would be hyped so much in the years to come. Right from our entertainment to our basic needs to complex data handling statistics, Machine Learning takes care of all of this. The clutches of Machine Learning aren’t just limited to the basic necessities and entertainment.
The technology plays a pivotal role in domain areas such as data retrieval, database consistency, and spam detection along with many other vast ranges of applications. We do come across various articles that are ready to teach us about the basic concepts of Machine Learning, however, learning becomes more fun when we actually see it working in practicality.
Keeping this in mind, PythonGeeks brings to you, an article that will talk about the real-life case studies of Machine Learning stating its advancement in various fields. We will talk about the merits of Machine Learning in the field of technology as well as in Life Science and Biology. So, without further delay, let us look at these case studies and get to know a bit more about Machine Learning.
Machine Learning Case Studies in Technology
1. machine learning case study on dell.
We all are aware of the multinational leader in technology, Dell. This tech giant empowers people and communities from across the globe by providing superior software and hardware services at very affordable prices. As a matter of fact, data plays a pivotal role in the programming of the hard drive of Dell, the marketing team of Dell requires a data-driven solution that supercharges response rates and exhibits why certain words and phrases outpace others in terms of efficiency and reliability.
Dell made a partnership with Persado, one of the names amongst the world’s leading technology in AI and ML fabricating marketing creative, in order to harness the power of words in their respective email channel and garner data-driven analytics for each of their key audiences for a better user experience.
As an evident outcome of this partnership, Dell experienced a 50% average increase in CTR and a 46% average increase in responses from its customer engagement . Apart from this, it also witnessed a huge 22% average increase in page visits and a 77% average increase in add-to-carts orders .
Overwhelmed by this success rate and learnings with email, Dell adamantly wanted to elevate their entire marketing platform with Persado for more profit and audience engagement. Dell now makes use of machine learning algorithms to enhance the marketing copy of their promotional and lifecycle emails. Apart from these, their management even deploys Machine Learning models for Facebook ads, display banners, direct mail, and even radio content for a farther reach for the target audience.
2. Machine Learning Case Study on Sky
Sky UK is a British telecommunication service that transforms customer experiences with the help of machine learning and artificial intelligence algorithms with the help of Adobe Sensei.
Due to the immense profit that the company gained due to the deployment of the Machine Learning model, the Head of Digital Decisioning and Analytics, Sky UK once stated that they have 22.5 million very diverse customers. Even attempting to divide people by their favorite television genre can result in pretty broad segments for their services.
This will result in the following outcomes:
- Creating hyper-focused segments to engage customers.
- Usage of machine learning to deliver actionable intelligence.
- Improvement in the relationships with customers.
- Applying AI learnings across channels to understand what matters to customers.
The company was competent in efficiently analyzing large volumes of customer information with the help of machine learning frameworks. With the deployment of Machine Learning models, the services were able to recommend their target audience with products and services that resonated the most with each of them.
McLaughlin once stated that people think of machine learning as a tool for delivering experiences that are strictly defined and very robotic in their approach, but it’s actually the other way round. With Adobe Sensei, the management of the Sky was drawing a line that connects customer intelligence and personalized experiences that are valuable and appropriate for their customers.
3. Machine Learning Case Study on Trendyol
Trendyol is amongst the leading e-commerce companies based in Turkey. It once faced threats from its global competitors like Adidas and ASOS, particularly for its sportswear sales and audience engagement.
In order to assist the company in gaining customer loyalty and to enhance its emailing system, Trendyol partnered with the vendor Liveclicker, which specializes in real-time personalization for a better user experience for its customers.
Trendyol made use of machine learning and artificial intelligence algorithms to create several highly personalized marketing campaigns based on the interests of a particular target audience. It was not only aimed at providing a personalized touch to the campaign, but it also helped to distinguish which messages would be most relevant or draw the attention of which set of customers. It also came up with an offer for a football jersey imposing the recipient’s name on the back of the jersey to ramp up the personalization level and grab the consumer’s attention.
By innovating such one-to-one personalization, not only were the retailer’s open rates, click-through rates, conversions were high, it also significantly made their sales reach all-time highs. It resulted in the generation of a 30% increase in click-through rates for Trendyol, a 62% growth in response rates, and a striking 130% increase in conversion rates for the tech giant.
4. Machine Learning Case Study On Harley Davidson
The world that we live in today is where it becomes difficult to break through traditional marketing. For an emerging business like – Harley Davidson NYC, Albert (an artificial intelligence-powered robot) has a lot of appeal for the growth and popularity of the company. Powered by effective and reliable machine learning and artificial intelligence algorithms, robots are writing news stories, opening new dimensions, working in hotels, managing traffic, and even running McDonald’s customers’ outlets.
We can use Albert in various marketing channels including social media and email campaigns. The software accurately predicts and differentiates among the consumers who are most likely to convert and adjust personal creative copies on their own for the benefits of the campaign.
Harley Davidson is the only brand to date that uses Albert to its advantage. The company analyzed customer data to determine a strong pattern in the behavior of previous customers whose actions were positive in terms of purchasing and spending more than the average amount of time on browsing through the website giving way to the use of Albert. With this analyzed data, Albert bifurcates segments of customers and scales up the test campaigns according to the interests and engagement of customers.
Once the company efficiently deployed Albert, Harley Davidson witnessed an increase in its sales by 40% with the use of Albert. The brand also witnessed a 2,930% increase in leads, with 50% of those from high converting ‘lookalikes’ identified by artificial intelligence and machine learning using Albert.
5. Machine Learning Case Study on Yelp
As far as our technical knowledge is concerned, we are not able to recognize Yelp as a tech company. However, it is effectively taking advantage of machine learning to improve its users’ experience to a great extent.
Yelp’s machine learning algorithms assist the company’s non-robotic staff in tasks like collecting, categorizing, and labeling images more efficiently and precisely. Since images play a pivotal role to Yelp as user reviews themselves, the tech giant is always trying to improve how it handles image processing to analyze customer feedback in a constructive way. Through this assistance, the company is serving millions of its users now with accurate and satisfactory services.
For an entire generation nowadays, capturing photos of their food has become second nature. Owing to this, Yelp has such a huge database of photos for image processing. Its software makes use of techniques for analysis of the image to identify and classify the extracted features on the basis of color, texture, and shape. It implies that it can recognize the presence of, say, pizzas, or whether a restaurant has outdoor seating by merely analyzing the images that we provide as input data.
As a constructive outcome, the company is now capable of predicting attributes like ‘good for kids’ and ‘classy ambiance’ with a striking more than 80% accuracy.
6. Machine Learning Case Study on Tesla
Tesla is now a big name in the electric automobile industry and the chances that it will continue to be the trending topic for years to come are really high. It is popular and extensively known for its advanced and futuristic cars and their advanced models. The company states that their cars have their own AI hardware for their advancement. Tesla is even making use of AI for fabricating self-driving cars.
With the current progress rate of technology, cars are not yet completely autonomous and need human intervention to some extent. The company is working extensively on the thinking algorithm for cars to help them become fully autonomous. It is currently working in an advert partnership with NVIDIA on an unsupervised ML algorithm for its development.
This step by Tesla would be a game-changer in the field of automobiles and Machine Learning models for many reasons. The cars feed the data directly to Tesla’s cloud storage to avoid data leakage. The car sends the driver’s seating position, traffic of the area, and other valuable information on the cloud to precisely predict the next move of the car. The car is equipped with various internal and external sensors that detect the above-mentioned data for processing.
Machine Learning Case Studies in Life Science and Biology
7. development of microbiome therapeutics.
We have studied and identified a vast number of microorganisms, so-called microbiota like bacteria, fungi, viruses, and other single-celled organisms in our body till today with the advancement in technology. All the genes of the microbiota are collectively known as the microbiome. These genes are present in an enormous number of trillions, for example, the bacteria present in the human body have more than 100 times more unique genes than humans could ever have.
These microbiotas that are present in the human body have a massive influence on human health and cause imbalances leading to many disorders like Parkinson’s disease or inflammatory bowel disease. There is also the presumption that such imbalances may even cause several autoimmune diseases if precariously left in the human body. So, microbiome research is a very trendy research area and Machine Learning models can help in handling them effectively.
In order to influence the microbiota and develop microbiome therapeutics to reverse the diseases caused by them, we need to understand the microbiota’s genes and their influence on our body. With all the gene sequencing possibilities that are present today, terabytes of data are available however we cannot use it as it is not yet probed.
8. Predicting Heart Failure in Mobile Health
Heart failure typically leads to emergency or hospital admission and may even be fatal in some situations. And with the increase in the aging population, the percentage of heart failure in the population is expected to increase.
People that suffer from heart failure usually have some pre-existing illnesses that go undiagnosed and lead to fatal ailments. So, it is not uncommon that we make use of telemedicine systems to monitor and consult a patient, and collect valuable data like mobile health data like blood pressure, body weight, or heart rate and transmit it effectively.
Most prediction and prevention systems are now fabricated based on fixed rules, like when specific measurements of the vital readings of the human body are beyond a predefined threshold, the patient is alerted even before the diagnosis of any kind of ailment. It is self-explanatory that such a predictive system may lead to a high number of false alerts, due to fluctuating reading of the vitals due to reasons that are not serious.
Because of the programming that we do on the algorithms, alerts lead mostly to hospital admission. Due to this reason, too many false alerts lead to increased health costs and deteriorate the patient’s confidence in the prediction defying the cause of the algorithms. Eventually, the concerned patient will stop following the recommendation for medical help even if the algorithm alters it for fatal ailments.
So, on the basis of baseline data of the patient like age, gender, smoker or not, a pacemaker or not along with measurements of vital elements of the body like sodium, potassium, or hemoglobin concentrations in the blood, apart from the monitored characteristics like heart rate, body weight, (systolic and diastolic) blood pressure, or questionnaire proves to be helpful in answering about the well-being, or physical activities, a classifier on the basis of Naïve Bayes has been finally developed to reduce the chances of false positives.
9. Mental Health Prediction, Diagnosis, and Treatment
According to an estimated number that at least 10% of the global population has a mental disorder, it is now high time that we need to take preventive measures in this field. Economic losses that are evident due to mental illness sum up to nearly $10 trillion.
Mental disorders include a large variety of ailments ranging from anxiety, depression, substance use disorder, and others. Some other prime examples include opioids, bipolar disorder, schizophrenia, or eating disorders that cause high risk to the human resources.
As a result of which, the detection of mental disorders and intervention as early as possible is critical in order to reduce the loss of precious resources. There are two main approaches to deploy Machine Learning models in detecting mental disorders: apps for consumers that detect mental diseases and tools for psychiatrists to support diagnostics of their patients.
The apps for consumers are typically conversational chatbots enhanced with machine learning algorithms to help the consumers in reducing their anxiety or panic attacks. The app analyzes the behavioral traits of the person like the spoken language of the consumer and recommends help to the customers accordingly. As the recommendations must be strictly on the basis of scientific evidence, the interaction and response of proposals and the individual language pattern of the chatbot, as well as, the consumer must be predicted as precisely as possible.
10. Research Publication and Database Scanning for Bio-Markers for Stroke
As a matter of fact, Stroke is one of the major reasons for disability and death amongst the elder generations. The lifetime risk analysis of an adult person is about 25% of having once a stroke history. However, stroke is a very heterogeneous disorder in nature. Therefore, having individualized pre-stroke and post-stroke care is critical for the success of a cure.
In order to determine this individualized care, the person’s phenotype indicates that the observable characteristics of a person should be chosen wisely. Furthermore, we usually achieve this by biomarkers. A so-called biomarker represents a measurable data point such that we can stratify the patients. Examples of such biomarkers are disease severity scores, lifestyle characteristics, or genomic properties.
There are many recognized biomarkers already published or in databases. Apart from this, there are hundreds of scientific publications that talk daily about the detection of biomarkers for all the different diseases.
11. 3D Bioprinting
Bioprinting is yet another trending topic in the domain of biotechnology. It works on the basis of a digital blueprint where the printer uses cells and natural or synthetic biomaterials — also called bio-inks — to print layer-by-layer living tissues like skin, organs, blood vessels, or bones that have exact replication of the real tissues.
As an alternative for depending on organ donations, we can produce these tissues in printers more ethically and cost-effectively. Apart from this, we can even perform drug tests on the synthetic build tissue than with animal or human testing. The whole technology is still emerging and is in early maturity due to its high complexity. One of the most crucial parts to cope with this complexity of printing is data science.
12. Supply Chain Optimization
As we might have observed, the production of drugs needs time, especially for today’s high-tech cures based on specific substances and production methods only. Apart from this, we have to break down the whole process into many different steps, and several of them are outsourced to specialist delivery agents.
We observe this currently with the COVID-19 vaccine production as well. The vaccine inventors deliver the blueprint for the vaccine. Then the production happens in plants of companies specialized in sterile production. The production unit then delivers the vaccine in tanks to companies. They do the filling in small doses under clinical conditions, and at last, another company makes the supply for the given blueprint.
The complete planning, right from having the right input substances available at the right time, then having the adequate production capacity, and at last, the exact amount of drugs stored for serving the demand is a highly complicated system. As a result of which, this must be managed for hundreds and thousands of therapies, each with its specific conditions.
13. AES On Google Cloud AutoML Vision
As we have known, the AES Corporation is a power generation and distribution company. They generate and sell power that the consumers use for utilities and industrial work. They depend on Google Cloud on their road to make renewable energy more efficient. AES makes use of Google AutoML Vision to review images of wind turbine blades and analyze their maintenance needs beforehand.
Outcomes of this case study:
- It reduces image review time by approximately 50%
- It helps in reducing the prices of renewable energy
- This results in more time to invest in identifying wind turbine damage and mending it
14. Bayes AG on AWS SageMaker
Bayer AG is an emerging name in multinational pharmaceutical and life sciences companies and it is based in Germany. One of their key highlights is in the production of insecticides, fungicides, and herbicides for agricultural purposes.
In order to assist farmers monitor their crops, they fabricate their Digital Yellow Trap: an Internet of Things (IoT) device that alerts farmers of pests using image recognition on the farming land.
- It helps in reducing Bayer lab’s architecture costs by 94%
- We can scale it to accommodate for fluctuating demand
- It is able to handle tens of thousands of requests per second
- It helps in Community-based, early warning
15. American Cancer Society on Google Cloud ML Engine
The American Cancer Society is a nonprofit organization for eradicating cancer. They operate in more than 250 regional offices all over America.
They make use of the Google Cloud ML Engine to identify novel patterns in digital pathology images. Their aim is to improve breast cancer detection accuracy and reduce the overall diagnosis timeline as well as ensure effective costing.
Outcomes of this use case:
- It helps in enhancing the speed and accuracy of image analysis by removing human limitations
- It even aids in improving patients’ quality of life and life expectancy
- This helps to protect tissue samples by backing up image data to the cloud
16. Road Safety Commission of Western Australia
The Road Safety Commission of Western Australia operates under the Western Australia Police Force. It takes the responsibility for tracking road accidents and making the roads safer by taking adequate precautions.
In an attempt to achieve its safety strategy “Towards Zero 2008-2020” which aims at reducing road fatalities by 40%, the road safety commission is depending on machine learning, artificial intelligence, and advanced analytics for precise and reliable results.
- It helps in achieving the goal of data engineering and visualization time reduced by 80%
- It has achieved an estimated 25% reduction in vehicle crashes
- This is based on straightforward and efficient data sharing
- It works on flexibility of data with various coding languages
With this, we have seen the various case studies that are done till now in the field of Machine Learning. PythonGeeks specially curated this list of case studies to help readers to understand the deployment of Machine Learning models in the real world. The article can benefit you in various ways since it delivers accurate studies of the various uses of Machine Learning. You can study these cases to get to know Machine Learning a bit better and even try to find improvements in the existing solution.
Did you know we work 24x7 to provide you best tutorials Please encourage us - write a review on Google | Facebook
Tags: Machine Learning Case Studies
4 Responses
- Pingbacks 0
Great content and relevant to current digital transformation process.
Very informative
Very insightful
Leave a Reply Cancel reply
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.
Comparative study of a newly proposed machine learning classification to detect damage occurrence in structures
New citation alert added.
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
New Citation Alert!
Please log in to your account
Information & Contributors
Bibliometrics & citations, view options, index terms.
Applied computing
Physical sciences and engineering
Computing methodologies
Machine learning
Learning paradigms
Supervised learning
Supervised learning by classification
Machine learning approaches
Classification and regression trees
Neural networks
Recommendations
Delamination identification in sandwich composite structures using machine learning techniques.
- Within a structural health monitoring context, a novel delamination detection approach using modal data is presented.
Recent advances in machine learning have enabled powerful strategies for autonomous data-driven damage detection and identification in structural systems. This work proposes a novel method for 3D delamination identification in sandwich ...
Fuzzy Least Square Support Vector Machine Applied to Detect Damage for Fiber Smart Structures
The research on realizing the self-detecting damage function is one of the main research contents of smart structures, and an important issue related to the self-detecting damage function is the method of damage detection. It has been of an important ...
Vibration-based damage detection of structures employing Bayesian data fusion coupled with TLBO optimization algorithm
The present paper deals with structural health monitoring of trusses, space frame and plate structure utilizing the Bayesian data fusion approach. The application of the proposed approach has been demonstrated on a 25-member plane truss, a 42-...
Information
Published in.
Pergamon Press, Inc.
United States
Publication History
Author tags.
- Structural health monitoring
- Damage detection
- Hybrid machine learning algorithm
- Stacking method
- Research-article
Contributors
Other metrics, bibliometrics, article metrics.
- 0 Total Citations
- 0 Total Downloads
- Downloads (Last 12 months) 0
- Downloads (Last 6 weeks) 0
View options
Login options.
Check if you have access through your login credentials or your institution to get full access on this article.
Full Access
Share this publication link.
Copying failed.
Share on social media
Affiliations, export citations.
- Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
- Download citation
- Copy citation
We are preparing your search results for download ...
We will inform you here when the file is ready.
Your file of search results citations is now ready.
Your search export query has expired. Please try again.
Information
- Author Services
Initiatives
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
- Active Journals
- Find a Journal
- Proceedings Series
- For Authors
- For Reviewers
- For Editors
- For Librarians
- For Publishers
- For Societies
- For Conference Organizers
- Open Access Policy
- Institutional Open Access Program
- Special Issues Guidelines
- Editorial Process
- Research and Publication Ethics
- Article Processing Charges
- Testimonials
- Preprints.org
- SciProfiles
- Encyclopedia
Article Menu
- Subscribe SciFeed
- Recommended Articles
- Google Scholar
- on Google Scholar
- Table of Contents
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
JSmol Viewer
Identification of pasture degradation using remote sensing data and machine learning: a case study of obichnik.
1. Introduction
- Soc.: plants grow close to each other, joining with their aboveground parts (90–100%);
- Cop. 3: plants occur in very large numbers (75–90%);
- Cop. 2: plants occur in large numbers (50–75%);
- Cop. 1: plants occur in considerable numbers (25–50%);
- Cop. 1-Sp.: the species are relatively abundant (15–25%);
- Sp.: the species are abundant, but do not form a continuous cover (5–15%);
- Sol.-Sp.: low abundance (1–5%);
- Sol.: the species grow sparsely (<1%);
- Un.: the species occur in single instances.
2. Materials and Methods
2.1. the study area.
- Shrubs and trees ( Figure 2 a)—since no grazing system is applied on the pasture, shrubs and trees are steadily increasing their share;
- Stones and rocks ( Figure 2 b)—the pasture is surrounded by abandoned old houses and other constructions, which are the source of this type of degradation;
- Exposed soil and cattle tracks ( Figure 2 c)—they are created mostly by the grazing animals themselves. One of the reasons for this is the slope of the terrain, which is up to 15%.
2.2. Data Acquisition and Data Analysis
2.3. methodology of the study.
- Grass—this class corresponds to areas with no or insignificant degradation, which are covered by grass vegetation;
- Shrubs—this class corresponds to areas which are covered with bushes and in the long term with trees;
- Soil—this class represents areas with exposed soil and cattle tracks;
- Stones—this class represents areas which are covered with stones and rocks.
- Step 1. A UAV is used to make overlapping images of the investigated pasture;
- Step 2. The overlapping images are combined into one big high-quality (HQ) map, representing the investigated pasture area;
- Step 3. The HQ map is used to select numerous polygons (regions of interest) from each class, to be used for training and validation purposes;
- Step 4. The HQ map is used to generate a segmentation map, by manipulating three parameters of the pixels: spectral detail, spatial detail, and minimum segment size in pixels;
- Step 5. Using the combined HQ map, the training data, and the segmentation map, a classification map is created, using one of the available algorithms;
- Step 6. The accuracy of the classification map is assessed by using randomly selected pixels from the chosen polygons in Step 3 and comparing them with the reference class. Steps 4, 5, and 6 are repeated numerous times until the best-performing models are obtained;
- Step 7. Once the optimal classification models are obtained, they are used to evaluate and analyze the degradation of the pasture.
Click here to enlarge figure
3. Results and Discussion
4. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.
- Gantulga, N.; Iimaa, T.; Batmunkh, M.; Surenjav, U.; Tserennadmin, E.; Turmunkh, T.; Denchingungaa, D.; Dorjsuren, B. Impacts of natural and anthropogenic factors on soil erosion. Proc. Mong. Acad. Sci. 2023 , 63 , 3–18. [ Google Scholar ] [ CrossRef ]
- Behmanesh, B.; Barani, H.; Abedi Sarvestani, A.; Shahraki, M.R.; Sharafatmandrad, M. Rangeland degradation assessment: A new strategy based on indigenous ecological knowledge of pastoralists. Solid Earth Discuss 2015 , 7 , 2999–3019. [ Google Scholar ] [ CrossRef ]
- Mganga, K.Z.; Nyariki, D.M.; Musimba, N.K.; Amwata, D.A. Determinants and rates of land degradation: Application of stationary time-series model to data from a semi-arid environment in Kenya. J. Arid Land 2018 , 10 , 1–11. [ Google Scholar ] [ CrossRef ]
- Nasiyev, B.; Shibaikin, V.; Bekkaliyev, A.; Zhanatalapov, N.Z.; Bekkaliyeva, A. Changes in the Quality of Vegetation Cover and Soil of Pastures in Semi-Deserts of West Kazakhstan, Depending on the Grazing Methods. J. Ecol. Eng. 2022 , 23 , 50–60. [ Google Scholar ] [ CrossRef ]
- Lu, X.; Kelsey, K.C.; Yan, Y.; Sun, J.; Wang, X.; Cheng, G.; Neff, J.C. Effects of grazing on ecosystem structure and function of alpine grasslands in Qinghai-Tibetan Plateau: A synthesis. Ecosphere 2017 , 8 , e01656. [ Google Scholar ] [ CrossRef ]
- Bekele, N.; Kebede, G. Rangeland Degradation and Restoration in Semi-arid Areas of Southern Ethiopia: The Case of Borana Rangeland. Int. J. Environ. Sci. 2014 , 3 , 94–103. [ Google Scholar ]
- Shamsutdinov, Z. Ecological restoration of biodiversity and forage productivity of degraded pasture ecosystems in the Central Asian Desert. BIO Web Conf. 2022 , 43 , 01025. [ Google Scholar ] [ CrossRef ]
- Ragimov, A.; Mazirov, M.; Nikolaev, V.; Shitikova, A.; Malakhova, S. Impact of Different Type of Cattle Grazing on the Processes of Agrochemical Degradation and Digression of Soil Cover. E3S Web Conf. 2020 , 220 , 01002. [ Google Scholar ] [ CrossRef ]
- Cao, J.; Adamowski, J.F.; Deo, R.C.; Xu, X.; Gong, Y.; Feng, Q. Grassland Degradation on the Qinghai-Tibetan Plateau: Reevaluation of Causative Factors. Rangel. Ecol. Manag. 2019 , 72 , 988–995. [ Google Scholar ] [ CrossRef ]
- Suresh, S.; Gupta, D.C.; Mann, J.S. Degradation of Common Pastures: An Economics Perspective of its Impact on Livestock Farming and Coping Strategies. Agric. Econ. Res. Rev. 2010 , 23 , 47–56. [ Google Scholar ]
- Stonecipher, C. Mitigation of Medusahead (Teaniatherum Caput-Medusae) through Grazing and Revegetation on the Channeled Scablands of Eastern Washington. Ph.D. Dissertation, Utah State University, Logan, UT, USA, 2015. [ Google Scholar ] [ CrossRef ]
- Feltran-Barbieri, R.; Féres, J.G. Degraded pastures in Brazil: Improving livestock production and forest restoration. R. Soc. Open Sci. 2021 , 8 , 201854. [ Google Scholar ] [ CrossRef ]
- Yamamoto, W.; Dewi, I.A.; Ibrahim, M. Effects of silvopastoral areas on milk production at dual-purpose cattle farms at the semi-humid old agricultural frontier in central Nicaragua. Agric. Syst. 2007 , 94 , 368–375. [ Google Scholar ] [ CrossRef ]
- United Nations. SDG Indicators. Available online: https://unstats.un.org/sdgs/metadata/ (accessed on 22 May 2024).
- Zhang, N.; Li, Z.; Feng, Y.; Li, X.; Tang, J. Development and application of a vegetation degradation classification approach for the temperate grasslands of northern China. Ecol. Indic. 2023 , 154 , 110857. [ Google Scholar ] [ CrossRef ]
- Quinaia, T.L.; do Valle Junior, R.F.; de Miranda Coelho, V.P.; da Cunha, R.C.; Valera, C.A.; Fernandes, L.F.S.; Pacheco, F.A.L. Application of an improved vegetation index based on the visible spectrum in the diagnosis of degraded pastures: Implications for development. Land Degrad Dev. 2021 , 32 , 4693–4707. [ Google Scholar ] [ CrossRef ]
- Galdino, S. Classification of Pasture Degradation Levels in Terms of Hydric Erosion Risk in Quartzipsamments Areas at Alto Taquari Watershed (MS/MT, Brazil). Geografia 2013 , 38 , 95–107. [ Google Scholar ]
- de Torres, F.N.; Richter, R.; Vohland, M. A multisensoral approach for high-resolution land cover and pasture degradation mapping in the humid tropics: A case study of the fragmented landscape of Rio de Janeiro. Int. J. Appl. Earth Obs. Geoinf. 2019 , 78 , 189–201. [ Google Scholar ] [ CrossRef ]
- Yesmagulova, B.Z.; Assetova, A.Y.; Tassanova, Z.B.; Zhildikbaeva, A.N.; Molzhigitova, D.K. Determination of the Degradation Degree of Pasture Lands in the West Kazakhstan Region Based on Monitoring Using Geoinformation Technologies. J. Ecol. Eng. 2023 , 24 , 179–187. [ Google Scholar ] [ CrossRef ]
- Drude, O. Die Ökologie der Pflanzen ; F. Vieweg & Sohn: Braunschweig, Germany, 1913. (In German) [ Google Scholar ]
- Gao, Q.; Wan, Y.; Li, Y.; Lin, E. Grassland degradation in Northern Tibet based on remote sensing data. J. Geogr. Sci. 2006 , 16 , 165–173. [ Google Scholar ] [ CrossRef ]
- Yan, X.; Jiang, Y.; Chen, S.; He, Z.; Li, C.; Xia, S.-T.; Dai, T.; Dong, S.; Zheng, F. Automatic Grassland Degradation Estimation Using Deep Learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence AI for Improving Human Well-Being, Macao, China, 10–16 August 2019; pp. 6028–6034. [ Google Scholar ] [ CrossRef ]
- Hu, Y.; Nacun, B. An Analysis of Land-Use Change and Grassland Degradation from a Policy Perspective in Inner Mongolia, China, 1990–2015. Sustainability 2018 , 10 , 4048. [ Google Scholar ] [ CrossRef ]
- Wiesmair, M.; Feilhauer, H.; Magiera, A.; Otte, A.; Waldhardt, R. Estimating vegetation cover from high- resolution satellite data to assess grassland degradation in the Georgian Caucasus. Res. Dev. 2016 , 36 , 56–65. [ Google Scholar ] [ CrossRef ]
- Ogungbuyi, M.G.; Mohammed, C.; Ara, I.; Fischer, A.M.; Harrison, M.T. Advancing Skyborne Technologies and High-Resolution Satellites for Pasture Monitoring and Improved Management: A Review. Remote Sens. 2023 , 15 , 4866. [ Google Scholar ] [ CrossRef ]
- Xu, X.; Liu, L.; Han, P.; Gong, X.; Zhang, Q. Accuracy of Vegetation Indices in Assessing Different Grades of Grassland Desertification from UAV. Int. J. Environ. Res. Public Health 2022 , 19 , 16793. [ Google Scholar ] [ CrossRef ]
- Jin, E.; Du, J.; Bi, Y.; Wang, S.; Gao, X. Research on Classification of Grassland Degeneration Indicator Objects Based on UAV Hyperspectral Remote Sensing and 3D_RNet-O Model. Sensors 2024 , 24 , 1114. [ Google Scholar ] [ CrossRef ] [ PubMed ]
- Akumu, C.E.; Amadi, E.O.; Dennis, S. Application of drone and WorldView-4 satellite data in mapping and monitoring grazing land cover and pasture quality: Pre-and post-flooding. Land 2021 , 10 , 321. [ Google Scholar ] [ CrossRef ]
- Barnetson, J.; Phinn, S.; Scarth, P. Estimating plant pasture biomass and quality from UAV imaging across Queensland’s Rangelands. AgriEngineering 2020 , 2 , 523–543. [ Google Scholar ] [ CrossRef ]
- Michez, A.; Lejeune, P.; Bauwens, S.; Herinaina, A.A.L.; Blaise, Y.; Muñoz, E.C.; Lebeau, F.; Bindelle, J. Mapping and monitoring of biomass and grazing in pasture with an unmanned aerial system. Remote Sens. 2019 , 11 , 473. [ Google Scholar ] [ CrossRef ]
- Morais, T.G.; Teixeira, R.F.; Figueiredo, M.; Domingos, T. The use of machine learning methods to estimate aboveground biomass of grasslands: A review. Ecol. Indic. 2021 , 130 , 108081. [ Google Scholar ] [ CrossRef ]
- Alvarez-Mendoza, C.I.; Guzman, D.; Casas, J.; Bastidas, M.; Polanco, J.; Valencia-Ortiz, M.; Montenegro, F.; Arango, J.; Ishitani, M.; Selvaraj, M.G. Predictive Modeling of Above-Ground Biomass in Brachiaria Pastures from Satellite and UAV Imagery Using Machine Learning Approaches. Remote Sens. 2022 , 14 , 5870. [ Google Scholar ] [ CrossRef ]
- Chen, Y.; Guerschman, J.; Shendryk, Y.; Henry, D.; Harrison, M.T. Estimating Pasture Biomass Using Sentinel-2 Imagery and Machine Learning. Remote Sens. 2021 , 13 , 603. [ Google Scholar ] [ CrossRef ]
- De Rosa, D.; Basso, B.; Fasiolo, M.; Friedl, J.; Fulkerson, B.; Grace, P.R.; Rowlings, D.W. Predicting pasture biomass using a statistical model and machine learning algorithm implemented with remotely sensed imagery. Comput. Electron. Agric. 2021 , 180 , 105880. [ Google Scholar ] [ CrossRef ]
- Franco, V.R.; Hott, M.C.; Andrade, R.G.; Goliatt, L. Hybrid machine learning methods combined with computer vision approaches to estimate biophysical parameters of pastures. Evol. Intell. 2023 , 16 , 1271–1284. [ Google Scholar ] [ CrossRef ]
- Wei, D.; Liu, K.; Xiao, C.; Sun, W.; Liu, W.; Liu, L.; Huang, X.; Feng, C. A Systematic Classification Method for Grassland Community Division Using China’s ZY1-02D Hyperspectral Observations. Remote Sens. 2022 , 14 , 3751. [ Google Scholar ] [ CrossRef ]
- Parente, L.; Ferreira, L.; Faria, A.; Nogueira, S.; Araújo, F.; Teixeira, L.; Hagen, S. Monitoring the brazilian pasturelands: A new mapping approach based on the landsat 8 spectral and temporal domains. Int. J. Appl. Earth Obs. Geoinf. 2017 , 62 , 135–143. [ Google Scholar ] [ CrossRef ]
- Girolamo-Neto, C.D.; Sato, L.Y.; Sanches, I.D.; Silva, I.C.O.; Rocha, J.C.S.; Almeida, C.A. Object Based Image Analysis and Texture Features for Pasture Classification in Brazilian Savannah. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2020 , V-3-2020 , 453–460. [ Google Scholar ] [ CrossRef ]
- Evstatiev, B.; Mladenova, T.; Valov, N.; Zhelyazkova, T.; Gerdzhikova, M.; Todorova, M.; Grozeva, N.; Sevov, A.; Stanchev, G. Fast Pasture Classification Method using Groundbased Camera and the Modified Green Red Vegetation Index (MGRVI). Int. J. Adv. Comput. Sci. Appl. 2023 , 14 , 45–51. [ Google Scholar ] [ CrossRef ]
- Khazieva, E.; Verburg, P.H.; Pazúr, R. Grassland degradation by shrub encroachment: Mapping patterns and drivers of encroachment in Kyrgyzstan. J. Arid Environ. 2022 , 207 , 104849. [ Google Scholar ] [ CrossRef ]
- Motta, J.J.S.; Encina, C.C.; Guaraldo, E.; Gonçalves, A.; Gamarra, R.M.; Filho, A.C.P. The Analysis of the Degree of Grassland Degradation Using Remote Sensing. The paths of the Geography. Life 2021 , 22 , 201–219. [ Google Scholar ] [ CrossRef ]
- do Valle Júnior, R.F.; Siqueira, H.E.; Valera, C.A.; Oliveira, C.F.; Fernandes, L.F.S.; Moura, J.P.; Pacheco, F.A.L. Diagnosis of degraded pastures using an improved NDVI-based remote sensing approach: An application to the Environmental Protection Area of Uberaba River Basin (Minas Gerais, Brazil). Remote Sens. Appl. Soc. Environ. 2019 , 14 , 20–33. [ Google Scholar ] [ CrossRef ]
- Vieira, R.M.D.S.P.; Tomasella, J.; Barbosa, A.A.; Polizel, S.P.; Ometto, J.P.H.B.; Santos, F.C.; da Cruz Ferreira, Y.; de Toledo, P.M. Land degradation mapping in the MATOPIBA region (Brazil) using remote sensing data and decision-tree analysis. Sci. Total Environ. 2021 , 782 , 146900. [ Google Scholar ] [ CrossRef ]
- Le Cam, L. Maximum Likelihood: An Introduction. Int. Stat. Rev./Rev. Int. De Stat. 1990 , 58 , 153–171. [ Google Scholar ] [ CrossRef ]
- Le Gall, J.-F. Random trees and applications. Probab. Surv. 2005 , 2 , 245–311. [ Google Scholar ] [ CrossRef ]
- Pisner, D.A.; Schnyer, D.M. Chapter 6—Support vector machine. In Machine Learning ; Academic Press: Cambridge, MA, USA, 2020; pp. 101–121. [ Google Scholar ] [ CrossRef ]
- McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. 2012 , 22 , 276–282. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ PubMed Central ]
Class | Number of ROI | Number of Pixels | Relative Area |
---|---|---|---|
Grass | 94 | 1,220,995 | 30.5% |
Shrubs | 92 | 2,626,642 | 65.6% |
Soil | 32 | 24,452 | 0.6% |
Stones | 62 | 132,413 | 3.3% |
Total | 280 | 4,004,502 | 100.0% |
No. | Classification Algorithm | Cohen’s Kappa |
---|---|---|
1 | Optimal object-based MxL | 0.53 |
2 | Optimal object-based RT | 0.86 |
3 | Optimal object-based SVM | 0.82 |
4 | Pixel-based MxL | 0.54 |
5 | Pixel-based RT | 0.41 |
6 | Pixel-based SVM | 0.43 |
Reference | Precision | Recall | F-Score | |||||
---|---|---|---|---|---|---|---|---|
Grass | Stones | Soil | Shrubs | |||||
Results | Grass | 3052 | 0 | 10 | 343 | 0.896 | 0.993 | 0.942 |
Stones | 0 | 251 | 0 | 191 | 0.568 | 0.787 | 0.660 | |
Soil | 0 | 0 | 47 | 89 | 0.346 | 0.825 | 0.487 | |
Shrubs | 21 | 68 | 0 | 5922 | 0.985 | 0.905 | 0.943 |
Reference | Precision | Recall | F-Score | |||||
---|---|---|---|---|---|---|---|---|
Grass | Stones | Soil | Shrubs | |||||
Results | Grass | 3020 | 8 | 10 | 413 | 0.875 | 0.983 | 0.926 |
Stones | 0 | 259 | 0 | 387 | 0.401 | 0.812 | 0.537 | |
Soil | 0 | 3 | 39 | 0 | 0.929 | 0.684 | 0.788 | |
Shrubs | 53 | 49 | 8 | 5744 | 0.981 | 0.878 | 0.927 |
Class | Area, m | Relative Area, % |
---|---|---|
Optimal object-based RT classification | ||
Grass | 6065 | 61.50 |
Stones | 102 | 1.03 |
Soil | 290 | 2.94 |
Shrubs | 3405 | 34.53 |
Optimal object-based SVM classification | ||
Grass | 6037 | 61.21 |
Stones | 221 | 2.24 |
Soil | 111 | 1.13 |
Shrubs | 3493 | 35.42 |
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
Share and Cite
Evstatiev, B.; Valova, I.; Kaneva, T.; Valov, N.; Sevov, A.; Stanchev, G.; Komitov, G.; Zhelyazkova, T.; Gerdzhikova, M.; Todorova, M.; et al. Identification of Pasture Degradation Using Remote Sensing Data and Machine Learning: A Case Study of Obichnik. Appl. Sci. 2024 , 14 , 7599. https://doi.org/10.3390/app14177599
Evstatiev B, Valova I, Kaneva T, Valov N, Sevov A, Stanchev G, Komitov G, Zhelyazkova T, Gerdzhikova M, Todorova M, et al. Identification of Pasture Degradation Using Remote Sensing Data and Machine Learning: A Case Study of Obichnik. Applied Sciences . 2024; 14(17):7599. https://doi.org/10.3390/app14177599
Evstatiev, Boris, Irena Valova, Tsvetelina Kaneva, Nikolay Valov, Atanas Sevov, Georgi Stanchev, Georgi Komitov, Tsenka Zhelyazkova, Mariya Gerdzhikova, Mima Todorova, and et al. 2024. "Identification of Pasture Degradation Using Remote Sensing Data and Machine Learning: A Case Study of Obichnik" Applied Sciences 14, no. 17: 7599. https://doi.org/10.3390/app14177599
Article Metrics
Article access statistics, further information, mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
Exploring advanced machine learning techniques for landslide susceptibility mapping in Yanchuan County, China
- Published: 27 August 2024
Cite this article
- Wei Chen 1 ,
- Chao Guo 1 ,
- Fanghao Lin 1 ,
- Ruixin Zhao 2 ,
- Paraskevas Tsangaratos 4 &
- Ioanna Ilia 4
Many landslides occurred every year, causing extensive property losses and casualties in China. Landslide susceptibility mapping is crucial for disaster prevention by the government or related organizations to protect people's lives and property. This study compared the performance of random forest (RF), classification and regression trees (CART), Bayesian network (BN), and logistic model trees (LMT) methods in generating landslide susceptibility maps in Yanchuan County using optimization strategy. A field survey was conducted to map 311 landslides. The dataset was divided into a training dataset and a validation dataset with a ratio of 7:3. Sixteen factors influencing landslides were identified based on a geological survey of the study area, including elevation, plan curvature, profile curvature, slope aspect, slope angle, slope length, topographic position index (TPI), terrain ruggedness index (TRI), convergence index, normalized difference vegetation index (NDVI), distance to roads, distance to rivers, rainfall, soil type, lithology, and land use. The training dataset was used to train the models in Weka software, and landslide susceptibility maps were generated in GIS software. The performance of the four models was evaluated by receiver operating characteristic (ROC) curves, confusion matrix, chi-square test, and other statistical analysis methods. The comparison results show that all four machine learning models are suitable for evaluating landslide susceptibility in the study area. The performances of the RF and LMT methods are more stable than those of the other two models; thus, they are suitable for landslide susceptibility mapping.
This is a preview of subscription content, log in via an institution to check access.
Access this article
Subscribe and save.
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Explore related subjects
- Artificial Intelligence
Akgun A (2012) A comparison of landslide susceptibility maps produced by logistic regression, multi-criteria decision, and likelihood ratio methods: a case study at zmir, Turkey. Landslides 9:93–106
Article Google Scholar
Allouche O, Tsoar A, Kadmon R (2006) Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). J Appl Ecol 43:1223–1232
Aslam B, Zafar A, Khalil U (2023) Comparative analysis of multiple conventional neural networks for landslide susceptibility mapping. Nat Hazards 115:673–707. https://doi.org/10.1007/s11069-022-05570-x
Bai S-B, Wang J, Lü G-N, Zhou P-G, Hou S-S, Xu S-N (2010) GIS-based logistic regression for landslide susceptibility mapping of the Zhongxian segment in the Three Gorges area China. Geomorphology 115:23–31
Ballabio C, Sterlacchini S (2012) Support vector machines for landslide susceptibility mapping: the Staffora River Basin case study, Italy. Math Geosci 44:47–70
Bovenga F, Pasquariello G, Pellicani R, Refice A, Spilotro G (2017) Landslide monitoring for risk mitigation by using corner reflector and satellite SAR interferometry: The large landslide of Carlantino (Italy). CATENA 151:49–62
Breiman L (2001) Random forests. Machine Learn 45:5–32
Bui DT, Pradhan B, Lofman O, Revhaug I, Dick OB (2012) Spatial prediction of landslide hazards in Hoa Binh province (Vietnam): a comparative assessment of the efficacy of evidential belief functions and fuzzy logic models. Catena 96:28–40
Bui DT, Tuan TA, Klempe H, Pradhan B, Revhaug I (2016) Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides 13:361–378
Bui DT, Ngo P-TT, Pham TD, Jaafari A, Minh NQ, Hoa PV, Samui P (2019) A novel hybrid approach based on a swarm intelligence optimized extreme learning machine for flash flood susceptibility mapping. Catena 179:184–196
Cao W, Pan D, Xu Z, Fu Y, Zhang W, Ren Y, Nan T (2023) Landslide disaster vulnerability mapping study in Henan Province: Comparison of different machine learning models. Bull Geol Sci Technol. https://doi.org/10.19509/j.cnki.dzkq.tb20230338 .
Chapi K, Singh VP, Shirzadi A, Shahabi H, Bui DT, Pham BT, Khosravi K (2017) A novel hybrid artificial intelligence approach for flood susceptibility assessment. Environ Model Softw 95:229–245
Chen W, Yang Z (2023) Landslide susceptibility modeling using bivariate statistical-based logistic regression, naïve Bayes, and alternating decision tree models. Bull Eng Geol Env 82:190. https://doi.org/10.1007/s10064-023-03216-1
Chen W, Xie X, Peng J, Wang J, Duan Z, Hong H (2017) GIS-based landslide susceptibility modelling: a comparative assessment of kernel logistic regression, Na ve-Bayes tree, and alternating decision tree models. Geomat Nat Haz Risk 8:950–973
Chen W, Peng J, Hong H, Shahabi H, Pradhan B, Liu J, Zhu A-X, Pei X, Duan Z (2018) Landslide susceptibility modelling using GIS-based machine learning techniques for Chongren County, Jiangxi Province, China. ScTEn 626:1121–1135. https://doi.org/10.1016/j.scitotenv.2018.01.124
Article CAS Google Scholar
Chen W, Hong H, Panahi M, Shahabi H, Wang Y, Shirzadi A, Pirasteh S, Alesheikh AA, Khosravi K, Panahi S (2019a) Spatial prediction of landslide susceptibility using gis-based data mining techniques of anfis with whale optimization algorithm (woa) and grey wolf optimizer (gwo). Appl Sci 9:3755
Chen W, Pradhan B, Li S, Shahabi H, Rizeei HM, Hou E, Wang S (2019b) Novel hybrid integration approach of bagging-based fisher¡¯s linear discriminant function for groundwater potential analysis. Nat Resour Res 28:1239–1258
Chen W, Fan L, Li C, Pham BT (2020a) Spatial prediction of landslides using hybrid integration of artificial intelligence algorithms with frequency ratio and index of entropy in Nanzheng County, China. Appl Sci 10:29
Chen W, Li Y, Xue W, Shahabi H, Li S, Hong H, Wang X, Bian H, Zhang S, Pradhan B (2020b) Modeling flood susceptibility using data-driven approaches of na ve bayes tree, alternating decision tree, and random forest methods. Sci Total Environ 701:134979
Dang V-H, Dieu TB, Tran X-L, Hoang N-D (2019) Enhancing the accuracy of rainfall-induced landslide prediction along mountain roads with a GIS-based random forest classifier. Bull Eng Geol Env 78:2835–2849
Deng R, Zhang Q, Liu W, Chen L, Tan J, Gao Z, Zheng X (2024) Collapse susceptibility evaluation based on an improved two-step sampling strategy and a convolutional neural network. Bull Geol Sci Technol 43(2):186–200. https://doi.org/10.19509/j.cnki.dzkq.tb20220535
ESRI (2014) ArcGIS desktop: release 10.2 Redlands, CA: Environmental Systems Research Institute
Felicísimo ÁM, Cuartero A, Remondo J, Quirós E (2013) Mapping landslide susceptibility with logistic regression, multiple adaptive regression splines, classification and regression trees, and maximum entropy methods: a comparative study. Landslides 10:175–189
Frank E, Hall AM, Witten HI (2016) The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition
Ge Y, Liu G, Tang H, Zhao B, Xiong C (2023) Comparative analysis of five convolutional neural networks for landslide susceptibility assessment. Bull Eng Geol Env 82:377. https://doi.org/10.1007/s10064-023-03408-9
Gheisari S, Meybodi MR (2016) Bnc-pso: structure learning of bayesian networks by particle swarm optimization. Inf Sci 348:272–289
Gomez H, Kavzoglu T (2005) Assessment of shallow landslide susceptibility using artificial neural networks in Jabonosa River Basin, Venezuela. Eng Geol 78:11–27
Gorsevski PV, Gessler PE, Foltz RB, Elliot WJ (2006) Spatial prediction of landslide hazard using logistic regression and ROC analysis. Trans GIS 10:395–415
Gudiyangada Nachappa T, TavakkoliPiralilou S, Ghorbanzadeh O, Shahabi H, Blaschke T (2019) Landslide susceptibility mapping for Austria using Geons and optimization with the Dempster-Shafer theory. Appl Sci 9:5393
Guo Y, Dou J, Xiang Z, Ma H, Dong A, Luo W (2024) Susceptibility evaluation of Wenchuan coseismic landslides by gradient boosting decision tree and random forest based on optimal negative sample sampling strategies. Bull Geol Sci Technol 43(3):251–265. https://doi.org/10.19509/j.cnki.dzkq.tb20230037
Gutiérrez JA, Carvalheiro LG, Polce C, van Loon EE, Raes N, Reemer M, Biesmeijer JC (2013) Fit-for-purpose: species distribution model performance depends on evaluation criteria CDutch hoverflies as a case study. PloS One 8:e63708
Haque U, Da Silva PF, Devoli G, JR Pilz, Zhao B, Khaloua A, Wilopo W, Andersen P, Lu P, Lee J (2019) The human cost of global warming: deadly landslides and their triggers (1995-2014). Sci Total Environ 682:673–684
He Q, Xu Z, Li S, Li R, Zhang S, Wang N, Pham BT, Chen W (2019) Novel entropy and rotation forest-based credal decision tree classifier for landslide susceptibility modeling. Entropy 21:106
Hong H, Naghibi SA, Dashtpagerdi MM, Pourghasemi HR, Chen W (2017) A comparative assessment between linear and quadratic discriminant analyses (LDA-QDA) with frequency ratio and weights-of-evidence models for forest fire susceptibility mapping in China. Arab J Geosci 10:167
Hong H, Liu J, Zhu A-X (2020) Modeling landslide susceptibility using LogitBoost alternating decision trees and forest by penalizing attributes with the bagging ensemble. ScTEn 718:137231
CAS Google Scholar
Kavzoglu T, Colkesen I, Sahin EK (2019) Machine learning techniques in landslide susceptibility mapping: A survey and a case study. Landslides: Theory, Practice and Modelling. Springer. pp 283–301
Khan H, Shafique M, Khan MA, Bacha MA, Shah SU, Calligaris C (2019) Landslide susceptibility assessment using Frequency Ratio, a case study of northern Pakistan. Egyptian J Remote Sens Space Sci 22:11–24
Google Scholar
Kim J-C, Lee S, Jung H-S, Lee S (2018) Landslide susceptibility mapping using random forest and boosted tree models in Pyeong-Chang, Korea. Geocarto Int 33:1000–1015
Kumar R, Anbalagan R (2019) Landslide susceptibility mapping of the Tehri reservoir rim area using the weights of evidence method. J Earth Syst Sci 128:153
Lagomarsino D, Tofani V, Segoni S, Catani F, Casagli N (2017) A tool for classification and regression using random forest methodology: applications to landslide susceptibility mapping and soil thickness modeling. Environ Model Assess 22:201–214
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Machine Learn 59:161–205
Larraaga P, Poza M, Yurramendi Y, Murga RH, Kuijpers CMH (1996) Structure learning of Bayesian networks by genetic algorithms: A performance analysis of control parameters. IEEE Trans Pattern Anal Machine Intell 18:912–926
Lee S-H (2010) Landslide susceptibility analysis using bayesian network and semantic technology. J Korean Soc Geospatial Inform Syst 18:61–69
Lee S (2013) Landslide detection and susceptibility mapping in the Sagimakri area, Korea using KOMPSAT-1 and weight of evidence technique. Environ Earth Sci 70:3197–3215
Lee S, Lee M-J, Jung H-S, Lee S (2019) Landslide susceptibility mapping using na ve bayes and bayesian network models in Umyeonsan, Korea. Geocarto Int 34:1–15
Li R, Wang N (2019) Landslide susceptibility mapping for the Muchuan county (China): A comparison between bivariate statistical models (woe, ebf, and ioe) and their ensembles with logistic regression. Symmetry 11:762
Manzo G, Tofani V, Segoni S, Battistini A, Catani F (2013) GIS techniques for regional-scale landslide susceptibility assessment: the Sicily (Italy) case study. Int J Geogr Inf Sci 27:1433–1452
Markham IS, Mathieu RG, Wray BA (2000) Kanban setting through artificial intelligence: a comparative study of artificial neural networks and decision trees. Integr Manuf Syst 11:239–246
Meng Q, Miao F, Zhen J, Wang X, Wang A, Peng Y, Fan Q (2016) GIS-based landslide susceptibility mapping with logistic regression, analytical hierarchy process, and combined fuzzy and support vector machine methods: a case study from Wolong Giant Panda Natural Reserve, China. Bull Eng Geol Env 75:923–944
Mokarram M, Roshan G, Negahban S (2015) Landform classification using topography position index (case study: salt dome of Korsia-Darab plain, Iran). Model Earth Syst Environ 1:40
Mondal P, Liu X, Fatoyinbo TE, Lagomasino D (2019) Evaluating Combinations of Sentinel-2 Data and Machine-Learning Algorithms for Mangrove Mapping in West Africa. Remote Sens 11:2928
Mosavi A, Ozturk P, Chau K-w (2018) Flood prediction using machine learning models: Literature review. Water 10:1536
Neuh user B, Damm B, Terhorst B (2012) GIS-based assessment of landslide susceptibility on the base of the weights-of-evidence model. Landslides 9:511–528
Nguyen V-T, Tran TH, Ha NA, Ngo VL, Nadhir A-A, Tran VP, Duy Nguyen H, Malek MA, Amini A, Prakash I (2019a) GIS based novel hybrid computational intelligence models for mapping landslide susceptibility: A case study at Da Lat City, Vietnam. Sustainability 11:7118
Nguyen VV, Pham BT, Vu BT, Prakash I, Jha S, Shahabi H, Shirzadi A, Ba DN, Kumar R, Chatterjee JM (2019b) Hybrid machine learning approaches for landslide susceptibility modeling. Forests 10:157
Oh H-J, Kadavi PR, Lee C-W, Lee S (2018) Evaluation of landslide susceptibility mapping by evidential belief function, logistic regression and support vector machine models. Geomat Nat Haz Risk 9:1053–1070
Ohlmacher GC, Davis JC (2003) Using multiple logistic regression and GIS technology to predict landslide hazard in northeast Kansas, USA. Eng Geol 69:331–343
Othman A, Gloaguen R, Andreani L, Rahnama M (2015) Landslide susceptibility mapping in Mawat area, Kurdistan Region, NE Iraq: a comparison of different statistical models. Nat Hazards Earth Syst Sci Discuss 3:1789–1833
Pham BT, Bui DT, Pourghasemi HR, Indra P, Dholakia M (2017a) Landslide susceptibility assesssment in the Uttarakhand area (India) using GIS: a comparison study of prediction capability of na ve bayes, multilayer perceptron neural networks, and functional trees methods. Theoret Appl Climatol 128:255–273
Pham BT, Khosravi K, Prakash I (2017b) Application and comparison of decision tree-based machine learning methods in landside susceptibility assessment at Pauri Garhwal Area, Uttarakhand, India. Environ Process 4:711–730
Pham BT, Prakash I, Bui DT (2018) Spatial prediction of landslides using a hybrid machine learning approach based on random subspace and classification and regression trees. Geomorphology 303:256–270
Pham BT, Prakash I (2017) A novel hybrid intelligent approach of random subspace ensemble and reduced error pruning trees for landslide susceptibility modeling: A Case Study at Mu Cang Chai District, Yen Bai Province, Viet Nam. International Conference on Geo-Spatial Technologies and Earth Resources. Springer. pp 255–269
Pourghasemi HR, Rossi M (2017) Landslide susceptibility modeling in a landslide prone area in Mazandarn Province, north of Iran: a comparison between GLM, GAM, MARS, and M-AHP methods. Theoret Appl Climatol 130:609–633
Powers DM (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation
Rahmati O, Kornejady A, Samadi M, Deo RC, Conoscenti C, Lombardo L, Dayal K, Taghizadeh-Mehrjardi R, Pourghasemi HR, Kumar S (2019) PMT: New analytical framework for automated evaluation of geo-environmental modelling approaches. ScTEn 664:296–311
Rao S, Leng X (2024) Debris flow susceptibility evaluation of Liangshan Prefecture based on the RSIV-RF model. Bull Geol Sci Technol 43(1):275–287. https://doi.org/10.19509/j.cnki.dzkq.tb20220267
Saha A, Villuri VGK, Bhardwaj A (2022) Development and Assessment of GIS-Based Landslide Susceptibility Mapping Models Using ANN, Fuzzy-AHP, and MCDA in Darjeeling Himalayas, West Bengal. India Land 11:1711
Saha A, Villuri VGK, Bhardwaj A (2023a) Development and assessment of a novel hybrid machine learning-based landslide susceptibility mapping model in the Darjeeling Himalayas. Stoch Env Res Risk Assess. https://doi.org/10.1007/s00477-023-02528-8
Saha A, Villuri VGK, Bhardwaj A, Kumar S (2023b) A Multi-Criteria Decision Analysis (MCDA) Approach for Landslide Susceptibility Mapping of a Part of Darjeeling District in North-East Himalaya. India 13:5062
Saha A, Tripathi L, Villuri VGK, Bhardwaj A (2024) Exploring machine learning and statistical approach techniques for landslide susceptibility mapping in Siwalik Himalayan Region using geospatial technology. Environ Sci Pollut Res 31:10443–10459. https://doi.org/10.1007/s11356-023-31670-7
Shirzadi A, Soliamani K, Habibnejhad M, Kavian A, Chapi K, Shahabi H, Chen W, Khosravi K, Thai Pham B, Pradhan B (2018) Novel GIS based machine learning algorithms for shallow landslide susceptibility mapping. Sensors 18:3777
Song Y, Gong J, Gao S, Wang D, Cui T, Li Y, Wei B (2012) Susceptibility assessment of earthquake-induced landslides using Bayesian network: a case study in Beichuan, China. Comput Geosci 42:189–199
Steinberg D (2009) CART: classi cation and regression trees. The top ten algorithms in data mining. Chapman and Hall/CRC. pp 193–216
Swets JA (1988) Measuring the accuracy of diagnostic systems. Science 240:1285–1293
Temkin NR, Holubkov R, Machamer JE, Winn HR, Dikmen SS (1995) Classification and regression trees (CART) for prediction of function at 1 year following head trauma. J Neurosurg 82:764–771
Thai Pham B, Shirzadi A, Shahabi H, Omidvar E, Singh SK, Sahana M, Talebpour Asl D, Bin Ahmad B, Kim Quoc N, Lee S (2019) Landslide susceptibility assessment by novel hybrid machine learning algorithms. Sustainability 11:4386
Tien Bui D, Pradhan B, Lofman O, Revhaug I (2012) Landslide susceptibility assessment in vietnam using support vector machines, decision tree, and Naive Bayes Models. Math Problems Eng 2012:26
Tingyao J, Dinglong W (2013) A landslide stability calculation method based on Bayesian network. 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA). IEEE. pp 905–908
Trigila A, Iadanza C, Esposito C, Scarascia-Mugnozza G (2015) Comparison of Logistic Regression and Random Forests techniques for shallow landslide susceptibility assessment in Giampilieri (NE Sicily, Italy). Geomorphology 249:119–136
Truong XL, Mitamura M, Kono Y, Raghavan V, Yonezawa G, Truong XQ, Do TH, Tien Bui D, Lee S (2018) Enhancing prediction performance of landslide susceptibility model using hybrid machine learning approach of bagging ensemble and logistic model tree. Appl Sci 8:1046
Van Westen C, Rengers N, Soeters R (2003) Use of geomorphological information in indirect landslide susceptibility assessment. Nat Hazards 30:399–419
Wang Q, Li W, Chen W, Bai H (2015) GIS-based assessment of landslide susceptibility using certainty factor and index of entropy models for the Qianyang County of Baoji city, China. J Earth Syst Sci 124:1399–1415
Wang Q, Li W, Wu Y, Pei Y, Xie P (2016) Application of statistical index and index of entropy methods to landslide susceptibility assessment in Gongliu (Xinjiang, China). Environ Earth Sci 75:599
Wang G, Chen X, Chen W (2020a) Spatial prediction of landslide susceptibility based on GIS and discriminant functions. ISPRS Int J Geo Inf 9:144
Wang G, Lei X, Chen W, Shahabi H, Shirzadi A (2020b) Hybrid computational intelligence methods for landslide susceptibility mapping. Symmetry 12:325
Westreich D, Lessler J, Funk MJ (2010) Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol 63:826–833
Wilde M, Günther A, Reichenbach P, Malet J-P, Hervás J (2018) Pan-European landslide susceptibility mapping: ELSUS Version 2. J Maps 14:97–104
Williams CJ, Lee SS, Fisher RA, Dickerman LH (1999) A comparison of statistical methods for prenatal screening for Down syndrome. Appl Stoch Model Bus Ind 15:89–101
Wischmeier WH, Smith DD (1978) Predicting rainfall erosion losses: a guide to conservation planning . Department of Agriculture, Science and Education Administration
Wu Y, Ke Y, Chen Z, Liang S, Zhao H, Hong H (2020) Application of alternating decision tree with AdaBoost and bagging ensembles for landslide susceptibility mapping. Catena 187:104396
Wu L, Yin K, Zeng T, Liu, Shuhao L, Zhenyi (2024) Evaluation of geological disaster susceptibility of transmission lines under different grid resolutions. Bull Geol Sci Technol 43(1):241–252. https://doi.org/10.19509/j.cnki.dzkq.tb202203
Yang Z-h, Lan H-x, Gao X, Li L-p, Meng Y-s, Wu Y-m (2015) Urgent landslide susceptibility assessment in the 2013 Lushan earthquake-impacted area, Sichuan Province, China. Nat Hazards 75:2467–2487
Youssef AM, Pourghasemi HR, Pourtaghi ZS, Al-Katheeri MM (2016) Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 13:839–856
Zhang T, Han L, Chen W, Shahabi H (2018) Hybrid integration approach of entropy with logistic regression and support vector machine for landslide susceptibility modeling. Entropy 20:884
Zhang T, Han L, Han J, Li X, Zhang H, Wang H (2019) Assessment of Landslide Susceptibility Using Integrated Ensemble Fractal Dimension with Kernel Logistic Regression Model. Entropy 21:218
Zhang W, Chen H, Ji C, Yang Q, Xi w, Sun X, Zhang Y, Yu T, Ni B, Xu Z, Li D (2023) Landslide Susceptibility Assessment in the Alpine and Canyon Areas based on Ascending and Descending InSAR Data. Bull Geol Sci Technol. https://doi.org/10.19509/j.cnki.dzkq.tb20230560
Zhao X, Chen W (2020) Gis-based evaluation of landslide susceptibility models using certainty factors and functional trees-based ensemble techniques. Appl Sci 10:16
Zhu L, Huang J-f (2006) GIS-based logistic regression method for landslide susceptibility mapping in regional scale. J Zhejiang University-Sci A 7:2007–2017
Zhu A-X, Miao Y, Liu J, Bai S, Zeng C, Ma T, Hong H (2019) A similarity-based approach to sampling absence data for landslide susceptibility mapping using data-driven methods. Catena 183:104188
Download references
This study was supported by the Innovation Capability Support Program of Shaanxi (Program No. 2020KJXX-005).
Author information
Authors and affiliations.
College of Geology and Environment, Xi’an University of Science and Technology, Xi’an, 710054, China
Wei Chen, Chao Guo & Fanghao Lin
School of Highway, Chang’an University, Xi’an, 710064, China
Ruixin Zhao
School of Mining & Civil Engineering Liupanshui Normal University, Liupanshui, 553000, Guizhou, China
Laboratory of Engineering Geology and Hydrogeology, Department of Geological Sciences, School of Mining and Metallurgical Engineering, National Technical University of Athens, 15780, Zografou, Greece
Paraskevas Tsangaratos & Ioanna Ilia
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Wei Chen .
Ethics declarations
Conflict of interest.
The authors declare no competing interests.
Additional information
Communicated by: Hassan Babaie
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
About this article
Chen, W., Guo, C., Lin, F. et al. Exploring advanced machine learning techniques for landslide susceptibility mapping in Yanchuan County, China. Earth Sci Inform (2024). https://doi.org/10.1007/s12145-024-01455-8
Download citation
Received : 28 May 2024
Accepted : 14 August 2024
Published : 27 August 2024
DOI : https://doi.org/10.1007/s12145-024-01455-8
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Susceptibility maps
- Machine learning
- Comparative analysis
- Yanchuan County
- Find a journal
- Publish with us
- Track your research
arXiv's Accessibility Forum starts next month!
Help | Advanced Search
Computer Science > Machine Learning
Title: benchmarking counterfactual interpretability in deep learning models for time series classification.
Abstract: The popularity of deep learning methods in the time series domain boosts interest in interpretability studies, including counterfactual (CF) methods. CF methods identify minimal changes in instances to alter the model predictions. Despite extensive research, no existing work benchmarks CF methods in the time series domain. Additionally, the results reported in the literature are inconclusive due to the limited number of datasets and inadequate metrics. In this work, we redesign quantitative metrics to accurately capture desirable characteristics in CFs. We specifically redesign the metrics for sparsity and plausibility and introduce a new metric for consistency. Combined with validity, generation time, and proximity, we form a comprehensive metric set. We systematically benchmark 6 different CF methods on 20 univariate datasets and 10 multivariate datasets with 3 different classifiers. Results indicate that the performance of CF methods varies across metrics and among different models. Finally, we provide case studies and a guideline for practical usage.
Comments: | 15 pages, 27 figures |
Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) |
Cite as: | [cs.LG] |
(or [cs.LG] for this version) | |
Focus to learn more arXiv-issued DOI via DataCite |
Submission history
Access paper:.
- HTML (experimental)
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
BibTeX formatted citation
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
- Institution
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
IMAGES
COMMENTS
Summary: This comprehensive guide covers the basics of classification algorithms, key techniques like Logistic Regression and SVM, and advanced topics such as handling imbalanced datasets. It also includes practical implementation steps and discusses the future of classification in Machine Learning. Introduction. Machine Learning has revolutionised the way we analyse and interpret data ...
The case study in this article will go over a popular Machine learning concept called classification. Classification. In Machine Learning (ML), classification is a supervised learning concept that groups data into classes. Classification usually refers to any kind of problem where a specific type of class label is the result to be predicted ...
Regression. There are four main categories of Machine Learning algorithms: supervised, unsupervised, semi-supervised, and reinforcement learning. Even though classification and regression are both from the category of supervised learning, they are not the same. The prediction task is a classification when the target variable is discrete.
By Avishek Nag (Machine Learning expert) A comparison of different classifiers' accuracy & performance for high-dimensional data. Photo Credit : Pixabay. In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this 'curse of dimensionality ...
Classification is one of the most widely used techniques in machine learning, with a broad array of applications, including sentiment analysis, ad targeting, spam detection, risk assessment, medical diagnosis and image classification. The core goal of classification is to predict a category or class y from some inputs x.
Now comes the cool part, end-to-end application of deep learning to real-world datasets. We will cover the 3 most commonly encountered problems as case studies: binary classification, multiclass classification and regression. Case Study: Binary Classification. 1.1) Data Visualization & Preprocessing. 1.2) Logistic Regression Model. 1.3) ANN Model.
Classification is a task of Machine Learning which assigns a label value to a specific class and then can identify a particular type to be of one kind or another. The most basic example can be of the mail spam filtration system where one can classify a mail as either "spam" or "not spam". You will encounter multiple types of ...
Up to 300 passengers survived and about 550 didn't, in other words the survival rate (or the population mean) is 38%. Moreover, a histogram is perfect to give a rough sense of the density of the underlying distribution of a single numerical data. I recommend using a box plot to graphically depict data groups through their quartiles. Let's take the Age variable for instance:
Abstract. As a young research field, the machine learning has made significant progress and covered a broad spectrum of applications for the last few decades. Classification is an important task ...
Examples include: Email spam detection (spam or not). Churn prediction (churn or not). Conversion prediction (buy or not). Typically, binary classification tasks involve one class that is the normal state and another class that is the abnormal state. For example " not spam " is the normal state and " spam " is the abnormal state.
Classification is a process of categorizing data or objects into predefined classes or categories based on their features or attributes. Machine Learning classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data.
Here is an interesting pricing problem for calculating electricity consumption. 3. Amazon's Real-Time Fraud Detection System. Another case study from Amazon—its fraud detection system uses machine learning to identify and prevent fraudulent transactions as they occur.
Using supervised machine learning for large-scale classification in management research: The case for identifying artificial intelligence patents ... recent machine learning (ML) tools for text classification and natural language processing can be used to construct quantitative variables and to classify unstructured text documents. In this ...
The final classification is made by counting the most common scenario or votes present within the ... in addition to the 139 instances of the case study, to the machine learning algorithms, then ...
Step 1: Define explanatory and target variables. We'll store the rows of observations in a variable X and the corresponding class of those observations (0 or 1) in a variable y. Step 2: Split the dataset into training and testing sets. We use 75% of data for training and 25% for testing.
The book uses a hands-on case study-based approach to crack real-world applications to which machine learning concepts can be applied. These smarter machines will enable your business processes to achieve efficiencies on minimal time and resources. Python Machine Learning Case Studies takes you through the steps to improve business processes ...
This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms.
The Titanic Machine Learning Case Study is a classic example in the field of data science and machine learning. The study is based on the dataset of passengers aboard the Titanic when it sank in 1912. The study's goal is to predict whether a passenger survived or not based on their demographic and other information.
What is Classification In Machine Learning. Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.
A step by step guide to image classification. In this post, I am going to explain a end-to-end use case of deep learning image classification in order to automate the process of classifying ...
6. Machine Learning Case Study on Tesla. Tesla is now a big name in the electric automobile industry and the chances that it will continue to be the trending topic for years to come are really high. It is popular and extensively known for its advanced and futuristic cars and their advanced models.
Explore and run machine learning code with Kaggle Notebooks | Using data from Mines vs Rocks Kaggle uses cookies from Google to deliver and enhance the quality of its services and to analyze traffic. Learn more.
In this case, if we had a bunch of examples of first and last names, phone numbers, ID numbers, DoB, email addresses and VINs, each labelled as such, we could train a multi-class supervised ...
Classification Modeling: An Overview. In the classification problem, we try to build a model that predicts the labels of the target variable using independent variables. As we deal with labeled target data, we'll need supervised machine learning algorithms like Logistic Regression, SVM, Decision Tree, etc.
In recent years, the advancement of technology and artificial intelligence methods based on signal processing and machine learning has attracted the attention of researchers. The challenges currently exist in the field of structural health monitoring to identify and classify damages to achieve high accuracy in a health-monitoring program.
The degradation of pastures and meadows is a global problem with a wide range of impacts. It affects farmers in different ways, such as decreases in cattle production, milk yield, and forage quality. Still, it also has other side effects, such as loss of biodiversity, loss of resources, etc. In this study, the degradation of a semi-natural pasture near the village of Obichnik, Bulgaria, was ...
Synthetic populations are increasingly required in transportation demand modelling practice to feed the large-scale agent-based microsimulation platforms gaining in popularity. The quality of the synthetic population, i.e., its representativeness of the sociodemographic and the spatial distribution of the real population, is a determinant factor of the reliability of the microsimulation it ...
Many landslides occurred every year, causing extensive property losses and casualties in China. Landslide susceptibility mapping is crucial for disaster prevention by the government or related organizations to protect people's lives and property. This study compared the performance of random forest (RF), classification and regression trees (CART), Bayesian network (BN), and logistic model ...
Support vector machines (SVMs) are well-known machine learning algorithms for classification and regression applications. In the healthcare domain, they have been used for a variety of tasks ...
The popularity of deep learning methods in the time series domain boosts interest in interpretability studies, including counterfactual (CF) methods. CF methods identify minimal changes in instances to alter the model predictions. Despite extensive research, no existing work benchmarks CF methods in the time series domain. Additionally, the results reported in the literature are inconclusive ...