Weekend batch
Mayank is a Research Analyst at Simplilearn. He is proficient in Machine learning and Artificial intelligence with python.
Machine Learning Interview Guide
Regression vs. Classification in Machine Learning for Beginners
Machine Learning Career Guide: A Playbook to Becoming a Machine Learning Engineer
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Scientific Reports volume 12 , Article number: 15252 ( 2022 ) Cite this article
10k Accesses
18 Altmetric
Metrics details
The analytic procedures incorporated to facilitate the delivery of projects are often referred to as project analytics. Existing techniques focus on retrospective reporting and understanding the underlying relationships to make informed decisions. Although machine learning algorithms have been widely used in addressing problems within various contexts (e.g., streamlining the design of construction projects), limited studies have evaluated pre-existing machine learning methods within the delivery of construction projects. Due to this, the current research aims to contribute further to this convergence between artificial intelligence and the execution construction project through the evaluation of a specific set of machine learning algorithms. This study proposes a machine learning-based data-driven research framework for addressing problems related to project analytics. It then illustrates an example of the application of this framework. In this illustration, existing data from an open-source data repository on construction projects and cost overrun frequencies was studied in which several machine learning models (Python’s Scikit-learn package) were tested and evaluated. The data consisted of 44 independent variables (from materials to labour and contracting) and one dependent variable (project cost overrun frequency), which has been categorised for processing under several machine learning models. These models include support vector machine, logistic regression, k -nearest neighbour, random forest, stacking (ensemble) model and artificial neural network. Feature selection and evaluation methods, including the Univariate feature selection, Recursive feature elimination, SelectFromModel and confusion matrix, were applied to determine the most accurate prediction model. This study also discusses the generalisability of using the proposed research framework in other research contexts within the field of project management. The proposed framework, its illustration in the context of construction projects and its potential to be adopted in different contexts will significantly contribute to project practitioners, stakeholders and academics in addressing many project-related issues.
Introduction.
Successful projects require the presence of appropriate information and technology 1 . Project analytics provides an avenue for informed decisions to be made through the lifecycle of a project. Project analytics applies various statistics (e.g., earned value analysis or Monte Carlo simulation) among other models to make evidence-based decisions. They are used to manage risks as well as project execution 2 . There is a tendency for project analytics to be employed due to other additional benefits, including an ability to forecast and make predictions, benchmark with other projects, and determine trends such as those that are time-dependent 3 , 4 , 5 . There has been increasing interest in project analytics and how current technology applications can be incorporated and utilised 6 . Broadly, project analytics can be understood on five levels 4 . The first is descriptive analytics which incorporates retrospective reporting. The second is known as diagnostic analytics , which aims to understand the interrelationships and underlying causes and effects. The third is predictive analytics which seeks to make predictions. Subsequent to this is prescriptive analytics , which prescribes steps following predictions. Finally, cognitive analytics aims to predict future problems. The first three levels can be applied with ease with the help of technology. The fourth and fifth steps require data that is generally more difficult to obtain as they may be less accessible or unstructured. Further, although project key performance indicators can be challenging to define 2 , identifying common measurable features facilitates this 7 . It is anticipated that project analytics will continue to experience development due to its direct benefits to the major baseline measures focused on productivity, profitability, cost, and time 8 . The nature of project management itself is fluid and flexible, and project analytics allows an avenue for which machine learning algorithms can be applied 9 .
Machine learning within the field of project analytics falls into the category of cognitive analytics, which deals with problem prediction. Generally, machine learning explores the possibilities of computers to improve processes through training or experience 10 . It can also build on the pre-existing capabilities and techniques prevalent within management to accomplish complex tasks 11 . Due to its practical use and broad applicability, recent developments have led to the invention and introduction of newer and more innovative machine learning algorithms and techniques. Artificial intelligence, for instance, allows for software to develop computer vision, speech recognition, natural language processing, robot control, and other applications 10 . Specific to the construction industry, it is now used to monitor construction environments through a virtual reality and building information modelling replication 12 or risk prediction 13 . Within other industries, such as consumer services and transport, machine learning is being applied to improve consumer experiences and satisfaction 10 , 14 and reduce the human errors of traffic controllers 15 . Recent applications and development of machine learning broadly fall into the categories of classification, regression, ranking, clustering, dimensionality reduction and manifold learning 16 . Current learning models include linear predictors, boosting, stochastic gradient descent, kernel methods, and nearest neighbour, among others 11 . Newer and more applications and learning models are continuously being introduced to improve accessibility and effectiveness.
Specific to the management of construction projects, other studies have also been made to understand how copious amounts of project data can be used 17 , the importance of ontology and semantics throughout the nexus between artificial intelligence and construction projects 18 , 19 as well as novel approaches to the challenges within this integration of fields 20 , 21 , 22 . There have been limited applications of pre-existing machine learning models on construction cost overruns. They have predominantly focussed on applications to streamline the design processes within construction 23 , 24 , 25 , 26 , and those which have investigated project profitability have not incorporated the types and combinations of algorithms used within this study 6 , 27 . Furthermore, existing applications have largely been skewed towards one type or another 28 , 29 .
In addition to the frequently used earned value method (EVM), researchers have been applying many other powerful quantitative methods to address a diverse range of project analytics research problems over time. Examples of those methods include time series analysis, fuzzy logic, simulation, network analytics, and network correlation and regression. Time series analysis uses longitudinal data to forecast an underlying project's future needs, such as the time and cost 30 , 31 , 32 . Few other methods are combined with EVM to find a better solution for the underlying research problems. For example, Narbaev and De Marco 33 integrated growth models and EVM for forecasting project cost at completion using data from construction projects. For analysing the ongoing progress of projects having ambiguous or linguistic outcomes, fuzzy logic is often combined with EVM 34 , 35 , 36 . Yu et al. 36 applied fuzzy theory and EVM for schedule management. Ponz-Tienda et al. 35 found that using fuzzy arithmetic on EVM provided more objective results in uncertain environments than the traditional methodology. Bonato et al. 37 integrated EVM with Monte Carlo simulation to predict the final cost of three engineering projects. Batselier and Vanhoucke 38 compared the accuracy of the project time and cost forecasting using EVM and simulation. They found that the simulation results supported findings from the EVM. Network methods are primarily used to analyse project stakeholder networks. Yang and Zou 39 developed a social network theory-based model to explore stakeholder-associated risks and their interactions in complex green building projects. Uddin 40 proposed a social network analytics-based framework for analysing stakeholder networks. Ong and Uddin 41 further applied network correlation and regression to examine the co-evolution of stakeholder networks in collaborative healthcare projects. Although many other methods have already been used, as evident in the current literature, machine learning methods or models are yet to be adopted for addressing research problems related to project analytics. The current investigation is derived from the cognitive analytics component of project analytics. It proposes an approach for determining hidden information and patterns to assist with project delivery. Figure 1 illustrates a tree diagram showing different levels of project analytics and their associated methods from the literature. It also illustrates existing methods within the cognitive component of project analytics to where the application of machine learning is situated contextually.
A tree diagram of different project analytics methods. It also shows where the current study belongs to. Although earned value analysis is commonly used in project analytics, we do not include it in this figure since it is used in the first three levels of project analytics.
Machine learning models have several notable advantages over traditional statistical methods that play a significant role in project analytics 42 . First, machine learning algorithms can quickly identify trends and patterns by simultaneously analysing a large volume of data. Second, they are more capable of continuous improvement. Machine learning algorithms can improve their accuracy and efficiency for decision-making through subsequent training from potential new data. Third, machine learning algorithms efficiently handle multi-dimensional and multi-variety data in dynamic or uncertain environments. Fourth, they are compelling to automate various decision-making tasks. For example, machine learning-based sentiment analysis can easily a negative tweet and can automatically take further necessary steps. Last but not least, machine learning has been helpful across various industries, for example, defence to education 43 . Current research has seen the development of several different branches of artificial intelligence (including robotics, automated planning and scheduling and optimisation) within safety monitoring, risk prediction, cost estimation and so on 44 . This has progressed from the applications of regression on project cost overruns 45 to the current deep-learning implementations within the construction industry 46 . Despite this, the uses remain largely limited and are still in a developmental state. The benefits of applications are noted, such as optimising and streamlining existing processes; however, high initial costs form a barrier to accessibility 44 .
The primary goal of this study is to demonstrate the applicability of different machine learning algorithms in addressing problems related to project analytics. Limitations in applying machine learning algorithms within the context of construction projects have been explored previously. However, preceding research has mainly been conducted to improve the design processes specific to construction 23 , 24 , and those investigating project profitabilities have not incorporated the types and combinations of algorithms used within this study 6 , 27 . For instance, preceding research has incorporated a different combination of machine-learning algorithms in research of predicting construction delays 47 . This study first proposed a machine learning-based data-driven research framework for project analytics to contribute to the proposed study direction. It then applied this framework to a case study of construction projects. Although there are three different machine learning algorithms (supervised, unsupervised and semi-supervised), the supervised machine learning models are most commonly used due to their efficiency and effectiveness in addressing many real-world problems 48 . Therefore, we will use machine learning to represent supervised machine learning throughout the rest of this article. The contribution of this study is significant in that it considers the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult 9 , 49 . Further to this, existing implementations have largely been limited to safety monitoring, risk prediction, cost estimation and so on 44 . Through the evaluation of machine-learning applications, this study further demonstrates a case study for which algorithms can be used to consider and model the relationship between project attributes and a project performance measure (i.e., cost overrun frequency).
When and why machine learning for project analytics.
Machine learning models are typically used for research problems that involve predicting the classification outcome of a categorical dependent variable. Therefore, they can be applied in the context of project analytics if the underlying objective variable is a categorical one. If that objective variable is non-categorical, it must first be converted into a categorical variable. For example, if the objective or target variable is the project cost, we can convert this variable into a categorical variable by taking only two possible values. The first value would be 0 to indicate a low-cost project, and the second could be 1 for showing a high-cost project. The average or median cost value for all projects under consideration can be considered for splitting project costs into low-cost and high-cost categories.
For data-driven decision-making, machine learning models are advantageous. This is because traditional statistical methods (e.g., ordinary least square (OLS) regression) make assumptions about the underlying research data to produce explicit formulae for the objective target measures. Unlike these statistical methods, machine learning algorithms figure out patterns on their own directly from the data. For instance, for a non-linear but separable dataset, an OLS regression model will not be the right choice due to its assumption that the underlying data must be linear. However, a machine learning model can easily separate the dataset into the underlying classes. Figure 2 (a) presents a situation where machine learning models perform better than traditional statistical methods.
( a ) An illustration showing the superior performance of machine learning models compared with the traditional statistical models using an abstract dataset with two attributes (X 1 and X 2 ). The data points within this abstract dataset consist of two classes: one represented with a transparent circle and the second class illustrated with a black-filled circle. These data points are non-linear but separable. Traditional statistical models (e.g., ordinary least square regression) will not accurately separate these data points. However, any machine learning model can easily separate them without making errors; and ( b ) Traditional programming versus machine learning.
Similarly, machine learning models are compelling if the underlying research dataset has many attributes or independent measures. Such models can identify features that significantly contribute to the corresponding classification performance regardless of their distributions or collinearity. Traditional statistical methods have become prone to biased results when there exists a correlation between independent variables. Machine learning-based current studies specific to project analytics have been largely limited. Despite this, there have been tangential studies on the use of artificial intelligence to improve cost estimations as well as risk prediction 44 . Additionally, models have been implemented in the optimisation of existing processes 50 .
Machine learning can be thought of as a process of teaching a machine (i.e., computers) to learn from data and adjust or apply its present knowledge when exposed to new data 42 . It is a type of artificial intelligence that enables computers to learn from examples or experiences. Traditional programming requires some input data and some logic in the form of code (program) to generate the output. Unlike traditional programming, the input data and their corresponding output are fed to an algorithm to create a program in machine learning. This resultant program can capture powerful insights into the data pattern and can be used to predict future outcomes. Figure 2 (b) shows the difference between machine learning and traditional programming.
Figure 3 illustrates the proposed machine learning-based research framework of this study. The framework starts with breaking the project research dataset into the training and test components. As mentioned in the previous section, the research dataset may have many categorical and/or nominal independent variables, but its single dependent variable must be categorical. Although there is no strict rule for this split, the training data size is generally more than or equal to 50% of the original dataset 48 .
The proposed machine learning-based data-driven framework.
Machine learning algorithms can handle variables that have only numerical outcomes. So, when one or more of the underlying categorical variables have a textual or string outcome, we must first convert them into the corresponding numerical values. Suppose a variable can take only three textual outcomes (low, medium and high). In that case, we could consider, for example, 1 to represent low , 2 to represent medium , and 3 to represent high . Other statistical techniques, such as the RIDIT (relative to an identified distribution) scoring 51 , can also be used to convert ordered categorical measurements into quantitative ones. RIDIT is a parametric approach that uses probabilistic comparison to determine the statistical differences between ordered categorical groups. The remaining components of the proposed framework have been briefly described in the following subsections.
The next step of the framework is to follow the model-building procedure to develop the desired machine learning models using the training data. The first step of this procedure is to select suitable machine learning algorithms or models. Among the available machine learning algorithms, the commonly used ones are support vector machine, logistic regression, k -nearest neighbours, artificial neural network, decision tree and random forest 52 . One can also select an ensemble machine learning model as the desired algorithm. An ensemble machine learning method uses multiple algorithms or the same algorithm multiple times to achieve better predictive performance than could be obtained from any of the constituent learning models alone 52 . Three widely used ensemble approaches are bagging, boosting and stacking. In bagging, the research dataset is divided into different equal-sized subsets. The underlying machine learning algorithm is then applied to these subsets for classification. In boosting, a random sample of the dataset is selected and then fitted and trained sequentially with different models to compensate for the weakness observed in the immediately used model. Stacking combined different weak machine learning models in a heterogeneous way to improve the predictive performance. For example, the random forest algorithm is an ensemble of different decision tree models 42 .
Second, each selected machine learning model will be processed through the k -fold cross-validation approach to improve predictive efficiency. In k -fold cross-validation, the training data is divided into k folds. In an iteration, the (k-1) folds are used to train the selected machine models, and the remaining last fold isF used for validation purposes. This iteration process continues until each k folds will get a turn to be used for validation purposes. The final predictive efficiency of the trained models is based on the average values from the outcomes of these iterations. In addition to this average value, researchers use the standard deviation of the results from different iterations as the predictive training efficiency. Supplementary Fig 1 shows an illustration of the k -fold cross-validation.
Third, most machine learning algorithms require a pre-defined value for their different parameters, known as hyperparameter tuning. The settings of these parameters play a vital role in the achieved performance of the underlying algorithm. For a given machine learning algorithm, the optimal value for these parameters can be different from one dataset to another. The same algorithm needs to run multiple times with different parameter values to find its optimal parameter value for a given dataset. Many algorithms are available in the literature, such as the Grid search 53 , to find the optimal parameter value. In the Grid search, hyperparameters are divided into discrete grids. Each grid point represents a specific combination of the underlying model parameters. The parameter values of the point that results in the best performance are the optimal parameter values 53 .
Once the desired machine learning models have been developed using the training data, they need to be tested using the test data. The underlying trained model is then applied to predict its dependent variable for each data instance. Therefore, for each data instance, two categorical outcomes will be available for its dependent variable: one predicted using the underlying trained model, and the other is the actual category. These predicted and actual categorical outcome values are used to report the results of the underlying machine learning model.
The fundamental tool to report results from machine learning models is the confusion matrix, which consists of four integer values 48 . The first value represents the number of positive cases correctly identified as positive by the underlying trained model (true-positive). The second value indicates the number of positive instances incorrectly identified as negative (false-negative). The third value represents the number of negative cases incorrectly identified as positive (false-positive). Finally, the fourth value indicates the number of negative instances correctly identified as negative (true-negative). Researchers also use a few performance measures based on the four values of the confusion matrix to report machine learning results. The most used measure is accuracy which is the ratio of the number of correct predictions (true-positive + true-negative) and the total number of data instances (sum of all four values of the confusion matrix). Other measures commonly used to report machine learning results are precision, recall and F1-score. Precision refers to the ratio between true-positives and the total number of positive predictions (i.e., true-positive + false-positive), often used to indicate the quality of a positive prediction made by a model 48 . Recall, also known as the true-positive rate, is calculated by dividing true-positive by the number of data instances that should have been predicted as positive (i.e., true-positive + false-negative). F1-score is the harmonic mean of the last two measures, i.e., [(2 × Precision × Recall)/(Precision + Recall)] and the error-rate equals to (1-Accuracy).
Another essential tool for reporting machine learning results is variable or feature importance, which identifies a list of independent variables (features) contributing most to the classification performance. The importance of a variable refers to how much a given machine learning algorithm uses that variable in making accurate predictions 54 . The widely used technique for identifying variable importance is the principal component analysis. It reduces the dimensionality of the data while minimising information loss, which eventually increases the interpretability of the underlying machine learning outcome. It further helps in finding the important features in a dataset as well as plotting them in 2D and 3D 54 .
Ethical approval is not required for this study since this study used publicly available data for research investigation purposes. All research was performed in accordance with relevant guidelines/regulations.
Due to the nature of the data sources, informed consent was not required for this study.
This section illustrates an application of this study’s proposed framework (Fig. 2 ) in a construction project context. We will apply this framework in classifying projects into two classes based on their cost overrun experience. Projects rarely experience a delay belonging to the first class (Rare class). The second class indicates those projects that often experience a delay (Often class). In doing so, we consider a list of independent variables or features.
The research dataset is taken from an open-source data repository, Kaggle 55 . This survey-based research dataset was collected to explore the causes of the project cost overrun in Indian construction projects 45 , consisting of 44 independent variables or features and one dependent variable. The independent variables cover a wide range of cost overrun factors, from materials and labour to contractual issues and the scope of the work. The dependent variable is the frequency of experiencing project cost overrun (rare or often). The dataset size is 139; 65 belong to the rare class, and the remaining 74 are from the often class. We converted each categorical variable with a textual or string outcome into an appropriate numerical value range to prepare the dataset for machine learning analysis. For example, we used 1 and 2 to represent rare and often class, respectively. The correlation matrix among the 44 features is presented in Supplementary Fig 2 .
This study considered four machine learning algorithms to explore the causes of project cost overrun using the research dataset mentioned above. They are support vector machine, logistic regression, k- nearest neighbours and random forest.
Support vector machine (SVM) is a process applied to understand data. For instance, if one wants to determine and interpret which projects are classified as programmatically successful through the processing of precedent data information, SVM would provide a practical approach for prediction. SVM functions by assigning labels to objects 56 . The comparison attributes are used to cluster these objects into different groups or classes by maximising their marginal distances and minimising the classification errors. The attributes are plotted multi-dimensionally, allowing a separation line, known as a hyperplane , see supplementary Fig 3 (a), to distinguish between underlying classes or groups 52 . Support vectors are the data points that lie closest to the decision boundary on both sides. In Supplementary Fig 3 (a), they are the circles (both transparent and shaded ones) close to the hyperplane. Support vectors play an essential role in deciding the position and orientation of the hyperplane. Various computational methods, including a kernel function to create more derived attributes, are applied to accommodate this process 56 . Support vector machines are not only limited to binary classes but can also be generalised to a larger variety of classifications. This is accomplished through the training of separate SVMs 56 .
Logistic regression (LR) builds on the linear regression model and predicts the outcome of a dichotomous variable 57 ; for example, the presence or absence of an event. It uses a scatterplot to understand the connection between an independent variable and one or more dependent variables (see Supplementary Fig 3 (b)). LR model fits the data to a sigmoidal curve instead of fitting it to a straight line. The natural logarithm is considered when developing the model. It provides a value between 0 and 1 that is interpreted as the probability of class membership. Best estimates are determined by developing from approximate estimates until a level of stability is reached 58 . Generally, LR offers a straightforward approach for determining and observing interrelationships. It is more efficient compared to ordinary regressions 59 .
k -nearest neighbours (KNN) algorithm uses a process that plots prior information and applies a specific sample size ( k ) to the plot to determine the most likely scenario 52 . This method finds the nearest training examples using a distance measure. The final classification is made by counting the most common scenario or votes present within the specified sample. As illustrated in Supplementary Fig 3 (c), the closest four nearest neighbours in the small circle are three grey squares and one white square. The majority class is grey. Hence, KNN will predict the instance (i.e., Χ ) as grey. On the other hand, if we look at the larger circle of the same figure, the nearest neighbours consist of ten white squares and four grey squares. The majority class is white. Thus, KNN will classify the instance as white. KNN’s advantage lies in its ability to produce a simplified result and handle missing data 60 . In summary, KNN utilises similarities (as well as differences) and distances in the process of developing models.
Random forest (RF) is a machine learning process that consists of many decision trees. A decision tree is a tree-like structure where each internal node represents a test on the input attribute. It may have multiple internal nodes at different levels, and the leaf or terminal nodes represent the decision outcomes. It produces a classification outcome for a distinctive and separate part to the input vector. For non-numerical processes, it considers the average value, and for discrete processes, it considers the number of votes 52 . Supplementary Fig 3 (d) shows three decision trees to illustrate the function of a random forest. The outcomes from trees 1, 2 and 3 are class B, class A and class A, respectively. According to the majority vote, the final prediction will be class A. Because it considers specific attributes, it can have a tendency to emphasise specific attributes over others, which may result in some attributes being unevenly weighted 52 . Advantages of the random forest include its ability to handle multidimensionality and multicollinearity in data despite its sensitivity to sampling design.
Artificial neural network (ANN) simulates the way in which human brains work. This is accomplished by modelling logical propositions and incorporating weighted inputs, a transfer and one output 61 (Supplementary Fig 3 (e)). It is advantageous because it can be used to model non-linear relationships and handle multivariate data 62 . ANN learns through three major avenues. These include error-back propagation (supervised), the Kohonen (unsupervised) and the counter-propagation ANN (supervised) 62 . There are two types of ANN—supervised and unsupervised. ANN has been used in a myriad of applications ranging from pharmaceuticals 61 to electronic devices 63 . It also possesses great levels of fault tolerance 64 and learns by example and through self-organisation 65 .
Ensemble techniques are a type of machine learning methodology in which numerous basic classifiers are combined to generate an optimal model 66 . An ensemble technique considers many models and combines them to form a single model, and the final model will eliminate the weaknesses of each individual learner, resulting in a powerful model that will improve model performance. The stacking model is a general architecture comprised of two classifier levels: base classifier and meta-learner 67 . The base classifiers are trained with the training dataset, and a new dataset is constructed for the meta-learner. Afterwards, this new dataset is used to train the meta-classifier. This study uses four models (SVM, LR, KNN and RF) as base classifiers and LR as a meta learner, as illustrated in Supplementary Fig 3 (f).
The process of selecting the optimal feature subset that significantly influences the predicted outcomes, which may be efficient to increase model performance and save running time, is known as feature selection. This study considers three different feature selection approaches. They are the Univariate feature selection (UFS), Recursive feature elimination (RFE) and SelectFromModel (SFM) approach. UFS examines each feature separately to determine the strength of its relationship with the response variable 68 . This method is straightforward to use and comprehend and helps acquire a deeper understanding of data. In this study, we calculate the chi-square values between features. RFE is a type of backwards feature elimination in which the model is fit first using all features in the given dataset and then removing the least important features one by one 69 . After that, the model is refit until the desired number of features is left over, which is determined by the parameter. SFM is used to choose effective features based on the feature importance of the best-performing model 70 . This approach selects features by establishing a threshold based on feature significance as indicated by the model on the training set. Those characteristics whose feature importance is more than the threshold are chosen, while those whose feature importance is less than the threshold are deleted. In this study, we apply SFM after we compare the performance of four machine learning methods. Afterwards, we train the best-performing model again using the features from the SFM approach.
We split the dataset into 70:30 for training and test purposes of the four selected machine learning algorithms. We used Python’s Scikit-learn package for implementing these algorithms 70 . Using the training data, we first developed six models based on these six algorithms. We used fivefold validation and target to improve the accuracy value. Then, we applied these models to the test data. We also executed all required hyperparameter tunings for each algorithm for the possible best classification outcome. Table 1 shows the performance outcomes for each algorithm during the training and test phase. The hyperparameter settings for each algorithm have been listed in Supplementary Table 1 .
As revealed in Table 1 , random forest outperformed the other three algorithms in terms of accuracy for both the training and test phases. It showed an accuracy of 78.14% and 77.50% for the training and test phases, respectively. The second-best performer in the training phase is k- nearest neighbours (76.98%), and for the test phase, it is the support vector machine, k- nearest neighbours and artificial neural network (72.50%).
Since random forest showed the best performance, we explored further based on this algorithm. We applied the three approaches (UFS, RFE and SFM) for feature optimisation on the random forest. The result is presented in Table 2 . SFM shows the best outcome among these three approaches. Its accuracy is 85.00%, whereas the accuracies of USF and RFE are 77.50% and 72.50%, respectively. As can be seen in Table 2 , the accuracy for the testing phase increases from 77.50% in Table 1 (b) to 85.00% with the SFM feature optimisation. Table 3 shows the 19 selected features from the SFM output. Out of 44 features, SFM found that 19 of them play a significant role in predicting the outcomes.
Further, Fig. 4 illustrates the confusion matrix when the random forest model with the SFM feature optimiser was applied to the test data. There are 18 true-positive, five false-negative, one false-positive and 16 true-negative cases. Therefore, the accuracy for the test phase is (18 + 16)/(18 + 5 + 1 + 16) = 85.00%.
Confusion matrix results based on the random forest model with the SFM feature optimiser (1 for the rare class and 2 for the often class).
Figure 5 illustrates the top-10 most important features or variables based on the random forest algorithm with the SFM optimiser. We used feature importance based on the mean decrease in impurity in identifying this list of important variables. Mean decrease in impurity computes each feature’s importance as the sum over the number of splits that include the feature in proportion to the number of samples it splits 71 . According to this figure, the delays in decision marking attribute contributed most to the classification performance of the random forest algorithm, followed by cash flow problem and construction cost underestimation attributes. The current construction project literature also highlighted these top-10 factors as significant contributors to project cost overrun. For example, using construction project data from Jordan, Al-Hazim et al. 72 ranked 20 causes for cost overrun, including causes similar to these causes.
Feature importance (top-10 out of 19) based on the random forest model with the SFM feature optimiser.
Further, we conduct a sensitivity analysis of the model’s ten most important features (from Fig. 5 ) to explore how a change in each feature affects the cost overrun. We utilise the partial dependence plot (PDP), which is a typical visualisation tool for non-parametric models 73 , to display this analysis’s outcomes. A PDP can demonstrate whether the relation between the target and a feature is linear, monotonic, or more complicated. The result of the sensitivity analysis is presented in Fig. 6 . For the ‘delays in decisions making’ attribute, the PDP shows that the probability is below 0.4 until the rating value is three and increases after. A higher value for this attribute indicates a higher risk of cost overrun. On the other hand, there are no significant differences can be seen in the remaining nine features if the value changes.
The result of the sensitivity analysis from the partial dependency plot tool for the ten most important features.
We illustrated an application of the proposed machine learning-based research framework in classifying construction projects. RF showed the highest accuracy in predicting the test dataset. For a new data instance with information for its 19 features but has not had any information on its classification, RF can identify its class ( rare or often ) correctly with a probability of 85.00%. If more data is provided, in addition to the 139 instances of the case study, to the machine learning algorithms, then their accuracy and efficiency in making project classification will improve with subsequent training. For example, if we provide 100 more data instances, these algorithms will have an additional 50 instances for training with a 70:30 split. This continuous improvement facility put the machine learning algorithms in a superior position over other traditional methods. In the current literature, some studies explore the factors contributing to project delay or cost overrun. In most cases, they applied factor analysis or other related statistical methods for research data analysis 72 , 74 , 75 . In addition to identifying important attributes, the proposed machine learning-based framework identified the ranking of factors and how eliminating less important factors affects the prediction accuracy when applied to this case study.
We shared the Python software developed to implement the four machine learning algorithms considered in this case study using GitHub 76 , a software hosting internet site. user-friendly version of this software can be accessed at https://share.streamlit.io/haohuilu/pa/main/app.py . The accuracy findings from this link could be slightly different from one run to another due to the hyperparameter settings of the corresponding machine learning algorithms.
Due to their robust prediction ability, machine learning methods have already gained wide acceptability across a wide range of research domains. On the other side, EVM is the most commonly used method in project analytics due to its simplicity and ease of interpretability 77 . Essential research efforts have been made to improve its generalisability over time. For example, Naeni et al. 34 developed a fuzzy approach for earned value analysis to make it suitable to analyse project scenarios with ambiguous or linguistic outcomes. Acebes 78 integrated Monte Carlo simulation with EVM for project monitoring and control for a similar purpose. Another prominent method frequently used in project analytics is the time series analysis, which is compelling for the longitudinal prediction of project time and cost 30 . Apparently, as evident in the present current literature, not much effort has been made to bring machine learning into project analytics for addressing project management research problems. This research made a significant attempt to contribute to filling up this gap.
Our proposed data-driven framework only includes the fundamental model development and application process components for machine learning algorithms. It does not have a few advanced-level machine learning methods. This study intentionally did not consider them for the proposed model since they are required only in particular designs of machine learning analysis. For example, the framework does not contain any methods or tools to handle the data imbalance issue. Data imbalance refers to a situation when the research dataset has an uneven distribution of the target class 79 . For example, a binary target variable will cause a data imbalance issue if one of its class labels has a very high number of observations compared with the other class. Commonly used techniques to address this issue are undersampling and oversampling. The undersampling technique decreases the size of the majority class. On the other hand, the oversampling technique randomly duplicates the minority class until the class distribution becomes balanced 79 . The class distribution of the case study did not produce any data imbalance issues.
This study considered only six fundamental machine learning algorithms for the case study, although many other such algorithms are available in the literature. For example, it did not consider the extreme gradient boosting (XGBoost) algorithm. XGBoost is based on the decision tree algorithm, similar to the random forest algorithm 80 . It has become dominant in applied machine learning due to its performance and speed. Naïve Bayes and convolutional neural networks are other popular machine learning algorithms that were not considered when applying the proposed framework to the case study. In addition to the three feature selection methods, multi-view can be adopted when applying the proposed framework to the case study. Multi-view learning is another direction in machine learning that considers learning with multiple views of the existing data with the aim to improve predictive performance 81 , 82 . Similarly, although we considered five performance measures, there are other potential candidates. One such example is the area under the receiver operating curve, which is the ability of the underlying classifier to distinguish between classes 48 . We leave them as a potential application scope while applying our proposed framework in any other project contexts in future studies.
Although this study only used one case study for illustration, our proposed research framework can be used in other project analytics contexts. In such an application context, the underlying research goal should be to predict the outcome classes and find attributes playing a significant role in making correct predictions. For example, by considering two types of projects based on the time required to accomplish (e.g., on-time and delayed ), the proposed framework can develop machine learning models that can predict the class of a new data instance and find out attributes contributing mainly to this prediction performance. This framework can also be used at any stage of the project. For example, the framework’s results allow project stakeholders to screen projects for excessive cost overruns and forecast budget loss at bidding and before contracts are signed. In addition, various factors that contribute to project cost overruns can be figured out at an earlier stage. These elements emerge at each stage of a project’s life cycle. The framework’s feature importance helps project managers locate the critical contributor to cost overrun.
This study has made an important contribution to the current project analytics literature by considering the applications of machine learning within project management. Project management is often thought of as being very fluid in nature, and because of this, applications of machine learning are often more difficult. Further, existing implementations have largely been limited to safety monitoring, risk prediction and cost estimation. Through the evaluation of machine learning applications, this study further demonstrates the uses for which algorithms can be used to consider and model the relationship between project attributes and cost overrun frequency.
The applications of machine learning in project analytics are still undergoing constant development. Within construction projects, its applications have been largely limited and focused on profitability or the design of structures themselves. In this regard, our study made a substantial effort by proposing a machine learning-based framework to address research problems related to project analytics. We also illustrated an example of this framework’s application in the context of construction project management.
Like any other research, this study also has a few limitations that could provide scopes for future research. First, the framework does not include a few advanced machine learning techniques, such as data imbalance issues and kernel density estimation. Second, we considered only one case study to illustrate the application of the proposed framework. Illustrations of this framework using case studies from different project contexts would confirm its robust application. Finally, this study did not consider all machine learning models and performance measures available in the literature for the case study. For example, we did not consider the Naïve Bayes model and precision measure in applying the proposed research framework for the case study.
This study obtained research data from publicly available online repositories. We mentioned their sources using proper citations. Here is the link to the data https://www.kaggle.com/datasets/amansaxena/survey-on-road-construction-delay .
Venkrbec, V. & Klanšek, U. In: Advances and Trends in Engineering Sciences and Technologies II 685–690 (CRC Press, 2016).
Google Scholar
Damnjanovic, I. & Reinschmidt, K. Data Analytics for Engineering and Construction Project Risk Management (Springer, 2020).
Book Google Scholar
Singh, H. Project Management Analytics: A Data-driven Approach to Making Rational and Effective Project Decisions (FT Press, 2015).
Frame, J. D. & Chen, Y. Why Data Analytics in Project Management? (Auerbach Publications, 2018).
Ong, S. & Uddin, S. Data Science and Artificial Intelligence in Project Management: The Past, Present and Future. J. Mod. Proj. Manag. 7 , 26–33 (2020).
Bilal, M. et al. Investigating profitability performance of construction projects using big data: A project analytics approach. J. Build. Eng. 26 , 100850 (2019).
Article Google Scholar
Radziszewska-Zielina, E. & Sroka, B. Planning repetitive construction projects considering technological constraints. Open Eng. 8 , 500–505 (2018).
Neely, A. D., Adams, C. & Kennerley, M. The Performance Prism: The Scorecard for Measuring and Managing Business Success (Prentice Hall Financial Times, 2002).
Kanakaris, N., Karacapilidis, N., Kournetas, G. & Lazanas, A. In: International Conference on Operations Research and Enterprise Systems. 135–155 Springer.
Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 , 255–260 (2015).
Article ADS MathSciNet CAS PubMed MATH Google Scholar
Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).
Book MATH Google Scholar
Rahimian, F. P., Seyedzadeh, S., Oliver, S., Rodriguez, S. & Dawood, N. On-demand monitoring of construction projects through a game-like hybrid application of BIM and machine learning. Autom. Constr. 110 , 103012 (2020).
Sanni-Anibire, M. O., Zin, R. M. & Olatunji, S. O. Machine learning model for delay risk assessment in tall building projects. Int. J. Constr. Manag. 22 , 1–10 (2020).
Cong, J. et al. A machine learning-based iterative design approach to automate user satisfaction degree prediction in smart product-service system. Comput. Ind. Eng. 165 , 107939 (2022).
Li, F., Chen, C.-H., Lee, C.-H. & Feng, S. Artificial intelligence-enabled non-intrusive vigilance assessment approach to reducing traffic controller’s human errors. Knowl. Based Syst. 239 , 108047 (2021).
Mohri, M., Rostamizadeh, A. & Talwalkar, A. Foundations of Machine Learning (MIT press, 2018).
MATH Google Scholar
Whyte, J., Stasis, A. & Lindkvist, C. Managing change in the delivery of complex projects: Configuration management, asset information and ‘big data’. Int. J. Proj. Manag. 34 , 339–351 (2016).
Zangeneh, P. & McCabe, B. Ontology-based knowledge representation for industrial megaprojects analytics using linked data and the semantic web. Adv. Eng. Inform. 46 , 101164 (2020).
Akinosho, T. D. et al. Deep learning in the construction industry: A review of present status and future innovations. J. Build. Eng. 32 , 101827 (2020).
Soman, R. K., Molina-Solana, M. & Whyte, J. K. Linked-Data based constraint-checking (LDCC) to support look-ahead planning in construction. Autom. Constr. 120 , 103369 (2020).
Soman, R. K. & Whyte, J. K. Codification challenges for data science in construction. J. Constr. Eng. Manag. 146 , 04020072 (2020).
Soman, R. K. & Molina-Solana, M. Automating look-ahead schedule generation for construction using linked-data based constraint checking and reinforcement learning. Autom. Constr. 134 , 104069 (2022).
Shi, F., Soman, R. K., Han, J. & Whyte, J. K. Addressing adjacency constraints in rectangular floor plans using Monte-Carlo tree search. Autom. Constr. 115 , 103187 (2020).
Chen, L. & Whyte, J. Understanding design change propagation in complex engineering systems using a digital twin and design structure matrix. Eng. Constr. Archit. Manag. (2021).
Allison, J. T. et al. Artificial intelligence and engineering design. J. Mech. Des. 144 , 020301 (2022).
Dutta, D. & Bose, I. Managing a big data project: The case of ramco cements limited. Int. J. Prod. Econ. 165 , 293–306 (2015).
Bilal, M. & Oyedele, L. O. Guidelines for applied machine learning in construction industry—A case of profit margins estimation. Adv. Eng. Inform. 43 , 101013 (2020).
Tayefeh Hashemi, S., Ebadati, O. M. & Kaur, H. Cost estimation and prediction in construction projects: A systematic review on machine learning techniques. SN Appl. Sci. 2 , 1–27 (2020).
Arage, S. S. & Dharwadkar, N. V. In: International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC). 594–599 (IEEE, 2017).
Cheng, C.-H., Chang, J.-R. & Yeh, C.-A. Entropy-based and trapezoid fuzzification-based fuzzy time series approaches for forecasting IT project cost. Technol. Forecast. Soc. Chang. 73 , 524–542 (2006).
Joukar, A. & Nahmens, I. Volatility forecast of construction cost index using general autoregressive conditional heteroskedastic method. J. Constr. Eng. Manag. 142 , 04015051 (2016).
Xu, J.-W. & Moon, S. Stochastic forecast of construction cost index using a cointegrated vector autoregression model. J. Manag. Eng. 29 , 10–18 (2013).
Narbaev, T. & De Marco, A. Combination of growth model and earned schedule to forecast project cost at completion. J. Constr. Eng. Manag. 140 , 04013038 (2014).
Naeni, L. M., Shadrokh, S. & Salehipour, A. A fuzzy approach for the earned value management. Int. J. Proj. Manag. 29 , 764–772 (2011).
Ponz-Tienda, J. L., Pellicer, E. & Yepes, V. Complete fuzzy scheduling and fuzzy earned value management in construction projects. J. Zhejiang Univ. Sci. A 13 , 56–68 (2012).
Yu, F., Chen, X., Cory, C. A., Yang, Z. & Hu, Y. An active construction dynamic schedule management model: Using the fuzzy earned value management and BP neural network. KSCE J. Civ. Eng. 25 , 2335–2349 (2021).
Bonato, F. K., Albuquerque, A. A. & Paixão, M. A. S. An application of earned value management (EVM) with Monte Carlo simulation in engineering project management. Gest. Produção 26 , e4641 (2019).
Batselier, J. & Vanhoucke, M. Empirical evaluation of earned value management forecasting accuracy for time and cost. J. Constr. Eng. Manag. 141 , 05015010 (2015).
Yang, R. J. & Zou, P. X. Stakeholder-associated risks and their interactions in complex green building projects: A social network model. Build. Environ. 73 , 208–222 (2014).
Uddin, S. Social network analysis in project management–A case study of analysing stakeholder networks. J. Mod. Proj. Manag. 5 , 106–113 (2017).
Ong, S. & Uddin, S. Co-evolution of project stakeholder networks. J. Mod. Proj. Manag. 8 , 96–115 (2020).
Khanzode, K. C. A. & Sarode, R. D. Advantages and disadvantages of artificial intelligence and machine learning: A literature review. Int. J. Libr. Inf. Sci. (IJLIS) 9 , 30–36 (2020).
Loyola-Gonzalez, O. Black-box vs. white-box: Understanding their advantages and weaknesses from a practical point of view. IEEE Access 7 , 154096–154113 (2019).
Abioye, S. O. et al. Artificial intelligence in the construction industry: A review of present status, opportunities and future challenges. J. Build. Eng. 44 , 103299 (2021).
Doloi, H., Sawhney, A., Iyer, K. & Rentala, S. Analysing factors affecting delays in Indian construction projects. Int. J. Proj. Manag. 30 , 479–489 (2012).
Alkhaddar, R., Wooder, T., Sertyesilisik, B. & Tunstall, A. Deep learning approach’s effectiveness on sustainability improvement in the UK construction industry. Manag. Environ. Qual. Int. J. 23 , 126–139 (2012).
Gondia, A., Siam, A., El-Dakhakhni, W. & Nassar, A. H. Machine learning algorithms for construction projects delay risk prediction. J. Constr. Eng. Manag. 146 , 04019085 (2020).
Witten, I. H. & Frank, E. Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 2005).
Kanakaris, N., Karacapilidis, N. I. & Lazanas, A. In: ICORES. 362–369.
Heo, S., Han, S., Shin, Y. & Na, S. Challenges of data refining process during the artificial intelligence development projects in the architecture engineering and construction industry. Appl. Sci. 11 , 10919 (2021).
Article CAS Google Scholar
Bross, I. D. How to use ridit analysis. Biometrics 14 , 18–38 (1958).
Uddin, S., Khan, A., Hossain, M. E. & Moni, M. A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 19 , 1–16 (2019).
LaValle, S. M., Branicky, M. S. & Lindemann, S. R. On the relationship between classical grid search and probabilistic roadmaps. Int. J. Robot. Res. 23 , 673–692 (2004).
Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2 , 433–459 (2010).
Saxena, A. Survey on Road Construction Delay , https://www.kaggle.com/amansaxena/survey-on-road-construction-delay (2021).
Noble, W. S. What is a support vector machine?. Nat. Biotechnol. 24 , 1565–1567 (2006).
Article CAS PubMed Google Scholar
Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression Vol. 398 (John Wiley & Sons, 2013).
LaValley, M. P. Logistic regression. Circulation 117 , 2395–2399 (2008).
Article PubMed Google Scholar
Menard, S. Applied Logistic Regression Analysis Vol. 106 (Sage, 2002).
Batista, G. E. & Monard, M. C. A study of K-nearest neighbour as an imputation method. His 87 , 48 (2002).
Agatonovic-Kustrin, S. & Beresford, R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J. Pharm. Biomed. Anal. 22 , 717–727 (2000).
Zupan, J. Introduction to artificial neural network (ANN) methods: What they are and how to use them. Acta Chim. Slov. 41 , 327–327 (1994).
CAS Google Scholar
Hopfield, J. J. Artificial neural networks. IEEE Circuits Devices Mag. 4 , 3–10 (1988).
Zou, J., Han, Y. & So, S.-S. Overview of artificial neural networks. Artificial Neural Networks . 14–22 (2008).
Maind, S. B. & Wankar, P. Research paper on basic of artificial neural network. Int. J. Recent Innov. Trends Comput. Commun. 2 , 96–100 (2014).
Wolpert, D. H. Stacked generalization. Neural Netw. 5 , 241–259 (1992).
Pavlyshenko, B. In: IEEE Second International Conference on Data Stream Mining & Processing (DSMP). 255–258 (IEEE).
Jović, A., Brkić, K. & Bogunović, N. In: 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205 (Ieee, 2015).
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46 , 389–422 (2002).
Article MATH Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).
MathSciNet MATH Google Scholar
Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. Understanding variable importances in forests of randomized trees. Adv. Neural. Inf. Process. Syst. 26 , 431–439 (2013).
Al-Hazim, N., Salem, Z. A. & Ahmad, H. Delay and cost overrun in infrastructure projects in Jordan. Procedia Eng. 182 , 18–24 (2017).
Breiman, L. Random forests. Mach. Learn. 45 , 5–32. https://doi.org/10.1023/A:1010933404324 (2001).
Shehu, Z., Endut, I. R. & Akintoye, A. Factors contributing to project time and hence cost overrun in the Malaysian construction industry. J. Financ. Manag. Prop. Constr. 19 , 55–75 (2014).
Akomah, B. B. & Jackson, E. N. Contractors’ perception of factors contributing to road project delay. Int. J. Constr. Eng. Manag. 5 , 79–85 (2016).
GitHub: Where the world builds software , https://github.com/ .
Anbari, F. T. Earned value project management method and extensions. Proj. Manag. J. 34 , 12–23 (2003).
Acebes, F., Pereda, M., Poza, D., Pajares, J. & Galán, J. M. Stochastic earned value analysis using Monte Carlo simulation and statistical learning techniques. Int. J. Proj. Manag. 33 , 1597–1609 (2015).
Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. data anal. 6 , 429–449 (2002).
Chen, T. et al. Xgboost: extreme gradient boosting. R Packag. Version 0.4–2.1 1 , 1–4 (2015).
Guarino, A., Lettieri, N., Malandrino, D., Zaccagnino, R. & Capo, C. Adam or Eve? Automatic users’ gender classification via gestures analysis on touch devices. Neural Comput. Appl. 1–23 (2022).
Zaccagnino, R., Capo, C., Guarino, A., Lettieri, N. & Malandrino, D. Techno-regulation and intelligent safeguards. Multimed. Tools Appl. 80 , 15803–15824 (2021).
Download references
The authors acknowledge the insightful comments from Prof Jennifer Whyte on an earlier version of this article.
Authors and affiliations.
School of Project Management, The University of Sydney, Level 2, 21 Ross St, Forest Lodge, NSW, 2037, Australia
Shahadat Uddin, Stephen Ong & Haohui Lu
You can also search for this author in PubMed Google Scholar
S.U.: Conceptualisation; Data curation; Formal analysis; Methodology; Supervision; and Writing (original draft, review and editing) S.O.: Data curation; and Writing (original draft, review and editing) H.L.: Methodology; and Writing (original draft, review and editing) All authors reviewed the manuscript).
Correspondence to Shahadat Uddin .
Competing interests.
The authors declare no competing interests.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information., rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and permissions
Cite this article.
Uddin, S., Ong, S. & Lu, H. Machine learning in project analytics: a data-driven framework and case study. Sci Rep 12 , 15252 (2022). https://doi.org/10.1038/s41598-022-19728-x
Download citation
Received : 13 April 2022
Accepted : 02 September 2022
Published : 09 September 2022
DOI : https://doi.org/10.1038/s41598-022-19728-x
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.
As the name suggests, Classification is the task of “classifying things” into sub-categories. Classification is part of supervised machine learning in which we put labeled data for training.
The article serves as a comprehensive guide to understanding and applying classification techniques, highlighting their significance and practical implications.
Supervised Machine Learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output Y = f(X) . The goal is to approximate the mapping function so well that when you have new input data (x) you can predict the output variables (Y) for that data.
Supervised learning problems can be further grouped into Regression and Classification problems.
Classification is a process of categorizing data or objects into predefined classes or categories based on their features or attributes.
Machine Learning classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data.
The main objective of classification machine learning is to build a model that can accurately assign a label or category to a new observation based on its features.
For example, a classification model might be trained on a dataset of images labeled as either dogs or cats and then used to predict the class of new, unseen images of dogs or cats based on their features such as color, texture, and shape.
There are two main classification types in machine learning:
In binary classification, the goal is to classify the input into one of two classes or categories. Example – On the basis of the given health conditions of a person, we have to determine whether the person has a certain disease or not.
In multi-class classification, the goal is to classify the input into one of several classes or categories. For Example – On the basis of data about different species of flowers, we have to determine which specie our observation belongs to.
Binary vs Multi class classification
Other categories of classification involves:
In, Multi-label Classification the goal is to predict which of several labels a new data point belongs to. This is different from multiclass classification, where each data point can only belong to one class. For example, a multi-label classification algorithm could be used to classify images of animals as belonging to one or more of the categories cat, dog, bird, or fish.
In, Imbalanced Classification the goal is to predict whether a new data point belongs to a minority class, even though there are many more examples of the majority class. For example, a medical diagnosis algorithm could be used to predict whether a patient has a rare disease, even though there are many more patients with common diseases.
There are various types of classifiers algorithms . Some of them are :
Linear models create a linear decision boundary between classes. They are simple and computationally efficient. Some of the linear classification models are as follows:
Non-linear models create a non-linear decision boundary between classes. They can capture more complex relationships between the input features and the target variable. Some of the non-linear classification models are as follows:
In machine learning, classification learners can also be classified as either “lazy” or “eager” learners.
Evaluating a classification model is an important step in machine learning, as it helps to assess the performance and generalization ability of the model on new, unseen data. There are several metrics and techniques that can be used to evaluate a classification model, depending on the specific problem and requirements. Here are some commonly used evaluation metrics:
It is important to choose the appropriate evaluation metric(s) based on the specific problem and requirements, and to avoid overfitting by evaluating the model on independent test data.
Here are the characteristics of the classification:
The basic idea behind classification is to train a model on a labeled dataset, where the input data is associated with their corresponding output labels, to learn the patterns and relationships between the input data and output labels. Once the model is trained, it can be used to predict the output labels for new unseen data.
Classification Machine Learning
The classification process typically involves the following steps:
Before getting started with classification, it is important to understand the problem you are trying to solve. What are the class labels you are trying to predict? What is the relationship between the input data and the class labels?
Suppose we have to predict whether a patient has a certain disease or not, on the basis of 7 independent variables, called features. This means, there can be only two possible outcomes:
This is a binary classification problem.
Once you have a good understanding of the problem, the next step is to prepare your data. This includes collecting and preprocessing the data and splitting it into training, validation, and test sets. In this step, the data is cleaned, preprocessed, and transformed into a format that can be used by the classification algorithm.
The relevant features or attributes are extracted from the data that can be used to differentiate between the different classes.
Suppose our input X has 7 independent features, having only 5 features influencing the label or target values remaining 2 are negligibly or not correlated, then we will use only these 5 features only for the model training.
There are many different models that can be used for classification, including logistic regression, decision trees, support vector machines (SVM), or neural networks . It is important to select a model that is appropriate for your problem, taking into account the size and complexity of your data, and the computational resources you have available.
Once you have selected a model, the next step is to train it on your training data. This involves adjusting the parameters of the model to minimize the error between the predicted class labels and the actual class labels for the training data.
Evaluating the model: After training the model, it is important to evaluate its performance on a validation set. This will give you a good idea of how well the model is likely to perform on new, unseen data.
Log Loss or Cross-Entropy Loss, Confusion Matrix, Precision, Recall, and AUC-ROC curve are the quality metrics used for measuring the performance of the model.
If the model’s performance is not satisfactory, you can fine-tune it by adjusting the parameters, or trying a different model.
Finally, once we are satisfied with the performance of the model, we can deploy it to make predictions on new data. it can be used for real world problem.
Classification algorithms are widely used in many real-world applications across various domains, including:
Let’s get a hands-on experience with how Classification works. We are going to study various Classifiers and see a rather simple analytical comparison of their performance on a well-known, standard data set, the Iris data set.
Requirements for running the given script:
In conclusion, classification is a fundamental task in machine learning, involving the categorization of data into predefined classes or categories based on their features.
What is classification rule in machine learning.
A decision guideline in machine learning determining the class or category of input based on features.
Methods like decision trees, SVM, and k-NN categorizing data into predefined classes for predictions.
Acquiring knowledge to assign labels to input data, distinguishing classes in supervised machine learning.
Classification: Predicts predefined classes. Clustering: Groups data based on inherent similarities without predefined classes.
Classification: Assigns labels to data classes. Regression: Predicts continuous values for quantitative analysis.
Similar reads, improve your coding skills with practice.
Working on case studies is one of the best practices that will help you improve your problem-solving skills as a data scientist. In this article, I’m going to introduce you to some of the best data science case studies based on the problems of classification that will help you understand and solve problems based on classification using machine learning.
If you are getting your first data science job working for an advertising company that does internet advertising, this case study will help you a lot. By understanding the click-through rate, an ad agency selects and targets the most potential customers who are most likely to respond to ads.
This data science case study is based on classification because you have to predict whether a person will respond to ads or not. You can find this data science case study solved and explained using Python from here .
Smartphones are one of the best-selling electronic devices because people keep buying new smartphones when they find new features on a new device. In such a situation, it is very difficult for someone to decide on the price of a smartphone who is considering starting a new smartphone business.
So in this task, you have to categorize the price range of smartphones to give an idea of the price range of a smartphone based on its features. You can find this data science case study on classification solved and explained using Python from here .
The prediction of a person’s gender is based on the problem of classification and computer vision. If you get your first data science job at a company that is very active in building computer vision applications, this will be the most basic classification-based task for you.
Here you need to train a model that can detect the gender of a person by taking an input image or using a real-time camera. You can find a solution for this data science case study from here .
Working on case studies is one of the best practices that will help you improve your problem-solving skills as a data scientist. I hope you liked this article on data science case studies on classification. All the three case studies mentioned in this article are solved and explained using Python . Feel free to ask your valuable questions in the comments section below.
Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.
Discover more from thecleverprogrammer.
Subscribe now to keep reading and get access to the full archive.
Type your email…
Continue reading
Explore the potential of machine learning through these practical machine learning case studies and success stories in various industries. | ProjectPro
Machine learning is revolutionizing how different industries function, from healthcare to finance to transportation. If you're curious about how this technology is applied in real-world scenarios, look no further. In this blog, we'll explore some exciting machine learning case studies that showcase the potential of this powerful emerging technology.
Machine-learning-based applications have quickly transformed work methods in the technological world. It is changing the way we work, live, and interact with the world around us. Machine learning is revolutionizing industries, from personalized recommendations on streaming platforms to self-driving cars.
But while the technology of artificial intelligence and machine learning may seem abstract or daunting to some, its applications are incredibly tangible and impactful. Data Scientists use machine learning algorithms to predict equipment failures in manufacturing, improve cancer diagnoses in healthcare , and even detect fraudulent activity in 5 . If you're interested in learning more about how machine learning is applied in real-world scenarios, you are on the right page. This blog will explore in depth how machine learning applications are used for solving real-world problems.
We'll start with a few case studies from GitHub that examine how machine learning is being used by businesses to retain their customers and improve customer satisfaction. We'll also look at how machine learning is being used with the help of Python programming language to detect and prevent fraud in the financial sector and how it can save companies millions of dollars in losses. Next, we will examine how top companies use machine learning to solve various business problems. Additionally, we'll explore how machine learning is used in the healthcare industry, and how this technology can improve patient outcomes and save lives.
By going through these case studies, you will better understand how machine learning is transforming work across different industries. So, let's get started!
Machine learning case studies on github, machine learning case studies in python, company-specific machine learning case studies, machine learning case studies in biology and healthcare, aws machine learning case studies , azure machine learning case studies, how to prepare for machine learning case studies interview.
This section has machine learning case studies along with their GitHub repository that contains the sample code.
Predicting customer churn is essential for businesses interested in retaining customers and maximizing their profits. By leveraging historical customer data, machine learning algorithms can identify patterns and factors that are correlated with churn, enabling businesses to take proactive steps to prevent it.
In this case study, you will study how a telecom company uses machine learning for customer churn prediction. The available data contains information about the services each customer signed up for, their contact information, monthly charges, and their demographics. The goal is to first analyze the data at hand with the help of methods used in Exploratory Data Analysis . It will assist in picking a suitable machine-learning algorithm. The five machine learning models used in this case-study are AdaBoost, Gradient Boost, Random Forest, Support Vector Machines, and K-Nearest Neighbors. These models are used to determine which customers are at risk of churn.
By using machine learning for churn prediction, businesses can better understand customer behavior, identify areas for improvement, and implement targeted retention strategies. It can result in increased customer loyalty, higher revenue, and a better understanding of customer needs and preferences. This case study example will help you understand how machine learning is a valuable tool for any business looking to improve customer retention and stay ahead of the competition.
GitHub Repository: https://github.com/Pradnya1208/Telecom-Customer-Churn-prediction
Market basket analysis is a common application of machine learning in retail and e-commerce, where it is used to identify patterns and relationships between products that are frequently purchased together. By leveraging this information, businesses can make informed decisions about product placement, promotions, and pricing strategies.
In this case study, you will utilize the EDA methods to carefully analyze the relationships among different variables in the data. Next, you will study how to use the Apriori algorithm to identify frequent itemsets and association rules, which describe the likelihood of a product being purchased given the presence of another product. These rules can generate recommendations, optimize product placement, and increase sales, and they can also be used for customer segmentation.
Using machine learning for market basket analysis allows businesses to understand customer behavior better, identify cross-selling opportunities, and increase customer satisfaction. It has the potential to result in increased revenue, improved customer loyalty, and a better understanding of customer needs and preferences.
GitHub Repository: https://github.com/kkrusere/Market-Basket-Analysis-on-the-Online-Retail-Data
Airbnb is a tech company that enables hosts to rent out their homes, apartments, or rooms to guests interested in temporary lodging. One of the key challenges hosts face is optimizing the rent prices for the customers. With the help of machine learning, hosts can have rough estimates of the rental costs based on various factors such as location, property type, amenities, and availability.
The first step, in this case study, is to clean the dataset to handle missing values, duplicates, and outliers. In the same step, the data is transformed, and the data is prepared for modeling with the help of feature engineering methods. The next step is to perform EDA to understand how the rental listings are spread across different cities in the US. Next, you will learn how to visualize how prices change over time, looking at trends for different seasons, months, days of the week, and times of the day.
The final step involves implementing ML models like linear regression (ridge and lasso), Naive Bayes, and Random Forests to produce price estimates for listings. You will learn how to compare the outcome of these models and evaluate their performance.
GitHub Repository: https://github.com/samuelklam/airbnb-pricing-prediction
New Projects
The Titanic Machine Learning Case Study is a classic example in the field of data science and machine learning. The study is based on the dataset of passengers aboard the Titanic when it sank in 1912. The study's goal is to predict whether a passenger survived or not based on their demographic and other information.
The dataset contains information on 891 passengers, including their age, gender, ticket class, fare paid, as well as whether or not they survived the disaster. The first step in the analysis is to explore the dataset and identify any missing values or outliers. Once this is done, the data is preprocessed to prepare it for modeling.
The next step is to build a predictive model using various machine learning algorithms, such as logistic regression, decision trees, and random forests. These models are trained on a subset of the data and evaluated on another subset to ensure they can generalize well to new data.
Finally, the model is used to make predictions on a test dataset, and the model performance is measured using various metrics such as accuracy, precision, and recall. The study results can be used to improve safety protocols and inform future disaster response efforts.
GitHub Repository: https://github.com/ashishpatel26/Titanic-Machine-Learning-from-Disaster
Graduate Research assistance at Stony Brook University
Gautam Vermani
Data Consultant at Confidential
Not sure what you are looking for?
If you are looking for a sample of machine learning case study in python, then keep reading this space.
Financial institutions receive tons of requests for lending money by borrowers and making decisions for each request is a crucial task. Manually processing these requests can be a time-consuming and error-prone process, so there is an increasing demand for machine learning to improve this process by automation.
You can work on this Loan Dataset on Kaggle to get started on this one of the most real-world case studies in the financial industry. The dataset contains 614 unique values for 13 columns: Follow the below-mentioned steps to get started on this case study.
Analyze the dataset and explore how various factors such as gender, marital status, and employment affect the loan amount and status of the loan application .
Select the features to automate the process of classification of loan applications.
Apply machine learning models such as logistic regression, decision trees, and random forests to the features and compare their performance using statistical metrics.
This case study falls under the umbrella of supervised learning problems in machine learning and demonstrates how ML models are used to automate tasks in the financial industry.
With these Data Science Projects in Python , your career is bound to reach new heights. Start working on them today!
Whenever one thinks of buying a new computer, the first thing that comes to mind is to curate a list of hardware specifications that best suit their needs. The next step is browsing different websites and looking for the cheapest option available. Performing all these processes can be time-consuming and require a lot of effort. But you don’t have to worry as machine learning can help you build a system that can estimate the price of a computer system by taking into account its various features.
This sample basic computer dataset on Kaggle can help you develop a price estimation model that can analyze historical data and identify patterns and trends in the relationship between computer specifications and prices. By training a machine learning model on this data, the model can learn to make accurate predictions of prices for new or unseen computer components. Machine learning algorithms such as K-Nearest Neighbours, Decision Trees, Random Forests, ADA Boost and XGBoost can effectively capture complex relationships between features and prices, leading to more accurate price estimates.
Besides saving time and effort compared to manual estimation methods, this project also has a business use case as it can provide stakeholders with valuable insights into market trends and consumer preferences.
Here is a machine learning case study that aims to predict the median value of owner-occupied homes in Boston suburbs based on various features such as crime rate, number of rooms, and pupil-teacher ratio.
Start working on this study by collecting the data from the publicly available UCI Machine Learning Repository, which contains information about 506 neighborhoods in the Boston area. The dataset includes 13 features such as per capita crime rate, average number of rooms per dwelling, and the proportion of owner-occupied units built before 1940. You can gain more insights into this data by using EDA techniques. Then prepare the dataset for implementing ML models by handling missing values, converting categorical features to numerical ones, and scaling the data.
Use machine learning algorithms such as Linear Regression, Lasso Regression, and Random Forest to predict house prices for different neighborhoods in the Boston area. Select the best model by comparing the performance of each one using metrics such as mean squared error, mean absolute error, and R-squared.
This section has machine learning case studies of different firms across various industries.
Dell Technologies is a multinational technology company that designs, develops, and sells computers, servers, data storage devices, network switches, software, and other technology products and services. Dell is one of the world's most prominent PC vendors and serves customers in over 180 countries. As Data is an integral component of Dell's hard drive, the marketing team of Dell required a data-focused solution that would improve response rates and demonstrate why some words and phrases are more effective than others.
Dell contacted Persado and partnered with the firm that utilizes AI to create marketing content. Persado helped Dell revamp the email marketing strategy and leverage the data analytics to garner their audiences' attention. The statistics revealed that the partnership resulted in a noticeable increase in customer engagement as the page visits by 22% on average and a 50% average increase in CTR.
Dell currently relies on ML methods to improve their marketing strategy for emails, banners, direct mail, Facebook ads, and radio content.
Explore Categories
In the current environment, it is challenging to overcome traditional marketing. An artificial intelligence powered robot, Albert is appealing for a business like Harley Davidson. Robots are now directing traffic, creating news stories, working in hotels, and even running McDonald's, thanks to machine learning and artificial intelligence.
There are many marketing channels that Albert can be applied to, including Email and social media.It automatically prepares customized creative copies and forecasts which customers will most likely convert.
The only company to make use of Albert is Harley Davidson. The business examined customer data to ascertain the activities of past clients who successfully made purchases and invested more time than usual across different pages on the website. With this knowledge, Albert divided the customer base into groups and adjusted the scale of test campaigns accordingly.
Results reveal that using Albert increased Harley Davidson's sales by 40%. The brand also saw a 2,930% spike in leads, 50% of which came from very effective "lookalikes" found by machine learning and artificial intelligence.
Zomato is a popular online platform that provides restaurant search and discovery services, online ordering and delivery, and customer reviews and ratings. Founded in India in 2008, the company has expanded to over 24 countries and serves millions of users globally. Over the years, it has become a popular choice for consumers to browse the ratings of different restaurants in their area.
To provide the best restaurant options to their customers, Zomato ensures to hand-pick the ones likely to perform well in the future. Machine Learning can help zomato in making such decisions by considering the different restaurant features. You can work on this sample Zomato Restaurants Data and experiment with how machine learning can be useful to Zomato. The dataset has the details of 9551 restaurants. The first step should involve careful analysis of the data and identifying outliers and missing values in the dataset. Treat them using statistical methods and then use regression models to predict the rating of different restaurants.
The Zomato Case study is one of the most popular machine learning startup case studies among data science enthusiasts.
Tesla, Inc. is an American electric vehicle and clean energy company founded in 2003 by Elon Musk. The company designs, manufactures, and sells electric cars, battery storage systems, and solar products. Tesla has pioneered the electric vehicle industry and has popularized high-capacity lithium-ion batteries and regenerative braking systems. The company strongly focuses on innovation, sustainability, and reducing the world's dependence on fossil fuels.
Tesla uses machine learning in various ways to enhance the performance and features of its electric vehicles. One of the most notable applications of machine learning at Tesla is in its Autopilot system, which uses a combination of cameras, sensors, and machine learning algorithms to enable advanced driver assistance features such as lane centering, adaptive cruise control, and automatic emergency braking.
Tesla's Autopilot system uses deep neural networks to process large amounts of real-world driving data and accurately predict driving behavior and potential hazards. It enables the system to learn and adapt over time, improving its accuracy and responsiveness.
Additionally, Tesla also uses machine learning in its battery management systems to optimize the performance and longevity of its batteries. Machine learning algorithms are used to model and predict the behavior of the batteries under different conditions, enabling Tesla to optimize charging rates, temperature control, and other factors to maximize the lifespan and performance of its batteries.
Unlock the ProjectPro Learning Experience for FREE
Amazon Prime Video uses machine learning to ensure high video quality for its users. The company has developed a system that analyzes video content and applies various techniques to enhance the viewing experience.
The system uses machine learning algorithms to automatically detect and correct issues such as unexpected black frames, blocky frames, and audio noise. For detecting block corruption, residual neural networks are used. After training the algorithm on the large dataset, a threshold of 0.07 was set for the corrupted-area ratio to mark the areas of the frame that have block corruption. For detecting unwanted noise in the audio, a model based on a pre-trained audio neural network is used to classify a one-second audio sample into one of these classes: audio hum, audio distortion, audio diss, audio clicks, and no defect. The lip sync is handled using the SynNet architecture.
By using machine learning to optimize video quality, Amazon can deliver a consistent and high-quality viewing experience to its users, regardless of the device or network conditions they are using. It helps maintain customer satisfaction and loyalty and ensures that Amazon remains a competitive video streaming market leader.
Machine Learning applications are not only limited to financial and tech use cases. It also finds its use in the Healthcare industry. So, here are a few machine learning case studies that showcase the use of this technology in the Biology and Healthcare domain.
The development of microbiome therapeutics involves the study of the interactions between the human microbiome and various diseases and identifying specific microbial strains or compositions that can be used to treat or prevent these diseases. Machine learning plays a crucial role in this process by enabling the analysis of large, complex datasets and identifying patterns and correlations that would be difficult or impossible to detect through traditional methods.
Machine learning algorithms can analyze microbiome data at various levels, including taxonomic composition, functional pathways, and gene expression profiles. These algorithms can identify specific microbial strains or communities associated with different diseases or conditions and can be used to develop targeted therapies.
Besides that, machine learning can be used to optimize the design and delivery of microbiome therapeutics. For example, machine learning algorithms can be used to predict the efficacy of different microbial strains or compositions and optimize these therapies' dosage and delivery mechanisms.
Machine learning is increasingly being used to develop predictive models for diagnosing and managing mental illness. One of the critical advantages of machine learning in this context is its ability to analyze large, complex datasets and identify patterns and correlations that would be difficult for human experts to detect.
Machine learning algorithms can be trained on various data sources, including clinical assessments, self-reported symptoms, and physiological measures such as brain imaging or heart rate variability. These algorithms can then be used to develop predictive models to identify individuals at high risk of developing a mental illness or who are likely to experience a particular symptom or condition.
One example of machine learning being used to predict mental illness is in the development of suicide risk assessment tools. These tools use machine learning algorithms to analyze various risk factors, such as demographic information, medical history, and social media activity, to identify individuals at risk of suicide. These tools can be used to guide early intervention and support for individuals struggling with mental health issues.
One can also a build a Chatbot using Machine learning and Natural Lanaguage Processing that can analyze the responses of the user and recommend them the necessary steps that they can immediately take.
Get confident to build end-to-end projects
Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.
Another popular subject in the biotechnology industry is Bioprinting. Based on a computerized blueprint, the printer prints biological tissues like skin, organs, blood arteries, and bones layer by layer using cells and biomaterials, also known as bioinks.
They can be made in printers more ethically and economically than by relying on organ donations. Additionally, synthetic construct tissue is used for drug testing instead of testing on animals or people. Due to its tremendous complexity, the entire technology is still in its early stages of maturity. Data science is one of the most essential components to handle this complexity of printing.
The qualities of the bioinks, which have inherent variability, or the many printing parameters, are just a couple of the many variables that affect the printing process and quality. For instance, Bayesian optimization improves the likelihood of producing useable output and optimizes the printing process.
A crucial element of the procedure is the printing speed. To estimate the optimal speed, siamese network models are used. Convolutional neural networks are applied to photographs of the layer-by-layer tissue to detect material, or tissue abnormalities.
In this section, you will find a list of machine learning case studies that have utilized Amazon Web Services to create machine learning based solutions.
Autodesk is a US-based software company that provides solutions for 3D design, engineering, and entertainment industries. The company offers a wide range of software products and services, including computer-aided design (CAD) software, 3D animation software, and other tools used in architecture, construction, engineering, manufacturing, media and entertainment industries.
Autodesk utilizes machine learning (ML) models that are constructed on Amazon SageMaker, a managed ML service provided by Amazon Web Services (AWS), to assist designers in categorizing and sifting through a multitude of versions created by generative design procedures and selecting the most optimal design. ML techniques built with Amazon SageMaker help Autodesk progress from intuitive design to exploring the boundaries of generative design for their customers to produce innovative products that can even be life-changing. As an example, Edera Safety, a design studio located in Austria, created a superior and more effective spine protector by utilizing Autodesk's generative design process constructed on AWS.
Capital One is a financial services company in the United States that offers a range of financial products and services to consumers, small businesses, and commercial clients. The company provides credit cards, loans, savings and checking accounts, investment services, and other financial products and services.
Capital One leverages AWS to transform data into valuable insights using machine learning, enabling the company to innovate rapidly on behalf of its customers. To power its machine-learning innovation, Capital One utilizes a range of AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), and AWS Lambda. AWS is enabling Capital One to implement flexible DevOps processes, enabling the company to introduce new products and features to the market in just a few weeks instead of several months or years. Additionally, AWS assists Capital One in providing data to and facilitating the training of sophisticated machine-learning analysis and customer-service solutions. The company also integrates its contact centers with its CRM and other critical systems, while simultaneously attracting promising entry-level and mid-career developers and engineers with the opportunity to gain knowledge and innovate with the most up-to-date cloud technologies.
In 2008, BuildFax began by collecting widely scattered building permit data from different parts of the United States and distributing it to various businesses, including building inspectors, insurance companies, and economic analysts. Today, it offers custom-made solutions to these professions and several other services. These services comprise indices that monitor trends like commercial construction, and housing remodels.
Source: aws.amazon.com/solutions/case-studies
The primary customer base of BuildFax is insurance companies that splurge billion dollars on rood losses. BuildFax assists its customers in developing policies and premiums by evaluating the roof losses for them. Initially, it relied on general data and ZIP codes for building predictive models but they did not prove to be useful as they were not accurate and were slightly complex in nature. It thus required a way out of building a solution that could support more accurate results for property-specific estimates. It thus chose Amazon Machine Learning for predictive modeling. By employing Amazon Machine Learning, it is possible for the company to offer insurance companies and builders personalized estimations of roof-age and job-cost, which are specific to a particular property and it does not have to depend on more generalized estimates based on ZIP codes. It now utilizes customers' data and data from public sources to create predictive models.
What makes Python one of the best programming languages for ML Projects? The answer lies in these solved and end-to-end Machine Learning Projects in Python . Check them out now!
This section will present you with a list of machine learning case studies that showcase how companies have leveraged Microsoft Azure Services for completing machine learning tasks in their firm.
Consider a company (Azure customer) in the Electronic Design Automation industry that provides software, hardware, and IP for electronic systems and semiconductor companies. Their finance team was struggling to manage account receivables efficiently, so they wanted to use machine learning to predict payment outcomes and reduce outstanding receivables. The team faced a major challenge with managing change data capture using Azure Data Factory . A3S provided a solution by automating data migration from SAP ECC to Azure Synapse and offering fully automated analytics as a service, which helped the company streamline their account receivables management. It was able to achieve the entire scenario from data ingestion to analytics within a week, and they plan to use A3S for other analytics initiatives.
Royal Dutch Shell, a global company managing oil wells to retail petrol stations, is using computer vision technology to automate safety checks at its service stations. In partnership with Microsoft, it has developed the project called Video Analytics for Downstream Retail (VADR) that uses machine vision and image processing to detect dangerous behavior and alert the servicemen. It uses OpenCV and Azure Databricks in the background highlighting how Azure can be used for personalised applications. Once the projects shows decent results in the countries where it has been deployed (Thailand and Singapore), Shell plans to expand the project further by going global with the VADR project.
TransLink, a transportation company in Vancouver, deployed 18,000 different sets of machine learning models using Azure Machine Learning to predict bus departure times and determine bus crowdedness. The models take into account factors such as traffic, bad weather and at-capacity buses. The deployment led to an improvement in predicted bus departure times of 74%. The company also created a mobile app that allows people to plan their trips based on how at-capacity a bus might be at different times of day.
Microsoft Azure Personaliser is a cloud-based service that uses reinforcement learning to select the best content for customers based on up-to-date information about them, the context, and the application. Custom recommender services can also be created using Azure Machine Learning. The Xbox One group used Cognitive Services Personaliser to find content suited to each user, which resulted in a 40% increase in user engagement compared to a random personalisation policy on the Xbox platform.
All the mentioned case studies in this blog will help you explore the application of machine learning in solving real problems across different industries. But you must not stop after working on them if you are preparing for an interview and intend to showcase that you have mastered the art of implementing ML algorithms, and you must practice more such caste studies in machine learning.
And if you have decided to dive deeper into machine learning, data science, and big data, be sure to check out ProjectPro , which offers a repository of solved projects in data science and big data. With a wide range of projects, you can explore different techniques and approaches and build your machine learning and data science skills . Our repository has a project for each one of you, irrespective of your academic and professional background. The customer-specific learning path is likely to help you find your way to making a mark in this newly emerging field. So why wait? Start exploring today and see what you can accomplish with big data and data science !
Access Data Science and Machine Learning Project Code Examples
A case study in machine learning is an in-depth analysis of a real-world problem or scenario, where machine learning techniques are applied to solve the problem or provide insights. Case studies can provide valuable insights into the application of machine learning and can be used as a basis for further research or development.
A good use case for machine learning is any scenario with a large and complex dataset and where there is a need to identify patterns, predict outcomes, or automate decision-making based on that data. It could include fraud detection, predictive maintenance, recommendation systems, and image or speech recognition, among others.
The three basic types of machine learning problems are supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the algorithm is trained on labeled data. In unsupervised learning, the algorithm seeks to identify patterns in unstructured data. In reinforcement learning, the algorithm learns through trial and error based on feedback from the environment.
The four basics of machine learning are data preparation, model selection, model training, and model evaluation. Data preparation involves collecting, cleaning, and preparing data for use in training models. Model selection involves choosing the appropriate algorithm for a given task. Model training involves optimizing the chosen algorithm to achieve the desired outcome. Model evaluation consists of assessing the performance of the trained model on new data.
|
|
Manika Nagpal is a versatile professional with a strong background in both Physics and Data Science. As a Senior Analyst at ProjectPro, she leverages her expertise in data science and writing to create engaging and insightful blogs that help businesses and individuals stay up-to-date with the
© 2024
© 2024 Iconiq Inc.
Privacy policy
User policy
Write for ProjectPro
Learn Python Programming from Scratch
Master Programming with Our Comprehensive Courses Enroll Now!
A decade ago, no one must have thought that the term “Machine Learning” would be hyped so much in the years to come. Right from our entertainment to our basic needs to complex data handling statistics, Machine Learning takes care of all of this. The clutches of Machine Learning aren’t just limited to the basic necessities and entertainment.
The technology plays a pivotal role in domain areas such as data retrieval, database consistency, and spam detection along with many other vast ranges of applications. We do come across various articles that are ready to teach us about the basic concepts of Machine Learning, however, learning becomes more fun when we actually see it working in practicality.
Keeping this in mind, PythonGeeks brings to you, an article that will talk about the real-life case studies of Machine Learning stating its advancement in various fields. We will talk about the merits of Machine Learning in the field of technology as well as in Life Science and Biology. So, without further delay, let us look at these case studies and get to know a bit more about Machine Learning.
1. machine learning case study on dell.
We all are aware of the multinational leader in technology, Dell. This tech giant empowers people and communities from across the globe by providing superior software and hardware services at very affordable prices. As a matter of fact, data plays a pivotal role in the programming of the hard drive of Dell, the marketing team of Dell requires a data-driven solution that supercharges response rates and exhibits why certain words and phrases outpace others in terms of efficiency and reliability.
Dell made a partnership with Persado, one of the names amongst the world’s leading technology in AI and ML fabricating marketing creative, in order to harness the power of words in their respective email channel and garner data-driven analytics for each of their key audiences for a better user experience.
As an evident outcome of this partnership, Dell experienced a 50% average increase in CTR and a 46% average increase in responses from its customer engagement . Apart from this, it also witnessed a huge 22% average increase in page visits and a 77% average increase in add-to-carts orders .
Overwhelmed by this success rate and learnings with email, Dell adamantly wanted to elevate their entire marketing platform with Persado for more profit and audience engagement. Dell now makes use of machine learning algorithms to enhance the marketing copy of their promotional and lifecycle emails. Apart from these, their management even deploys Machine Learning models for Facebook ads, display banners, direct mail, and even radio content for a farther reach for the target audience.
Sky UK is a British telecommunication service that transforms customer experiences with the help of machine learning and artificial intelligence algorithms with the help of Adobe Sensei.
Due to the immense profit that the company gained due to the deployment of the Machine Learning model, the Head of Digital Decisioning and Analytics, Sky UK once stated that they have 22.5 million very diverse customers. Even attempting to divide people by their favorite television genre can result in pretty broad segments for their services.
This will result in the following outcomes:
The company was competent in efficiently analyzing large volumes of customer information with the help of machine learning frameworks. With the deployment of Machine Learning models, the services were able to recommend their target audience with products and services that resonated the most with each of them.
McLaughlin once stated that people think of machine learning as a tool for delivering experiences that are strictly defined and very robotic in their approach, but it’s actually the other way round. With Adobe Sensei, the management of the Sky was drawing a line that connects customer intelligence and personalized experiences that are valuable and appropriate for their customers.
Trendyol is amongst the leading e-commerce companies based in Turkey. It once faced threats from its global competitors like Adidas and ASOS, particularly for its sportswear sales and audience engagement.
In order to assist the company in gaining customer loyalty and to enhance its emailing system, Trendyol partnered with the vendor Liveclicker, which specializes in real-time personalization for a better user experience for its customers.
Trendyol made use of machine learning and artificial intelligence algorithms to create several highly personalized marketing campaigns based on the interests of a particular target audience. It was not only aimed at providing a personalized touch to the campaign, but it also helped to distinguish which messages would be most relevant or draw the attention of which set of customers. It also came up with an offer for a football jersey imposing the recipient’s name on the back of the jersey to ramp up the personalization level and grab the consumer’s attention.
By innovating such one-to-one personalization, not only were the retailer’s open rates, click-through rates, conversions were high, it also significantly made their sales reach all-time highs. It resulted in the generation of a 30% increase in click-through rates for Trendyol, a 62% growth in response rates, and a striking 130% increase in conversion rates for the tech giant.
The world that we live in today is where it becomes difficult to break through traditional marketing. For an emerging business like – Harley Davidson NYC, Albert (an artificial intelligence-powered robot) has a lot of appeal for the growth and popularity of the company. Powered by effective and reliable machine learning and artificial intelligence algorithms, robots are writing news stories, opening new dimensions, working in hotels, managing traffic, and even running McDonald’s customers’ outlets.
We can use Albert in various marketing channels including social media and email campaigns. The software accurately predicts and differentiates among the consumers who are most likely to convert and adjust personal creative copies on their own for the benefits of the campaign.
Harley Davidson is the only brand to date that uses Albert to its advantage. The company analyzed customer data to determine a strong pattern in the behavior of previous customers whose actions were positive in terms of purchasing and spending more than the average amount of time on browsing through the website giving way to the use of Albert. With this analyzed data, Albert bifurcates segments of customers and scales up the test campaigns according to the interests and engagement of customers.
Once the company efficiently deployed Albert, Harley Davidson witnessed an increase in its sales by 40% with the use of Albert. The brand also witnessed a 2,930% increase in leads, with 50% of those from high converting ‘lookalikes’ identified by artificial intelligence and machine learning using Albert.
As far as our technical knowledge is concerned, we are not able to recognize Yelp as a tech company. However, it is effectively taking advantage of machine learning to improve its users’ experience to a great extent.
Yelp’s machine learning algorithms assist the company’s non-robotic staff in tasks like collecting, categorizing, and labeling images more efficiently and precisely. Since images play a pivotal role to Yelp as user reviews themselves, the tech giant is always trying to improve how it handles image processing to analyze customer feedback in a constructive way. Through this assistance, the company is serving millions of its users now with accurate and satisfactory services.
For an entire generation nowadays, capturing photos of their food has become second nature. Owing to this, Yelp has such a huge database of photos for image processing. Its software makes use of techniques for analysis of the image to identify and classify the extracted features on the basis of color, texture, and shape. It implies that it can recognize the presence of, say, pizzas, or whether a restaurant has outdoor seating by merely analyzing the images that we provide as input data.
As a constructive outcome, the company is now capable of predicting attributes like ‘good for kids’ and ‘classy ambiance’ with a striking more than 80% accuracy.
Tesla is now a big name in the electric automobile industry and the chances that it will continue to be the trending topic for years to come are really high. It is popular and extensively known for its advanced and futuristic cars and their advanced models. The company states that their cars have their own AI hardware for their advancement. Tesla is even making use of AI for fabricating self-driving cars.
With the current progress rate of technology, cars are not yet completely autonomous and need human intervention to some extent. The company is working extensively on the thinking algorithm for cars to help them become fully autonomous. It is currently working in an advert partnership with NVIDIA on an unsupervised ML algorithm for its development.
This step by Tesla would be a game-changer in the field of automobiles and Machine Learning models for many reasons. The cars feed the data directly to Tesla’s cloud storage to avoid data leakage. The car sends the driver’s seating position, traffic of the area, and other valuable information on the cloud to precisely predict the next move of the car. The car is equipped with various internal and external sensors that detect the above-mentioned data for processing.
7. development of microbiome therapeutics.
We have studied and identified a vast number of microorganisms, so-called microbiota like bacteria, fungi, viruses, and other single-celled organisms in our body till today with the advancement in technology. All the genes of the microbiota are collectively known as the microbiome. These genes are present in an enormous number of trillions, for example, the bacteria present in the human body have more than 100 times more unique genes than humans could ever have.
These microbiotas that are present in the human body have a massive influence on human health and cause imbalances leading to many disorders like Parkinson’s disease or inflammatory bowel disease. There is also the presumption that such imbalances may even cause several autoimmune diseases if precariously left in the human body. So, microbiome research is a very trendy research area and Machine Learning models can help in handling them effectively.
In order to influence the microbiota and develop microbiome therapeutics to reverse the diseases caused by them, we need to understand the microbiota’s genes and their influence on our body. With all the gene sequencing possibilities that are present today, terabytes of data are available however we cannot use it as it is not yet probed.
Heart failure typically leads to emergency or hospital admission and may even be fatal in some situations. And with the increase in the aging population, the percentage of heart failure in the population is expected to increase.
People that suffer from heart failure usually have some pre-existing illnesses that go undiagnosed and lead to fatal ailments. So, it is not uncommon that we make use of telemedicine systems to monitor and consult a patient, and collect valuable data like mobile health data like blood pressure, body weight, or heart rate and transmit it effectively.
Most prediction and prevention systems are now fabricated based on fixed rules, like when specific measurements of the vital readings of the human body are beyond a predefined threshold, the patient is alerted even before the diagnosis of any kind of ailment. It is self-explanatory that such a predictive system may lead to a high number of false alerts, due to fluctuating reading of the vitals due to reasons that are not serious.
Because of the programming that we do on the algorithms, alerts lead mostly to hospital admission. Due to this reason, too many false alerts lead to increased health costs and deteriorate the patient’s confidence in the prediction defying the cause of the algorithms. Eventually, the concerned patient will stop following the recommendation for medical help even if the algorithm alters it for fatal ailments.
So, on the basis of baseline data of the patient like age, gender, smoker or not, a pacemaker or not along with measurements of vital elements of the body like sodium, potassium, or hemoglobin concentrations in the blood, apart from the monitored characteristics like heart rate, body weight, (systolic and diastolic) blood pressure, or questionnaire proves to be helpful in answering about the well-being, or physical activities, a classifier on the basis of Naïve Bayes has been finally developed to reduce the chances of false positives.
According to an estimated number that at least 10% of the global population has a mental disorder, it is now high time that we need to take preventive measures in this field. Economic losses that are evident due to mental illness sum up to nearly $10 trillion.
Mental disorders include a large variety of ailments ranging from anxiety, depression, substance use disorder, and others. Some other prime examples include opioids, bipolar disorder, schizophrenia, or eating disorders that cause high risk to the human resources.
As a result of which, the detection of mental disorders and intervention as early as possible is critical in order to reduce the loss of precious resources. There are two main approaches to deploy Machine Learning models in detecting mental disorders: apps for consumers that detect mental diseases and tools for psychiatrists to support diagnostics of their patients.
The apps for consumers are typically conversational chatbots enhanced with machine learning algorithms to help the consumers in reducing their anxiety or panic attacks. The app analyzes the behavioral traits of the person like the spoken language of the consumer and recommends help to the customers accordingly. As the recommendations must be strictly on the basis of scientific evidence, the interaction and response of proposals and the individual language pattern of the chatbot, as well as, the consumer must be predicted as precisely as possible.
As a matter of fact, Stroke is one of the major reasons for disability and death amongst the elder generations. The lifetime risk analysis of an adult person is about 25% of having once a stroke history. However, stroke is a very heterogeneous disorder in nature. Therefore, having individualized pre-stroke and post-stroke care is critical for the success of a cure.
In order to determine this individualized care, the person’s phenotype indicates that the observable characteristics of a person should be chosen wisely. Furthermore, we usually achieve this by biomarkers. A so-called biomarker represents a measurable data point such that we can stratify the patients. Examples of such biomarkers are disease severity scores, lifestyle characteristics, or genomic properties.
There are many recognized biomarkers already published or in databases. Apart from this, there are hundreds of scientific publications that talk daily about the detection of biomarkers for all the different diseases.
Bioprinting is yet another trending topic in the domain of biotechnology. It works on the basis of a digital blueprint where the printer uses cells and natural or synthetic biomaterials — also called bio-inks — to print layer-by-layer living tissues like skin, organs, blood vessels, or bones that have exact replication of the real tissues.
As an alternative for depending on organ donations, we can produce these tissues in printers more ethically and cost-effectively. Apart from this, we can even perform drug tests on the synthetic build tissue than with animal or human testing. The whole technology is still emerging and is in early maturity due to its high complexity. One of the most crucial parts to cope with this complexity of printing is data science.
As we might have observed, the production of drugs needs time, especially for today’s high-tech cures based on specific substances and production methods only. Apart from this, we have to break down the whole process into many different steps, and several of them are outsourced to specialist delivery agents.
We observe this currently with the COVID-19 vaccine production as well. The vaccine inventors deliver the blueprint for the vaccine. Then the production happens in plants of companies specialized in sterile production. The production unit then delivers the vaccine in tanks to companies. They do the filling in small doses under clinical conditions, and at last, another company makes the supply for the given blueprint.
The complete planning, right from having the right input substances available at the right time, then having the adequate production capacity, and at last, the exact amount of drugs stored for serving the demand is a highly complicated system. As a result of which, this must be managed for hundreds and thousands of therapies, each with its specific conditions.
As we have known, the AES Corporation is a power generation and distribution company. They generate and sell power that the consumers use for utilities and industrial work. They depend on Google Cloud on their road to make renewable energy more efficient. AES makes use of Google AutoML Vision to review images of wind turbine blades and analyze their maintenance needs beforehand.
Bayer AG is an emerging name in multinational pharmaceutical and life sciences companies and it is based in Germany. One of their key highlights is in the production of insecticides, fungicides, and herbicides for agricultural purposes.
In order to assist farmers monitor their crops, they fabricate their Digital Yellow Trap: an Internet of Things (IoT) device that alerts farmers of pests using image recognition on the farming land.
The American Cancer Society is a nonprofit organization for eradicating cancer. They operate in more than 250 regional offices all over America.
They make use of the Google Cloud ML Engine to identify novel patterns in digital pathology images. Their aim is to improve breast cancer detection accuracy and reduce the overall diagnosis timeline as well as ensure effective costing.
The Road Safety Commission of Western Australia operates under the Western Australia Police Force. It takes the responsibility for tracking road accidents and making the roads safer by taking adequate precautions.
In an attempt to achieve its safety strategy “Towards Zero 2008-2020” which aims at reducing road fatalities by 40%, the road safety commission is depending on machine learning, artificial intelligence, and advanced analytics for precise and reliable results.
With this, we have seen the various case studies that are done till now in the field of Machine Learning. PythonGeeks specially curated this list of case studies to help readers to understand the deployment of Machine Learning models in the real world. The article can benefit you in various ways since it delivers accurate studies of the various uses of Machine Learning. You can study these cases to get to know Machine Learning a bit better and even try to find improvements in the existing solution.
Tags: Machine Learning Case Studies
Great content and relevant to current digital transformation process.
Very informative
Very insightful
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.
New citation alert added.
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Please log in to your account
Bibliometrics & citations, view options, recommendations, interpretability in the medical field: a systematic mapping and review study.
Recently, the machine learning (ML) field has been rapidly growing, mainly owing to the availability of historical datasets and advanced computational power. This growth is still facing a set of challenges, such as ...
This paper proposes a new framework for learning a rule ensemble model that is both accurate and interpretable. A rule ensemble is an interpretable model based on the linear combination of weighted rules. In practice, we often face the trade-off ...
The adoption of complex machine learning (ML) models in recent years has brought along a new challenge related to how to interpret, understand, and explain the reasoning behind these complex models' predictions. Treating complex ML systems as ...
Published in.
Pergamon Press, Inc.
United States
Author tags.
Other metrics, bibliometrics, article metrics.
Login options.
Check if you have access through your login credentials or your institution to get full access on this article.
Share this publication link.
Copying failed.
Affiliations, export citations.
We are preparing your search results for download ...
We will inform you here when the file is ready.
Your file of search results citations is now ready.
Your search export query has expired. Please try again.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Email citation, add to collections.
Your saved search, create a file for external citation management software, your rss feed.
Affiliations.
Cardiac electrical changes associated with ischemic heart disease (IHD) are subtle and could be detected even in rest condition in magnetocardiography (MCG) which measures weak cardiac magnetic fields. Cardiac features that are derived from MCG recorded from multiple locations on the chest of subjects and some conventional time domain indices are widely used in Machine learning (ML) classifiers to objectively distinguish IHD and control subjects. Most of the earlier studies have employed features that are derived from signal-averaged cardiac beats and have ignored inter-beat information. The present study demonstrates the utility of beat-by-beat features to be useful in classifying IHD subjects (n = 23) and healthy controls (n = 75) in 37-channel MCG data taken under rest condition of subjects. The study reveals the importance of three features (out of eight measured features) namely, the field map angle (FMA) computed from magnetic field map, beat-by-beat variations of alpha angle in the ST-T region and T wave magnitude variations in yielding a better classification accuracy (92.7 %) against that achieved by conventional features (81 %). Further, beat-by-beat features are also found to augment the accuracy in classifying myocardial infarction (MI) Versus control subjects in two public ECG databases (92 % from 88 % and 94 % from 77 %). These demonstrations summarily suggest the importance of beat-by-beat features in clinical diagnosis of ischemia.
Keywords: beat-by-beat cardiac features; ischemic heart disease; machine learning classifiers; magnetocardiography; myocardial infarction.
© 2024 IOP Publishing Ltd.
PubMed Disclaimer
Linkout - more resources, full text sources.
NCBI Literature Resources
MeSH PMC Bookshelf Disclaimer
The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.
44 Accesses
Explore all metrics
Cardiovascular disease holds the position of being the foremost cause of death worldwide. Heart Disease Prediction (HDP) is a difficult task as it needs advanced knowledge with better experience. Moreover, it encounters numerous significant challenges in clinical data analysis. While many researchers have focused on predicting heart disease, the performance metric, namely prediction accuracy, remains suboptimal. The accurate HDP can help the person to prevent himself from life threats and at the same time, inaccurate prediction can prove to be fatal. To solve these issues, in this review work several Deep Learning (DL), Machine Learning (ML) and optimization based HDP techniques are discussed. In recent times, many researchers have been utilizing different DL and ML algorithms to help the professionals and health care industry for the prediction of heart disease. Further, it discussed about various optimization-based algorithms and its performance analysis. Therefore, this review paper suggests that the optimization-based HDP algorithm could assist doctors in predicting the occurrence of heart disease in advance and offering suitable treatment.
This is a preview of subscription content, log in via an institution to check access.
Subscribe and save.
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Source of reviewed papers
Data availability.
Data will be made available on reasonable request.
Folorunso SO, Awotunde JB, Adeniyi EA, Abiodun KM, Ayo FE (2022) Heart disease classification using machine learning models. Info Intell Appl 1547:35–49
Google Scholar
Phasinam K, Mondal T, Novaliendry D, Yang C-H, Dutta C, Shabaz M (2022) Analyzing the performance of machine learning techniques in disease prediction. J Food Qual 2022:1–9
TR Ramesh, Lilhore UK, Poongodi M, Simaiya S, Kaur A, Hamdi M (2022) Predictive analysis of heart diseases with machine learning approaches. Malays J Comput Sci 132–148. https://doi.org/10.22452/mjcs.sp2022no1.10
Mahesh TR, Dhilip Kumar V, Vinoth Kumar V, Asghar J, Geman O, Arulkumaran G, Arun N (2022) ADABOOST ensemble methods using K-fold cross validation for survivability with the early detection of heart disease. Comput Intell Neurosci 2022:1–11
Yang H-Y, Liu M-L, Luo P, Yao X-S, Zhou H (2022) Network pharmacology provides a systematic approach to understanding the treatment of ischemic heart diseases with traditional Chinese medicine. Phytomed 104:154268
Article Google Scholar
Hossain MA, Kim J-H (2022) Possibility as role of ginseng and ginsenosides on inhibiting the heart disease of COVID-19: A systematic review. J Ginseng Res 46:321–330
Mijwil MM, Shukur BS, Mahmood ESh (2022) The most common heart diseases and their influence on human life: A Mini-review. J Adv Med Med Res 34:26–36
Agrud A, Subburaju S, Goel P, Ren J, Kumar AS, Caldarone BJ, Dai W, Chavez J, Fukumura D, Jain RK, Kloner RA (2022) Gabrb3 endothelial cell-specific knockout mice display abnormal blood flow, hypertension, and behavioral dysfunction. Sci Rep. https://doi.org/10.1038/s41598-022-08806-9
Ahmad GN, Shafiullah FH, Abbas M, Rahman O, Imdadullah AMS (2022) Mixed machine learning approach for efficient prediction of human heart disease by identifying the numerical and categorical features. Appl Sci 12:7449
Orji KN, Ike OH, Wariso M, Oguji CE, Omejua CG, Uchendu IK, Makata VC, Emuebie H, Inalegwu SE (2022) Review on cardiovascular disease and antihypertensive drugs effect on the circulating biomarkers of heart disease. GSC Biol Pharm Sci 20:120–129
Nanthini K, Pyingkodi M, Sivabalaselvamani D, Kumari S, Kumar T (2022) Performance analysis of machine learning algorithms in Heart diseases prediction. IoT Based Control Netw Intell Syst 528:407–423
Mantovani A, Byrne CD, Benfari G, Bonapace S, Simon TG, Targher G (2022) Risk of heart failure in patients with nonalcoholic fatty liver disease. J Am Coll Cardiol 79:180–191
Wienecke LM, Cohen S, Bauersachs J, Mebazaa A, Chousterman BG (2021) Immunity and inflammation: The neglected key players in congenital heart disease? Heart Fail Rev 27:1957–1971
Domyati A, Memon Q (2022) Robust detection of cardiac disease using machine learning algorithms. 2022 The 5th Int Conf Control Comput Vision 52–55. https://doi.org/10.1145/3561613.3561622
Heidenreich PA, Fonarow GC, Opsha Y, Sandhu AT, Sweitzer NK, Warraich HJ, Butler J, Hsich E, Pressler SB, Shah K, Taylor K (2022) Economic issues in heart failure in the United States. J Card Fail 28:453–466
Kreutz R, Brunström M, Thomopoulos C, Carlberg B, Mancia G (2022) Do recent meta-analyses truly prove that treatment with blood pressure-lowering drugs is beneficial at any blood pressure value, no matter how low? A critical review. J Hypertens 40:839–846
Vasantrao CP, Gupta N (2023) Wader hunt optimization based UNET model for change detection in satellite images. Int J Inf Technol 15:1611–1623
Alkayed NJ, Cao Z, Qian ZY, Nagarajan S, Liu X, Nelson JW, Xie F, Li B, Fan W, Liu L, Grafe MR (2022) Control of coronary vascular resistance by Eicosanoids via a novel GPCR. Am J Physiol-Cell Physiol. https://doi.org/10.1152/ajpcell.00454.2021
Su J, Li Z, Huang M, Wang Y, Yang T, Ma M, Ni T, Pan G, Lai Z, Li C, Li L (2022) Triglyceride glucose index for the detection of the severity of coronary artery disease in different glucose metabolic states in patients with coronary heart disease: A RCSCD-TCM study in China. Cardiovasc Diabetol. https://doi.org/10.1186/s12933-022-01523-7
Frąk W, Wojtasińska A, Lisińska W, Młynarska E, Franczyk B, Rysz J (2022) Pathophysiology of cardiovascular diseases: New insights into molecular mechanisms of atherosclerosis, arterial hypertension, and coronary artery disease. Biomedicine 10:1938
Malik A, Daniel B, Sarosh V, Lovely C (2023) Congestive heart failure. InStatPearls [internet]. StatPearls Publishing
Ding M, Li QF, Yin G, Liu JL, Jan XY, Huang T, Li AC, Zheng L (2022) Effects of drosophila melanogaster regular exercise and apolipoprotein b knockdown on abnormal heart rhythm induced by a high-fat diet. PLoS One. https://doi.org/10.1371/journal.pone.0262471
https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset
Shah D, Patel S, Bharti SK (2020) Heart disease prediction using machine learning techniques. SN Comput Sci. https://doi.org/10.1007/s42979-020-00365-y
Salhi DE, Tari A, Kechadi M-T (2021) Using machine learning for heart disease prediction. Adv Comput Syst Appl 199:70–81. https://doi.org/10.1007/978-3-030-69418-0_7
Rajendran R, Karthi A (2022) Heart disease prediction using entropy based feature engineering and ensembling of machine learning classifiers. Expert Syst Appl 207:117882
Wankhede J, Kumar M, Sambandam P (2020) Efficient heart disease prediction-based on optimal feature selection using DFCSS and classification by improved Elman-SFO. IET Syst Biol 14:380–390
Deepika D, Balaji N (2022) Effective heart disease prediction using novel MLP-EBMDA approach. Biomed Signal Process Control 72:103318
Singh R, Rajesh E (2019) Prediction of heart disease by clustering and classification techniques. Int J Comput Sci Eng 7:861–866
Venkatesan C, Saravanan S, Satheeskumaran S (2021) Real-time ECG signal pre-processing and neuro fuzzy-based CHD risk prediction. Int J Comput Sci Eng 24:323
Seker E, Talburt JR, Greer ML (2022) Preprocessing to address bias in healthcare data. Stud Health Technol Info. https://doi.org/10.3233/shti220468
Aziz S, Khan MU, Iqtidar K, Ali S, Remete AN, Javid MA (2022) Pulse plethysmograph signal analysis method for classification of heart diseases using novel local spectral ternary patterns. Expert Syst. https://doi.org/10.1111/exsy.13011
Boukhatem C, Youssef HY, Nassif AB (2022) Heart disease prediction using machine learning. 2022 Adv Sci Eng Technol International Conferences (ASET) 1–6. https://doi.org/10.1109/aset53988.2022.9734880
Rath A, Mishra D, Panda G, Pal M (2022) Development and assessment of machine learning based heart disease detection using imbalanced heart sound signal. Biomed Signal Process Control 76:103730
Heena A, Biradar N, Maroof NM (2021) Machine learning based detection and classification of heart abnormalities. Lect Notes Netw Syst 300:15–22. https://doi.org/10.1007/978-3-030-84760-9_2
IrinSherly S, Mathivanan G (2023) An efficient honey badger based faster region CNN for chronc heart failure prediction. Biomed Signal Process Control 79:104165
Shehzadi S, Hassan MA, Rizwan M, Kryvinska N, Vincent K (2022) Diagnosis of chronic ischemic heart disease using machine learning techniques. Comput Intell Neurosci 2022:1–9
Al Bataineh A, Manacek S (2022) MLP-PSO Hybrid Algorithm for heart disease prediction. J Pers Med 12:1208
Balamurugan R, Ratheesh S, Venila YM (2021) Classification of heart disease using adaptive Harris Hawk optimization-based clustering algorithm and enhanced deep genetic algorithm. Soft Comput 26:2357–2373
Nanehkaran YA, Licai Z, Chen J, Jamel AA, Shengnan Z, Navaei YD, Aghbolagh MA (2022) Anomaly detection in heart disease using a density-based unsupervised approach. Wirel Commun Mob Comput 2022:1–14
Akcin E, Isleyen KS, Ozcan E, Hameed AA, Alimovski E, Jamil A (2021) A hybrid feature extraction method for heart disease classification using ECG Signals. 2021 Innovations in Intell Syst Appl Conference (ASYU). https://doi.org/10.1109/asyu52992.2021.9599070
Gao X-Y, Amin Ali A, Shaban Hassan H, Anwar EM (2021) Improving the accuracy for analyzing heart diseases prediction based on the ensemble method. Complexity 2021:1–10
Sekar J, Aruchamy P, SulaimaLebbe Abdul H, Mohammed AS, Khamuruddeen S (2021) An efficient clinical support system for heart disease prediction using TANFIS classifier. Comput Intell 38:610–640
Ogundokun RO, Misra S, Awotunde JB, Agrawal A, Ahuja R (2022) PCA-based feature extraction for classification of heart disease. Lect Notes Electr Eng 881:173–183. https://doi.org/10.1007/978-981-19-1111-8_15
Prabha DrR, Senthil GA, Lazha DrA, VijendraBabu DrD, Roopa MsD (2021) A novel computational rough set based feature extraction for heart disease analysis. Proceedings of the First International Conference on Computing, Communication and Control System, I3CAC 2021, 7–8 June 2021, Bharath University, Chennai, India. https://doi.org/10.4108/eai.7-6-2021.2308575
Almustafa KM (2020) Prediction of heart disease and classifiers’ sensitivity analysis. BMC Bioinf. https://doi.org/10.1186/s12859-020-03626-y
Venkatesan M, Lakshmipathy P, Vijayan V, Sundar R (2021) Cardiac disease diagnosis using feature extraction and machine learning based classification with internet of things (iot). Concurrency Comput Pract Experience. https://doi.org/10.1002/cpe.6622
Spencer R, Thabtah F, Abdelhamid N, Thompson M (2020) Exploring feature selection and classification methods for predicting heart disease. Digit Health 6:205520762091477
Abdollahi J, Nouri-Moghaddam B (2022) A hybrid method for heart disease diagnosis utilizing feature selection based ensemble classifier model generation. Iran J Comput Sci 5:229–246
Hassan MR, Huda S, Hassan MM, Abawajy J, Alsanad A, Fortino G (2022) Early detection of cardiovascular autonomic neuropathy: A multi-class classification model based on feature selection and deep learning feature fusion. Info Fusion 77:70–80
Ansarullah SI, Saif SM, Kumar P, Kirmani MM (2022) Significance of visible non-invasive risk attributes for the initial prediction of heart disease using different machine learning techniques. Comput Intell Neurosci 2022:1–12
Balasubramaniam S, Joe CV, Manthiramoorthy C, Kumar KS (2024) ReliefF based feature selection and gradient squirrel search algorithm enabled deep maxout network for detection of heart disease. Biomed Signal Process Control 87:105446
Nancy AA, Ravindran D, Raj Vincent PM, Srinivasan K, Gutierrez Reina D (2022) IOT-cloud-based smart healthcare monitoring system for heart disease prediction via deep learning. Electronics 11:2292
Barhoom A, Almasri A, Abu-Nasser B, Abu-Naser S (2022) Prediction of heart disease using a collection of machine and deep learning algorithms. International Journal of Engineering and Information Systems (IJEAIS) 6:1–13
Raju KB, Dara S, Vidyarthi A, Gupta VM, Khan B (2022) Smart heart disease prediction system with IOT and fog computing sectors enabled by cascaded deep learning model. Comput Intell Neurosci 2022:1–22
Bhavekar GS, Goswami AD (2022) A hybrid model for heart disease prediction using recurrent neural network and long short term memory. Int J Inf Technol 14:1781–1789
Goswami AD, Bhavekar GS, Chafle PV (2022) Electrocardiogram signal classification using vggnet: A neural network based classification model. Int J Inf Technol 15:119–128
Mehmood A, Iqbal M, Mehmood Z, Irtaza A, Nawaz M, Nazir T, Masood M (2021) Prediction of heart disease using deep convolutional neural networks. Arab J Sci Eng 46:3409–3422
Xiao C, Li Y, Jiang Y (2020) Heart coronary artery segmentation and disease risk warning based on a deep learning algorithm. IEEE Access 8:140108–140121
Ali F, El-Sappagh S, Islam SMR, Kwak D, Ali A, Imran M, Kwak K-S (2020) A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Info Fusion 63:208–222
Mienye ID, Sun Y, Wang Z (2020) An improved ensemble learning approach for the prediction of heart disease risk. Info Med Unlocked 20:100402
Pan Y, Fu M, Cheng B, Tao X, Guo J (2020) Enhanced deep learning assisted convolutional neural network for heart disease prediction on the internet of medical things platform. IEEE Access 8:189503–189512
Khan MA (2020) An IOT framework for heart disease prediction based on MDCNN classifier. IEEE Access 8:34717–34727
Ali L, Rahman A, Khan A, Zhou M, Javeed A, Khan JA (2019) An automated diagnostic system for heart disease prediction based on statistical model and optimally configured deep neural network. IEEE Access 7:34938–34945
Tuli S, Basumatary N, Gill SS, Kahani M, Arya RC, Wander GS, Buyya R (2020) HealthFog: an ensemble deep learning based smart healthcare system for automatic diagnosis of heart diseases in integrated IOT and fog computing environments. Future Gener Comput Syst 104:187–200
Hassan D, Hussein HI, Hassan MM (2023) Heart disease prediction based on pre-trained deep neural networks combined with principal component analysis. Biomed Signal Process Control 79:104019
Patro SP, Nayak GS, Padhy N (2021) Heart disease prediction by using novel optimization algorithm: a supervised learning prospective. Info Med Unlocked 26:100696
Sharma S, Parmar M (2020) Heart diseases prediction using deep learning neural network model. Int J Innov Technol Explor Eng 9:2244–2248
Al-Tashi Q, Rais H, Jadid S (2018) Feature selection method based on grey wolf optimization for coronary artery disease classification. Adv Intell Syst Comput 843:257–266. https://doi.org/10.1007/978-3-319-99007-1_25
Mienye ID, Sun Y (2021) Improved heart disease prediction using particle swarm optimization based stacked sparse autoencoder. Electr 10:2347
Al-Yarimi FA, Munassar NM, Bamashmos MH, Ali MY (2020) Feature optimization by discrete weights for heart disease prediction using supervised learning. Soft Comput 25:1821–1831
El-Shafiey MG, Hagag A, El-Dahshan E-SA, Ismail MA (2022) A hybrid GA and PSO optimized approach for heart-disease prediction based on random forest. Multimed Tools Appl 81:18155–18179
Khourdifi Y, Bahaj M (2019) Heart disease prediction and classification using machine learning algorithms optimized by particle swarm optimization and ant colony optimization. Int J Intell Eng Syst 12:242–252
Abdar M, Książek W, Acharya UR, Tan R-S, Makarenkov V, Pławiak P (2019) A new machine learning technique for an accurate diagnosis of coronary artery disease. Comput Methods Prog Biomed 179:104992
Al Bataineh A, Manacek S (2022) MLP-PSO hybrid algorithm for heart disease prediction. J Pers Med 12:1208
Nandy S, Adhikari M, Balasubramanian V, Menon VG, Li X, Zakarya M (2021) An intelligent heart disease prediction system based on swarm-artificial neural network. Neural Comput Appl 35:14723–14737
Jain A, Chandra Sekhara Rao A, Kumar Jain P, Hu Y-C (2023) Optimized levy flight model for heart disease prediction using CNN framework in big data application. Expert Syst Appl 223:119859
Bhavekar GS, Das Goswami A (2022) Herding exploring algorithm with light gradient boosting machine classifier for effective prediction of heart diseases. Int J Swarm Intell Res 13:1–22
Rani P, Kumar R, Ahmed NM, Jain A (2021) A decision support system for heart disease prediction based upon machine learning. J Reliable Intell Environ 7:263–275
Kavitha M, Gnaneswar G, Dinesh R, Sai YR, Suraj RS (2021) Heart disease prediction using Hybrid Machine Learning Model. 2021 6th International Conference Inventive Comput Technol (ICICT). pp 1329–1333. https://doi.org/10.1109/icict50816.2021.9358597
Aggarwal R, Podder P, Khamparia A (2022) ECG classification and analysis for heart disease prediction using XAI-driven machine learning algorithms. Biomedical Data Analysis and Processing Using Explainable (XAI) and Responsive Artificial Intell (RAI) 222:91–103. https://doi.org/10.1007/978-981-19-1476-8_7
Jagtap A, Rambade H, Baswat O, Malewadkar P (2019) Heart disease prediction using machine learning. Sci Manage 2:352–355
Mohan S, Thirumalai C, Srivastava G (2019) Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7:81542–81554
Repaka AN, Ravikanti SD, Franklin RG (2019) Design and implementing heart disease prediction using naives bayesian. 2019 3rd International Conference Trends Electr Info (ICOEI). pp 292–297. https://doi.org/10.1109/icoei.2019.8862604
Bhatt CM, Patel P, Ghetia T, Mazzeo PL (2023) Effective heart disease prediction using machine learning techniques. Algorithms 16:88
Biswas N, Ali MM, Rahaman MA, Islam M, Mia MdR, Azam S, Ahmed K, Bui FM, Al-Zahrani FA, Moni MA (2023) Machine learning-based model to predict heart disease in early stage employing different feature selection techniques. BioMed Res Int 2023:1–15
Bani Hani SH, Ahmad MM (2023) Machine-learning algorithms for ischemic heart disease prediction: A systematic review. Curr Cardiol Rev. https://doi.org/10.2174/1573403x18666220609123053
Berrill M, Ashcroft E, Fluck D, John I, Beeton I, Sharma P, Baltabaeva A (2022) Tricuspid regurgitation in acute heart failure: Predicting outcome using novel quantitative echocardiography techniques. Diagnostics 13:109
Xu J, Sun Y, Gong D, Fan Y (2023) Association between disease-specific health-related quality of life and all-cause mortality in patients with heart failure: a meta-analysis. Curr Probl Cardiol 48:101592
Trigka M, Dritsas E (2023) Long-term coronary artery disease risk prediction with machine learning models. Sensors 23:1193
Adekkanattu P, Rasmussen LV, Pacheco JA et al (2023) Prediction of left ventricular ejection fraction changes in heart failure patients using machine learning and electronic health records: a multi-site study. Sci Rep. https://doi.org/10.1038/s41598-023-27493-8
Sudha VK, Kumar D (2023) Hybrid CNN and LSTM network for heart disease prediction. SN Comput Sci. https://doi.org/10.1007/s42979-022-01598-9
Bozkurt B (2023) Successful decongestion as a clinical target, performance indicator, and as a study endpoint in hospitalized heart failure patients. JACC: Heart Failure 11:126–129
Chen S, Hu W, Yang Y et al (2023) Predicting six-month re-admission risk in heart failure patients using multiple machine learning methods: a study based on the Chinese heart failure population database. J Clin Med 12:870
Sun H, Pan J (2023) Heart disease prediction using machine learning algorithms with self-measurable physical condition indicators. J Data Anal Inf Process 11:1–10
Behera A, Mishra TK, Sahoo KS, Sarathchandra B (2022) An improved machine learning framework for cardiovascular disease prediction. Commun Comput Info Sci 1729:289–299. https://doi.org/10.1007/978-3-031-21750-0_25
Salman Shukur B, MohsinMijwil M (2023) Involving machine learning techniques in heart disease diagnosis: a performance analysis. Int J Electr Comput Eng (IJECE) 13:2177
Verma P, Sahu SK, Awasthi VK (2022) Deep neural network with feature optimization technique for classification of coronary artery disease. Adv Comput Intell Robotics 257–269. https://doi.org/10.4018/978-1-7998-8892-5.ch016
Ozcan M, Peker S (2023) A classification and regression tree algorithm for heart disease modeling and prediction. Healthc Anal 3:100130
Forrest IS, Petrazzini BO, Duffy Á et al (2023) Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts. Lancet 401:215–225
Ogundepo EA, Yahya WB (2023) Performance analysis of supervised classification models on heart disease prediction. Innov Syst Softw Eng 19:129–144
Shrivastava PK, Sharma M, sharma P, Kumar A (2023) HCBILSTM: A hybrid model for predicting heart disease using CNN and BILSTM algorithms. Meas: Sens 25:100657
Fajri YA, Wiharto W, Suryani E (2022) Hybrid model feature selection with the bee swarm optimization method and Q-learning on the diagnosis of coronary heart disease. Info 14:15
Nayak O, Pallapothala T, Gupta GP (2022) Heart disease prediction framework using soft voting-based ensemble learning techniques. Convergence Big Data Technol Comput Intell Techniques. pp 147–165. https://doi.org/10.4018/978-1-6684-5264-6.ch007
Download references
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Authors and affiliations.
Department of Artificial Intelligence & Data Science, CSMSS Chh. Shahu College of Engineering, Aurangabad, 431136, India
Girish Shrikrushnarao Bhavekar
Department of SENSE, VIT-AP University, Andhra Pradesh, 522237, India
Agam Das Goswami
Department of AI&DS, CSMSS Chh. Shahu College of Engineering, Aurangabad, Maharashtra, 431136, India
Chafle Pratiksha Vasantrao
Department of Computer Science & Engineering, GH Raisoni University, Amravati, Maharashtra, 444701, India
Amit K. Gaikwad & Amol V. Zade
Department of Computer Science & Engineering, Sipna College of Engineering & Technology, Amravati, Maharashtra, 444701, India
Harsha Vyawahare
You can also search for this author in PubMed Google Scholar
All the authors have contributed equally to the work.
Correspondence to Agam Das Goswami .
Ethical approval.
All applicable institutional and/or national guidelines for the care and use of animals were followed.
For this type of analysis formal consent is not needed.
The authors declare that they have no potential conflict of interest.
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
Bhavekar, G.S., Das Goswami, A., Vasantrao, C.P. et al. Heart disease prediction using machine learning, deep Learning and optimization techniques-A semantic review. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19680-0
Download citation
Received : 26 September 2023
Revised : 24 February 2024
Accepted : 10 June 2024
Published : 05 July 2024
DOI : https://doi.org/10.1007/s11042-024-19680-0
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
In the realm of remote sensing, where labeled datasets are scarce, leveraging pre-trained models via transfer learning offers a compelling solution. This study investigates the efficacy of the Segment Anything Model (SAM), a foundational computer vision model, in the domain of optical remote sensing tasks, specifically focusing on image classification and semantic segmentation.The scarcity of labeled data in remote sensing poses a significant challenge for machine learning development. Transfer learning, a technique utilizing pre-trained models like SAM, circumvents this challenge by leveraging existing data from related domains. SAM, developed and trained by Meta AI, serves as a foundational model for prompt-based image segmentation. It employs over 1 billion masks on 11 million images, facilitating robust zero-shot and few-shot capabilities. SAM's architecture comprises an image encoder, prompt encoder, and mask decoder components, all geared towards swift and accurate segmentation for various prompts, ensuring real-time interactivity and handling ambiguity.Two distinct use cases leveraging SAM-based models in the domain of optical remote sensing are presented, representing two critical tasks: image classification and semantic segmentation. Through comprehensive analysis and comparative assessments, various model architectures, including linear and convolutional classifiers, SAM-based adaptations, and UNet for semantic segmentation, are examined. Experiments encompass contrasting model performances across different dataset splits and varying training data sizes. The SAM-based models include using a linear, a convolutional or a ViT decoder classifiers on top of the SAM encoder.Use Case 1: Image Classification with EuroSAT DatasetThe EuroSAT dataset, comprising 27,000 labeled image patches from Sentinel-2 satellite images across ten distinct land cover classes, serves as the testing ground for image classification tasks. SAM-ViT models consistently demonstrate high accuracy, ranging between 89% and 93% on various sizes of training datasets. These models outperform baseline approaches, exhibiting resilience even with limited training data. This use case highlights SAM-ViT's effectiveness in accurately categorizing land cover classes despite data limitations.Use Case 2: Semantic Segmentation with Road DatasetIn the semantic segmentation domain, the study focuses on the Road dataset, evaluating SAM-based models, particularly SAM-CONV, against the benchmark UNet model. SAM-CONV showcases remarkable superiority, achieving F1-scores and Dice coefficients exceeding 0.84 and 0.82, respectively. Its exceptional performance in pixel-level labeling emphasizes its robustness in delineating roads from surrounding environments, surpassing established benchmarks and demonstrating its applicability in fine-grained analysis.In conclusion, SAM-driven transfer learning methods hold promise for robust remote sensing analysis. SAM-ViT excels in image classification, while SAM-CONV demonstrates superiority in semantic segmentation, paving the way for their practical use in real-world remote sensing applications despite limited labeled data availability.
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
Refining long-time series of urban built-up-area extraction based on night-time light—a case study of the dongting lake area in china.
2. study area and data, 2.1. study area, 2.2. research data, 2.2.1. ntl research data sources, 2.2.2. other research data, 3.1. viirs-like ntl dataset generation, 3.1.1. intercalibration of ntl data, 3.1.2. conversion of dmsp/ols ntl, 3.2. calculation of the vanui index, 3.3. svm-based urban built-up-area extraction, 3.4. accuracy assessment, 4.1. assessment of extraction results, 4.2. chronological changes, 4.3. spatial change, 5. discussion, 5.1. comparisons with previous studies, 5.2. limitations of study, 6. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.
Click here to enlarge figure
Satellite | Year | a | b | c | R | Satellite | Year | a | b | c | R |
---|---|---|---|---|---|---|---|---|---|---|---|
F10 | 1992 | −2.0570 | 1.5903 | −0.0090 | 0.9075 | F15 | 2002 | 0.0491 | 0.9568 | 0.0010 | 0.9658 |
F10 | 1993 | −1.0582 | 1.5983 | −0.0093 | 0.9360 | F15 | 2003 | 0.2217 | 1.5122 | −0.0080 | 0.9314 |
F10 | 1994 | −0.3458 | 1.4864 | −0.0079 | 0.9243 | F15 | 2004 | 0.5751 | 1.3335 | −0.0051 | 0.9479 |
F12 | 1994 | −0.6890 | 1.1770 | −0.0025 | 0.9071 | F15 | 2005 | 0.6367 | 1.2838 | −0.0041 | 0.9335 |
F12 | 1995 | −0.0515 | 1.2293 | −0.0038 | 0.9178 | F15 | 2006 | 0.8261 | 1.2790 | −0.0041 | 0.9387 |
F12 | 1996 | −0.0959 | 1.2727 | −0.0040 | 0.9319 | F15 | 2007 | 1.3606 | 1.2974 | −0.0045 | 0.9013 |
F12 | 1997 | −0.3321 | 1.1782 | −0.0026 | 0.9245 | F16 | 2004 | 0.2853 | 1.1955 | −0.0034 | 0.9039 |
F12 | 1998 | −0.0608 | 1.0648 | −0.0013 | 0.9536 | F16 | 2005 | −0.0001 | 1.4159 | −0.0063 | 0.9390 |
F12 | 1999 | 0.0000 | 1.0000 | 0.0000 | 1.0000 | F16 | 2006 | 0.1065 | 1.1371 | −0.0016 | 0.9199 |
F14 | 1997 | −1.1323 | 1.7696 | −0.0122 | 0.9101 | F16 | 2007 | 0.6394 | 0.9114 | 0.0014 | 0.9511 |
F14 | 1998 | −0.1917 | 1.6321 | −0.0101 | 0.9723 | F16 | 2008 | 0.5564 | 0.9931 | 0.0000 | 0.9450 |
F14 | 1999 | −0.1557 | 1.5055 | −0.0078 | 0.9717 | F16 | 2009 | 0.9492 | 1.0683 | −0.0016 | 0.8918 |
F14 | 2000 | 1.0988 | 1.3155 | −0.0053 | 0.9278 | F18 | 2010 | 2.3430 | 0.5102 | 0.0065 | 0.8462 |
F14 | 2001 | 0.1943 | 1.3219 | −0.0051 | 0.9448 | F18 | 2010 | 2.3458 | 0.5100 | 0.0065 | 0.8453 |
F14 | 2002 | 1.0517 | 1.1905 | −0.0036 | 0.9203 | F18 | 2011 | 1.8956 | 0.7345 | 0.0030 | 0.9095 |
F14 | 2003 | 0.7390 | 1.2416 | −0.0040 | 0.9432 | F18 | 2012 | 1.8750 | 0.6203 | 0.0052 | 0.9392 |
F15 | 2000 | 0.1254 | 1.0452 | −0.0010 | 0.9320 | F18 | 2013 | 1.8411 | 0.7049 | 0.0033 | 0.9321 |
F15 | 2001 | −0.7024 | 1.1081 | −0.0012 | 0.9593 |
Class | Actual Class | ||
---|---|---|---|
Built-Up Area | Non-Built-Up Area | ||
Predicted class | Built-up Area | True Built-up Area | False Built-up Area |
Non-built-up Area | False Non-Built-up Area | True Non-Built-up Area |
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
Chen, Y.; Ren, F.; Du, Q.; Zhou, P. Refining Long-Time Series of Urban Built-Up-Area Extraction Based on Night-Time Light—A Case Study of the Dongting Lake Area in China. Land 2024 , 13 , 1006. https://doi.org/10.3390/land13071006
Chen Y, Ren F, Du Q, Zhou P. Refining Long-Time Series of Urban Built-Up-Area Extraction Based on Night-Time Light—A Case Study of the Dongting Lake Area in China. Land . 2024; 13(7):1006. https://doi.org/10.3390/land13071006
Chen, Yinan, Fu Ren, Qingyun Du, and Pan Zhou. 2024. "Refining Long-Time Series of Urban Built-Up-Area Extraction Based on Night-Time Light—A Case Study of the Dongting Lake Area in China" Land 13, no. 7: 1006. https://doi.org/10.3390/land13071006
Article access statistics, further information, mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
IMAGES
VIDEO
COMMENTS
The case study in this article will go over a popular Machine learning concept called classification. Classification. In Machine Learning (ML), classification is a supervised learning concept that groups data into classes. Classification usually refers to any kind of problem where a specific type of class label is the result to be predicted ...
Up to 300 passengers survived and about 550 didn't, in other words the survival rate (or the population mean) is 38%. Moreover, a histogram is perfect to give a rough sense of the density of the underlying distribution of a single numerical data. I recommend using a box plot to graphically depict data groups through their quartiles. Let's take the Age variable for instance:
Classification is a task of Machine Learning which assigns a label value to a specific class and then can identify a particular type to be of one kind or another. The most basic example can be of the mail spam filtration system where one can classify a mail as either "spam" or "not spam". You will encounter multiple types of ...
Classification is one of the most widely used techniques in machine learning, with a broad array of applications, including sentiment analysis, ad targeting, spam detection, risk assessment, medical diagnosis and image classification. The core goal of classification is to predict a category or class y from some inputs x.
Examples include: Email spam detection (spam or not). Churn prediction (churn or not). Conversion prediction (buy or not). Typically, binary classification tasks involve one class that is the normal state and another class that is the abnormal state. For example " not spam " is the normal state and " spam " is the abnormal state.
This first course treats the machine learning method as a black box. Using this abstraction, you will focus on understanding tasks of interest, matching these tasks to machine learning tools, and assessing the quality of the output. In subsequent courses, you will delve into the components of this black box by examining models and algorithms.
Exploring Classification: A Deep Dive into Real-World Case StudiesWelcome to our series dedicated to unraveling the intricacies of classification through cap...
Machine learning classification can be used in a variety of day-to-day applications. In the health care industry, researchers can use machine learning classification to predict new future diseases and whether someone might contract an infection. ... Additionally, through a series of case studies, you'll gain hands-on experience in significant ...
Classification is a supervised machine learning process that involves predicting the class of given data points. Those classes can be targets, labels or categories. For example, a spam detection machine learning algorithm would aim to classify emails as either "spam" or "not spam.". Common classification algorithms include: K-nearest ...
Abstract. As a young research field, the machine learning has made significant progress and covered a broad spectrum of applications for the last few decades. Classification is an important task ...
by Avishek Nag (Machine Learning expert) Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data A comparison of different classifiers' accuracy & performance for high-dimensional data Photo Credit : PixabayIn Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely
A classification problem in machine learning is one in which a class label is anticipated for a specific example of input data. Problems with categorization include the following: Give an example and indicate whether it is spam or not. Identify a handwritten character as one of the recognized characters.
In case of multi-label classification tasks, a single instance of data can simultaneously belong to two or more classes of target variables. Hence, we can say that the predicted classes are not ...
As a young research field, the machine learning has made significant progress and covered a broad spectrum of applications for the last few decades. Classification is an important task of machine learning. Today, the task is used in a vast array of areas. The present article provides a case study on various classification algorithms (under machine learning), their applicability and issues ...
The final classification is made by counting the most common scenario or votes present within the ... in addition to the 139 instances of the case study, to the machine learning algorithms, then ...
Overall, logistic regression is a powerful machine-learning algorithm that can be used to solve a variety of problems. It is a good choice for beginners because it is relatively simple to ...
Classification is a process of categorizing data or objects into predefined classes or categories based on their features or attributes. Machine Learning classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data.
In this case, if we had a bunch of examples of first and last names, phone numbers, ID numbers, DoB, email addresses and VINs, each labelled as such, we could train a multi-class supervised ...
Working on case studies is one of the best practices that will help you improve your problem-solving skills as a data scientist. In this article, I'm going to introduce you to some of the best data science case studies based on the problems of classification that will help you understand and solve problems based on classification using machine learning.
Explore and run machine learning code with Kaggle Notebooks | Using data from Mines vs Rocks. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active ...
The Titanic Machine Learning Case Study is a classic example in the field of data science and machine learning. The study is based on the dataset of passengers aboard the Titanic when it sank in 1912. The study's goal is to predict whether a passenger survived or not based on their demographic and other information.
6. Machine Learning Case Study on Tesla. Tesla is now a big name in the electric automobile industry and the chances that it will continue to be the trending topic for years to come are really high. It is popular and extensively known for its advanced and futuristic cars and their advanced models.
The most effective machine learning classification techniques, such as artificial neural networks, are not easily interpretable, which limits their usefulness in critical areas, such as medicine, where errors can have severe consequences. Researchers have been working to balance the trade-off between the model performance and interpretability.
The study reveals the importance of three features (out of eight measured features) namely, the field map angle (FMA) computed from magnetic field map, beat-by-beat variations of alpha angle in the ST-T region and T wave magnitude variations in yielding a better classification accuracy (92.7 %) against that achieved by conventional features (81 %).
Heart disease has been recognized as a deadly and complex human illness across the worldwide. [1,2,3,4,5].Heart disease disrupts the normal functions of the heart, leading to the blockage of blood vessels [6,7,8,9].This condition increases the risk of stroke, angina, and heart attack, as well as coronary artery infections, which can weaken the body, especially in elderly individuals and adults ...
This study investigates the efficacy of the Segment Anything Model (SAM), a foundational computer vision model, in the domain of optical remote sensing tasks, specifically focusing on image classification and semantic segmentation.The scarcity of labeled data in remote sensing poses a significant challenge for machine learning development.
By studying the development law of urbanization, the problems of disorderly expansion and resource wastage in urban built-up areas can be effectively avoided, which is crucial for the long-term sustainable development of cities. This study proposes a high-precision urban built-up-area extraction method for county-level cities for small and medium-sized towns in county-level regions. Our ...