Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 13 March 2024

Evaluation metrics and statistical tests for machine learning

  • Oona Rainio   ORCID: orcid.org/0000-0002-7775-7656 1 ,
  • Jarmo Teuho   ORCID: orcid.org/0000-0001-9401-0725 1 &
  • Riku Klén   ORCID: orcid.org/0000-0002-0982-8360 1  

Scientific Reports volume  14 , Article number:  6086 ( 2024 ) Cite this article

15k Accesses

19 Citations

3 Altmetric

Metrics details

  • Computer science

An Author Correction to this article was published on 08 July 2024

This article has been updated

Research on different machine learning (ML) has become incredibly popular during the past few decades. However, for some researchers not familiar with statistics, it might be difficult to understand how to evaluate the performance of ML models and compare them with each other. Here, we introduce the most common evaluation metrics used for the typical supervised ML tasks including binary, multi-class, and multi-label classification, regression, image segmentation, object detection, and information retrieval. We explain how to choose a suitable statistical test for comparing models, how to obtain enough values of the metric for testing, and how to perform the test and interpret its results. We also present a few practical examples about comparing convolutional neural networks used to classify X-rays with different lung infections and detect cancer tumors in positron emission tomography images.

Similar content being viewed by others

statistical tests research paper

Improving the repeatability of deep learning models with Monte Carlo dropout

statistical tests research paper

MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification

statistical tests research paper

Self-supervised learning for medical image classification: a systematic review and implementation guidelines

Introduction.

Due to our developed technology and access to huge amounts of digitized data, the number of different applications using machine learning (ML) has increased dramatically during the past few decades 1 . Whereas ML techniques initially included only statistical methods and simple algorithms 2 , ML is currently used for different purposes across the fields of engineering, medicine, public health, finance, politics, and natural sciences, both in academia and industry 3 . However, because of this immerse interdisciplinary interest, some of the new ML researchers might not have a good grasp of basic statistical concepts. This prompts need for ongoing education about the proper use of statistics and appropriate metrics for evaluation of performance of ML algorithms.

When new ML models are created, it is necessary to compare their performance to the already existing ones 4 . Evaluation serves two purposes: methods that do not perform well can be discarded, and the ones that seem promising can be further optimized. Also, especially in medicine, it is often useful to know whether an ML model outperforms an educated professional or not 5 , 6 , 7 . In supervised ML, we first divide our data for training and test sets, use the training data for training and validation of the model, predict all the instances of the test data, and compare the obtained predictions to the corresponding ground-truth values of the test set 8 . In this way, we can estimate whether the predictions of a new ML model are better than the predictions of a human or existing models in our test set.

Despite complexity of final applications, ML models typically consists of relatively simple sub-tasks, such as binary or multi-class classification and regression. In addition, a special image processing ML technique called a convolutional neural network (CNN) can be used to perform image segmentation 9 and object detectors are used to find desired targets in images or video footage 10 . Depending on the task in question, there are certain choices of evaluation metrics that can be used to assess the performance of supervised ML models 11 . There are also established statistical testing practices, especially for metrics used in binary classification 8 , 12 . Nonetheless, the misuse of certain well-known tests, such as the paired t-test, is common 4 , and the required assumptions of the tests are often ignored 11 .

Our aim here is to introduce the most common metrics for binary and multi-class classification, regression, image segmentation, and object detection. We explain the basics of statistical testing and what tests should be used in different situations related to supervised ML. At the end, we also give three examples about comparing the performance of CNNs for classifying X-rays related to lung infections and performing image segmentation for positron emission tomography (PET) images.

Different machine learning tasks

Binary classification.

In a binary classification task, the instances of data are typically predicted to be either positive or negative so that a positive label is interpreted as presence of illness, abnormality, or some other deviation while a negative instance does not differ from the baseline in this respect. Each predicted binary label has therefore four possible designations: a true positive (TP) is a correctly predicted positive outcome, a true negative (TN) is a correctly predicted negative outcome, a false positive (FP) is a negative instance predicted to be positive, and a false negative (FN) is a positive instance predicted to be negative 13 . A confusion matrix, here a \(2\times 2\) -matrix containing the counts of TP, TN, FP, and FN observations like Table  1 , can be used to compute several metrics for the evaluation of the binary classifier.

The most commonly used evaluation metrics for binary classification are accuracy, sensitivity, specificity, and precision, which express the percentage of correctly classified instances in the set of all the instances, the truly positive instances, the truly negative instances, or the instances classified as positive, respectively. Sensitivity is commonly referred as recall 14 . They have the formulas

where TP, TN, FP, and FN refer to the numbers of the predictions with these designations 13 , 14 , 15 , 16 . Especially in diagnostics, sensitivity or recall is also known as true positive rate 14 , specificity as true negative rate 16 , and precision as positive predictive value 17 . With the exception of accuracy, the aforementioned metrics are often used as pairs, such as precision and recall or sensitivity and specificity. It is noteworthy that sensitivity and specificity reveal more about the model than accuracy especially if the number of real positive and negative instances is very imbalanced.

There are also several other evaluation metrics like accuracy that depend on all the values of the confusion matrix: Youden’s index 18 , defined as \(\mathrm{Sen.}+\mathrm{Spe.}-1\) 15 , gives an equal weight to the accuracies within the positive and the negative instances, regardless of their numbers. The F1-score, defined as

is a harmonic mean of precision and recall 19 . Cohen’s kappa ( \(\kappa\) ), defined as

compares how well the binary classifier performs compared to the randomized accuracy \(p_e\) 19 . It was originally introduced as a measurement for the degree of agreement between two observers in psychology 20 but it can be applied to measure the agreement between the predicted and the real classes. Furthermore, Matthews’ correlation coefficient (MCC), defined as

measures the correlation between the real and the predicted values of the instances 21 . This definition of MCC follows directly from that of Pearson’s correlation coefficient 22 .

To compute the values of the metrics above, the predictions of the test set by the model must be converted with some threshold if they are not already binary labels. The value of this threshold is often the default choice of 0.5 or the cut-point that gives highest accuracy or Youden’s index for the predictions of the training set. The threshold should be always chosen based on the predictions of the training set only because using the threshold that maximizes the accuracy of the predictions of the test set produces unrealistically good results.

However, if the numeric predictions before their conversion into binary are available, we can consider the receiver operating characteristic (ROC) curve. It is obtained by plotting sensitivity against the false positive rate (equal to 1 minus specificity) at all possible threshold values. As can be seen from Fig.  1 , it follows that a ROC curve is always monotonically increasing function inside the unit square tied to the points (0, 0) and (1, 1) so that closer the ROC curve is to (0, 1) the better the predictions are 23 . The area under the ROC curve (AUC) is another possible evaluation metric with values in [0, 1] but, unlike for the metrics, its value does not depend on the choice of the threshold at all.

figure 1

ROC curves computed from the binary predictions of a test set containing 300 chest X-rays with COVID-19 and 300 X-rays from healthy patients by the modified U-Net (in blue) and InceptionV3 (in gray), accompanied by a straight line equal to the theoretic ROC curve of a random binary classifier. The x -axis here uses sensitivity instead of the false positive rate but, since its values range from 1 to 0, the end result is a typical plot, not its reflection. The AUC values are 0.845 for the modified U-Net and 0.821 for InceptionV3. The values of other evaluation metrics are in Table  4 .

Alternatively, if we have n predictions \(q_i\in (0,1]\) for binary labels \(p_i\in \{0,1\}\) , we can also compute their cross-entropy loss defined as

The cross-entropy loss is often used for training ML models as its values decrease as the differences between the predictions and the real binary labels diminish 24 .

Multi-class classification

If the classification task is separating n instances between \(k\ge 3\) different classes, we can present the results of the classifier by using a \(k\times k\) confusion matrix as in Table  2 . Its element \(n_{ij}\) at the intersection of the i th and the j th column for \(i,j=1,\ldots ,k\) is the number of instances from the i th classified to the j th class. The evaluation of this matrix uses same metrics that we introduced for binary classification.

Firstly, there are two simple ways to obtain the values for all the evaluation metrics except AUC introduced with the previous section. We need to create a unique \(2\times 2\) confusion matrix for each of the k classes:

In a process called macro-averaging, we calculate the value of the metric separately for each class \(i=1,\ldots ,k\) by using the numbers \(\textrm{TP}_i\) , \(\textrm{TN}_i\) , \(\textrm{FN}_i\) , and \(\textrm{FP}_i\) defined as above and then consider the mean value of the k resulting values of the metric. Alternatively, in micro-averaging, we compute the value of the evaluation metric from the sums \(\sum ^k_{i=1}\textrm{TP}_i\) , \(\sum ^k_{i=1}\textrm{TN}_i\) , \(\sum ^k_{i=1}\textrm{FN}_i\) , and \(\sum ^k_{i=1}\textrm{FP}_i\) . Out of these procedures, macro-averaging gives equal weight to each class regardless of their size whereas micro-averaging gives equal weight to each instance and is therefore easily dominated by larger classes 25 . However, if each class should contain equally many instances as in the situation of in Table  2 , both micro- and macro-averaging yield same values for accuracy, sensitivity, specificity, and Youden’s index.

Cohen’s \(\kappa\) and MCC have also own definitions specially designed for the multi-class classification: Cohen’s \(\kappa\) can be written as

where \(n_{i\cdot }=\sum ^k_{j=1}n_{ij}\) , \(n_{\cdot i}=\sum ^k_{j=1}n_{ji}\) , and \(n=\sum ^k_{i=1}\sum ^k_{j=1}n_{ij}\) 26 . Similarly, MCC can be computed from a general \(k\times k\) confusion matrix with the formula 27

In the special case \(k=2\) , we obtain the same formulas for Cohen’s \(\kappa\) and MCC as in ( 2 ) and ( 3 ) 22 .

Multi-label classification

Multi-label classification is a generalized version of multi-class classification with nonexclusive class labels. Instead of dividing the data instances between several classes, the aim is to find all the class labels that apply out of \(k\ge 2\) possible labels. For each n instances, the model returns a binary vector \(y^{(i)}\) , \(i=1,\ldots ,n\) , whose j th element is 1 if the j th label is present and otherwise 0 for all \(j=1,\ldots ,k\) . A possible metric for evaluation is the Hamming loss, defined as

where \(x^{(i)}_j\) is the real value of the j th element in the binary vector of the i th data instance and \(y^{(i)}_j\) is the corresponding predicted value. The smaller the Hamming loss is, the better the model is. Alternatively, we can compute for instance the micro- or macro-average accuracy, precision, or recall for the vectors \(y^{(i)}\) , \(i=1,\ldots ,n\) 28 .

In a regression problem, a model is used predict instances whose values are real numbers rather than categorical. This is the case when predicting, for instance, height, stock prices, voter turnout, or rainfall amount. Here, we denote the real value of the i th instance in a test set of n instances by \(x_i\) and its predicted value by \(y_i\) for \(i=1,\ldots ,n\) .

One way to evaluate the model is to measure correlation between the real and the predicted values 12 . The most well-known method for this is Pearson’s correlation coefficient, defined as

where \(\overline{x}\) and \(\overline{y}\) denote the mean values of the vectors \((x_1,\ldots ,x_n)\) and \((y_1,\ldots ,y_n)\) , respectively 29 . However, Pearson’s correlation coefficient is designed for measuring correlation between variables whose marginal distributions are assumed to be normal. Because of this, Spearman’s correlation coefficient \(r_s\) might be a better evaluation metric when the real values \(x_i\) are not even approximately normally distributed. Spearman’s correlation coefficient is obtained by first converting the observations \(x_i\) and \(y_i\) , \(i=1,\ldots ,n\) , into their ranks and then computing Pearson’s correlation coefficient of these ranks 29 .

Another way to evaluate the model is to use some error measurement, such as mean absolute error (MAE) \(\sum ^n_{i=1}|x_i-y_i|\) or mean squared error (MSE) \(\sum ^n_{i=1}(x_i-y_i)^2\) 12 . The difference between MSE and MAE is that MSE punishes more for large errors 12 . Naturally, the smaller the error measurement is, the better the model performs.

Image segmentation

Image segmentation is a process of dividing images into regions of pixels or, in case of three-dimensional (3D) images, voxels, so that different objects and their boundaries can be located. In practice, this means converting a matrix of the same size as an image into a segmentation mask whose each point tells the class of the corresponding point in the image. In binary image segmentation, the desired output is a binary mask with positive elements coded as 1s and negative elements as 0s but we can also perform multiclass image segmentation called semantic segmentation by using more integers to signify different classes. An example of binary tumor segmentation can be seen in Fig.  2 .

figure 2

The binary tumor mask predicted by U-Net CNN with maximum dimensionality of 128 (in blue) and the ground-truth tumor mask drawn by a physician (in white) for one transaxial slice from a PET image of a head and neck cancer patient. The image is 128  \(\times\)  128 pixels and the predicted segmentation mask contains 181 TP pixels, 16156 TN pixels, 17 FP pixels, and 30 FN pixels. This gives us Dice of 0.885, IoU of 0.794, and overall pixel accuracy of 0.997.

One of the possible evaluation metric for an image segmentation masks is accuracy. In case of binary segmentation, we could simply count the number of TP, TN, FN, and FP pixels and calculate the accuracy as in ( 1 ). However, the issue with this approach is that the number of positive pixels is typically very small compared to the number of negative pixels: For instance, if we try perform tumor segmentation for medical images of the body, the positive targets, while incredibly important, have minimal volume compared to the background and they might not even be present in some images. Because of this, the value of accuracy can be very high even in the cases where the model does not find the positive object as long as the majority of negative pixels is correct.

Consequently, the results of binary segmentation are often evaluated with a metric that ignores the TN points. Instead, we concentrate on evaluating the similarity of the predicted positive segment given by a CNN and the ground-truth positive segment annotated by a human. For this purpose, we can use the Sørensen–Dice similarity coefficient 30 , 31 , also known as the Dice score, defined for two sets X and Y as

where | S | denotes the number of pixels or voxels in the set S 32 . This definition can be equivalently written as

by using the elements of the confusion matrix from the binary predictions of the points 32 . A very similar alternative to Dice score is the Jaccard similarity coefficient 33 , which is also known as the Jaccard index or Intersection over Union (IoU), and defined as

for the sets X and Y , and

for the elements of the confusion matrix 32 . The equality \(\textrm{IoU}=D/(2-D)\) holds trivially between the IoU and the Dice score 32 .

There are also metrics specially designed for 3D segmentation, as this is common task for medical tomography images. The surface of the point set X , denoted by \(\partial X\) , is the set of all voxels in X for which at least one of the 18 or the 26 neighbour voxels is does not belong in X . As an alternative to the typical Dice score, the surface Dice similarity coefficient (SDSC) can be computed by replacing X and Y with their surfaces \(\partial X\) and \(\partial Y\) in ( 5 ). Let d ( x ,  y ) be the Euclidean distance between two voxels x and y , and define \(d(x,Y)=\min _{y\in \partial Y}d(x,y)\) for the set Y . The average symmetric surface distance (ASD) between sets X and Y can now be defined as

The Hausdorff distance is \(\textrm{hd}(X,Y)=\max _{x\in X}d(x,Y)\) and its symmetric version, also known as the maximum symmetric surface distance, is \(\textrm{HD}(X,Y)=\max \{\textrm{hd}(X,Y),\textrm{hd}(Y,X)\}\) . The symmetric volume difference (SVD) is a Dice-based error metric defined as \(\textrm{SVD}=1-D\) and the volumetric overlap error (VOE) is the corresponding error measure derived from IoU, \(\textrm{VOE}=1-\textrm{IoU}\) . The model performance is considered better with smaller surface distances and errors terms 34 .

The results of multi-class semantic segmentation are typically evaluated by using mean Dice or IoU values, either as the mean of all within-class scores in a single image or the class-specific means of several images. The similarity of two semantic segmentation masks or any two can be also evaluated with structural similarity index measure (SSIM). If u and v are two image matrices with means \(\overline{u}\) and \(\overline{v}\) , variances \(s_u\) and \(s_v\) , and covariance \(s_{u,v}\) , then we have

for constants \(c_1\) and \(c_2\) depending on pixel values 35 . The SSIM is typically computed by using the formula above within several kernels or windows of the images. The values of SSIM are interpreted as those correlation: 1 for perfect similarity, 0 for no association, and \(-1\) for perfect opposites.

Object detection

Another similar tasks related to image processing is object detection, in which we find bounding boxes around each object in the image and classify them into different classes. A good object detector is capable of finding all the objects in an image without producing any false observations, placing the bounding boxes as close their correct locations as possible, and also classifying all the found objects correctly. Due to the diversity in these subtasks, evaluation of object detectors is slightly more complicated than it is for the other models introduced.

To evaluate the results of object detection, we must start by counting how many objects of a specific class were found. This quickly leads to the question how to decide how close a predicted bounding box needs to be a ground-truth box so that we can interpret the object as found. The common criteria here is IoU defined as in ( 6 ): The prediction is only considered a match of a ground-truth box if the IoU value of the two boxes exceeds a certain threshold value, often 0.5. If there are several predicted boxes producing an IoU high enough with the same ground-truth box, only the best one in terms of IoU is considered a match to the ground-truth box while all the others are FP observations. Namely, FP is here the number of predicted boxes without a matching ground-truth box while TP is the number of the predictions that match a ground-truth box of the same class and FN is the number of ground-truth boxes without a matching prediction 10 .

With the TP, FP, and FN numbers of the specific class, we can compute precision and recall as in ( 1 ). Since an object detector outputs a confidence for every bounding box expressing how confident the model is about the prediction, we can remove the predictions below a threshold of confidence. Changing this threshold affects TP, FP, and FN numbers and therefore also precision and recall. The precision-recall curve (PRC) can be obtained by plotting precision against recall at all possible thresholds of confidence. After that, we can compute average precision (AP) as the area under the PRC. The whole model is evaluated by computing mean average precision (mAP) as the mean value of the APs in all the different classes. We often consider [email protected] which is computed by using the IoU threshold 0.5 to define a match but just as well we could compute [email protected] or [email protected], or mAP@[0.5:0.95] which is the the mean value of [email protected], [email protected], \(\ldots\) , [email protected]. The metric [email protected] is more strict than [email protected] given it requires greater overlap for the potential matches and is therefore suitable for situations where the predicted bounding box locations need to be very exact 10 .

Information retrieval

Information search and retrieval is a significant task in ML research. The ability to retrieve only relevant results from large image- or text-based databases is crucial for these databases to be actually useful. Search engines and other information retrievals models can be evaluated by using precision and recall to describe the percentage of relevant retrieved documents among either search results or all the relevant documents. If we have K results \(d_1,\ldots ,d_K\) ordered by estimated relevance from the database D and each document d is either relevant ( \(\textrm{rel}(d)=1\) ) or not ( \(\textrm{rel}(d)=0\) ), we can compute precision of the first k retrieved documents as P@ \(k=\sum ^k_{i=1}\textrm{rel}(d_i)/k\) , for \(k=1,\ldots ,K\) and then define AP as 36

The mAP is obtained by a mean value of AP across different topics or search queries 36 . If results have more classes than just relevant and non-relevant, discounted cumulative gain (DCG) of k first results can be defined as

where G ( i ) is a numerical value presenting the gain of the i th result 37 . For instance, the values 10, 7, 3, 0.5, and 0 are often used for perfect, excellent, good, fair, and bad results, respectively 37 . If there are several search queries to be evaluated, mean DCG can be used.

Statistical tests

The motivation behind statistical tests is often to find out whether there is a significant difference between two different populations within respect of some specific property. We can collect smaller data sets from the populations and use them to compute values of the numeric quantity representing the feature of interest. Since there is nearly always at least slight difference between these values, the relevant question is whether this difference is great enough to be considered as an actual evidence of an underlying dissimilarity between the populations or if it is just a result of random variation.

The process of statistical testing is relatively simple: We formulate a null hypothesis \(H_0\) according to which there is no real difference, choose some level of significance \(\alpha \in (0,1)\) , and define a suitable test statistic Z with a known probability distribution \(P(Z|H_0)\) under the null hypothesis. We then use this distribution to compute the probability of obtaining at least as extreme value for the statistic Z than the one value z already observed. If the resulting probability \(p=2\min \{P(Z\le z|H_0),P(Z\ge z|H_0)\}\) , called p value, is less than \(\alpha\) , then the null hypothesis is rejected and the difference is considered statistically significant. We make a type I error when rejecting a true null hypothesis, and a type II error is accepting a false null hypothesis. We can control the probability of a type I error as its is equal to \(\alpha\) . We could also use \(\alpha\) to compute the critical values for the statistic for accepting or rejecting the null hypothesis instead of using a p value. However, in this paper, all the test functions in Python 38 and R 39 mentioned return a p value. We use \(\alpha =0.05\) as the level of significance in our examples.

When comparing performance of two or more models, it is often necessary to perform the tests for multiple times depending on the evaluation metric and the statistical test used. For instance, while we can compute Dice score of every predicted segmentation mask in the test set, we only obtain one value of accuracy from the predictions of the whole test set after binary classification and as well as one value of MSE after regression. If we want to compare regression models, we can test squared errors instead of their mean and, in case of binary classification, there are tests that are based on the predictions of a single test set. In other cases, we have to evaluate our models on several test sets to obtain enough values from other evaluation metrics for statistical testing. The required values of an evaluation metric for a certain statistical test are summarized in the flowchart of Fig.  3 .

While the test sets should ideally come from fully different data sets, sometimes our only option is to use a resampling procedure to create multiple test sets from the same data. In practice, we must re-initialize, train, and test the models for several times and save the values of the evaluation metrics from the predictions of the test set on each iteration round. We should use same training and test set for all the models on the same iteration round but vary them between the rounds because, otherwise, our conclusions about a potential difference between the models might be misled by some unknown factor in these specific data sets. Researchers commonly use here k -fold cross-validation, in which the data is divided into k similarly sized folds and, during k iteration rounds, each fold is the test set exactly once while the other \(k-1\) form the training data 12 . Alternatively, we can perform repeated cross-validation that has a few re-runs of each potential test set 12 . However, it should be taken into account that resampling methods do not produce independent values for the evaluation metrics and might lead to underestimating the variance of the test statistic, causing biased results 12 .

figure 3

The possible tasks for a model, their evaluation metrics, the values of the evaluation metric that must be computed for each model before statistical testing, the potential questions a statistical test could answer in the situation, and the suitable test.

Testing for a significant difference in any evaluation metric

Regardless of whether the values of the evaluation metric come from a single test set or several test sets on different iteration rounds, the values of the metric for the two models are based on the same instances and therefore paired. Many researchers therefore check which of the models gives a higher mean and then use a paired t-test to test if the difference in the mean is significant 4 . The null hypothesis of the paired t-test is that the mean of the differences in the matched pairs is equal to 0 40 , and this test can be performed with the function ttest_rel in the package scipy.stats 41 in Python or t.test(x,y,paired=TRUE) in the base package stats in R. There are also such newer variations of the t-test that are specially designed to repeated cross-validation 11 . However, the t-test is not recommended for this situation because it is strongly affected by outliers 4 and not valid when resampled test sets are used 12 .

Another possible test is a sign test. If two models are evaluated by using N test sets and there is no difference between them, then each of them should produce a better value for the evaluation metric N /2 times 4 . Thus, the number of times where the first model is better than the second follows a binomial distribution and, for a greater number of N , a normal distribution with a mean N /2 and standard deviation \(\sqrt{N}/2\) 11 . We can therefore apply the sign test to test whether one of the models outperforms the other with respect to the chosen evaluation metric in a statistically significant way. However, the sign test has a very weak power for detecting significant differences 4 .

The best alternative for this situation is the Wilcoxon signed-rank test instead 4 . It is a non-parametric test for the null hypothesis that the median of the differences in the matched pairs is equal to 0 42 . This test has the test statistic

and \(\textrm{rank}(|d_i|)\) , \(i=1,\ldots ,n\) , denote the differences \(d_i\) in the n matched pairs ranked by their absolute values 43 . The T -statistic can be examined directly by using its own critical values or, for large values of n , utilizing the statistic

which follows the normal distribution under the null hypothesis 4 . The Wilcoxon signed-rank test can be performed with wilcoxon in scipy.stats in Python or wilcox.test(x,y, paired=TRUE) in stats in R.

Test for comparing several models

As explained above, we can use Wilcoxon signed-rank test to estimate whether the differences between two models are significant with respect to any evaluation metric, but this test is not ideal when comparing several models. Namely, while we can repeat Wilcoxon tests between each pair of models, the risk of type I error increases with multiple comparisons. Adjusting the level of significance by Bonferroni correction has been suggested as a solution 44 but it is overly radical 4 .

Instead, the better approach in a situation where we have K models evaluated in J data sets is to perform Friedman’s test 4 . The average rank of the k th model, \(k=1,\ldots ,K\) , is \(\overline{R}_k=\sum ^J_{j=1}r^j_k/J\) where \(r^j_k\) is the rank of the j th value of the evaluation metric for the k th model 4 . The test statistic can be now written as

or, as noted by Iman and Davenport 45 , as 4

Out of the two statistics, \(\chi ^2_F\) is overly conservative and \(F_{ID}\) is therefore recommended 4 . Under the null hypothesis, \(\chi ^2_F\) follows the \(\chi ^2\) -distribution with \(K-1\) degrees of freedom and \(F_{ID}\) follows the F -distribution with \(K-1\) and \((K-1)(J-1)\) degrees of freedom 4 . Friedman’s test can be performed with friedmanchisquare in scipy.stats in Python or friedman.test in stats in R, but both of these functions are based on the statistic \(\chi ^2_F\) and therefore are not reliable for small values of J . However, if J is small, we can use a few separate Wilcoxon signed-rank tests instead.

Tests for binary classification of a single test set

There are also such tests for comparison of two classifiers which only require their predictions from a single iteration round. McNemar’s test is a common non-parametric test that only requires two numbers and is typically used to compare either sensitivity or specificity of two classifiers 46 . To find out whether there is a significant difference in the sensitivity of the classifiers, let b be the number of positive instances in the test set misclassified as FN by the first classifier but not by the second classifier and c similarly the number of positive instances misclassified as FN by the second classifier but not by the first classifier. To study specificity, count the numbers b and c by using FP misclassifications among the negative instances. Comparing accuracy by counting errors among both positive and negative sets is not recommended 47 . If there is no significant difference in the performance of the two classifiers, the test statistic

follows the \(\chi ^2\) -distribution with 1 degree of freedom for \(b+c\ge 20\) and a binomial distribution otherwise 11 . This test can be performed with mcnemar in statsmodels.stats.contingency_tables 48 in Python or mcnemar.test in stats in R.

We can also use the DeLong test to see whether there is a statistically significant different between the AUCs of two binary classifiers. Namely, DeLong et al. 49 noticed that the Mann-Whitney statistic can be used as an estimate of an AUC and the theory of generalized U-statistic can be applied to compare two AUCs. The Mann-Whitney two-sample statistic for AUC can be written as

where m is the number of truly positive instances, n is the number of the number of truly negative instances, \(Y_{i1}\) is the numeric prediction of the i th positive instance before it was converted into binary and, similarly, \(Y_{j0}\) is the numeric prediction of the j th negative instance 50 . Let \(\hat{\theta }_1\) be the estimate above for the AUC of the first classifier and \(\hat{\theta }_2\) the same for the second classifier. The DeLong test estimates their variance and covariance (see e.g. 51 for the exact formulas) and then uses the statistic

which follows the normal distribution under the null hypothesis due to the properties of the known U-statistic 51 . The DeLong test can be performed with roc.test(x,y,method= ’delong’) in the package pROC 52 in R.

Tests for comparing variance

Another important factor when comparing the performance of models is the amount of variance they produce. A model that consistently obtains high values in some evaluation metric is better than a model whose performance varies greatly on different iteration rounds. However, it must be taken into careful consideration here how the multiple values of the evaluation metric are obtained before considering their variance. For instance, if we use repeated cross-validation, we will not obtain a realistic estimate how the performance of a model would vary over different data sets.

We can use the F-test of equality of variances to test the null hypothesis according to which two populations have equal variances. The test statistic is \(F=S^2_1/S^2_2\) where \(S^2_1\) and \(S^2_2\) are the sample variations in the values produced by the two models for the evaluation metric, and this F-statistic follows the F-distribution with \(n-1\) and \(n-1\) degrees of freedom under the null hypothesis 53 .

However, the use of the F-test is not recommend for non-normally distributed values and this is often the case when comparing evaluation metrics: For instance, if the model has a median accuracy of 90% but a high amount of variation between different test sets, it is likely that the distribution of accuracy is left-skewed as the accuracy is limited on [0, 1] by its definition. The normality can be tested here with the Shapiro–Wilk test 54 ( shapiro in the package scipy.stats and shapiro.test in the package stats in R). If the data is not normally distributed, the possible alternatives for the F-test include Barlett’s test 55 ( bartlett in scipy.stats in Python and bartlett.test in stats in R) and Levene’s test 56 ( levene in scipy.stats in Python and leveneTest in the package car 57 in R).

Comparison to a human

In ML research, it is often of interest if a specific ML model performs better than a human. Especially, in a medical field, it is useful to estimate the difference between the tumor masks predicted by a CNN differ and those drawn by a physician by taking into account how much difference there would be if the same masks were drawn by two different physicians. For this purpose, we can use statistical testing to compare the results of an ML model and a human in terms of a relevant evaluation metric as we would compare the performance of two models. However, there might be some cases where this comparison is not possible: A human is not able to go through very large amounts of data, at least not fast, and, while we can always re-initialize the model between different rounds of repeated cross-validation, a human will not forget their earlier decisions. Because of this, statistical comparison between an ML model and a human is often limited to using McNemar’s test or the DeLong test to compare classifications in a single test set or the Wilcoxon signed-rank test to compare segmentation masks in terms of Dice and IoU values for a reasonable number of images.

Software requirements

The CNNs were coded in Python (version: 3.9.9) 38 with packages TensorFlow (version: 2.7.0) 58 and Keras (version: 2.7.0) 59 . Most of the test were preformed in Python with scipy (version: 1.7.3) 41 or statsmodels (version: 0.14.0) 48 . The DeLong test was performed and Fig.  1 was plotted with pROC (version: 1.18.5) 52 in R (version: 3.4.1) 39 . The images of the third data set had been studied with Carimas (version: 2.10) 60 , which was also used to draw their binary masks.

We use three data sets consisting of two-dimensional grayscale images converted into the size of 128  \(\times\)  128 pixels. The first data set contains 3000 chest X-rays of COVID-19 patients and 3000 chest X-rays of healthy patients chosen from COVID-19 Radiography Database 61 , 62 . The second data set has 700 chest X-rays of healthy patients and 700 chest X-rays of COVID-19 patients from COVID-19 Radiography Database, 700 chest X-rays of patients with pneumonia from Chest X-Ray Images (Pneumonia) 63 , and 700 chest X-rays of tuberculosis patients from Tuberculosis (TB) Chest X-ray Database 64 . The third data set has a total of 962 two-dimensional transaxial image slices from the PET images of 89 head and neck squamous cell carcinoma patients. The patients were imaged with \(^{18}\) F-fluorodeoxyglucose tracer in Turku PET Centre, Turku, Finland, during years 2014–2022. More details about the imaging can be found in 65 , 66 . Each of the slices has also a ground-truth binary segmentation mask showing pixels depicting cancerous tissue as positive and the rest as negative, and they were chosen so that they have at least 6 positive pixels. All the cancer patients were at least 18 years of age, gave informed consent to the research use of their data, and the research from their data was approved by Ethics Committee of the Hospital District of Southwest Finland. All research was performed in accordance with the Declaration of Helsinki.

Convolutional neural networks

In both binary and multi-class classification, we use a CNN that has U-Net architecture by Ronneberger et al. 67 modified for classification 65 and a ready-built CNN called InceptionV3 available in Keras. For binary segmentation, we use two U-Nets, a shallower of which has 64 as maximum dimensionality of a Conv2D layer and a deeper of which has 128. They were also used in 66 , 68 . We use stochastic gradient descent as an optimizer for the classification CNN and Adam for the segmentation CNNs. The classification CNNs are trained on 10 epochs and the segmentation CNNs on 50. The learning rate of 0.001 and, during training, 30% of the training data is used for validation. After training the CNNs for binary classification, we predict both training and test sets and use the threshold giving the maximal Youden’s index in the training set as a threshold for converting the numeric predictions of the test set into binary labels. We similarly convert the output after binary segmentation by using the threshold that produces the highest median Dice in the training set. For the multi-class classification, we obtain directly class labels by using the maximum elements of one-hot encoding.

Our experiments

We first compare the performance of the modified U-Net and InceptionV3 in binary classification by using our first data set of COVID-19 and negative X-rays with fivefold cross-validation. We compute all the possible evaluation metrics from our single test set and use McNemar’s test for sensitivity and specificity and DeLong test for AUC. Then we compare the modified U-Net and InceptionV3 in multi-class classification with repeated fivefold cross-validation (5 re-runs of each test set). We save the values of micro- and macro-average evaluation metrics after each round and use the Wilcoxon signed-rank test to estimate whether the differences in the resulting 25 values of each metric are significant or not. Even though the paired t-test should not be used for this, we perform it to see if its p values would be different from those of the Wilcoxon test. Finally, we divide our third data set patient-wise into train and test sets so that the test set has 191 slices (19.9% of the total data), and compare the two U-Nets for binary segmentation. We use the Shapiro–Wilk test to test the normality of Dice and IoU values of different segmentation masks, t-test and Wilcoxon test to estimate their differences, and F-test, Bartlett’s test and Levene’s test to check if there are significant differences in variances.

The results of the binary classification task are summarized in the contingency table of Table  3 and the resulting values of the evaluation metrics are in Table  4 . According two McNemar’s test computed from Table  3 separately for sensitivity among COVID-19 patients and specificity negative patients, the modified U-Net produced significantly higher sensitivity ( p value < 5.07e−5) but significantly lower specificity ( p value < 0.0207). The ROC curves of the modified U-Net and InceptionV3 can be seen from Fig.  1 and, according the DeLong test, there is no significant difference in their AUC ( p value = 0.137).

The median values of the evaluation metrics are in Table  5 for the multi-class classification task. According to t-tests and Wilcoxon tests, the modified U-Net is significantly better than InceptionV3, regardless of which metric is used. The p value of the t-test for macro-average F1-score is 6.47e−4 and less than 2.38e−5 for all the other metrics and, similarly, the p value of the Wilcoxon test for macro-average F1-score is 0.00116 and less than 6.37e−5 for all the other metrics.

The median and standard deviation of Dice and IoU values computed for the two U-Nets in the segmentation task are in Table  6 , as are the p values of Shapiro–Wilk tests, t-tests, Wilcoxon tests, F-tests, Bartlett’s tests, and Levene’s tests. Based on these p values, neither Dice nor IoU values are normally distributed, the deeper U-Net is significantly better in terms of both Dice and IoU values, and, while the deeper U-Net had higher standard deviation, this difference is only significant according to Levene’s test performed for the IoU values.

In our first experiment, we used both McNemar’s test and the DeLong test to study two CNNs used for binary classification. Our results show that the choice of the threshold was not ideal for the modified U-Net as we obtained high sensitivity on the cost of the specificity. This also reveals one issue with McNemar’s test: It does not tell us which classifier is better if one of them has a significantly higher sensitivity but a significantly lower specificity. We would need to use some other thresholds to convert the output of the CNN into binary labels and then repeat McNemar’s tests in order to find out if the significant differences are caused by specific threshold choices or not. In this respect, the DeLong test is more useful as its results do not depend on the threshold choices. However, to obtain more trustworthy results, it would still be necessary to use cross-validation and compare the AUCs of different test sets with the Wilcoxon signed-rank test.

In our second and third experiments, we used the t-test for comparing the values of evaluation metrics, even though it is not recommend for this, especially not when combined with repeated cross-validation. Its p values were relatively close to those of the Wilcoxon tests and, regardless of which test was used, we obtained the same conclusions about the significant differences. Since the misuse of the t-test is rather common, as noted by Demšar 4 , it is good to know that the results obtained in earlier research are not necessary wrong. Similarly, even though the F-test is not designed for non-normally distributed data, its p values were very close to those of Bartlett’s tests. However, both the t-test and the F-test are sensitive to the error caused by potential outliers so their use can lead incorrect results.

It should be noted here that aim of our experiments was to give examples of the use of the evaluation metrics and the related tests. To find out how often the t-test or some other test produces false conclusions when improperly used, more research is needed. Similarly, one possible topic for future research is also how many the number of the test sets affects the trustworthiness of the conclusions.

In this paper, we introduced several evaluation metrics for common ML tasks including binary and multi-class classification, regression, image segmentation, and object detection. Statistical testing can be used to estimate whether the different values in these metrics between two or more models are caused by actual differences between the models. The choice of the exact test depends the task of the models, the evaluation metric used, and the number of test sets available. As some metrics produce only one value from a single test set and there might be only one data set, some type of resampling, such as repeated cross-validation, is often necessary. Because of this, the well-known tests such the paired t-test underestimate variance and do not produce reliable results. Instead, the use of non-parametric tests such as the Wilcoxon signed-rank test or Friedman’s test is recommend.

Data availability

The X-ray data sets analyzed during the current study are available in the repositories: COVID-19 Radiography Database 61 , 62 https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database , Chest X-Ray Images (Pneumonia) 63 https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia , and Tuberculosis (TB) Chest X-ray Database 64 https://www.kaggle.com/datasets/tawsifurrahman/tuberculosis-tb-chest-xray-dataset .

Code availability

Available at github.com/rklen/statistical_tests_for_CNNs.

Change history

08 july 2024.

A Correction to this paper has been published: https://doi.org/10.1038/s41598-024-66611-y

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 (6245), 255–260 (2015).

ADS   MathSciNet   CAS   PubMed   Google Scholar  

Fradkov, A. L. Early history of machine learning. IFAC-PapersOnLine 53 (2), 1385–1390 (2020).

Google Scholar  

Bertolini, M., Mezzogori, D., Neroni, M. & Zammori, F. Machine Learning for industrial applications: A comprehensive literature review. Expert Syst. Appl. 175 , 114820 (2021).

Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 , 1–30 (2006).

MathSciNet   Google Scholar  

Angeline, R., Kanna, S.N., Menon, N.G., Ashwath, B.: Identifying malignancy of lung cancer using deep learning concepts. In Artificial Intelligence in Healthcare (eds. Garg, L., Basterrech, S., Banerjee, C., Sharma, T.K.) 35–46 https://doi.org/10.1007/978-981-16-6265-2_3 (Advanced Technologies and Societal Change, Springer, 2022).

Debats, O. A., Litjens, G. J. & Huisman, H. J. Lymph node detection in MR Lymphography: False positive reduction using multi-view convolutional neural networks. PeerJ 7 , e8052 (2019).

PubMed   PubMed Central   Google Scholar  

Madabhushi, A., Feldman, M., Metaxas, D., Chute, D., Tomaszeweski, J. Optimal feature combination for automated segmentation of prostatic adenocarcinoma from high resolution MRI. In Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No. 03CH37439) 614–617, Vol. 1. IEEE (2003).

Raschka, S. Model evaluation, model selection, and algorithm selection in machine learning. arXiv:1811.12808 (2018).

Li, Z., Liu, F., Yang, W., Peng, S. & Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 33 (12), 6999 (2021).

ADS   MathSciNet   Google Scholar  

Planche, B. & Andres, E. Hands-On Computer Vision with TensorFlow 2: Leverage Deep Learning to Create Powerful Image Processing Apps with TensorFlow 2.0 and Keras (Packt Publishing, 2019).

Santafe, G., Inza, I. & Lozano, J. A. Dealing with the evaluation of supervised classification algorithms. Artif. Intell. Rev. 44 , 467–508 (2015).

Tohka, J. & Van Gils, M. Evaluation of machine learning algorithms for health and wellness applications: a tutorial. Comput. Biol. Med. 132 , 104324 (2021).

PubMed   Google Scholar  

Zhu, W., Zeng, N. & Wang, N. Sensitivity, specificity, accuracy, associated confidence interval and ROC analysis with practical SAS implementations. In NESUG proceedings: health care and life sciences, Baltimore, Maryland 67, vol. 19 (2010).

Dehmer, M. & Basak, S. C. Statistical and Machine Learning Approaches for Network Analysis (Wiley, 2012).

Šimundić, A. M. Measures of diagnostic accuracy: Basic definitions. EJIFCC 19 (4), 203–211 (2009).

Small Casler, K. & Gawlik, K. (eds) Laboratory Screening and Diagnostic Evaluation: An Evidence-Based Approach (Springer, 2022).

Cox, D. J. & Vladescu, J. C. Statistics for Applied Behavior Analysis Practitioners and Researchers (Academic Press, 2023).

Youden, W. J. Index for rating diagnostic tests. Cancer 3 (1), 32–35 (1950).

CAS   PubMed   Google Scholar  

Emmert-Streib, F., Moutari, S. & Dehmer, M. Elements of Data Science, Machine Learning, and Artificial Intelligence Using R (Springer, 2023).

Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20 (1), 37–46 (1960).

Lantz, B. Machine Learning with R: Learn Techniques for Building and Improving Machine Learning Models, from Data Preparation to Model Tuning, Evaluation, and Working with Big Data (Packt Publishing, 2023).

Boughorbel, S., Jarray, F. & El-Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12 (6), e0177678 (2017).

Pepe, M., Longton, G. & Janes, H. Estimation and comparison of receiver operating characteristic curves. Stata J. 9 , 1 (2009).

Martinez, M., & Stiefelhagen, R. Taming the cross entropy loss. In Pattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 628–637, Vol. 40. Springer (2019).

Manning, C. & Schutze, H. Foundations of Statistical Natural Language Processing (MIT Press, 1999).

Tallón-Ballesteros, A. J., Riquelme, J. C. Data mining methods applied to a digital forensics task for supervised machine learning. In Computational Intelligence in Digital Forensics: Forensic Investigation and Applications 413–428 (2014).

Yilmaz, A. E. & Demirhan, H. Weighted kappa measures for ordinal multi-class classification performance. Appl. Soft Comput. 134 , 110020 (2023).

Zhang, M. L. & Zhou, Z. H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26 (8), 1819–1837 (2013).

Xiao, C., Ye, J., Esteves, R. M. & Rong, C. Using Spearman’s correlation coefficients for exploratory data analysis on big dataset. Concurr. Comput. Pract. Exp. 28 , 3866–3878 (2016).

Dice, L. R. Measures of the amount of ecologic association between species. Ecology 26 (3), 297–302 (1945).

Sørensen, T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K. Dan. Vidensk. Selsk. 5 (4), 1–34 (1948).

Sarkar, M. & Sahoo, P. K. Intelligent image segmentation methods using deep convolutional neural network. In Biomedical Signal and Image Processing with Artificial Intelligence 309–335 (Springer, 2022).

Jaccard, P. The Distribution of the Flora in the Alpine Zone.1. New Phytol. 11 (2), 37–50 (1912).

Voiculescu, I., & Yeghiazaryan, V. (2015). An Overview of Current Evaluation Methods Used in Medical Image Segmentation .

Brunet, D., Vrscay, E. R. & Wang, Z. On the mathematical properties of the structural similarity index. IEEE Trans. Image Process. 21 (4), 1488–1499 (2011).

ADS   MathSciNet   PubMed   Google Scholar  

Cormack, G. V., & Lynam, T. R. Statistical precision of information retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 533–540 (2006).

Dupret, G. & Piwowarski, B. Model based comparison of discounted cumulative gain and average precision. J. Discrete Algorithms 18 , 49–62 (2013).

van Rossum, G. & Drake, F. L. Python 3 Reference Manual (CreateSpace, 2009).

R Core Team. R: A Language and Environment for Statistical Computing (R Foundation of Statistical Computing, 2021).

Jekel, J. F. Epidemiology, Biostatistics, and Preventive Medicine (Elsevier Health Sciences, 2007).

Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 17 (3), 261–272 (2020).

CAS   PubMed   PubMed Central   Google Scholar  

Lang, T. A. & Secic, M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers (ACP Press, Berlin, 2006).

Corder, G. W. & Foreman, D. I. Nonparametric Statistics for Non-statisticians (Wiley, 2009).

Salzberg, S. L. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1 , 317–328 (1997).

Iman, R. L. & Davenport, J. M. Approximations of the critical region of the Friedman statistic. Commun. Stat. 9 , 571–595 (1980).

Kim, S. & Lee, W. Does McNemar’s test compare the sensitivities and specificities of two diagnostic tests?. Stat. Methods Med. Res. 26 (1), 142–154 (2017).

MathSciNet   PubMed   Google Scholar  

Trajman, A. & Luiz, R. R. McNemar chi2 test revisited: Comparing sensitivity and specificity of diagnostic examinations. Scand. J. Clin. Lab Invest. 68 (1), 77–80 (2008).

Seabold, S., & Perktold, J. Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference (2010).

DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44 (3), 837–45 (1988).

Qin, G. & Hotilovac, L. Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test. Stat. Methods Med. Res. 17 (2), 207–221 (2008).

Nakas, C. T., Bantis, L. E. & Gatsonis, C. A. ROC Analysis for Classification and Prediction in Practice (CRC Press, 2023).

Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 12 , 77 (2011).

Bethea, R. M., Duran, B. S. & Boullion, T. L. Statistical Methods for Engineers and Scientists (Taylor & Francis, 1995).

Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika 52 (3–4), 591–611 (1965).

Bartlett, M. S. Properties of sufficiency and statistical tests. Proc. R. Stat. Soc. Ser. A 160 , 268–282 (1937).

ADS   Google Scholar  

Levene, H. Robust tests for equality of variances. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling (eds Olkin, I., Hotelling, H. et al. ) 278–292 (Stanford University Press, 1960).

Fox, J. & Weisberg, S. An R Companion to Applied Regression 3rd edn. (Sage, 2019).

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015).

Keras, C. F. GitHub (2015).

Rainio, O. et al. Carimas: An extensive medical imaging data processing tool for research. J. Digit. Imaging 36 (4), 1885 (2023).

Chowdhury, M. E. H. et al. Can AI help in screening Viral and COVID-19 pneumonia?. IEEE Access 2020 (8), 132665–132676 (2020).

Rahman, T. et al. Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Comput. Biol. Med. 132 , 104319 (2021).

Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), 1122-1131.e9 (2018).

Rahman, T. et al. Reliable tuberculosis detection using chest X-ray with deep learning, segmentation and visualization. IEEE Access 8 , 191586–191601 (2020).

Hellström, H. et al. Classification of head and neck cancer from PET images using convolutional neural networks. Sci. Rep. 13 , 10528 (2023).

ADS   PubMed   PubMed Central   Google Scholar  

Liedes, J. et al. Automatic segmentation of head and neck cancer from PET-MRI data using deep learning. J. Med. Biol. Eng. https://doi.org/10.1007/s40846-023-00818-8 (2023).

Article   Google Scholar  

Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015. MICCAI 2015 Vol. 9351 (eds Navab, N. et al. ) 234–241 (Springer, 2015).

Rainio, O. et al. New method of using a convolutional neural network for 2D intraprostatic tumor segmentation from PET images. Res. Biomed. Eng. https://doi.org/10.1007/s42600-023-00314-7 (2023) ( to appear ).

Download references

Acknowledgements

We are grateful to the referees for their suggestions.

The first author was financially supported by the Finnish Cultural Foundation and Jenny and Antti Wihuri Foundation. The second author was supported by the Finnish Cultural Foundation (Maire and Aimo Mäkinen Foundation).

Author information

Authors and affiliations.

Turku PET Centre, University of Turku and Turku University Hospital, Turku, Finland

Oona Rainio, Jarmo Teuho & Riku Klén

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Oona Rainio .

Ethics declarations

Competing interests.

On the behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: The original version of this Article contained an error in an equation in the Different machine learning tasks section, under the subheading ‘Multi-class classification’. Full information regarding the correction made can be found in the correction for this Article.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Rainio, O., Teuho, J. & Klén, R. Evaluation metrics and statistical tests for machine learning. Sci Rep 14 , 6086 (2024). https://doi.org/10.1038/s41598-024-56706-x

Download citation

Received : 13 December 2023

Accepted : 09 March 2024

Published : 13 March 2024

DOI : https://doi.org/10.1038/s41598-024-56706-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Evaluation metrics
  • Machine learning
  • Medical images
  • Statistical testing

This article is cited by

Comparison of thresholds for a convolutional neural network classifying medical images.

  • Oona Rainio
  • Jonne Tamminen

International Journal of Data Science and Analytics (2024)

Real-time invasive sea lamprey detection using machine learning classifier models on embedded systems

  • Ian González-Afanador
  • Claudia Chen
  • Nelson Sepúlveda

Neural Computing and Applications (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

statistical tests research paper

statistical tests research paper

Statistical Papers

Statistical Papers is a forum for presentation and critical assessment of statistical methods encouraging the discussion of methodological foundations and potential applications.

  • The Journal stresses statistical methods that have broad applications, giving special attention to those relevant to the economic and social sciences.
  • Covers all topics of modern data science, such as frequentist and Bayesian design and inference as well as statistical learning.
  • Contains original research papers (regular articles), survey articles, short communications, reports on statistical software, and book reviews.
  • High author satisfaction with 90% likely to publish in the journal again.
  • Werner G. Müller,
  • Carsten Jentsch,
  • Shuangzhe Liu,
  • Ulrike Schneider

statistical tests research paper

Latest issue

Volume 65, Issue 6

Latest articles

Coherent indexes for shifted count and semicontinuous models.

  • Marcelo Bourguignon
  • Célestin C. Kokonendji

statistical tests research paper

Exceedance statistics based on bottom- \(k\) -lists

  • Halil Tanil

General classes of bivariate distributions for modeling data with common observations

  • Na Young Yoo
  • Ji Hwan Cha

statistical tests research paper

Hotelling \(T^2\) test in high dimensions with application to Wilks outlier method

  • Reza Modarres

statistical tests research paper

On the functional regression model and its finite-dimensional approximations

  • José R. Berrendero
  • Alejandro Cholaquidis
  • Antonio Cuevas

statistical tests research paper

Journal updates

Write & submit: overleaf latex template.

Overleaf LaTeX Template

Journal information

  • Australian Business Deans Council (ABDC) Journal Quality List
  • Current Index to Statistics
  • Google Scholar
  • Japanese Science and Technology Agency (JST)
  • Mathematical Reviews
  • Norwegian Register for Scientific Journals and Series
  • OCLC WorldCat Discovery Service
  • Research Papers in Economics (RePEc)
  • Science Citation Index Expanded (SCIE)
  • TD Net Discovery Service
  • UGC-CARE List (India)

Rights and permissions

Editorial policies

© Springer-Verlag GmbH Germany, part of Springer Nature

  • Find a journal
  • Publish with us
  • Track your research

13. Study design and choosing a statistical test

Sample size.

statistical tests research paper

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Population vs sample

In most cases, it’s too difficult or expensive to collect data from every member of the population you’re interested in studying. Instead, you’ll collect data from a sample.

Statistical analysis allows you to apply your findings beyond your own sample as long as you use appropriate sampling procedures . You should aim for a sample that is representative of the population.

Sampling for statistical analysis

There are two main approaches to selecting a sample.

  • Probability sampling: every member of the population has a chance of being selected for the study through random selection.
  • Non-probability sampling: some members of the population are more likely than others to be selected for the study because of criteria such as convenience or voluntary self-selection.

In theory, for highly generalizable findings, you should use a probability sampling method. Random selection reduces several types of research bias , like sampling bias , and ensures that data from your sample is actually typical of the population. Parametric tests can be used to make strong statistical inferences when data are collected using probability sampling.

But in practice, it’s rarely possible to gather the ideal sample. While non-probability samples are more likely to at risk for biases like self-selection bias , they are much easier to recruit and collect data from. Non-parametric tests are more appropriate for non-probability samples, but they result in weaker inferences about the population.

If you want to use parametric tests for non-probability samples, you have to make the case that:

  • your sample is representative of the population you’re generalizing your findings to.
  • your sample lacks systematic bias.

Keep in mind that external validity means that you can only generalize your conclusions to others who share the characteristics of your sample. For instance, results from Western, Educated, Industrialized, Rich and Democratic samples (e.g., college students in the US) aren’t automatically applicable to all non-WEIRD populations.

If you apply parametric tests to data from non-probability samples, be sure to elaborate on the limitations of how far your results can be generalized in your discussion section .

Create an appropriate sampling procedure

Based on the resources available for your research, decide on how you’ll recruit participants.

  • Will you have resources to advertise your study widely, including outside of your university setting?
  • Will you have the means to recruit a diverse sample that represents a broad population?
  • Do you have time to contact and follow up with members of hard-to-reach groups?

Your participants are self-selected by their schools. Although you’re using a non-probability sample, you aim for a diverse and representative sample. Example: Sampling (correlational study) Your main population of interest is male college students in the US. Using social media advertising, you recruit senior-year male college students from a smaller subpopulation: seven universities in the Boston area.

Calculate sufficient sample size

Before recruiting participants, decide on your sample size either by looking at other studies in your field or using statistics. A sample that’s too small may be unrepresentative of the sample, while a sample that’s too large will be more costly than necessary.

There are many sample size calculators online. Different formulas are used depending on whether you have subgroups or how rigorous your study should be (e.g., in clinical research). As a rule of thumb, a minimum of 30 units or more per subgroup is necessary.

To use these calculators, you have to understand and input these key components:

  • Significance level (alpha): the risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Statistical power : the probability of your study detecting an effect of a certain size if there is one, usually 80% or higher.
  • Expected effect size : a standardized indication of how large the expected result of your study will be, usually based on other similar studies.
  • Population standard deviation: an estimate of the population parameter based on a previous study or a pilot study of your own.

Once you’ve collected all of your data, you can inspect them and calculate descriptive statistics that summarize them.

Inspect your data

There are various ways to inspect your data, including the following:

  • Organizing data from each variable in frequency distribution tables .
  • Displaying data from a key variable in a bar chart to view the distribution of responses.
  • Visualizing the relationship between two variables using a scatter plot .

By visualizing your data in tables and graphs, you can assess whether your data follow a skewed or normal distribution and whether there are any outliers or missing data.

A normal distribution means that your data are symmetrically distributed around a center where most values lie, with the values tapering off at the tail ends.

Mean, median, mode, and standard deviation in a normal distribution

In contrast, a skewed distribution is asymmetric and has more values on one end than the other. The shape of the distribution is important to keep in mind because only some descriptive statistics should be used with skewed distributions.

Extreme outliers can also produce misleading statistics, so you may need a systematic approach to dealing with these values.

Calculate measures of central tendency

Measures of central tendency describe where most of the values in a data set lie. Three main measures of central tendency are often reported:

  • Mode : the most popular response or value in the data set.
  • Median : the value in the exact middle of the data set when ordered from low to high.
  • Mean : the sum of all values divided by the number of values.

However, depending on the shape of the distribution and level of measurement, only one or two of these measures may be appropriate. For example, many demographic characteristics can only be described using the mode or proportions, while a variable like reaction time may not have a mode at all.

Calculate measures of variability

Measures of variability tell you how spread out the values in a data set are. Four main measures of variability are often reported:

  • Range : the highest value minus the lowest value of the data set.
  • Interquartile range : the range of the middle half of the data set.
  • Standard deviation : the average distance between each value in your data set and the mean.
  • Variance : the square of the standard deviation.

Once again, the shape of the distribution and level of measurement should guide your choice of variability statistics. The interquartile range is the best measure for skewed distributions, while standard deviation and variance provide the best information for normal distributions.

Using your table, you should check whether the units of the descriptive statistics are comparable for pretest and posttest scores. For example, are the variance levels similar across the groups? Are there any extreme values? If there are, you may need to identify and remove extreme outliers in your data set or transform your data before performing a statistical test.

Pretest scores Posttest scores
Mean 68.44 75.25
Standard deviation 9.43 9.88
Variance 88.96 97.96
Range 36.25 45.12
30

From this table, we can see that the mean score increased after the meditation exercise, and the variances of the two scores are comparable. Next, we can perform a statistical test to find out if this improvement in test scores is statistically significant in the population. Example: Descriptive statistics (correlational study) After collecting data from 653 students, you tabulate descriptive statistics for annual parental income and GPA.

It’s important to check whether you have a broad range of data points. If you don’t, your data may be skewed towards some groups more than others (e.g., high academic achievers), and only limited inferences can be made about a relationship.

Parental income (USD) GPA
Mean 62,100 3.12
Standard deviation 15,000 0.45
Variance 225,000,000 0.16
Range 8,000–378,000 2.64–4.00
653

A number that describes a sample is called a statistic , while a number describing a population is called a parameter . Using inferential statistics , you can make conclusions about population parameters based on sample statistics.

Researchers often use two main methods (simultaneously) to make inferences in statistics.

  • Estimation: calculating population parameters based on sample statistics.
  • Hypothesis testing: a formal process for testing research predictions about the population using samples.

You can make two types of estimates of population parameters from sample statistics:

  • A point estimate : a value that represents your best guess of the exact parameter.
  • An interval estimate : a range of values that represent your best guess of where the parameter lies.

If your aim is to infer and report population characteristics from sample data, it’s best to use both point and interval estimates in your paper.

You can consider a sample statistic a point estimate for the population parameter when you have a representative sample (e.g., in a wide public opinion poll, the proportion of a sample that supports the current government is taken as the population proportion of government supporters).

There’s always error involved in estimation, so you should also provide a confidence interval as an interval estimate to show the variability around a point estimate.

A confidence interval uses the standard error and the z score from the standard normal distribution to convey where you’d generally expect to find the population parameter most of the time.

Hypothesis testing

Using data from a sample, you can test hypotheses about relationships between variables in the population. Hypothesis testing starts with the assumption that the null hypothesis is true in the population, and you use statistical tests to assess whether the null hypothesis can be rejected or not.

Statistical tests determine where your sample data would lie on an expected distribution of sample data if the null hypothesis were true. These tests give two main outputs:

  • A test statistic tells you how much your data differs from the null hypothesis of the test.
  • A p value tells you the likelihood of obtaining your results if the null hypothesis is actually true in the population.

Statistical tests come in three main varieties:

  • Comparison tests assess group differences in outcomes.
  • Regression tests assess cause-and-effect relationships between variables.
  • Correlation tests assess relationships between variables without assuming causation.

Your choice of statistical test depends on your research questions, research design, sampling method, and data characteristics.

Parametric tests

Parametric tests make powerful inferences about the population based on sample data. But to use them, some assumptions must be met, and only some types of variables can be used. If your data violate these assumptions, you can perform appropriate data transformations or use alternative non-parametric tests instead.

A regression models the extent to which changes in a predictor variable results in changes in outcome variable(s).

  • A simple linear regression includes one predictor variable and one outcome variable.
  • A multiple linear regression includes two or more predictor variables and one outcome variable.

Comparison tests usually compare the means of groups. These may be the means of different groups within a sample (e.g., a treatment and control group), the means of one sample group taken at different times (e.g., pretest and posttest scores), or a sample mean and a population mean.

  • A t test is for exactly 1 or 2 groups when the sample is small (30 or less).
  • A z test is for exactly 1 or 2 groups when the sample is large.
  • An ANOVA is for 3 or more groups.

The z and t tests have subtypes based on the number and types of samples and the hypotheses:

  • If you have only one sample that you want to compare to a population mean, use a one-sample test .
  • If you have paired measurements (within-subjects design), use a dependent (paired) samples test .
  • If you have completely separate measurements from two unmatched groups (between-subjects design), use an independent (unpaired) samples test .
  • If you expect a difference between groups in a specific direction, use a one-tailed test .
  • If you don’t have any expectations for the direction of a difference between groups, use a two-tailed test .

The only parametric correlation test is Pearson’s r . The correlation coefficient ( r ) tells you the strength of a linear relationship between two quantitative variables.

However, to test whether the correlation in the sample is strong enough to be important in the population, you also need to perform a significance test of the correlation coefficient, usually a t test, to obtain a p value. This test uses your sample size to calculate how much the correlation coefficient differs from zero in the population.

You use a dependent-samples, one-tailed t test to assess whether the meditation exercise significantly improved math test scores. The test gives you:

  • a t value (test statistic) of 3.00
  • a p value of 0.0028

Although Pearson’s r is a test statistic, it doesn’t tell you anything about how significant the correlation is in the population. You also need to test whether this sample correlation coefficient is large enough to demonstrate a correlation in the population.

A t test can also determine how significantly a correlation coefficient differs from zero based on sample size. Since you expect a positive correlation between parental income and GPA, you use a one-sample, one-tailed t test. The t test gives you:

  • a t value of 3.08
  • a p value of 0.001

Prevent plagiarism. Run a free check.

The final step of statistical analysis is interpreting your results.

Statistical significance

In hypothesis testing, statistical significance is the main criterion for forming conclusions. You compare your p value to a set significance level (usually 0.05) to decide whether your results are statistically significant or non-significant.

Statistically significant results are considered unlikely to have arisen solely due to chance. There is only a very low chance of such a result occurring if the null hypothesis is true in the population.

This means that you believe the meditation intervention, rather than random factors, directly caused the increase in test scores. Example: Interpret your results (correlational study) You compare your p value of 0.001 to your significance threshold of 0.05. With a p value under this threshold, you can reject the null hypothesis. This indicates a statistically significant correlation between parental income and GPA in male college students.

Note that correlation doesn’t always mean causation, because there are often many underlying factors contributing to a complex variable like GPA. Even if one variable is related to another, this may be because of a third variable influencing both of them, or indirect links between the two variables.

Effect size

A statistically significant result doesn’t necessarily mean that there are important real life applications or clinical outcomes for a finding.

In contrast, the effect size indicates the practical significance of your results. It’s important to report effect sizes along with your inferential statistics for a complete picture of your results. You should also report interval estimates of effect sizes if you’re writing an APA style paper .

With a Cohen’s d of 0.72, there’s medium to high practical significance to your finding that the meditation exercise improved test scores. Example: Effect size (correlational study) To determine the effect size of the correlation coefficient, you compare your Pearson’s r value to Cohen’s effect size criteria.

Decision errors

Type I and Type II errors are mistakes made in research conclusions. A Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s false.

You can aim to minimize the risk of these errors by selecting an optimal significance level and ensuring high power . However, there’s a trade-off between the two errors, so a fine balance is necessary.

Frequentist versus Bayesian statistics

Traditionally, frequentist statistics emphasizes null hypothesis significance testing and always starts with the assumption of a true null hypothesis.

However, Bayesian statistics has grown in popularity as an alternative approach in the last few decades. In this approach, you use previous research to continually update your hypotheses based on your expectations and observations.

Bayes factor compares the relative strength of evidence for the null versus the alternative hypothesis rather than making a conclusion about rejecting the null hypothesis or not.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval

Methodology

  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Likert scale

Research bias

  • Implicit bias
  • Framing effect
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hostile attribution bias
  • Affect heuristic

Is this article helpful?

Other students also liked.

  • Descriptive Statistics | Definitions, Types, Examples
  • Inferential Statistics | An Easy Introduction & Examples
  • Choosing the Right Statistical Test | Types & Examples

More interesting articles

  • Akaike Information Criterion | When & How to Use It (Example)
  • An Easy Introduction to Statistical Significance (With Examples)
  • An Introduction to t Tests | Definitions, Formula and Examples
  • ANOVA in R | A Complete Step-by-Step Guide with Examples
  • Central Limit Theorem | Formula, Definition & Examples
  • Central Tendency | Understanding the Mean, Median & Mode
  • Chi-Square (Χ²) Distributions | Definition & Examples
  • Chi-Square (Χ²) Table | Examples & Downloadable Table
  • Chi-Square (Χ²) Tests | Types, Formula & Examples
  • Chi-Square Goodness of Fit Test | Formula, Guide & Examples
  • Chi-Square Test of Independence | Formula, Guide & Examples
  • Coefficient of Determination (R²) | Calculation & Interpretation
  • Correlation Coefficient | Types, Formulas & Examples
  • Frequency Distribution | Tables, Types & Examples
  • How to Calculate Standard Deviation (Guide) | Calculator & Examples
  • How to Calculate Variance | Calculator, Analysis & Examples
  • How to Find Degrees of Freedom | Definition & Formula
  • How to Find Interquartile Range (IQR) | Calculator & Examples
  • How to Find Outliers | 4 Ways with Examples & Explanation
  • How to Find the Geometric Mean | Calculator & Formula
  • How to Find the Mean | Definition, Examples & Calculator
  • How to Find the Median | Definition, Examples & Calculator
  • How to Find the Mode | Definition, Examples & Calculator
  • How to Find the Range of a Data Set | Calculator & Formula
  • Hypothesis Testing | A Step-by-Step Guide with Easy Examples
  • Interval Data and How to Analyze It | Definitions & Examples
  • Levels of Measurement | Nominal, Ordinal, Interval and Ratio
  • Linear Regression in R | A Step-by-Step Guide & Examples
  • Missing Data | Types, Explanation, & Imputation
  • Multiple Linear Regression | A Quick Guide (Examples)
  • Nominal Data | Definition, Examples, Data Collection & Analysis
  • Normal Distribution | Examples, Formulas, & Uses
  • Null and Alternative Hypotheses | Definitions & Examples
  • One-way ANOVA | When and How to Use It (With Examples)
  • Ordinal Data | Definition, Examples, Data Collection & Analysis
  • Parameter vs Statistic | Definitions, Differences & Examples
  • Pearson Correlation Coefficient (r) | Guide & Examples
  • Poisson Distributions | Definition, Formula & Examples
  • Probability Distribution | Formula, Types, & Examples
  • Quartiles & Quantiles | Calculation, Definition & Interpretation
  • Ratio Scales | Definition, Examples, & Data Analysis
  • Simple Linear Regression | An Easy Introduction & Examples
  • Skewness | Definition, Examples & Formula
  • Statistical Power and Why It Matters | A Simple Introduction
  • Student's t Table (Free Download) | Guide & Examples
  • T-distribution: What it is and how to use it
  • Test statistics | Definition, Interpretation, and Examples
  • The Standard Normal Distribution | Calculator, Examples & Uses
  • Two-Way ANOVA | Examples & When To Use It
  • Type I & Type II Errors | Differences, Examples, Visualizations
  • Understanding Confidence Intervals | Easy Examples & Formulas
  • Understanding P values | Definition and Examples
  • Variability | Calculating Range, IQR, Variance, Standard Deviation
  • What is Effect Size and Why Does It Matter? (Examples)
  • What Is Kurtosis? | Definition, Examples & Formula
  • What Is Standard Error? | How to Calculate (Guide with Examples)

What is your plagiarism score?

statistical tests research paper

  • Get new issue alerts Get alerts
  • Submit a Manuscript

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

How to choose and interpret a statistical test? An update for budding researchers

Najmi, Ahmad; Sadasivam, Balakrishnan; Ray, Avik

Department of Pharmacology, All India Institute of Medical Sciences Bhopal, Bhopal, Madhya Pradesh, India

Address for correspondence: Dr. Avik Ray, Department of Pharmacology, All India Institute of Medical Sciences Bhopal, Saket Nagar, Bhopal - 462 020, Madhya Pradesh, India. E-mail: [email protected]

Received March 03, 2021

Received in revised form March 29, 2021

Accepted May 12, 2021

Postgraduate medical students are often not able to select and interpret the findings of statistical tests during their thesis or research projects. To go ahead with selection of tests to be performed, researchers need to determine the objectives of study, types of variables, analysis and the study design, number of groups and data sets, and the types of distribution. In this review, we summarize and explain various statistical tests to help postgraduate medical students to select the most appropriate techniques for their thesis and dissertation.

Introduction

Postgraduate medical students are often confused in the selection and interpretation of statistical tests during their thesis or research projects. Selection of statistical test is not a rocket science and it is based on some assumptions. We require some basic information for selection of appropriate statistical test such as objectives of the study, type of variables, type of analysis, type of study design, number of groups and data sets, and the type of distribution. In the present article, we will discuss about selection and interpretation of statistical tests.

Types of statistical test

Statistical tests can be broadly classified as parametric[ 1 ] and nonparametric tests. Parametric test is applied when data is normally distributed and not skewed. Normal distribution[ 2 3 ] is characterized by a smooth bell-shaped symmetrical curve. ±1 Standard deviation (SD) covers 68% and ± 2 SD covers 95% of the values in the distribution. It is always preferable to use parametric test as these tests are more robust. Sometimes data does not follow normal distribution and is skewed. In such scenarios, data transformation technique[ 4 ] may be applied to convert skewed data into normal data. Only when this transformation is not possible, nonparametric tests should be used. Parametric tests use parameters like mean, SD, and standard error of mean for analysis. The lists of various parametric and nonparametric tests are given in Table 1 .

T1-8

Parametric tests

Student's t- test.

This is a parametric test which was described by WS Gossett.[ 5 ] He chose his pseudonym as “student” because his company did not allow its scientists to publish confidential data. Therefore, this test is known as the Student's t -test. This test is used to compare the two means and is used for small samples ( n <30). Paired t -test is used when one group serves as its own control, e.g. to compare the blood sugar before and after the administration of a drug. Unpaired t -test is used to compare the means of two independent groups, e.g. to compare the blood sugar of two independent groups. The data should be normally distributed and quantitative. This test is used when the SD[ 6 ] of two means is almost the same or SD of one group is not twice greater or lesser than that of other.

Analysis of variance (ANOVA) test

This test is used to compare the mean of three or more than three groups.[ 7 ] The data should be normally distributed. One-way ANOVA is used when groups to be compared are defined by just one factor. Repeated measure ANOVA is used when groups to be compared are defined by multiple factors. For example, if we want to evaluate the effect of three different antihypertensive drugs on three different group of human volunteers, then we will use ANOVA test to evaluate about any significant difference between groups. ANOVA test does not indicate which group is significantly different from the others. Post hoc tests should be used to know about individual group differences. Various types of post hoc tests[ 8 ] are available to know about individual group comparison like Bonferroni, Dunnett’s, Tukeys test, etc.

Correlation coefficient test

This parametric test is used to know about the linear relationship[ 9 ] between two variables. For example, if we want to know about any linear relationship between body weight and blood pressure, correlation test will be used. Correlation only shows an association between two variables. It does not show causation. Scatter plot can be used to know about correlation between two variables. Pearson's correlation coefficient test is used for continuous variables, and Spearman's correlation coefficient is used as for categorical variables.

Regression test

This parametric test is used to know about the dependent relationship[ 10 ] between two variables. We can predict the value of dependent variable, based on the value of independent variable. For example, if we draw a curve between time and plasma concentration of a drug, then we can predict a drug concentration at particular time on the basis of time plasma concentration curve. Here, time is the independent variable and plasma concentration is the dependent variable. Dependent variable is plotted on y -axis and independent variable is plotted on x -axis.

Nonparametric test

These tests are used when the data is not normally distributed (skewed).[ 11 ] Data is usually summarized as median. Ranks and scores (Apgar scores and visual analogue score) do not follow normal distribution and are summarized as median.

  • Wilcoxon test: Wilcoxon signed rank and Mann–Whitney U test are counterparts of paired and unpaired t -test for nonparametric test.
  • Kruskal–Wallis test: This is counterpart of one-way ANOVA for nonparametric test.
  • Friedman's test: This is counterpart of repeated measure ANOVA for nonparametric test.
  • Spearman's rank correlation: This test is counterpart of Pearson correlation test for nonparametric test.
  • Chi-square test: This nonparametric test is used for binomial or dichotomous data, which is summarized as percentage or proportions. For example, to compare the proportion of death and survival in vaccinated and nonvaccinated children with respiratory tract infections. There is no parametric counterpart for Chi-square test.

Type of variable/data

Variable or data may be numerical or categorical type.[ 12 13 ] Numerical data may be continuous or discrete. Examples of continuous data are blood sugar, blood pressure, weight, height, etc. Examples of discrete data are the number of members in a family, number of persons who attended the outpatient department, number of persons experiencing nausea, etc. Categorical or qualitative data may be nominal or ordinal. Nominal data can be identified by some attributes or names like colour of eyes, names of religion, etc. Ordinal data can be arranged in some meaningful order like stages of cancer, severity of disease in terms of mild, moderate, and severe. Data can be summarized in the form of mean, median, or proportion. Numerical continuous data follows normal distribution and can be summarized as means. Numerical discrete data often follows nonnormal distribution and can be summarized as median. Dichotomous or binomial data[ 14 ] can be defined as those data which have only two outcomes such as yes or no, or male or female. It can be summarized as proportions.

Types of analysis

In statistical terms, analysis may be a comparative analysis, a correlation analysis, or a regression analysis.[ 15 ] Comparative analysis is characterized by comparison of mean or median between groups. Suppose we want to know the relation between two variables, for example, body weight and blood sugar. In such a case, correlation analysis will be used. If we want to predict the value of a second variable based on information about a first variable, regression analysis will be used. For example, if we know the values of body weight and we want to predict the blood sugar of a patient, regression analysis will be used.

Types of study design

In epidemiological studies, there are various type of study design like case control, cohort, and cross-sectional study designs. However, in statistics, there are only two types of study designs. First is paired or matched[ 16 ] study design. Second is unpaired or independent study design. In paired study design, the same group serves as its own control. For example, we want to evaluate the effect of a new drug on blood pressure in a group of 10 healthy volunteers. If we compare the values of blood pressure in the same group of 10 individuals, before intervention and after intervention, then this is known as paired or matched design. However, if we want to compare the values of blood pressure in two entirely different groups, then this is known as unpaired or independent study design.

Number of groups and data sets

There may be a single group but multiple data sets.[ 17 ] For example, if we want to evaluate the effect of a new drug on heart rate of a single group of individuals, then there may be multiple data sets if we take the reading of heart rate at various time intervals. There may be two groups or two data sets. There may be more than two groups and more than two data sets. Different statistical test is applied for different situations.

Types of distribution

Data can be summarized as means if the variable follows normal distribution. Most of the bodily parameters[ 8 ] like heart rate, blood pressure, blood sugar, serum cholesterol, height, and weight follow normal distribution. Numerical continuous data follows normal distribution and can be summarized as means. Numerical discrete data often follows nonnormal distribution and can be summarized as median. Ranks or scores do not follow normal distribution and can be summarized as median.[ 18 ] Examples are Apgar score and visual analogue scale for pain measurement. Dichotomous data can be summarized as proportions.[ 17 ] There are many statistical tests which are based on the assumption that the data follows normal distribution.

For example, as an investigator, you want to evaluate the melanizing action of three different topical preparations in three different groups of vitiligo patients (10 in each group). The three group of patients will apply either one of the topical preparations and the effect will be measured in scores (0–5, 0—No melanizing action, 5–excellent melanizing action). What will be the most appropriate statistical test?

From the above study, following points can be noted:

Objective: Evaluation of melanizing action of three different topical preparations in three different groups of vitiligo patients.

Type of data—scores summarized as median

Type of distribution—Nonnormal

No. of groups—3

Study design—Unpaired

Type of analysis—Comparison

According to Table 2 , the row that matches criteria no. 1 and 2 is row number 2 and the column that matches the criteria no. 3, 4, and 5 is column number 4. The cell where column 4 and row 2 meet indicates Kruskal–Wallis test.

T2-8

Please note that the list of tests is not comprehensive. It is a simplified table only to crudely demonstrate how to select a test for statistical analysis of data.

Interpretation of P value

The results provided by inferential statistics will be valid provided the selection of subjects, methods, and data collection are correct. For example, if we use t -test for highly skewed data, then the results will be invalid. During their thesis, postgraduate medical students are often more concerned about P or probability value.[ 19 ] By convention, when P value < 0.01, the difference between groups is considered as highly significant. When P value is >0.01 but less than 0.05, then difference is considered as just significant. If P value is more than 0.05, then one should not immediately declare it as NOT significant. Before declaring it as NOT significant, one should try to know about power of the study. The power of a study is its ability to pick up a difference when it exists. So, power should be calculated especially if P value >0.05. The power may be low due to various reasons like small sample size, high dropout rates, noncompliance, etc. If power is <80% and P >0.05, then it is judicious to declare that the study has not enough power to detect the difference.

It is not a rule of thumb that a difference between two groups will be considered significant only when P value is <0.05. This 5% level is only taken as convention. It can be fixed at 1, 2, or 10% depending upon the study. The P value also depends on variance of data. If variance is less, then P value will also be less.

There are some situations in which clinical significance overrides statistical significance. Suppose an investigator wants to evaluate a new drug for rabies, he administers the new drug in 10 patients of rabies. In the second group of rabies patients ( n = 10), standard treatment was given. Two patients survived in the first group and none survived in the second group. The statistical test showed that difference was not significant. What will you do in such situation? Will you dump the study as the results were not significant or evaluate this drug further? We all know that rabies is 100% fatal disease. It would be a miracle even if a single patient survived by a new drug. Therefore, the conclusion should be based on clinical knowledge and experience rather than statistics alone.

Suppose a new antidiabetic drug lowers mean fasting blood sugar by 2 mg% and statistical test concludes that the results are highly significant ( P < 0.01). This raises an important question—should any physician recommend this new drug to patient of diabetes, which lowers mean fasting blood glucose by just 2 mg%? It is true that the difference between groups is statistically significant, but it is not at all clinically significant. So, practically this new drug is not adding any significant to the armory of medicine against diabetes.

Interpretations of confidence interval

Suppose the mean systolic blood pressure in a sample population is 110 mmHg, and we want to know the population systolic blood pressure mean. Although the exact value cannot be obtained, a range can be calculated within which the true population mean lies. This range is called confidence interval[ 20 ] and is calculated using the sample mean and the standard error (SE). The mean ±1SE and mean ±2 SE will give approximately 68 and 95% confidence interval, respectively. The endpoints of the confidence interval are known as confidence limits. Confidence interval is always mentioned with a particular degree of certainty, e.g. 95%. This is called confidence level and is expressed as percentage. The confidence level which is commonly used is 95%, but 90 and 99% confidence levels can also be calculated.

Confidence intervals should also be mentioned along with P value, especially in case of nonsignificant results. Confidence interval indicates the range of likely values of sample means in a population. When two groups are compared, the likely values of difference in means of two population under study can be calculated. For example, the difference in the height of two groups (Asian and European) can be found out and confidence interval for the difference in height can be calculated. The 95% confidence interval for the difference can be calculated using the formula (if t -test has been chosen):

Upper limit = mean + (t 0.05 × SE diff )

Lower limit = mean-(t 0.05 × SE diff )

Point to be noted is that the above formula is used to calculate the confidence interval for the difference between group means and not for individual means. If 95% confidence interval includes a zero value, the difference is not statistically significant at 5% significance level. If P value tells us about statistically significant difference, then why do we need to mention the confidence interval? It is because the confidence interval tells us about the precision of the estimate as indicated by the range. If the range is narrow, then it will be more precise. If the range is broad, then it will be less precise. One can get an idea of precision from confidence interval.

Relevance for primary care physicians

In this era of evidence-based medicine, having an in-depth knowledge of biostatistics to analyze health and biomedical research data is of utmost importance. The practice of primary care comes with the privilege of encountering a variety of diseases, both acute and chronic, which comes with their own unique set of statistical parameters, interpretations, and challenges. Choosing a statistical test for significance testing becomes critical if someone wants to analyze and compare the patient characteristics and relevant variables for both internal reporting in institutional assessments and for disseminating their findings to the world in the form of publications. This review would be a quick guide for all primary care physicians to choose the most appropriate statistical test pertaining to their data set and come up with important inferences and propositions.

Although it is difficult to know about the details of every statistical test, a biomedical researcher must have the basic knowledge of inferential statistics. Selection of wrong statistical test can lead to false conclusions which can compromise the quality of research. Similarly, a wrong interpretation will also lead towards a wrong conclusion. The researchers should have a clear idea about the various variable types they are dealing with, their respective distributions, and the kinds of tests they need to apply for analyzing the data set. Both P value and confidence interval should be documented for precise results. One may consult standard textbooks of statistics and software tools[ 21 ] for statistical analysis. Various online and offline software like SPSS, Minitab, RStudio, and GraphPad Prism are available for statistical analysis which ease the process of data analysis.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

  • Cited Here |
  • Google Scholar

Clinical research; biostatistics; primary care physicians; statistical test

  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Methotrexate induced pneumonitis – a case report and review of literature, patient safety in primary and outpatient health care.

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Statistics > Methodology

Title: hypothesis testing for general network models.

Abstract: The network data has attracted considerable attention in modern statistics. In research on complex network data, one key issue is finding its underlying connection structure given a network sample. The methods that have been proposed in literature usually assume that the underlying structure is a known model. In practice, however, the true model is usually unknown, and network learning procedures based on these methods may suffer from model misspecification. To handle this issue, based on the random matrix theory, we first give a spectral property of the normalized adjacency matrix under a mild condition. Further, we establish a general goodness-of-fit test procedure for the unweight and undirected network. We prove that the null distribution of the proposed statistic converges in distribution to the standard normal distribution. Theoretically, this testing procedure is suitable for nearly all popular network models, such as stochastic block models, and latent space models. Further, we apply the proposed method to the degree-corrected mixed membership model and give a sequential estimator of the number of communities. Both simulation studies and real-world data examples indicate that the proposed method works well.
Subjects: Methodology (stat.ME)
Cite as: [stat.ME]
  (or [stat.ME] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Research on differential data fusion processing method for dual GNSS base stations

  • Huang, Tushun

In the differential processing of global navigation and positioning system (GNSS) trajectory data from dual base stations, a hierarchical fusion processing method is proposed based on the trajectory parameter layer for two trajectory parameters generated by the different dual base stations according to the distribution pattern. Trajectory segment that satisfies the differential baseline of a priori data accuracy are given high weight to prioritize for fusion, and the portion exceeding the maximum differential baseline distance is weighted fusion according to the sliding window plus polynomial fitting to determine the weighted coefficients. By practical testing, the statistical residuals of the weighted fusion trajectory in the horizontal and vertical directions reduced by 32.83% and 30.19% respectively compared to the unfused trajectory.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

actuators-logo

Article Menu

statistical tests research paper

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Research on output characteristics of a non-contact piezoelectric actuator’s micro-displacement amplifying mechanism.

statistical tests research paper

Share and Cite

Li, H.; Zhang, D.; Lin, Y.; Chen, Z.; Shi, Z.; Li, C.; Zhao, L. Research on Output Characteristics of a Non-Contact Piezoelectric Actuator’s Micro-Displacement Amplifying Mechanism. Actuators 2024 , 13 , 309. https://doi.org/10.3390/act13080309

Li H, Zhang D, Lin Y, Chen Z, Shi Z, Li C, Zhao L. Research on Output Characteristics of a Non-Contact Piezoelectric Actuator’s Micro-Displacement Amplifying Mechanism. Actuators . 2024; 13(8):309. https://doi.org/10.3390/act13080309

Li, Huaiyong, Dongya Zhang, Yusheng Lin, Zhong Chen, Zhiwei Shi, Chong Li, and Liang Zhao. 2024. "Research on Output Characteristics of a Non-Contact Piezoelectric Actuator’s Micro-Displacement Amplifying Mechanism" Actuators 13, no. 8: 309. https://doi.org/10.3390/act13080309

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Restor Dent Endod
  • v.44(3); 2019 Aug

Logo of rde

Statistical notes for clinical researchers: the independent samples t -test

Hae-young kim.

Department of Health Policy and Management, College of Health Science, and Department of Public Health Science, Graduate School, Korea University, Seoul, Korea.

The t -test is frequently used in comparing 2 group means. The compared groups may be independent to each other such as men and women. Otherwise, compared data are correlated in a case such as comparison of blood pressure levels from the same person before and after medication ( Figure 1 ). In this section we will focus on independent t -test only. There are 2 kinds of independent t -test depending on whether 2 group variances can be assumed equal or not. The t -test is based on the inference using t -distribution.

An external file that holds a picture, illustration, etc.
Object name is rde-44-e26-g001.jpg

T -DISTRIBUTION

The t -distribution was invented in 1908 by William Sealy Gosset, who was working for the Guinness brewery in Dublin, Ireland. As the Guinness brewery did not permit their employee's publishing the research results related to their work, Gosset published his findings by a pseudonym, “Student.” Therefore, the distribution he suggested was called as Student's t -distribution. The t -distribution is a distribution similar to the standard normal distribution, z -distribution, but has lower peak and higher tail compared to it ( Figure 2 ).

An external file that holds a picture, illustration, etc.
Object name is rde-44-e26-g002.jpg

According to the sampling theory, when samples are drawn from a normal-distributed population, the distribution of sample means is expected to be a normal distribution. When we know the variance of population, σ 2 , we can define the distribution of sample means as a normal distribution and adopt z -distribution in statistical inference. However, in reality, we generally never know σ 2 , we use sample variance, s 2 , instead. Although the s 2 is the best estimator for σ 2 , the degree of accuracy of s 2 depends on the sample size. When the sample size is large enough ( e.g. , n = 300), we expect that the sample variance would be very similar to the population variance. However, when sample size is small, such as n = 10, we could guess that the accuracy of sample variance may be not that high. The t -distribution reflects this difference of uncertainty according to sample size. Therefore the shape of t -distribution changes by the degree of freedom (df), which is sample size minus one (n − 1) when one sample mean is tested.

The t -distribution appears to be a family of distribution of which shape varies according to its df ( Figure 2 ). When df is smaller, the t -distribution has lower peak and higher tail compared to those with higher df. The shape of t -distribution approaches to z -distribution as df increases. When df gets large enough, e.g. , n = 300, t -distribution is almost identical with z -distribution. For the inferences of means using small samples, it is necessary to apply t -distribution, while similar inference can be obtain by either t -distribution or z -distribution for a case with a large sample. For inference of 2 means, we generally use t -test based on t -distribution regardless of the sizes of sample because it is always safe, not only for a test with small df but also for that with large df.

INDEPENDENT SAMPLES T -TEST

To adopt z - or t -distribution for inference using small samples, a basic assumption is that the distribution of population is not significantly different from normal distribution. As seen in Appendix 1 , the normality assumption needs to be tested in advance. If normality assumption cannot be met and we have a small sample ( n < 25), then we are not permitted to use ‘parametric’ t -test. Instead, a non-parametric analysis such as Mann-Whitney U test should be selected.

For comparison of 2 independent group means, we can use a z -statistic to test the hypothesis of equal population means only if we know the population variances of 2 groups, σ 1 2 and σ 2 2 , as follows;

where X ̄ 1 and X ̄ 2 , σ 1 2 and σ 2 2 , and n 1 and n 2 are sample means, population variances, and the sizes of 2 groups.

Again, as we never know the population variances, we need to use sample variances as their estimates. There are 2 methods whether 2 population variances could be assumed equal or not. Under assumption of equal variances, the t -test devised by Gosset in 1908, Student's t -test, can be applied. The other version is Welch's t -test introduced in 1947, for the cases where the assumption of equal variances cannot be accepted because quite a big difference is observed between 2 sample variances.

1. Student's t -test

In Student's t -test, the population variances are assumed equal. Therefore, we need only one common variance estimate for 2 groups. The common variance estimate is calculated as a pooled variance, a weighted average of 2 sample variances as follows;

where s 1 2 and s 2 2 are sample variances.

The resulting t -test statistic is a form that both the population variances, σ 1 2 and σ 1 2 , are exchanged with a common variance estimate, s p 2 . The df is given as n 1 + n 2 − 2 for the t -test statistic.

In Appendix 1 , ‘(E-1) Leven's test for equality of variances’ shows that the null hypothesis of equal variances was accepted by the high p value, 0.334 (under heading of Sig.). In ‘(E-2) t -test for equality of means t -values’, the upper line shows the result of Student's t -test. The t -value and df are shown −3.357 and 18. We can get the same figures using the formulas Eq. 2 and Eq. 3, and descriptive statistics in Table 1 , as follows.

GroupNo.MeanStandard deviation value
11010.280.59780.004
21011.080.4590

The result of calculation is a little different from that by SPSS (IBM Corp., Armonk, NY, USA) of Appendix 1 , maybe because of rounding errors.

2. Welch's t -test

Actually there are a lot of cases where the equal variance cannot be assumed. Even if it is unlikely to assume equal variances, we still compare 2 independent group means by performing the Welch's t -test. Welch's t -test is more reliable when the 2 samples have unequal variances and/or unequal sample sizes. We need to maintain the assumption of normality.

Because the population variances are not equal, we have to estimate them separately by 2 sample variances, s 1 2 and s 2 2 . As the result, the form of t -test statistic is given as follows;

where ν is Satterthwaite degrees of freedom.

In Appendix 1 , ‘(E-1) Leven's test for equality of variances’ shows an equal variance can be successfully assumed ( p = 0.334). Therefore, the Welch's t -test is inappropriate for this data. Only for the purpose of exercise, we can try to interpret the results of Welch's t -test shown in the lower line in ‘(E-2) t -test for equality of means t -values’. The t -value and df are shown as −3.357 and 16.875.

We've confirmed nearly same results by calculation using the formula and by SPSS software.

The t -test is one of frequently used analysis methods for comparing 2 group means. However, sometimes we forget the underlying assumptions such as normality assumption or miss the meaning of equal variance assumption. Especially when we have a small sample, we need to check normality assumption first and make a decision between the parametric t -test and the nonparametric Mann-Whitney U test. Also, we need to assess the assumption of equal variances and select either Student's t -test or Welch's t -test.

Procedure of t -test analysis using IBM SPSS

The procedure of t -test analysis using IBM SPSS Statistics for Windows Version 23.0 (IBM Corp., Armonk, NY, USA) is as follows.

An external file that holds a picture, illustration, etc.
Object name is rde-44-e26-a001.jpg

COMMENTS

  1. Choosing the Right Statistical Test

    ANOVA and MANOVA tests are used when comparing the means of more than two groups (e.g., the average heights of children, teenagers, and adults). Predictor variable. Outcome variable. Research question example. Paired t-test. Categorical. 1 predictor. Quantitative. groups come from the same population.

  2. An Introduction to Statistics: Choosing the Correct Statistical Test

    The choice of statistical test used for analysis of data from a research study is crucial in interpreting the results of the study. This article gives an overview of the various factors that determine the selection of a statistical test and lists some statistical testsused in common practice. How to cite this article: Ranganathan P. An ...

  3. How to choose and interpret a statistical test? An update for budding

    Selection of statistical test is not a rocket science and it is based on some assumptions. We require some basic information for selection of appropriate statistical test such as objectives of the study, type of variables, type of analysis, type of study design, number of groups and data sets, and the type of distribution.

  4. (PDF) Statistical tests and Measurements

    5) Use the appropriate statistical test. 6) Do a power analysis to determine a good sample size. 7) Collect the data. 8) Examine the data to see whether it meets the assumption of parametric or a ...

  5. Understanding Statistical Testing

    Abstract. Statistical hypothesis testing is common in research, but a conventional understanding sometimes leads to mistaken application and misinterpretation. The logic of hypothesis testing presented in this article provides for a clearer understanding, application, and interpretation. Key conclusions are that (a) the magnitude of an estimate ...

  6. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  7. PDF Statistical tests, P values, confidence intervals, and power: a guide

    Common misinterpretations of single P values. The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave. = 0.40, the null hypothesis has a 40 % chance of being true. No!

  8. Hypothesis Testing

    Step 5: Present your findings. The results of hypothesis testing will be presented in the results and discussion sections of your research paper, dissertation or thesis.. In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p-value).

  9. Evaluation metrics and statistical tests for machine learning

    The most commonly used evaluation metrics for binary classification are accuracy, sensitivity, specificity, and precision, which express the percentage of correctly classified instances in the set ...

  10. Test statistics

    The test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely your observed data match the distribution expected under the null hypothesis of that statistical test. The test statistic is used to calculate the p value of your results, helping to decide whether to reject your null hypothesis.

  11. Home

    Overview. Statistical Papers is a forum for presentation and critical assessment of statistical methods encouraging the discussion of methodological foundations and potential applications. The Journal stresses statistical methods that have broad applications, giving special attention to those relevant to the economic and social sciences.

  12. 13. Study design and choosing a statistical test

    Design. In many ways the design of a study is more important than the analysis. A badly designed study can never be retrieved, whereas a poorly analysed one can usually be reanalysed. (1) Consideration of design is also important because the design of a study will govern how the data are to be analysed. Most medical studies consider an input ...

  13. An Introduction to Statistics: Understanding Hypothesis Testing and

    Ranganathan P, Pramesh CS. An Introduction to Statistics: Understanding Hypothesis Testing and Statistical Errors. Indian J Crit Care Med 2019;23 (Suppl 3):S230-S231. Keywords: Biostatistics, Research design, Statistical bias. Two papers quoted in this issue of the Indian Journal of Critical Care Medicine report.

  14. PDF Study Design and Statistical Analysis

    10.3 Do I need to use statistical analysis if I have population data? 170 10.4 How do I choose what statistical program to use for analyzing data? 171 11 Publishing research 172 11.1 How do I write my study up for publication? 172 11.2 How do I determine authorship for the paper? 174 11.3 How do I resolve disagreements about authorship? 175

  15. (PDF) An Overview of Statistical Data Analysis

    [email protected]. August 21, 2019. Abstract. The use of statistical software in academia and enterprises has been evolving over the last. years. More often than not, students, professors ...

  16. The Beginner's Guide to Statistical Analysis

    Table of contents. Step 1: Write your hypotheses and plan your research design. Step 2: Collect data from a sample. Step 3: Summarize your data with descriptive statistics. Step 4: Test hypotheses or make estimates with inferential statistics.

  17. PDF Reporting Results of Common Statistical Tests in APA Format

    p values. There are two ways to report p values. One way is to use the alpha level (the a priori criterion for the probablility of falsely rejecting your null hypothesis), which is typically .05 or .01. Example: F(1, 24) = 44.4, p < .01. You may also report the exact p value (the a posteriori probability that the result that you obtained, or ...

  18. Power Analysis, Sample Size, and Assessment of Statistical Assumptions

    3.1. Statistical test assumptions. The review of lighting papers described in Section 2 highlighted that parametric tests are the most common type of statistical test used. As the name implies, parametric statistical tests are based on the assumption of certain parameters about the data being tested and the conditions in which they were obtained.

  19. How to choose and interpret a statistical test? An update fo ...

    Selection of wrong statistical test can lead to false conclusions which can compromise the quality of research. Similarly, a wrong interpretation will also lead towards a wrong conclusion. The researchers should have a clear idea about the various variable types they are dealing with, their respective distributions, and the kinds of tests they ...

  20. Basic statistical tools in research and data analysis

    Median test for one sample: The sign test and Wilcoxon's signed rank test. The sign test and Wilcoxon's signed rank test are used for median tests of one sample. These tests examine whether one instance of sample data is greater or smaller than the median reference value. Sign test. This test examines the hypothesis about the median θ0 of a ...

  21. (PDF) Hypotheses and Hypothesis Testing

    The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis. plan, (3) analyze sample data, and (4) interpret results. W e work through those steps below ...

  22. [2408.04213] Hypothesis testing for general network models

    The network data has attracted considerable attention in modern statistics. In research on complex network data, one key issue is finding its underlying connection structure given a network sample. The methods that have been proposed in literature usually assume that the underlying structure is a known model. In practice, however, the true model is usually unknown, and network learning ...

  23. Research on differential data fusion processing method for dual GNSS

    In the differential processing of global navigation and positioning system (GNSS) trajectory data from dual base stations, a hierarchical fusion processing method is proposed based on the trajectory parameter layer for two trajectory parameters generated by the different dual base stations according to the distribution pattern. Trajectory segment that satisfies the differential baseline of a ...

  24. Choosing statistical test

    The outcome variable (endpoint) is defined at the same time the question to be answered is formulated. Three criteria are decisive for the selection of the statistical test, which are as follows: the number of variables, types of data/level of measurement (continuous, binary, categorical) and. the type of study design (paired or unpaired).

  25. Actuators

    A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the ...

  26. Statistical notes for clinical researchers: the independent samples t-test

    The t-test is frequently used in comparing 2 group means.The compared groups may be independent to each other such as men and women. Otherwise, compared data are correlated in a case such as comparison of blood pressure levels from the same person before and after medication (Figure 1).In this section we will focus on independent t-test only.There are 2 kinds of independent t-test depending on ...