• En español – ExME
  • Em português – EME

Multivariate analysis: an overview

Posted on 9th September 2021 by Vighnesh D

""

Data analysis is one of the most useful tools when one tries to understand the vast amount of information presented to them and synthesise evidence from it. There are usually multiple factors influencing a phenomenon.

Of these, some can be observed, documented and interpreted thoroughly while others cannot. For example, in order to estimate the burden of a disease in society there may be a lot of factors which can be readily recorded, and a whole lot of others which are unreliable and, therefore, require proper scrutiny. Factors like incidence, age distribution, sex distribution and financial loss owing to the disease can be accounted for more easily when compared to contact tracing, prevalence and institutional support for the same. Therefore, it is of paramount importance that the data which is collected and interpreted must be done thoroughly in order to avoid common pitfalls.

2 boxes side by side. Box 1 has a scatter plot with a nearly horizontal red line through it. At the bottom it states R squared = 0.06. The second box has the same scatter plot and then joined up red lines which look like a person holding a dog. The red text in this box says Rexthor, The Dog-Bearer. Below these boxes is the statement "I don't trust linear regressions when it's harder to guess the direction of the correlation from the scatter plot than to find new constellations on it".

Image from: https://imgs.xkcd.com/comics/useful_geometry_formulas.png under Creative Commons License 2.5 Randall Munroe. xkcd.com.

Why does it sound so important?

Data collection and analysis is emphasised upon in academia because the very same findings determine the policy of a governing body and, therefore, the implications that follow it are the direct product of the information that is fed into the system.

Introduction

In this blog, we will discuss types of data analysis in general and multivariate analysis in particular. It aims to introduce the concept to investigators inclined towards this discipline by attempting to reduce the complexity around the subject.

Analysis of data based on the types of variables in consideration is broadly divided into three categories:

  • Univariate analysis: The simplest of all data analysis models, univariate analysis considers only one variable in calculation. Thus, although it is quite simple in application, it has limited use in analysing big data. E.g. incidence of a disease.
  • Bivariate analysis: As the name suggests, bivariate analysis takes two variables into consideration. It has a slightly expanded area of application but is nevertheless limited when it comes to large sets of data. E.g. incidence of a disease and the season of the year.
  • Multivariate analysis: Multivariate analysis takes a whole host of variables into consideration. This makes it a complicated as well as essential tool. The greatest virtue of such a model is that it considers as many factors into consideration as possible. This results in tremendous reduction of bias and gives a result closest to reality. For example, kindly refer to the factors discussed in the “overview” section of this article.

Multivariate analysis is defined as:

The statistical study of data where multiple measurements are made on each experimental unit and where the relationships among multivariate measurements and their structure are important

Multivariate statistical methods incorporate several techniques depending on the situation and the question in focus. Some of these methods are listed below:

  • Regression analysis: Used to determine the relationship between a dependent variable and one or more independent variable.
  • Analysis of Variance (ANOVA) : Used to determine the relationship between collections of data by analyzing the difference in the means.
  • Interdependent analysis: Used to determine the relationship between a set of variables among themselves.
  • Discriminant analysis: Used to classify observations in two or more distinct set of categories.
  • Classification and cluster analysis: Used to find similarity in a group of observations.
  • Principal component analysis: Used to interpret data in its simplest form by introducing new uncorrelated variables.
  • Factor analysis: Similar to principal component analysis, this too is used to crunch big data into small, interpretable forms.
  • Canonical correlation analysis: Perhaps one of the most complex models among all of the above, canonical correlation attempts to interpret data by analysing relationships between cross-covariance matrices.

ANOVA remains one of the most widely used statistical models in academia. Of the several types of ANOVA models, there is one subtype that is frequently used because of the factors involved in the studies. Traditionally, it has found its application in behavioural research, i.e. Psychology, Psychiatry and allied disciplines. This model is called the Multivariate Analysis of Variance (MANOVA). It is widely described as the multivariate analogue of ANOVA, used in interpreting univariate data.

4 boxes side by side. 1st box has a stick man sitting at a desk with a hill shaped object which has the words 'Students T Distribution' on it. They are wiggling it on top of a bit of paper he is saying "Hmm". The 2nd box the same scene exists, but he is now saying "....Nope". In the 3rd box he has lifted off the hill shaped object and walking away from the desk with it. In the final box, he is placing a new object onto the desk which is a hill shape, but with many more peaks and troughs on it with the words 'Teachers' T Distribution' on it.

Image from: https://imgs.xkcd.com/comics/t_distribution.png under Creative Commons License 2.5 Randall Munroe. xkcd.com.

Interpretation of results

Interpretation of results is probably the most difficult part in the technique. The relevant results are generally summarized in a table with an associated text. Appropriate information must be highlighted regarding:

  • Multivariate test statistics used
  • Degrees of freedom
  • Appropriate test statistics used
  • Calculated p-value (p < x)

Reliability and validity of the test are the most important determining factors in such techniques.

Applications

Multivariate analysis is used in several disciplines. One of its most distinguishing features is that it can be used in parametric as well as non-parametric tests.

Quick question: What are parametric and non-parametric tests?

  • Parametric tests: Tests which make certain assumptions regarding the distribution of data, i.e. within a fixed parameter.
  • Non-parametric tests: Tests which do not make assumptions with respect to distribution. On the contrary, the distribution of data is assumed to be free of distribution.

2 column table. First column is "Parametric tests". Under this is the following list: Based on Interval/Ratio Scale; Outliers absent; Uniformly distributed data; equal variance; sample size is usually large. The second column is titled "Non parametric tests". The list below this is as follows: Based on Nominal/Ordinal scale; Outliers present; Non uniform data; Unequal variance; Small sample size.

Uses of Multivariate analysis: Multivariate analyses are used principally for four reasons, i.e. to see patterns of data, to make clear comparisons, to discard unwanted information and to study multiple factors at once. Applications of multivariate analysis are found in almost all the disciplines which make up the bulk of policy-making, e.g. economics, healthcare, pharmaceutical industries, applied sciences, sociology, and so on. Multivariate analysis has particularly enjoyed a traditional stronghold in the field of behavioural sciences like psychology, psychiatry and allied fields because of the complex nature of the discipline.

Multivariate analysis is one of the most useful methods to determine relationships and analyse patterns among large sets of data. It is particularly effective in minimizing bias if a structured study design is employed. However, the complexity of the technique makes it a less sought-out model for novice research enthusiasts. Therefore, although the process of designing the study and interpretation of results is a tedious one, the techniques stand out in finding the relationships in complex situations.

References (pdf)

' src=

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

No Comments on Multivariate analysis: an overview

' src=

I got good information on multivariate data analysis and using mult variat analysis advantages and patterns.

' src=

Great summary. I found this very useful for starters

' src=

Thank you so much for the dscussion on multivariate design in research. However, i want to know more about multiple regression analysis. Hope for more learnings to gain from you.

' src=

Thank you for letting the author know this was useful, and I will see if there are any students wanting to blog about multiple regression analysis next!

' src=

When you want to know what contributed to an outcome what study is done?

' src=

Dear Philip, Thank you for bringing this to our notice. Your input regarding the discussion is highly appreciated. However, since this particular blog was meant to be an overview, I consciously avoided the nuances to prevent complicated explanations at an early stage. I am planning to expand on the matter in subsequent blogs and will keep your suggestion in mind while drafting for the same. Many thanks, Vighnesh.

' src=

Sorry, I don’t want to be pedantic, but shouldn’t we differentiate between ‘multivariate’ and ‘multivariable’ regression? https://stats.stackexchange.com/questions/447455/multivariable-vs-multivariate-regression https://www.ajgponline.org/article/S1064-7481(18)30579-7/fulltext

Subscribe to our newsletter

You will receive our monthly newsletter and free access to Trip Premium.

Related Articles

data mining

Data mining or data dredging?

Advances in technology now allow huge amounts of data to be handled simultaneously. Katherine takes a look at how this can be used in healthcare and how it can be exploited.

data analysis

Nominal, ordinal, or numerical variables?

How can you tell if a variable is nominal, ordinal, or numerical? Why does it even matter? Determining the appropriate variable type used in a study is essential to determining the correct statistical method to use when obtaining your results. It is important not to take the variables out of context because more often than not, the same variable that can be ordinal can also be numerical, depending on how the data was recorded and analyzed. This post will give you a specific example that may help you better grasp this concept.

data analysis

Data analysis methods

A description of the two types of data analysis – “As Treated” and “Intention to Treat” – using a hypothetical trial as an example

Lesson 8: Multivariate Analysis of Variance (MANOVA)

The Multivariate Analysis of Variance (MANOVA) is the multivariate analog of the Analysis of Variance (ANOVA) procedure used for univariate data. We will introduce the Multivariate Analysis of Variance with the Romano-British Pottery data example.

Pottery shards are collected from four sites in the British Isles:

  • L : Llanedyrn
  • C : Caldicot
  • I : Isle Thorns
  • A : Ashley Rails

Subsequently, we will use the first letter of the name to distinguish between the sites.

Each pottery sample was returned to the laboratory for chemical assay. In these assays the concentrations of five different chemicals were determined:

  • Al : Aluminum
  • Mg : Magnesium
  • Ca : Calcium
  • Na : Sodium

We will abbreviate the chemical constituents with the chemical symbol in the examples that follow.

MANOVA will allow us to determine whether the chemical content of the pottery depends on the site where the pottery was obtained.  If this is the case, then in Lesson 10 , we will learn how to use the chemical content of a pottery sample of unknown origin to hopefully determine which site the sample came from.

  • Use SAS/Minitab to perform a multivariate analysis of variance;
  • Draw appropriate conclusions from the results of a multivariate analysis of variance;
  • Understand the Bonferroni method for assessing the significance of individual variables;
  • Understand how to construct and interpret orthogonal contrasts among groups (treatments)

8.1 - The Univariate Approach: Analysis of Variance (ANOVA)

In the univariate case, the data can often be arranged in a table as shown in the table below:

Treatment

  1 2 \(\dots\) g
Subjects 1 \(Y_{11}\) \(Y_{21}\) \(\dots\) \(Y_{g_1}\)
2 \(Y_{12}\) \(Y_{22}\) \(\dots\) \(Y_{g_2}\)
\(\vdots\) \(\vdots\) \(\vdots\)   \(\vdots\)
\(n_i\) \(Y_{1n_1}\) \(Y_{2n_2}\) \(\dots\) \(Y_{gn_g}\)

The columns correspond to the responses to g different treatments or from g different populations. And, the rows correspond to the subjects in each of these treatments or populations.

  • \(Y_{ij}\) = Observation from subject j in group i
  • \(n_{i}\) = Number of subjects in group i
  • \(N = n_{1} + n_{2} + \dots + n_{g}\) = Total sample size.

Assumptions for the Analysis of Variance are the same as for a two-sample t -test except that there are more than two groups:

  • The data from group i has common mean = \(\mu_{i}\); i.e., \(E\left(Y_{ij}\right) = \mu_{i}\) . This means that there are no sub-populations with different means.
  • Homoskedasticity : The data from all groups have common variance \(\sigma^2\); i.e., \(var(Y_{ij}) = \sigma^{2}\). That is, the variability in the data does not depend on group membership.
  • Independence: The subjects are independently sampled.
  • Normality : The data are normally distributed.

The hypothesis of interest is that all of the means are equal. Mathematically we write this as:

\(H_0\colon \mu_1 = \mu_2 = \dots = \mu_g\)

The alternative is expressed as:

\(H_a\colon \mu_i \ne \mu_j \) for at least one \(i \ne j\).

i.e., there is a difference between at least one pair of group population means. The following notation should be considered:

This involves taking an average of all the observations for j = 1 to \(n_{i}\) belonging to the i th group. The dot in the second subscript means that the average involves summing over the second subscript of y .

This involves taking the average of all the observations within each group and over the groups and dividing by the total sample size. The double dots indicate that we are summing over both subscripts of y .

  • \(\bar{y}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\) = Sample mean for group i .
  • \(\bar{y}_{..} = \frac{1}{N}\sum_{i=1}^{g}\sum_{j=1}^{n_i}Y_{ij}\) = Grand mean.

Here we are looking at the average squared difference between each observation and the grand mean. Note that if the observations tend to be far away from the Grand Mean then this will take a large value. Conversely, if all of the observations tend to be close to the Grand mean, this will take a small value. Thus, the total sum of squares measures the variation of the data about the Grand mean.

An Analysis of Variance (ANOVA) is a partitioning of the total sum of squares. In the second line of the expression below, we are adding and subtracting the sample mean for the i th group. In the third line, we can divide this out into two terms, the first term involves the differences between the observations and the group means, \(\bar{y}_i\), while the second term involves the differences between the group means and the grand mean.

\(\begin{array}{lll} SS_{total} & = & \sum_{i=1}^{g}\sum_{j=1}^{n_i}\left(Y_{ij}-\bar{y}_{..}\right)^2 \\ & = & \sum_{i=1}^{g}\sum_{j=1}^{n_i}\left((Y_{ij}-\bar{y}_{i.})+(\bar{y}_{i.}-\bar{y}_{..})\right)^2 \\ & = &\underset{SS_{error}}{\underbrace{\sum_{i=1}^{g}\sum_{j=1}^{n_i}(Y_{ij}-\bar{y}_{i.})^2}}+\underset{SS_{treat}}{\underbrace{\sum_{i=1}^{g}n_i(\bar{y}_{i.}-\bar{y}_{..})^2}} \end{array}\)

The first term is called the error sum of squares and measures the variation in the data about their group means.

Note that if the observations tend to be close to their group means, then this value will tend to be small. On the other hand, if the observations tend to be far away from their group means, then the value will be larger. The second term is called the treatment sum of squares and involves the differences between the group means and the Grand mean. Here, if group means are close to the Grand mean, then this value will be small. While, if the group means tend to be far away from the Grand mean, this will take a large value. This second term is called the Treatment Sum of Squares and measures the variation of the group means about the Grand mean.

The Analysis of Variance results is summarized in an analysis of variance table below:

  Hover over the light bulb to get more information on that item.

Source

d.f.

SS

MS

F

Treatments

\(g-1\)

\(\sum _ { i = 1 } ^ { g } n _ { i } \left( \overline { y } _ { i . } - \overline { y } _ { . . } \right) ^ { 2 }\)

\(\dfrac { S S _ { \text { treat } } } { g - 1 }\)

\(\dfrac { M S _ { \text { treat } } } { M S _ { \text { error } } }\)

Error

\(N-g\)

\(\sum _ { i = 1 } ^ { g } \sum _ { j = 1 } ^ { n _ { i } } \left( Y _ { i j } - \overline { y } _ { i . } \right) ^ { 2 }\)

\(\dfrac { S S _ { \text { error } } } { N - g }\)

 

Total

\(N-1\)

\(\sum _ { i = 1 } ^ { g } \sum _ { j = 1 } ^ { n _ { i } } \left( Y _ { i j } - \overline { y } _ { \dots } \right) ^ { 2 }\)

   

The ANOVA table contains columns for Source, Degrees of Freedom, Sum of Squares, Mean Square and F . Sources include Treatment and Error which together add up to the Total.

The degrees of freedom for treatment in the first row of the table are calculated by taking the number of groups or treatments minus 1. The total degree of freedom is the total sample size minus 1.  The Error degrees of freedom are obtained by subtracting the treatment degrees of freedom from the total degrees of freedom to obtain N - g .

The formulae for the Sum of Squares are given in the SS column. The Mean Square terms are obtained by taking the Sums of Squares terms and dividing them by the corresponding degrees of freedom.

The final column contains the F statistic which is obtained by taking the MS for treatment and dividing it by the MS for Error.

Under the null hypothesis that the treatment effect is equal across group means, that is \(H_{0} \colon \mu_{1} = \mu_{2} = \dots = \mu_{g} \), this F statistic is F -distributed with g - 1 and N - g degrees of freedom:

\(F \sim F_{g-1, N-g}\)

The numerator degrees of freedom g - 1 comes from the degrees of freedom for treatments in the ANOVA table. This is referred to as the numerator degrees of freedom since the formula for the F -statistic involves the Mean Square for Treatment in the numerator. The denominator degrees of freedom N - g is equal to the degrees of freedom for error in the ANOVA table. This is referred to as the denominator degrees of freedom because the formula for the F -statistic involves the Mean Square Error in the denominator.

We reject \(H_{0}\) at level \(\alpha\) if the F statistic is greater than the critical value of the F -table, with g - 1 and N - g degrees of freedom, and evaluated at level \(\alpha\).

\(F > F_{g-1, N-g, \alpha}\)

8.2 - The Multivariate Approach: One-way Multivariate Analysis of Variance (One-way MANOVA)

Now we will consider the multivariate analog, the Multivariate Analysis of Variance, often abbreviated as MANOVA.

Suppose that we have data on p variables which we can arrange in a table such as the one below:

Table of randomized block design data
Treatment
  1 2 \(\cdots\)
Subject
1 \(\mathbf{Y_{11}} = \begin{pmatrix} Y_{111} \\ Y_{112} \\ \vdots \\ Y_{11p} \end{pmatrix}\) \(\mathbf{Y_{21}} = \begin{pmatrix} Y_{211} \\ Y_{212} \\ \vdots \\ Y_{21p} \end{pmatrix}\) \(\cdots\) \(\mathbf{Y_{g1}} = \begin{pmatrix} Y_{g11} \\ Y_{g12} \\ \vdots \\ Y_{g1p} \end{pmatrix}\)
2 \(\mathbf{Y_{21}} = \begin{pmatrix} Y_{121} \\ Y_{122} \\ \vdots \\ Y_{12p} \end{pmatrix}\) \(\mathbf{Y_{22}} = \begin{pmatrix} Y_{221} \\ Y_{222} \\ \vdots \\ Y_{22p} \end{pmatrix}\) \(\cdots\) \(\mathbf{Y_{g2}} = \begin{pmatrix} Y_{g21} \\ Y_{g22} \\ \vdots \\ Y_{g2p} \end{pmatrix}\)
\(\vdots\) \(\vdots\) \(\vdots\)   \(\vdots\)
\(n_i\) \(\mathbf{Y_{1n_1}} = \begin{pmatrix} Y_{1n_{1}1} \\ Y_{1n_{1}2} \\ \vdots \\ Y_{1n_{1}p} \end{pmatrix}\) \(\mathbf{Y_{2n_2}} = \begin{pmatrix} Y_{2n_{2}1} \\ Y_{2n_{2}2} \\ \vdots \\ Y_{2n_{2}p} \end{pmatrix}\) \(\cdots\) \(\mathbf{Y_{gn_{g}}} = \begin{pmatrix} Y_{gn_{g^1}} \\ Y_{gn_{g^2}} \\ \vdots \\ Y_{gn_{2}p} \end{pmatrix}\)

In this multivariate case the scalar quantities, \(Y_{ij}\), of the corresponding table in ANOVA, are replaced by vectors having p observations.

\(n_{i}\)  = the number of subjects in group i

 \(N = n _ { 1 } + n _ { 2 } + \ldots + n _ { g }\) = Total sample size.

Assumptions

The assumptions here are essentially the same as the assumptions in Hotelling's \(T^{2}\) test, only here they apply to groups:

  • The data from group i has common mean vector \(\boldsymbol{\mu}_i = \left(\begin{array}{c}\mu_{i1}\\\mu_{i2}\\\vdots\\\mu_{ip}\end{array}\right)\)
  • The data from all groups have a common variance-covariance matrix \(\Sigma\).
  • Independence : The subjects are independently sampled.
  • Normality : The data are multivariate normally distributed.

Here we are interested in testing the null hypothesis that the group mean vectors are all equal to one another. Mathematically this is expressed as:

\(H_0\colon \boldsymbol{\mu}_1 = \boldsymbol{\mu}_2 = \dots = \boldsymbol{\mu}_g\)

The alternative hypothesis is:

\(H_a \colon \mu_{ik} \ne \mu_{jk}\) for at least one \(i \ne j\) and at least one variable \(k\)

This says that the null hypothesis is false if at least one pair of treatments is different on at least one variable.

The scalar quantities used in the univariate setting are replaced by vectors in the multivariate setting:

Sample Mean Vector

\(\bar{\mathbf{y}}_{i.} = \frac{1}{n_i}\sum_{j=1}^{n_i}\mathbf{Y}_{ij} = \left(\begin{array}{c}\bar{y}_{i.1}\\ \bar{y}_{i.2} \\ \vdots \\ \bar{y}_{i.p}\end{array}\right)\) = sample mean vector for group i . This sample mean vector is comprised of the group means for each of the p variables. Thus, \(\bar{y}_{i.k} = \frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ijk}\) = sample mean vector for variable k in group i .

Grand Mean Vector

\(\bar{\mathbf{y}}_{..} = \frac{1}{N}\sum_{i=1}^{g}\sum_{j=1}^{n_i}\mathbf{Y}_{ij} = \left(\begin{array}{c}\bar{y}_{..1}\\ \bar{y}_{..2} \\ \vdots \\ \bar{y}_{..p}\end{array}\right)\) = grand mean vector. This grand mean vector is comprised of the grand means for each of the p variables. Thus, \(\bar{y}_{..k} = \frac{1}{N}\sum_{i=1}^{g}\sum_{j=1}^{n_i}Y_{ijk}\) = grand mean for variable k .

Total Sum of Squares and Cross Products

In the univariate Analysis of Variance, we defined the Total Sums of Squares, a scalar quantity. The multivariate analog is the Total Sum of Squares and Cross Products matrix, a p x p matrix of numbers. The total sum of squares is a cross products matrix defined by the expression below:

\(\mathbf{T = \sum\limits_{i=1}^{g}\sum_\limits{j=1}^{n_i}(Y_{ij}-\bar{y}_{..})(Y_{ij}-\bar{y}_{..})'}\)

Here we are looking at the differences between the vectors of observations \(Y_{ij}\) and the Grand mean vector. These differences form a vector which is then multiplied by its transpose.

Here, the \(\left (k, l \right )^{th}\) element of T is

\(\sum\limits_{i=1}^{g}\sum\limits_{j=1}^{n_i} (Y_{ijk}-\bar{y}_{..k})(Y_{ijl}-\bar{y}_{..l})\)

For k = l , this is the total sum of squares for variable k and measures the total variation in the \(k^{th}\) variable. For \(k ≠ l\), this measures the dependence between variables k and l across all of the observations.

We may partition the total sum of squares and cross products as follows:

\(\begin{array}{lll}\mathbf{T} & = & \mathbf{\sum_{i=1}^{g}\sum_{j=1}^{n_i}(Y_{ij}-\bar{y}_{..})(Y_{ij}-\bar{y}_{..})'} \\ & = & \mathbf{\sum_{i=1}^{g}\sum_{j=1}^{n_i}\{(Y_{ij}-\bar{y}_i)+(\bar{y}_i-\bar{y}_{..})\}\{(Y_{ij}-\bar{y}_i)+(\bar{y}_i-\bar{y}_{..})\}'} \\ & = & \mathbf{\underset{E}{\underbrace{\sum_{i=1}^{g}\sum_{j=1}^{n_i}(Y_{ij}-\bar{y}_{i.})(Y_{ij}-\bar{y}_{i.})'}}+\underset{H}{\underbrace{\sum_{i=1}^{g}n_i(\bar{y}_{i.}-\bar{y}_{..})(\bar{y}_{i.}-\bar{y}_{..})'}}}\end{array}\)

where E is the Error Sum of Squares and Cross Products , and H is the Hypothesis Sum of Squares and Cross Products .

The \(\left (k, l \right )^{th}\) element of the error sum of squares and cross products matrix E is:

\(\sum_\limits{i=1}^{g}\sum\limits_{j=1}^{n_i}(Y_{ijk}-\bar{y}_{i.k})(Y_{ijl}-\bar{y}_{i.l})\)

For k = l , this is the error sum of squares for variable k , and measures the within treatment variation for the \(k^{th}\) variable. For \(k ≠ l\), this measures the dependence between variables k and l after taking into account the treatment.

The \(\left (k, l \right )^{th}\) element of the hypothesis sum of squares and cross products matrix H is

\(\sum\limits_{i=1}^{g}n_i(\bar{y}_{i.k}-\bar{y}_{..k})(\bar{y}_{i.l}-\bar{y}_{..l})\)

For k = l , this is the treatment sum of squares for variable k and measures the between-treatment variation for the \(k^{th}\) variable. For \(k ≠ l\), this measures the dependence of variables k and l across treatments.

The partitioning of the total sum of squares and cross-products matrix may be summarized in the multivariate analysis of the variance table:

MANOVA

Source d.f. SSP
Treatments - 1
Error -
Total - 1

We wish to reject

\(H_0\colon \boldsymbol{\mu_1 = \mu_2 = \dots =\mu_g}\)

if the hypothesis sum of squares and cross products matrix H is large relative to the error sum of squares and cross products matrix E .

8.3 - Test Statistics for MANOVA

SAS uses four different test statistics based on the MANOVA table:

\(\Lambda^* = \dfrac{|\mathbf{E}|}{|\mathbf{H+E}|}\)

Here, the determinant of the error sums of squares and cross-products matrix E is divided by the determinant of the total sum of squares and cross-products matrix T = H + E . If H is large relative to E , then | H + E | will be large relative to | E |. Thus, we will reject the null hypothesis if Wilks lambda is small (close to zero).

\(T^2_0 = trace(\mathbf{HE}^{-1})\)

Here, we are multiplying H by the inverse of E ; then we take the trace of the resulting matrix. If H is large relative to E , then the Hotelling-Lawley trace will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.

\(V = trace(\mathbf{H(H+E)^{-1}})\)

Here, we are multiplying H by the inverse of the total sum of squares and cross products matrix T = H + E . If H is large relative to E , then the Pillai trace will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.

Here, we multiply H by the inverse of E and then compute the largest eigenvalue of the resulting matrix. If H is large relative to E , then Roy's root will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.

Recall: The trace of a p x p matrix

\(\mathbf{A} = \left(\begin{array}{cccc}a_{11} & a_{12} & \dots & a_{1p}\\ a_{21} & a_{22} & \dots & a_{2p} \\ \vdots & \vdots & & \vdots \\ a_{p1} & a_{p2} & \dots & a_{pp}\end{array}\right)\)

is equal to

\(trace(\mathbf{A}) = \sum_{i=1}^{p}a_{ii}\)

Statistical tables are not available for the above test statistics. However, each of the above test statistics has an F approximation: The following details the F approximations for Wilks lambda. Details for all four F approximations can be found on the SAS website .

1. Wilks Lambda

\begin{align} \text{Starting with }&& \Lambda^* &= \dfrac{|\mathbf{E}|}{|\mathbf{H+E}|}\\  \text{Let, }&& a &= N-g - \dfrac{p-g+2}{2},\\ &&\text{} b &= \left\{\begin{array}{ll} \sqrt{\frac{p^2(g-1)^2-4}{p^2+(g-1)^2-5}}; &\text{if } p^2 + (g-1)^2-5 > 0\\ 1; & \text{if } p^2 + (g-1)^2-5 \le 0 \end{array}\right. \\ \text{and}&& c &= \dfrac{p(g-1)-2}{2} \\ \text{Then}&& F &= \left(\dfrac{1-\Lambda^{1/b}}{\Lambda^{1/b}}\right)\left(\dfrac{ab-c}{p(g-1)}\right) \overset{\cdot}{\sim} F_{p(g-1), ab-c} \\ \text{Under}&& H_{o} \end{align}

8.4 - Example: Pottery Data - Checking Model Assumptions

Example 8-1 pottery data (manova).

Before carrying out a MANOVA, first, check the model assumptions:

  • The data from group i has common mean vector \(\boldsymbol{\mu}_{i}\)
  • Normality: The data are multivariate normally distributed.

Assumption 1 : The data from group i have a common mean vector \(\boldsymbol{\mu}_{i}\)

This assumption says that there are no subpopulations with different mean vectors. Here, this assumption might be violated if pottery collected from the same site had inconsistencies.

Assumption 3 : Independence : The subjects are independently sampled. This assumption is satisfied if the assayed pottery is obtained by randomly sampling the pottery collected from each site. This assumption would be violated if, for example, pottery samples were collected in clusters. In other applications, this assumption may be violated if the data were collected over time or space.

Assumption 4 : Normality : The data are multivariate normally distributed.

  • For large samples, the Central Limit Theorem says that the sample mean vectors are approximately multivariate normally distributed, even if the individual observations are not.
  • For the pottery data, however, we have a total of only N = 26 observations, including only two samples from Caldicot. With a small N , we cannot rely on the Central Limit Theorem.

Diagnostic procedures are based on the residuals, computed by taking the differences between the individual observations and the group means for each variable:

\(\hat{\epsilon}_{ijk} = Y_{ijk}-\bar{Y}_{i.k}\)

Thus, for each subject (or pottery sample in this case), residuals are defined for each of the p variables. Then, to assess normality, we apply the following graphical procedures:

  • Plot the histograms of the residuals for each variable. Look for a symmetric distribution.
  • Plot a matrix of scatter plots. Look for elliptical distributions and outliers.
  • Plot three-dimensional scatter plots. Look for elliptical distributions and outliers.

If the histograms are not symmetric or the scatter plots are not elliptical, this would be evidence that the data are not sampled from a multivariate normal distribution in violation of Assumption 4. In this case, a normalizing transformation should be considered.

Download the text file containing the data here: pottery.csv

  •   Example

The SAS program below will help us check this assumption.

Download the SAS Program here: potterya.sas

Note : In the upper right-hand corner of the code block you will have the option of copying (   ) the code to your clipboard or downloading (   ) the file to your computer.

MANOVA normality assumption

To fit the MANOVA model and assess the normality of residuals in Minitab:

  • Open the ‘pottery’ data set in a new worksheet
  • For convenience, rename the columns : site, al, fe, mg, ca, and na from left to right.
  • Highlight and select all five variables (al through na) to move them to the Responses window
  • Highlight and select 'site' to move it to the Model window.
  • Graphs > Individual plots , check Histogram and Normal plot, then 'OK' .
  • Choose 'OK' again. The MANOVA table, along with the residual plots are displayed in the results area.
  • Histograms suggest that, except for sodium, the distributions are relatively symmetric. However, the histogram for sodium suggests that there are two outliers in the data. Both of these outliers are in Llanadyrn.
  • Two outliers can also be identified from the matrix of scatter plots.
  • Removal of the two outliers results in a more symmetric distribution for sodium.

The results of MANOVA can be sensitive to the presence of outliers. One approach to assessing this would be to analyze the data twice, once with the outliers and once without them. The results may then be compared for consistency. The following analyses use all of the data, including the two outliers.

Assumption 2 : The data from all groups have a common variance-covariance matrix \(\Sigma\).

This assumption can be checked using Box's test for homogeneity of variance-covariance matrices. To obtain Box's test, let \(\Sigma_{i}\) denote the population variance-covariance matrix for group i . Consider testing:

\(H_0\colon \Sigma_1 = \Sigma_2 = \dots = \Sigma_g\)

\(H_0\colon \Sigma_i \ne \Sigma_j\) for at least one \(i \ne j\)

Under the alternative hypothesis, at least two of the variance-covariance matrices differ on at least one of their elements. Let:

\(\mathbf{S}_i = \dfrac{1}{n_i-1}\sum\limits_{j=1}^{n_i}\mathbf{(Y_{ij}-\bar{y}_{i.})(Y_{ij}-\bar{y}_{i.})'}\)

denote the sample variance-covariance matrix for group i . Compute the pooled variance-covariance matrix

\(\mathbf{S}_p = \dfrac{\sum_{i=1}^{g}(n_i-1)\mathbf{S}_i}{\sum_{i=1}^{g}(n_i-1)}= \dfrac{\mathbf{E}}{N-g}\)

Box's test is based on the following test statistic:

\(L' = c\left\{(N-g)\log |\mathbf{S}_p| - \sum_{i=1}^{g}(n_i-1)\log|\mathbf{S}_i|\right\}\)

where the correction factor is

\(c = 1-\dfrac{2p^2+3p-1}{6(p+1)(g-1)}\left\{\sum_\limits{i=1}^{g}\dfrac{1}{n_i-1}-\dfrac{1}{N-g}\right\}\)

The version of Box's test considered in the lesson of the two-sample Hotelling's T-square is a special case where g = 2. Under the null hypothesis of homogeneous variance-covariance matrices, L' is approximately chi-square distributed with

\(\dfrac{1}{2}p(p+1)(g-1)\)

degrees of freedom. Reject \(H_0\) at level \(\alpha\) if

\(L' > \chi^2_{\frac{1}{2}p(p+1)(g-1),\alpha}\)

Example 8-2: Pottery Data

Here we will use the Pottery SAS program.

Download the SAS Program here: pottery2.sas

Minitab does not perform this function at this time.

We find no statistically significant evidence against the null hypothesis that the variance-covariance matrices are homogeneous ( L' = 27.58; d.f. = 45; p = 0.98).

  • If we were to reject the null hypothesis of homogeneity of variance-covariance matrices, then we would conclude that assumption 2 is violated.
  • MANOVA is not robust to violations of the assumption of homogeneous variance-covariance matrices.
  • Note that the assumptions of homogeneous variance-covariance matrices and multivariate normality are often violated together.
  • Therefore, a normalizing transformation may also be a variance-stabilizing transformation.

8.5 - Example: MANOVA of Pottery Data

Example 8-3: pottery data (manova).

After we have assessed the assumptions, our next step is to proceed with the MANOVA.

This may be carried out using the Pottery SAS Program below.

Download the SAS Program here: pottery.sas

Performing a MANOVA

To carry out the MANOVA test in Minitab:

  • Stat > ANOVA > General MANOVA
  • Highlight and select 'site' to move it to the Model window. Choose ' OK'.
  • Choose ' OK' again. The MANOVA table is displayed in the results area.

The concentrations of the chemical elements depend on the site where the pottery sample was obtained \(\left( \Lambda ^ { \star } = 0.0123; F = 13.09; \mathrm { d } . \mathrm { f } = 15,50 ; p < 0.0001 \right)\). It was found, therefore, that there are differences in the concentrations of at least one element between at least one pair of sites.

  Question : How do the chemical constituents differ among sites?

A profile plot may be used to explore how the chemical constituents differ among the four sites. In a profile plot, the group means are plotted on the Y-axis against the variable names on the X-axis, connecting the dots for all means within each group. A profile plot for the pottery data is obtained using the SAS program below

Download the SAS Program here: pottery1.sas

MANOVA profile plot

To create a profile plot in Minitab:

  • Graph > Line plots > Multiple Y’s
  • Highlight and select all five variables (al through na) to move them to Graph variables.
  • Highlight and select 'site' to move it to the Categorical variable for grouping.
  • Choose 'OK' . The profile plot is displayed in the results area. Note that the order of the variables on the plot corresponds to the original order in the data set.

gplot for pottery data

Results from the profile plots are summarized as follows:

  • The sample sites appear to be paired: Ashley Rails with Isle Thorns and Caldicot with Llanedyrn.
  • Ashley Rails and Isle Thorns appear to have higher aluminum concentrations than Caldicot and Llanedyrn.
  • Caldicot and Llanedyrn appear to have higher iron and magnesium concentrations than Ashley Rails and Isle Thorns.
  • Calcium and sodium concentrations do not appear to vary much among the sites.

Note: These results are not backed up by appropriate hypotheses tests.  Hypotheses need to be formed to answer specific questions about the data. These should be considered only if significant differences among group mean vectors are detected in the MANOVA.

Specific Questions

  • Which chemical elements vary significantly across sites?
  • Is the mean chemical constituency of pottery from Ashley Rails and Isle Thorns different from that of Llanedyrn and Caldicot?
  • Is the mean chemical constituency of pottery from Ashley Rails equal to that of Isle Thorns?
  • Is the mean chemical constituency of pottery from Llanedyrn equal to that of Caldicot?

Analysis of Individual Chemical Elements

A naive approach to assessing the significance of individual variables (chemical elements) would be to carry out individual ANOVAs to test:

\(H_0\colon \mu_{1k} = \mu_{2k} = \dots = \mu_{gk}\)

for chemical k. Reject \(H_0 \) at level \(\alpha\) if

  Problem : If we're going to repeat this analysis for each of the p variables, this does not control for the experiment-wise error rate.

Just as we can apply a Bonferroni correction to obtain confidence intervals, we can also apply a Bonferroni correction to assess the effects of group membership on the population means of the individual variables.

Bonferroni Correction: Reject \(H_0 \) at level \(\alpha\) if

\(F > F_{g-1, N-g, \alpha/p}\)

or, equivalently, if the p -value is less than \(α/p\).

Example 8-4: Pottery Data (ANOVA)

The results for individual ANOVA results are output with the SAS program below.

Download the SAS program here: pottery.sas

MANOVA follow-up ANOVAs

To carry out multiple ANOVA tests in Minitab:

  • Highlight and select 'site' to move it to the Factors window.
  • Choose 'OK' . Each of the ANOVA tables is displayed separately in the results area.

Here, p = 5 variables, g = 4 groups, and a total of N = 26 observations. So, for an α = 0.05 level test, we reject

\(F > F_{3,22,0.01} = 4.82\)

or equivalently, if the p -value reported by SAS is less than 0.05/5 = 0.01. The results of the individual ANOVAs are summarized in the following table. All tests are carried out with 3, 22 degrees freedom ( the d.f. should always be noted when reporting these results ).

Element F SAS -value
Al 26.67 < 0.0001
Fe 89.88 < 0.0001
Mg 49.12 < 0.0001
Ca 29.16 < 0.0001
Na 9.50 0.0003

Because all of the F -statistics exceed the critical value of 4.82, or equivalently, because the SAS p -values all fall below 0.01, we can see that all tests are significant at the 0.05 level under the Bonferroni correction.

Conclusion : The means for all chemical elements differ significantly among the sites. For each element, the means for that element are different for at least one pair of sites.

8.6 - Orthogonal Contrasts

Differences among treatments can be explored through pre-planned orthogonal contrasts. Contrasts involve linear combinations of group mean vectors instead of linear combinations of the variables.

The linear combination of group mean vectors

\(\mathbf{\Psi} = \sum_\limits{i=1}^{g}c_i\mathbf{\mu}_i\)

is a (treatment) contrast if

\(\sum_\limits{i=1}^{g}c_i = 0\)

Contrasts are defined with respect to specific questions we might wish to ask of the data. Here, we shall consider testing hypotheses of the form

\(H_0\colon \mathbf{\Psi = 0}\)

Example 8-5: Drug Trial

Suppose that we have a drug trial with the following 3 treatments:

Consider the following questions:

  Question 1: Is there a difference between the Brand Name drug and the Generic drug?

\begin{align} \text{That is, consider testing:}&& &H_0\colon \mathbf{\mu_2 = \mu_3}\\ \text{This is equivalent to testing,}&&  &H_0\colon \mathbf{\Psi = 0}\\ \text{where,}&&  &\mathbf{\Psi = \mu_2 - \mu_3} \\ \text{with}&&  &c_1 = 0, c_2 = 1, c_3 = -1 \end{align}

  Question 2: Are the drug treatments effective?

\begin{align} \text{That is, consider testing:}&& &H_0\colon \mathbf{\mu_1} = \frac{\mathbf{\mu_2+\mu_3}}{2}\\ \text{This is equivalent to testing,}&&  &H_0\colon \mathbf{\Psi = 0}\\ \text{where,}&&  &\mathbf{\Psi} = \mathbf{\mu}_1 - \frac{1}{2}\mathbf{\mu}_2 - \frac{1}{2}\mathbf{\mu}_3 \\ \text{with}&&  &c_1 = 1, c_2 = c_3 = -\frac{1}{2}\end{align}

The contrast

\(\mathbf{\Psi} = \sum_{i=1}^{g}c_i \mu_i\)

is estimated by replacing the population mean vectors with the corresponding sample mean vectors:

\(\mathbf{\hat{\Psi}} = \sum_{i=1}^{g}c_i\mathbf{\bar{Y}}_i.\)

Because the estimated contrast is a function of random data, the estimated contrast is also a random vector. So the estimated contrast has a population mean vector and population variance-covariance matrix. The population mean of the estimated contrast is \(\mathbf{\Psi}\). The variance-covariance matrix of \(\hat{\mathbf{\Psi}}\)¸ is:

\(\left(\sum\limits_{i=1}^{g}\frac{c^2_i}{n_i}\right)\Sigma\)

which is estimated by substituting the pooled variance-covariance matrix for the population variance-covariance matrix

\(\left(\sum\limits_{i=1}^{g}\frac{c^2_i}{n_i}\right)\mathbf{S}_p = \left(\sum\limits_{i=1}^{g}\frac{c^2_i}{n_i}\right) \dfrac{\mathbf{E}}{N-g}\)

Two contrasts

\(\Psi_1 = \sum_{i=1}^{g}c_i\mathbf{\mu}_i\) and \(\Psi_2 = \sum_{i=1}^{g}d_i\mathbf{\mu}_i\)

are orthogonal if

\(\sum\limits_{i=1}^{g}\frac{c_id_i}{n_i}=0\)

The importance of orthogonal contrasts can be illustrated by considering the following paired comparisons:

\(H^{(1)}_0\colon \mu_1 = \mu_2\)

\(H^{(2)}_0\colon \mu_1 = \mu_3\)

\(H^{(3)}_0\colon \mu_2 = \mu_3\)

We might reject \(H^{(3)}_0\), but fail to reject \(H^{(1)}_0\) and \(H^{(2)}_0\). But, if \(H^{(3)}_0\) is false then both \(H^{(1)}_0\) and \(H^{(2)}_0\) cannot be true.

  • For balanced data (i.e., \(n _ { 1 } = n _ { 2 } = \ldots = n _ { g }\) ), \(\mathbf{\Psi}_1\) and \(\mathbf{\Psi}_2\) are orthogonal contrasts if \(\sum_{i=1}^{g}c_id_i = 0\)
  • If \(\mathbf{\Psi}_1\) and \(\mathbf{\Psi}_2\) are orthogonal contrasts, then the elements of \(\hat{\mathbf{\Psi}}_1\) and \(\hat{\mathbf{\Psi}}_2\) are uncorrelated
  • If \(\mathbf{\Psi}_1\) and \(\mathbf{\Psi}_2\) are orthogonal contrasts, then the tests for \(H_{0} \colon \mathbf{\Psi}_1= 0\) and \(H_{0} \colon \mathbf{\Psi}_2= 0\) are independent of one another. That is, the results of the test have no impact on the results of the other test.
  • For g groups, it is always possible to construct g - 1 mutually orthogonal contrasts.
  • If \(\mathbf{\Psi}_1, \mathbf{\Psi}_2, \dots, \mathbf{\Psi}_{g-1}\) are orthogonal contrasts, then for each ANOVA table, the treatment sum of squares can be partitioned into \(SS_{treat} = SS_{\Psi_1}+SS_{\Psi_2}+\dots + SS_{\Psi_{g-1}} \)
  • Similarly, the hypothesis sum of squares and cross-products matrix may be partitioned: \(\mathbf{H} = \mathbf{H}_{\Psi_1}+\mathbf{H}_{\Psi_2}+\dots\mathbf{H}_{\Psi_{g-1}}\)

8.7 - Constructing Orthogonal Contrasts

The following shows two examples to construct orthogonal contrasts. In each example, we consider balanced data; that is, there are equal numbers of observations in each group.

Example 8-6:

In some cases, it is possible to draw a tree diagram illustrating the hypothesized relationships among the treatments. In the following tree, we wish to compare 5 different populations of subjects. Prior to collecting the data, we may have reason to believe that populations 2 and 3 are most closely related. Populations 4 and 5 are also closely related, but not as close as populations 2 and 3. Population 1 is closer to populations 2 and 3 than populations 4 and 5.

Each branch (denoted by the letters A, B, C, and D) corresponds to a hypothesis we may wish to test. This yields the contrast coefficients as shown in each row of the following table:

Treatments

Contrasts 1 2 3 4 5
A \(\dfrac{1}{3}\) \(\dfrac{1}{3}\) \(\dfrac{1}{3}\) \(- \dfrac { 1 } { 2 }\) \(- \dfrac { 1 } { 2 }\)
B 1 \(- \dfrac { 1 } { 2 }\) \(- \dfrac { 1 } { 2 }\) 0 0
C 0 0 0 1 -1
D 0 1 -1 0 0

Consider Contrast A. Here, we are comparing the mean of all subjects in populations 1,2, and 3 to the mean of all subjects in populations 4 and 5.

For Contrast B, we compare population 1 (receiving a coefficient of +1) with the mean of populations 2 and 3 (each receiving a coefficient of -1/2). Multiplying the corresponding coefficients of contrasts A and B, we obtain:

(1/3) × 1 + (1/3) × (-1/2) + (1/3) × (-1/2) + (-1/2) × 0 + (-1/2) × 0 = 1/3 - 1/6 - 1/6 + 0 + 0 = 0

So contrasts A and B are orthogonal. Similar computations can be carried out to confirm that all remaining pairs of contrasts are orthogonal to one another.

Example 8-7:

Consider the factorial arrangement of drug type and drug dose treatments:

Dose

Drug Low High
A 1 2
B 3 4

Here, treatment 1 is equivalent to a low dose of drug A, treatment 2 is equivalent to a high dose of drug A, etc. For this factorial arrangement of drug type and drug dose treatments, we can form the orthogonal contrasts:

Treatments

Contrasts A, Low A, High B, Low B, High
Drug \(- \dfrac{1}{2}\) \(- \dfrac{1}{2}\) \( \dfrac{1}{2}\) \( \dfrac{1}{2}\)
Dose \(- \dfrac{1}{2}\) \( \dfrac{1}{2}\) \(- \dfrac { 1 } { 2 }\) \( \dfrac{1}{2}\)
Interaction \( \dfrac{1}{2}\) \(- \dfrac{1}{2}\) \(- \dfrac{1}{2}\) \( \dfrac{1}{2}\)

To test for the effects of drug type, we give coefficients with a negative sign for drug A, and positive signs for drug B. Because there are two doses within each drug type, the coefficients take values of plus or minus 1/2.

Similarly, to test for the effects of drug dose, we give coefficients with negative signs for the low dose, and positive signs for the high dose. Because there are two drugs for each dose, the coefficients take values of plus or minus 1/2.

The final test considers the null hypothesis that the effect of the drug does not depend on the dose, or conversely, the effect of the dose does not depend on the drug. In either case, we are testing the null hypothesis that there is no interaction between the drug and dose. The coefficients for this interaction are obtained by multiplying the signs of the coefficients for drug and dose. Thus, for drug A at the low dose, we multiply "-" (for the drug effect) times "-" (for the dose-effect) to obtain "+" (for the interaction). Similarly, for drug A at the high dose, we multiply "-" (for the drug effect) times "+" (for the dose-effect) to obtain "-" (for the interaction). The remaining coefficients are obtained similarly.

Example 8-8: Pottery Data

Recall the specific questions:

  • Does the mean chemical content of pottery from Ashley Rails and Isle Thorns equal that of pottery from Caldicot and Llanedyrn?
  • Does the mean chemical content of pottery from Ashley Rails equal that of pottery from Isle Thorns?
  • Does the mean chemical content of pottery from Caldicot equal that of pottery from Llanedyrn?

These questions correspond to the following theoretical relationships among the sites:

The relationships among sites suggested in the above figure suggest the following contrasts:

Sites

Contrasts Ashley Rails Caldicot Isle Thorns Llanedyrn
1 \( \dfrac{1}{2}\) \(- \dfrac{1}{2}\) \( \dfrac{1}{2}\) \(- \dfrac{1}{2}\)
2 1 0 -1 0
3 0 1 0 -1
\(n_i\) 5 2 5 14

Contrasts 1 and 2 are orthogonal:

\[\sum_{i=1}^{g}  \frac{c_id_i}{n_i} = \frac{0.5 \times 1}{5} + \frac{(-0.5)\times 0}{2}+\frac{0.5 \times (-1)}{5} +\frac{(-0.5)\times 0}{14} = 0\]

However, contrasts 1 and 3 are not orthogonal:

\[\sum_{i=1}^{g} \frac{c_id_i}{n_i} = \frac{0.5 \times 0}{5} + \frac{(-0.5)\times 1}{2}+\frac{0.5 \times 0}{5} +\frac{(-0.5)\times (-1) }{14} = \frac{6}{28}\]

Solution: Instead of estimating the mean of pottery collected from Caldicot and Llanedyrn by

\[\frac{\mathbf{\bar{y}_2+\bar{y}_4}}{2}\]

we can weigh by sample size:

\[\frac{n_2\mathbf{\bar{y}_2}+n_4\mathbf{\bar{y}_4}}{n_2+n_4} = \frac{2\mathbf{\bar{y}}_2+14\bar{\mathbf{y}}_4}{16}\]

Similarly, the mean of pottery collected from Ashley Rails and Isle Thorns may be estimated by

\[\frac{n_1\mathbf{\bar{y}_1}+n_3\mathbf{\bar{y}_3}}{n_1+n_3} = \frac{5\mathbf{\bar{y}}_1+5\bar{\mathbf{y}}_3}{10} = \frac{8\mathbf{\bar{y}}_1+8\bar{\mathbf{y}}_3}{16}\]

This yields the Orthogonal Contrast Coefficients:

Sites

Contrasts Ashley Rails Caldicot Isle Thorns Llanedyrn
1 \( \dfrac{8}{16}\) \(- \dfrac{2}{16}\) \( \dfrac{8}{16}\) \(- \dfrac{14}{16}\)
2 1 0 -1 0
3 0 1 0 -1
\(n_i\) 5 2 5 14

The inspect button below will walk through how these contrasts are implemented in the SAS program.

Orthogonal contrast for MANOVA is not available in Minitab at this time.

The following table of estimated contrasts is obtained

Sites

Element \(\widehat { \Psi } _ { 1 }\) \(\widehat { \Psi } _ { 2 }\) \(\widehat { \Psi } _ { 3 }\)
Al 5.29 -0.86 -0.86
Fe -4.64 -0.20 -0.96
Mg -4.06 -0.07 -0.97
Ca -0.17 0.03 0.09
Na -0.17 -0.01 -0.20

These results suggest:

Pottery from Ashley Rails and Isle Thorns have higher aluminum and lower iron, magnesium, calcium, and sodium concentrations than pottery from Caldicot and Llanedyrn.

  • Pottery from Ashley Rails has higher calcium and lower aluminum, iron, magnesium, and sodium concentrations than pottery from Isle Thorns.
  • Pottery from Caldicot has higher calcium and lower aluminum, iron, magnesium, and sodium concentrations than pottery from Llanedyrn.

8.8 - Hypothesis Tests

The suggestions dealt with in the previous page are not backed up by appropriate hypothesis tests. Consider hypothesis tests of the form:

\(H_0\colon \Psi = 0\) against \(H_a\colon \Psi \ne 0\)

Univariate Case:

For the univariate case, we may compute the sums of squares for the contrast:

\(SS_{\Psi} = \frac{\hat{\Psi}^2}{\sum_{i=1}^{g}\frac{c^2_i}{n_i}}\)

This sum of squares has only 1 d.f. so that the mean square for the contrast is

\(MS_{\Psi}= SS_{\Psi}\)

Then compute the F -ratio:

\(F = \frac{MS_{\Psi}}{MS_{error}}\)

Reject \(H_{0} \colon \Psi = 0\) at level \(\alpha\) if

\(F > F_{1, N-g, \alpha}\)

Multivariate Case

For the multivariate case, the sums of squares for the contrast is replaced by the hypothesis sum of squares and cross-products matrix for the contrast:

\(\mathbf{H}_{\mathbf{\Psi}} = \dfrac{\mathbf{\hat{\Psi}\hat{\Psi}'}}{\sum_{i=1}^{g}\frac{c^2_i}{n_i}}\)

Compute Wilks Lambda

\(\Lambda^* = \dfrac{|\mathbf{E}|}{\mathbf{|H_{\Psi}+E|}}\)

Compute the F -statistic

\(F = \left(\dfrac{1-\Lambda^*_{\mathbf{\Psi}}}{\Lambda^*_{\mathbf{\Psi}}}\right)\left(\dfrac{N-g-p+1}{p}\right)\)

Reject H o : \(\mathbf{\Psi = 0} \) at level \(α\) if

\(F > F_{p, N-g-p+1, \alpha}\)

Example 8-9: Pottery Data

The following table gives the results of testing the null hypothesis that each of the contrasts is equal to zero. You should be able to find these numbers in the output by downloading the SAS program here: pottery.sas .

Contrast \(\Lambda _ { \Psi } ^ { * }\) d.f
1 0.0284 122.81 5, 18 <0.0001
2 0.9126 0.34 5, 18 0.8788
3 0.4487 4.42 5, 18 0.0084

Conclusions

  • The mean chemical content of pottery from Ashley Rails and Isle Thorns differs in at least one element from that of Caldicot and Llanedyrn \(\left( \Lambda _ { \Psi } ^ { * } = 0.0284; F = 122. 81; d.f. = 5, 18; p < 0.0001 \right) \).
  • There is no significant difference in the mean chemical contents between Ashley Rails and Isle Thorns \(\left( \Lambda _ { \Psi } ^ { * } = 0.9126; F = 0.34; d.f. = 5, 18; p = 0.8788 \right) \).
  • The mean chemical content of pottery from Caldicot differs in at least one element from that of Llanedyrn \(\left( \Lambda _ { \Psi } ^ { * } = 0.4487; F = 4.42; d.f. = 5, 18; p = 0.0084 \right) \).

Once we have rejected the null hypothesis that contrast is equal to zero, we can compute simultaneous or Bonferroni confidence intervals for the contrast:

Simultaneous Confidence Intervals

Simultaneous \((1 - α) × 100\%\) Confidence Intervals for the Elements of \(\Psi\) are obtained as follows:

\(\hat{\Psi}_j \pm \sqrt{\dfrac{p(N-g)}{N-g-p+1}F_{p, N-g-p+1}}SE(\hat{\Psi}_j)\)

\(SE(\hat{\Psi}_j) = \sqrt{\left(\sum\limits_{i=1}^{g}\dfrac{c^2_i}{n_i}\right)\dfrac{e_{jj}}{N-g}}\)

where \(e_{jj}\) is the \( \left( j, j \right)^{th}\) element of the error sum of squares and cross products matrix, and is equal to the error sums of squares for the analysis of variance of variable j .

Recall that we have p = 5 chemical constituents, g = 4 sites, and a total of N = 26 observations. From the F -table, we have F 5,18,0.05 = 2.77. Then our multiplier 

\begin{align} M &= \sqrt{\frac{p(N-g)}{N-g-p+1}F_{5,18}}\\[10pt] &= \sqrt{\frac{5(26-4)}{26-4-5+1}\times 2.77}\\[10pt] &= 4.114 \end{align}

Simultaneous 95% Confidence Intervals are computed in the following table. The elements of the estimated contrast together with their standard errors are found at the bottom of each page, giving the results of the individual ANOVAs. For example, the estimated contrast from aluminum is 5.294 with a standard error of 0.5972. The fourth column is obtained by multiplying the standard errors by M = 4.114. So, for example, 0.5972 × 4.114 = 2.457. Finally, the confidence interval for aluminum is 5.294 plus/minus 2.457:

Element \(\widehat { \Psi }\) \(M \times SE \left(\widehat { \Psi }\right)\) Confidence Interval
Al 5.294 0.5972 2.457 2.837, 7.751
Fe -4.640 0.2844 1.170 -5.810, -3.470
Mg -4.065 0.3376 1.389 -5.454, -2.676
Ca -0.175 0.0195 0.080 -0.255, -0.095
Na -0.175 0.0384 0.158 -0.333, -0.017

Simultaneous 95% Confidence Intervals for Contrast 3 are obtained similarly to those for Contrast 1.

Element \(\widehat{\Psi}\) SE( \(\widehat{\Psi}\) ) \(M\times SE(\widehat{\Psi})\) Confidence Interval
Al -0.864 1.1199 4.608 -5.472, 3.744
Fe -0.957 0.5333 2.194 -3.151, 1.237
Mg -0.971 0.6331 2.605 -3.576, 1.634
Ca  0.093 0.0366 0.150 -0.057, 0.243
Na -0.201 0.0719 0.296 -0.497, 0.095

All of the above confidence intervals cover zero. Therefore, the significant difference between Caldicot and Llanedyrn appears to be due to the combined contributions of the various variables.

Bonferroni Confidence Intervals

Bonferroni \((1 - α) × 100\%\) Confidence Intervals for the Elements of Ψ are obtained as follows:

\(\hat{\Psi}_j \pm t_{N-g, \frac{\alpha}{2p}}SE(\hat{\Psi}_j)\)

and \(e_{jj}\) is the \( \left( j, j \right)^{th}\) element of the error sum of squares and cross products matrix and is equal to the error sums of squares for the analysis of variance of variable j .

Here we have a \(t_{22,0.005} = 2.819\). The Bonferroni 95% Confidence Intervals are:

Element \(\widehat{\Psi}\) SE( \(\widehat{\Psi}\) ) \(t\times SE(\widehat{\Psi})\) Confidence Interval
Al 5.294 0.5972 1.684 3.610, 6.978
Fe -4.640 0.2844 0.802 -5.442, -3.838
Mg -4.0.65 0.3376 0.952 -5.017, -3.113
Ca -0.175 0.0195 0.055 -0.230, -0.120
Na -0.175 0.0384 0.108 -0.283, -0.067

Bonferroni 95% Confidence Intervals (note: the "M" multiplier below should be the t-value 2.819)

Element \(\widehat{\Psi}\) SE( \(\widehat{\Psi}\) ) \(M\times SE(\widehat{\Psi})\) Confidence Interval
Al -0.864 1.1199 3.157 -4.021, 2.293
Fe -0.957 0.5333 1.503 -2.460, 0.546
Mg -0.971 0.6331 1.785 -2.756, 0.814
Ca 0.093 0.0366 0.103 -0.010, 0.196
Na -0.201 0.0719 0.203 -0.404, 0.001

All resulting intervals cover 0 so there are no significant results.

8.9 - Randomized Block Design: Two-way MANOVA

Within randomized block designs, we have two factors:

  • Blocks, and

A randomized complete block design with a treatments and b blocks is constructed in two steps:

  • The experimental units (the units to which our treatments are going to be applied) are partitioned into b blocks, each comprised of a units.
  • Treatments are randomly assigned to the experimental units in such a way that each treatment appears once in each block.

Randomized block designs are often applied in agricultural settings. The example below will make this clearer.

In general, the blocks should be partitioned so that:

  • Units within blocks are as uniform as possible.
  • Differences between blocks are as large as possible.

These conditions will generally give you the most powerful results.

Example 8-10: Rice Data (Experimental Design)

Let us look at an example of such a design involving rice.

We have four different varieties of rice; varieties A, B, C, and D. And, we have five different blocks in our study. So, imagine each of these blocks as a rice field or patty on a farm somewhere. These blocks are just different patches of land, and each block is partitioned into four plots. Then we randomly assign which variety goes into which plot in each block. You will note that variety A appears once in each block, as does each of the other varieties. This is how the randomized block design experiment is set up.

Block 1
D  C 
B  A 
Block 2
A  D 
B  C 
Block 3
B  D 
C  A 
Block 4
D  B 
A  C 
Block 5
A  C 
D  B 

A randomized block design with the following layout was used to compare 4 varieties of rice in 5 blocks.

This type of experimental design is also used in medical trials where people with similar characteristics are in each block. This may be people who weigh about the same, are of the same sex, same age, or whatever factor is deemed important for that particular experiment. So generally, what you want is for people within each of the blocks to be similar to one another.

Back to the rice data... In each of the partitions within each of the five blocks, one of the four varieties of rice would be planted. In this experiment, the height of the plant and the number of tillers per plant were measured six weeks after transplanting. Both of these measurements are indicators of how vigorous the growth is. The taller the plant and the greater number of tillers, the healthier the plant is, which should lead to a higher rice yield.

In general, randomized block design data should look like this:

Table of randomized block design data
Block
  1 2 \(\cdots\)
1 \(\mathbf{Y_{11}} = \begin{pmatrix} Y_{111} \\ Y_{112} \\ \vdots \\ Y_{11p} \end{pmatrix}\) \(\mathbf{Y_{12}} = \begin{pmatrix} Y_{121} \\ Y_{122} \\ \vdots \\ Y_{12p} \end{pmatrix}\) \(\cdots\) \(\mathbf{Y_{1b}} = \begin{pmatrix} Y_{1b1} \\ Y_{1b2} \\ \vdots \\ Y_{1bp} \end{pmatrix}\)
2 \(\mathbf{Y_{21}} = \begin{pmatrix} Y_{211} \\ Y_{212} \\ \vdots \\ Y_{21p} \end{pmatrix}\) \(\mathbf{Y_{22}} = \begin{pmatrix} Y_{221} \\ Y_{222} \\ \vdots \\ Y_{22p} \end{pmatrix}\) \(\cdots\) \(\mathbf{Y_{2b}} = \begin{pmatrix} Y_{2b1} \\ Y_{2b2} \\ \vdots \\ Y_{2bp} \end{pmatrix}\)
\(\mathbf{Y_{a1}} = \begin{pmatrix} Y_{a11} \\ Y_{a12} \\ \vdots \\ Y_{a1p} \end{pmatrix}\) \(\mathbf{Y_{a2}} = \begin{pmatrix} Y_{a21} \\ Y_{a22} \\ \vdots \\ Y_{a2p} \end{pmatrix}\) \(\cdots\) \(\mathbf{Y_{ab}} = \begin{pmatrix} Y_{ab1} \\ Y_{ab2} \\ \vdots \\ Y_{abp} \end{pmatrix}\)

We have a rows for the a treatments. In this case, we would have four rows, one for each of the four varieties of rice. We also set up b columns for b blocks. In this case, we have five columns, one for each of the five blocks. In each block, for each treatment, we are going to observe a vector of variables.

Our notation is as follows:

  • Let \(Y_{ijk}\) = observation for variable k from block j in treatment i

\(\mathbf{Y_{ij}} = \left(\begin{array}{c}Y_{ij1}\\Y_{ij2}\\\vdots \\ Y_{ijp}\end{array}\right)\)

  • a = Number of Treatments
  • b = Number of Blocks

8.10 - Two-way MANOVA Additive Model and Assumptions

A model is formed for a two-way multivariate analysis of variance.

Two-way MANOVA Additive Model

\(\underset{\mathbf{Y}_{ij}}{\underbrace{\left(\begin{array}{c}Y_{ij1}\\Y_{ij2}\\ \vdots \\ Y_{ijp}\end{array}\right)}} = \underset{\mathbf{\nu}}{\underbrace{\left(\begin{array}{c}\nu_1 \\ \nu_2 \\ \vdots \\ \nu_p \end{array}\right)}}+\underset{\mathbf{\alpha}_{i}}{\underbrace{\left(\begin{array}{c} \alpha_{i1} \\ \alpha_{i2} \\ \vdots \\ \alpha_{ip}\end{array}\right)}}+\underset{\mathbf{\beta}_{j}}{\underbrace{\left(\begin{array}{c}\beta_{j1} \\ \beta_{j2} \\ \vdots \\ \beta_{jp}\end{array}\right)}} + \underset{\mathbf{\epsilon}_{ij}}{\underbrace{\left(\begin{array}{c}\epsilon_{ij1} \\ \epsilon_{ij2} \\ \vdots \\ \epsilon_{ijp}\end{array}\right)}}\)

In this model:

  • \(\mathbf{Y}_{ij}\) is the p × 1 vector of observations for treatment i in block j;

This vector of observations is written as a function of the following

  • \(\nu_{k}\) is the overall mean for variable k ; these are collected into the overall mean vector \(\boldsymbol{\nu}\)
  • \(\alpha_{ik}\) is the effect of treatment i on variable k ; these are collected into the treatment effect vector \(\boldsymbol{\alpha}_{i}\)
  • \(\beta_{jk}\) is the effect of block j on variable k ; these are collected in the block effect vector \(\boldsymbol{\beta}_{j}\)
  • \(\varepsilon_{ijk}\) is the experimental error for treatment i , block j , and variable k ; these are collected into the error vector \(\boldsymbol{\varepsilon}_{ij}\)

These are fairly standard assumptions with one extra one added.

  • The error vectors \(\varepsilon_{ij}\) have zero population mean;
  • The error vectors \(\varepsilon_{ij}\) have a common variance-covariance matrix \(\Sigma\) — (the usual assumption of a homogeneous variance-covariance matrix)
  • The error vectors \(\varepsilon_{ij}\) are independently sampled;
  • The error vectors \(\varepsilon_{ij}\) are sampled from a multivariate normal distribution;
  • There is no block-by-treatment interaction. This means that the effect of the treatment is not affected by, or does not depend on the block.

These are the standard assumptions.

We could define the treatment mean vector for treatment i such that:

\(\mu_i = \nu +\alpha_i\)

Here we could consider testing the null hypothesis that all of the treatment mean vectors are identical,

\(H_0\colon \boldsymbol{\mu_1 = \mu_2 = \dots = \mu_g}\)

or equivalently, the null hypothesis that there is no treatment effect:

\(H_0\colon \boldsymbol{\alpha_1 = \alpha_2 = \dots = \alpha_a = 0}\)

This is the same null hypothesis that we tested in the One-way MANOVA.

We would test this against the alternative hypothesis that there is a difference between at least one pair of treatments on at least one variable, or:

\(H_a\colon \mu_{ik} \ne \mu_{jk}\) for at least one \(i \ne j\) and at least one variable \(k\)

We will use standard dot notation to define mean vectors for treatments, mean vectors for blocks, and a grand mean vector.

We define a mean vector for treatment i :

\(\mathbf{\bar{y}}_{i.} = \frac{1}{b}\sum_{j=1}^{b}\mathbf{Y}_{ij} = \left(\begin{array}{c}\bar{y}_{i.1}\\ \bar{y}_{i.2} \\ \vdots \\ \bar{y}_{i.p}\end{array}\right)\) = Sample mean vector for treatment i .

In this case, it is comprised of the mean vectors for i th treatment for each of the p variables and it is obtained by summing over the blocks and then dividing by the number of blocks. The dot appears in the second position indicating that we are to sum over the second subscript, the position assigned to the blocks.

For example, \(\bar{y}_{i.k} = \frac{1}{b}\sum_{j=1}^{b}Y_{ijk}\) = Sample mean for variable k and treatment i .

We  define a mean vector for block j :

\(\mathbf{\bar{y}}_{.j} = \frac{1}{a}\sum_{i=1}^{a}\mathbf{Y}_{ij} = \left(\begin{array}{c}\bar{y}_{.j1}\\ \bar{y}_{.j2} \\ \vdots \\ \bar{y}_{.jp}\end{array}\right)\) = Sample mean vector for block j .

Here we will sum over the treatments in each of the blocks so the dot appears in the first position. Therefore, this is essentially the block means for each of our variables.

For example, \(\bar{y}_{.jk} = \frac{1}{a}\sum_{i=1}^{a}Y_{ijk}\) = Sample mean for variable k and block j .

Finally, we define the Grand mean vector by summing all of the observation vectors over the treatments and the blocks. So you will see the double dots appearing in this case:

\(\mathbf{\bar{y}}_{..} = \frac{1}{ab}\sum_{i=1}^{a}\sum_{j=1}^{b}\mathbf{Y}_{ij} = \left(\begin{array}{c}\bar{y}_{..1}\\ \bar{y}_{..2} \\ \vdots \\ \bar{y}_{..p}\end{array}\right)\) = Grand mean vector.

This involves dividing by a × b , which is the sample size in this case.

For example, \(\bar{y}_{..k}=\frac{1}{ab}\sum_{i=1}^{a}\sum_{j=1}^{b}Y_{ijk}\) = Grand mean for variable k .

As before, we will define the Total Sum of Squares and Cross Products Matrix . This is the same definition that we used in the One-way MANOVA. It involves comparing the observation vectors for the individual subjects to the grand mean vector.

\(\mathbf{T = \sum_{i=1}^{a}\sum_{j=1}^{b}(Y_{ij}-\bar{y}_{..})(Y_{ij}-\bar{y}_{..})'}\)

Here, the \( \left(k, l \right)^{th}\) element of T is

\(\sum_{i=1}^{a}\sum_{j=1}^{b}(Y_{ijk}-\bar{y}_{..k})(Y_{ijl}-\bar{y}_{..l}).\)

  • For \( k = l \), this is the total sum of squares for variable k and measures the total variation in variable k.
  • For \( k ≠ l \), this measures the association or dependency between variables k and l across all observations.

In this case, the total sum of squares and cross products matrix may be partitioned into three matrices, three different sums of squares cross-product matrices:

\begin{align} \mathbf{T} &= \underset{\mathbf{H}}{\underbrace{b\sum_{i=1}^{a}\mathbf{(\bar{y}_{i.}-\bar{y}_{..})(\bar{y}_{i.}-\bar{y}_{..})'}}}\\&+\underset{\mathbf{B}}{\underbrace{a\sum_{j=1}^{b}\mathbf{(\bar{y}_{.j}-\bar{y}_{..})(\bar{y}_{.j}-\bar{y}_{..})'}}}\\ &+\underset{\mathbf{E}}{\underbrace{\sum_{i=1}^{a}\sum_{j=1}^{b}\mathbf{(Y_{ij}-\bar{y}_{i.}-\bar{y}_{.j}+\bar{y}_{..})(Y_{ij}-\bar{y}_{i.}-\bar{y}_{.j}+\bar{y}_{..})'}}} \end{align}

As shown above:

  • H is the Treatment Sum of Squares and Cross Products matrix;
  • B is the Block Sum of Squares and Cross Products matrix;
  • E is the Error Sum of Squares and Cross Products matrix.

The \( \left(k, l \right)^{th}\) element of the Treatment Sum of Squares and Cross Products matrix H is

\(b\sum_{i=1}^{a}(\bar{y}_{i.k}-\bar{y}_{..k})(\bar{y}_{i.l}-\bar{y}_{..l})\)

  • If  \(k = l\), is the treatment sum of squares for variable k, and measures variation between treatments.
  • If \( k ≠ l \), this measures how variables k and l vary together across treatments.

The \( \left(k, l \right)^{th}\) element of the Block Sum of Squares and Cross Products matrix B is

\(a\sum_{j=1}^{a}(\bar{y}_{.jk}-\bar{y}_{..k})(\bar{y}_{.jl}-\bar{y}_{..l})\)

  • For \( k = l \), is the block sum of squares for variable k, and measures variation between or among blocks.
  • For \( k ≠ l \), this measures how variables k and l vary together across blocks (not usually of much interest).

The \( \left(k, l \right)^{th}\) element of the Error Sum of Squares and Cross Products matrix E is

\(\sum_{i=1}^{a}\sum_{j=1}^{b}(Y_{ijk}-\bar{y}_{i.k}-\bar{y}_{.jk}+\bar{y}_{..k})(Y_{ijl}-\bar{y}_{i.l}-\bar{y}_{.jl}+\bar{y}_{..l})\)

  • For \( k = l \), is the error sum of squares for variable k, and measures variability within treatment and block combinations of variable k.
  • For \( k ≠ l \), this measures the association or dependence between variables k and l after you take into account treatment and block.

8.11 - Forming a MANOVA table

The partitioning of the total sum of squares and cross products matrix may be summarized in the multivariate analysis of variance table as shown below:

MANOVA
Source SSP
Blocks - 1
Treatments - 1
Error ( - 1)( - 1)
Total - 1

SSP stands for the sum of squares and cross products discussed above.

To test the null hypothesis that the treatment mean vectors are equal, compute a Wilks Lambda using the following expression:

This is the determinant of the error sum of squares and cross-products matrix divided by the determinant of the sum of the treatment sum of squares and cross-products plus the error sum of squares and cross-products matrix.

Under the null hypothesis, this has an F -approximation. The approximation is quite involved and will not be reviewed here. Instead, let's take a look at our example where we will implement these concepts.

Example 8-11: Rice Data

Rice data can be downloaded here: rice.csv

The program below shows the analysis of the rice data.

Download the SAS Program here: rice.sas

Performing a two-way MANOVA

To carry out the two-way MANOVA test in Minitab:

  • Open the ‘rice’ data set in a new worksheet.
  • For convenience, rename the columns : block, variety, height, and tillers, from left to right.
  • Highlight and select height and tillers to move them to the Responses window.
  • Highlight and select block and variety to move them to the Model window.
  • Choose 'OK' . The MANOVA results for the tests for block and variety are separately displayed in the results area.
  • We reject the null hypothesis that the variety mean vectors are identical \(( \Lambda = 0.342; F = 2.60 ; d f = 6,22 ; p = 0.0463 )\). At least two varieties differ in means for height and/or number of tillers.
Variable F SAS -value Bonferroni -value
Height 4.19 0.030 0.061
Tillers 1.27 0.327 0.654

Each test is carried out with 3 and 12 d.f. Because we have only 2 response variables, a 0.05 level test would be rejected if the p -value is less than 0.025 under a Bonferroni correction. Thus, if a strict \(α = 0.05\) level is adhered to, then neither variable shows a significant variety effect. However, if a 0.1 level test is considered, we see that there is weak evidence that the mean heights vary among the varieties ( F = 4.19; d. f. = 3, 12).

Variety Mean Standard Error
A 58.4 1.62
B 50.6 1.62
C 55.2 1.62
D 53.0 1.62

Variety A is the tallest, while variety B is the shortest. The standard error is obtained from:

\(SE(\bar{y}_{i.k}) = \sqrt{\dfrac{MS_{error}}{b}} = \sqrt{\dfrac{13.125}{5}} = 1.62\)

  • Looking at the partial correlation (found below the error sum of squares and cross-products matrix in the output), we see that height is not significantly correlated with the number of tillers within varieties \(( r = - 0.278 ; p = 0.3572 )\).

8.12 - Summary

In this lesson we learned about:

  • The 1-way MANOVA for testing the null hypothesis of equality of group mean vectors;
  • Methods for diagnosing the assumptions of the 1-way MANOVA;
  • Bonferroni corrected ANOVAs to assess the significance of individual variables;
  • Construction and interpretation of orthogonal contrasts;
  • Wilks lambda for testing the significance of contrasts among group mean vectors; and
  • Simultaneous and Bonferroni confidence intervals for the elements of contrast.

In general, a thorough analysis of data would be comprised of the following steps:

Perform appropriate diagnostic tests for the assumptions of the MANOVA. Carry out appropriate normalizing and variance-stabilizing transformations of the variables.

Perform a one-way MANOVA to test for equality of group mean vectors. If this test is not significant, conclude that there is no statistically significant evidence against the null hypothesis that the group mean vectors are equal to one another and stop. If the test is significant, conclude that at least one pair of group mean vectors differ on at least one element and go on to Step 3.

Perform Bonferroni-corrected ANOVAs on the individual variables to determine which variables are significantly different among groups.

Construct up to g -1 orthogonal contrasts based on specific scientific questions regarding the relationships among the groups.

Use Wilks lambda to test the significance of each contrast defined in Step 4.

For the significant contrasts only, construct simultaneous or Bonferroni confidence intervals for the elements of those contrasts. Draw appropriate conclusions from these confidence intervals, making sure that you note the directions of all effects (which treatments or group of treatments have the greater means for each variable).

In this lesson we also learned about:

  • How to perform multiple factor MANOVAs;
  • What conclusions may be drawn from the results of a multiple-factor MANOVA;
  • The Bonferroni corrected ANOVAs for the individual variables.

Just as in the one-way MANOVA, we carried out orthogonal contrasts among the four varieties of rice. However, in this case, it is not clear from the data description just what contrasts should be considered. If a phylogenetic tree were available for these varieties, then appropriate contrasts may be constructed.

The Classroom | Empowering Students in Their College Journey

What is Multivariate Statistical Analysis?

Multivariate Techniques: Advantages and Disadvantages

Multivariate Techniques: Advantages and Disadvantages

Multivariate statistical analysis refers to multiple advanced techniques for examining relationships among multiple variables at the same time. Researchers use multivariate procedures in studies that involve more than one dependent variable (also known as the outcome or phenomenon of interest), more than one independent variable (also known as a predictor) or both. Upper-level undergraduate courses and graduate courses in statistics teach multivariate statistical analysis. This type of analysis is desirable because researchers often hypothesize that a given outcome of interest is effected or influenced by more than one thing.

Types of Analysis

There are many statistical techniques for conducting multivariate analysis, and the most appropriate technique for a given study varies with the type of study and the key research questions. Four of the most common multivariate techniques are multiple regression analysis, factor analysis, path analysis and multiple analysis of variance, or MANOVA.

Multiple Regression

Multiple regression analysis, often referred to simply as regression analysis, examines the effects of multiple independent variables (predictors) on the value of a dependent variable, or outcome. Regression calculates a coefficient for each independent variable, as well as its statistical significance, to estimate the effect of each predictor on the dependent variable, with other predictors held constant. Researchers in economics and other social sciences often use regression analysis to study social and economic phenomena. An example of a regression study is to examine the effect of education, experience, gender, and ethnicity on income.

Factor Analysis

Factor analysis is a data reduction technique in which a researcher reduces a large number of variables to a smaller, more manageable, number of factors. Factor analysis uncovers patterns among variables and then clusters highly interrelated variables into factors. Factor analysis has many applications, but a common use is in survey research, where researchers use the technique to see if lengthy series of questions can be grouped into shorter sets.

Path Analysis

This is a graphical form of multivariate statistical analysis in which graphs known as path diagrams depict the correlations among variables, as well as the directions of those correlations and the "paths" along which these relationships travel. Statistical software programs calculate path coefficients, the values of which estimate the strength of relationships among the variables in a researcher's hypothesized model.

Multiple Analysis of Variance

Multiple Analysis of Variance, or MANOVA, is an advanced form of the more basic analysis of variance, or ANOVA. MANOVA extends the technique to studies with two or more related dependent variables while controlling for the correlations among them. An example of a study for which MANOVA would be an appropriate technique is a study of health among three groups of teens: those who exercise regularly, those who exercise on occasion, and those who never exercise. A MANOVA for this study would allow multiple health-related outcome measures such as weight, heart rate, and respiratory rates.

Benefits in Social Science

Multivariate statistical analysis is especially important in social science research because researchers in these fields are often unable to use randomized laboratory experiments that their counterparts in medicine and natural sciences often use. Instead, many social scientists must rely on quasi-experimental designs in which the experimental and control groups may have initial differences that could affect or bias the outcome of the study. Multivariate techniques try to statistically account for these differences and adjust outcome measures to control for the portion that can be attributed to the differences.

Statistical Calculations

Statistical software programs such as SAS, Stata, and SPSS can perform multivariate statistical analyses. These programs are frequently used by university researchers and other research professionals. Spreadsheet programs can perform some multivariate analyses, but are intended for more general use and may have limited abilities than a specialized statistical software package.

Related Articles

The Real Difference Between Reliability and Validity

The Real Difference Between Reliability and Validity

Cross Sectional Vs. Time Series

Cross Sectional Vs. Time Series

Types of Primary Data

Types of Primary Data

The Disadvantages of Logistic Regression

The Disadvantages of Logistic Regression

Strengths & Weakness of Sequential Study

Strengths & Weakness of Sequential Study

How to Do Case Studies in Qualitative Research

How to Do Case Studies in Qualitative Research

Advantages & Disadvantages of Triangulation Design

Advantages & Disadvantages of Triangulation Design

How to Use a Chi Square Test in Likert Scales

How to Use a Chi Square Test in Likert Scales

  • UCLA: One-Way Manova
  • SPSS Tutorials: ANOVA – What Is It?
  • University of Oregon Department of Geography: MANOVA and Discriminant Analysis
  • Introduction to Multivariate Statistics

Shane Hall is a writer and research analyst with more than 20 years of experience. His work has appeared in "Brookings Papers on Education Policy," "Population and Development" and various Texas newspapers. Hall has a Doctor of Philosophy in political economy and is a former college instructor of economics and political science.

R (BGU course)

Chapter 9 multivariate data analysis.

The term “multivariate data analysis” is so broad and so overloaded, that we start by clarifying what is discussed and what is not discussed in this chapter. Broadly speaking, we will discuss statistical inference , and leave more “exploratory flavored” matters like clustering, and visualization, to the Unsupervised Learning Chapter 11 .

We start with an example.

Formally, let \(y\) be single (random) measurement of a \(p\) -variate random vector. Denote \(\mu:=E[y]\) . Here is the set of problems we will discuss, in order of their statistical difficulty.

Signal Detection : a.k.a. multivariate test , or global test , or omnibus test . Where we test whether \(\mu\) differs than some \(\mu_0\) .

Signal Counting : a.k.a. prevalence estimation , or \(\pi_0\) estimation . Where we count the number of entries in \(\mu\) that differ from \(\mu_0\) .

Signal Identification : a.k.a. selection , or multiple testing . Where we infer which of the entries in \(\mu\) differ from \(\mu_0\) . In the ANOVA literature, this is known as a post-hoc analysis, which follows an omnibus test .

Estimation : Estimating the magnitudes of entries in \(\mu\) , and their departure from \(\mu_0\) . If estimation follows a signal detection or signal identification stage, this is known as selective estimation .

9.1 Signal Detection

Signal detection deals with the detection of the departure of \(\mu\) from some \(\mu_0\) , and especially, \(\mu_0=0\) . This problem can be thought of as the multivariate counterpart of the univariate hypothesis t-test.

9.1.1 Hotelling’s T2 Test

The most fundamental approach to signal detection is a mere generalization of the t-test, known as Hotelling’s \(T^2\) test .

Recall the univariate t-statistic of a data vector \(x\) of length \(n\) : \[\begin{align} t^2(x):= \frac{(\bar{x}-\mu_0)^2}{Var[\bar{x}]}= (\bar{x}-\mu_0)Var[\bar{x}]^{-1}(\bar{x}-\mu_0), \tag{9.1} \end{align}\] where \(Var[\bar{x}]=S^2(x)/n\) , and \(S^2(x)\) is the unbiased variance estimator \(S^2(x):=(n-1)^{-1}\sum (x_i-\bar x)^2\) .

Generalizing Eq (9.1) to the multivariate case: \(\mu_0\) is a \(p\) -vector, \(\bar x\) is a \(p\) -vector, and \(Var[\bar x]\) is a \(p \times p\) matrix of the covariance between the \(p\) coordinated of \(\bar x\) . When operating with vectors, the squaring becomes a quadratic form, and the division becomes a matrix inverse. We thus have \[\begin{align} T^2(x):= (\bar{x}-\mu_0)' Var[\bar{x}]^{-1} (\bar{x}-\mu_0), \tag{9.2} \end{align}\] which is the definition of Hotelling’s \(T^2\) one-sample test statistic. We typically denote the covariance between coordinates in \(x\) with \(\hat \Sigma(x)\) , so that \(\widehat \Sigma_{k,l}:=\widehat {Cov}[x_k,x_l]=(n-1)^{-1} \sum (x_{k,i}-\bar x_k)(x_{l,i}-\bar x_l)\) . Using the \(\Sigma\) notation, Eq. (9.2) becomes \[\begin{align} T^2(x):= n (\bar{x}-\mu_0)' \hat \Sigma(x)^{-1} (\bar{x}-\mu_0), \end{align}\] which is the standard notation of Hotelling’s test statistic.

For inference, we need the null distribution of Hotelling’s test statistic. For this we introduce some vocabulary 17 :

  • Low Dimension : We call a problem low dimensional if \(n \gg p\) , i.e. \(p/n \approx 0\) . This means there are many observations per estimated parameter.
  • High Dimension : We call a problem high dimensional if \(p/n \to c\) , where \(c\in (0,1)\) . This means there are more observations than parameters, but not many.
  • Very High Dimension : We call a problem very high dimensional if \(p/n \to c\) , where \(1<c<\infty\) . This means there are less observations than parameters.

Hotelling’s \(T^2\) test can only be used in the low dimensional regime. For some intuition on this statement, think of taking \(n=20\) measurements of \(p=100\) physiological variables. We seemingly have \(20\) observations, but there are \(100\) unknown quantities in \(\mu\) . Say you decide that \(\mu\) differs from \(\mu_0\) based on the coordinate with maximal difference between your data and \(\mu_0\) . Do you know how much variability to expect of this maximum? Try comparing your intuition with a quick simulation. Did the variabilty of the maximum surprise you? Hotelling’s \(T^2\) is not the same as the maxiumum, but the same intuition applies. This criticism is formalized in Bai and Saranadasa ( 1996 ) . In modern applications, Hotelling’s \(T^2\) is rarely recommended. Luckily, many modern alternatives are available. See J. Rosenblatt, Gilron, and Mukamel ( 2016 ) for a review.

9.1.2 Various Types of Signal to Detect

In the previous, we assumed that the signal is a departure of \(\mu\) from some \(\mu_0\) . For vactor-valued data \(y\) , that is distributed \(\mathcal F\) , we may define “signal” as any departure from some \(\mathcal F_0\) . This is the multivaraite counterpart of goodness-of-fit (GOF) tests.

Even when restricting “signal” to departures of \(\mu\) from \(\mu_0\) , “signal” may come in various forms:

  • Dense Signal : when the departure is in a large number of coordinates of \(\mu\) .
  • Sparse Signal : when the departure is in a small number of coordinates of \(\mu\) .

Process control in a manufactoring plant, for instance, is consistent with a dense signal: if a manufacturing process has failed, we expect a change in many measurements (i.e. coordinates of \(\mu\) ). Detection of activation in brain imaging is consistent with a dense signal: if a region encodes cognitive function, we expect a change in many brain locations (i.e. coordinates of \(\mu\) .) Detection of disease encodig regions in the genome is consistent with a sparse signal: if susceptibility of disease is genetic, only a small subset of locations in the genome will encode it.

Hotelling’s \(T^2\) statistic is best for dense signal. The next test, is a simple (and forgotten) test best with sparse signal.

9.1.3 Simes’ Test

Hotelling’s \(T^2\) statistic has currently two limitations: It is designed for dense signals, and it requires estimating the covariance, which is a very difficult problem.

An algorithm, that is sensitive to sparse signal and allows statistically valid detection under a wide range of covariances (even if we don’t know the covariance) is known as Simes’ Test . The statistic is defined vie the following algorithm:

  • Compute \(p\) variable-wise p-values: \(p_1,\dots,p_j\) .
  • Denote \(p_{(1)},\dots,p_{(j)}\) the sorted p-values.
  • Simes’ statistic is \(p_{Simes}:=min_j\{p_{(j)} \times p/j\}\) .
  • Reject the “no signal” null hypothesis at significance \(\alpha\) if \(p_{Simes}<\alpha\) .

9.1.4 Signal Detection with R

We start with simulating some data with no signal. We will convince ourselves that Hotelling’s and Simes’ tests detect nothing, when nothing is present. We will then generate new data, after injecting some signal, i.e., making \(\mu\) depart from \(\mu_0=0\) . We then convince ourselves, that both Hotelling’s and Simes’ tests, are indeed capable of detecting signal, when present.

Generating null data:

Now making our own Hotelling one-sample \(T^2\) test using Eq.( (9.2) ).

Things to note:

  • stopifnot(n > 5 * p) is a little verification to check that the problem is indeed low dimensional. Otherwise, the \(\chi^2\) approximation cannot be trusted.
  • solve returns a matrix inverse.
  • %*% is the matrix product operator (see also crossprod() ).
  • A function may return only a single object, so we wrap the statistic and its p-value in a list object.

Just for verification, we compare our home made Hotelling’s test, to the implementation in the rrcov package. The statistic is clearly OK, but our \(\chi^2\) approximation of the distribution leaves room to desire. Personally, I would never trust a Hotelling test if \(n\) is not much greater than \(p\) , in which case I would use a high-dimensional adaptation (see Bibliography).

Let’s do the same with Simes’:

And now we verify that both tests can indeed detect signal when present. Are p-values small enough to reject the “no signal” null hypothesis?

… yes. All p-values are very small, so that all statistics can detect the non-null distribution.

9.2 Signal Counting

There are many ways to approach the signal counting problem. For the purposes of this book, however, we will not discuss them directly, and solve the signal counting problem as a signal identification problem: if we know where \(\mu\) departs from \(\mu_0\) , we only need to count coordinates to solve the signal counting problem.

9.3 Signal Identification

The problem of signal identification is also known as selective testing , or more commonly as multiple testing .

In the ANOVA literature, an identification stage will typically follow a detection stage. These are known as the omnibus F test , and post-hoc tests, respectively. In the multiple testing literature there will typically be no preliminary detection stage. It is typically assumed that signal is present, and the only question is “where?”

The first question when approaching a multiple testing problem is “what is an error”? Is an error declaring a coordinate in \(\mu\) to be different than \(\mu_0\) when it is actually not? Is an error an overly high proportion of falsely identified coordinates? The former is known as the family wise error rate (FWER), and the latter as the false discovery rate (FDR).

9.3.1 Signal Identification in R

One (of many) ways to do signal identification involves the stats::p.adjust function. The function takes as inputs a \(p\) -vector of the variable-wise p-values . Why do we start with variable-wise p-values, and not the full data set?

  • Because we want to make inference variable-wise, so it is natural to start with variable-wise statistics.
  • Because we want to avoid dealing with covariances if possible. Computing variable-wise p-values does not require estimating covariances.
  • So that the identification problem is decoupled from the variable-wise inference problem, and may be applied much more generally than in the setup we presented.

We start be generating some high-dimensional multivariate data and computing the coordinate-wise (i.e. hypothesis-wise) p-value.

We now compute the pvalues of each coordinate. We use a coordinate-wise t-test. Why a t-test? Because for the purpose of demonstration we want a simple test. In reality, you may use any test that returns valid p-values.

  • t.pval is a function that merely returns the p-value of a t.test.
  • We used the apply function to apply the same function to each column of x .
  • MARGIN=2 tells apply to compute over columns and not rows.
  • The output, p.values , is a vector of 100 p-values.

We are now ready to do the identification, i.e., find which coordinate of \(\mu\) is different than \(\mu_0=0\) . The workflow for identification has the same structure, regardless of the desired error guarantees:

  • Compute an adjusted p-value .
  • Compare the adjusted p-value to the desired error level.

If we want \(FWER \leq 0.05\) , meaning that we allow a \(5\%\) probability of making any mistake, we will use the method="holm" argument of p.adjust .

If we want \(FDR \leq 0.05\) , meaning that we allow the proportion of false discoveries to be no larger than \(5\%\) , we use the method="BH" argument of p.adjust .

We now inject some strong signal in \(\mu\) just to see that the process works. We will artificially inject signal in the first 10 coordinates.

Indeed- we are now able to detect that the first coordinates carry signal, because their respective coordinate-wise null hypotheses have been rejected.

9.4 Signal Estimation (*)

The estimation of the elements of \(\mu\) is a seemingly straightforward task. This is not the case, however, if we estimate only the elements that were selected because they were significant (or any other data-dependent criterion). Clearly, estimating only significant entries will introduce a bias in the estimation. In the statistical literature, this is known as selection bias . Selection bias also occurs when you perform inference on regression coefficients after some model selection, say, with a lasso, or a forward search 18 .

Selective inference is a complicated and active research topic so we will not offer any off-the-shelf solution to the matter. The curious reader is invited to read Rosenblatt and Benjamini ( 2014 ) , Javanmard and Montanari ( 2014 ) , or Will Fithian’s PhD thesis (Fithian 2015 ) for more on the topic.

9.5 Bibliographic Notes

For a general introduction to multivariate data analysis see Anderson-Cook ( 2004 ) . For an R oriented introduction, see Everitt and Hothorn ( 2011 ) . For more on the difficulties with high dimensional problems, see Bai and Saranadasa ( 1996 ) . For some cutting edge solutions for testing in high-dimension, see J. Rosenblatt, Gilron, and Mukamel ( 2016 ) and references therein. Simes’ test is not very well known. It is introduced in Simes ( 1986 ) , and proven to control the type I error of detection under a PRDS type of dependence in Benjamini and Yekutieli ( 2001 ) . For more on multiple testing, and signal identification, see Efron ( 2012 ) . For more on the choice of your error rate see Rosenblatt ( 2013 ) . For an excellent review on graphical models see Kalisch and Bühlmann ( 2014 ) . Everything you need on graphical models, Bayesian belief networks, and structure learning in R, is collected in the Task View .

9.6 Practice Yourself

Generate multivariate data with:

  • Use Hotelling’s test to determine if \(\mu\) equals \(\mu_0=0\) . Can you detect the signal?
  • Perform t.test on each variable and extract the p-value. Try to identify visually the variables which depart from \(\mu_0\) .
  • Use p.adjust to identify in which variables there are any departures from \(\mu_0=0\) . Allow 5% probability of making any false identification.
  • Use p.adjust to identify in which variables there are any departures from \(\mu_0=0\) . Allow a 5% proportion of errors within identifications.
  • Do we agree the groups differ?
  • Implement the two-group Hotelling test described in Wikipedia: ( https://en.wikipedia.org/wiki/Hotelling%27s_T-squared_distribution#Two-sample_statistic ).
  • Verify that you are able to detect that the groups differ.
  • Perform a two-group t-test on each coordinate. On which coordinates can you detect signal while controlling the FWER? On which while controlling the FDR? Use p.adjust .

Return to the previous problem, but set n=9 . Verify that you cannot compute your Hotelling statistic.

Anderson-Cook, Christine M. 2004. “An Introduction to Multivariate Statistical Analysis.” Journal of the American Statistical Association 99 (467). American Statistical Association: 907–9.

Bai, Zhidong, and Hewa Saranadasa. 1996. “Effect of High Dimension: By an Example of a Two Sample Problem.” Statistica Sinica . JSTOR, 311–29.

Benjamini, Yoav, and Daniel Yekutieli. 2001. “The Control of the False Discovery Rate in Multiple Testing Under Dependency.” Annals of Statistics . JSTOR, 1165–88.

Efron, Bradley. 2012. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction . Vol. 1. Cambridge University Press.

Everitt, Brian, and Torsten Hothorn. 2011. An Introduction to Applied Multivariate Analysis with R . Springer Science & Business Media.

Fithian, William. 2015. “Topics in Adaptive Inference.” PhD thesis, STANFORD UNIVERSITY.

Javanmard, Adel, and Andrea Montanari. 2014. “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression.” Journal of Machine Learning Research 15 (1): 2869–2909.

Kalisch, Markus, and Peter Bühlmann. 2014. “Causal Structure Learning and Inference: A Selective Review.” Quality Technology & Quantitative Management 11 (1). Taylor & Francis: 3–21.

Rosenblatt, Jonathan. 2013. “A Practitioner’s Guide to Multiple Testing Error Rates.” arXiv Preprint arXiv:1304.4920 .

Rosenblatt, Jonathan D, and Yoav Benjamini. 2014. “Selective Correlations; Not Voodoo.” NeuroImage 103. Elsevier: 401–10.

Rosenblatt, Jonathan, Roee Gilron, and Roy Mukamel. 2016. “Better-Than-Chance Classification for Signal Detection.” arXiv Preprint arXiv:1608.08873 .

Simes, R John. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3). Oxford University Press: 751–54.

This vocabulary is not standard in the literature, so when you read a text, you will need to verify yourself what the author means. ↩

You might find this shocking, but it does mean that you cannot trust the summary table of a model that was selected from a multitude of models. ↩

An Introduction to Multivariate Analysis

Data analytics is all about looking at various factors to see how they impact certain situations and outcomes. When dealing with data that contains more than two variables, you’ll use multivariate analysis.

Multivariate analysis isn’t just one specific method—rather, it encompasses a whole range of statistical techniques. These techniques allow you to gain a deeper understanding of your data in relation to specific business or real-world scenarios.

So, if you’re an aspiring data analyst or data scientist, multivariate analysis is an important concept to get to grips with.

In this post, we’ll provide a complete introduction to multivariate analysis. We’ll delve deeper into defining what multivariate analysis actually is, and we’ll introduce some key techniques you can use when analyzing your data. We’ll also give some examples of multivariate analysis in action.

Want to skip ahead to a particular section? Just use the clickable menu.

  • What is multivariate analysis?
  • Multivariate data analysis techniques (with examples)
  • What are the advantages of multivariate analysis?
  • Key takeaways and further reading

Ready to demystify multivariate analysis? Let’s do it.

1. What is multivariate analysis?

In data analytics, we look at different variables (or factors) and how they might impact certain situations or outcomes.

For example, in marketing, you might look at how the variable “money spent on advertising” impacts the variable “number of sales.” In the healthcare sector, you might want to explore whether there’s a correlation between “weekly hours of exercise” and “cholesterol level.” This helps us to understand why certain outcomes occur, which in turn allows us to make informed predictions and decisions for the future.

There are three categories of analysis to be aware of:

  • Univariate analysis , which looks at just one variable
  • Bivariate analysis , which analyzes two variables
  • Multivariate analysis , which looks at more than two variables

As you can see, multivariate analysis encompasses all statistical techniques that are used to analyze more than two variables at once. The aim is to find patterns and correlations between several variables simultaneously—allowing for a much deeper, more complex understanding of a given scenario than you’ll get with bivariate analysis.

An example of multivariate analysis

Let’s imagine you’re interested in the relationship between a person’s social media habits and their self-esteem. You could carry out a bivariate analysis, comparing the following two variables:

  • How many hours a day a person spends on Instagram
  • Their self-esteem score (measured using a self-esteem scale)

You may or may not find a relationship between the two variables; however, you know that, in reality, self-esteem is a complex concept. It’s likely impacted by many different factors—not just how many hours a person spends on Instagram. You might also want to consider factors such as age, employment status, how often a person exercises, and relationship status (for example). In order to deduce the extent to which each of these variables correlates with self-esteem, and with each other, you’d need to run a multivariate analysis.

So we know that multivariate analysis is used when you want to explore more than two variables at once. Now let’s consider some of the different techniques you might use to do this.

2. Multivariate data analysis techniques and examples

There are many different techniques for multivariate analysis, and they can be divided into two categories:

  • Dependence techniques
  • Interdependence techniques

So what’s the difference? Let’s take a look.

Multivariate analysis techniques: Dependence vs. interdependence

When we use the terms “dependence” and “interdependence,” we’re referring to different types of relationships within the data. To give a brief explanation:

Dependence methods

Dependence methods are used when one or some of the variables are dependent on others. Dependence looks at cause and effect; in other words, can the values of two or more independent variables be used to explain, describe, or predict the value of another, dependent variable? To give a simple example, the dependent variable of “weight” might be predicted by independent variables such as “height” and “age.”

In machine learning, dependence techniques are used to build predictive models. The analyst enters input data into the model, specifying which variables are independent and which ones are dependent—in other words, which variables they want the model to predict, and which variables they want the model to use to make those predictions.

Interdependence methods

Interdependence methods are used to understand the structural makeup and underlying patterns within a dataset. In this case, no variables are dependent on others, so you’re not looking for causal relationships. Rather, interdependence methods seek to give meaning to a set of variables or to group them together in meaningful ways.

So: One is about the effect of certain variables on others, while the other is all about the structure of the dataset.

With that in mind, let’s consider some useful multivariate analysis techniques. We’ll look at:

Multiple linear regression

Multiple logistic regression, multivariate analysis of variance (manova), factor analysis, cluster analysis.

Multiple linear regression is a dependence method which looks at the relationship between one dependent variable and two or more independent variables. A multiple regression model will tell you the extent to which each independent variable has a linear relationship with the dependent variable. This is useful as it helps you to understand which factors are likely to influence a certain outcome, allowing you to estimate future outcomes.

Example of multiple regression:

As a data analyst, you could use multiple regression to predict crop growth. In this example, crop growth is your dependent variable and you want to see how different factors affect it. Your independent variables could be rainfall, temperature, amount of sunlight, and amount of fertilizer added to the soil. A multiple regression model would show you the proportion of variance in crop growth that each independent variable accounts for.

Source: Public domain via  Wikimedia Commons

Logistic regression analysis is used to calculate (and predict) the probability of a binary event occurring. A binary outcome is one where there are only two possible outcomes; either the event occurs (1) or it doesn’t (0). So, based on a set of independent variables, logistic regression can predict how likely it is that a certain scenario will arise. It is also used for classification. You can learn about the difference between regression and classification here .

Example of logistic regression:

Let’s imagine you work as an analyst within the insurance sector and you need to predict how likely it is that each potential customer will make a claim. You might enter a range of independent variables into your model, such as age, whether or not they have a serious health condition, their occupation, and so on. Using these variables, a logistic regression analysis will calculate the probability of the event (making a claim) occurring. Another oft-cited example is the filters used to classify email as “spam” or “not spam.” You’ll find a more detailed explanation in this complete guide to logistic regression .

Multivariate analysis of variance (MANOVA) is used to measure the effect of multiple independent variables on two or more dependent variables. With MANOVA, it’s important to note that the independent variables are categorical, while the dependent variables are metric in nature. A categorical variable is a variable that belongs to a distinct category—for example, the variable “employment status” could be categorized into certain units, such as “employed full-time,” “employed part-time,” “unemployed,” and so on. A metric variable is measured quantitatively and takes on a numerical value.

In MANOVA analysis, you’re looking at various combinations of the independent variables to compare how they differ in their effects on the dependent variable.

Example of MANOVA:

Let’s imagine you work for an engineering company that is on a mission to build a super-fast, eco-friendly rocket. You could use MANOVA to measure the effect that various design combinations have on both the speed of the rocket and the amount of carbon dioxide it emits. In this scenario, your categorical independent variables could be:

  • Engine type, categorized as E1, E2, or E3
  • Material used for the rocket exterior, categorized as M1, M2, or M3
  • Type of fuel used to power the rocket, categorized as F1, F2, or F3

Your metric dependent variables are speed in kilometers per hour, and carbon dioxide measured in parts per million. Using MANOVA, you’d test different combinations (e.g. E1, M1, and F1 vs. E1, M2, and F1, vs. E1, M3, and F1, and so on) to calculate the effect of all the independent variables. This should help you to find the optimal design solution for your rocket.

Factor analysis is an interdependence technique which seeks to reduce the number of variables in a dataset. If you have too many variables, it can be difficult to find patterns in your data. At the same time, models created using datasets with too many variables are susceptible to overfitting. Overfitting is a modeling error that occurs when a model fits too closely and specifically to a certain dataset, making it less generalizable to future datasets, and thus potentially less accurate in the predictions it makes.

Factor analysis works by detecting sets of variables which correlate highly with each other. These variables may then be condensed into a single variable. Data analysts will often carry out factor analysis to prepare the data for subsequent analyses.

Factor analysis example:

Let’s imagine you have a dataset containing data pertaining to a person’s income, education level, and occupation. You might find a high degree of correlation among each of these variables, and thus reduce them to the single factor “socioeconomic status.” You might also have data on how happy they were with customer service, how much they like a certain product, and how likely they are to recommend the product to a friend. Each of these variables could be grouped into the single factor “customer satisfaction” (as long as they are found to correlate strongly with one another). Even though you’ve reduced several data points to just one factor, you’re not really losing any information—these factors adequately capture and represent the individual variables concerned. With your “streamlined” dataset, you’re now ready to carry out further analyses.

Another interdependence technique, cluster analysis is used to group similar items within a dataset into clusters.

When grouping data into clusters, the aim is for the variables in one cluster to be more similar to each other than they are to variables in other clusters. This is measured in terms of intracluster and intercluster distance. Intracluster distance looks at the distance between data points within one cluster. This should be small. Intercluster distance looks at the distance between data points in different clusters. This should ideally be large. Cluster analysis helps you to understand how data in your sample is distributed, and to find patterns.

Learn more: What is Cluster Analysis? A Complete Beginner’s Guide

Cluster analysis example:

A prime example of cluster analysis is audience segmentation. If you were working in marketing, you might use cluster analysis to define different customer groups which could benefit from more targeted campaigns. As a healthcare analyst , you might use cluster analysis to explore whether certain lifestyle factors or geographical locations are associated with higher or lower cases of certain illnesses. Because it’s an interdependence technique, cluster analysis is often carried out in the early stages of data analysis.

Source: Chire, CC BY-SA 3.0  via Wikimedia Commons

More multivariate analysis techniques

This is just a handful of multivariate analysis techniques used by data analysts and data scientists to understand complex datasets. If you’re keen to explore further, check out discriminant analysis, conjoint analysis, canonical correlation analysis, structural equation modeling, and multidimensional scaling.

3. What are the advantages of multivariate analysis?

The one major advantage of multivariate analysis is the depth of insight it provides. In exploring multiple variables, you’re painting a much more detailed picture of what’s occurring—and, as a result, the insights you uncover are much more applicable to the real world.

Remember our self-esteem example back in section one? We could carry out a bivariate analysis, looking at the relationship between self-esteem and just one other factor; and, if we found a strong correlation between the two variables, we might be inclined to conclude that this particular variable is a strong determinant of self-esteem. However, in reality, we know that self-esteem can’t be attributed to one single factor. It’s a complex concept; in order to create a model that we could really trust to be accurate, we’d need to take many more factors into account. That’s where multivariate analysis really shines; it allows us to analyze many different factors and get closer to the reality of a given situation.

4. Key takeaways and further reading

In this post, we’ve learned that multivariate analysis is used to analyze data containing more than two variables. To recap, here are some key takeaways:

  • The aim of multivariate analysis is to find patterns and correlations between several variables simultaneously
  • Multivariate analysis is especially useful for analyzing complex datasets, allowing you to gain a deeper understanding of your data and how it relates to real-world scenarios
  • There are two types of multivariate analysis techniques: Dependence techniques, which look at cause-and-effect relationships between variables, and interdependence techniques, which explore the structure of a dataset
  • Key multivariate analysis techniques include multiple linear regression, multiple logistic regression, MANOVA, factor analysis, and cluster analysis—to name just a few

So what now? For a hands-on introduction to data analytics, try this free five-day data analytics short course . And, if you’d like to learn more about the different methods used by data analysts, check out the following:

  • What is data cleaning and why does it matter?
  • SQL cheatsheet: Learn your first 8 commands
  • A step-by-step guide to the data analysis process
  • +1 415-349-0105 +44 800-088-5450 +1 844-822-8378 +61 1-800-614-417
  • VWO Engage Login

multivariate hypothesis also known as

  • EN DE ES BR

Multivariate Testing

What is multivariate testing.

Multivariate testing (MVT) is a form of A/B testing wherein a combination of multiple page elements are modified and tested against the original version (called the control) to determine which permutation leaves the highest impact on the business metrics you’re tracking. This form of testing is recommended if you want to test the impact of radical changes on a webpage as compared to analyzing the impact of one particular element. 

What Is Multivariate Testing

Unlike a traditional A/B test, MVT is more complex and best suited for advanced marketing, product, and development professionals. Let’s consider an example to give you a more comprehensive explanation of this testing methodology and see how it aids in conversion rate optimization .

Let’s say you have an online business of homemade chocolates. Your product landing page typically has three important elements to attract visitors and push them down the conversion funnel – product images, call-to-action button color, and product headline. You decide to test 2 versions of all the 3 elements to understand which combination performs the best and increases your conversion rate.  This would make your test a Multivariate Test (MVT). 

A set of 2 variations for 3 page elements means a total of 8 variation combinations. The formula to calculate the total number of versions in MVT is as follows:

[No. of variations of element A] x [No. of variations of element B] x [No. of variations of element C]… = [Total No. of variations]

Now, variation combinations would be:

2 (Product image) x 2 (CTA button color) x 2 (Product headline) = 8 

Multivariate Testing Combinations

Each of these combinations will now be concurrently tested to analyze which combination helps get maximum conversions on your product landing page . 

Note, a multivariate test eliminates the need to run multiple A/B tests and subsequent A/B tests to find the winning variation. Running concurrent tests with greater variation combinations not only helps you save time, money, and effort, but also draw conclusions in the shortest possible time.  

Here’s a real-life example of multivariate testing to testify the benefits of this experimentation methodology. 

Hyundai.io found that while the traffic on its car model landing pages was significant, not many people were requesting a test drive or downloading car brochures. Using VWO’s qualitative and quantitative tools, they analyzed that each of their landing pages has many different elements, including car headline, car visuals, car specifications, testimonials, and so on which might be causing friction. They decided to run a multivariate test to understand which elements were influencing a visitor’s decision to request a test drive or download a car brochure. 

They created variations of the following sections of the car landing page:

  • New SEO friendly text vs old text: They hypothesized that by making the text more SEO friendly, they could reap more SEO benefits.
  • Extra CTA buttons vs no extra CTA buttons: They hypothesized that by adding extra and more prominent CTA buttons on the page, they’ll be able to nudge visitors in the right direction.
  • Large photo of the car versus thumbnails: They hypothesized that it’s better to have larger photographs on the page than thumbnails to create better visitor traction

Hyundai.io tested a total of 8 combinations (3 sections, 2 variations each = 2*2*2) on their website. 

Here’s a screenshot of the original page and the winning variation:

Hyundai.io Multivariate Test Control

The variation with more SEO-friendly, extra CTA buttons and larger images increased Hyundai.io’s conversion rates for both, request for test drive and download brochure, by a total of  62%. They also saw a 208% increase in their click-through rate .

MVT is not just restricted to testing the performance of your webpages. You can use it across a range of fields. For instance, you can test your PPC ads, server-side codes, and so on. But, MVT should only be used on sufficiently sized visitor segments. More in-depth analysis means longer completion time. You must also not test too many variables at once as the test will take a longer time to complete and may or may not achieve statistical significance.

Multivariate Testing Banner 1

Understanding some basic multivariate testing terminologies

Although an integral part of A/B testing, there are a couple of terminologies specific to multivariate testing that anyone getting into this experimentation arena should know. 

  • Combination: It refers to a number of possible arrangements or unions you can create to test a collection of variable options in multiple locations. The order of selection or permutation does not matter. For instance, if you’re testing three elements on your home page, each with three variable options, then there are a total of 27 possible combinations (3x3x3) that you’ll test. When a visitor becomes a part of your test, they’ll see one combination, also referred to as an experience, when they visit your website. 
  • Content: Text, image, or any element that becomes a part of an experiment. In multivariate testing, several content options spread across a web page are compared in tandem to analyze which combination shows the best results. Content is also sometimes referred to as a level in MVT..  
  • Location: Ideally, a location refers to a page on the website or a specific area where you run optimizations. It’s essential to website activities and experiences, and display content to visitors or track visitor behavior. 
  • Control: Ideally, control refers to the original page, element, or content against which you’re planning to run a test. It also represents the “A” in the A/B testing scenario. For instance, if you want to test the performance of your homepage’s banner image, the original or existing banner image will be the “control.” Control is often also referred to as “Champion” by many seasoned optimizers. 
  • Goal: An event or a combination of events that help measure the success of a test or an experiment. For instance, a content writer’s goal is to increase visitor engagement on their content pieces and even generate content leads. 
  • Confidence Level: How positive or assertive you are about the success of your experiment. 
  • Conversion Rate: The percentage of unique visitors entering your conversion funnel and converting it into paying customers. 
  • Element: A discrete page component such as a form, block of text, an image, call-to-action button, etc.  
  • Experiment: It’s another way of assessing or evaluating the performance of one or more page elements. 
  • Hypothesis: A tentative assumption made to draft and test a logical or empirical consequence of a particular problem. An example of a hypothesis could be: Based on the previously run experiments and qualitative data gathered through heatmap and scrollmap analysis, I expect that adding banner and text CTAs on the guide page at regular intervals will help generate more content leads and MQLs.
  • Non-Conclusive Results: Not deriving any solid conclusion from the experiment(s) you’ve run. Non-conclusive results do not point to a test’s failure—instead, just the failure of deriving a learning curve. 
  • Qualitative Research: A technique of gathering and analyzing non-numerical data from existing and potential customers to understand a concept, opinion, or experience. 
  • Quantitative Research: Digging through numerical data derived from analytics to find insights around the behavior of your website visitors and draw statistical analysis.   
  • Visitors: A person or a unique entity visiting your site or landing pages. They’re termed unique because no matter how many times a person visits your site or page, they’re counted only once.

Multivariate Testing Banner 2

What are the different types of multivariate testing methods?

MVT is in itself an umbrella methodology. There are several different types of multivariate tests that you can choose to run. We’ve defined each of these in detail below.

1. Full factorial testing

This is the most basic and frequently used MVT method. Using this testing methodology, you basically distribute your website traffic equally among all testing combinations. For instance, if you’re planning to test 8 different types of combinations on your homepage, each testing variation will receive one-eighth of all the website traffic. 

Since each variation gets the same amount of traffic, the full factorial testing method offers all the necessary data you’d need to determine which testing variation and page elements perform the best. You’ll be able to discover which element had no effect on your targeted business metrics and which ones influenced them the most. 

Because this MVT methodology makes no assumptions with respect to statistics or testing mathematics used in the background, our seasoned optimizers highly recommend it for people running or planning to run multivariate tests.

Increase your recurring revenue by optimizing your website using VWO’s Multivariate testing methodology. Sign up and start your 1-month free trial today !

2. Partial or fractional factorial testing

Partial or fractional factorial MVT methodology exposes only a fraction of all testing variations to the website’s total traffic. The conversion rate of the unexposed testing variations is interpreted from the ones that were included in the test. 

Say you want to test 16 variations or combinations of your website’s homepage. In a regular test (or full factorial test), traffic is split equally between all variations. However, in the case of fractional factorial testing, traffic is divided between only 8 variations. The conversion rate of the remaining 8 variations is calculated or statistically deduced based on those actually tested. 

This method involves the use of advanced mathematical techniques and the use of multiple assumptions to gather insights, and has many disadvantages. One pro point of this MVT methodology is that it requires less traffic. It’s a good option for websites or pages with low traffic. 

However, regardless of how advanced mathematics techniques you use to draw statistically significant results using fractional factorial testing, hard data is any day better than speculation.

3. Taguchi testing

This is an old and esoteric MVT method. If you run a Google search, you’ll find that most tools on the market today claim to cut down on your testing time and traffic requirement by using the Taguchi testing technique. It’s more of an “off-line quality control” technique as it helps test and ensure good performance of products or processes in their design stage.  

While some optimizers consider this a good MVT methodology, we at VWO believe that this is an old-school practice which is not theoretically sound. It was initially used in the manufacturing industry to reduce the number of combinations required to be tested for QA and other experiments. 

Taguchi testing is not applicable or suitable for online testing and hence, not recommended. Use the full factorial or partial factorial MVT approach.

Multivariate Testing Banner 3

How is multivariate testing different?

A/b testing vs multivariate testing.

Ask an experience optimizer and they’ll say the ideal use of A/B testing is to analyze the performance of two or more radically different website elements. Meanwhile, MVT is a perfect technique to test which combination of page element(s) gets maximum conversions. 

In testing terms, it’s often recommended to use A/B testing to find what’s called the “global maximum,” and MVT to refine your way towards the “local maximum.”

Let’s take an example to understand the concept of global maximum and local maximum.

Imagine for a second that you’ve never tasted even a single piece of chocolate in your life, and you’re standing in a chocolate shop looking at 25 different types of chocolates, confused about which one to purchase.

There are probably 5 different kinds of caramel chocolates, 10 different varieties of truffles, 6 different variations of lollipops, and 4 different types of exotic fruit chocolates. Are you going to taste all these 25 flavors before deciding which one to buy?

You may try one kind of chocolate from each of the above-mentioned categories, but surely not all. If you find that you like truffles the most over lollipops, caramel chocolates and exotic fruit chocolates, you’ll start tasting more truffle flavors like “coconut truffles,” “Oreo truffles,” “chocolate-fudge truffles,” and so on to decide which among the truffle flavors you like the most.

In statistical terms, we’d say that the category of chocolates you like the most will become the global maximum. This is the type of chocolate that spoke to your taste buds and tasted the best among the lot. When you get down to the specific flavors of truffles, i.e., coconut truffle, Oreo truffle, chocolate fudge, and more, you’ll discover the local maximum – the best version of the variety that you chose.

As an experience optimizer, you must approach testing in a similar manner.  Find the webpage that gives you maximum conversions (global maximum), and then test combinations of specific elements on that webpage to understand which one improves your page’s performance and makes the highest-converting page (local maximum). What you’re looking for, global maximum or local maximum will define which testing methodology you must use. 

Here’s a list of pros and cons of using A/B testing and multivariate testing .

A/B testingMultivariate testing


1. A comparatively simple method to design and execute.
2. Helps conclude debates around campaign tactics when there’s one hypothesis in question. 
3. Helps generate statistically significant results even with lesser traffic samples.
4. Provides clear and detailed result reports which are easy for even non-technical teams to interpret and implement.



1. Limited to testing a single element with a few variations, typically 2 to 3.
2. Not possible to analyze the interaction between various page elements within the same testing campaign. 


1. Gives insights regarding the interaction between multiple page elements. 
2. Provides a granular picture around which element poses impact on the performance of a page.
3. Enables optimizers to compare many versions of a campaign and conclude which one has the maximum impact.  



1. A comparatively complex experimentation methodology to design and execute. 
2. It requires more traffic than an A/B test to show statistically significant results.  
3. Too many combinations make result interpretation difficult. 
4. Can serve as an overkill when an A/B test could have been sufficient to show results. 

Multivariate Testing Banner 4

Split URL testing vs multivariate testing

Assuming you’re fairly clear with the definition and concept of MVT, we’ll begin by breaking down the concept of Split URL testing . Rather than testing page elements at a granular level such as in the case of MVT, a split URL test enables you to run a test on a page level. Meaning, variations in the case of Split URL test are dramatically different and hosted on separate URLs but have the same end goal. 

Let’s continue on the example of you running an online business of homemade chocolates. Imagine your current homepage has a banner that shows different offers running on your website along with a section displaying your featured products, another section highlighting different chocolate categories, brand story, and related recipes. According to your gut feeling, the page looks attractive and has the potential to convert. 

However, after looking at the qualitative results, viewing heatmaps , session recording, etc. you find that many elements on your homepage are not showing the results they should. 

If you decide to run a Split URL test, you can create an entirely new page design with elements placed in a different manner and compare the performance of this variation with the control to analyze which one’s generating more conversions. 

Meanwhile, if you decide to run a multivariate test, you can create permutations of different page elements that you want to examine, maybe test different colors of your homepage’s CTA button, banner image, sub headings, and so on, and check which combination generates maximum conversion. There can be ‘n’ number of permutations that you can test with MVT.

Split URL Testing Vs Multivariate Testing

One of the primary reasons MVT is better than split URL testing is that the latter demands a lot of design and development team’s bandwidth and is a lengthy process. MVT, on the other hand, is comparatively less complex to run and demands lesser bandwidth as well.  

Here’s a comparison table between Split URL testing and MVT:

Split URL testingMultivariate testing
:

1. Sizeable changes such as completely new page designs are tested to check which gets maximum traction and conversions.
2. Variations are created on separate URLs to maintain distinction and clarity.
3. Helps examine different page flows and complex changes such as a complete redesign of your website’s homepage, product page, etc.  
4. With Split URL testing, you test a completely new webpage.

:

1. A comparatively complex test to design and execute.
2. Requires a lot of design and development team bandwidth. 
3. Assess the performance of a website as a whole while ignoring the performance of individual page elements.  
:

1. A combination of web page elements are modified and tested to check which permutation gets maximum conversions. 
2. Runs on a granular level to understand the performance of each page element.
3. Comparatively test more variations.Requires less changes in terms of design and layout. 

:

1. Usually requires more traffic to reach statistical significance. 
2. Demands more variable combinations to run and show results. 
3. The traffic spread across variations is too thin. This sometimes makes the test results unreliable.
4. Since more subtle changes are tested, the impact on conversion rate may not be significant or dramatic.

Multipage testing vs multivariate testing

As the name suggests, multipage testing is a form of experimentation method wherein changes in particular elements are tested across multiple pages . For instance, you can modify your homemade chocolate eCommerce website’s primary CTA buttons (Add to Cart and Become a Member) on the homepage, replicate the change across the entire site, and run a multipage test to analyze results. 

Compared to MVT, optimizers suggest it’s best to use multipage testing when you want to provide a consistent experience to your traffic, when you’re redesigning your website, or you want to improve your conversion funnel . Meanwhile, if you want to map the performance of certain elements on one particular web page, go with A/B testing or MVT. 

Here’s a clear distinction between multipage testing and multivariate testing. 

Multipage testingMultivariate testing


1. Create one test to map the performance of a particular element, say site-wide navigation, across the entire website.
2. Run funnel tests to examine different voices and tones on web pages.
3. Experiment different design theories and analyze which one’s the best. 
4. Helps map site-wide conversion rate.


1. Requires huge traffic to show statistically significant results. 
2. Can take longer than usual to conclude.Gaining results from this form of experimentation method can be tricky. 


1.
2. You can validate even the minutest of choices, such as the color of a CTA button. 
3. Gives in-depth insights about how different page elements play together.
4. Determine the contribution of individual page elements. 
5. Eliminates the need to run multiple A/B tests. 


1. More permutations or variations means longer time for a test to reach the stage of statistical significance.
2. Unlike multipage testing, you can test changes only on one particular page at a time in MVT.

How to run a multivariate test?

The process of setting up and running MVT is not very different from a regular A/B test, except for a couple of steps in between. But we’ll start from the beginning so that the process stays afresh. Let’s deep dive. 

1. Identify a problem

The first step to running MVT and improving the performance of your web page is to dig into data and identify all the loopholes causing visitors to drop off. For instance, the link attached to your “download guide” button may be broken or the form on your product page may be asking for information more than necessary. To spot these points of ambiguities, take the following steps.   

  • Conduct informal research: Take a look at customer support feedback and examine product reviews to understand how people are reacting to your products and services. Speak to your sales, support, and design teams to get honest feedback about your website from a customer’s point of view.   
  • Quantitative tools such Google Analytics to analyze bounce rate, page time spent, exit percentage, and similar metrics. 
  • Qualitative tools such as heatmaps to see where the majority of your website visitors are concentrating their attention, scrollmaps to analyze how far they’re scrolling down the page, and session recordings to visually see their entire journey.
  • Explore the option of usability testing: This tool offers an insight into how people are actually using or navigating through your website. With usability testing you can gather direct feedback about visitor issues and draft necessary solutions. 

VWO Insights offers you a full suite of qualitative and quantitative tools such as heatmaps, scroll maps, click maps, session recordings, form analytics, etc. for quick and thorough analysis.

2. Define your goals and formulate a hypothesis

Successful experimentation begins by clearly defining goals. It is these goals or metrics that help prepare smart testing strategies around element selection and their variations. For example, if you’re trying to increase your average order value (AOV) or revenue per visitor (RPV) , you may select elements that directly aid these metrics and create different variations. 

Once you’ve defined your goals and selected your page elements to test, it’s time to formulate a hypothesis. A hypothesis is basically a suggested solution to a problem. For instance, after looking at the heatmap of your product page you analyze that your “Add to Cart” button is not prominently visible, you’ll perhaps form a hypothesis that “ based on observations gathered through heatmap and other qualitative tools, I expect that if we put the “Add to Cart” button in a colored box, we will see high visitor interaction with the button and more conversions.  

Here’s a hypothesis format that we at VWO use:

Creating a hypothesis in VWO and rating it

If you don’t have a good heatmap tool in your arsenal, use VWO’s AI-powered free heatmap generator to know how visitors are likely to engage on your webpage. You can also invest in VWO Insights to generate actual visitor behavior data and use it to your leverage.

3. Create variations

Post forming a hypothesis and having a good idea around which page elements you want to test, the next step is to create variations. Infuse your site with the new web elements or variations such as clearer and more prominently visible call-to-action buttons, enhanced images, text that’s more factual and resonates with what visitors expect, and so on. 

VWO provides an excellent and highly user-friendly platform to run a multivariate test. Using VWO’s Visual Editor , you can play with your website and its elements and create as many variations as you want without the help of a developer. Do watch this video on ‘Introduction to VWO Visual Editor’ to learn more. 

4. Determine your sample size

As stated in the above sections, it’s best to run MVT on a page with high traffic volume. If you choose to run it on a small traffic volume, your test is most likely to fail or take longer than usual to reach its statistically significant stage. The higher the traffic volume, the better will be the traffic split between variations, and hence, the better shall be the test results.

You can use our A/B and multivariate test duration calculator to find out how much traffic, and how long you need to run MVT based on your website’s current traffic volume, number of variations including control, and your statistical significance.  

5. Review your test setup

The next step to running an effective and successful MVT is to review your test setup in your testing app. You may be confident that you’ve taken all the necessary steps and added the variations correctly in your testing app, but there’s no harm thoroughly reviewing it one or two times more. 

One of the clearest advantages of conducting a review is that it gives you an opportunity to ensure every element has been added correctly and all the necessary test selections have been made. Taking the time to quality check your test is a critical step to ensure its success. 

6. Start driving traffic

If you think your webpage doesn’t have enough traffic to support the MVT experiment, it’s time to look for ways to do so. Your job is to ensure your page has as much traffic as possible to make sure your testing efforts don’t fail. Use paid advertising, social media promotion, and other traffic generation methods to prove or disprove the hypothesis you’re currently playing with.

7. Analyze your results

Once your test has run its due course and reached its statistically significant stage, it’s time to access it’s results and see if your hypothesis was right or wrong. 

Since you’ve tested multiple page elements at once, take some time to interpret your results. Examine each variation and analyze their performance. Note that it’s not necessary a variation that won which may be the best one to implement on your website permanently. Sometimes, these results can be inconclusive. Use qualitative tools such as heatmaps, clickmaps, session recordings, form analytics, etc. to examine the performance of each variation and draw a conclusion.  

After all, it’s important to ensure the validity of your test and implement changes on your webpage as per the preference of your audience and deliver the experience they want. 

How to run a multivariate test on a low-traffic website?

We’ve concluded time and again that MVT requires more traffic than an A/B test to show statistically significant results. But, does that mean websites with low traffic volume cannot do multivariate testing? Absolutely not! 

The theory behind MVT asking for high traffic is pretty obvious. The higher the number of variations, the higher shall be the traffic split between the variations, and hence, the longer it will take to draw conclusive results. If you’re planning to run MVT on a website with low traffic volume, all you need to do is make some modifications.

1. Only test high-impact changes

Let’s say, there are 6 elements on your product page which you believe have the potential to improve the performance of your page and even increase conversions. Do all of these have equal potential? Probably not. Some may be more impactful and have more noticeable effects than others.  

When you’re planning to run MVT on a low traffic website, focus your energies on testing those site elements that can have a significant impact on your page’s performance and goals rather than testing small modifications with low impact intensity. 

Although it may be intimidating to test radical changes, when you do, the likelihood of them showing a dramatic difference in conversion rate is also high. No matter the outcome, the learnings and valuable insights about your customer’s behavior and perception of your brand, can help run informed future tests and business decisions. 

2. Use fewer variations

Needless to say, low traffic volume means testing less number of variations. We understand that it’s tempting to test different optimization ideas to solve visitor problems. But, with every added variation in your test, the time to achieve statistical significance also increases . Don’t take that risk. Go small and go slow. This may cost you some extra efforts and resources, it will surely save you much time compared to running MVT with more number of variations. 

Use tools like heatmaps, scrollmaps, session recordings, usability testing , etc. to find high-impact page elements. Use the ICE model (impact, confidence, ease) to create a testing pipeline and follow it. 

3. Focus on micro-conversions

Your primary goal may be to increase page sign-ups, click-through rate or overall conversions, but does it make sense to use these as your primary metrics when you know it would take you much longer to gather enough conversions and even verify the test results? Surely not.

The better thing to do is test conversions on a micro-level . A level at which conversions are plentiful and can help you optimize your page quickly. For instance, focus your efforts on increasing your page engagement rate, clicks on add to cart button, clicks on images, etc. 

Other goals could be setting up a conversion goal that fires when a visitor fills up an exit-intent pop-up form, stays on your website for more than 30 minutes, or scrolls down a certain depth/folds through your long-copy page. You can also use a quantitative tool like Google Analytics to analyze which conversions to map or use as goals to optimize your website .   

4. Consider lowering down your statistical significance setting

When you don’t have the leverage to run a test on a large sample size , resort to using other methods to measure the performance of your control and variation. Do not wait for your test to reach its statistically significant level. If your testing tool allows, you can also lower your statistical significance levels. So, for instance, if you set your significance level to 70%, any version that reaches this mark will become the winner. In this case, you would also require a much smaller sample size than going for 99% significance. 

Optimizers across the industry recommend many ways to measure the performance of a test version, but the ones we recommend are as follows:

  • The 2-sample t-test: Also known as the independent sample t-test, it’s a method used to examine whether the means of two or more unknown samples are equal or not. If your sample distribution is unequal, you can use a different estimate of the standard deviation to check results. 
  • B. The Chi-Squared Test: The primary objective of this test is to examine the statistical significance of the observed relationship between two variants with respect to the expected outcome. In other words, it helps you analyze which version of your test is most likely to reach statistical significance and has better chances of winning.   
  • Confidence Interval : This method simply measures the degree of certainty or uncertainty of a variation to reach its statistical significance by observing the data at hand.  
  • Measuring sessions : This is another way to test the statistical significance of a variation. Rather than measuring your test’s performance by counting users, take into account sessions. This means that your test treats each individual as a participant in an experiment only once. 

5. Avoid niche testing

Avoid testing those sections or elements of your site that get very few hits. Instead, target page elements that get more traction. Site-wide CTA tests, landing page tests , and the like will help you take advantage of your site’s incoming traffic . Such tests are also likely to show statistically significant results in a shorter time span. 

What are the advantages and disadvantages of multivariate testing?

Instances in which multivariate testing is valuable, 1. mvt helps measure the interaction between multiple page elements .

Let’s come back to evaluating the performance of your homemade chocolate eCommerce website. You’re confident that two sequential A/B tests can produce the same results as an MVT. So, you decide to first run an A/B test to compare a static banner image with a video banner, and the latter wins. Next, on the winning variation, i.e., the video banner, you do another A/B test between two possible CTAs, ‘Explore More’ and ‘Buy Now,’ and the former proved better. Don’t you think you could’ve come to the same conclusion with an MVT test?

Unlike an A/B test, a multivariate test gives you the leverage to test and measure how multiple page elements interact with each other. Just by testing a combination of various elements (let’s say your product page’s hero image, a CTA, and headline) you can not only figure out which variation performs better than others and helps increase conversions but also discover the best possible combinations of the tested page elements.

Best possible multivariate testing combinations

2. Provides an in-depth visitor behavior analysis

MVT enables you to conduct an in-depth analysis on visitor behavior and their preference patterns. It provides you with statistics on the performance of the variations vs their conversion effects. This helps re-orienting the visitor connection of your website. The better the re-orientation as per a visitor’s intent, the more are the chances of high conversions.   

3. Sheds light on dead or least contributing page elements. 

A multivariate test doesn’t just help you find the right combination of page elements that will help increase your conversions. It also sheds light on those that are contributing least or nothing to your site’s conversions and occupying huge page space. These elements can be anything, textual content, images, banner, etc. 

Relocate or replace them with elements that catch the attention of your visitors and channelize some conversions as well. It’s always better to have your page elements contribute something to your goals than absolutely nothing. 

4. Guides you through page structurization

The importance of placing elements at the right location can be understood from the fact that visitors today have a short view span. They devote maximum time reading and taking in information mentioned in the first fold of your web page. So, if you’re not placing the relevant content at the top, you’re reducing the chances of getting conversions by a great margin. MVT allows you to study the placement of various page elements and locate them at their right place in order to facilitate conversions for your business and make it easy for visitors to find what they came to look for on your page. 

5. Test a wide range of combinations

Unlike a regular A/B test that gives you the leverage to test one or more variations of a particular element in singularity, MVT allows you to test multiple elements at the same time by applying the concept of permutations and combinations. Such an experimentation method not only increases your testing options that you can use to tap on conversions but helps save time by avoiding running sequential A/B tests.  

Instances where  multivariate testing is not valuable

MVT is a highly sophisticated testing methodology. A single MVT test helps answer multiple questions at once. But, just because it’s a complicated testing technique doesn’t mean it’s better than other techniques  or that the data it generates is more useful. Every coin has two sides. We listed the pros of using a multivariate test in the above section. It’s time to know the cons as well. 

1. Requires more traffic to show statistically significant results

Unlike a traditional A/B test, MVT demands high traffic inflow. This means that it only works or shows statistically significant results for sites which have a ton of traffic. Even if you do run it on a site with low traffic, you’ll have to compromise on something or the other such as test fewer combinations, use other methods to calculate a winner, etc. Read the  section above on ‘How to run a multivariate test on low traffic websites’ for better clarity.  

2. They’re comparatively tricky to set up

One thing that makes people opt for an A/B test over MVT is that the former is comparatively very simple to set up. After all, you just need to change one or two elements and add variations while keeping the rest of the page design the same. Anyone who has an understanding of web design can easily set up an A/B test, and even complex A/B tests today rarely require more than a couple of minutes of a developer’s time. Tools like VWO enable even non-technical folks to set up an A/B test within a matter of minutes . 

On the contrary, MVT requires more efforts even if you’re creating a basic one, and it’s also very easy for them to go off the rails. A minute mistake in the design or while creating the variations can hamper the test results. MVT is a good option for optimizers who have a lot of experience in the arena of experimentation. 

3. There’s a hidden opportunity cost

Time is always of the essence and a valuable commodity for any business. When you run a test on your website, you’re investing time and playing with your conversion rate. There’s a hidden cost that you put on the line. Multivariate tests are comparatively complex and slow to set up, and slower to run. All the time lost during its setup phase and course of running creates an opportunity cost. 

Amidst the time MVT takes to show meaningful results, you could have run dozens of A/B tests and drawn conclusions. A/B tests are quick to set up and also provide definite answers to many specific problems.  

4. The chances of failure are comparatively high

Needless to say that testing allows you to move fast and break things to optimize your website and make it more user friendly. You get the leverage to try crazy ideas and even fail spectacularly without facing any real risk or consequences. While this approach seems effective in the case of an A/B test, we can’t say the same for multivariate testing. 

While each A/B test, despite its success or failure, provides a series of learning points to refer back, the same fails in the case of a multivariate test. It’s comparatively difficult to draw meaningful learnings as you play with a lot of elements that too in combination. More so, MVT is so slow and tedious that it doesn’t really make any sense to take such a risk in the first place and fail in the end.  

5. MVT is biased towards design

Another MVT con that most optimizers have realized over the years is that the testing method often provides answers to problems related to design. Some of the strongest advocates or supporters of MVT are also UI and UX professionals. 

Design is obviously important, but it’s surely not everything. UI and UX elements represent only a small part of all the total variables you use to enhance the performance of your website. Copy, promotional offers, and even site functionality are essential to ensure your website is liked by your target audience. These elements are often underestimated and overlooked in the case of MVT despite the fact that they have a huge impact on the conversion rate of your site.  

Machine learning and multivariate testing

The advancement in the field of technology, especially artificial intelligence, is now prominently visible in the testing arena as well. For many years, a program called a neural network was enabling computers to learn as they gather data and take necessary actions that were more accurate than humans while using less data and time. However, neural networks were able to help humans solve some specific problems only. 

Looking at the capabilities of these neural networks, many software companies thought to use their potential to develop solutions which could enhance the entire multivariate testing process. This solution is called the evolutionary neural network or a generic neural network. 

The approach uses the capabilities of machine learning to select which website elements to test and creates all possible variations on its own. It restricts the need to test all combinations and enables \ optimizers to test only those which have the ability to show highest conversions. 

The algorithms behind the solution prune the poor performing combinations to pave the way for more likely winners to participate in the test. Over time, the highest performing combination emerges as the winner and then becomes the new control. 

Once that happens, these algorithms then introduce variables called mutations in the test. Variants that were previously pruned are reintroduced as new combinations to analyze whether any one of these might turn successful amidst the better-performing combinations. 

This approach has proved to show better and faster results even with less traffic.

Evolutionary neural networks allow testing tools to learn what combinations will work without testing all multivariate combinations.

Evolutionary neural networks enable testing tools to learn which set of combinations will show positive results without testing all possible multivariate combinations. With machine learning, websites that have too little traffic to opt for MVT can also consider this option now without making compromises. 

Best practices to follow when running a multivariate test

MVT has the ability to empower optimizers to discover effective content that helps drive KPIs and enhance the performance of a website, but only when they follow best practices. 

1. Create a learning agenda

Before getting started, be sure that MVT is the best testing approach to your identified problems or whether a simple A/B test would best suit your needs. We, at VWO, believe that it’s important to first draft a learning agenda. This will help you define what exactly you want to test and what you hope to learn from this experiment. 

The agenda basically acts as a blueprint, helping you establish a hypothesis, define the page elements you want to test and for which audience segment, and prioritize learning objectives accordingly. It also comes in handy when you begin to set up your test and ensures that all adjustments have been made correctly. 

2. Avoid testing all possible combinations 

For most people, running a multivariate test means testing everything that comes under their radar. That should not be the case. Restrict yourself and test only those variables that you believe can have a high impact on your conversion rate. Also, the more the number of elements, the more shall be the number of permutations, the more it will be to run and gauge the final results.

Say for example, you want to test the performance of your display ad. You decide to test four possible images, two possible CTA button colors, and three possible headlines. This totals up to 24 variations of your display ad. When you test all the variations at once, each gets 1/24th traffic of the total incoming traffic. 

With such a high traffic split, the chances of any variation reaching its statistical significance is quite low. And, even if one or some of them do, the time they take will make the test insignificant.  

Furthermore, it’s not necessary that all combinations may sense from a design standpoint. For instance, an image with a blue background and blue CTA button color will make it hard for visitors to identify the CTA, especially on a mobile screen. Use good judgment and select only those variations which can show some results.  

3. Generate ideas from data for greater experimental variety and relevancy

While it’s a great practice to limit yourself from testing every possible idea that pops in your head, it’s also important to not ignore possibilities that could impact your conversion rate. To know if a variation is worth sampling, generate ideas from various data sources . These could include:

  • First-hand data collected on the basis of visitor behavior, segment demographics, and interests.
  • Third-party data extracted from multiple data providers for additional visitor information such as purchase behavior, transactional data, or industry-specific data.
  • Historical performance data extracted from previously run campaigns targeting similar traffic.

4. Start eliminating low performers once your test reaches minimum sample size

It’s not necessary to end your multivariate test the moment it achieves adequate sample size . Rather, you should begin eliminating the non-performing variations. Shut down variations that have negligible movement compared to control once they’ve achieved the needed representative sample size. This means that more traffic will flow towards variations that are performing well, enabling you to optimize your test for higher quality results, faster.

5. Use high-performing combinations for further testing

Once you’ve discovered a few potential variations, restructure your test and fine-tune the variable elements. If a certain headline on a product page is outperforming others, come up with a fresh set of variations around that headline.  

You can even opt to run a discrete A/B/n test with limited experimental groups to analyze the performance of the new variations in a shorter time. Most experience optimizers suggest that when you learn something from an experiment, use the knowledge to enhance the performance of other page elements. Testing is not just about increasing revenue, but about understanding visitor behavior and serving to their needs.  

Best multivariate testing tools

The market today is swamping with A/B testing tools . Not all have the required capabilities to help you run successful experiments without a hassle. Neither can you take the risk of developing an in-house experimentation suite. Here’s a list of the top 5 multivariate testing tools for experience optimizers who have a passion for testing:  

VWO is an all-in-one, cloud-based experimentation tool that helps optimizers run multiple tests on their website and optimize it to ensure better conversions. Besides laying an easy-to-use experimentation platform, the tool allows you to conduct qualitative and quantitative research work, create test variations, and even analyze the performance of your tests via its robust dashboard. 

VWO homepage

VWO also offers the SmartStats feature that runs on Bayesian statistics. It gives you more control of your tests and helps derive conclusions sooner. Sign up today for a free trial and get into the habit of experimentation.

2. Optimizely 

Optimizely platform offers a comprehensive suite of CRO tools and generally entertain enterprise-level customers only.   

Optimizely homepage

Essentially, Optimizely primarily provides web experimentation and personalization services. However, you can use its capabilities to run experiments on mobile apps and messaging platforms as well. You can even opt to run multiple tests on the same page and rest assured of accurate results.  

3. A/B Tasty

A/B Tasty is another experimentation platform that offers a holistic range of testing features. Besides the usual A/B and Split URL testing option, it also allows you to run multivariate tests. It has an integrated platform that’s easy-to-use and offers a real-time view of your tests and their respective confidence levels. 

AB Tasty homepage

4. Oracle Maxymiser

An advanced A/B testing and personalization tool, Oracle Maxymiser enables you to design and run sophisticated experiments. The platform offers many powerful features and capabilities such as multivariate testing, funnel optimization, advanced targeting and segmentation, and predictive analytics. Such features make it a perfect match for data-driven optimizers with an in-house IT support team.

Oracle Maximizer homepage

5. Google Optimize 360

Google Optimize 360 is a Google product that offers a broad range of services besides experimentation. Some of these include native Google Analytics integration, URL targeting, and Geo-targeting. If you opt for Google Optimize 360’s premium version, you get to explore the tool in-depth. With Google Optimize 360 you can:

  • test up to 36 combinations when running MVT
  • run 100+ test simultaneously
  • make 100+ personalizations at a time
  • get access to Google Analytics 360 Suite administration

Google Optimize 360 homepage

Multivariate testing is an arm of A/B testing that uses the same experimentation mechanics, but compares more than one variable on a website in a live environment. It opposes the traditional scientific notion and enables you to, in a way, run multiple A/B/n tests on the same page simultaneously. At the core, it’s a highly complex process that requires more time and effort, but provides comprehensive information around how different page elements interact with each other and which combinations work the best magic on your site. 

If you still have questions around what is multivariate testing or how it can benefit your website, request a demo today ! Or, get into the habit of experimentation and start A/B testing today! Sign up for VWO’s free trial . 

multivariate hypothesis also known as

Download this Guide

Deliver great experiences. grow faster, starting today..

Talk to a sales representative

Get in touch

Thank you for writing to us.

One of our representatives will get in touch with you shortly.

Signup for a full-featured trial

Free for 30 days. No credit card required

Set up your password to get started

Awesome! Your meeting is confirmed for at

Thank you, for sharing your details.

Hi 👋 Let's schedule your demo

To begin, tell us a bit about yourself

While we will deliver a demo that covers the entire VWO platform, please share a few details for us to personalize the demo for you.

Select the capabilities that you would like us to emphasise on during the demo., which of these sounds like you, please share the use cases, goals or needs that you are trying to solve., please share the url of your website..

We will come prepared with a demo environment for this specific website.

I can't wait to meet you on at

, thank you for sharing the details. Your dedicated VWO representative, will be in touch shortly to set up a time for this demo.

We're satisfied and glad we picked VWO. We're getting the ROI from our experiments. Christoffer Kjellberg CRO Manager
VWO has been so helpful in our optimization efforts. Testing opportunities are endless and it has allowed us to easily identify, set up, and run multiple tests at a time. Elizabeth Levitan Digital Optimization Specialist
As the project manager for our experimentation process, I love how the functionality of VWO allows us to get up and going quickly but also gives us the flexibility to be more complex with our testing. Tara Rowe Marketing Technology Manager
You don't need a website development background to make VWO work for you. The VWO support team is amazing Elizabeth Romanski Consumer Marketing & Analytics Manager

Trusted by thousands of leading brands

multivariate hypothesis also known as

logo image missing

  • > Machine Learning

What is Multivariate Data Analysis?

  • Bhumika Dutta
  • Aug 23, 2021

What is Multivariate Data Analysis? title banner

Introduction

We have access to huge amounts of data in today’s world and it is very important to analyze and manage the data in order to use it for something important. The words data and analysis go hand in hand, as they depend on each other. 

Data analysis and research are also related as they both involve several tools and techniques that are used to predict the outcome of specific tasks for the benefit of any company. The majority of business issues include several factors. 

When making choices, managers use a variety of performance indicators and associated metrics. When selecting which items or services to buy, consumers consider a variety of factors. The equities that a broker suggests are influenced by a variety of variables. 

When choosing a restaurant, diners evaluate a variety of things. More elements affect managers' and customers' decisions as the world grows more complicated. As a result, business researchers, managers, and consumers must increasingly rely on more sophisticated techniques for data analysis and comprehension. 

One of those analytical techniques that are used to read huge amounts of data is known as Multivariate Data Analysis.

(Also read: Binary and multiclass classification in ML )

In statistics, one might have heard of variates, which is a particular combination of different variables. Two of the common variate analysis approaches are univariate and bivariate approaches. 

A single variable is statistically tested in univariate analysis, whereas two variables are statistically tested in bivariate analysis. When three or more variables are involved, the problem is intrinsically multidimensional, necessitating the use of multivariate data analysis. In this article, we are going to discuss:

What is multivariate data analysis? Objectives of MVA.

Types of multivariate data analysis.

Advantages of multivariate data analysis.

Disadvantages of multivariate data analysis.

(Recommended read: What is Hypothesis Testing? Types and Methods )

Multivariate data analysis

Multivariate data analysis is a type of statistical analysis that involves more than two dependent variables, resulting in a single outcome. Many problems in the world can be practical examples of multivariate equations as whatever happens in the world happens due to multiple reasons. 

One such example of the real world is the weather. The weather at any particular place does not solely depend on the ongoing season, instead many other factors play their specific roles, like humidity, pollution, etc. Just like this, the variables in the analysis are prototypes of real-time situations, products, services, or decision-making involving more variables. 

Wishart presented the first article on multivariate data analysis (MVA) in 1928. The topic of the study was the covariance matrix distribution of a normal population with numerous variables. 

Hotelling, R. A. Fischer, and others published theoretical work on MVA in the 1930s. multivariate data analysis was widely used in the disciplines of education, psychology, and biology at the time. 

As time advanced, MVA was extended to the fields of meteorology, geology, science, and medicine in the mid-1950s. Today, it focuses on two types of statistics: descriptive statistics and inferential statistics. We frequently find the best linear combination of variables that are mathematically docile in the descriptive region, but an inference is an informed estimate that is meant to save analysts time from diving too deeply into the data.

Till now we have talked about the definition and history of multivariate data analysis. Let us learn about the objectives as well.

Objectives of multivariate data analysis:

Multivariate data analysis helps in the reduction and simplification of data as much as possible without losing any important details.

As MVA has multiple variables, the variables are grouped and sorted on the basis of their unique features. 

The variables in multivariate data analysis could be dependent or independent. It is important to verify the collected data and analyze the state of the variables.

In multivariate data analysis, it is very important to understand the relationship between all the variables and predict the behavior of the variables based on observations.

It is tested to create a statistical hypothesis based on the parameters of multivariate data. This testing is carried out to determine whether or not the assumptions are true.

(Must read: Hypothesis testing )

Advantages of multivariate data analysis:

The following are the advantages of multivariate data analysis:

As multivariate data analysis deals with multiple variables, all the variables can either be independent or dependent on each other. This helps the analysis to search for factors that can help in drawing accurate conclusions.

Since the analysis is tested, the drawn conclusions are closer to real-life situations.

Disadvantages of multivariate data analysis:

The following are the disadvantages of multivariate data analysis:

Multivariate data analysis includes many complex computations and hence can be laborious.

The analysis necessitates the collection and tabulation of a large number of observations for various variables. This process of observation takes a long time.

(Also read: 15 Statistical Terms for Machine Learning )

7 Types of Multivariate Data Analysis

According to this source , the following types of multivariate data analysis are there in research analysis:

Structural Equation Modelling:

SEM or Structural Equation Modelling is a type of statistical multivariate data analysis technique that analyzes the structural relationships between variables. This is a versatile and extensive data analysis network. 

SEM evaluates the dependent and independent variables. In addition, latent variable metrics and model measurement verification are obtained. SEM is a hybrid of metric analysis and structural modeling. 

For multivariate data analysis, this takes into account measurement errors and factors observed. The factors are evaluated using multivariate analytic techniques. This is an important component of the SEM model.

(Look also: Statistical data analysis )

Interdependence technique:

The relationships between the variables are studied in this approach to have a better understanding of them. This aids in determining the data's pattern and the variables' assumptions.

Canonical Correlation Analysis:

The canonical correlation analysis deals with the relations of straight lines between two types of variables. It has two main purposes- reduction of data and interpretation of data. Between the two categories of variables, all probability correlations are calculated. 

When the two types of correlations are large, interpreting them might be difficult, but canonical correlation analysis can assist to highlight the link between the two variables.

Factor Analysis:

Factor analysis reduces data from a large number of variables to a small number of variables. Dimension reduction is another name for it. Before proceeding with the analysis, this approach is utilized to decrease the data. The patterns are apparent and much easier to examine when factor analysis is completed.

Cluster Analysis:

Cluster analysis is a collection of approaches for categorizing instances or objects into groupings called clusters. The data is divided based on similarity and then labeled to the group throughout the analysis. This is a data mining function that allows them to acquire insight into the data distribution based on each group's distinct characteristics.

Correspondence Analysis:

A table with a two-way array of non-negative values is used in a correspondence analysis approach. This array represents the relationship between the table's row and column entries. A table of contingency, in which the column and row entries relate to the two variables and the numbers in the table cells refer to frequencies, is a popular multivariate data analysis example.

Multidimensional Scaling:

MDS, or multidimensional scaling, is a technique that involves creating a map with the locations of the variables in a table, as well as the distances between them. There can be one or more dimensions to the map. 

A metric or non-metric answer can be provided by the software. The proximity matrix is a table that shows the distances in tabular form. The findings of the trials or a correlation matrix are used to update this tabular column.

From the rows and columns of a database table to meaningful data, multivariate data analysis may be used to read and analyze data contained in various databases. This approach, also known as factor analysis, is used to gain an overview of a table in a database by reading strong patterns in the data such as trends, groupings, outliers, and their repetitions, producing a pattern. This is used by huge organizations and companies. 

(Must read: Feature engineering in ML )

The output of this applied multivariate statistical analysis is the basis for the sales plan. Multivariate data analysis approaches are often utilized in companies to define objectives.

Share Blog :

multivariate hypothesis also known as

Be a part of our Instagram community

Trending blogs

5 Factors Influencing Consumer Behavior

Elasticity of Demand and its Types

An Overview of Descriptive Analysis

What is PESTLE Analysis? Everything you need to know about it

What is Managerial Economics? Definition, Types, Nature, Principles, and Scope

5 Factors Affecting the Price Elasticity of Demand (PED)

6 Major Branches of Artificial Intelligence (AI)

Scope of Managerial Economics

Dijkstra’s Algorithm: The Shortest Path Algorithm

Different Types of Research Methods

Latest Comments

multivariate hypothesis also known as

Really this is an informative post, thanks so much for this. I look forward to more posts. Here students can study all the subjects of the school. (Standard 8 to 12) CBSE, ICSE can learn the previous year's solved papers and other entrance exams like JEE, NEET, SSC etc. for free. Click to know: - https://www.zigya.com

multivariate hypothesis also known as

A Guide on Data Analysis

22 multivariate methods.

\(y_1,...,y_p\) are possibly correlated random variables with means \(\mu_1,...,\mu_p\)

\[ \mathbf{y} = \left( \begin{array} {c} y_1 \\ . \\ y_p \\ \end{array} \right) \]

\[ E(\mathbf{y}) = \left( \begin{array} {c} \mu_1 \\ . \\ \mu_p \\ \end{array} \right) \]

Let \(\sigma_{ij} = cov(y_i, y_j)\) for \(i,j = 1,…,p\)

\[ \mathbf{\Sigma} = (\sigma_{ij}) = \left( \begin{array} {cccc} \sigma_{11} & \sigma_{22} & ... & \sigma_{1p} \\ \sigma_{21} & \sigma_{22} & ... & \sigma_{2p} \\ . & . & . & . \\ \sigma_{p1} & \sigma_{p2} & ... & \sigma_{pp} \end{array} \right) \]

where \(\mathbf{\Sigma}\) (symmetric) is the variance-covariance or dispersion matrix

Let \(\mathbf{u}_{p \times 1}\) and \(\mathbf{v}_{q \times 1}\) be random vectors with means \(\mu_u\) and \(\mu_v\) . Then

\[ \mathbf{\Sigma}_{uv} = cov(\mathbf{u,v}) = E[(\mathbf{u} - \mu_u)(\mathbf{v} - \mu_v)'] \]

in which \(\mathbf{\Sigma}_{uv} \neq \mathbf{\Sigma}_{vu}\) and \(\mathbf{\Sigma}_{uv} = \mathbf{\Sigma}_{vu}'\)

Properties of Covariance Matrices

  • Symmetric \(\mathbf{\Sigma}' = \mathbf{\Sigma}\)
  • Non-negative definite \(\mathbf{a'\Sigma a} \ge 0\) for any \(\mathbf{a} \in R^p\) , which is equivalent to eigenvalues of \(\mathbf{\Sigma}\) , \(\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_p \ge 0\)
  • \(|\mathbf{\Sigma}| = \lambda_1 \lambda_2 ... \lambda_p \ge 0\) ( generalized variance ) (the bigger this number is, the more variation there is
  • \(trace(\mathbf{\Sigma}) = tr(\mathbf{\Sigma}) = \lambda_1 + ... + \lambda_p = \sigma_{11} + ... + \sigma_{pp} =\) sum of variance ( total variance )
  • \(\mathbf{\Sigma}\) is typically required to be positive definite, which means all eigenvalues are positive, and \(\mathbf{\Sigma}\) has an inverse \(\mathbf{\Sigma}^{-1}\) such that \(\mathbf{\Sigma}^{-1}\mathbf{\Sigma} = \mathbf{I}_{p \times p} = \mathbf{\Sigma \Sigma}^{-1}\)

Correlation Matrices

\[ \rho_{ij} = \frac{\sigma_{ij}}{\sqrt{\sigma_{ii} \sigma_{jj}}} \]

\[ \mathbf{R} = \left( \begin{array} {cccc} \rho_{11} & \rho_{12} & ... & \rho_{1p} \\ \rho_{21} & \rho_{22} & ... & \rho_{2p} \\ . & . & . &. \\ \rho_{p1} & \rho_{p2} & ... & \rho_{pp} \\ \end{array} \right) \]

where \(\rho_{ij}\) is the correlation, and \(\rho_{ii} = 1\) for all i

Alternatively,

\[ \mathbf{R} = [diag(\mathbf{\Sigma})]^{-1/2}\mathbf{\Sigma}[diag(\mathbf{\Sigma})]^{-1/2} \]

where \(diag(\mathbf{\Sigma})\) is the matrix which has the \(\sigma_{ii}\) ’s on the diagonal and 0’s elsewhere

and \(\mathbf{A}^{1/2}\) (the square root of a symmetric matrix) is a symmetric matrix such as \(\mathbf{A} = \mathbf{A}^{1/2}\mathbf{A}^{1/2}\)

\(\mathbf{x}\) and \(\mathbf{y}\) be random vectors with means \(\mu_x\) and \(\mu_y\) and variance -variance matrices \(\mathbf{\Sigma}_x\) and \(\mathbf{\Sigma}_y\) .

\(\mathbf{A}\) and \(\mathbf{B}\) be matrices of constants and \(\mathbf{c}\) and \(\mathbf{d}\) be vectors of constants

\(E(\mathbf{Ay + c} ) = \mathbf{A} \mu_y + c\)

\(var(\mathbf{Ay + c}) = \mathbf{A} var(\mathbf{y})\mathbf{A}' = \mathbf{A \Sigma_y A}'\)

\(cov(\mathbf{Ay + c, By+ d}) = \mathbf{A\Sigma_y B}'\)

\(E(\mathbf{Ay + Bx + c}) = \mathbf{A \mu_y + B \mu_x + c}\)

\(var(\mathbf{Ay + Bx + c}) = \mathbf{A \Sigma_y A' + B \Sigma_x B' + A \Sigma_{yx}B' + B\Sigma'_{yx}A'}\)

Multivariate Normal Distribution

Let \(\mathbf{y}\) be a multivariate normal (MVN) random variable with mean \(\mu\) and variance \(\mathbf{\Sigma}\) . Then the density of \(\mathbf{y}\) is

\[ f(\mathbf{y}) = \frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}} \exp(-\frac{1}{2} \mathbf{(y-\mu)'\Sigma^{-1}(y-\mu)} ) \]

\(\mathbf{y} \sim N_p(\mu, \mathbf{\Sigma})\)

22.0.1 Properties of MVN

Let \(\mathbf{A}_{r \times p}\) be a fixed matrix. Then \(\mathbf{Ay} \sim N_r (\mathbf{A \mu, A \Sigma A'})\) . \(r \le p\) and all rows of \(\mathbf{A}\) must be linearly independent to guarantee that \(\mathbf{A \Sigma A}'\) is non-singular.

Let \(\mathbf{G}\) be a matrix such that \(\mathbf{\Sigma}^{-1} = \mathbf{GG}'\) . Then \(\mathbf{G'y} \sim N_p(\mathbf{G' \mu, I})\) and \(\mathbf{G'(y-\mu)} \sim N_p (0,\mathbf{I})\)

Any fixed linear combination of \(y_1,...,y_p\) (say \(\mathbf{c'y}\) ) follows \(\mathbf{c'y} \sim N_1 (\mathbf{c' \mu, c' \Sigma c})\)

Define a partition, \([\mathbf{y}'_1,\mathbf{y}_2']'\) where

\(\mathbf{y}_1\) is \(p_1 \times 1\)

\(\mathbf{y}_2\) is \(p_2 \times 1\) ,

\(p_1 + p_2 = p\)

\(p_1,p_2 \ge 1\) Then

\[ \left( \begin{array} {c} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \end{array} \right) \sim N \left( \left( \begin{array} {c} \mu_1 \\ \mu_2 \\ \end{array} \right), \left( \begin{array} {cc} \mathbf{\Sigma}_{11} & \mathbf{\Sigma}_{12} \\ \mathbf{\Sigma}_{21} & \mathbf{\Sigma}_{22}\\ \end{array} \right) \right) \]

The marginal distributions of \(\mathbf{y}_1\) and \(\mathbf{y}_2\) are \(\mathbf{y}_1 \sim N_{p1}(\mathbf{\mu_1, \Sigma_{11}})\) and \(\mathbf{y}_2 \sim N_{p2}(\mathbf{\mu_2, \Sigma_{22}})\)

Individual components \(y_1,...,y_p\) are all normally distributed \(y_i \sim N_1(\mu_i, \sigma_{ii})\)

The conditional distribution of \(\mathbf{y}_1\) and \(\mathbf{y}_2\) is normal

\(\mathbf{y}_1 | \mathbf{y}_2 \sim N_{p1}(\mathbf{\mu_1 + \Sigma_{12} \Sigma_{22}^{-1}(y_2 - \mu_2),\Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \sigma_{21}})\)

  • In this formula, we see if we know (have info about) \(\mathbf{y}_2\) , we can re-weight \(\mathbf{y}_1\) ’s mean, and the variance is reduced because we know more about \(\mathbf{y}_1\) because we know \(\mathbf{y}_2\)

which is analogous to \(\mathbf{y}_2 | \mathbf{y}_1\) . And \(\mathbf{y}_1\) and \(\mathbf{y}_2\) are independently distrusted only if \(\mathbf{\Sigma}_{12} = 0\)

If \(\mathbf{y} \sim N(\mathbf{\mu, \Sigma})\) and \(\mathbf{\Sigma}\) is positive definite, then \(\mathbf{(y-\mu)' \Sigma^{-1} (y - \mu)} \sim \chi^2_{(p)}\)

If \(\mathbf{y}_i\) are independent \(N_p (\mathbf{\mu}_i , \mathbf{\Sigma}_i)\) random variables, then for fixed matrices \(\mathbf{A}_{i(m \times p)}\) , \(\sum_{i=1}^k \mathbf{A}_i \mathbf{y}_i \sim N_m (\sum_{i=1}^{k} \mathbf{A}_i \mathbf{\mu}_i, \sum_{i=1}^k \mathbf{A}_i \mathbf{\Sigma}_i \mathbf{A}_i)\)

Multiple Regression

\[ \left( \begin{array} {c} Y \\ \mathbf{x} \end{array} \right) \sim N_{p+1} \left( \left[ \begin{array} {c} \mu_y \\ \mathbf{\mu}_x \end{array} \right] , \left[ \begin{array} {cc} \sigma^2_Y & \mathbf{\Sigma}_{yx} \\ \mathbf{\Sigma}_{yx} & \mathbf{\Sigma}_{xx} \end{array} \right] \right) \]

The conditional distribution of Y given x follows a univariate normal distribution with

\[ \begin{aligned} E(Y| \mathbf{x}) &= \mu_y + \mathbf{\Sigma}_{yx} \Sigma_{xx}^{-1} (\mathbf{x}- \mu_x) \\ &= \mu_y - \Sigma_{yx} \Sigma_{xx}^{-1}\mu_x + \Sigma_{yx} \Sigma_{xx}^{-1}\mathbf{x} \\ &= \beta_0 + \mathbf{\beta'x} \end{aligned} \]

where \(\beta = (\beta_1,...,\beta_p)' = \mathbf{\Sigma}_{xx}^{-1} \mathbf{\Sigma}_{yx}'\) (e.g., analogous to \(\mathbf{(x'x)^{-1}x'y}\) but not the same if we consider \(Y_i\) and \(\mathbf{x}_i\) , \(i = 1,..,n\) and use the empirical covariance formula: \(var(Y|\mathbf{x}) = \sigma^2_Y - \mathbf{\Sigma_{yx}\Sigma^{-1}_{xx} \Sigma'_{yx}}\) )

Samples from Multivariate Normal Populations

A random sample of size n, \(\mathbf{y}_1,.., \mathbf{y}_n\) from \(N_p (\mathbf{\mu}, \mathbf{\Sigma})\) . Then

Since \(\mathbf{y}_1,..., \mathbf{y}_n\) are iid, their sample mean, \(\bar{\mathbf{y}} = \sum_{i=1}^n \mathbf{y}_i/n \sim N_p (\mathbf{\mu}, \mathbf{\Sigma}/n)\) . that is, \(\bar{\mathbf{y}}\) is an unbiased estimator of \(\mathbf{\mu}\)

The \(p \times p\) sample variance-covariance matrix, \(\mathbf{S}\) is \(\mathbf{S} = \frac{1}{n-1}\sum_{i=1}^n (\mathbf{y}_i - \bar{\mathbf{y}})(\mathbf{y}_i - \bar{\mathbf{y}})' = \frac{1}{n-1} (\sum_{i=1}^n \mathbf{y}_i \mathbf{y}_i' - n \bar{\mathbf{y}}\bar{\mathbf{y}}')\)

  • where \(\mathbf{S}\) is symmetric, unbiased estimator of \(\mathbf{\Sigma}\) and has \(p(p+1)/2\) random variables.

\((n-1)\mathbf{S} \sim W_p (n-1, \mathbf{\Sigma})\) is a Wishart distribution with n-1 degrees of freedom and expectation \((n-1) \mathbf{\Sigma}\) . The Wishart distribution is a multivariate extension of the Chi-squared distribution.

\(\bar{\mathbf{y}}\) and \(\mathbf{S}\) are independent

\(\bar{\mathbf{y}}\) and \(\mathbf{S}\) are sufficient statistics. (All of the info in the data about \(\mathbf{\mu}\) and \(\mathbf{\Sigma}\) is contained in \(\bar{\mathbf{y}}\) and \(\mathbf{S}\) , regardless of sample size).

Large Sample Properties

\(\mathbf{y}_1,..., \mathbf{y}_n\) are a random sample from some population with mean \(\mathbf{\mu}\) and variance-covariance matrix \(\mathbf{\Sigma}\)

\(\bar{\mathbf{y}}\) is a consistent estimator for \(\mu\)

\(\mathbf{S}\) is a consistent estimator for \(\mathbf{\Sigma}\)

Multivariate Central Limit Theorem : Similar to the univariate case, \(\sqrt{n}(\bar{\mathbf{y}} - \mu) \dot{\sim} N_p (\mathbf{0,\Sigma})\) where n is large relative to p ( \(n \ge 25p\) ), which is equivalent to \(\bar{\mathbf{y}} \dot{\sim} N_p (\mu, \mathbf{\Sigma}/n)\)

Wald’s Theorem : \(n(\bar{\mathbf{y}} - \mu)' \mathbf{S}^{-1} (\bar{\mathbf{y}} - \mu)\) when n is large relative to p.

Maximum Likelihood Estimation for MVN

Suppose iid \(\mathbf{y}_1 ,... \mathbf{y}_n \sim N_p (\mu, \mathbf{\Sigma})\) , the likelihood function for the data is

\[ \begin{aligned} L(\mu, \mathbf{\Sigma}) &= \prod_{j=1}^n (\frac{1}{(2\pi)^{p/2}|\mathbf{\Sigma}|^{1/2}} \exp(-\frac{1}{2}(\mathbf{y}_j -\mu)'\mathbf{\Sigma}^{-1})(\mathbf{y}_j -\mu)) \\ &= \frac{1}{(2\pi)^{np/2}|\mathbf{\Sigma}|^{n/2}} \exp(-\frac{1}{2} \sum_{j=1}^n(\mathbf{y}_j -\mu)'\mathbf{\Sigma}^{-1})(\mathbf{y}_j -\mu) \end{aligned} \]

Then, the MLEs are

\[ \hat{\mu} = \bar{\mathbf{y}} \]

\[ \hat{\mathbf{\Sigma}} = \frac{n-1}{n} \mathbf{S} \]

using derivatives of the log of the likelihood function with respect to \(\mu\) and \(\mathbf{\Sigma}\)

Properties of MLEs

Invariance: If \(\hat{\theta}\) is the MLE of \(\theta\) , then the MLE of \(h(\theta)\) is \(h(\hat{\theta})\) for any function h(.)

Consistency: MLEs are consistent estimators, but they are usually biased

Efficiency: MLEs are efficient estimators (no other estimator has a smaller variance for large samples)

Asymptotic normality: Suppose that \(\hat{\theta}_n\) is the MLE for \(\theta\) based upon n independent observations. Then \(\hat{\theta}_n \dot{\sim} N(\theta, \mathbf{H}^{-1})\)

\(\mathbf{H}\) is the Fisher Information Matrix, which contains the expected values of the second partial derivatives fo the log-likelihood function. the (i,j)th element of \(\mathbf{H}\) is \(-E(\frac{\partial^2 l(\mathbf{\theta})}{\partial \theta_i \partial \theta_j})\)

we can estimate \(\mathbf{H}\) by finding the form determined above, and evaluate it at \(\theta = \hat{\theta}_n\)

Likelihood ratio testing: for some null hypothesis, \(H_0\) we can form a likelihood ratio test

The statistic is: \(\Lambda = \frac{\max_{H_0}l(\mathbf{\mu}, \mathbf{\Sigma|Y})}{\max l(\mu, \mathbf{\Sigma | Y})}\)

For large n, \(-2 \log \Lambda \sim \chi^2_{(v)}\) where v is the number of parameters in the unrestricted space minus the number of parameters under \(H_0\)

Test of Multivariate Normality

Check univariate normality for each trait (X) separately

Can check \[Normality Assessment\]

The good thing is that if any of the univariate trait is not normal, then the joint distribution is not normal (see again [m]). If a joint multivariate distribution is normal, then the marginal distribution has to be normal.

However, marginal normality of all traits does not imply joint MVN

Easily rule out multivariate normality, but not easy to prove it

Mardia’s tests for multivariate normality

Multivariate skewness is \[ \beta_{1,p} = E[(\mathbf{y}- \mathbf{\mu})' \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu})]^3 \]

where \(\mathbf{x}\) and \(\mathbf{y}\) are independent, but have the same distribution (note: \(\beta\) here is not regression coefficient)

Multivariate kurtosis is defined as

\[ \beta_{2,p} - E[(\mathbf{y}- \mathbf{\mu})' \mathbf{\Sigma}^{-1} (\mathbf{x} - \mathbf{\mu})]^2 \]

For the MVN distribution, we have \(\beta_{1,p} = 0\) and \(\beta_{2,p} = p(p+2)\)

For a sample of size n, we can estimate

\[ \hat{\beta}_{1,p} = \frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n g^2_{ij} \]

\[ \hat{\beta}_{2,p} = \frac{1}{n} \sum_{i=1}^n g^2_{ii} \]

  • where \(g_{ij} = (\mathbf{y}_i - \bar{\mathbf{y}})' \mathbf{S}^{-1} (\mathbf{y}_j - \bar{\mathbf{y}})\) . Note: \(g_{ii} = d^2_i\) where \(d^2_i\) is the Mahalanobis distance

( Mardia 1970 ) shows for large n

\[ \kappa_1 = \frac{n \hat{\beta}_{1,p}}{6} \dot{\sim} \chi^2_{p(p+1)(p+2)/6} \]

\[ \kappa_2 = \frac{\hat{\beta}_{2,p} - p(p+2)}{\sqrt{8p(p+2)/n}} \sim N(0,1) \]

Hence, we can use \(\kappa_1\) and \(\kappa_2\) to test the null hypothesis of MVN.

When the data are non-normal, normal theory tests on the mean are sensitive to \(\beta_{1,p}\) , while tests on the covariance are sensitive to \(\beta_{2,p}\)

Alternatively, Doornik-Hansen test for multivariate normality ( Doornik and Hansen 2008 )

Chi-square Q-Q plot

Let \(\mathbf{y}_i, i = 1,...,n\) be a random sample sample from \(N_p(\mathbf{\mu}, \mathbf{\Sigma})\)

Then \(\mathbf{z}_i = \mathbf{\Sigma}^{-1/2}(\mathbf{y}_i - \mathbf{\mu}), i = 1,...,n\) are iid \(N_p (\mathbf{0}, \mathbf{I})\) . Thus, \(d_i^2 = \mathbf{z}_i' \mathbf{z}_i \sim \chi^2_p , i = 1,...,n\)

plot the ordered \(d_i^2\) values against the qualities of the \(\chi^2_p\) distribution. When normality holds, the plot should approximately resemble a straight lien passing through the origin at a 45 degree

it requires large sample size (i.e., sensitive to sample size). Even if we generate data from a MVN, the tail of the Chi-square Q-Q plot can still be out of line.

If the data are not normal, we can

use nonparametric methods

use models based upon an approximate distribution (e.g., GLMM)

try performing a transformation

multivariate hypothesis also known as

22.0.2 Mean Vector Inference

In the univariate normal distribution, we test \(H_0: \mu =\mu_0\) by using

\[ T = \frac{\bar{y}- \mu_0}{s/\sqrt{n}} \sim t_{n-1} \]

under the null hypothesis. And reject the null if \(|T|\) is large relative to \(t_{(1-\alpha/2,n-1)}\) because it means that seeing a value as large as what we observed is rare if the null is true

Equivalently,

\[ T^2 = \frac{(\bar{y}- \mu_0)^2}{s^2/n} = n(\bar{y}- \mu_0)(s^2)^{-1}(\bar{y}- \mu_0) \sim f_{(1,n-1)} \]

22.0.2.1 Natural Multivariate Generalization

\[ \begin{aligned} &H_0: \mathbf{\mu} = \mathbf{\mu}_0 \\ &H_a: \mathbf{\mu} \neq \mathbf{\mu}_0 \end{aligned} \]

Define Hotelling’s \(T^2\) by

\[ T^2 = n(\bar{\mathbf{y}} - \mathbf{\mu}_0)'\mathbf{S}^{-1}(\bar{\mathbf{y}} - \mathbf{\mu}_0) \]

which can be viewed as a generalized distance between \(\bar{\mathbf{y}}\) and \(\mathbf{\mu}_0\)

Under the assumption of normality,

\[ F = \frac{n-p}{(n-1)p} T^2 \sim f_{(p,n-p)} \]

and reject the null hypothesis when \(F > f_{(1-\alpha, p, n-p)}\)

The \(T^2\) test is invariant to changes in measurement units.

  • If \(\mathbf{z = Cy + d}\) where \(\mathbf{C}\) and \(\mathbf{d}\) do not depend on \(\mathbf{y}\) , then \(T^2(\mathbf{z}) - T^2(\mathbf{y})\)

The \(T^2\) test can be derived as a likelihood ratio test of \(H_0: \mu = \mu_0\)

22.0.2.2 Confidence Intervals

22.0.2.2.1 confidence region.

An “exact” \(100(1-\alpha)\%\) confidence region for \(\mathbf{\mu}\) is the set of all vectors, \(\mathbf{v}\) , which are “close enough” to the observed mean vector, \(\bar{\mathbf{y}}\) to satisfy

\[ n(\bar{\mathbf{y}} - \mathbf{\mu}_0)'\mathbf{S}^{-1}(\bar{\mathbf{y}} - \mathbf{\mu}_0) \le \frac{(n-1)p}{n-p} f_{(1-\alpha, p, n-p)} \]

  • \(\mathbf{v}\) are just the mean vectors that are not rejected by the \(T^2\) test when \(\mathbf{\bar{y}}\) is observed.

In case that you have 2 parameters, the confidence region is a “hyper-ellipsoid”.

In this region, it consists of all \(\mathbf{\mu}_0\) vectors for which the \(T^2\) test would not reject \(H_0\) at significance level \(\alpha\)

Even though the confidence region better assesses the joint knowledge concerning plausible values of \(\mathbf{\mu}\) , people typically include confidence statement about the individual component means. We’d like all of the separate confidence statements to hold simultaneously with a specified high probability. Simultaneous confidence intervals: intervals against any statement being incorrect

22.0.2.2.1.1 Simultaneous Confidence Statements

  • Intervals based on a rectangular confidence region by projecting the previous region onto the coordinate axes:

\[ \bar{y}_{i} \pm \sqrt{\frac{(n-1)p}{n-p}f_{(1-\alpha, p,n-p)}\frac{s_{ii}}{n}} \]

for all \(i = 1,..,p\)

which implied confidence region is conservative; it has at least \(100(1- \alpha)\%\)

Generally, simultaneous \(100(1-\alpha) \%\) confidence intervals for all linear combinations , \(\mathbf{a}\) of the elements of the mean vector are given by

\[ \mathbf{a'\bar{y}} \pm \sqrt{\frac{(n-1)p}{n-p}f_{(1-\alpha, p,n-p)}\frac{\mathbf{a'Sa}}{n}} \]

works for any arbitrary linear combination \(\mathbf{a'\mu} = a_1 \mu_1 + ... + a_p \mu_p\) , which is a projection onto the axis in the direction of \(\mathbf{a}\)

These intervals have the property that the probability that at least one such interval does not contain the appropriate \(\mathbf{a' \mu}\) is no more than \(\alpha\)

These types of intervals can be used for “data snooping” (like \[Scheffe\] )

22.0.2.2.1.2 One \(\mu\) at a time

  • One at a time confidence intervals:

\[ \bar{y}_i \pm t_{(1 - \alpha/2, n-1} \sqrt{\frac{s_{ii}}{n}} \]

Each of these intervals has a probability of \(1-\alpha\) of covering the appropriate \(\mu_i\)

But they ignore the covariance structure of the \(p\) variables

If we only care about \(k\) simultaneous intervals, we can use “one at a time” method with the \[Bonferroni\] correction.

This method gets more conservative as the number of intervals \(k\) increases.

22.0.3 General Hypothesis Testing

22.0.3.1 one-sample tests.

\[ H_0: \mathbf{C \mu= 0} \]

  • \(\mathbf{C}\) is a \(c \times p\) matrix of rank c where \(c \le p\)

We can test this hypothesis using the following statistic

\[ F = \frac{n - c}{(n-1)c} T^2 \]

where \(T^2 = n(\mathbf{C\bar{y}})' (\mathbf{CSC'})^{-1} (\mathbf{C\bar{y}})\)

\[ H_0: \mu_1 = \mu_2 = ... = \mu_p \]

\[ \begin{aligned} \mu_1 - \mu_2 &= 0 \\ &\vdots \\ \mu_{p-1} - \mu_p &= 0 \end{aligned} \]

a total of \(p-1\) tests. Hence, we have \(\mathbf{C}\) as the \(p - 1 \times p\) matrix

\[ \mathbf{C} = \left( \begin{array} {ccccc} 1 & -1 & 0 & \ldots & 0 \\ 0 & 1 & -1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & 1 & -1 \end{array} \right) \]

number of rows = \(c = p -1\)

Equivalently, we can also compare all of the other means to the first mean. Then, we test \(\mu_1 - \mu_2 = 0, \mu_1 - \mu_3 = 0,..., \mu_1 - \mu_p = 0\) , the \((p-1) \times p\) matrix \(\mathbf{C}\) is

\[ \mathbf{C} = \left( \begin{array} {ccccc} -1 & 1 & 0 & \ldots & 0 \\ -1 & 0 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ -1 & 0 & \ldots & 0 & 1 \end{array} \right) \]

The value of \(T^2\) is invariant to these equivalent choices of \(\mathbf{C}\)

This is often used for repeated measures designs , where each subject receives each treatment once over successive periods of time (all treatments are administered to each unit).

Let \(y_{ij}\) be the response from subject i at time j for \(i = 1,..,n, j = 1,...,T\) . In this case, \(\mathbf{y}_i = (y_{i1}, ..., y_{iT})', i = 1,...,n\) are a random sample from \(N_T (\mathbf{\mu}, \mathbf{\Sigma})\)

Let \(n=8\) subjects, \(T = 6\) . We are interested in \(\mu_1, .., \mu_6\)

\[ H_0: \mu_1 = \mu_2 = ... = \mu_6 \]

\[ \begin{aligned} \mu_1 - \mu_2 &= 0 \\ \mu_2 - \mu_3 &= 0 \\ &... \\ \mu_5 - \mu_6 &= 0 \end{aligned} \]

We can test orthogonal polynomials for 4 equally spaced time points. To test for example the null hypothesis that quadratic and cubic effects are jointly equal to 0, we would define \(\mathbf{C}\)

\[ \mathbf{C} = \left( \begin{array} {cccc} 1 & -1 & -1 & 1 \\ -1 & 3 & -3 & 1 \end{array} \right) \]

22.0.3.2 Two-Sample Tests

Consider the analogous two sample multivariate tests.

Example: we have data on two independent random samples, one sample from each of two populations

\[ \begin{aligned} \mathbf{y}_{1i} &\sim N_p (\mathbf{\mu_1, \Sigma}) \\ \mathbf{y}_{2j} &\sim N_p (\mathbf{\mu_2, \Sigma}) \end{aligned} \]

equal variance-covariance matrices

independent random samples

We can summarize our data using the sufficient statistics \(\mathbf{\bar{y}}_1, \mathbf{S}_1, \mathbf{\bar{y}}_2, \mathbf{S}_2\) with respective sample sizes, \(n_1,n_2\)

Since we assume that \(\mathbf{\Sigma}_1 = \mathbf{\Sigma}_2 = \mathbf{\Sigma}\) , compute a pooled estimate of the variance-covariance matrix on \(n_1 + n_2 - 2\) df

\[ \mathbf{S} = \frac{(n_1 - 1)\mathbf{S}_1 + (n_2-1) \mathbf{S}_2}{(n_1 -1) + (n_2 - 1)} \]

\[ \begin{aligned} &H_0: \mathbf{\mu}_1 = \mathbf{\mu}_2 \\ &H_a: \mathbf{\mu}_1 \neq \mathbf{\mu}_2 \end{aligned} \]

At least one element of the mean vectors is different

\(\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2\) to estimate \(\mu_1 - \mu_2\)

\(\mathbf{S}\) to estimate \(\mathbf{\Sigma}\)

Note: because we assume the two populations are independent, there is no covariance

\(cov(\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2) = var(\mathbf{\bar{y}}_1) + var(\mathbf{\bar{y}}_2) = \frac{\mathbf{\Sigma_1}}{n_1} + \frac{\mathbf{\Sigma_2}}{n_2} = \mathbf{\Sigma}(\frac{1}{n_1} + \frac{1}{n_2})\)

Reject \(H_0\) if

\[ \begin{aligned} T^2 &= (\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2)'\{ \mathbf{S} (\frac{1}{n_1} + \frac{1}{n_2})\}^{-1} (\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2)\\ &= \frac{n_1 n_2}{n_1 +n_2} (\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2)'\{ \mathbf{S} \}^{-1} (\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2)\\ & \ge \frac{(n_1 + n_2 -2)p}{n_1 + n_2 - p - 1} f_{(1- \alpha,n_1 + n_2 - p -1)} \end{aligned} \]

or equivalently, if

\[ F = \frac{n_1 + n_2 - p -1}{(n_1 + n_2 -2)p} T^2 \ge f_{(1- \alpha, p , n_1 + n_2 -p -1)} \]

A \(100(1-\alpha) \%\) confidence region for \(\mu_1 - \mu_2\) consists of all vector \(\delta\) which satisfy

\[ \frac{n_1 n_2}{n_1 + n_2} (\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2 - \mathbf{\delta})' \mathbf{S}^{-1}(\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2 - \mathbf{\delta}) \le \frac{(n_1 + n_2 - 2)p}{n_1 + n_2 -p - 1}f_{(1-\alpha, p , n_1 + n_2 - p -1)} \]

The simultaneous confidence intervals for all linear combinations of \(\mu_1 - \mu_2\) have the form

\[ \mathbf{a'}(\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2) \pm \sqrt{\frac{(n_1 + n_2 -2)p}{n_1 + n_2 - p -1}}f_{(1-\alpha, p, n_1 + n_2 -p -1)} \times \sqrt{\mathbf{a'Sa}(\frac{1}{n_1} + \frac{1}{n_2})} \]

Bonferroni intervals, for k combinations

\[ (\bar{y}_{1i} - \bar{y}_{2i}) \pm t_{(1-\alpha/2k, n_1 + n_2 - 2)}\sqrt{(\frac{1}{n_1} + \frac{1}{n_2})s_{ii}} \]

22.0.3.3 Model Assumptions

If model assumption are not met

Unequal Covariance Matrices

If \(n_1 = n_2\) (large samples) there is little effect on the Type I error rate and power fo the two sample test

If \(n_1 > n_2\) and the eigenvalues of \(\mathbf{\Sigma}_1 \mathbf{\Sigma}^{-1}_2\) are less than 1, the Type I error level is inflated

If \(n_1 > n_2\) and some eigenvalues of \(\mathbf{\Sigma}_1 \mathbf{\Sigma}_2^{-1}\) are greater than 1, the Type I error rate is too small, leading to a reduction in power

Sample Not Normal

Type I error level of the two sample \(T^2\) test isn’t much affect by moderate departures from normality if the two populations being sampled have similar distributions

One sample \(T^2\) test is much more sensitive to lack of normality, especially when the distribution is skewed.

Intuitively, you can think that in one sample your distribution will be sensitive, but the distribution of the difference between two similar distributions will not be as sensitive.

Transform to make the data more normal

Large large samples, use the \(\chi^2\) (Wald) test, in which populations don’t need to be normal, or equal sample sizes, or equal variance-covariance matrices

  • \(H_0: \mu_1 - \mu_2 =0\) use \((\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2)'( \frac{1}{n_1} \mathbf{S}_1 + \frac{1}{n_2}\mathbf{S}_2)^{-1}(\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2) \dot{\sim} \chi^2_{(p)}\)

22.0.3.3.1 Equal Covariance Matrices Tests

With independent random samples from k populations of \(p\) -dimensional vectors. We compute the sample covariance matrix for each, \(\mathbf{S}_i\) , where \(i = 1,...,k\)

\[ \begin{aligned} &H_0: \mathbf{\Sigma}_1 = \mathbf{\Sigma}_2 = \ldots = \mathbf{\Sigma}_k = \mathbf{\Sigma} \\ &H_a: \text{at least 2 are different} \end{aligned} \]

Assume \(H_0\) is true, we would use a pooled estimate of the common covariance matrix, \(\mathbf{\Sigma}\)

\[ \mathbf{S} = \frac{\sum_{i=1}^k (n_i -1)\mathbf{S}_i}{\sum_{i=1}^k (n_i - 1)} \]

with \(\sum_{i=1}^k (n_i -1)\)

22.0.3.3.1.1 Bartlett’s Test

(a modification of the likelihood ratio test). Define

\[ N = \sum_{i=1}^k n_i \]

and (note: \(| |\) are determinants here, not absolute value)

\[ M = (N - k) \log|\mathbf{S}| - \sum_{i=1}^k (n_i - 1) \log|\mathbf{S}_i| \]

\[ C^{-1} = 1 - \frac{2p^2 + 3p - 1}{6(p+1)(k-1)} \{\sum_{i=1}^k (\frac{1}{n_i - 1}) - \frac{1}{N-k} \} \]

Reject \(H_0\) when \(MC^{-1} > \chi^2_{1- \alpha, (k-1)p(p+1)/2}\)

If not all samples are from normal populations, \(MC^{-1}\) has a distribution which is often shifted to the right of the nominal \(\chi^2\) distribution, which means \(H_0\) is often rejected even when it is true (the Type I error level is inflated). Hence, it is better to test individual normality first, or then multivariate normality before you do Bartlett’s test.

22.0.3.4 Two-Sample Repeated Measurements

Define \(\mathbf{y}_{hi} = (y_{hi1}, ..., y_{hit})'\) to be the observations from the i-th subject in the h-th group for times 1 through T

Assume that \(\mathbf{y}_{11}, ..., \mathbf{y}_{1n_1}\) are iid \(N_t(\mathbf{\mu}_1, \mathbf{\Sigma})\) and that \(\mathbf{y}_{21},...,\mathbf{y}_{2n_2}\) are iid \(N_t(\mathbf{\mu}_2, \mathbf{\Sigma})\)

\(H_0: \mathbf{C}(\mathbf{\mu}_1 - \mathbf{\mu}_2) = \mathbf{0}_c\) where \(\mathbf{C}\) is a \(c \times t\) matrix of rank \(c\) where \(c \le t\)

The test statistic has the form

\[ T^2 = \frac{n_1 n_2}{n_1 + n_2} (\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2)' \mathbf{C}'(\mathbf{CSC}')^{-1}\mathbf{C} (\mathbf{\bar{y}}_1 - \mathbf{\bar{y}}_2) \]

where \(\mathbf{S}\) is the pooled covariance estimate. Then,

\[ F = \frac{n_1 + n_2 - c -1}{(n_1 + n_2-2)c} T^2 \sim f_{(c, n_1 + n_2 - c-1)} \]

when \(H_0\) is true

If the null hypothesis  \(H_0: \mu_1 = \mu_2\) is rejected. A weaker hypothesis is that the profiles for the two groups are parallel.

\[ \begin{aligned} \mu_{11} - \mu_{21} &= \mu_{12} - \mu_{22} \\ &\vdots \\ \mu_{1t-1} - \mu_{2t-1} &= \mu_{1t} - \mu_{2t} \end{aligned} \]

The null hypothesis matrix term is then

\(H_0: \mathbf{C}(\mu_1 - \mu_2) = \mathbf{0}_c\) , where \(c = t - 1\) and

\[ \mathbf{C} = \left( \begin{array} {ccccc} 1 & -1 & 0 & \ldots & 0 \\ 0 & 1 & -1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \ldots & -1 \end{array} \right)_{(t-1) \times t} \]

can’t reject the null of hypothesized vector of means

reject the null that the two labs’ measurements are equal

multivariate hypothesis also known as

reject null. Hence, there is a difference in the means of the bivariate normal distributions

22.1 MANOVA

Multivariate Analysis of Variance

One-way MANOVA

Compare treatment means for h different populations

Population 1: \(\mathbf{y}_{11}, \mathbf{y}_{12}, \dots, \mathbf{y}_{1n_1} \sim idd N_p (\mathbf{\mu}_1, \mathbf{\Sigma})\)

Population h: \(\mathbf{y}_{h1}, \mathbf{y}_{h2}, \dots, \mathbf{y}_{hn_h} \sim idd N_p (\mathbf{\mu}_h, \mathbf{\Sigma})\)

Assumptions

  • Independent random samples from \(h\) different populations
  • Common covariance matrices
  • Each population is multivariate normal

Calculate the summary statistics \(\mathbf{\bar{y}}_i, \mathbf{S}\) and the pooled estimate of the covariance matrix \(\mathbf{S}\)

Similar to the univariate one-way ANVOA, we can use the effects model formulation \(\mathbf{\mu}_i = \mathbf{\mu} + \mathbf{\tau}_i\) , where

\(\mathbf{\mu}_i\) is the population mean for population i

\(\mathbf{\mu}\) is the overall mean effect

\(\mathbf{\tau}_i\) is the treatment effect of the i-th treatment.

For the one-way model: \(\mathbf{y}_{ij} = \mu + \tau_i + \epsilon_{ij}\) for \(i = 1,..,h; j = 1,..., n_i\) and \(\epsilon_{ij} \sim N_p(\mathbf{0, \Sigma})\)

However, the above model is over-parameterized (i.e., infinite number of ways to define \(\mathbf{\mu}\) and the \(\mathbf{\tau}_i\) ’s such that they add up to \(\mu_i\) . Thus we can constrain by having

\[ \sum_{i=1}^h n_i \tau_i = 0 \]

\[ \mathbf{\tau}_h = 0 \]

The observational equivalent of the effects model is

\[ \begin{aligned} \mathbf{y}_{ij} &= \mathbf{\bar{y}} + (\mathbf{\bar{y}}_i - \mathbf{\bar{y}}) + (\mathbf{y}_{ij} - \mathbf{\bar{y}}_i) \\ &= \text{overall sample mean} + \text{treatement effect} + \text{residual} \text{ (under univariate ANOVA)} \end{aligned} \]

After manipulation

\[ \sum_{i = 1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})' = \sum_{i = 1}^h n_i (\mathbf{\bar{y}}_i - \mathbf{\bar{y}})(\mathbf{\bar{y}}_i - \mathbf{\bar{y}})' + \sum_{i=1}^h \sum_{j = 1}^{n_i} (\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}})(\mathbf{\bar{y}}_{ij} - \mathbf{\bar{y}}_i)' \]

LHS = Total corrected sums of squares and cross products (SSCP) matrix

1st term = Treatment (or between subjects) sum of squares and cross product matrix (denoted H;B)

2nd term = residual (or within subject) SSCP matrix denoted (E;W)

\[ \mathbf{E} = (n_1 - 1)\mathbf{S}_1 + ... + (n_h -1) \mathbf{S}_h = (n-h) \mathbf{S} \]

MANOVA table

MONOVA table
Source SSCP df
Treatment \(\mathbf{H}\) \(h -1\)
Residual (error) \(\mathbf{E}\) \(\sum_{i= 1}^h n_i - h\)
Total Corrected \(\mathbf{H + E}\) \(\sum_{i=1}^h n_i -1\)

\[ H_0: \tau_1 = \tau_2 = \dots = \tau_h = \mathbf{0} \]

We consider the relative “sizes” of \(\mathbf{E}\) and \(\mathbf{H+E}\)

Wilk’s Lambda

Define Wilk’s Lambda

\[ \Lambda^* = \frac{|\mathbf{E}|}{|\mathbf{H+E}|} \]

Properties:

Wilk’s Lambda is equivalent to the F-statistic in the univariate case

The exact distribution of \(\Lambda^*\) can be determined for especial cases.

For large sample sizes, reject \(H_0\) if

\[ -(\sum_{i=1}^h n_i - 1 - \frac{p+h}{2}) \log(\Lambda^*) > \chi^2_{(1-\alpha, p(h-1))} \]

22.1.1 Testing General Hypotheses

\(h\) different treatments

with the i-th treatment

applied to \(n_i\) subjects that

are observed for \(p\) repeated measures.

Consider this a \(p\) dimensional obs on a random sample from each of \(h\) different treatment populations.

\[ \mathbf{y}_{ij} = \mathbf{\mu} + \mathbf{\tau}_i + \mathbf{\epsilon}_{ij} \]

for \(i = 1,..,h\) and \(j = 1,..,n_i\)

\[ \mathbf{Y} = \mathbf{XB} + \mathbf{\epsilon} \]

where \(n = \sum_{i = 1}^h n_i\) and with restriction \(\mathbf{\tau}_h = 0\)

\[ \mathbf{Y}_{(n \times p)} = \left[ \begin{array} {c} \mathbf{y}_{11}' \\ \vdots \\ \mathbf{y}_{1n_1}' \\ \vdots \\ \mathbf{y}_{hn_h}' \end{array} \right], \mathbf{B}_{(h \times p)} = \left[ \begin{array} {c} \mathbf{\mu}' \\ \mathbf{\tau}_1' \\ \vdots \\ \mathbf{\tau}_{h-1}' \end{array} \right], \mathbf{\epsilon}_{(n \times p)} = \left[ \begin{array} {c} \epsilon_{11}' \\ \vdots \\ \epsilon_{1n_1}' \\ \vdots \\ \epsilon_{hn_h}' \end{array} \right] \]

\[ \mathbf{X}_{(n \times h)} = \left[ \begin{array} {ccccc} 1 & 1 & 0 & \ldots & 0 \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & 1 & 0 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ldots & \vdots \\ 1 & 0 & 0 & \ldots & 0 \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & 0 & 0 & \ldots & 0 \end{array} \right] \]

\[ \mathbf{\hat{B}} = (\mathbf{X'X})^{-1} \mathbf{X'Y} \]

Rows of \(\mathbf{Y}\) are independent (i.e., \(var(\mathbf{Y}) = \mathbf{I}_n \otimes \mathbf{\Sigma}\) , an \(np \times np\) matrix, where \(\otimes\) is the Kronecker product).

\[ \begin{aligned} &H_0: \mathbf{LBM} = 0 \\ &H_a: \mathbf{LBM} \neq 0 \end{aligned} \]

\(\mathbf{L}\) is a \(g \times h\) matrix of full row rank ( \(g \le h\) ) = comparisons across groups

\(\mathbf{M}\) is a \(p \times u\) matrix of full column rank ( \(u \le p\) ) = comparisons across traits

The general treatment corrected sums of squares and cross product is

\[ \mathbf{H} = \mathbf{M'Y'X(X'X)^{-1}L'[L(X'X)^{-1}L']^{-1}L(X'X)^{-1}X'YM} \]

or for the null hypothesis \(H_0: \mathbf{LBM} = \mathbf{D}\)

\[ \mathbf{H} = (\mathbf{\hat{LBM}} - \mathbf{D})'[\mathbf{X(X'X)^{-1}L}]^{-1}(\mathbf{\hat{LBM}} - \mathbf{D}) \]

The general matrix of residual sums of squares and cross product

\[ \mathbf{E} = \mathbf{M'Y'[I-X(X'X)^{-1}X']YM} = \mathbf{M'[Y'Y - \hat{B}'(X'X)^{-1}\hat{B}]M} \]

We can compute the following statistic eigenvalues of \(\mathbf{HE}^{-1}\)

Wilk’s Criterion: \(\Lambda^* = \frac{|\mathbf{E}|}{|\mathbf{H} + \mathbf{E}|}\) . The df depend on the rank of \(\mathbf{L}, \mathbf{M}, \mathbf{X}\)

Lawley-Hotelling Trace: \(U = tr(\mathbf{HE}^{-1})\)

Pillai Trace: \(V = tr(\mathbf{H}(\mathbf{H}+ \mathbf{E}^{-1})\)

Roy’s Maximum Root: largest eigenvalue of \(\mathbf{HE}^{-1}\)

If \(H_0\) is true and n is large, \(-(n-1- \frac{p+h}{2})\ln \Lambda^* \sim \chi^2_{p(h-1)}\) . Some special values of p and h can give exact F-dist under \(H_0\)

reject the null of equal multivariate mean vectors between the three admmission groups

  • If independent = time with 3 levels -> univariate ANOVA (require sphericity assumption (i.e., the variances for all differences are equal))
  • If each level of independent time as a separate variable -> MANOVA (does not require sphericity assumption)

can’t reject the null hypothesis of sphericity, hence univariate ANOVA is also appropriate.We also see linear significant time effect, but no quadratic time effect

multivariate hypothesis also known as

reject the null hypothesis of no difference in means between treatments

there is no significant difference in means between the control and bww9 drug

there is a significant difference in means between ax23 drug treatment and the rest of the treatments

22.1.2 Profile Analysis

Examine similarities between the treatment effects (between subjects), which is useful for longitudinal analysis. Null is that all treatments have the same average effect.

\[ H_0: \mu_1 = \mu_2 = \dots = \mu_h \]

\[ H_0: \tau_1 = \tau_2 = \dots = \tau_h \]

The exact nature of the similarities and differences between the treatments can be examined under this analysis.

Sequential steps in profile analysis:

  • Are the profiles parallel ? (i.e., is there no interaction between treatment and time)
  • Are the profiles coincidental ? (i.e., are the profiles identical?)
  • Are the profiles horizontal ? (i.e., are there no differences between any time points?)

If we reject the null hypothesis that the profiles are parallel, we can test

Are there differences among groups within some subset of the total time points?

Are there differences among time points in a particular group (or groups)?

Are there differences within some subset of the total time points in a particular group (or groups)?

4 times (p = 4)

3 treatments (h=3)

22.1.2.1 Parallel Profile

Are the profiles for each population identical expect for a mean shift?

\[ \begin{aligned} H_0: \mu_{11} - \mu_{21} - \mu_{12} - \mu_{22} = &\dots = \mu_{1t} - \mu_{2t} \\ \mu_{11} - \mu_{31} - \mu_{12} - \mu_{32} = &\dots = \mu_{1t} - \mu_{3t} \\ &\dots \end{aligned} \]

for \(h-1\) equations

\[ H_0: \mathbf{LBM = 0} \]

\[ \mathbf{LBM} = \left[ \begin{array} {ccc} 1 & -1 & 0 \\ 1 & 0 & -1 \end{array} \right] \left[ \begin{array} {ccc} \mu_{11} & \dots & \mu_{14} \\ \mu_{21} & \dots & \mu_{24} \\ \mu_{31} & \dots & \mu_{34} \end{array} \right] \left[ \begin{array} {ccc} 1 & 1 & 1 \\ -1 & 0 & 0 \\ 0 & -1 & 0 \\ 0 & 0 & -1 \end{array} \right] = \mathbf{0} \]

where this is the cell means parameterization of \(\mathbf{B}\)

The multiplication of the first 2 matrices \(\mathbf{LB}\) is

\[ \left[ \begin{array} {cccc} \mu_{11} - \mu_{21} & \mu_{12} - \mu_{22} & \mu_{13} - \mu_{23} & \mu_{14} - \mu_{24}\\ \mu_{11} - \mu_{31} & \mu_{12} - \mu_{32} & \mu_{13} - \mu_{33} & \mu_{14} - \mu_{34} \end{array} \right] \]

which is the differences in treatment means at the same time

Multiplying by \(\mathbf{M}\) , we get the comparison across time

\[ \left[ \begin{array} {ccc} (\mu_{11} - \mu_{21}) - (\mu_{12} - \mu_{22}) & (\mu_{11} - \mu_{21}) -(\mu_{13} - \mu_{23}) & (\mu_{11} - \mu_{21}) - (\mu_{14} - \mu_{24}) \\ (\mu_{11} - \mu_{31}) - (\mu_{12} - \mu_{32}) & (\mu_{11} - \mu_{31}) - (\mu_{13} - \mu_{33}) & (\mu_{11} - \mu_{31}) -(\mu_{14} - \mu_{34}) \end{array} \right] \]

Alternatively, we can also use the effects parameterization

\[ \mathbf{LBM} = \left[ \begin{array} {cccc} 0 & 1 & -1 & 0 \\ 0 & 1 & 0 & -1 \end{array} \right] \left[ \begin{array} {c} \mu' \\ \tau'_1 \\ \tau_2' \\ \tau_3' \end{array} \right] \left[ \begin{array} {ccc} 1 & 1 & 1 \\ -1 & 0 & 0 \\ 0 & -1 & 0 \\ 0 & 0 & -1 \end{array} \right] = \mathbf{0} \]

In both parameterizations, \(rank(\mathbf{L}) = h-1\) and \(rank(\mathbf{M}) = p-1\)

We could also choose \(\mathbf{L}\) and \(\mathbf{M}\) in other forms

\[ \mathbf{L} = \left[ \begin{array} {cccc} 0 & 1 & 0 & -1 \\ 0 & 0 & 1 & -1 \end{array} \right] \]

\[ \mathbf{M} = \left[ \begin{array} {ccc} 1 & 0 & 0 \\ -1 & 1 & 0 \\ 0 & -1 & 1 \\ 0 & 0 & -1 \end{array} \right] \]

and still obtain the same result.

22.1.2.2 Coincidental Profiles

After we have evidence that the profiles are parallel (i.e., fail to reject the parallel profile test), we can ask whether they are identical?

Given profiles are parallel , then if the sums of the components of \(\mu_i\) are identical for all the treatments, then the profiles are identical .

\[ H_0: \mathbf{1'}_p \mu_1 = \mathbf{1'}_p \mu_2 = \dots = \mathbf{1'}_p \mu_h \]

\[ H_0: \mathbf{LBM} = \mathbf{0} \]

where for the cell means parameterization

\[ \mathbf{L} = \left[ \begin{array} {ccc} 1 & 0 & -1 \\ 0 & 1 & -1 \end{array} \right] \]

\[ \mathbf{M} = \left[ \begin{array} {cccc} 1 & 1 & 1 & 1 \end{array} \right]' \]

multiplication yields

\[ \left[ \begin{array} {c} (\mu_{11} + \mu_{12} + \mu_{13} + \mu_{14}) - (\mu_{31} + \mu_{32} + \mu_{33} + \mu_{34}) \\ (\mu_{21} + \mu_{22} + \mu_{23} + \mu_{24}) - (\mu_{31} + \mu_{32} + \mu_{33} + \mu_{34}) \end{array} \right] = \left[ \begin{array} {c} 0 \\ 0 \end{array} \right] \]

Different choices of \(\mathbf{L}\) and \(\mathbf{M}\) can yield the same result

22.1.2.3 Horizontal Profiles

Given that we can’t reject the null hypothesis that all \(h\) profiles are the same, we can ask whether all of the elements of the common profile equal? (i.e., horizontal)

\[ \mathbf{L} = \left[ \begin{array} {ccc} 1 & 0 & 0 \end{array} \right] \]

\[ \left[ \begin{array} {ccc} (\mu_{11} - \mu_{12}) & (\mu_{12} - \mu_{13}) & (\mu_{13} + \mu_{14}) \end{array} \right] = \left[ \begin{array} {ccc} 0 & 0 & 0 \end{array} \right] \]

  • If we fail to reject all 3 hypotheses, then we fail to reject the null hypotheses of both no difference between treatments and no differences between traits.
Test Equivalent test for
Parallel profile Interaction
Coincidental profile main effect of between-subjects factor
Horizontal profile main effect of repeated measures factor

22.1.3 Summary

multivariate hypothesis also known as

22.2 Principal Components

  • Unsupervised learning
  • find important features
  • reduce the dimensions of the data set
  • “decorrelate” multivariate vectors that have dependence.
  • uses eigenvector/eigvenvalue decomposition of covariance (correlation) matrices.

According to the “spectral decomposition theorem”, if \(\mathbf{\Sigma}_{p \times p}\) i s a positive semi-definite, symmetric, real matrix, then there exists an orthogonal matrix \(\mathbf{A}\) such that \(\mathbf{A'\Sigma A} = \Lambda\) where \(\Lambda\) is a diagonal matrix containing the eigenvalues \(\mathbf{\Sigma}\)

\[ \mathbf{\Lambda} = \left( \begin{array} {cccc} \lambda_1 & 0 & \ldots & 0 \\ 0 & \lambda_2 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \lambda_p \end{array} \right) \]

\[ \mathbf{A} = \left( \begin{array} {cccc} \mathbf{a}_1 & \mathbf{a}_2 & \ldots & \mathbf{a}_p \end{array} \right) \]

the i-th column of \(\mathbf{A}\) , \(\mathbf{a}_i\) , is the i-th \(p \times 1\) eigenvector of \(\mathbf{\Sigma}\) that corresponds to the eigenvalue, \(\lambda_i\) , where \(\lambda_1 \ge \lambda_2 \ge \ldots \ge \lambda_p\) . Alternatively, express in matrix decomposition:

\[ \mathbf{\Sigma} = \mathbf{A \Lambda A}' \]

\[ \mathbf{\Sigma} = \mathbf{A} \left( \begin{array} {cccc} \lambda_1 & 0 & \ldots & 0 \\ 0 & \lambda_2 & \ldots & 0 \\ \vdots & \vdots& \ddots & \vdots \\ 0 & 0 & \ldots & \lambda_p \end{array} \right) \mathbf{A}' = \sum_{i=1}^p \lambda_i \mathbf{a}_i \mathbf{a}_i' \]

where the outer product \(\mathbf{a}_i \mathbf{a}_i'\) is a \(p \times p\) matrix of rank 1.

For example,

\(\mathbf{x} \sim N_2(\mathbf{\mu}, \mathbf{\Sigma})\)

\[ \mathbf{\mu} = \left( \begin{array} {c} 5 \\ 12 \end{array} \right); \mathbf{\Sigma} = \left( \begin{array} {cc} 4 & 1 \\ 1 & 2 \end{array} \right) \]

multivariate hypothesis also known as

\[ \mathbf{A} = \left( \begin{array} {cc} 0.9239 & -0.3827 \\ 0.3827 & 0.9239 \\ \end{array} \right) \]

Columns of \(\mathbf{A}\) are the eigenvectors for the decomposition

Under matrix multiplication ( \(\mathbf{A'\Sigma A}\) or \(\mathbf{A'A}\) ), the off-diagonal elements equal to 0

Multiplying data by this matrix (i.e., projecting the data onto the orthogonal axes); the distribution of the resulting data (i.e., “scores”) is

\[ N_2 (\mathbf{A'\mu,A'\Sigma A}) = N_2 (\mathbf{A'\mu, \Lambda}) \]

\[ \mathbf{y} = \mathbf{A'x} \sim N \left[ \left( \begin{array} {c} 9.2119 \\ 9.1733 \end{array} \right), \left( \begin{array} {cc} 4.4144 & 0 \\ 0 & 1.5859 \end{array} \right) \right] \]

multivariate hypothesis also known as

No more dependence in the data structure, plot

The i-th eigenvalue is the variance of a linear combination of the elements of \(\mathbf{x}\) ; \(var(y_i) = var(\mathbf{a'_i x}) = \lambda_i\)

The values on the transformed set of axes (i.e., the \(y_i\) ’s) are called the scores. These are the orthogonal projections of the data onto the “new principal component axes

Variances of \(y_1\) are greater than those for any other possible projection

Covariance matrix decomposition and projection onto orthogonal axes = PCA

22.2.1 Population Principal Components

\(p \times 1\) vectors \(\mathbf{x}_1, \dots , \mathbf{x}_n\) which are iid with \(var(\mathbf{x}_i) = \mathbf{\Sigma}\)

The first PC is the linear combination \(y_1 = \mathbf{a}_1' \mathbf{x} = a_{11}x_1 + \dots + a_{1p}x_p\) with \(\mathbf{a}_1' \mathbf{a}_1 = 1\) such that \(var(y_1)\) is the maximum of all linear combinations of \(\mathbf{x}\) which have unit length

The second PC is the linear combination \(y_1 = \mathbf{a}_2' \mathbf{x} = a_{21}x_1 + \dots + a_{2p}x_p\) with \(\mathbf{a}_2' \mathbf{a}_2 = 1\) such that \(var(y_1)\) is the maximum of all linear combinations of \(\mathbf{x}\) which have unit length and uncorrelated with \(y_1\) (i.e., \(cov(\mathbf{a}_1' \mathbf{x}, \mathbf{a}'_2 \mathbf{x}) =0\)

continues for all \(y_i\) to \(y_p\)

\(\mathbf{a}_i\) ’s are those that make up the matrix \(\mathbf{A}\) in the symmetric decomposition \(\mathbf{A'\Sigma A} = \mathbf{\Lambda}\) , where \(var(y_1) = \lambda_1, \dots , var(y_p) = \lambda_p\) And the total variance of \(\mathbf{x}\) is

\[ \begin{aligned} var(x_1) + \dots + var(x_p) &= tr(\Sigma) = \lambda_1 + \dots + \lambda_p \\ &= var(y_1) + \dots + var(y_p) \end{aligned} \]

Data Reduction

To reduce the dimension of data from p (original) to k dimensions without much “loss of information”, we can use properties of the population principal components

Suppose \(\mathbf{\Sigma} \approx \sum_{i=1}^k \lambda_i \mathbf{a}_i \mathbf{a}_i'\) . Even thought the true variance-covariance matrix has rank \(p\) , it can be be well approximate by a matrix of rank k (k <p)

New “traits” are linear combinations of the measured traits. We can attempt to make meaningful interpretation fo the combinations (with orthogonality constraints).

The proportion of the total variance accounted for by the j-th principal component is

\[ \frac{var(y_j)}{\sum_{i=1}^p var(y_i)} = \frac{\lambda_j}{\sum_{i=1}^p \lambda_i} \]

The proportion of the total variation accounted for by the first k principal components is \(\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^p \lambda_i}\)

Above example , we have \(4.4144/(4+2) = .735\) of the total variability can be explained by the first principal component

22.2.2 Sample Principal Components

Since \(\mathbf{\Sigma}\) is unknown, we use

\[ \mathbf{S} = \frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})' \]

Let \(\hat{\lambda}_1 \ge \hat{\lambda}_2 \ge \dots \ge \hat{\lambda}_p \ge 0\) be the eigenvalues of \(\mathbf{S}\) and \(\hat{\mathbf{a}}_1, \hat{\mathbf{a}}_2, \dots, \hat{\mathbf{a}}_p\) denote the eigenvectors of \(\mathbf{S}\)

Then, the i-th sample principal component score (or principal component or score) is

\[ \hat{y}_{ij} = \sum_{k=1}^p \hat{a}_{ik}x_{kj} = \hat{\mathbf{a}}_i'\mathbf{x}_j \]

Properties of Sample Principal Components

The estimated variance of \(y_i = \hat{\mathbf{a}}_i'\mathbf{x}_j\) is \(\hat{\lambda}_i\)

The sample covariance between \(\hat{y}_i\) and \(\hat{y}_{i'}\) is 0 when \(i \neq i'\)

The proportion of the total sample variance accounted for by the i-th sample principal component is \(\frac{\hat{\lambda}_i}{\sum_{k=1}^p \hat{\lambda}_k}\)

The estimated correlation between the \(i\) -th principal component score and the \(l\) -th attribute of \(\mathbf{x}\) is

\[ r_{x_l , \hat{y}_i} = \frac{\hat{a}_{il}\sqrt{\lambda_i}}{\sqrt{s_{ll}}} \]

The correlation coefficient is typically used to interpret the components (i.e., if this correlation is high then it suggests that the l-th original trait is important in the i-th principle component). According to R. A. Johnson, Wichern, et al. ( 2002 ) , pp.433-434, \(r_{x_l, \hat{y}_i}\) only measures the univariate contribution of an individual X to a component Y without taking into account the presence of the other X’s. Hence, some prefer \(\hat{a}_{il}\) coefficient to interpret the principal component.

\(r_{x_l, \hat{y}_i} ; \hat{a}_{il}\) are referred to as “loadings”

To use k principal components, we must calculate the scores for each data vector in the sample

\[ \mathbf{y}_j = \left( \begin{array} {c} y_{1j} \\ y_{2j} \\ \vdots \\ y_{kj} \end{array} \right) = \left( \begin{array} {c} \hat{\mathbf{a}}_1' \mathbf{x}_j \\ \hat{\mathbf{a}}_2' \mathbf{x}_j \\ \vdots \\ \hat{\mathbf{a}}_k' \mathbf{x}_j \end{array} \right) = \left( \begin{array} {c} \hat{\mathbf{a}}_1' \\ \hat{\mathbf{a}}_2' \\ \vdots \\ \hat{\mathbf{a}}_k' \end{array} \right) \mathbf{x}_j \]

Large sample theory exists for eigenvalues and eigenvectors of sample covariance matrices if inference is necessary. But we do not do inference with PCA, we only use it as exploratory or descriptive analysis.

PC is not invariant to changes in scale (Exception: if all trait are rescaled by multiplying by the same constant, such as feet to inches).

PCA based on the correlation matrix \(\mathbf{R}\) is different than that based on the covariance matrix \(\mathbf{\Sigma}\)

PCA for the correlation matrix is just rescaling each trait to have unit variance

Transform \(\mathbf{x}\) to \(\mathbf{z}\) where \(z_{ij} = (x_{ij} - \bar{x}_i)/\sqrt{s_{ii}}\) where the denominator affects the PCA

After transformation, \(cov(\mathbf{z}) = \mathbf{R}\)

PCA on \(\mathbf{R}\) is calculated in the same way as that on \(\mathbf{S}\) (where \(\hat{\lambda}{}_1 + \dots + \hat{\lambda}{}_p = p\) )

The use of \(\mathbf{R}, \mathbf{S}\) depends on the purpose of PCA.

  • If the scale of the observations if different, covariance matrix is more preferable. but if they are dramatically different, analysis can still be dominated by the large variance traits.

How many PCs to use can be guided by

Scree Graphs: plot the eigenvalues against their indices. Look for the “elbow” where the steep decline in the graph suddenly flattens out; or big gaps.

minimum Percent of total variation (e.g., choose enough components to have 50% or 90%). can be used for interpretations.

Kaiser’s rule: use only those PC with eigenvalues larger than 1 (applied to PCA on the correlation matrix) - ad hoc

Compare to the eigenvalue scree plot of data to the scree plot when the data are randomized.

22.2.3 Application

PCA on the covariance matrix is usually not preferred due to the fact that PCA is not invariant to changes in scale. Hence, PCA on the correlation matrix is more preferred

This also addresses the problem of multicollinearity

The eigvenvectors may differ by a multiplication of -1 for different implementation, but same interpretation.

Covid Example

To reduce collinearity problem in this dataset, we can use principal components as regressors.

multivariate hypothesis also known as

MSE for the PC-based model is larger than regular regression, because models with a large degree of collinearity can still perform well.

pcr function in pls can be used for fitting PC regression (it will select the optimal number of components in the model).

22.3 Factor Analysis

Using a few linear combinations of underlying unobservable (latent) traits, we try to describe the covariance relationship among a large number of measured traits

Similar to PCA , but factor analysis is model based

More details can be found on PSU stat or UMN stat

Let \(\mathbf{y}\) be the set of \(p\) measured variables

\(E(\mathbf{y}) = \mathbf{\mu}\)

\(var(\mathbf{y}) = \mathbf{\Sigma}\)

\[ \begin{aligned} \mathbf{y} - \mathbf{\mu} &= \mathbf{Lf} + \epsilon \\ &= \left( \begin{array} {c} l_{11}f_1 + l_{12}f_2 + \dots + l_{tm}f_m \\ \vdots \\ l_{p1}f_1 + l_{p2}f_2 + \dots + l_{pm} f_m \end{array} \right) + \left( \begin{array} {c} \epsilon_1 \\ \vdots \\ \epsilon_p \end{array} \right) \end{aligned} \]

\(\mathbf{y} - \mathbf{\mu}\) = the p centered measurements

\(\mathbf{L}\) = \(p \times m\) matrix of factor loadings

\(\mathbf{f}\) = unobserved common factors for the population

\(\mathbf{\epsilon}\) = random errors (i.e., variation that is not accounted for by the common factors).

We want \(m\) (the number of factors) to be much smaller than \(p\) (the number of measured attributes)

Restrictions on the model

\(E(\epsilon) = \mathbf{0}\)

\(var(\epsilon) = \Psi_{p \times p} = diag( \psi_1, \dots, \psi_p)\)

\(\mathbf{\epsilon}, \mathbf{f}\) are independent

Additional assumption could be \(E(\mathbf{f}) = \mathbf{0}, var(\mathbf{f}) = \mathbf{I}_{m \times m}\) (known as the orthogonal factor model) , which imposes the following covariance structure on \(\mathbf{y}\)

\[ \begin{aligned} var(\mathbf{y}) = \mathbf{\Sigma} &= var(\mathbf{Lf} + \mathbf{\epsilon}) \\ &= var(\mathbf{Lf}) + var(\epsilon) \\ &= \mathbf{L} var(\mathbf{f}) \mathbf{L}' + \mathbf{\Psi} \\ &= \mathbf{LIL}' + \mathbf{\Psi} \\ &= \mathbf{LL}' + \mathbf{\Psi} \end{aligned} \]

Since \(\mathbf{\Psi}\) is diagonal, the off-diagonal elements of \(\mathbf{LL}'\) are \(\sigma_{ij}\) , the co variances in \(\mathbf{\Sigma}\) , which means \(cov(y_i, y_j) = \sum_{k=1}^m l_{ik}l_{jk}\) and the covariance of \(\mathbf{y}\) is completely determined by the m factors ( \(m <<p\) )

\(var(y_i) = \sum_{k=1}^m l_{ik}^2 + \psi_i\) where \(\psi_i\) is the specific variance and the summation term is the i-th communality (i.e., portion of the variance of the i-th variable contributed by the \(m\) common factors ( \(h_i^2 = \sum_{k=1}^m l_{ik}^2\) )

The factor model is only uniquely determined up to an orthogonal transformation of the factors.

Let \(\mathbf{T}_{m \times m}\) be an orthogonal matrix \(\mathbf{TT}' = \mathbf{T'T} = \mathbf{I}\) then

\[ \begin{aligned} \mathbf{y} - \mathbf{\mu} &= \mathbf{Lf} + \epsilon \\ &= \mathbf{LTT'f} + \epsilon \\ &= \mathbf{L}^*(\mathbf{T'f}) + \epsilon & \text{where } \mathbf{L}^* = \mathbf{LT} \end{aligned} \]

\[ \begin{aligned} \mathbf{\Sigma} &= \mathbf{LL}' + \mathbf{\Psi} \\ &= \mathbf{LTT'L} + \mathbf{\Psi} \\ &= (\mathbf{L}^*)(\mathbf{L}^*)' + \mathbf{\Psi} \end{aligned} \]

Hence, any orthogonal transformation of the factors is an equally good description of the correlations among the observed traits.

Let \(\mathbf{y} = \mathbf{Cx}\) , where \(\mathbf{C}\) is any diagonal matrix, then \(\mathbf{L}_y = \mathbf{CL}_x\) and \(\mathbf{\Psi}_y = \mathbf{C\Psi}_x\mathbf{C}\)

Hence, we can see that factor analysis is also invariant to changes in scale

22.3.1 Methods of Estimation

To estimate \(\mathbf{L}\)

  • Principal Component Method
  • Principal Factor Method

22.3.1.1 Principal Component Method

Spectral decomposition

\[ \begin{aligned} \mathbf{\Sigma} &= \lambda_1 \mathbf{a}_1 \mathbf{a}_1' + \dots + \lambda_p \mathbf{a}_p \mathbf{a}_p' \\ &= \mathbf{A\Lambda A}' \\ &= \sum_{k=1}^m \lambda+k \mathbf{a}_k \mathbf{a}_k' + \sum_{k= m+1}^p \lambda_k \mathbf{a}_k \mathbf{a}_k' \\ &= \sum_{k=1}^m l_k l_k' + \sum_{k=m+1}^p \lambda_k \mathbf{a}_k \mathbf{a}_k' \end{aligned} \]

where \(l_k = \mathbf{a}_k \sqrt{\lambda_k}\) and the second term is not diagonal in general.

\[ \psi_i = \sigma_{ii} - \sum_{k=1}^m l_{ik}^2 = \sigma_{ii} - \sum_{k=1}^m \lambda_i a_{ik}^2 \]

\[ \mathbf{\Sigma} \approx \mathbf{LL}' + \mathbf{\Psi} \]

To estimate \(\mathbf{L}\) and \(\Psi\) , we use the expected eigenvalues and eigenvectors from \(\mathbf{S}\) or \(\mathbf{R}\)

The estimated factor loadings don’t change as the number of actors increases

The diagonal elements of \(\hat{\mathbf{L}}\hat{\mathbf{L}}' + \hat{\mathbf{\Psi}}\) are equal to the diagonal elements of \(\mathbf{S}\) and \(\mathbf{R}\) , but the covariances may not be exactly reproduced

We select \(m\) so that the off-diagonal elements close to the values in \(\mathbf{S}\) (or to make the off-diagonal elements of \(\mathbf{S} - \hat{\mathbf{L}} \hat{\mathbf{L}}' + \hat{\mathbf{\Psi}}\) small)

22.3.1.2 Principal Factor Method

Consider modeling the correlation matrix, \(\mathbf{R} = \mathbf{L} \mathbf{L}' + \mathbf{\Psi}\) . Then

\[ \mathbf{L} \mathbf{L}' = \mathbf{R} - \mathbf{\Psi} = \left( \begin{array} {cccc} h_1^2 & r_{12} & \dots & r_{1p} \\ r_{21} & h_2^2 & \dots & r_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ r_{p1} & r_{p2} & \dots & h_p^2 \end{array} \right) \]

where \(h_i^2 = 1- \psi_i\) (the communality)

Suppose that initial estimates are available for the communalities, \((h_1^*)^2,(h_2^*)^2, \dots , (h_p^*)^2\) , then we can regress each trait on all the others, and then use the \(r^2\) as \(h^2\)

The estimate of \(\mathbf{R} - \mathbf{\Psi}\) at step k is

\[ (\mathbf{R} - \mathbf{\Psi})_k = \left( \begin{array} {cccc} (h_1^*)^2 & r_{12} & \dots & r_{1p} \\ r_{21} & (h_2^*)^2 & \dots & r_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ r_{p1} & r_{p2} & \dots & (h_p^*)^2 \end{array} \right) = \mathbf{L}_k^*(\mathbf{L}_k^*)' \]

\[ \mathbf{L}_k^* = (\sqrt{\hat{\lambda}_1^*\hat{\mathbf{a}}_1^* , \dots \hat{\lambda}_m^*\hat{\mathbf{a}}_m^*}) \]

\[ \hat{\psi}_{i,k}^* = 1 - \sum_{j=1}^m \hat{\lambda}_i^* (\hat{a}_{ij}^*)^2 \]

we used the spectral decomposition on the estimated matrix \((\mathbf{R}- \mathbf{\Psi})\) to calculate the \(\hat{\lambda}_i^* s\) and the \(\mathbf{\hat{a}}_i^* s\)

After updating the values of \((\hat{h}_i^*)^2 = 1 - \hat{\psi}_{i,k}^*\) we will use them to form a new \(\mathbf{L}_{k+1}^*\) via another spectral decomposition. Repeat the process

The matrix \((\mathbf{R} - \mathbf{\Psi})_k\) is not necessarily positive definite

The principal component method is similar to principal factor if one considers the initial communalities are \(h^2 = 1\)

if \(m\) is too large, some communalities may become larger than 1, causing the iterations to terminate. To combat, we can

fix any communality that is greater than 1 at 1 and then continues.

continue iterations regardless of the size of the communalities. However, results can be outside fo the parameter space.

22.3.1.3 Maximum Likelihood Method

Since we need the likelihood function, we make the additional (critical) assumption that

\(\mathbf{y}_j \sim N(\mathbf{\mu},\mathbf{\Sigma})\) for \(j = 1,..,n\)

\(\mathbf{f} \sim N(\mathbf{0}, \mathbf{I})\)

\(\epsilon_j \sim N(\mathbf{0}, \mathbf{\Psi})\)

and restriction

  • \(\mathbf{L}' \mathbf{\Psi}^{-1}\mathbf{L} = \mathbf{\Delta}\) where \(\mathbf{\Delta}\) is a diagonal matrix. (since the factor loading matrix is not unique, we need this restriction).

Finding MLE can be computationally expensive

we typically use other methods for exploratory data analysis

Likelihood ratio tests could be used for testing hypotheses in this framework (i.e., Confirmatory Factor Analysis)

22.3.2 Factor Rotation

\(\mathbf{T}_{m \times m}\) is an orthogonal matrix that has the property that

\[ \hat{\mathbf{L}} \hat{\mathbf{L}}' + \hat{\mathbf{\Psi}} = \hat{\mathbf{L}}^*(\hat{\mathbf{L}}^*)' + \hat{\mathbf{\Psi}} \]

where \(\mathbf{L}^* = \mathbf{LT}\)

This means that estimated specific variances and communalities are not altered by the orthogonal transformation.

Since there are an infinite number of choices for \(\mathbf{T}\) , some selection criterion is necessary

For example, we can find the orthogonal transformation that maximizes the objective function

\[ \sum_{j = 1}^m [\frac{1}{p}\sum_{i=1}^p (\frac{l_{ij}^{*2}}{h_i})^2 - \{\frac{\gamma}{p} \sum_{i=1}^p (\frac{l_{ij}^{*2}}{h_i})^2 \}^2] \]

where \(\frac{l_{ij}^{*2}}{h_i}\) are “scaled loadings”, which gives variables with small communalities more influence.

Different choices of \(\gamma\) in the objective function correspond to different orthogonal rotation found in the literature;

Varimax \(\gamma = 1\) (rotate the factors so that each of the \(p\) variables should have a high loading on only one factor, but this is not always possible).

Quartimax \(\gamma = 0\)

Equimax \(\gamma = m/2\)

Parsimax \(\gamma = \frac{p(m-1)}{p+m-2}\)

Promax: non-orthogonal or olique transformations

Harris-Kaiser (HK): non-orthogonal or oblique transformations

22.3.3 Estimation of Factor Scores

\[ (\mathbf{y}_j - \mathbf{\mu}) = \mathbf{L}_{p \times m}\mathbf{f}_j + \epsilon_j \]

If the factor model is correct then

\[ var(\epsilon_j) = \mathbf{\Psi} = diag (\psi_1, \dots , \psi_p) \]

Thus we could consider using weighted least squares to estimate \(\mathbf{f}_j\) , the vector of factor scores for the j-th sampled unit by

\[ \begin{aligned} \hat{\mathbf{f}} &= (\mathbf{L}'\mathbf{\Psi}^{-1} \mathbf{L})^{-1} \mathbf{L}' \mathbf{\Psi}^{-1}(\mathbf{y}_j - \mathbf{\mu}) \\ & \approx (\mathbf{L}'\mathbf{\Psi}^{-1} \mathbf{L})^{-1} \mathbf{L}' \mathbf{\Psi}^{-1}(\mathbf{y}_j - \mathbf{\bar{y}}) \end{aligned} \]

22.3.3.1 The Regression Method

Alternatively, we can use the regression method to estimate the factor scores

Consider the joint distribution of \((\mathbf{y}_j - \mathbf{\mu})\) and \(\mathbf{f}_j\) assuming multivariate normality, as in the maximum likelihood approach. then,

\[ \left( \begin{array} {c} \mathbf{y}_j - \mathbf{\mu} \\ \mathbf{f}_j \end{array} \right) \sim N_{p + m} \left( \left[ \begin{array} {cc} \mathbf{LL}' + \mathbf{\Psi} & \mathbf{L} \\ \mathbf{L}' & \mathbf{I}_{m\times m} \end{array} \right] \right) \]

when the \(m\) factor model is correct

\[ E(\mathbf{f}_j | \mathbf{y}_j - \mathbf{\mu}) = \mathbf{L}' (\mathbf{LL}' + \mathbf{\Psi})^{-1}(\mathbf{y}_j - \mathbf{\mu}) \]

notice that \(\mathbf{L}' (\mathbf{LL}' + \mathbf{\Psi})^{-1}\) is an \(m \times p\) matrix of regression coefficients

Then, we use the estimated conditional mean vector to estimate the factor scores

\[ \mathbf{\hat{f}}_j = \mathbf{\hat{L}}'(\mathbf{\hat{L}}\mathbf{\hat{L}}' + \mathbf{\hat{\Psi}})^{-1}(\mathbf{y}_j - \mathbf{\bar{y}}) \]

Alternatively, we could reduce the effect of possible incorrect determination fo the number of factors \(m\) by using \(\mathbf{S}\) as a substitute for \(\mathbf{\hat{L}}\mathbf{\hat{L}}' + \mathbf{\hat{\Psi}}\) then

\[ \mathbf{\hat{f}}_j = \mathbf{\hat{L}}'\mathbf{S}^{-1}(\mathbf{y}_j - \mathbf{\bar{y}}) \]

where \(j = 1,\dots,n\)

22.3.4 Model Diagnostic

Check for outliers (recall that \(\mathbf{f}_j \sim iid N(\mathbf{0}, \mathbf{I}_{m \times m})\) )

Check for multivariate normality assumption

Use univariate tests for normality to check the factor scores

Confirmatory Factor Analysis : formal testing of hypotheses about loadings, use MLE and full/reduced model testing paradigm and measures of model fit

22.3.5 Application

In the psych package,

h2 = the communalities

u2 = the uniqueness

com = the complexity

multivariate hypothesis also known as

The output info for the null hypothesis of no common factors is in the statement “The degrees of freedom for the null model ..”

The output info for the null hypothesis that number of factors is sufficient is in the statement “The total number of observations was …”

One factor is not enough, two is sufficient, and not enough data for 3 factors (df of -2 and NA for p-value). Hence, we should use 2-factor model.

22.4 Discriminant Analysis

Suppose we have two or more different populations from which observations could come from. Discriminant analysis seeks to determine which of the possible population an observation comes from while making as few mistakes as possible

This is an alternative to logistic approaches with the following advantages:

when there is clear separation between classes, the parameter estimates for the logic regression model can be surprisingly unstable, while discriminant approaches do not suffer

If X is normal in each of the classes and the sample size is small, then discriminant approaches can be more accurate

Similar to MANOVA, let \(\mathbf{y}_{j1},\mathbf{y}_{j2},\dots, \mathbf{y}_{in_j} \sim iid f_j (\mathbf{y})\) for \(j = 1,\dots, h\)

Let \(f_j(\mathbf{y})\) be the density function for population j . Note that each vector \(\mathbf{y}\) contain measurements on all \(p\) traits

  • Assume that each observation is from one of \(h\) possible populations.
  • We want to form a discriminant rule that will allocate an observation \(\mathbf{y}\) to population j when \(\mathbf{y}\) is in fact from this population

22.4.1 Known Populations

The maximum likelihood discriminant rule for assigning an observation \(\mathbf{y}\) to one of the \(h\) populations allocates \(\mathbf{y}\) to the population that gives the largest likelihood to \(\mathbf{y}\)

Consider the likelihood for a single observation \(\mathbf{y}\) , which has the form \(f_j (\mathbf{y})\) where j is the true population.

Since \(j\) is unknown, to make the likelihood as large as possible, we should choose the value j which causes \(f_j (\mathbf{y})\) to be as large as possible

Consider a simple univariate example. Suppose we have data from one of two binomial populations.

The first population has \(n= 10\) trials with success probability \(p = .5\)

The second population has \(n= 10\) trials with success probability \(p = .7\)

to which population would we assign an observation of \(y = 7\)

\(f(y = 7|n = 10, p = .5) = .117\)

\(f(y = 7|n = 10, p = .7) = .267\) where \(f(.)\) is the binomial likelihood.

Hence, we choose the second population

Another example

We have 2 populations, where

First population: \(N(\mu_1, \sigma^2_1)\)

Second population: \(N(\mu_2, \sigma^2_2)\)

The likelihood for a single observation is

\[ f_j (y) = (2\pi \sigma^2_j)^{-1/2} \exp\{ -\frac{1}{2}(\frac{y - \mu_j}{\sigma_j})^2\} \]

Consider a likelihood ratio rule

\[ \begin{aligned} \Lambda &= \frac{\text{likelihood of y from pop 1}}{\text{likelihood of y from pop 2}} \\ &= \frac{f_1(y)}{f_2(y)} \\ &= \frac{\sigma_2}{\sigma_1} \exp\{-\frac{1}{2}[(\frac{y - \mu_1}{\sigma_1})^2- (\frac{y - \mu_2}{\sigma_2})^2] \} \end{aligned} \]

Hence, we classify into

pop 1 if \(\Lambda >1\)

pop 2 if \(\Lambda <1\)

for ties, flip a coin

Another way to think:

we classify into population 1 if the “standardized distance” of y from \(\mu_1\) is less than the “standardized distance” of y from \(\mu_2\) which is referred to as a quadratic discriminant rule .

(Significant simplification occurs in th special case where \(\sigma_1 = \sigma_2 = \sigma^2\) )

Thus, we classify into population 1 if

\[ (y - \mu_2)^2 > (y - \mu_1)^2 \]

\[ |y- \mu_2| > |y - \mu_1| \]

\[ -2 \log (\Lambda) = -2y \frac{(\mu_1 - \mu_2)}{\sigma^2} + \frac{(\mu_1^2 - \mu_2^2)}{\sigma^2} = \beta y + \alpha \]

Thus, we classify into population 1 if this is less than 0.

Discriminant classification rule is linear in y in this case.

22.4.1.1 Multivariate Expansion

Suppose that there are 2 populations

\(N_p(\mathbf{\mu}_1, \mathbf{\Sigma}_1)\)

\(N_p(\mathbf{\mu}_2, \mathbf{\Sigma}_2)\)

\[ \begin{aligned} -2 \log(\frac{f_1 (\mathbf{x})}{f_2 (\mathbf{x})}) &= \log|\mathbf{\Sigma}_1| + (\mathbf{x} - \mathbf{\mu}_1)' \mathbf{\Sigma}^{-1}_1 (\mathbf{x} - \mathbf{\mu}_1) \\ &- [\log|\mathbf{\Sigma}_2|+ (\mathbf{x} - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1}_2 (\mathbf{x} - \mathbf{\mu}_2) ] \end{aligned} \]

Again, we classify into population 1 if this is less than 0, otherwise, population 2. And like the univariate case with non-equal variances, this is a quadratic discriminant rule.

And if the covariance matrices are equal: \(\mathbf{\Sigma}_1 = \mathbf{\Sigma}_2 = \mathbf{\Sigma}_1\) classify into population 1 if

\[ (\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1}\mathbf{x} - \frac{1}{2} (\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1} (\mathbf{\mu}_1 - \mathbf{\mu}_2) \ge 0 \]

This linear discriminant rule is also referred to as Fisher’s linear discriminant function

By assuming the covariance matrices are equal, we assume that the shape and orientation fo the two populations must be the same (which can be a strong restriction)

In other words, for each variable, it can have different mean but the same variance.

  • Note: LDA Bayes decision boundary is linear. Hence, quadratic decision boundary might lead to better classification. Moreover, the assumption of same variance/covariance matrix across all classes for Gaussian densities imposes the linear rule, if we allow the predictors in each class to follow MVN distribution with class-specific mean vectors and variance/covariance matrices, then it is Quadratic Discriminant Analysis. But then, you will have more parameters to estimate (which gives more flexibility than LDA) at the cost of more variance (bias -variance tradeoff).

When \(\mathbf{\mu}_1, \mathbf{\mu}_2, \mathbf{\Sigma}\) are known, the probability of misclassification can be determined:

\[ \begin{aligned} P(2|1) &= P(\text{calssify into pop 2| x is from pop 1}) \\ &= P((\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1} \mathbf{x} \le \frac{1}{2} (\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1} (\mathbf{\mu}_1 - \mathbf{\mu}_2)|\mathbf{x} \sim N(\mu_1, \mathbf{\Sigma}) \\ &= \Phi(-\frac{1}{2} \delta) \end{aligned} \]

\(\delta^2 = (\mathbf{\mu}_1 - \mathbf{\mu}_2)' \mathbf{\Sigma}^{-1} (\mathbf{\mu}_1 - \mathbf{\mu}_2)\)

\(\Phi\) is the standard normal CDF

Suppose there are \(h\) possible populations, which are distributed as \(N_p (\mathbf{\mu}_p, \mathbf{\Sigma})\) . Then, the maximum likelihood (linear) discriminant rule allocates \(\mathbf{y}\) to population j where j minimizes the squared Mahalanobis distance

\[ (\mathbf{y} - \mathbf{\mu}_j)' \mathbf{\Sigma}^{-1} (\mathbf{y} - \mathbf{\mu}_j) \]

22.4.1.2 Bayes Discriminant Rules

If we know that population j has prior probabilities \(\pi_j\) (assume \(\pi_j >0\) ) we can form the Bayes discriminant rule.

This rule allocates an observation \(\mathbf{y}\) to the population for which \(\pi_j f_j (\mathbf{y})\) is maximized.

  • Maximum likelihood discriminant rule is a special case of the Bayes discriminant rule , where it sets all the \(\pi_j = 1/h\)

Optimal Properties of Bayes Discriminant Rules

let \(p_{ii}\) be the probability of correctly assigning an observation from population i

then one rule (with probabilities \(p_{ii}\) ) is as good as another rule (with probabilities \(p_{ii}'\) ) if \(p_{ii} \ge p_{ii}'\) for all \(i = 1,\dots, h\)

The first rule is better than the alternative if \(p_{ii} > p_{ii}'\) for at least one i.

A rule for which there is no better alternative is called admissible

Bayes Discriminant Rules are admissible

If we utilized prior probabilities, then we can form the posterior probability of a correct allocation, \(\sum_{i=1}^h \pi_i p_{ii}\)

Bayes Discriminant Rules have the largest possible posterior probability of correct allocation with respect to the prior

These properties show that Bayes Discriminant rule is our best approach .

Unequal Cost

We want to consider the cost misallocation

  • Define \(c_{ij}\) to be the cost associated with allocation a member of population j to population i.

Assume that

\(c_{ij} >0\) for all \(i \neq j\)

\(c_{ij} = 0\) if \(i = j\)

We could determine the expected amount of loss for an observation allocated to population i as \(\sum_j c_{ij} p_{ij}\) where the \(p_{ij}s\) are the probabilities of allocating an observation from population j into population i

We want to minimize the amount of loss expected for our rule. Using a Bayes Discrimination, allocate \(\mathbf{y}\) to the population j which minimizes \(\sum_{k \neq j} c_{ij} \pi_k f_k(\mathbf{y})\)

We could assign equal probabilities to each group and get a maximum likelihood type rule. here, we would allocate \(\mathbf{y}\) to population j which minimizes \(\sum_{k \neq j}c_{jk} f_k(\mathbf{y})\)

Two binomial populations, each of size 10, with probabilities \(p_1 = .5\) and \(p_2 = .7\)

And the probability of being in the first population is .9

However, suppose the cost of inappropriately allocating into the first population is 1 and the cost of incorrectly allocating into the second population is 5.

In this case, we pick population 1 over population 2

In general, we consider two regions, \(R_1\) and \(R_2\) associated with population 1 and 2:

\[ R_1: \frac{f_1 (\mathbf{x})}{f_2 (\mathbf{x})} \ge \frac{c_{12} \pi_2}{c_{21} \pi_1} \]

\[ R_2: \frac{f_1 (\mathbf{x})}{f_2 (\mathbf{x})} < \frac{c_{12} \pi_2}{c_{21} \pi_1} \]

where \(c_{12}\) is the cost of assigning a member of population 2 to population 1.

22.4.1.3 Discrimination Under Estimation

Suppose we know the form of the distributions for populations of interests, but we still have to estimate the parameters.

we know the distributions are multivariate normal, but we have to estimate the means and variances

The maximum likelihood discriminant rule allocates an observation \(\mathbf{y}\) to population j when j maximizes the function

\[ f_j (\mathbf{y} |\hat{\theta}) \]

where \(\hat{\theta}\) are the maximum likelihood estimates of the unknown parameters

For instance, we have 2 multivariate normal populations with distinct means, but common variance covariance matrix

MLEs for \(\mathbf{\mu}_1\) and \(\mathbf{\mu}_2\) are \(\mathbf{\bar{y}}_1\) and \(\mathbf{\bar{y}}_2\) and common \(\mathbf{\Sigma}\) is \(\mathbf{S}\) .

Thus, an estimated discriminant rule could be formed by substituting these sample values for the population values

22.4.1.4 Native Bayes

The challenge with classification using Bayes’ is that we don’t know the (true) densities, \(f_k, k = 1, \dots, K\) , while LDA and QDA make strong multivariate normality assumptions to deal with this.

Naive Bayes makes only one assumption: within the k-th class, the p predictors are independent (i.e,, for \(k = 1,\dots, K\)

\[ f_k(x) = f_{k1}(x_1) \times f_{k2}(x_2) \times \dots \times f_{kp}(x_p) \]

where \(f_{kj}\) is the density function of the j-th predictor among observation in the k-th class.

This assumption allows the use of joint distribution without the need to account for dependence between observations. However, this (native) assumption can be unrealistic, but still works well in cases where the number of sample (n) is not large relative to the number of features (p).

With this assumption, we have

\[ P(Y=k|X=x) = \frac{\pi_k \times f_{k1}(x_1) \times \dots \times f_{kp}(x_p)}{\sum_{l=1}^K \pi_l \times f_{l1}(x_1)\times \dots f_{lp}(x_p)} \]

we only need to estimate the one-dimensional density function \(f_{kj}\) with either of these approaches:

When \(X_j\) is quantitative, assume it has a univariate normal distribution (with independence): \(X_j | Y = k \sim N(\mu_{jk}, \sigma^2_{jk})\) which is more restrictive than QDA because it assumes predictors are independent (e.g., a diagonal covariance matrix)

When \(X_j\) is quantitative, use a kernel density estimator Kernel Methods ; which is a smoothed histogram

When \(X_j\) is qualitative, we count the promotion of training observations for the j-th predictor corresponding to each class.

22.4.1.5 Comparison of Classification Methods

Assuming we have K classes and K is the baseline from (James , Witten, Hastie, and Tibshirani book)

Comparing the log odds relative to the K class

22.4.1.5.1 Logistic Regression

\[ \log(\frac{P(Y=k|X = x)}{P(Y = K| X = x)}) = \beta_{k0} + \sum_{j=1}^p \beta_{kj}x_j \]

22.4.1.5.2 LDA

\[ \log(\frac{P(Y = k | X = x)}{P(Y = K | X = x)} = a_k + \sum_{j=1}^p b_{kj} x_j \]

where \(a_k\) and \(b_{kj}\) are functions of \(\pi_k, \pi_K, \mu_k , \mu_K, \mathbf{\Sigma}\)

Similar to logistic regression, LDA assumes the log odds is linear in \(x\)

Even though they look like having the same form, the parameters in logistic regression are estimated by MLE, where as LDA linear parameters are specified by the prior and normal distributions

We expect LDA to outperform logistic regression when the normality assumption (approximately) holds, and logistic regression to perform better when it does not

22.4.1.5.3 QDA

\[ \log(\frac{P(Y=k|X=x}{P(Y=K | X = x}) = a_k + \sum_{j=1}^{p}b_{kj}x_{j} + \sum_{j=1}^p \sum_{l=1}^p c_{kjl}x_j x_l \]

where \(a_k, b_{kj}, c_{kjl}\) are functions \(\pi_k , \pi_K, \mu_k, \mu_K ,\mathbf{\Sigma}_k, \mathbf{\Sigma}_K\)

22.4.1.5.4 Naive Bayes

\[ \log (\frac{P(Y = k | X = x)}{P(Y = K | X = x}) = a_k + \sum_{j=1}^p g_{kj} (x_j) \]

where \(a_k = \log (\pi_k / \pi_K)\) and \(g_{kj}(x_j) = \log(\frac{f_{kj}(x_j)}{f_{Kj}(x_j)})\) which is the form of generalized additive model

22.4.1.5.5 Summary

LDA is a special case of QDA

LDA is robust when it comes to high dimensions

Any classifier with a linear decision boundary is a special case of naive Bayes with \(g_{kj}(x_j) = b_{kj} x_j\) , which means LDA is a special case of naive Bayes. LDA assumes that the features are normally distributed with a common within-class covariance matrix, and naive Bayes assumes independence of the features.

Naive bayes is also a special case of LDA with \(\mathbf{\Sigma}\) restricted to a diagonal matrix with diagonals, \(\sigma^2\) (another notation \(diag (\mathbf{\Sigma})\) ) assuming \(f_{kj}(x_j) = N(\mu_{kj}, \sigma^2_j)\)

QDA and naive Bayes are not special case of each other. In principal,e naive Bayes can produce a more flexible fit by the choice of \(g_{kj}(x_j)\) , but it’s restricted to only purely additive fit, but QDA includes multiplicative terms of the form \(c_{kjl}x_j x_l\)

None of these methods uniformly dominates the others: the choice of method depends on the true distribution of the predictors in each of the K classes, n and p (i.e., related to the bias-variance tradeoff).

Compare to the non-parametric method (KNN)

KNN would outperform both LDA and logistic regression when the decision boundary is highly nonlinear, but can’t say which predictors are most important, and requires many observations

KNN is also limited in high-dimensions due to the curse of dimensionality

Since QDA is a special type of nonlinear decision boundary (quadratic), it can be considered as a compromise between the linear methods and KNN classification. QDA can have fewer training observations than KNN but not as flexible.

From simulation:

True decision boundary Best performance
Linear LDA + Logistic regression
Moderately nonlinear QDA + Naive Bayes
Highly nonlinear (many training, p is not large) KNN
  • like linear regression, we can also introduce flexibility by including transformed features \(\sqrt{X}, X^2, X^3\)

22.4.2 Probabilities of Misclassification

When the distribution are exactly known, we can determine the misclassification probabilities exactly. however, when we need to estimate the population parameters, we have to estimate the probability of misclassification

Naive method

Plugging the parameters estimates into the form for the misclassification probabilities results to derive at the estimates of the misclassification probability.

But this will tend to be optimistic when the number of samples in one or more populations is small.

Resubstitution method

Use the proportion of the samples from population i that would be allocated to another population as an estimate of the misclassification probability

But also optimistic when the number of samples is small

Jack-knife estimates:

The above two methods use observation to estimate both parameters and also misclassification probabilities based upon the discriminant rule

Alternatively, we determine the discriminant rule based upon all of the data except the k-th observation from the j-th population

then, determine if the k-th observation would be misclassified under this rule

perform this process for all \(n_j\) observation in population j . An estimate fo the misclassification probability would be the fraction of \(n_j\) observations which were misclassified

repeat the process for other \(i \neq j\) populations

This method is more reliable than the others, but also computationally intensive

Cross-Validation

Consider the group-specific densities \(f_j (\mathbf{x})\) for multivariate vector \(\mathbf{x}\) .

Assume equal misclassifications costs, the Bayes classification probability of \(\mathbf{x}\) belonging to the j-th population is

\[ p(j |\mathbf{x}) = \frac{\pi_j f_j (\mathbf{x})}{\sum_{k=1}^h \pi_k f_k (\mathbf{x})} \]

\(j = 1,\dots, h\)

where there are \(h\) possible groups.

We then classify into the group for which this probability of membership is largest

Alternatively, we can write this in terms of a generalized squared distance formation

\[ D_j^2 (\mathbf{x}) = d_j^2 (\mathbf{x})+ g_1(j) + g_2 (j) \]

\(d_j^2(\mathbf{x}) = (\mathbf{x} - \mathbf{\mu}_j)' \mathbf{V}_j^{-1} (\mathbf{x} - \mathbf{\mu}_j)\) is the squared Mahalanobis distance from \(\mathbf{x}\) to the centroid of group j, and

\(\mathbf{V}_j = \mathbf{S}_j\) if the within group covariance matrices are not equal

\(\mathbf{V}_j = \mathbf{S}_p\) if a pooled covariance estimate is appropriate

\[ g_1(j) = \begin{cases} \ln |\mathbf{S}_j| & \text{within group covariances are not equal} \\ 0 & \text{pooled covariance} \end{cases} \]

\[ g_2(j) = \begin{cases} -2 \ln \pi_j & \text{prior probabilities are not equal} \\ 0 & \text{prior probabilities are equal} \end{cases} \]

then, the posterior probability of belonging to group j is

\[ p(j| \mathbf{x}) = \frac{\exp(-.5 D_j^2(\mathbf{x}))}{\sum_{k=1}^h \exp(-.5 D^2_k (\mathbf{x}))} \]

where \(j = 1,\dots , h\)

and \(\mathbf{x}\) is classified into group j if \(p(j | \mathbf{x})\) is largest for \(j = 1,\dots,h\) (or, \(D_j^2(\mathbf{x})\) is smallest).

22.4.2.1 Assessing Classification Performance

For binary classification, confusion matrix

Predicted class
- or Null + or Null Total
True Class - or Null True Neg (TN) False Pos (FP) N
+ or Null False Neg (FN) True Pos (TP) P
Total N* P*

and table 4.6 from ( James et al. 2013 )

Name Definition Synonyms
False Pos rate FP/N Type I error, 1 0 Specificity
True Pos. rate TP/P 1 - Type II error, power, sensitivity, recall
Pos Pred. value TP/P* Precision, 1 - false discovery promotion
Neg. Pred. value TN/N*

ROC curve (receiver Operating Characteristics) is a graphical comparison between sensitivity (true positive) and specificity ( = 1 - false positive)

y-axis = true positive rate

x-axis = false positive rate

as we change the threshold rate for classifying an observation as from 0 to 1

AUC (area under the ROC) ideally would equal to 1, a bad classifier would have AUC = 0.5 (pure chance)

22.4.3 Unknown Populations/ Nonparametric Discrimination

When your multivariate data are not Gaussian, or known distributional form at all, we can use the following methods

22.4.3.1 Kernel Methods

We approximate \(f_j (\mathbf{x})\) by a kernel density estimate

\[ \hat{f}_j(\mathbf{x}) = \frac{1}{n_j} \sum_{i = 1}^{n_j} K_j (\mathbf{x} - \mathbf{x}_i) \]

\(K_j (.)\) is a kernel function satisfying \(\int K_j(\mathbf{z})d\mathbf{z} =1\)

\(\mathbf{x}_i\) , \(i = 1,\dots , n_j\) is a random sample from the j-th population.

Thus, after finding \(\hat{f}_j (\mathbf{x})\) for each of the \(h\) populations, the posterior probability of group membership is

\[ p(j |\mathbf{x}) = \frac{\pi_j \hat{f}_j (\mathbf{x})}{\sum_{k-1}^h \pi_k \hat{f}_k (\mathbf{x})} \]

where \(j = 1,\dots, h\)

There are different choices for the kernel function:

Epanechnikov

We these kernels, we have to pick the “radius” (or variance, width, window width, bandwidth) of the kernel, which is a smoothing parameter (the larger the radius, the more smooth the kernel estimate of the density).

To select the smoothness parameter, we can use the following method

If we believe the populations were close to multivariate normal, then

\[ R = (\frac{4/(2p+1)}{n_j})^{1/(p+1} \]

But since we do not know for sure, we might choose several different values and select one that vies the best out of sample or cross-validation discrimination.

Moreover, you also have to decide whether to use different kernel smoothness for different populations, which is similar to the individual and pooled covariances in the classical methodology.

22.4.3.2 Nearest Neighbor Methods

The nearest neighbor (also known as k-nearest neighbor) method performs the classification of a new observation vector based on the group membership of its nearest neighbors. In practice, we find

\[ d_{ij}^2 (\mathbf{x}, \mathbf{x}_i) = (\mathbf{x}, \mathbf{x}_i) V_j^{-1}(\mathbf{x}, \mathbf{x}_i) \]

which is the distance between the vector \(\mathbf{x}\) and the \(i\) -th observation in group \(j\)

We consider different choices for \(\mathbf{V}_j\)

\[ \begin{aligned} \mathbf{V}_j &= \mathbf{S}_p \\ \mathbf{V}_j &= \mathbf{S}_j \\ \mathbf{V}_j &= \mathbf{I} \\ \mathbf{V}_j &= diag (\mathbf{S}_p) \end{aligned} \]

We find the \(k\) observations that are closest to \(\mathbf{x}\) (where users pick \(k\) ). Then we classify into the most common population, weighted by the prior.

22.4.3.3 Modern Discriminant Methods

Logistic regression (with or without random effects) is a flexible model-based procedure for classification between two populations.

The extension of logistic regression to the multi-group setting is polychotomous logistic regression (or, mulinomial regression).

The machine learning and pattern recognition are growing with strong focus on nonlinear discriminant analysis methods such as:

radial basis function networks

support vector machines

multiplayer perceptrons (neural networks)

The general framework

\[ g_j (\mathbf{x}) = \sum_{l = 1}^m w_{jl}\phi_l (\mathbf{x}; \mathbf{\theta}_l) + w_{j0} \]

\(m\) nonlinear basis functions \(\phi_l\) , each of which has \(n_m\) parameters given by \(\theta_l = \{ \theta_{lk}: k = 1, \dots , n_m \}\)

We assign \(\mathbf{x}\) to the \(j\) -th population if \(g_j(\mathbf{x})\) is the maximum for all \(j = 1,\dots, h\)

Development usually focuses on the choice and estimation of the basis functions, \(\phi_l\) and the estimation of the weights \(w_{jl}\)

More details can be found ( Webb, Copsey, and Cawley 2011 )

22.4.4 Application

22.4.4.1 lda.

Default prior is proportional to sample size and lda and qda do not fit a constant or intercept term

LDA didn’t do well on both within sample and out-of-sample data.

22.4.4.2 QDA

22.4.4.3 knn.

knn uses design matrices of the features.

22.4.4.4 Stepwise

Stepwise discriminant analysis using the stepclass in function in the klaR package.

multivariate hypothesis also known as

22.4.4.5 PCA with Discriminant Analysis

we can use both PCA for dimension reduction in discriminant analysis

multivariate hypothesis also known as

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Dermatol
  • v.62(4); Jul-Aug 2017

Biostatistics Series Module 10: Brief Overview of Multivariate Methods

Avijit hazra.

From the Department of Pharmacology, Institute of Postgraduate Medical Education and Research, Kolkata, West Bengal, India

Nithya Gogtay

1 Department of Clinical Pharmacology, Seth GS Medical College and KEM Hospital, Mumbai, Maharashtra, India

Multivariate analysis refers to statistical techniques that simultaneously look at three or more variables in relation to the subjects under investigation with the aim of identifying or clarifying the relationships between them. These techniques have been broadly classified as dependence techniques, which explore the relationship between one or more dependent variables and their independent predictors, and interdependence techniques, that make no such distinction but treat all variables equally in a search for underlying relationships. Multiple linear regression models a situation where a single numerical dependent variable is to be predicted from multiple numerical independent variables. Logistic regression is used when the outcome variable is dichotomous in nature. The log-linear technique models count type of data and can be used to analyze cross-tabulations where more than two variables are included. Analysis of covariance is an extension of analysis of variance (ANOVA), in which an additional independent variable of interest, the covariate, is brought into the analysis. It tries to examine whether a difference persists after “controlling” for the effect of the covariate that can impact the numerical dependent variable of interest. Multivariate analysis of variance (MANOVA) is a multivariate extension of ANOVA used when multiple numerical dependent variables have to be incorporated in the analysis. Interdependence techniques are more commonly applied to psychometrics, social sciences and market research. Exploratory factor analysis and principal component analysis are related techniques that seek to extract from a larger number of metric variables, a smaller number of composite factors or components, which are linearly related to the original variables. Cluster analysis aims to identify, in a large number of cases, relatively homogeneous groups called clusters, without prior information about the groups. The calculation intensive nature of multivariate analysis has so far precluded most researchers from using these techniques routinely. The situation is now changing with wider availability, and increasing sophistication of statistical software and researchers should no longer shy away from exploring the applications of multivariate methods to real-life data sets.

Introduction

Multivariate analysis refers to statistical techniques that simultaneously look at three or more variables in relation to the subject under investigation with the aim of identifying or clarifying the relationships between them. The real world is always multivariate. Anything happening is the result of many different inputs and influences. However, multivariate methods are calculation intensive and hence have not been applied to research problems with the frequency that they should have been. Fortunately, in recent years, increasing computing power and the availability and user-friendliness of statistical software are leading to increased interest in and use of multivariate techniques. The aim of this module is to demystify multivariate methods, many of which are the basis for statistical modeling, and take a closer look at some of these methods. However, it is to be borne in mind that merely familiarizing oneself with the meaning of a technique does not mean that one has gained an understanding of what the analytic procedure does, what are its limitations, how to interpret the output that is generated and what the results signify. Mastery of a technique can only come through actually working with it a number of times using real data sets.

Classification of Multivariate Methods

This in itself is a complex issue. Most often, multivariate methods are classified as dependence or interdependence techniques and selection of the appropriate technique hinges on an understanding of this distinction. Table 1 lists techniques on the basis of this categorization. If in the context of the research question of interest, we can identify dependent or response variables whose value is influenced by independent explanatory or predictor variables, then we are dealing with a dependency situation. On the other hand, if variables are interrelated without a clear distinction of dependent and interdependent variables, or without the need for such a distinction, then we are dealing with an interdependency situation. Thus in the latter case, there are no dependent and independent variable designations, all variables are treated equally in a search for underlying patterns of relationships.

Classification of multivariate statistical techniques

An external file that holds a picture, illustration, etc.
Object name is IJD-62-358-g001.jpg

Multiple Linear Regression

Multiple linear regression attempts to model the relationship between two or more metric explanatory variables and a single metric response variable by fitting a linear equation to observed data.

Thus if we have a question like “do age, height and weight explain the variation in fasting blood glucose level?” or can “fasting blood glucose level be predicted from age, height and weight?,” we may use multiple regression after capturing data on all four variables for a series of subjects. Actually, there are three uses for multiple linear regression analysis. First, it may be used to assess the strength of the influence that individual predictors have on the single dependent variable. Second, it helps us to understand how much the dependent variable will change when we vary the predictors. Finally, a good model may be used to predict trends and future values.

Mathematically the model for multiple regression, given n predictors, is denoted as

y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 +… + β n X n + ε

Where β 0 represent a constant and ε represents a random error term, while β 1 , β 2 , β 3, etc., denote the regression coefficients (called partial regression coefficients in the multiple regression situation) associated with each predictor. A partial regression coefficient denotes the amount by which Y changes, when the particular X value changes by one unit, given that all other X values remain constant. Calculation of the regression coefficients makes use of the method of least squares technique as in simple linear regression.

We can refine and simplify a multiple linear regression model by judging which predictors substantially influence the dependent variable, and then excluding those that have only trivial effect. Software packages implement this as a series of hypothesis test for the predictors. For each predictor variable X i , we may test the null hypothesis β i = 0 against the alternative β i <> 0, and obtain series of p values. Predictors returning statistically significant p values may be retained in the model, and others discarded. The software also offers various diagnostic tools to judge model fit and adequacy.

Multiple linear regression analysis makes certain assumptions. First and foremost linear relationship is assumed between each independent variable and the dependent variable. These can be verified through inspection of scatter plots. The regression residuals are assumed to be normally distributed and homoscedastic, that is show homogeneity of variance. The absence of multicollinearity is assumed in the model, meaning that the independent variables are not too highly correlated with one another. Finally, when modeling a dependent variable by the multiple linear regression procedure, an important consideration is the model fit. Adding independent variables to the model will always increase the amount of explained variance in the dependent variable (typically expressed as R 2 in contrast to r 2 used in the context of simple linear regression). However, adding too many independent variables without any theoretical justification may result in an overfit model.

Logistic Regression

Logistic regression analysis, in the simplest variant called binary logistic regression, models the value of one dependent variable, that is binary in nature, from two or more independent predictor variables.

It can be used to model any situation when the relationship between multiple predictor variables and one dependent variable is to be explored, provided the latter is dichotomous or binary in nature. The predictor variables can be any mix of numerical, nominal and ordinal variables. The technique addresses the same questions that discriminant function analysis or multiple linear regression does but with no distributional assumptions on the predictors (the predictors do not have to be normally distributed, linearly related or have equal variance in each group). There are more complex forms of logistic regression, called polychotomous logistic regression (also called polytomous or multinomial logistic regression), that can handle dependent variables that are polychotomous, rather than binary, in nature. If the outcome categories are ordered, the analysis is referred to as ordinal logistic regression. If predictors are all continuous, normally distributed and show homogeneity of variance, one may use discriminant analysis instead of logistic regression. If predictors are all categorical, one may use logit analysis instead.

Many questions may be addressed through logistic regression

  • What is the relative importance of each predictor or, in other words, what is the strength of association between the outcome variable and a predictor?
  • Can the outcome variable be correctly predicted given a set of predictors?
  • Can the solution generalize to predicting new cases?
  • How does each variable affect the outcome? Does a predictor make the solution better or worse or have no effect?
  • Are there interactions among predictors, that is whether the effect of one predictor differs according to the level of another?
  • Does adding interactions among predictors (continuous or categorical) improve the model?
  • How good is the model at classifying cases for which the outcome is known?
  • Can prediction models be tested for relative fit to the data? (the “goodness of fit” estimation).

Logistic regression utilizes the logit function and one of its few assumptions is that of linearity in the logit – the regression equation should have a linear relationship with the logit form of the predicted variable (there is no assumption about the predictors being linearly related to each other). The problem with probabilities is that they are nonlinear. Thus moving from 0.10 to 0.20 doubles the probability, but going from 0.80 to 0.90 barely increases the probability. Odds, we know, express the ratio of the probability of occurrence of an event to the probability of nonoccurrence of the event. Logit or log odds is the natural logarithm (log to base e or 2.71828) of a proportion expressed as the odds. The logit scale is linear. The unique impact of each predictor can be expressed as an odds ratio that can be tested for statistical significance, against the null hypothesis that the ratio is 1 (meaning no influence of a predictor on the predicted variable). This is done by using a form of Chi-square test that gives a Wald statistic (also called Wald coefficient) and by looking at the 95% confidence limits of the odds ratio. The odds ratio of an individual predictor in a logistic regression output is called the adjusted odds ratio since it is adjusted against the values of other predictors.

The regression equation, in logistic regression, with n predictors, takes the form:

An external file that holds a picture, illustration, etc.
Object name is IJD-62-358-g002.jpg

Where, Log n [ p /(1 − p )] denotes logit ( p ) − p being the probability of occurrence of the event in question, X 1 , X 2 ,……, X n are the values of the explanatory variables, β 0 is the constant in the equation, and the beta values are the regression coefficients.

Using the logistic transformation in this way overcomes problems that might arise if P was modeled directly as a linear function of the explanatory variables. The constant and the regression coefficients for the predictors are usually displayed in table form in most computer outputs. A Wald statistic is provided for each predictor in the regression model, with its corresponding significance value.

There are three strategies for executing a logistic regression analysis – direct, sequential and stepwise. In the direct or “enter” method, all the independent variables are forced into the regression equation. This method is generally used when there is no specific hypotheses regarding the relative importance of the predictors. In sequential logistic regression, the investigator decides the order in which the independent variables are entered into the model to get an idea of the relative importance of the predictors. The predictor that is entered first is given priority, and all subsequent independent variables are assessed to see if they add significantly to the predictive value of the first variable. Stepwise logistic regression is an exploratory method concerned with hypothesis generation. The inclusion or exclusion of independent variables in the regression equation is decided on statistical grounds as the model is run. With some statistical packages, it is possible to set the significance level that will decide when to exclude a variable from the model. In a stepwise approach, high correlations between independent variables may mean that a predictor may be discarded, even if it does significantly predict the dependent variable because predictors entered earlier have already accounted for the prediction. Therefore, it is important to choose independent variables that are likely to have a high impact on the predicted variable but have little or no correlations between themselves (avoiding multicollinearity). This is a concern in all regression techniques but is particularly important in stepwise logistic regression because of the manner in which the analysis is run. Therefore, it helps greatly to precede a logistic regression analysis by univariate analysis on individual predictors and correlation analysis for various predictors. Discarding predictors with minimal influence as indicated by univariate analysis, will also simplify the model without decreasing its predictive value. The regression analysis will then indicate the best set of predictors, from those included, and one can use these to predict the outcome for new cases.

The exponential beta value in the logistic regression output denotes the odds ratio of the dependent variable. We can find the probability of the dependent variable from this odds ratio. If the exponential beta value is greater than one, then the probability of higher category (i.e., usually the occurrence of the event) increases, and if the probability of exponential beta is less than one, then the probability of higher category decreases. For categorical variables, the exponential beta value is interpreted against the reference category, where the probability of the dependent variable will increase or decrease. For numerical predictors, it is interpreted as one unit increase in the independent variable, corresponding to the increase or decrease of units of the dependent variable.

In classical regression analysis, the R 2 (coefficient of determination) value is used to express the proportion of the variability in the model that can be explained by one or more predictors that are included in the regression equation. This measure of the effect size is not acceptable in logistic regression; instead parameters such as Cox and Snell's R 2 , Nagelkerke's R 2 , and McFadden's − 2 log-likelihood (−2LL) statistic, are considered more appropriate. The R 2 values are more readily understood; if multiplied by 100 they indicate the percentage of variability accounted for by the model. Again, a “most likely” fit means −2LL value is minimized.

The overall significance of the regression model can also be tested by a goodness-of-fit test such as the likelihood ratio Chi-square test or the Hosmer–Lemeshow Chi-square test for comparing the goodness-of-fit of the model (including the constant and any number of predictors) with that of the constant-only model. If the difference is nonsignificant, it indicates that there is no difference in predictive power between the model that includes all the predictors and the model that includes none of the predictors. Another way of assessing the validity of the model is to look at the proportion of cases that are correctly classified by the model on the dependent variable. This is presented in a classification table.

Finally, it is important to be aware of the limitations or restrictions of logistic regression analysis. The dependent variable must be binary in nature. It is essentially a large sample analysis, and a satisfactory model cannot be found with too many predictors. One convenient thumb rule of sample size for logistic regression analysis is that there should be at least ten times as many cases as predictors. If not, the least influential predictors should be dropped from the analysis. Outliers will also influence the results of the regression analysis, and, therefore, it is good practice to examine the distribution of the predictor variables and exclude outliers from the data set. For this purpose, the values of a particular predictor may be transformed into standardized z scores and values with a z score of ± 3.29 or more are to be regarded as outliers. Finally, even if an independent variable successfully predicts the outcome, this does not imply a causal relationship. With large samples, even trivial correlations can become significant, and it is to be always borne in mind that statistical significance does not necessarily imply clinical significance.

Discriminant Function Analysis

Discriminant function analysis or discriminant analysis refers to a set of statistical techniques for generating rules for classifying cases (subjects or objects) into groups defined a priori , on the basis of observed variable values for each case. The aim is to assess whether or not a set of variables distinguish or discriminate between two (or more) groups of cases.

In a two group situation, the most commonly used method of discriminant analysis is Fisher's linear discriminant function, in which a linear function of the variables giving maximal separation between the groups is determined. This results in a classification rule or allocation rule that may be used to assign a new case to one of the two groups. The sample of observations from which the discriminant function is derived is often known as the training set. A classification matrix is often used in the discriminant analysis for summarizing the results obtained from the derived classification rule and obtained by cross-tabulating observed against predicted group membership. In addition to finding a classification rule or discriminant model, it is important to assess its performance (prediction accuracy), by finding the error rate, that is the proportion of cases incorrectly classified.

Discriminant analysis makes several assumptions. Logistic regression is the alternative technique that can be used in place of discriminant analysis when data do not meet these assumptions.

Some Other Dependence Techniques

In the analysis of variance (ANOVA) we see whether a metric variable mean is significantly different with respect to a factor which is often the treatment grouping. Analysis of covariance (ANCOVA) is an extension of ANOVA, in which an additional independent metric variable of interest, the covariate, that can influence the metric dependent variable is brought into the analysis. It tries to examine whether a difference persists after “controlling” for the effect of the covariate. For instance, ANCOVA may be used to analyze if different blood pressure lowering drugs show a statistically significant difference in the extent of lowering of systolic or diastolic blood pressure, after adjusting for baseline blood pressure of the subject as covariate.

Multivariate analysis of variance (MANOVA) is an extension of univariate ANOVA. In ANOVA, we examine the relationship of one numerical dependent variable with the independent variable that determines the grouping. However, ANOVA cannot compare groups when more than one numerical dependent variable has to be considered. To account for multiple dependent variables, MANOVA bundles them together into a weighted linear combination or composite variable. These linear combinations are known variously as canonical variates, roots, eigenvalues, vectors, or discriminant functions, although the terms canonical variates or roots are more commonly used. Once the dependent variables combine into a canonical variate, MANOVA explores whether or not the independent variable group differs from the newly created group. In this way, MANOVA essentially tests whether or not the independent grouping variable explains a significant amount of variance in the canonical variate.

Log-linear analysis models count type of data and can be used to analyze contingency tables where more than two variables are included. It can look at three or more categorical variables at the same time to determine associations between them and also show just where these associations lie. The technique is so named because the logarithm of the expected value of a count variable is modeled as a linear function of parameters – the latter represent associations between pairs of variables and higher order interactions between more than two variables.

Probit analysis technique is most commonly employed in toxicological experiments in which sets of animals are subjected to known levels of a toxin, and a model is required to relate the proportion surviving at particular doses, to the dose. In this type of analysis, a special kind of data transformation of proportions, the probit transformation, is modeled as a linear function of the dose, or more commonly, the logarithm of the dose. Estimates of the parameters in the model are found by maximum likelihood estimation.

Closely related to the probit function (and probit analysis) are the logit function and logit analysis. The logit function is defined as

An external file that holds a picture, illustration, etc.
Object name is IJD-62-358-g003.jpg

Analogously to the probit model, we may assume that such a transformed quantity is related linearly to a set of predictors, resulting in the logit model, which is the basis in particular of logistic regression, the most prevalent form of regression analysis for categorical response data.

Box 1 provides examples of the application of dependence multivariate methods in dermatology from published literature.

Examples of dependence multivariate analysis from literature

An external file that holds a picture, illustration, etc.
Object name is IJD-62-358-g004.jpg

Factor Analysis and Principal Components Analysis

Factor analysis refers to techniques that seek to reduce a large number of variables to a smaller number of composite factors or components, which are linearly related to the original variables. It is regarded as a data or variable reduction technique that attempts to partition a given set of variables into groups of maximally correlated variables. In the process, meaningful new constructs may be deciphered for the underlying latent variable that is being studied. The variables to which factor analysis is applied are metric in nature. Factor analysis is commonly used in psychometrics, social science studies, and market research.

The term exploratory factor analysis (EFA) is used if the researcher does not have any idea of how many dimensions of a latent variable that is being studied is represented by the manifest variables that have been measured. For instance, a researcher wants to study patient satisfaction with doctors (a latent variable since it is not easily measured) and to do so measures a number of items (e.g., how many times doctor visited in last 1 year, how many other patients referred to the doctor by each patient, how long patient is willing to wait at each clinic visit to see the doctor, and so on) using a suitably framed questionnaire. The researcher may then apply EFA to assess if the different manifest variables group together into a set of composite variables (factor) that may be logically interpreted as representing different dimensions of patient satisfaction.

On the other hand, confirmatory factor analysis (CFA) is used for verification when the researcher already has a priori idea of the different dimensions of a latent variable that may be represented in a set of measured variables. For instance, a researcher may have grouped items in a fatigue rating scale as those representing physical fatigue and mental fatigue, and now wants to verify if this grouping is acceptable, through CFA. It is noteworthy that EFA is used to identify the hypothetical constructs in a set of data, while CFA may be used to confirm the existence of these hypothetical constructs in a fresh set of data. However, CFA is less often used, is mathematically more demanding and has a strong similarity to techniques of structural equation modeling.

Principal components analysis (PCA) is a related technique that transforms the original variables into new variables that are uncorrelated and account for decreasing proportion of the variance in the data. The new variables are called principal components and are linear functions of the original variables. The first principal component, or first factor, is comprised of the best linear function of the original variables so as to maximize the amount of the total variance that can be explained. The second principal component is the best linear combination of variables for explaining the variance not accounted for by the first factor. There may be a third, fourth, fifth principal component and so on, each representing the best linear combination of variables not accounted for by the previous factors. The process continues until all the variance is accounted for, but in practice, is usually stopped after a small number of factors have been extracted. The eigenvalue is a measure of the extent of the variation explained by a particular principal component. A plot of eigenvalue versus component number, the scree plot [ Figure 1 ], is used to decide on the number of principal components to retain.

An external file that holds a picture, illustration, etc.
Object name is IJD-62-358-g005.jpg

Scree plot – the signature plot of exploratory factor analysis and principal components analysis. The plot (named after scree which is the debris that pile up at the bottom of a cliff) displays the eigenvalues associated with the extracted factors or components in descending order. This helps to visually assess which factors or components account for most of the variability in the data

EFA explores the correlation matrix between a set of observed variables to determine whether the correlations arise from the relationship of these observed or manifest variables to a small number of underlying latent variables, called common factors. The regression coefficients of the observed variables on the common factors are known in this context as factor loadings. After the initial estimation phase, an attempt is generally made to simplify the often difficult task of interpreting the derived factors using a process known as factor rotation. The aim is to produce a solution having what is known as simple structure, where each common factor affects only a small number of the observed variables.

EFA and PCA may appear to be synonymous techniques and are often implemented simultaneously by statistical software packages but are not identical. Although they may give identical information, PCA may be regarded as a more basic version of EFA and was developed earlier on (actually first proposed in 1901 by Karl Pearson). Factor analysis has technical differences with PCA. It is now generally recommended that factor analysis should be used when some theoretical ideas about relationships between variables exist, whereas PCA should be used if the goal of the researcher is to explore patterns in the data.

Cluster Analysis

Cluster analysis refers to a set of statistical techniques for informative classification of an initially unclassified set of cases (subjects or objects) using observed variable values for each case. The aim is to identify relatively homogeneous groups (clusters) without prior information about the groups or group membership for any of the cases. It is also called classification analysis or numerical taxonomy. Unlike discriminant function analysis, which works on the basis of decided groups, cluster analysis tries to identify unknown groups.

Cluster analysis involves formulating a problem, selecting a distance measure, selecting a clustering procedure, deciding the number of clusters, interpreting the cluster profiles and finally, assessing the validity of clustering. The variables on which the cluster analysis is to be done have to be selected carefully on the basis of a hypothesis, past experience and judgment of the researcher. An appropriate measure of distance or similarity between the clusters needs to be selected – the most commonly used one is the Euclidean distance or its square.

The clustering procedure in cluster analysis may be hierarchical, nonhierarchical, or a two-step one. Hierarchical clustering methods do not require preset knowledge of the number of groups and can be agglomerative or divisive in nature. In either case, the results are best described using a diagram with tree-like structure, called tree diagram or dendrogram [ Figure 2 ]. The objects are represented as nodes in the dendrogram, and the branches illustrate how they relate as subgroups or clusters. The length of the branch indicates the distance between the subgroups when they are joined. The nonhierarchical methods in cluster analysis are frequently referred to as K means clustering. The choice of clustering procedure and the choice of distance measure are interrelated. The relative sizes of clusters in cluster analysis should be meaningful.

An external file that holds a picture, illustration, etc.
Object name is IJD-62-358-g006.jpg

Appearance of dendrograms from cluster analysis. Note that both dendrograms A and B appear to identify three clusters but the distances are much closer in dendrogram B and therefore it has to be interpreted with great caution

Cluster analysis has been criticized because a software package will always produce some clustering whether or not such clusters exist in reality. There are many options one may select when doing a cluster analysis using a statistical package. Thus, a researcher may mine the data trying different methods of linking groups until the researcher “discovers” the structure that he or she originally believed was contained in the data. Therefore, the output has to be interpreted with great caution.

Box 2 provides examples of the application of interdependence multivariate methods in dermatology from published literature.

Examples of interdependence multivariate analysis from literature

An external file that holds a picture, illustration, etc.
Object name is IJD-62-358-g007.jpg

Sample size in Multivariate Analysis

We will bring this overview on multivariate methods to a close with the vexing question of what is the appropriate sample size for multivariate analysis. Unfortunately, there is no ready answer. It is important to remember that most multivariate analyses are essentially large sample methods and applying these techniques to small datasets may yield unreliable results. There are no easy procedures or algorithms to calculate sample sizes a priori for majority of these techniques.

For regression analyses, a general rule is that 10 events of the outcome of interest are required for each variable in the model including the exposures of interest (Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996; 49: 1373-9). Even this rule may be inadequate in the presence of categorical covariates. Therefore, even though the sample size may seem enough, wherever possible, model quality checks should be done to ensure that they are reliable enough to base clinical judgments upon.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

Further Reading

9   Multivariate methods for heterogeneous data

multivariate hypothesis also known as

In Chapter 7 , we saw how to summarize rectangular matrices whose columns were continuous variables. The maps we made used unsupervised dimensionality reduction techniques such as principal component analysis aimed at isolating the most important signal component in a matrix \(X\) when all the columns have meaningful variances.

Here we extend these ideas to more complex heterogeneous data where continuous and categorical variables are combined. Indeed, sometimes our observations cannot be easily described by sets of individual variables or coordinates – but it is possible to determine distances or (dis)similarities between them, or to describe relationships between them using a graph or a tree. Examples include species in a species tree or biological sequences. Outside of biology, examples include text documents or movie files, where we may have a reasonable method to determine (dis)similarity between them, but no obvious variables or coordinates.

This chapter contains more advanced techniques, for which we often omit technical details. Having come this far, we hope that by giving you some hands-on experience with examples, and extensive references, to enable you to understand and use some of the more `cutting edge’ techniques in nonlinear multivariate analysis.

9.1 Goals for this chapter

In this chapter, we will:

Extend linear dimension reduction methods to cases when the distances between observations are available, known as m ulti d imensional s caling (MDS) or principal coordinates analysis.

Find modifications of MDS that are nonlinear and robust to outliers.

Encode combinations of categorical data and continuous data as well as so-called ‘supplementary’ information. We will see that this enables us to deal with batch effects .

Use chi-square distances and c orrespondence a nalysis (CA) to see where categorical data (contingency tables) contain notable dependencies.

Generalize clustering methods that can uncover latent variables that are not categorical. This will allow us to detect gradients, “pseudotime” and hidden nonlinear effects in our data.

Generalize the notion of variance and covariance to the study of tables of data from multiple different data domains.

9.2 Multidimensional scaling and ordination

Sometimes, data are not represented as points in a feature space. This can occur when we are provided with (dis)similarity matrices between objects such as drugs, images, trees or other complex objects, which have no obvious coordinates in \({\mathbb R}^n\) .

In Chapter 5 we saw how to produce clusters from distances. Here our goal is to visualize the data in maps in low dimensional spaces (e.g., planes) reminiscent of the ones we make from the first few principal axes in PCA.

We start with an intuitive example using geography data. In Figure  9.1 , a heatmap and clustering of the distances between cities and places in Ukraine 1 are shown.

1  The provenance of these data are described in the script ukraine-dists.R in the data folder.

multivariate hypothesis also known as

Besides ukraine_dists , which contains the pairwise distances, the RData file that we loaded above also contains the dataframe ukraine_coords with the longitudes and latitudes; we will use this later as a ground truth. Given the distances, multidimensional scaling (MDS) provides a “map” of their relative locations. It will not be possible to arrange the cities such that their Euclidean distances on a 2D plane exactly reproduce the given distance matrix: the cities lie on the curved surface of the Earth rather than in a plane. Nevertheless, we can expect to find a two dimensional embedding that represents the data well. With biological data, our 2D embeddings are likely to be much less clearcut. We call the function with:

We make a function that we will reuse several times in this chapter to make a screeplot from the result of a call to the cmdscale function:

multivariate hypothesis also known as

Question 9.1 Look at all the eigenvalues output by the cmdscale function: what do you notice?

If you execute:

multivariate hypothesis also known as

you will note that unlike in PCA, there are some negative eigenvalues. These are due to the way cmdscale works.

The main output from the cmdscale function are the coordinates of the two-dimensional embedding, which we show in Figure  9.4 (we will discuss how the algorithm works in the next section).

multivariate hypothesis also known as

Note that while relative positions are correct, the orientation of the map is unconventional: Crimea is at the top. This is a common phenomenon with methods that reconstruct planar embeddings from distances. Since the distances between the points are invariant under rotations and reflections (axis flips), any solution is as good as any other solution that relates to it via rotation or reflection. Functions like cmdscale will pick one of the equally optimal solutions, and the particular choice can depend on minute details of the data or the computing platform being used. Here, we can transform our result into a more conventional orientation by reversing the sign of the \(y\) -axis. We redraw the map in Figure  9.5 and compare this to the true longitudes and latitudes from the ukraine_coords dataframe ( Figure  9.6 ).

multivariate hypothesis also known as

Question 9.2 We drew the longitudes and latitudes in the right panel of Figure  9.6 without attention to aspect ratio. What is the right aspect ratio for this plot?

There is no simple relationship between the distances that correspond to 1 degree change in longitude and to 1 degree change in latitude, so the choice is difficult to make. Even under the simplifying assumption that our Earth is spherical and has a radius of 6371 km, it’s complicated: one degree in latitude always corresponds to a distance of 111 km ( \(6371\times2\pi/360\) ), as does one degree of longitude on the equator. However, when you move away from the equator, a degree of longitude corresponds to shorter and shorter distances (and to no distance at all at the poles). Pragmatically, for displays such as in Figure  9.6 , we could choose a value for the aspect ratio that’s somewhere in the middle between the Northern and Southern most points, say, the cosine for 48 degrees.

Question 9.3 Add international borders and geographic features such as rivers to Figure  9.6 .

A start point is provided by the code below, which adds international borders ( Figure  9.7 ).

multivariate hypothesis also known as

Note: MDS creates similar output as PCA, however there is only one ‘dimension’ to the data (the sample points). There is no ‘dual’ dimension, there are no biplots and no loading vectors. This is a drawback when coming to interpreting the maps. Interpretation can be facilitated by examining carefully the extreme points and their differences.

9.2.1 How does the method work?

Let’s take a look at what would happen if we really started with points whose coordinates were known 2 . We put these coordinates into the two columns of a matrix with as many rows as there are points. Now we compute the distances between points based on these coordinates. To go from the coordinates \(X\) to distances, we write \[d^2_{i,j} = (x_i^1 - x_j^1)^2 + \dots + (x_i^p - x_j^p)^2.\] We will call the matrix of squared distances DdotD in R and \(D\bullet D\) in the text . We want to find points such that the square of their distances is as close as possible to the \(D\bullet D\) observed. 3

2  Here we commit a slight ‘abuse’ by using the longitude and longitude of our cities as Cartesian coordinates and ignoring the curvature of the earth’s surface. Check out the internet for information on the Haversine formula.

3  Here we commit a slight ‘abuse’ by using the longitudes and latitudes of our cities as Cartesian coordinates and ignoring the fact that they are curvilinear coordinates on a sphere-like surface.

The relative distances do not depend on the point of origin of the data. We center the data by using the centering matrix \(H\) defined as \(H=I-\frac{1}{n}{\mathbf{11}}^t\) . Let’s check the centering property of \(H\) using:

Question 9.4 Call B0 the matrix obtained by applying the centering matrix both to the right and to the left of DdotD Consider the points centered at the origin given by the \(HX\) matrix and compute its cross product, we’ll call this B2 . What do you have to do to B0 to make it equal to B2 ?

Therefore, given the squared distances between rows ( \(D\bullet D\) ) and the cross product of the centered matrix \(B=(HX)(HX)^t\) , we have shown:

\[ -\frac{1}{2} H(D\bullet D) H=B \tag{9.1}\]

This is always true, and we use it to reverse-engineer an \(X\) which satisfies Equation  9.1 when we are given \(D\bullet D\) to start with.

From \(D\bullet D\) to \(X\) using singular vectors.

We can go backwards from a matrix \(D\bullet D\) to \(X\) by taking the eigen-decomposition of \(B\) as defined in Equation  9.1 . This also enables us to choose how many coordinates, or columns, we want for the \(X\) matrix. This is very similar to how PCA provides the best rank \(r\) approximation. Note : As in PCA, we can write this using the singular value decomposition of \(HX\) (or the eigen decomposition of \(HX(HX)^t\) ):

\[ HX^{(r)} = US^{(r)}V^t \mbox{ with } S^{(r)} \mbox{ the diagonal matrix of the first } r \mbox{ singular values}, \]

This provides the best approximate representation in an Euclidean space of dimension \(r\) . The algorithm gives us the coordinates of points that have approximately the same distances as those provided by the \(D\) matrix.

Classical MDS Algorithm.

In summary, given an \(n \times n\) matrix of squared interpoint distances \(D\bullet D\) , we can find points and their coordinates \(\tilde{X}\) by the following operations:

Double center the interpoint distance squared and multiply it by \(-\frac{1}{2}\) : \(B = -\frac{1}{2}H D\bullet D H\) .

Diagonalize \(B\) : \(\quad B = U \Lambda U^t\) .

Extract \(\tilde{X}\) : \(\quad \tilde{X} = U \Lambda^{1/2}\) .

Finding the right underlying dimensionality.

As an example, let’s take objects for which we have similarities (surrogrates for distances) but for which there is no natural underlying Euclidean space.

In a psychology experiment from the 1950s, Ekman ( 1954 ) asked 31 subjects to rank the similarities of 14 different colors. His goal was to understand the underlying dimensionality of color perception. The similarity or confusion matrix was scaled to have values between 0 and 1. The colors that were often confused had similarities close to 1. We transform the data into a dissimilarity by subtracting the values from 1:

We compute the MDS coordinates and eigenvalues. We combine the eigenvalues in the screeplot shown in Figure  9.8 :

multivariate hypothesis also known as

We plot the different colors using the first two principal coordinates as follows:

multivariate hypothesis also known as

Figure  9.9 shows the Ekman data in the new coordinates. There is a striking pattern that calls for explanation. This horseshoe or arch structure in the points is often an indicator of a sequential latent ordering or gradient in the data ( Diaconis, Goel, and Holmes 2008 ) . We will revisit this in Section 9.5 .

9.2.2 Robust versions of MDS

Multidimensional scaling aims to minimize the difference between the squared distances as given by \(D\bullet D\) and the squared distances between the points with their new coordinates. Unfortunately, this objective tends to be sensitive to outliers: one single data point with large distances to everyone else can dominate, and thus skew, the whole analysis. Often, we like to use something that is more robust, and one way to achieve this is to disregard the actual values of the distances and only ask that the relative rankings of the original and the new distances are as similar as possible. Such a rank based approach is robust: its sensitivity to outliers is reduced.

We will use the Ekman data to show how useful robust methods are when we are not quite sure about the ‘scale’ of our measurements. Robust ordination, called non metric multidimensional scaling (NMDS for short) only attempts to embed the points in a new space such that the order of the reconstructed distances in the new map is the same as the ordering of the original distance matrix.

Non metric MDS looks for a transformation \(f\) of the given dissimilarities in the matrix \(d\) and a set of coordinates in a low dimensional space ( the map ) such that the distance in this new map is \(\tilde{d}\) and \(f(d)\thickapprox \tilde{d}\) . The quality of the approximation can be measured by the standardized residual sum of squares ( stress ) function:

\[ \text{stress}^2=\frac{\sum(f(d)-\tilde{d})^2}{\sum d^2}. \]

NMDS is not sequential in the sense that we have to specify the underlying dimensionality at the outset and the optimization is run to maximize the reconstruction of the distances according to that number. There is no notion of percentage of variation explained by individual axes as provided in PCA. However, we can make a simili-screeplot by running the program for all the successive values of \(k\) ( \(k=1, 2, 3, ...\) ) and looking at how well the stress drops. Here is an example of looking at these successive approximations and their goodness of fit. As in the case of diagnostics for clustering, we will take the number of axes after the stress has a steep drop.

Because each calculation of a NMDS result requires a new optimization that is both random and dependent on the \(k\) value, we use a similar procedure to what we did for clustering in Chapter 4 . We execute the metaMDS function, say, 100 times for each of the four possible values of \(k\) and record the stress values.

Let’s look at the boxplots of the results. This can be a useful diagnostic plot for choosing \(k\) ( Figure  9.10 ).

multivariate hypothesis also known as

We can also compare the distances and their approximations using what is known as a Shepard plot for \(k=2\) for instance, computed with:

multivariate hypothesis also known as

Both the Shepard’s plot in Figure  9.11 and the screeplot in Figure  9.10 point to a two-dimensional solution for Ekman’s color confusion study. Let us compare the output of the two different MDS programs, the classical metric least squares approximation and the nonmetric rank approximation method. The right panel of Figure  9.12 shows the result from the nonmetric rank approximation, the left panel is the same as Figure  9.9 . The projections are almost identical in both cases. For these data, it makes little difference whether we use a Euclidean or nonmetric multidimensional scaling method.

multivariate hypothesis also known as

9.3 Contiguous or supplementary information

multivariate hypothesis also known as

In Chapter 3 we introduced the R data.frame class that enables us to combine heterogeneous data types: categorical factors, text and continuous values. Each row of a dataframe corresponds to an object, or a record, and the columns are the different variables, or features.

Extra information about sample batches, dates of measurement, different protocols are often named metadata ; this can be misnomer if it is implied that metadata are somehow less important. Such information is real data that need to be integrated into the analyses. We typically store it in a data.frame or a similar R class and tightly link it to the primary assay data.

9.3.1 Known batches in data

Here we show an example of an analysis that was done by Holmes et al. ( 2011 ) on bacterial abundance data from Phylochip ( Brodie et al. 2006 ) microarrays. The experiment was designed to detect differences between a group of healthy rats and a group who had Irritable Bowel Disease ( Nelson et al. 2010 ) . This example shows a case where the nuisance batch effects become apparent in the analysis of experimental data. It is an illustration of the fact that best practices in data analyses are sequential and that it is better to analyse data as they are collected to adjust for severe problems in the experimental design as they occur , instead of having to deal with them post mortem 4 .

4  Fisher’s terminology, see Chapter 13 .

When data collection started on this project, days 1 and 2 were delivered and we made the plot that appears in Figure  9.14 . This showed a definite day effect. When investigating the source of this effect, we found that both the protocol and the array were different in days 1 and 2. This leads to uncertainty in the source of variation, we call this confounding of effects.

We load the data and the packages we use for this section:

Question 9.5 What class is the IBDchip ? Look at the last row of the matrix, what do you notice?

The data are normalized abundance measurements of 8634 taxa measured on 28 samples. We use a rank-threshold transformation, giving the top 3000 most abundant taxa scores from 3000 to 1, and letting the remaining (low abundant) ones all have a score of 1. We also separate out the proper assay data from the (awkwardly placed) day variable, which should be considered a factor 5 :

5  Below, we show how to arrange these data into a Bioconductor SummarizedExperiment , which is a much more sane way of storing such data.

Instead of using the continuous, somehow normalized data, we use a robust analysis replacing the values by their ranks. The lower values are considered ties encoded as a threshold chosen to reflect the number of expected taxa thought to be present:

multivariate hypothesis also known as

Question 9.6 Why do we use a threshold for the ranks?

Low abundances, at noise level occur for species that are not really present, of which there are more than half. A large jump in rank for these observations could easily occur without any meaningful reason. Thus we create a large number of ties for low abundance.

Figure  9.14 shows that the sample arrange themselves naturally into two different groups according to the day of the samples. After discovering this effect, we delved into the differences that could explain these distinct clusters. There were two different protocols used (protocol 1 on day 1, protocol 2 on day 2) and unfortunately two different provenances for the arrays used on those two days (array 1 on day 1, array 2 on day 2).

A third set of data of four samples had to be collected to deconvolve the confounding effect. Array 2 was used with protocol 2 on Day 3, Figure  9.15 shows the new PCA plot with all the samples created by the following:

multivariate hypothesis also known as

Question 9.7 In which situation would it be preferable to make confidence ellipses around the group means using the following code?

multivariate hypothesis also known as

Through this visualization we were able to uncover a flaw in the original experimental design. The first two batches shown in green and brown were both balanced with regards to IBS and healthy rats. They do show very different levels of variability and overall multivariate coordinates. In fact, there are two confounded effects. Both the arrays and protocols were different on those two days. We had to run a third batch of experiments on day 3, represented in purple, this used protocol from day 1 and the arrays from day 2. The third group faithfully overlaps with batch 1, telling us that the change in protocol was responsible for the variability.

9.3.2 Removing batch effects

Through the combination of the continuous measurements from assayIBD and the supplementary batch number as a factor, the PCA map has provided an invaluable investigation tool. This is a good example of the use of supplementary points 6 . The mean-barycenter points are created by using the group-means of points in each of the three groups and serve as extra markers on the plot.

6  This is called a supplementary point because the new observation-point is not used in the matrix decomposition.

We can decide to re-align the three groups by subtracting the group means so that all the batches are centered on the origin. A slightly more effective way is to use the ComBat function available in the sva package. This function uses a similar, but slightly more sophisticated method (Empirical Bayes mixture approach ( Leek et al. 2010 ) ). We can see its effect on the data by redoing our robust PCA (see the result in Figure  9.17 ):

multivariate hypothesis also known as

9.3.3 Hybrid data and Bioconductor containers

A more rational way of combining the batch and treatment information into compartments of a composite object is to use the SummarizedExperiment class. It includes special slots for the assay(s) where rows represent features of interest (e.g., genes, transcripts, exons, etc.) and columns represent samples. Supplementary information about the features can be stored in a DataFrame object, accessible using the function rowData . Each row of the DataFrame provides information on the feature in the corresponding row of the SummarizedExperiment object.

Here we insert the two covariates day and treatment in the colData object and combine it with assay data in a new SummarizedExperiment object.

This is the best way to keep all the relevant data together, it will also enable you to quickly filter the data while keeping all the information aligned properly.

Question 9.8 Make a new SummarizedExperiment object by choosing the subset of the samples that were created on day 2.

Columns of the DataFrame represent different attributes of the features of interest, e.g., gene or transcript IDs, etc. Here is an example of hybrid data container from single cell experiments (see Bioconductor workflow in Perraudeau et al. ( 2017 ) for more details).

After the pre-processing and normalization steps prescribed in the workflow, we retain the 1000 most variable genes measured on 747 cells.

Question 9.9 How many different batches do the cells belong to ?

We can look at a PCA of the normalized values and check graphically that the batch effect has been removed:

multivariate hypothesis also known as

Since the screeplot in Figure  9.18 shows us that we must not dissociate axes 2 and 3, we will make a three dimensional plot with the rgl package. We use the following interactive code:

multivariate hypothesis also known as

Note: Of course, the book medium is limiting here, as we are showing two static projections that do not do justice to the depth available when looking at the interactive dynamic plots as they appear using the plot3d function. We encourage the reader to experiment extensively with these and other interactive packages and they provide a much more intuitive experience of the data.

9.4 Correspondence analysis for contingency tables

9.4.1 cross-tabulation and contingency tables.

Categorical data abound in biological settings: sequence status (CpG/non-CpG), phenotypes, taxa are often coded as factors as we saw in Chapter 2. Cross-tabulation of two such variables gives us a contingency table ; the result of counting the co-occurrence of two phenotypes (sex and colorblindness was such an example). We saw that the first step is to look at the independence of the two categorical variables; the standard statistical measure of independence uses the chisquare distance . This quantity will replace the variance we used for continuous measurements.

The columns and rows of the table have the same `status’ and we are not in supervised/regression type setting. We won’t see a sample/variable divide; as a consequence the rows and columns will have the same status and we will ‘center’ both the rows and the columns. This symmetry will also translate in our use of biplots where both dimensions appear on the same plot.

Table 9.1: Sample by mutation matrix.
Patient Mut1 Mut2 Mut3 ...
AHX112 0 0 0
AHX717 1 0 1
AHX543 1 0 0

Transforming the data to tabular form.

If the data are collected as long lists with each subject (or sample) associated to its levels of the categorical variables, we may want to transform them into a contingency table. Here is an example. In Table  9.1 HIV mutations are tabulated as indicator (0/1) binary variables. These data are then transformed into a mutation co-occurrence matrix shown in Table  9.2 .

Table 9.2: Cross-tabulation of the HIV mutations showing two-way co-occurrences.
Patient Mut1 Mut2 Mut3 ...
Mut1 853 29 10
Mut2 29 853 52
Mut3 10 52 853

Question 9.10 What information is lost in this cross-tabulation ? When will this matter?

Here are some co-occurrence data from the HIV database ( Rhee et al. 2003 ) . Some of these mutations have a tendency to co-occur.

Question 9.11 Test the hypothesis of independence of the mutations.

Before explaining the details of how correspondence analysis works, let’s look at the output of one of many correspondence analysis functions. We use dudi.coa from the ade4 package to plot the mutations in a lower dimensional projection; the procedure follows what we did for PCA.

multivariate hypothesis also known as

After looking at a screeplot, we see that dimensionality of the underlying variation is definitely three dimensional, we plot these three dimensions. Ideally this would be done with an interactive three-dimensional plotting function such as that provided through the package rgl as shown in Figure  9.21 .

Question 9.12 Using the car and rgl packages make 3d scatterplot similar to Figure  9.21 . Compare to the plot obtained using aspect=FALSE with the plot3d function from rgl . What structure do you notice by rotating the cloud of points?

multivariate hypothesis also known as

Question 9.13 Show the code for plotting the plane defined by axes 1 and 3 of the correspondence analysis respecting the scaling of the vertical axis as shown in the bottom figure of Figure  9.22 .

This first example showed how to map all the different levels of one categorical variable (the mutations) in a similar way to how PCA projects continuous variables. We will now explore how this can be extended to two or more categorical variables.

9.4.2 Hair color, eye color and phenotype co-occurrence

We will consider a small table, so we can follow the analysis in detail. The data are a contingency table of hair-color and eye-color phenotypic co-occurrence from students as shown in Table  9.3 . In Chapter 2 , we used a \(\chi^2\) test to detect possible dependencies:

Table 9.3: Cross tabulation of students hair and eye color.
Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8

However, stating non independence between hair and eye color is not enough. We need a more detailed explanation of where the dependencies occur: which hair color occurs more often with green eyes ? Are some of the variable levels independent? In fact we can study the departure from independence using a special weighted version of SVD. This method can be understood as a simple extension of PCA and MDS to contingency tables.

Independence: computationally and visually.

We start by computing the row and column sums; we use these to build the table that would be expected if the two phenotypes were independent. We call this expected table HCexp .

Now we compute the \(\chi^2\) (chi-squared) statistic, which is the sum of the scaled residuals for each of the cells of the table:

We can study these residuals from the expected table, first numerically then in Figure  9.23 .

multivariate hypothesis also known as

Mathematical Formulation.

Here are the computations we just did in R in a more mathematical form. For a general contingency table \({\mathbf N}\) with \(I\) rows and \(J\) columns and a total sample size of \(n=\sum_{i=1}^I \sum_{j=1}^J n_{ij}= n_{\cdot \cdot}\) . If the two categorical variables were independent, each cell frequency would be approximately equal to

\[ n_{ij} = \frac{n_{i \cdot}}{n} \frac{n_{\cdot j}}{n} \times n \]

can also be written:

\[ {\mathbf N} = {\mathbf c r'} \times n, \qquad \mbox{ where } c= \frac{1}{n} {{\mathbf N}} {\mathbb 1}_m \;\mbox{ and }\; r'=\frac{1}{n} {\mathbf N}' {\mathbb 1}_p \]

The departure from independence is measured by the \(\chi^2\) statistic

\[ {\cal X}^2=\sum_{i,j} \frac{\left(n_{ij}-\frac{n_{i\cdot}}{n}\frac{n_{\cdot j}}{n}n\right)^2} {\frac{n_{i\cdot}n_{\cdot j}}{n^2}n} \]

Once we have ascertained that the two variables are not independent, we use a weighted multidimensional scaling using \(\chi^2\) distances to visualize the associations.

The method is called Correspondence Analysis (CA) or Dual Scaling and there are multiple R packages that implement it.

Here we make a simple biplot of the Hair and Eye colors.

multivariate hypothesis also known as

Question 9.14 What percentage of the Chisquare statistic is explained by the first two axes of the Correspondence Analysis?

Question 9.15 Compare the results with those obtained by using CCA in the vegan package with the appropriate value for the scaling parameter.

Interpreting the biplots

CA has a special barycentric property: the biplot scaling is chosen so that the row points are placed at the center of gravity of the column levels with their respective weights. For instance, the Blue eyes column point is at the center gravity of the (Black, Brown, Red, Blond) with weights proportional to (9, 34, 7, 64). The Blond row point is very heavily weighted, this is why Figure  9.24 shows Blond and Blue quite close together.

9.5 Finding time…and other important gradients.

All the methods we have studied in the last sections are commonly known as ordination methods. In the same way clustering allowed us to detect and interpret a hidden factor/categorical variable, ordination enables us to detect and interpret a hidden ordering, gradient or latent variable in the data.

multivariate hypothesis also known as

Ecologists have a long history of interpreting the arches formed by observations points in correspondence analysis and principal components as ecological gradients ( Prentice 1977 ) . Let’s illustrate this first with a very simple data set on which we perform a correspondence analysis.

We plot both the row-location points ( Figure  9.25 (a)) and the biplot of both location and plant species in the lower part of Figure  9.25 (b); this plot was made with:

multivariate hypothesis also known as

Question 9.16 Looking back at the raw matrix lakes as it appears, do you see a pattern in its entries? What would happen if the plants had been ordered by actual taxa names for instance?

9.5.1 Dynamics of cell development

We will now analyse a more interesting data set that was published by Moignard et al. ( 2015 ) . This paper describes the dynamics of blood cell development. The data are single cell gene expression measurements of 3,934 cells with blood and endothelial potential from five populations from between embryonic days E7.0 and E8.25.

multivariate hypothesis also known as

Remember from Chapter 4 that several different distances are available for comparing our cells. Here, we start by computing both an \(L_2\) distance and the \(\ell_1\) distance between the 3,934 cells.

The classical multidimensional scaling on these two distances matrices can be carried out using:

We look at the underlying dimension and see in Figure  9.27 that two dimensions can provide a substantial fraction of the variance.

multivariate hypothesis also known as

The first 2 coordinates account for 78 % of the variability when the \(\ell_1\) distance is used between cells, and 57% when the \(L^2\) distance is used. We see in Figure  9.28 (a) the first plane for the MDS on the \(\ell_1\) distances between cells:

multivariate hypothesis also known as

Figure  9.28 (b) is created in the same way and shows the two-dimensional projection created by using MDS on the L2 distances.

Figure  9.28 shows that both distances (L1 and L2) give the same first plane for the MDS with very similar representations of the underlying gradient followed by the cells.

We can see from Figure  9.28 that the cells are not distributed uniformly in the lower dimensions we have been looking at, we see a definite organization of the points. All the cells of type 4SG represented in red form an elongated cluster who are much less mixed with the other cell types.

9.5.2 Local, nonlinear methods

Multidimensional scaling and non metric multidimensional scaling aims to represent all distances as precisely as possible and the large distances between far away points skew the representations. It can be beneficial when looking for gradients or low dimensional manifolds to restrict ourselves to approximations of points that are close together. This calls for methods that try to represent local (small) distances well and do not try to approximate distances between faraway points with too much accuracy.

There has been substantial progress in such methods in recent years. The use of kernels computed using the calculated interpoint distances allows us to decrease the importance of points that are far apart. A radial basis kernel is of the form

\[ 1-\exp\left(-\frac{d(x,y)^2}{\sigma^2}\right), \quad\mbox{where } \sigma^2 \mbox{ is fixed.} \]

It has the effect of heavily discounting large distances. This can be very useful as the precision of interpoint distances is often better at smaller ranges; several examples of such methods are covered in Exercise  9.6 at the end of this chapter.

Question 9.17 Why do we take the difference between the 1 and the exponential? What happens when the distance between \(x\) and \(y\) is very big?

This widely used method adds flexibility to the kernel defined above and allows the \(\sigma^2\) parameter to vary locally (there is a normalization step so that it averages to one). The t-SNE method starts out from the positions of the points in the high dimensional space and derives a probability distribution on the set of pairs of points, such that the probabilities are proportional to the points’ proximities or similarities. It then uses this distribution to construct a representation of the dataset in low dimensions. The method is not robust and has the property of separating clusters of points artificially; however, this property can also help clarify a complex situation. One can think of it as a method akin to graph (or network) layout algorithms. They stretch the data to clarify relations between the very close (in the network: connected) points, but the distances between more distal (in the network: unconnected) points cannot be interpreted as being on the same scales in different regions of the plot. In particular, these distances will depend on the local point densities. Here is an example of the output of t-SNE on the cell data:

multivariate hypothesis also known as

In this case in order to see the subtle differences between MDS and t-SNE, it is really necessary to use 3d plotting.

Use the rgl package to look at the three t-SNE dimensions and add the correct cell type colors to the display.

Two of these 3d snapshots are shown in Figure  9.30 , we see a much stronger grouping of the purple points than in the MDS plots.

Note: A site worth visiting in order to appreciate more about the sensitivity of the t-SNE method to the complexity and \(\sigma\) parameters can be found at http://distill.pub/2016/misread-tsne .

multivariate hypothesis also known as

Question 9.18 Visualize a two-dimensional t-SNE embedding of the Ukraine distances from Section 9.2 .

multivariate hypothesis also known as

There are several other nonlinear methods for estimating nonlinear trajectories followed by points in the relevant state spaces. Here are a few examples.

RDRToolbox Local linear embedding ( LLE ) and isomap

diffusionMap This package models connections between points as a Markovian kernel.

kernlab Kernel methods

LPCM-package Local principal curves

9.6 Multitable techniques

Current studies often attempt to quantify variation in the microbial, genomic, and metabolic measurements across different experimental conditions. As a result, it is common to perform multiple assays on the same biological samples and ask what features – microbes, genes, or metabolites, for example – are associated with different sample conditions. There are many ways to approach these questions. Which to apply depends on the study’s focus.

9.6.1 Co-variation, inertia, co-inertia and the RV coefficient

As in physics, we define inertia as a sum of distances with ‘weighted’ points. This enables us to compute the inertia of counts in a contingency table as the weighted sum of the squares of distances between observed and expected frequencies (as in the chisquare statistic).

If we want to study two standardized variables measured at the same 10 locations together, we use their covariance . If \(x\) represents the standardized PH, and and \(y\) the standardized humidity, we measure their covariation using the mean

\[ \text{cov}(x,y) = \text{mean}(x1*y1 + x2*y2 + x3*y3 + \cdots + x10*y10). \tag{9.2}\]

If \(x\) and \(y\) co-vary in the same direction, this will be big. We saw how useful the correlation coefficient we defined in Chapter 8 was to our multivariate analyses. Multitable generalizations will be just as useful.

9.6.2 Mantel coefficient and a test of distance correlation

The Mantel coefficient, one of the earliest version of matrix association, developed and used by Henry Daniels, FN David and co-authors ( Josse and Holmes 2016 ) is very popular, especially in ecology.

Given two dissimilarity matrices \({\mathbf D^X}\) and \({\mathbf D^Y}\) , make these matrices into vectors the way the R dist function does, and compute their correlation. This is defined mathematically as:

\[ \begin{aligned} \mbox{r}_{\rm m}(\bf X,\bf Y)= \frac{\sum_{i=1}^n \sum_{j=1,j \neq i}^n (d_{ij}^**X**- \bar{d}^**X**)(d_{ij}^**Y**-\bar d^**Y**)}{\sqrt{\sum_{i,j,j \neq i}(d_{ij}^**X**- \bar{d}^**X**)^2 \sum_{i,j,j \neq i}(d_{ij}^**Y**-\bar{d}^**Y**)^2}}, \end{aligned} \]

with \(\bar{d}^**X**\) (resp \(\bar{d}^**Y**\) ) the mean of the lower diagonal terms of the dissimilarity matrix associated to \(\bf X\) (resp. to \(\bf Y\) ). This formulation shows us that it is a measure of linear correlation between distances. It has been widely used for testing two sets of distances, for instance one distance \({\mathbf D^X}\) could be computed using the soil chemistry at 17 different locations. The other distance \({\mathbf D^Y}\) could record plant abundance dissimilarities using a Jaccard index between the 17 same locations.

The correlation’s significance is often assessed via a simple permutation test, see Josse and Holmes ( 2016 ) for a review with its historical background and modern incarnations. The coefficient and associated tests are implemented in several R packages such as ade4 ( Chessel, Dufour, and Thioulouse 2004 ) , vegan and ecodist ( Goslee, Urban, et al. 2007 ) .

RV coefficient The global measure of similarity of two data tables as opposed to two vectors can be done by a generalization of covariance provided by an inner product between tables that gives the RV coefficient, a number between 0 and 1, like a correlation coefficient, but for tables.

\[ RV(A,B)=\frac{Tr(A'BAB')}{\sqrt{Tr(A'A)}\sqrt{Tr(B'B)}} \]

There are several other measures of matrix correlation available in the package MatrixCorrelation .

If we do ascertain a link between two matrices, we then need to find a way to understand that link. One such method is explained in the next section.

9.6.3 Canonical correlation analysis (CCA)

CCA is a method similar to PCA as it was developed by Hotelling in the 1930s to search for associations between two sets of continuous variables \(X\) and \(Y\) . Its goal is to find a linear projection of the first set of variables that maximally correlates with a linear projection of the second set of variables.

Finding correlated functions (covariates) of the two views of the same phenomenon by discarding the representation-specific details (noise) is expected to reveal the underlying hidden yet influential factors responsible for the correlation.

Canonical Correlation Algorithm.

Let us consider two matrices \(X\) and \(Y\) of order \(n > p\) and \(n > q\) respectively. The columns of \(X\) and \(Y\) correspond to variables and the rows correspond to the same \(n\) experimental units. The jth column of the matrix \(X\) is denoted by \(X_j\) , likewise the kth column of \(Y\) is denoted by \(Y_k\) . Without loss of generality it will be assumed that the columns of \(X\) and \(Y\) are standardized (mean 0 and variance 1).

We denote by \(S_{XX}\) and \(S_{YY}\) the sample covariance matrices for variable sets \(X\) and \(Y\) respectively, and by \(S_{XY} = S_{YX}^t\) the sample cross-covariance matrix between \(X\) and \(Y\) .

Classical CCA assumes first \(p \leq n\) and \(q \leq n\) , then that matrices \(X\) and \(Y\) are of full column rank p and q respectively. In the following, the principle of CCA is presented as a problem solved through an iterative algorithm. The first stage of CCA consists in finding two vectors \(a =(a_1,...,a_p)^t\) and \(b =(b_1,...,b_q)^t\) that maximize the correlation between the linear combinations \(U\) and \(V\) defined as

\[ \begin{aligned} U=Xa&=&a_1 X_1 +a_2 X_2 +\cdots a_p X_p\\ V=Yb&=&b_1 Y_1 +b_2 Y_2 +\cdots a_q Y_q \end{aligned} \]

and assuming that vectors \(a\) and $ b$ are normalized so that \(var(U) = var(V) = 1\) In other words, the problem consists in finding \(a\) and \(b\) such that

\[ \rho_1 = \text{cor}(U, V) = \max_{a,b} \text{cor} (Xa, Yb)\quad \text{subject to}\quad \text{var}(Xa)=\text{var}(Yb) = 1. \tag{9.3}\]

The resulting variables \(U\) and \(V\) are called the first canonical variates and \(\rho_1\) is referred to as the first canonical correlation.

Note: Higher order canonical variates and canonical correlations can be found as a stepwise problem. For \(s = 1,...,p\) , we can successively find positive correlations \(\rho_1 \geq \rho_2 \geq ... \geq \rho_p\) with corresponding vectors \((a^1, b^1), ..., (a^p, b^p)\) , by maximizing

\[ \rho_s = \text{cor}(U^s,V^s) = \max_{a^s,b^s} \text{cor} (Xa^s,Yb^s)\quad \text{subject to}\quad \text{var}(Xa^s) = \text{var}(Yb^s)=1 \tag{9.4}\]

under the additional restrictions

\[ \text{cor}(U^s,U^t) = \text{cor}(V^s, V^t)=0 \quad\text{for}\quad 1 \leq t < s \leq p. \tag{9.5}\]

We can think of CCA as a generalization of PCA where the variance we maximize is the `covariance’ between the two matrices (see Holmes ( 2006 ) for more details).

9.6.4 Sparse canonical correlation analysis (sCCA)

When the number of variables in each table is very large finding two very correlated vectors can be too easy and unstable: we have too many degrees of freedom.

Then it is beneficial to add a penalty maintains the number of non-zero coefficients to a minimum. This approach is called sparse canonical correlation analysis (sparse CCA or sCCA), a method well-suited to both exploratory comparisons between samples and the identification of features with interesting co variation. We will use an implementation from the PMA package.

Here we study a dataset collected by Kashyap et al. ( 2013 ) with two tables. One is a contingency table of bacterial abundances and another an abundance table of metabolites. There are 12 samples, so \(n = 12\) . The metabolite table has measurements on \(p = 637\) feature and the bacterial abundances had a total of $ q = 20,609$ OTUs, which we will filter down to around 200. We start by loading the data.

We first filter down to bacteria and metabolites of interest, removing (“by hand”) those that are zero across many samples and giving an upper threshold of 50 to the large values. We transform the data to weaken the heavy tails.

A second step in our preliminary analysis is to look if there is any association between the two matrices using the RV.test from the ade4 package:

We can now apply sparse CCA. This method compares sets of features across high-dimensional data tables, where there may be more measured features than samples. In the process, it chooses a subset of available features that capture the most covariance – these are the features that reflect signals present across multiple tables. We then apply PCA to this selected subset of features. In this sense, we use sparse CCA as a screening procedure, rather than as an ordination method.

The implementation is below. The parameters penaltyx and penaltyz are sparsity penalties. Smaller values of penaltyx will result in fewer selected microbes, similarly penaltyz modulates the number of selected metabolites. We tune them manually to facilitate subsequent interpretation – we generally prefer more sparsity than the default parameters would provide.

With these parameters, 5 bacteria and 16 metabolites were selected based on their ability to explain covariation between tables. Further, these features result in a correlation of 0.99 between the two tables. We interpret this to mean that the microbial and metabolomic data reflect similar underlying signals, and that these signals can be approximated well by the selected features. Be wary of the correlation value, however, since the scores are far from the usual bivariate normal cloud. Further, note that it is possible that other subsets of features could explain the data just as well – sparse CCA has minimized redundancy across features, but makes no guarantee that these are the “true” features in any sense.

Nonetheless, we can still use these 21 features to compress information from the two tables without much loss. To relate the recovered metabolites and OTUs to characteristics of the samples on which they were measured, we use them as input to an ordinary PCA. We have omitted the code we used to generate Figure 9.32 , we refer the reader to the online material accompanying the book or the workflow published in Callahan et al. ( 2016 ) .

Figure  9.32 displays the PCA triplot , where we show different types of samples and the multidomain features (Metabolites and OTUs). This allows comparison across the measured samples – triangles for knockout and circles for wild type –and characterizes the influence the different features – diamonds with text labels. For example, we see that the main variation in the data is across PD and ST samples, which correspond to the different diets. Further, large values of 15 of the features are associated with ST status, while small values for 5 of them indicate PD status.

multivariate hypothesis also known as

The advantage of the sparse CCA screening is now clear – we can display most of the variation across samples using a relatively simple plot, and can avoid plotting the hundreds of additional points that would be needed to display all of the features.

9.6.5 Canonical (or constrained) correspondence analysis (CCpnA)

The term constrained correspondence analysis translates the fact that this method is similar to a constrained regression. The method attempts to force the latent variables to be correlated with the environmental variables provided as `explanatory’.

CCpnA creates biplots where the positions of samples are determined by similarity in both species signatures and environmental characteristics. In contrast, principal components analysis or correspondence analysis only look at species signatures. More formally, it ensures that the resulting CCpnA directions lie in the span of the environmental variables. For thorough explanations see Braak ( 1985 ; Greenacre 2007 ) .

This method can be run using the function ordinate in phyloseq . In order to use the covariates from the sample data, we provide an extra argument, specifying which of the features to consider.

Here, we take the data we denoised using dada2 in Chapter 4 . We will see more details about creating the phyloseq object in Chapter 10 . For the time being, we use the otu_table component containing a contingency table of counts for different taxa. We would like to compute the constrained correspondence analyses that explain the taxa abundances by the age and family relationship (both variables are contained in the sample_data slot of the ps1 object).

We would like to make two dimensional plots showing only using the four most abundant taxa (making the biplot easier to read):

To access the positions for the biplot, we can use the scores function in the vegan . Further, to facilitate figure annotation, we also join the site scores with the environmental data in the sample_data slot. Of the 23 total taxonomic orders, we only explicitly annotate the four most abundant – this makes the biplot easier to read.

multivariate hypothesis also known as

Question 9.19 Look up the extra code for creating the tax and species objects in the online resources accompanying the book. Then make the analogue of Figure  9.33 but using litter as the faceting variable.

multivariate hypothesis also known as

Figures 9.33 and 9.34 show the plots of these annotated scores, splitting sites by their age bin and litter membership, respectively. Note that to keep the appropriate aspect ratio in the presence of faceting, we have taken the vertical axis as our first canonical component. We have labeled individual bacteria that are outliers along the second CCpnA direction.

Evidently, the first CCpnA direction distinguishes between mice in the two main age bins. Circles on the left and right of the biplot represent bacteria that are characteristic of younger and older mice, respectively. The second CCpnA direction splits off the few mice in the oldest age group; it also partially distinguishes between the two litters. These samples low in the second CCpnA direction have more of the outlier bacteria than the others.

This CCpnA analysis supports the conclusion that the main difference between the microbiome communities of the different mice lies along the age axis. However, in situations where the influence of environmental variables is not so strong, CCA can have more power in detecting such associations. In general, it can be applied whenever it is desirable to incorporate supplemental data, but in a way that (1) is less aggressive than supervised methods, and (2) can use several environmental variables at once.

9.7 Summary of this chapter

Heterogeneous data A mixture of many continuous and a few categorical variables can be handled by adding the categorical variables as supplementary information to the PCA. This is done by projecting the mean of all points in a group onto the map.

Using distances Relations between data objects can often be summarized as interpoint distances (whether distances between trees, images, graphs, or other complex objects).

Ordination A useful representation of these distances is available through a method similar to PCA called multidimensional scaling (MDS), otherwise known as PCoA (principal coordinate analysis). It can be helpful to think of the outcome of these analyses as uncovering latent variable. In the case of clustering the latent variables are categorical, in ordination they are latent variables like time or environmental gradients like distance to the water. This is why these methods are often called ordination.

Robust versions can be used when interpoint distances are wildly different. NMDS (nonmetric multidimensional scaling) aims to produce coordinates such that the order of the interpoint distances is respected as closely as possible.

Correspondence analysis : a method for computing low dimensional projections that explain dependencies in categorical data. It decomposes chisquare distance in much the same way that PCA decomposes variance. Correspondence analysis is usually the best way to follow up on a significant chisquare test. Once we have ascertained there are significant dependencies between different levels of categories, we can map them and interpret proximities on this map using plots and biplots.

Permutation test for distances Given two sets of distances between the same points, we can measure whether they are related using the Mantel permutation test.

Generalizations of variance and covariance When dealing with more than one matrix of measurements on the same data, we can generalize the notion of covariance and correlations to vectorial measurements of co-inertia.

Canonical correlation is a method for finding a few linear combinations of variables from each table that are as correlated as possible. When using this method on matrices with large numbers of variables, we use a regularized version with an L1 penalty that reduces the number of non-zero coefficients.

9.8 Further reading

Interpretation of PCoA maps and nonlinear embeddings can also be enhanced the way we did for PCA using generalizations of the supplementary point method, see Trosset and Priebe ( 2008 ) or Bengio et al. ( 2004 ) . We saw in Chapter 7 how we can project one categorical variable onto a PCA. The correspondence analysis framework actually allows us to mix several categorical variables in with any number of continuous variables. This is done through an extension called multiple correspondence analysis (MCA) whereby we can do the same analysis on a large number of binary categorical variables and obtain useful maps. The trick here will be to turn the continuous variables into categorical variables first. For extensive examples using R see for instance the book by Pagès ( 2016 ) .

A simple extension to PCA that allows for nonlinear principal curve estimates instead of principal directions defined by eigenvectors was proposed in Hastie and Stuetzle ( 1989 ) and is available in the package princurve .

Finding curved subspaces containing a high density data for dimensions higher than \(1\) is now called manifold embedding and can be done through Laplacian eigenmaps ( Belkin and Niyogi 2003 ) , local linear embedding as in Roweis and Saul ( 2000 ) or using the isomap method ( Tenenbaum, De Silva, and Langford 2000 ) . For textbooks covering nonlinear unsupervised learning methods see Hastie, Tibshirani, and Friedman ( 2008, chap. 14 ) or Izenman ( 2008 ) .

A review of many multitable correlation coefficients, and analysis of applications can be found in Josse and Holmes ( 2016 ) .

9.9 Exercises

Exercise 9.1  

We are going to take another look at the Phylochip data, replacing the original expression values by presence/absence. We threshold the data to retain only those that have a value of at least 8.633 in at least 8 samples 7 .

7  These values were chosen to give about retain about 3,000 taxa, similar to our previous choice of threshold.

Perform a correspondence analysis on these binary data and compare the plot you obtain to what we saw in Figure  9.15 .

See Figure  9.37 .

multivariate hypothesis also known as

Exercise 9.2  

Correspondence Analysis on color association tables: Here is an example of data collected by looking at the number of Google hits resulting from queries of pairs of words. The numbers in Table  9.4 are to be multiplied by 1000. For instance, the combination of the words “quiet” and “blue” returned 2,150,000 hits.

Table 9.4: Contingency table of co-occurring terms from search engine results.
black blue green grey orange purple white
quiet 2770 2150 2140 875 1220 821 2510
angry 2970 1530 1740 752 1040 710 1730
clever 1650 1270 1320 495 693 416 1420
depressed 1480 957 983 147 330 102 1270
happy 19300 8310 8730 1920 4220 2610 9150
lively 1840 1250 1350 659 621 488 1480
perplexed 110 71 80 19 23 15 109
virtuous 179 80 102 20 25 17 165

Perform a correspondence analysis of these data. What do you notice when you look at the two-dimensional biplot?

See Figure  9.38 . The code is not rendered here, but is shown in the document’s source file.

multivariate hypothesis also known as

Exercise 9.3  

The dates Plato wrote his various books are not known. We take the sentence endings and use those pattern frequencies as the data.

multivariate hypothesis also known as

From the biplot in Figure  9.39 can you guess at the chronological order of Plato’s works? Hint: the first (earliest) is known to be Republica . The last (latest) is known to be Laws .

Which sentence ending did Plato use more frequently early in his life?

What percentage of the inertia ( \(\chi^2\) -distance) is explained by the map in Figure  9.39 ?

To compute the percentage of inertia explained by the first two axes we take the cumulative sum of the eigenvalues at the value 2:

Exercise 9.4  

We are going to look at two datasets, one is a perturbed version of the other and they both present gradients as often seen in ecological data. Read in the two species count matrices lakelike and lakelikeh , which are stored as the object lakes.RData . Compare the output of correspondence analysis and principal component analysis on each of the two data sets; restrict yourself two dimensions. In the plots and the eigenvalues, what do you notice?

Comparison (output not shown):

Exercise 9.5  

We analyzed the normalized Moignard data in Section 9.5.1 . Now redo the analysis with the raw data (in file nbt.3154-S3-raw.csv ) and compare the output with that obtained using the normalized values.

Exercise 9.6  

We are going to explore the use of kernel methods.

Compute kernelized distances using the kernlab for the Moignard data using various values for the sigma tuning parameter in the definition of the kernels. Then perform MDS on these kernelized distances. What difference is there in variability explained by the first four components of kernel multidimensional scaling?

Make interactive three dimensional representations of the components: is there a projection where you see a branch for the purple points?

  • kernelized distances

Use kernelized distances to protect against outliers and allows discovery of non-linear components.

  • using a 3d scatterplot interactively:

multivariate hypothesis also known as

Exercise 9.7  

Higher resolution study of cell data. Take the original expression data blom we generated in Section 9.5.1 . Map the intensity of expression of each of the top 10 most variable genes onto the 3d plot made with the diffusion mapping. Which dimension, or which one of the principal coordinates (1,2,3,4) can be seen as the one that clusters the 4SG (red) points the most?

An implementation in the package LPCM provides the function lpc , which estimates principal curves. Here we constrain ourselves to three dimensions chosen from the output of the diffusion map and create smoothed curves.

To get a feel for what the smoothed data are showing us, we take a look at the interactive graphics using the function plot3d from them rgl package.

One way of plotting both the smoothed line and the data points is to add the line using the plot3d function.

multivariate hypothesis also known as

Exercise 9.8  

Here we explore more refined distances and diffusion maps that can show cell development trajectories as in Figure  9.45 .

multivariate hypothesis also known as

The diffusion map method restricts the estimation of distances to local points, thus further pursuing the idea that often only local distances should be represented precisely and as points become further apart they are not being measured with the same ‘reference’. This method also uses the distances as input but then creates local probabilistic transitions as indicators of similarity, these are combined into an affinity matrix for which the eigenvalues and eigenvectors are also computed much like in standard MDS.

Compare the output of the diffuse function from the diffusionMap package on both the l1 and l2 distances computed between the cells available in the dist2n.euclid and dist1n.l1 objects from Section 9.5.1 .

Notice that the vanilla plot for a dmap object does not allow the use of colors. As this essential to our understanding of cell development, we add the colors by hand. Of course, here we use static 3d plots but these should supplemented by the plot3d examples we give in the code.

We use a tailored wrapper function scp3d , so that we can easily insert relevant parameters:

The best way of visualizing the data is to make a rotatable interactive plot using the rgl package.

Page built on 2023-08-03 21:37:40.85015 using R version 4.3.0 (2023-04-21)

multivariate hypothesis also known as

Chapter 13: Multivariate Analysis of Variation

  • Open Access
  • First Online: 22 February 2022

Cite this chapter

You have full access to this open access chapter

multivariate hypothesis also known as

  • Arak Mathai 4 ,
  • Serge Provost 5 &
  • Hans Haubold 6  

4492 Accesses

3 Altmetric

Given multivariate random samples originating from several Gaussian populations sharing the same covariance matrix, the one-way multivariate analysis of variation (also known as multivariate analysis of variance) technique enables one to test whether or not the population mean value vectors are equal. When a second set of treatments is also considered, the technique is referred to as MANOVA for the two-way classification. Test statistics are derived for the one- and two-way layouts. Additionally, a multivariate extension of the two-way classification is discussed.

You have full access to this open access chapter,  Download chapter PDF

Similar content being viewed by others

multivariate hypothesis also known as

Multivariate Exploratory Approaches

multivariate hypothesis also known as

One-Way Analysis of Variance (ANOVA)

multivariate hypothesis also known as

Multivariate Statistics

13.1. introduction.

We will employ the same notations as in the previous chapters. Lower-case letters x , y , … will denote real scalar variables, whether mathematical or random. Capital letters X , Y , … will be used to denote real matrix-variate mathematical or random variables, whether square or rectangular matrices are involved. A tilde will be placed on top of letters such as \(\tilde {x},\tilde {y},\tilde {X},\tilde {Y}\) to denote variables in the complex domain. Constant matrices will for instance be denoted by A , B , C . A tilde will not be used on constant matrices unless the point is to be stressed that the matrix is in the complex domain. The determinant of a square matrix A will be denoted by | A | or det( A ) and, in the complex case, the absolute value or modulus of the determinant of A will be denoted as |det( A )|. When matrices are square, their order will be taken as p  ×  p , unless specified otherwise. When A is a full rank matrix in the complex domain, then AA ∗ is Hermitian positive definite where an asterisk designates the complex conjugate transpose of a matrix. Additionally, d X will indicate the wedge product of all the distinct differentials of the elements of the matrix X . Thus, letting the p  ×  q matrix X  = ( x ij ) where the x ij ’s are distinct real scalar variables, \(\text{d}X=\wedge _{i=1}^p\wedge _{j=1}^q\text{d}x_{ij}\) . For the complex matrix \(\tilde {X}=X_1+iX_2,\ i=\sqrt {(-1)}\) , where X 1 and X 2 are real, \(\text{d}\tilde {X}=\text{d}X_1\wedge \text{d}X_2\) .

In this chapter, we only consider analysis of variance (ANOVA) and multivariate analysis of variance (MANOVA) problems involving real populations. Even though all the steps involved in the following discussion focusing on the real variable case can readily be extended to the complex domain, it does not appear that a parallel development of analysis of variance methodologies in the complex domain has yet been considered. In order to elucidate the various steps in the procedures, we will first review the univariate case. For a detailed exposition of the analysis of variance technique in the scalar variable case, the reader may refer Mathai and Haubold ( 2017 ). We will consider the cases of one-way classification or completely randomized design as well as two-way classification without and with interaction or randomized block design. With this groundwork in place, the derivations of the results in the multivariate setting ought to prove easier to follow.

In the early nineteenth century, Gauss and Laplace utilized methodologies that may be regarded as forerunners to ANOVA in their analyses of astronomical data. However, this technique came to full fruition in Ronald Fisher’s classic book titled “Statistical Methods for Research Workers” , which was initially published in 1925. The principle behind ANOVA consists of partitioning the total variation present in the data into variations attributable to different sources. It is actually the total variation that is split rather than the total variance, the latter being a fraction of the former. Accordingly, the procedure could be more appropriately referred to as “analysis of variation”. As has already been mentioned, we will initially consider the one-way classification model, which will then be extended to the multivariate situation.

Let us first focus on an experimental design called a completely randomized experiment. In this setting, the subject matter was originally developed for agricultural experiments, which influenced its terminology. For example, the basic experimental unit is referred to as a “plot”, which is a piece of land in an agricultural context. When an experiment is performed on human beings, a plot translates into an individual. If the experiment is carried out on some machinery, then a machine corresponds to a plot. In a completely randomized experiment, a set of n 1  +  n 2  + ⋯ +  n k plots, which are homogeneous with respect to all factors of variation, are selected. Then, k treatments are applied at random to these plots, the first treatment to n 1 plots, the second treatment to n 2 plots, up to the k -th treatment being applied to n k plots. For instance, if the effects of k different fertilizers on the yield of a certain crop are to be studied, then the treatments consist of these k fertilizers, the first treatment meaning one of the fertilizers, the second treatment, another one and so on, with the k -th treatment corresponding to the last fertilizer. If the experiment involves studying the yield of corn among k different varieties of corn, then a treatment coincides with a particular variety. If an experiment consists of comparing k teaching methods, then a treatment refers to a method of teaching and a plot corresponds to a student. When an experiment compares the effect of k different medications in curing a certain ailment, then a treatment is a medication, and so on. If the treatments are denoted by t 1 , …, t k , then treatment t j is applied at random to n j homogeneous plots or n j homogeneous plots are selected at random and treatment t j is applied to them, for j  = 1, …, k . Random assignment is done to avoid possible biases or the influence of confounding factors, if any. Then, observations measuring the effect of these treatments on the experimental units are made. For example, in the case of various methods of teaching, the observation x ij could be the final grade obtained by the j -th student who was subjected to the i -th teaching method. In the case of comparing k different varieties of corn, the observation x ij could consist of the yield of corn observed at harvest time in the j -th plot which received the i -th variety of corn. Thus, in this instance, i stands for the treatment number and j represents the serial number of the plot receiving the i -th treatment, x ij being the final observation. Then, the corresponding linear additive fixed effect model is the following:

where μ is a general effect, α i is the deviation from the general effect due to treatment t i and e ij is the random component, which includes the sum total contributions originating from unknown or uncontrolled factors. When the experiment is designed, the plots are selected so that they be homogeneous with respect to all possible factors of variation. The general effect μ can be interpreted as the grand average or the expected value of x ij when α i is not present or treatment t i is not applied or has no effect. The simplest assumption that we will make is that E [ e ij ] = 0 for all i and j and Var( e ij ) =  σ 2  > 0 for all i and j and for some positive quantity σ 2 , where E [  ⋅ ] denotes the expected value of [  ⋅ ]. It is further assumed that μ , α 1 , …, α k are all unknown constants. When α 1 , …, α k are assumed to be random variables, model ( 13.1.1 ) is referred to as a“random effect model”. In the following discussion, we will solely consider fixed effect models. The first step consists of estimating the unknown quantities from the data. Since no distribution is assumed on the e ij ’s, and thereby on the x ij ’s, we will employ the method of least squares for estimating the parameters. In that case, one has to minimize the error sum of squares which is

Applying calculus principles, we equate the partial derivatives of \(\sum _{ij}e_{ij}^2\) with respect to μ to zero and then, equate the partial derivatives of \(\sum _{ij}e_{ij}^2\) with respect to α 1 , …, α k to zero and solve these equations. A convenient notation in this area is to represent a summation by a dot. As an example, if the subscript j is summed up, it is replaced by a dot, so that ∑ j x ij  ≡  x i . ; similarly, ∑ ij x ij  ≡  x .. . Thus,

and since we have taken α i as a deviation from the general effect due to treatment t i , we can let ∑ i n i α i  = 0 without any loss of generality. Then, x .. ∕ n . is an estimate of μ , and denoting estimates/estimators by a hat, we write \(\hat {\mu }={x_{..}}/{n_{.}}\) . Now, note that for example α 1 appears in the terms \((x_{11}-\mu -\alpha _1)^2+\cdots +(x_{1n_1}-\mu -\alpha _1)^2=\sum _j(x_{1j}-\mu -\alpha _1)^2\) but does not appear in the other terms in the error sum of squares. Accordingly, for a specific i ,

that is, \(\hat {\alpha }_i =\frac {x_{i.}}{n_i}-\hat {\mu }\,\) . Thus,

The least squares minimum is obtained by substituting the least squares estimates of μ and α i , i  = 1, …, k , in the error sum of squares. Denoting the least squares minimum by s 2 ,

When the square is expanded, the middle term will become \(-2\sum _{ij}(\frac {x_{i.}}{n_i}-\frac {x_{..}}{n_{.}})^2\) , thus yielding the expression given in ( 13.1.3 ). As well, we have the following identity:

Now, let us consider the hypothesis H o  :  α 1  =  α 2  = ⋯ =  α k , which is equivalent to the hypothesis α 1  =  α 2  = ⋯ =  α k  = 0 since, by assumption, ∑ i n i α i  = 0. Proceeding as before, the least squares minimum, under the null hypothesis H o , denoted by \(s_0^2\) , is the following:

and hence the sum of squares due to the hypothesis or due to the presence of the α j ’s, is given by \(s_0^2-s^2=\sum _{ij}(\frac {x_{i.}}{n_i}-\frac {x_{..}}{n_{.}})^2\) . Thus, the total variation is partitioned as follows:

which is the analysis of variation principle. If \(e_{ij}\overset {iid}{\sim } N_1(0,\sigma ^2)\) for all i and j where σ 2  > 0 is a constant, it follows from the chisquaredness and independence of quadratic forms, as discussed in Chaps. 2 and 3 , that \(\frac {s_0^2}{\sigma ^2}\sim \chi ^2_{n_{.}-1}\) , a real chisquare variable having n .  − 1 degrees of freedom, \(\frac {[s_0^2-s^2]}{\sigma ^2}\sim \chi ^2_{k-1}\) under the hypothesis H o and \(\frac {s^2}{\sigma ^2}\sim \chi ^2_{n_{.}-k},\) where the sum of squares due to the α j ’s, namely \(s_0^2-s^2,\) and the residual sum of squares, namely s 2 , are independently distributed under the hypothesis. Usually, these findings are put into a tabular form known as the analysis of variation table or ANOVA table. The usual format is as follows:

ANOVA Table for the One-Way Classification

Variation due to

(1)

(2)

(3)

(3)/(2)

treatments

 − 1

\(\sum _in_i(\frac {x_{i.}}{n_i}-\frac {x_{..}}{n_.})^2\)

\((s_0^2-s^2)/(k-1)\)

residuals

 − 

\(\sum _{ij}(x_{ij}-\frac {x_{i.}}{n_i})^2\)

∕(  −  )

total

 − 1

\(\sum _{ij}(x_{ij}-\frac {x_{..}}{n_{.}})^2\)

 

where df denotes the number of degrees of freedom, SS means sum of squares and MS stands for mean squares or the average of the squares. There is usually a last column which contains the F-ratio, that is, the ratio of the treatments MS to the residuals MS, and enables one to determine whether to reject the null hypothesis, in which case the test statistic is said to be “significant”, or not to reject the null hypothesis, when the test statistic is “not significant”. Further details on the real scalar variable case are available from Mathai and Haubold ( 2017 ).

In light of this brief review of the scalar variable case of one-way classification data or univariate data secured from a completely randomized design, the concepts will now be extended to the multivariate setting.

13.2. Multivariate Case of One-Way Classification Data Analysis

Extension of the results to the multivariate case is parallel to the scalar variable case. Consider a model of the type

with X ij , M , A i and E ij all being p  × 1 real vectors where X ij denotes the j -th observation vector in the i -th group or the observed vectors obtained from the n i plots receiving the i -th vector of treatments, M is a general effect vector, A i is a vector of deviations from M due to the i -th treatment vector so that ∑ i n i A i  =  O since we are taking deviations from the general effect M , and E ij is a vector of random components assumed to be normally distributed as follows: \(E_{ij}\overset {iid}{\sim } N_p(O,\varSigma ),\ \varSigma >O,\) for all i and j where Σ is a positive definite covariance matrix, that is,

where E [  ⋅ ] denotes the expected value operator. This normality assumption will be needed for testing hypotheses and developing certain distributional aspects. However, the multivariate analysis of variation can be set up without having to resort to any distributional assumption. In the real scalar variable case, we minimized the sum of the squares of the errors since the variations only involved single scalar variables. In the vector case, if we take the sum of squares of the elements in E ij , that is, \(E_{ij}^{\prime }E_{ij}\) and its sum over all i and j , then we are only considering the variations in the individual elements of E ij ’s; however, in the vector case, there is joint variation among the elements of the vector and that is also to be taken into account. Hence, we should be considering all squared terms and cross product terms or the whole matrix of squared and cross product terms. This is given by \(E_{ij}E_{ij}^{\prime }\) and so, we should consider this matrix and carry out some type of minimization. Consider

For obtaining estimates of M and A i , i  = 1, …, k , we will minimize the trace of \(\sum _{ij}E_{ij}E_{ij}^{\prime }\) as a criterion. There are terms of the type [ X ij  −  M  −  A i ] ′ [ X ij  −  M  −  A i ] in this trace. Thus,

noting that we assumed that ∑ i n i A i  =  O . Now, on differentiating the trace of \(E_{ij}E_{ij}^{\prime }\) with respect to A i for a specific i , we have

Observe that there is only one critical vector for \(\hat {M}\) and for \(\hat {A}_i,\ i=1,\ldots ,k\) . Accordingly, the critical point will either correspond to a minimum or a maximum of the trace. But for arbitrary M and A i , the maximum occurs at plus infinity and hence, the critical point \((\hat {M},\hat {A}_i,\ i=1,\ldots ,k)\) corresponds to a minimum. Once evaluated at these estimates, the sum of squares and cross products matrix, denoted by S , is the following:

Note that as in the scalar case, the middle terms and the last term will combine into the second term above. Now, let us impose the hypothesis H o  :  A 1  =  A 2  = ⋯ =  A k  =  O . Note that equality of the A j ’s will automatically imply that each one is null because the weighted sum is null as per our initial assumption in the model ( 13.2.1 ). Under this hypothesis, the model will be X ij  =  M  +  E ij , and then proceeding as in the univariate case, we end up with the following sum of squares and cross products matrix, denoted by S 0 :

so that the sum of squares and cross products matrix due to the A i ’s is the difference

Thus, the following partitioning of the total variation in the multivariate data:

Under the normality assumption for the random component \(E_{ij}\overset {iid}{\sim } N_p(O,\varSigma ),\ \varSigma >O,\) we have the following properties, which follow from results derived in Chap. 5 , the notation W p ( ν , Σ ) standing for a Wishart distribution having ν degrees of freedom and parameter matrix Σ :

We can summarize these findings in a tabular form known as the multivariate analysis of variation table or MANOVA table, where df means degrees of freedom in the corresponding Wishart distribution, and SSP represents the sum of squares and cross products matrix.

Multivariate Analysis of Variation (MANOVA) Table

Variation due to

treatments

 − 1

\(\sum _{ij}[\frac {1}{n_i}X_{i.}-\frac {1}{n_{.}}X_{..}][\frac {1}{n_i}X_{i.}-\frac {1}{n_{.}}X_{..}]'\)

residuals

 − 

\(\sum _{ij}[X_{ij}-\frac {1}{n_i}X_{i.}][X_{ij}-\frac {1}{n_i}X_{i.}]'\)

total

 − 1

\(\sum _{ij}[X_{ij}-\frac {1}{n_{.}}X_{..}][X_{ij}-\frac {1}{n_{.}}X_{..}]'\)

13.2.1. Some properties

The sample values from the i -th sample or the i -th group or the plots receiving the i -th treatment are \(X_{i1},X_{i2},\ldots ,X_{in_i}\) . In this case, the average is \(\sum _{j=1}^{n_i}\frac {X_{ij}}{n_i}=\frac {X_{i.}}{n_i}\) and the i -th sample sum of squares and products matrix is

As well, it follows from Chap. 5 that S i  ∼  W p ( n i  − 1, Σ ) when \(E_{ij}\overset {iid}{\sim } N_p(O,\varSigma ),\ \varSigma >O\) . Then, the residual sum of squares and products matrix can be written as follows, denoting it by the matrix V  :

where S i  ∼  W p ( n i  − 1, Σ ), i  = 1, …, k , and the S i ’s are independently distributed since the sample values from the k groups are independently distributed among themselves (within the group) and between groups. Hence, S  ∼  W p ( ν , Σ ), ν  = ( n 1  − 1) + ( n 2  − 1) + ⋯ + ( n k  − 1) =  n .  −  k . Note that \(\bar {X}_i=\frac {X_{i.}}{n_i}\) has \(\text{Cov}(\bar {X}_i)=\frac {1}{n_i}\varSigma \) , so that \(\sqrt {n_i}(\bar {X}_i-\bar {X})\) are iid N p ( O , Σ ) where \(\bar {X}={X_{..}}/{n_{.}}\) . Then, the sum of squares and products matrix due to the treatments or due to the A i ’s is the following, denoting it by U :

under the null hypothesis; when the hypothesis is violated, it is a noncentral Wishart distribution. Further, the sum of squares and products matrix due to the treatments and the residual sum of squares and products matrix are independently distributed. Thus, by comparing U and V  , we should be able to reach a decision regarding the hypothesis. One procedure that is followed is to take the determinants of U and V   and compare them. This does not have much of a basis and determinants should not be called “generalized variance” as previously explained since the basic condition of a norm is violated by the determinant. The basis for comparing determinants will become clear from the point of view of testing hypotheses by applying the likelihood ratio criterion, which is discussed next.

13.3. The Likelihood Ratio Criterion

Let \(E_{ij}\overset {iid}{\sim } N_p(O,\varSigma ),\ \varSigma >O\) , and suppose that we have simple random samples of sizes n 1 , …, n k from the k groups relating to the k treatments. Then, the likelihood function, denoted by L , is the following:

The maximum likelihood estimators/estimates (MLE’s) of M is \(\hat {M}=\frac {X_{..}}{n_{.}}=\bar {X}\) and that of A i is \(\hat {A}_i=\frac {X_{i.}}{n_i}-\hat {M}\) . With a view to obtaining the MLE of Σ , we first note that the exponent is a real scalar quantity which is thus equal to its trace, so that we can express the exponent as follows, after substituting the MLE’s of M and A i :

Now, following through the estimation procedure of the MLE included in Chap. 3 , we obtain the MLE of Σ as

After substituting \(\hat {M},\ \hat {A}_i\) and \(\hat {\varSigma }\) , the exponent in the likelihood ratio criterion λ becomes \(-\frac {1}{2}n_{.}\text{ tr}(I_p)=-\frac {1}{2}n_{.}p\) . Hence, the maximum value of the likelihood function L under the general model becomes

Under the hypothesis H o  :  A 1  =  A 2  = ⋯ =  A p  =  O , the model is X ij  =  M  +  E ij and the MLE of M under H o is still \(\frac {1}{n_{.}}X_{..}\) and \(\hat {\varSigma }\) under H o is \(\frac {1}{n_{.}}\sum _{ij}(X_{ij}-\frac {1}{n_{.}}X_{..})(X_{ij}-\frac {1}{n_{.}}X_{..})^{\prime }\) , so that \(\max L\) under H o , denoted by \(\max L_o\) , is

Therefore, the λ -criterion is the following:

and U  ∼  W p ( k  − 1, Σ ) under H o is the sum of squares and cross products matrix due to the A i ’s and V  ∼  W p ( n .  −  k , Σ ) is the residual sum of squares and cross products matrix. It has already been shown that U and V   are independently distributed. Then \(W_1=(U+V)^{-\frac {1}{2}}V(U+V)^{-\frac {1}{2}}\) , with the determinant \(\frac {|V|}{|U+V|}\) , is a real matrix-variate type-1 beta with parameters \((\frac {n_{.}-k}{2},\frac {k-1}{2})\) , as defined in Chap. 5 , and \(W_2=V^{-\frac {1}{2}}UV^{-\frac {1}{2}}\) is a real matrix-variate type-2 beta with the parameters \((\frac {k-1}{2},\frac {n_{.}-k}{2})\) . Moreover, \(Y_1=I-W_1=(U+V)^{-\frac {1}{2}}U(U+V)^{-\frac {1}{2}}\) with \(\frac {|U|}{|U+V|}\) is a real matrix-variate type-1 beta random variables with parameters \((\frac {k-1}{2},\frac {n_{.}-k}{2})\) . Given the properties of independent real matrix-variate gamma random variables, we have seen in Chap. 5 that W 1 and Y 2  =  U  +  V   are independently distributed. Similarly, Y 1  =  I  −  W 1 and Y 2 are independently distributed. Further, \(W_2^{-1}=V^{\frac {1}{2}}U^{-1}V^{\frac {1}{2}}\) is a real matrix-variate type-2 beta random variable with the parameters \((\frac {n_{.}-k}{2},\frac {k-1}{2})\) . Observe that

A one-to-one function of λ is

13.3.1. Arbitrary moments of the likelihood ratio criterion

For an arbitrary h , the h -th moment of w as well as that of λ can be obtained from the normalizing constant of a real matrix-variate type-1 beta density with the parameters \((\frac {n_{.}-k}{2},\frac {k-1}{2})\) . That is,

As \(E[\lambda ^h]=E[w^{\frac {n_{.}}{2}}]^h=E[w^{(\frac {n_{.}}{2})h}]\) , the h -th moment of λ is obtained by replacing h by \((\frac {n_{.}}{2})h\) in ( 13.3.7 ). That is,

13.3.2. Structural representation of the likelihood ratio criterion

It can readily be seen from ( 13.3.7 ) that the h -th moment of w is of the form of the h -th moment of a product of independently distributed real scalar type-1 beta random variables. That is,

where w 1 , …, w p are independently distributed and w j is a real scalar type-1 beta random variable with the parameters \((\frac {n_{.}-k}{2}-\frac {j-1}{2},\frac {k-1}{2}),\ j=1,\ldots ,p,\) for n .  −  k  >  p  − 1 and n .  >  k  +  p  − 1. Hence the exact density of w is available by constructing the density of a product of independently distributed real scalar type-1 beta random variables. For special values of p and k , one can obtain the exact densities in the forms of elementary functions. However, for the general case, the exact density corresponding to E [ w h ] as specified in ( 13.3.7 ) can be expressed in terms of a G-function and, in the case of E [ λ h ] as given in ( 13.3.8 ), the exact density can be represented in terms of an H-function. These representations are as follows, denoting the densities of w and λ as f w ( w ) and f λ ( λ ), respectively:

for n .  >  p  +  k  − 1, p  ≥ 1 and f w ( w ) = 0, f λ ( λ ) = 0, elsewhere. The evaluation of G and H-functions can be carried out with the help of symbolic computing packages such as Mathematica and MAPLE. Theoretical considerations, applications and several special cases of the G and H-functions are, for instance, available from Mathai ( 1993 ) and Mathai, Saxena and Haubold ( 2010 ). The special cases listed therein can also be utilized to work out the densities for particular cases of ( 13.3.10 ) and ( 13.3.11 ). Explicit structures of the densities for certain special cases are listed in the next section.

13.3.3. Some special cases

Several particular cases can be worked out by examining the moment expressions in ( 13.3.7 ) and ( 13.3.8 ). The h -th moment of the \(w=\lambda ^{\frac {2}{n_{.}}}\) , where λ is the likelihood ratio criterion, is available from ( 13.3.7 ) as

Case (1): p  = 1

In this case, from (i) ,

which is the h -th moment of a real scalar type-1 beta random variable with the parameters \((\frac {n_{.}-k}{2},\frac {k-1}{2})\) and, in this case, w is simply a real scalar type-1 beta random variable with the parameters \((\frac {n_{.}-k}{2},\frac {k-1}{2})\) . We reject the null hypothesis H o  :  A 1  =  A 2  = ⋯ =  A k  =  O for small values of the λ -criterion and, accordingly, we reject H o for small values of w or the hypothesis is rejected when the observed value of w  ≤  w α where w α is such that \(\int _0^{w_{\alpha }}f_w(w)\text{d}w=\alpha \) for the preassigned size α of the critical region, f w ( w ) denoting the density of w for p  = 1, n .  >  k .

Case (2): p  = 2

From (i) , we have

and therefore

The gamma functions in (ii) can be combined by making use of a duplication formula for gamma functions, namely,

Take \(z=\frac {n_{.}-k}{2}-\frac {1}{2}+\frac {h}{2}\) and \(z=\frac {n_{.}-1}{2}-\frac {1}{2}+\frac {h}{2}\) in the part containing h and in the constant part wherein h  = 0, and then apply formula ( 13.3.12 ) to obtain

which is, for an arbitrary h , the h -th moment of a real scalar type-1 beta random variable with parameters ( n .  −  k  − 1, k  − 1) for n .  −  k  − 1 > 0, k  > 1. Thus, \(y=w^{\frac {1}{2}}\) is a real scalar type-1 beta random variable with the parameters ( n .  −  k  − 1, k  − 1). We would then reject H o for small values of w , that is, for small values of y or when the observed value of y  ≤  y α with y α such that \(\int _0^{y_{\alpha }}f_y(y)\text{d}y=\alpha \) for a preassigned probability of type-I error which is the error of rejecting H o when H o is true, where f y ( y ) is the density of y for p  = 2 whenever n .  >  k  + 1.

Case (3): k  = 2, p  ≥ 1, n .  >  p  + 1

In this case, the h -th moment of w as specified in (13.3.7) is the following:

since the numerator gamma functions, except the last one, cancel with the denominator gamma functions except the first one. This expression happens to be the h -th moment of a real scalar type-1 beta random variable with the parameters \((\frac {n_{.}-1-p}{2},\frac {p}{2})\) and hence, for k  = 2, n .  − 1 −  p  > 0 and p  ≥ 1, w is a real scalar type-1 beta random variable. Then, we reject the null hypothesis H o for small values of w or when the observed value of w  ≤  w α , with w α such that \(\int _0^{w_{\alpha }}f_w(w)\text{d}w=\alpha \) for a preassigned significance level α , f w ( w ) being the density of w for this case. We will use the same notation f w ( w ) for the density of w in all the special cases.

Case (4): k  = 3, p  ≥ 1

Proceeding as in Case (3), we see that all the gammas in the h -th moment of w cancel out except the last two in the numerator and the first two in the denominator. Thus,

After combining the gammas in \(y=w^{\frac {1}{2}}\) with the help of the duplication formula ( 13.3.12 ), we have the following:

Therefore, \(y=w^{\frac {1}{2}}\) is a real scalar type-1 random variable with the parameters ( n .  −  p  − 2, p ). We reject the null hypothesis for small values of y or when the observed value of y  ≤  y α , with y α such that \(\int _0^{y_{\alpha }}f_y(y)\text{d}y=\alpha \) for a preassigned significance level α . We will use the same notation f y ( y ) for the density of y in all special cases.

We can also obtain some special cases for \(t_1=\frac {1-w}{w}\) and \(t_2=\frac {1-y}{y},\) with \( y=\sqrt {w}\) . With this transformation, t 1 and t 2 will be available in terms of type-2 beta variables in the real scalar case, which conveniently enables us to relate this distribution to real scalar F random variables so that an F table can be used for testing the null hypothesis and reaching a decision. We have noted that

where W 1 is a real matrix-variate type-1 beta random variable with the parameters \((\frac {n_{.}-k}{2},\frac {k-1}{2})\) and W 2 is a real matrix-variate type-2 beta random variable with the parameters \((\frac {k-1}{2},\frac {n_{.}-k}{2})\) . Then, when p  = 1, W 1 and W 2 are real scalar variables, denoted by w 1 and w 2 , respectively. Then for p  = 1, we have one gamma ratio with h in the general h -th moment ( 13.3.7 ) and then,

where w 2 is a real scalar type-2 beta random variable with the parameters \((\frac {p-1}{2}, \frac {n_{.}-k}{2})\) . As well, in general, for a real matrix-variate type-2 beta matrix W 2 with the parameters \((\frac {\nu _1}{2},\frac {\nu _2}{2}),\) we have \(\frac {\nu _2}{\nu _1}W_2=F_{\nu _1,\nu _2}\) where \(F_{\nu _1,\nu _2}\) is a real matrix-variate F matrix random variable with degrees of freedom ν 1 and ν 2 . When p  = 1 or in the real scalar case \(\frac {\nu _2}{\nu _1}W_2=F_{\nu _1,\nu _2}\) where, in this case, F is a real scalar F random variable with ν 1 and ν 2 degrees of freedom. We have used F for the scalar and matrix-variate case in order to avoid too many symbols. For p  = 2, we combine the gamma functions in the numerator and denominator by applying the duplication formula for gamma functions ( 13.3.12 ); then, for \(t_2=\frac {1-y}{y}\) the situation turns out to be the same as in the case of t 1 , the only difference being that in the real scalar type-2 beta w 2 , the parameters are ( k  − 1, n .  −  k  − 1). Note that the original \(\frac {k-1}{2}\) has become k  − 1 and the original \(\frac {n_{.}-k}{2}\) has become n .  −  k  − 1. Thus, we can state the following two special cases.

Case (5): \(p=1,\ t_1=\frac {1-w}{w}\)

As was explained, t 1 is a real type-2 beta random variable with the parameters \((\frac {k-1}{2},\frac {n_{.}-k}{2})\) , so that

which is a real scalar F random variable with k  − 1 and n .  −  k degrees of freedom. Accordingly, we reject H o for small values of w and y , which corresponds to large values of F . Thus, we reject the null hypothesis H o whenever the observed value of \(F_{k-1,n_{.}-k}\ge F_{k-1,n_{.}-k,\alpha }\) where \(F_{k-1,n_{.}-k,\alpha }\) is the upper 100 α% percentage point of the F distribution or \(\int _{a}^{\infty }g(F)\text{d}F=\alpha \) where \(a=F_{k-1,n_{.}-k,\alpha }\) and g ( F ) is the density of F in this case.

Case (6): \(p=2,\ t_2=\frac {1-y}{y},\ y=\sqrt {w}\)

As previously explained, t 2 is a real scalar type-2 beta random variable with the parameters ( k  − 1, n .  −  k  − 1) or

which is a real scalar F random variable having 2( k  − 1) and 2( n .  −  k  − 1) degrees of freedom. We reject the null hypothesis for large values of t 2 or when the observed value of \([\frac {n_{.}-k-1}{k-1}]t_2\ge b\) with b such that \(\int _{b}^{\infty }g(F){\mathrm{d}}F=\alpha \) , g ( F ) denoting in this case the density of a real scalar random variable F with degrees of freedoms 2( k  − 1) and 2( n .  −  k  − 1), and \(b=F_{2(k-1),2(n_{.}-k-1),\alpha }\) .

Case (7): \(k=2,\ p\ge 1,\ t_1=\frac {1-w}{w}\)

For the case k  = 2, we have already seen that the gamma functions with h in their arguments cancel out, leaving only one gamma in the numerator and one gamma in the denominator, so that w is distributed as a real scalar type-1 beta random variable with the parameters \((\frac {n_{.}-1-p}{2}, \frac {p}{2})\) . Thus, \(t_1=\frac {1-w}{w}\) is a real scalar type-2 beta with the parameters \((\frac {p}{2},\frac {n_.-p-1}{2})\) , and

which is a real scalar F random variable having p and n .  − 1 −  p degrees of freedom. We reject H o for large values of t 1 or when the observed value of \([\frac {n_{.}-1-p}{p}]t_1\ge b\) where b is such that \(\int _b^{\infty }g(F){\mathrm{d}}F=\alpha \) with g ( F ) being the density of an F random variable with degrees of freedoms p and n .  − 1 −  p in this special case.

Case (8): \(k=3,p\ge 1,\ t_2=\frac {1-y}{y},\ y=\sqrt {w}\)

On combining Cases (4) and (6), it is seen that t 2 is a real scalar type-2 beta random variable with the parameters ( p , n .  −  p  − 1), so that

which is a real scalar F random variable with the degrees of freedoms (2 p , 2( n .  −  p  − 1)). Thus, we reject the hypothesis for large values of this F random variable. For a test at significance level α or with α as the size of its critical region, the hypothesis H o  :  A 1  =  A 2  = ⋯ =  A k  =  O is rejected when the observed value of this \(F\ge F_{2p,2(n_{.}-1-p),\alpha }\) where \(F_{2p,2(n_{.}-1-p),\alpha }\) is the upper 100 α% percentage point of the F distribution.

Example 13.3.1

In a dieting experiment, three different diets D 1 , D 2 and D 3 are tried for a period of one month. The variables monitored are weight in kilograms (kg), waist circumference in centimeters (cm) and right mid-thigh circumference in centimeters. The measurements are x 1  =  final weight minus initial weight, x 2  =  final waist circumference minus initial waist reading and x 3  =  final minus initial thigh circumference. Diet D 1 is administered to a group of 5 randomly selected individuals ( n 1  = 5), D 2 , to 4 randomly selected persons ( n 2  = 4), and 6 randomly selected individuals ( n 3  = 6) are subjected to D 3 . Since three variables are monitored, p  = 3. As well, there are three treatments or three diets, so that k  = 3. In our notation,

multivariate hypothesis also known as

where i corresponds to the diet number and j stands for the sample serial number. For example, the observation vector on individual # 3 within the group subjected to diet D 2 is denoted by X 23 . The following are the data on x 1 , x 2 , x 3 :

Diet D 1  :  X 1 j , j  = 1, 2, 3, 4, 5 : 

multivariate hypothesis also known as

Diet D 2  :  X 2 j , j  = 1, 2, 3, 4 : 

multivariate hypothesis also known as

Diet D 3  :  X 3 j , j  = 1, 2, 3, 4, 5, 6 : 

multivariate hypothesis also known as

(1): Perform an ANOVA test on the first component consisting of weight measurements; (2): Carry out a MANOVA test on the first two components, weight and waist measurements; (3): Do a MANOVA test on all the three variables, weight, waist and thigh measurements.

Solution 13.3.1

We first compute the vectors \(X_{1.},\bar {X}_1,X_{2.},\bar {X}_2,X_{3.},\bar {X}_3,X_{..}\) and \( \bar {X}\) :

multivariate hypothesis also known as

Problem (1): ANOVA on the first component x 1 . The first components of the observations are x 1 ij . The first components under diet D 1 are

the first components of observations under diet D 2 are

and the first components under diet D 3 are

Hence, the total on the first component x 1..  = 15, and \(\bar {x}_1=\frac {x_{1..}}{n_{.}}=\frac {15}{15}=1\) . The first component model is the following:

Note again that estimators and estimates will be denoted by a hat. As previously mentioned, the same symbols will be used for the variables and the observations on those variables in order to avoid using too many symbols; however, the notations will be clear from the context. If the discussion pertains to distributions, then variables are involved, and if we are referring to numbers, then we are dealing with observations.

The least squares estimates are \(\hat {\mu }=\frac {x_{1..}}{n_{.}}=1,\ \hat {\alpha }_1=\frac {x_{11.}}{5}=\frac {5}{5}=1\) , \(\hat {\alpha }_2=\frac {x_{12.}}{4}=\frac {4}{4}=1\) , \(\hat {\alpha }_3=\frac {x_{13.}}{6}=\frac {6}{6}=1\) . The first component hypothesis is α 1  =  α 2  =  α 3  = 0. The total sum of squares is

The sum of squares due to the α i ’s is available from

Hence the following table:

ANOVA Table

Variation due to

F-ratio

diets

2

0

0

0

residuals

12

34

  

total

14

34

  

Since the sum of squares due to the α i ’s is null, the hypothesis is not rejected at any level.

Problem (2): MANOVA on the first two components. We are still using the notation X ij for the two and three-component cases since our general notation does not depend on the number of components in the vector concerned; as well, we can make use of the computations pertaining to the first component in Problem (1). The relevant quantities computed from the data on the first two components are the following:

multivariate hypothesis also known as

In this case, the grand total, denoted by X .. , and the grand average, denoted by \(\bar {X}\) , are the following:

multivariate hypothesis also known as

Note that the total sum of squares and cross products matrix can be written as follows:

multivariate hypothesis also known as

Now, consider the residual sum of squares and cross products matrix:

multivariate hypothesis also known as

Therefore, the observed w is given by

This is the case p  = 2, k  = 3, that is, our special Case (8). Then, the observed value of

Our F-statistic is \(F_{2p,2(n_{.}-p-1)}=F_{4,24}\) . Let us test the hypothesis A 1  =  A 2  =  A 3  =  O at the 5% significance level or α  = 0.05. Since the observed value 0.9244 < 5.77 =  F 4,24,0.05 which is available from F-tables, we do not reject the hypothesis.

Verification of the calculations

Denoting the total sum of squares and cross products matrix by S t , the residual sum of squares and cross products matrix by S r and the sum of squares and cross products matrix due to the hypothesis or due to the effects A i ’s by S h , we should have S t  =  S r  +  S h where

multivariate hypothesis also known as

as previously determined. Let us compute

For the first two components, we already have the following:

multivariate hypothesis also known as

As \(34+\frac {164}{15}=\frac {674}{15}\) , S t  =  S r  +  S h , that is,

multivariate hypothesis also known as

Thus, the result is verified.

Problem (3): Data on all the three variables. In this case, we have p  = 3, k  = 3. We will first use \(X_{1.},\bar {X}_1,X_{2.},\bar {X}_2,X_{3.},\bar {X}_3,X_{..}\) and \(\bar {X}\) which have already been evaluated, to compute the residual sum of squares and cross product matrix. Since all the matrices are symmetric, for convenience, we will only display the diagonal elements and those above the diagonal. As in the case of two components, we compute the following, making use of the calculations already done for the 2-component case (the notations remaining the same since our general notation does not involve p ):

multivariate hypothesis also known as

whose determinant is equal to

multivariate hypothesis also known as

The total sum of squares and cross products matrix is the following:

multivariate hypothesis also known as

Hence the total sum of squares and cross products matrix is

multivariate hypothesis also known as

Then, the observed value of

Since p  = 3 and k  = 3, an exact distribution is available from our special Case (8) for \(t_2=\frac {1-\sqrt {w}}{\sqrt {w}}\) and an observed value of t 2  = 0.1989. Then,

The critical value obtained from an F-table at the 5% significance level is F 6,22,.05  ≈ 3.85. Since the observed value of F 6,22 is \(\frac {11}{3}(0.1989)=0.7293<3.85\) , the hypothesis A 1  =  A 2  =  A 3  =  O is not rejected. It can also be verified that S t  =  S r  +  S h .

13.3.4. Asymptotic distribution of the λ -criterion

We can obtain an asymptotic real chisquare distribution for n .  → ∞ . To this end, consider the general h -th moments of λ or E [ λ h ] from ( 13.3.8 ), that is,

Let us expand all the gamma functions in E [ λ h ] by using the first term in the asymptotic expansion of a gamma function or by making use of Stirling’s approximation formula, namely,

for | z |→ ∞ when δ is a bounded quantity. Taking \(\frac {n_{.}}{2}\to \infty \) in the constant part and \(\frac {n_{.}}{2}(1+h)\to \infty \) in the part containing h , we have

The factor \((\tfrac {n_{.}}{2})^{-(\frac {k-1}{2})}\) is canceled from the expression coming from the constant part. Then, taking the product over j  = 1, …, p , we have

which is the moment generating function (mgf) of a real scalar chisquare with p ( k  − 1) degrees of freedom. Hence, we have the following result:

Theorem 13.3.1

Letting λ be the likelihood ratio criterion for testing the hypothesis H o  :  A 1  =  A 2  = ⋯ =  A k  =  O, the asymptotic distribution of \(-2\ln \lambda \) is a real chisquare random variable having p ( k  − 1) degrees of freedom as n .  → ∞, that is,

Observe that we only require the sum of the sample sizes n 1  + ⋯ +  n k  =  n . to go to infinity, and not that the individual n j ’s be large. This chisquare approximation can be utilized for testing the hypothesis for large values of n . , and we then reject H o for small values of λ , which means for large values of \(-2\ln \lambda \) or large values of \(\chi ^2_{p(k-1)}\) , that is, when the observed \(-2\ln \lambda \ge \chi ^2_{p(k-1),\alpha }\) where \(\chi ^2_{p(k-1),\alpha }\) denotes the upper 100 α% percentage point of the chisquare distribution.

13.3.5. MANOVA and testing the equality of population mean values

In a one-way classification model, we have the following for the p -variate case:

for j  = 1, …, n i , i  = 1, …, k . When the error vector is assumed to have a null expected value, that is, E [ E ij ] =  O , for all i and j , we have E [ X ij ] =  M i for all i and j . Thus, this assumption, in conjunction with the hypothesis A 1  =  A 2  = ⋯ =  A k  =  O , implies that M 1  =  M 2  = ⋯ =  M k , that is, the hypothesis of equality of the population mean value vectors or the test is equivalent to testing the equality of population mean value vectors in k independent populations with common covariance matrix Σ  >  O . We have already tackled this problem in Chap. 6 under both assumptions that Σ is known and unknown, when the populations are Gaussian, that is, X ij  ∼  N p ( M i , Σ ), Σ  >  O . Thus, the hypothesis made in a one-way classification MANOVA setting and the hypothesis of testing the equality of mean value vectors in MANOVA are one and the same. In the scalar case too, the ANOVA in a one-way classification data coincides with testing the equality of population mean values in k independent univariate populations. In the ANOVA case, we are comparing the sum of squares attributable to the hypothesis to the residual sum of squares. If the hypothesis really holds true, then the sum of squares due to the hypothesis or to the α j ’s (deviations from the general effect due to the j -th treatment) must be zero and hence for large values of the sum of squares due to the presence of the α j ’s, as compared to the residual sum of squares, we reject the hypothesis. In MANOVA, we are comparing two sums of squares and cross product matrices, namely,

We have the following distributional properties:

The likelihood ratio criterion is

where the η j ’s are the eigenvalues of T 3 . We reject H o for small values of λ which means for large values of \(\prod _{j=1}^p[1+\eta _j]\) . The basic objective in MANOVA consists of comparing U and V  , the matrices due to the presence of treatment effects and due to the residuals, respectively. We can carry out this comparison by using the type-1 beta matrices T 1 and T 2 or the type-2 beta matrices T 3 and T 4 or by making use of the eigenvalues of these matrices. In the type-1 beta case, the eigenvalues will be between 0 and 1, whereas in the type-2 beta case, the eigenvalues will be real positive or simply positive. We may also note that the eigenvalues of T 1 and its nonsymmetric forms U ( U + V ) −1 or ( U + V ) −1 U are identical. Similarly, the eigenvalues of the symmetric form T 2 and V  ( U + V ) −1 or ( U + V ) −1 V   are one and the same. As well, the eigenvalues of the symmetric form T 3 and the nonsymmetric forms UV −1 or V −1 U are the same. Again, the eigenvalues of the symmetric form T 4 and its nonsymmetric forms U −1 V   or V   U −1 are the same. Several researchers have constructed tests based on the matrices T 1 , T 2 , T 3 , T 4 or their nonsymmetric forms or their eigenvalues. Some of the well-known test statistics are the following:

For example, when the hypothesis is true, we expect the eigenvalues of T 3 to be small and hence we may reject the hypothesis when its smallest eigenvalue is large or the trace of T 3 is large. If we are using T 4 , then when the hypothesis is true, we expect T 4 to be large in the sense that the eigenvalues will be large, and therefore we may reject the hypothesis for small values of its largest eigenvalue or its trace. If we are utilizing T 1 , we are actually comparing the contribution attributable to the treatments to the total variation. We expect this to be small under the hypothesis and hence, we may reject the hypothesis for large values of its smallest eigenvalue or its trace. If we are using T 2 , we are comparing the residual part to the total variation. If the hypothesis is true, then we can expect a substantial contribution from the residual part so that we may reject the hypothesis for small values of the largest eigenvalue or the trace in this case. These are the main ideas in connection with constructing statistics for testing the hypothesis on the basis of the eigenvalues of the matrices T 1 , T 2 , T 3 and T 4 .

13.3.6. When H o is rejected

When H o  :  A 1  = ⋯ =  A k  =  O is rejected, it is plausible that some of the differences may be non-null, that is, A i  −  A j ≠ O for some i and j , i ≠ j . We may then test individual hypotheses of the type H o 1  :  A i  =  A j for i ≠ j . There are k ( k  − 1)∕2 such differences. This type of test is equivalent to testing the equality of the mean value vectors in two independent p -variate Gaussian populations with the same covariance matrix Σ  >  O . This has already been discussed in Chap. 6 for the cases Σ known and Σ unknown. In this instance, we can use the special Case (7) where for k  = 2, and the statistic t 1 is real scalar type-2 beta distributed with the parameters \((\frac {p}{2}, \frac {n_{.}-1-p}{2})\) , so that

where n .  =  n i  +  n j for some specific i and j . We can make use of ( 13.3.18 ) for testing individual hypotheses. By utilizing Special Case (8) for k  = 3, we can also test a hypothesis of the type A i  =  A j  =  A m for different i , j , m . Instead of comparing the results of all the k ( k  − 1)∕2 individual hypotheses, we may examine the estimates of A i , namely, \(\hat {A}_i=\frac {X_{i.}}{n_i}-\frac {X_{..}}{n_{.}},\ i=1,\ldots ,k\) . Consider the norms \(\Vert \frac {X_{i.}}{n_i}-\frac {X_{j.}}{n_j}\Vert ,\ i\ne j\) (the Euclidean norm may be taken for convenience). Start with the individual test corresponding to the maximum value of these norms. If this test is not rejected, it is likely that tests on all other differences will not be rejected either. If it is rejected, we then take the next largest difference and continue testing.

Note 13.3.1

Usually, before initiating a MANOVA, the assumption that the covariance matrices associated with the k populations or treatments are equal is tested. It may happen that the error variable E 1 j , j  = 1, …, n 1 , may have the common covariance matrix Σ 1 , E 2 j , j  = 1, …, n 2 , may have the common covariance matrix Σ 2 , and so on, where not all the Σ j ’s equal. In this instance, we may first test the hypothesis H o  :  Σ 1  =  Σ 2  = ⋯ =  Σ k . This test is already described in Chap. 6 . If this hypothesis is not rejected, we may carry out the MANOVA analysis of the data. If this hypothesis is rejected, then some of the Σ j ’s may not be equal. In this case, we test individual hypotheses of the type Σ i  =  Σ j for some specific i and j , i ≠ j . Include all treatments for which the individual hypotheses are not rejected by the tests and exclude the data on the treatments whose Σ j ’s may be different, but distinct from those already selected. Continue with the MANOVA analysis of the data on the treatments which are retained, that is, those for which the Σ j ’s are equal in the sense that the corresponding tests of equality of covariance matrices did not reject the hypotheses.

Example 13.3.2

For the sake of illustration, test the hypothesis H o  :  A 1  =  A 2 with the data provided in Example 13.3.1 .

Solution 13.3.2

We can utilize some of the computations done in the solution to Example 13.3.1 . Here, n 1  = 5, n 2  = 4 and n .  =  n 1  +  n 2  = 9. We disregard the third sample. The residual sum of squares and cross products matrix in the present case is available from the Solution 13.3.1 by omitting the matrix corresponding to the third sample. Then,

whose determinant is

multivariate hypothesis also known as

Let us compute \(\sum _{i=1}^2\sum _{j=1}^{n_i}(X_{ij}-\frac {X_{..}}{n_{.}})(X_{ij}-\frac {X_{..}}{n_{.}})^{\prime }\) :

multivariate hypothesis also known as

Hence the sum

multivariate hypothesis also known as

So, the observed values are as follows:

and \(F_{p,n_{.}-1-p}=F_{3,5}\) . Let us test the hypothesis at the 5% significance level. The critical value obtained from F-tables is F 3,5,0.05  = 9.01. But since the observed value of F is 0.9022 < 9.01, the hypothesis is not rejected. We expected this result because the hypothesis A 1  =  A 2  =  A 3 was not rejected. This example was mainly presented to illustrate the steps.

13.4. MANOVA for Two-Way Classification Data

As was done previously for the one-way classification, we will revisit the real scalar variable case first. Thus, we consider the case of two sets of treatments, instead of the single set analyzed in Sect. 13.3 . In an agricultural experiment, suppose that we are considering r fertilizers as the first set of treatments, say F 1 , …, F r , along with a set of s different varieties of corn, V 1 , …, V s , as the second set of treatments. A randomized block experiment belongs to this category. In this case, r blocks of land, which are homogeneous with respect to all factors that may affect the yield of corn, such as precipitation, fertility of the soil, exposure to sunlight, drainage, and so on, are selected. Fertilizers F 1 , …, F r are applied to these r blocks at random, the first block receiving any one of F 1 , …, F r , and so on. Each block is divided into s equivalent plots, all the plots being of the same size, shape, and so on. Then, the s varieties of corn are applied to each block at random, with one variety to each plot. Such an experiment is called a randomized block experiment. This experiment is then replicated t times. This replication is done so that possible interaction between fertilizers and varieties of corn could be tested. If the randomized block experiment is carried out only once, no interaction can be tested from such data because each plot will have only one observation. Interaction between the i -th fertilizer and j -th variety is a joint effect for the ( F i , V j ) combination, that is, the effect of F i on the yield varies with the variety of corn. For instance, an interaction will be present if the effect of F 1 is different when combined with V 1 or V 2 . In other words, there are individual effects and joint effects, a joint effect being referred to as an interaction between the two sets of treatments. As an example, consider one set of treatments consisting of r different methods of teaching and a second set of treatments that could be s levels of previous exposure of the students to the subject matter.

13.4.1. The model in a two-way classification

The additive, fixed effect, two-way classification or two-way layout model with interaction is the following:

where μ is a general effect, α i is the deviation from the general effect due to the i -th treatment of the first set, β j is the deviation from the general effect due to the j -th treatment of the second set, and γ ij is the effect due to interaction term or the joint effect of first and second sets of treatments. In a randomized block experiment, the treatments belonging to the first set are called “blocks” or “rows” and the treatments belonging to the second set are called “treatments” or “columns”; thus, the two sets correspond to rows, say R 1 , …, R r , and columns, say C 1 , …, C s . Then, γ ij is the deviation from the general effect due to the combination ( R i , C j ). The random component e ijk is the sum total contributions coming from all unknown factors and x ijk is the observation resulting from the effect of the combination of treatments ( R i , C j ) at the k -th replication or k -th identical repetition of the experiment. In an agricultural setting, the observation may be the yield of corn whereas, in a teaching experiment, the observation may be the grade obtained by the “( i , j , k )”-th student. In a fixed effect model, all parameters μ , α 1 , …, α r , β 1 , …, β s are assumed to be unknown constants. In a random effect model α 1 , …, α r or β 1 , …, β s or both sets are assumed to be random variables. We assume that E [ e ijk ] = 0 and Var( e ijk ) =  σ 2  > 0 for all i , j , k , where E (⋅) denotes the expected value of (⋅). In the present discussion, we will only consider the fixed effect model. Under this model, the data are called two-way classification data or two-way layout data because they can be classified according to the two sets of treatments, “rows” and “columns”. Since we are not making any assumption about the distribution of e ijk , and thereby that of x ijk , we will apply the method of least squares to estimate the parameters.

13.4.2. Estimation of parameters in a two-way classification

The error sum of squares is

Our first objective consists of isolating the sum of squares due to interaction and test the hypothesis of no interaction, that is, H o  :  γ ij  = 0 for all i , j and k . If γ ij ≠0, part of the effect of the i -th row R i is mixed up with the interaction and similarly, part of the effect of the j -th column, C j , is intermingled with γ ij , so that no hypothesis can be tested on the α i ’s and β j ’s unless γ ij is zero or negligibly small or the hypothesis γ ij  = 0 is not rejected. As well, on noting that in [ μ  +  α i  +  β j  +  γ ij ], the subscripts either appear none at a time, one at a time and both at a time, we may write μ  +  α i  +  β j  +  γ ij  =  m ij . Thus,

We employ the standard notation in this area, namely that a summation over a subscript is denoted by a dot. Then, the least squares minimum under the general model or the residual sum of squares, denoted by s 2 , is given by

Now, consider the hypothesis H o  :  γ ij  = 0 for all i and j . Under this H o , the model becomes

We differentiate this partially with respect to μ and α i for a specific i , and to β j for a specific j , and then equate the results to zero and solve to obtain estimates for μ , α i and β j . Since we have taken α i , β j and γ ij as deviations from the general effect μ , we may let α .  =  α 1  + ⋯ +  α r  = 0, β .  =  β 1  + ⋯ +  β s  = 0 and γ i .  = 0, for each i and γ . j  = 0 for each j , without any loss of generality. Then,

Hence, the least squares minimum under the hypothesis H o , denoted by \(s_0^2\) , is

the simplifications resulting from properties of summations with respect to subscripts. Thus, the sum of squares due to the hypothesis H o  :  γ ij  = 0 for all i and j or the interaction sum of squares, denoted by \(s^2_{\gamma }\) is the following:

the sum of squares due to the hypothesis or attributable to the γ ij ’s, that is, due to interaction is

If the hypothesis γ ij  = 0 is not rejected, the effects of the γ ij ’s are deemed insignificant and then, setting the hypothesis γ ij  = 0, α i  = 0, i  = 1, …, r , we obtain the sum of squares due to the α i ’s or sum of squares due to the rows denoted as \(s^2_r \) , is

Similarly, the sum of squares attributable to the β j ’s or due to the columns, denoted as \(s^2_c\) , is

Observe that the sum of squares due to rows plus the sum of squares due to columns, once added to the interaction sum of squares, is the subtotal sum of squares, denoted by \(s^2_{rc} =t\sum _{ij}\big (\frac {x_{ij.}}{t}-\frac {x_{...}}{rst}\big )^2\) or this subtotal sum of squares is partitioned into the sum of squares due to the rows, due to the columns and due to interaction. This is equivalent to an ANOVA on the subtotals ∑ k x ijk or an ANOVA on a two-way classification with a single observation per cell. As has been pointed out, in that case, we cannot test for interaction, and moreover, this subtotal sum of squares plus the residual sum of squares is the grand total sum of squares. If we assume a normal distribution for the error terms, that is, \(e_{ijk}\overset {iid}{\sim }N_1(0,\sigma ^2),\ \sigma ^2>0\) , for all i , j , k , then under the hypothesis H o  :  γ ij  = 0, it can be shown that

and the residual variation s 2 has the following distribution whether H o holds or not:

where \(s^2_{\gamma }\) and s 2 are independently distributed. Then, under the hypothesis γ ij  = 0 for all i and j or when this hypothesis is not rejected, it can be established that

and \(s_r^2\) and s 2 as well as \(s_c^2\) and s 2 are independently distributed whenever H o  :  γ ij  = 0 is not rejected. Hence, under the hypothesis,

The total sum of squares is \(\sum _{ijk}\big (x_{ijk}-\frac {x_{...}}{rst}\big )^2\) . Thus, the first decomposition and the first part of ANOVA in this two-way classification scheme is the following:

the second stage being

and the resulting ANOVA table is the following:

ANOVA Table for the Two-Way Classification

 

Variation due to

(1)

(2)

(3)=(2)/(1)

rows

 − 1

\(s^2_r=st\sum _{i=1}^r(\frac {x_{i..}}{st}-\frac {x_{...}}{rst})^2\)

\(s_r^2/(r-1)=D_1\)

columns

 − 1

\(s^2_c=rt\sum _{j=1}^s(\frac {x_{.j.}}{rt}-\frac {x_{...}}{rst})^2\)

\(s^2_c/(s-1)=D_2\)

interaction

(  − 1)(  − 1)

\(s^2_{\gamma }\)

\(s_{\gamma }^2/(r-1)(s-1)=D_3\)

subtotal

 − 1

\(t\sum _{ij}(\frac {x_{ij.}}{t}-\frac {x_{...}}{rst})^2\)

 

residuals

(  − 1)

∕[ (  − 1)] = 

total

 − 1

\(\sum _{ijk}(x_{ijk}-\frac {x_{...}}{rst})^2\)

 

where df designates the number of degrees of freedom, SS means sum of squares, MS stands for mean squares, the expressions for the residual sum of squares is given in ( 13.4.2 ), that for the interaction in ( 13.4.3 ), that for the rows in ( 13.4.4 ) and that for columns in ( 13.4.5 ), respectively. Note that we test the hypothesis on the α i ’s and β j ’s or row effects and column effects, only if the hypothesis γ ij  = 0 is not rejected; otherwise there is no point in testing hypotheses on the α i ’s and β j ’s because they are confounded with the γ ij ’s.

13.5. Multivariate Extension of the Two-Way Layout

Instead of a single real scalar variable being studied, we consider a p  × 1 vector of real scalar variables. The multivariate two-way classification, the fixed effect model is the following:

for i  = 1, …, r , j  = 1, …, s , k  = 1, …, t , where M , A i , B j , Γ ij and E ijk are all p  × 1 vectors. In this case, M is a general effect, A i is the deviation from the general effect due to the i -th row, B j is the deviation from the general effect due to the j -th column, Γ ij is the deviation from the general effect due to interaction between the rows and the columns and E ijk is the vector of the random or error component. For convenience, the two sets of treatments are referred to as rows and columns, the first set as rows and the second, as columns. In a two-way layout, two sets of treatments are tested. As in the scalar case of Sect. 13.4 , we can assume, without any loss of generality, that \(\sum _iA_i=A_1+\cdots +A_r=A_{.}=O,\ B_{.}=O,\ \sum _{i=1}^r\varGamma _{ij}=\varGamma _{.j}=O \) and \( \sum _{j=1}^s\varGamma _{ij}=\varGamma _{i.}=O\) . At this juncture, the procedures are parallel to those developed in Sect. 13.4 for the real scalar variable case. Instead of sums of squares, we now have sums of squares and cross products matrices. As before, we may write M ij  =  M  +  A i  +  B j  +  Γ ij . Then, the trace of the sum of squares and cross products error matrix \(E_{ijk}E_{ijk}^{\prime }\) is minimized. Using the vector derivative operator, we have

so that the residual sum of squares and cross products matrix, denoted by S res , is

All other derivations are analogous to those provided in the real scalar case. The sum of squares and cross products matrix due to interaction, denoted by S int is the following:

The sum of squares and cross products matrices due to the rows and columns are respectively given by

The sum of squares and cross products matrix for the subtotal is denoted by S sub  =  S row  +  S col  +  S int . The total sum of squares and cross products matrix, denoted by S tot , is the following:

We may now construct the MANOVA table. The following abbreviations are used: df stands for degrees of freedom of the corresponding Wishart matrix, SSP means the sum of squares and cross products matrix, MS stands for mean squares and is equal to SSP/df, and S row , S col , S res and S tot are respectively specified in ( 13.5.4 ), ( 13.5.5 ), ( 13.5.2 ) and ( 13.5.6 ).

MANOVA Table for a Two-Way Layout

 

Variation due to

(1)

(2)

(3)=(2)/(1)

rows

 − 1

∕(  − 1)

columns

 − 1

∕(  − 1)

interaction

(  − 1)(  − 1)

∕[(  − 1)(  − 1)]

subtotal

 − 1

 

residuals

(  − 1)

∕[ (  − 1)]

total

 − 1

 

13.5.1. Likelihood ratio test for multivariate two-way layout

Under the assumption that the error or random components \(E_{ijk}\overset {iid}{\sim } N_p(O,\varSigma ),\ \varSigma >O\) for all i , j and k , the exponential part of the multivariate normal density excluding \(-\frac {1}{2}\) is obtained as follows:

Thus, the joint density of all the X ijk ’s, denoted by L , is

The maximum likelihood estimates of M , A i , B j and Γ ij are the same as the least squares estimates and hence, the maximum likelihood estimator (MLE) of Σ is the least squares minimum which is the residual sum of squares and cross products matrix S or S res (in the present notation), where

This is the sample sum of squares and the cross products matrix under the general model and its determinant raised to the power of \(\frac {rst}{2}\) is the quantity appearing in the numerator of the likelihood ratio criterion λ . Consider the hypothesis H o  :  Γ ij  =  O for all i and j . Then, under this hypothesis, the estimator of Σ is S 0 , where

and \(|S_0|{ }^{\frac {rst}{2}}\) is the quantity appearing in the denominator of λ . However, S 0  −  S res  =  S int is the sum of squares and cross products matrix due to the interaction terms Γ ij ’s or to the hypothesis, so that S 0  =  S res  +  S int . Therefore, λ is given by

Letting \(w=\lambda ^{\frac {2}{rst}}\) ,

It follows from results derived in Chap. 5 that S res  ∼  W p ( rs ( t  − 1), Σ ), S int  ∼  W p (( r  − 1)( s  − 1), Σ ) under the hypothesis and S res and S int are independently distributed and hence, under H o ,

with the parameters \((\frac {rs(t-1)}{2}, \frac {(r-1)(s-1)}{2})\) . As well,

with the parameters \((\frac {(r-1)(s-1)}{2},\frac {rs(t-1)}{2})\) . Under H o , the h -th arbitrary moments of w and λ , which are readily obtained from those of a real matrix-variate type-1 beta variable, are

where \(\nu _1=\frac {rs(t-1)}{2} \) and \( \nu _2=\frac {(r-1)(s-1)}{2}\) . Note that we reject the null hypothesis H o  :  Γ ij  =  O , i  = 1, …, r , j  = 1, …, s , for small values of w and λ . As explained in Sect. 13.3 , the exact general density of w in ( 13.5.10 ) can be expressed in terms of a G-function and the exact general density of λ in ( 13.5.11 ) can be written in terms of a H-function. For the theory and applications of the G-function and the H-function, the reader may respectively refer to Mathai ( 1993 ) and Mathai et al. ( 2010 ).

13.5.2. Asymptotic distribution of λ in the MANOVA two-way layout

Consider the arbitrary h -th moment specified in ( 13.5.11 ). On expanding all the gamma functions for large values of rst in the constant part and for large values of rst (1 +  h ) in the functional part by applying Stirling’s formula or using the first term in the asymptotic expansion of a gamma function referring to ( 13.3.13 ), it can be verified that the h -th moment of λ behaves asymptotically as follows:

Thus, for large values of rst , one can utilize this real scalar chisquare approximation for testing the hypothesis H o  :  Γ ij  =  O for all i and j . We can work out a large number of exact distributions of w of ( 13.5.10 ) for special values of r , s , t , p . Observe that

where C is the normalizing constant such that when h  = 0, E [ w h ] = 1. Thus, when ( r  − 1)( s  − 1) is a positive integer or when r or s is odd, the gamma functions cancel out, leaving a number of factors in the denominator which can be written as a sum by applying the partial fractions technique. For small values of p , the exact density will then be expressible as a sum involving only a few terms. For larger values of p , there will be repeated factors in the denominator, which complicates matters.

13.5.3. Exact densities of w in some special cases

We will consider several special cases of the h -th moment of w as given in ( 13.5.13 ).

Case (1): p  = 1. In this case, h -th moment becomes

where C 1 is the associated normalizing constant. This is the h -th moment of a real scalar type-1 beta random variable with the parameters \((\frac {rs(t-1)}{2},\frac {(r-1)(s-1)}{2})\) . Hence \(y=\frac {1-w}{w}\) is a real scalar type-2 beta with parameters \((\frac {(r-1)(s-1)}{2}, \frac {rs(t-1)}{2})\) , and

Accordingly, the test can be carried out by using this F -statistic. One would reject the null hypothesis H o  :  Γ ij  =  O if the observed F  ≥  F ( r −1)( s −1), rs ( t −1), α where F ( r −1)( s −1), rs ( t −1), α is the upper 100 α% percentile of this F -density. For example, for r  = 2, s  = 3, t  = 3 and α  = 0.05, we have F 2,12,0.05  = 19.4 from F-tables so that H o would be rejected if the observed value of F 2,12  ≥ 19.4 at the specified significance level.

Case (2): p  = 2. In this case, we have a ratio of two gamma functions differing by \(\frac {1}{2}\) . Combining the gamma functions in the numerator and in the denominator by using the duplication formula and proceeding as in Sect. 13.3 for the one-way layout, the statistic \(t_1=\frac {1-\sqrt {w}}{\sqrt {w}}\) , and we have

so that the decision can be made as in Case (1).

Case (3): ( r  − 1)( s  − 1) = 1 ⇒  r  = 2, s  = 2. In this case, all the gamma functions in ( 13.3.13 ) cancel out except the last one in the numerator and the first one in the denominator. This gamma ratio is that of a real scalar type-1 beta random variable with the parameters \((\frac {rs(t-1)+1-p}{2},\frac {p}{2})\) , and hence \(y=\frac {1-w}{w}\) is a real scalar type-2 beta so that

and decision can be made by making use of this F distribution as in Case (1).

Case(4): ( r  − 1)( s  − 1) = 2. In this case,

with the corresponding normalizing constant C 1 . This product of p factors can be expressed as a sum by using partial fractions. That is,

Thus, the density of w , denoted by f w ( w ), which is available from (i) and (ii) , is the following:

and zero elsewhere. Some additional special cases could be worked out but the expressions would become complicated. For large values of rst , one can apply the asymptotic chisquare result given in ( 13.5.12 ) for testing the hypothesis H o  :  Γ ij  =  O .

Example 13.5.1

An experiment is conducted among heart patients to stabilize their systolic pressure, diastolic pressure and heart rate or pulse around the standard numbers which are 120, 80 and 60, respectively. A random sample of 24 patients who may be considered homogeneous with respect to all factors of variation, such as age, weight group, race, gender, dietary habits, and so on, are selected. These 24 individuals are randomly divided into two groups of equal size. One group of 12 subjects are given the medication combination Med-1 and the other 12 are administered the medication combination Med-2. Then, the Med-1 group is randomly divided into three subgroups of 4 subjects. These subgroups are assigned exercise routines Ex-1, Ex-2, Ex-3. Similarly, the Med-2 group is also divided at random into 3 subgroups of 4 individuals who are respectively subjected to exercise routines Ex-1, Ex-2, Ex-3. After one week, the following observations are made x 1  =  current reading on systolic pressure minus 120, x 2  =  current reading on diastolic pressure minus 80, x 3  =  current reading on heart rate minus 60. The structure of the two-way data layout is as follows:

multivariate hypothesis also known as

Let X ijk be the k -th vector in the i -the row ( i -th medication) and j -th column ( j -th exercise routine). For convenience, the data are presented in matrix form:

(1) Perform a two-way ANOVA on the first component, namely, x 1 , the current reading minus 120; (2) Carry out a MANOVA on the full data.

Solution 13.5.1

We need the following quantities:

multivariate hypothesis also known as

By using the first elements in all these vectors, we will carry out a two-way ANOVA and answer the first question. Since these are all observations on real scalar variables, we will utilize lower-case letters to indicate scalar quantities. Thus, we have the following values:

All quantities have been calculated separately in order to verify the computations. We could have obtained the interaction sum of squares from the subtotal sum of squares minus the sum of squares due to rows and columns. Similarly, we could have obtained the residual sum of squares from the total sum of squares minus the subtotal sum of squares. We will set up the ANOVA table, where, as usual, df stands for degrees of freedom, SS means sum of squares and MS denotes mean squares:

ANOVA Table for a Two-Way Layout with Interaction

 

.

 

Variation due to

(1)

(2)

(3)=(2)/(1)

F-ratio

rows

1

24

24

24/5.78

columns

2

4

2

2/5.78

interaction

2

4

2

2/5.78

subtotal

5

32

  

residuals

18

104

5.78

 

total

23

136

  

For testing the hypothesis of no interaction, the F -value at the 5% significance level is F 2,18,0.05  ≈ 19. The observed value of this F 2,18 being \(\frac {2}{5.78}\approx 0.35<19\) , the hypothesis of no interaction is not rejected. Thus, we can test for the significance of the row and column effects. Consider the hypothesis α 1  =  α 2  = 0. Then under this hypothesis and no interaction hypothesis, the F -ratio for the row sum of squares is 24∕5.78 ≈ 4.15 < 240 =  F 1,18,0.05 , the tabulated value of F 1,18 at α  = 0.05. Therefore, this hypothesis is not rejected. Now, consider the hypothesis β 1  =  β 2  =  β 3  = 0. Since under this hypothesis and the hypothesis of no interaction, the F -ratio for the column sum of squares is \(\frac {2}{5.78}=0.35<19=F_{2,18,0.05}\) , it is not rejected either. Thus, the data show no significant interaction between exercise routine and medication, and no significant effect of the exercise routines or the two combinations of medications in bringing the systolic pressures closer to the standard value of 120.

We now carry out the computations needed to perform a MANOVA on the full data. We employ our standard notation by denoting vectors and matrices by capital letters. The sum of squares and cross products matrices for the rows and columns are the following, respectively denoted by S row and S col :

multivariate hypothesis also known as

We can verify the computations done so far as follows. The sum of squares and cross product matrices ought to be such that S row  +  S col  +  S int  =  S sub . These are

multivariate hypothesis also known as

Hence the result is verified. Now, the total and residual sums of squares and cross product matrices are

multivariate hypothesis also known as

The above results are included in the following MANOVA table where df means degrees of freedom, SSP denotes a sum of squares and cross products matrix and MS is equal to SSP divided by the corresponding degrees of freedom:

MANOVA Table for a Two-Way Layout with Interaction

 

.

Variation due to

(1)

(2)

(3)=(2)/(1)

rows

1

columns

2

\(\frac {1}{2}S_{col}\)

interaction

2

\( \frac {1}{2}S_{int}\)

subtotal

5

 

residuals

18

\(\frac {1}{18}S_{res}\)

total

23

 

Then, the λ -criterion is

The determinants are as follows:

multivariate hypothesis also known as

We have explicit simple representations of the exact densities for the special cases p  = 1, p  = 2, t  = 2, t  = 3. However, our situation being p  = 3, t  = 4, they do not apply. A chisquare approximation is available for large values of rst , but our rst is only equal to 24. In this instance, \(-2\ln \lambda \to \chi ^2_{p(r-1)(s-1)}\simeq \,\chi ^2_{6}\) as rst  → ∞ . However, since the observed value of \(-2\ln \lambda =2.8508\) happens to be much smaller than the critical value resulting from the asymptotic distribution, which is \(\chi ^2_{6,0.05}=12.59\) in this case, we can still safely decide not to reject the hypothesis H o  :  Γ ij  =  O for all i and j , and go ahead and test for the main row and column effects, that is, the main effects of medical combinations Med-1 and Med-2 and the main effects of exercise routines Ex-1, Ex-2 and Ex-3. For testing the row effect, our hypothesis is A 1  =  A 2  =  O and for testing the column effect, it is B 1  =  B 2  =  B 3  =  O , given that Γ ij  =  O for all i and j . The corresponding likelihood ratio criteria are respectively,

and we may utilize \(w_j=\lambda _j^{\frac {2}{rst}},\ j=1,2\) . From previous calculations, we have

multivariate hypothesis also known as

The required determinants are as follows:

When \(rst\to \infty ,-2\ln \lambda _1\to \chi ^2_{p(r-1)}\) and \( -2\ln \lambda _2\to \chi ^2_{p(s-1)}\) , referring to Exercises 13.5.9 and 13.5.10, respectively. These results follow from the asymptotic expansion provided in Sect. 12.5.2 . Even though rst  = 24 is not that large, we may use these chisquare approximations for making decisions as the exact densities of w 1 and w 2 do not fall into the special cases previously discussed. When making use of the likelihood ratio criterion, we reject the hypotheses A 1  =  A 2  =  O and B 1  =  B 2  =  B 3  =  O for small values of λ 1 and λ 2 , respectively, which translates into large values of the approximate chisquare values. It is seen from (i) that the observed value \(-2\ln \lambda _1\) is larger than the tabulated critical value and hence we reject the hypothesis A 1  =  A 2  =  O at the 5% level. However, the hypothesis B 1  =  B 2  =  B 3  =  O is not rejected since the observed value is less than the critical value. We may conclude that the present data does not show any evidence of interaction between the exercise routines and medication combinations, that the exercise routine does not contribute significantly to bringing the subjects’ initial readings closer to the standard values (120, 80, 60), whereas there is a possibility that the medical combinations Med-1 and Med-2 are effective in significantly causing the subjects’ initial readings to approach standard values.

Note 13.5.1

It may be noticed from the MANOVA table that the second stage analysis will involve one observation per cell in a two-way layout, that is, the ( i , j )-th cell will contain only one observation vector X ij . for the second stage analysis. Thus, S sub  =  S int  +  S row  +  S col (the corresponding sum of squares in the real scalar case), and in this analysis with a single observation per cell, S int acts as the residual sum of squares and cross products matrix (the residual sum of squares in the real scalar case). Accordingly, “interaction” cannot be tested when there is only a single observation per cell.

In the ANOVA table obtained in Example 13.5.1 , prove that (1) the sum of squares due to interaction and the residual sum of squares, (2) the sum of squares due to rows and residual sum of squares, (3) the sum of squares due to columns and residual sum of squares, are independently distributed under the normality assumption for the error variables, that is, \(e_{ijk}\overset {iid}{\sim } N_1(0,\sigma ^2),\ \sigma ^2>0\) .

In the MANOVA table obtained in Example 13.5.1 , prove that (1) S int and S res , (2) S row and S res , (3) S col and S res , are independently distributed Wishart matrices when \(E_{ijk}\overset {iid}{\sim } N_p(O,\varSigma ),\ \varSigma >O\) .

In a one-way layout, the following are the data on four treatments. (1) Carry out a complete ANOVA on the first component (including individual comparisons if the hypothesis of no interaction is not rejected). (2) Perform a full MANOVA on the full data.

multivariate hypothesis also known as

Carry out a full one-way MANOVA on the following data:

multivariate hypothesis also known as

The following are the data on a two-way layout where A ij denotes the data on the i -th row and j -th column cell. (1) Perform a complete ANOVA on the first component. (2) Carry out a full MANOVA on the full data. (3) Verify that S row  +  S col  +  S int  =  S sub and S sub  +  S res  =  S tot , (4) Evaluate the exact density of w .

Carry out a complete MANOVA on the following data where A ij indicates the data in the i -th row and j -th column cell.

Under the hypothesis A 1  = ⋯ =  A r  =  O , prove that \(U_1=(S_{res}+S_{row})^{-\frac {1}{2}}S_{res}\) \((S_{res}+S_{row})^{-\frac {1}{2}}\) , is a real matrix-variate type-1 beta with the parameters \((\frac {rs(t-1)}{2},\frac {r-1}{2})\) for r  ≥  p , rs ( t  − 1) ≥  p , when the hypothesis Γ ij  =  O for all i and j is not rejected or assuming that Γ ij  =  O . The determinant of U 1 appears in the likelihood ratio criterion in this case.

Under the hypothesis B 1  = ⋯ =  B s  =  O when the hypothesis Γ ij  =  O is not rejected, or assuming that Γ ij  =  O , prove that \(U_2=(S_{res}+S_{col})^{-\frac {1}{2}}S_{res}(S_{res}+S_{col})^{-\frac {1}{2}}\) is a real matrix-variate type-1 beta random variable with the parameters \((\frac {rs(t-1)}{2},\ \frac {s-1}{2})\) for s  ≥  p , rs ( t  − 1) ≥  p . The determinant of U 2 appears in the likelihood ratio criterion for testing the main effect B j  =  O , j  = 1, …, s .

Show that when \(rst\to \infty , -2\ln \lambda _1\to \chi ^2_{p(r-1)}\) , that is, \( -2\ln \lambda _1\) asymptotically tends to a real scalar chisquare having p ( r  − 1) degrees of freedom, where λ 1  = | U 1 | and U 1 is as defined in Exercise 13.7 . [Hint: Look into the general h -th moment of λ 1 in this case, which can be evaluated by using the density of U 1 ]. Hence for large values of rst , one can use this approximate chisquare distribution for testing the hypothesis A 1  = ⋯ =  A r  =  O .

Show that when \(rst\to \infty , -2\ln \lambda _2\to \chi ^2_{p(s-1)},\) that is, \(-2\ln \lambda _2\) asymptotically converges to a real scalar chisquare having p ( s  − 1) degrees of freedom, where λ 2  = | U 2 | with U 2 as defined in Exercise 13.8 . [Hint: Look at the h -th moment of λ 2 ]. For large values of rst , one can utilize this approximate chisquare distribution for testing the hypothesis B 1  = ⋯ =  B s  =  O .

A.M. Mathai (1993): A Handbook of Generalized Special Functions for Statistical and Physical Sciences , Oxford University Press, Oxford.

MATH   Google Scholar  

A.M. Mathai and H.J. Haubold (2017): Probability and Statistics: A Course for Physicists and Engineers , De Gruyter, Germany.

Book   MATH   Google Scholar  

A.M. Mathai, R.K. Saxena and H.J. Haubold (2010): The H-function: Theory and Applications , Springer, New York.

Download references

Author information

Authors and affiliations.

Mathematics and Statistics, McGill University, Montreal, Canada

Arak Mathai

Statistical and Actuarial Sciences, The University of Western Ontario, London, ON, Canada

Serge Provost

Vienna International Centre, UN Office for Outer Space Affairs, Vienna, Austria

Hans Haubold

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2022 The Author(s)

About this chapter

Mathai, A., Provost, S., Haubold, H. (2022). Chapter 13: Multivariate Analysis of Variation. In: Multivariate Statistical Analysis in the Real and Complex Domains. Springer, Cham. https://doi.org/10.1007/978-3-030-95864-0_13

Download citation

DOI : https://doi.org/10.1007/978-3-030-95864-0_13

Published : 22 February 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-95863-3

Online ISBN : 978-3-030-95864-0

eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

multivariate hypothesis also known as

A/B Testing Vs. Multivariate Testing: Which One Is Better

  • March 5, 2024
  • AB Vs. Multivariate Testing
  • Analyzing AB Tests
  • AB Testing Tools
  • AB Testing Best Practices
  • AB Testing Process

multivariate hypothesis also known as

Khalid Saleh

Khalid Saleh is CEO and co-founder of Invesp. He is the co-author of Amazon.com bestselling book: "Conversion Optimization: The Art and Science of...

multivariate hypothesis also known as

Join 25,000+ Marketing 
Professionals!

Subscribe to Invesp's blog feed for future articles delivered to your feed reader or receive weekly updates by email.

A/B testing vs. multivariate testing? This question plagues every CRO professional every once in a while. 

When optimizing your digital assets, knowing whether to use A/B or multivariate testing is critical. 

Are you looking to quickly determine the superior version of a webpage for low-traffic sites?A/B testing is your go-to. 

Or do you aim to dissect complex interactions between various elements on a high-traffic page? Then, A/B and multivariate testing will provide your in-depth analysis. 

This guide breaks down each method and offers strategic insights into deploying them for maximum conversion optimization.

TL; DR? Here are some quick takeaways: 

  • A/B vs. Multivariate: A Quick Comparison: A/B testing is ideal for testing two versions of a single variable and requires less traffic. Conversely, multivariate testing involves testing multiple variables and their interactions but needs a higher traffic volume to provide significant results.
  • Formulating a SMART Hypothesis: Both methods require a clear, evidence-based hypothesis following the SMART framework to predict the outcome and define the changes, expected impact, and metrics for measurement.
  • Analyzing Test Results for Actionable Insights: Analyzing results involves tools like heat maps and session recordings. A/B testing emphasizes statistical significance, while multivariate testing focuses on element interactions.

Decoding A/B and Multivariate Testing: The Essentials

A/B Testing :  also known as split testing, compares two versions of a digital element to determine which performs better with the target audience.

How A/B testing works

It effectively optimizes various marketing efforts, including emails, newsletters, ads, and website elements. A/B testing is particularly useful when you need quick feedback on two distinct designs or for websites with lower traffic.

Key aspects of A/B testing: 

  • Controlled Comparison: Craft two different versions and evaluate them side by side while keeping all other variables constant.
  • Sample Size: Utilizing an adequate sample size to ensure reliable and accurate findings.
  • Qualitative Evaluation: Use tools like heat maps and session recordings to gain insights into user interactions with different variations.

Multivariate Testing: 

Multivariate testing takes it up a notch by evaluating multiple page elements simultaneously to uncover the most effective combination that maximizes conversion rates.

How multivariate testing works

By using multivariate testing, you can gain valuable insights into how different elements or variables impact user experience and optimize your website or product accordingly.

Key aspects of multivariate testing: 

  • Multiple Element Testing: Running tests to evaluate different combinations of elements.
  • Interaction Analysis: Understanding how variables interact with each other.
  • Comprehensive View: Providing insights into visitor behavior and preference patterns.
  • High Traffic Requirement: Demanding substantial web traffic due to increased variations.
  • Potential Bias: Focusing excessively on design-related problems and underestimating UI/UX elements’ impact.

Unlike A/B testing, which compares two variations, MVT changes more than one variable to test all resulting combinations simultaneously. It provides a comprehensive view of visitor behavior and preference patterns, making it ideal for testing different combinations of elements or variables.

A/B Testing vs. Multivariate Testing: Choosing the Right Method

Deciding between multivariate and A/B testing depends on the complexity of the tested elements and the ease of implementation. 

A/B testing is more straightforward and suitable for quick comparisons, while multivariate testing offers more comprehensive insights but requires more traffic and careful consideration of potential biases.

Designing Your Experiment: A/B vs. Multivariate

Choosing between A/B and multivariate testing depends on traffic, complexity, and goals. 

A/B testing is ideal for limited traffic due to its simplicity and clear outcomes. Multivariate testing offers detailed insights but requires more effort and time. 

However, before you set up either of the testing types, you’ll have to form a hypothesis. In the case of multivariate testing, you’ll also need to identify a number of variables you intend to test.

Crafting a Hypothesis for Effective Testing

Prior to commencing your A/B or multivariate testing, it’s imperative to construct a hypothesis. This conjecture about the potential influence of alterations on user behavior is crucial for executing substantive tests. 

An articulate hypothesis will include:

  • The specific modification under examination
  • The anticipated effect of this modification
  • The measurement that will be employed to evaluate said effect
  • It must be evidence-based and provide justification.

A compelling hypothesis also embraces the SMART criteria: Specificity, Measurability, Actionability, Relevance, and Testability.

It integrates quantitative data and qualitative insights to guarantee that the supposition is grounded in reality, predicated upon hard facts, and pertinent to the variables being examined.

A/B testing vs. Multivariate testing hypothesis example: 

For example, if you’re running an A/B test, your hypothesis could be: 

Changing the CTA button of the existing landing page from blue to orange will increase the click-through rate by 10% within one month, based on previous test results and user feedback favoring brighter colors.

If you’re running a multivariate test, your hypothesis could be:

Testing different combinations of headline, hero image, and CTA button style on the homepage will result in a winning combination that increases the conversion rate by 15% within two weeks, supported by prior test results and user preferences.

Identifying Variables for Your Test

Selecting the correct multiple variables to assess in a multivariate experiment is crucial. Each variable should have solid backing based on business objectives and expected influence on outcomes. When testing involving multiple variables, it’s essential to rigorously evaluate their possible effect and likelihood of affecting targeted results.

Variation ideas for inclusion in multivariate testing ought to stem from an analysis grounded in data, which bolsters their potential ability to positively affect conversion rates. Adopting this strategy ensures that the selected variables are significant and poised to yield insightful findings.

Setting Up A/B Tests

To implement an A/B testing protocol, one must:

  • Formulate a Hypothesis: Clearly define the problem you want to address and create a testable hypothesis (we’ve already done it in the above section).
  • Identify the Variable: Select the single element you want to test. This could be a headline, button color, image placement, or any other modifiable aspect.
  • Create Variations: Develop two versions of the element: the control (original) and the variant (modified). Ensure the change is significant enough to measure a potential impact.
  • Random Assignment: Distribute your sample randomly into two segments to assess the performance of the control version relative to that of its counterpart. By doing so, you minimize any distortion in outcomes due to external influences.
  • Determine Sample Size: Calculate the required sample size to achieve statistical significance. This depends on factors like desired confidence level, expected effect size, and existing conversion rate.
  • Run the Test: Finally, implement the test and allow it to run for a predetermined duration or until the desired sample size is reached.
  • Analyze Results: Collect and analyze data on relevant metrics (click-through rates, conversions, etc.). Use statistical analysis to determine if the observed differences are significant.

For a more detailed overview of how to run and set up A/B tests, check out our ultimate guide to A/B testing . 

Setting up Multivariate Tests

To set up multivariate tests: 

  • Identify Multiple Variables: Select multiple elements you want to test simultaneously. This could involve testing variations of headlines, images, button colors, and other factors.
  • Create Combinations: Generate all possible combinations of the selected elements. For example, if you’re testing two headlines and two button colors, you’ll have four combinations to test.

After this, all the steps remain the same as in the A/B test implementation, including randomly assigning audience to different combinations, determining sample size, and then finally running the test. 

Pro Tip: Implement trigger settings to specify when variations appear to users, and use fractional factorial testing to manage traffic distribution among variations. During the multivariate test, systematically evaluate the impact of variations and consider eliminating low-performing ones after reaching the minimum sample size.

Analyzing Test Outcomes for Data-Driven Decisions

Finally, it’s time to analyze your results. 

For a thorough assessment of user interactions post-A/B and multivariate testing sessions:

  • Session recordings
  • Form Analytics

They serve as indispensable tools by allowing you to observe real-time engagement metrics and dissect and comprehend findings after reaching statistical significance in an A/B test.

Making Sense of Multivariate Test Data

Interpreting multivariate test data calls for a distinct methodology. In multivariate testing, it is essential to evaluate the collective impact of various landing page elements on user behavior and conversion rates rather than examining aspects in isolation. 

This testing method provides comprehensive insights into how different elements interact, allowing teams to discover effects between variables that could lead to further optimization.

When assessing multivariate test data, it’s necessary to:

  • Identify the combinations of page elements that lead to the highest conversions
  • Recognize elements that contribute least to the site’s conversions
  • Discover the best possible combinations of tested page elements
  • Increase conversions
  • Identify the right combination of components that produces the highest conversion rate.

This process helps optimize your website’s performance and improve your conversion rate through conversion rate optimization.

Common Pitfalls in A/B and Multivariate Testing

Both testing methods offer valuable insights, but they also share some pitfalls to avoid. 

Here are some common mistakes to avoid when setting up your A/B or multivariate tests:

  • Insufficient Traffic: Not gathering enough traffic can lead to statistically insignificant results and unreliable conclusions.
  • Ignoring External Factors: Overlooking seasonal trends, market shifts, or other external influences can skew results and lead to inaccurate interpretations.
  • Technical Issues: Testing tools can sometimes impact website speed, affecting user behavior and compromising test results. Ensure your tools don’t interfere with the natural user experience.

A/B Testing vs. Multivariate Testing: Final Verdict 

A/B and multivariate testing are potent methods that can transform how you approach digital marketing. By comparing different variations, whether it’s two in A/B testing or multiple in multivariate testing, you can gain valuable insights into what resonates with your audience.

The key is to embrace a culture of experimentation, value data over opinions, and constantly learn from your tests. This approach can optimize your strategy, boost your results, and ultimately drive your business forward.

Frequently Asked Questions

What is the main difference between a/b and multivariate testing.

Multivariate testing distinguishes itself from A/B testing by evaluating various elements at the same time in order to determine which combination yields the most favorable results, as opposed to A/B testing which only contrasts two variations.

Recognizing this distinction will assist you in determining the appropriate method for your particular experimentation requirements.

When should I use A/B testing over multivariate testing?

When swift outcomes are needed from evaluating two distinct designs, or when your website experiences low traffic volumes, A/B testing is the method to employ.

On the other hand, if your intention is to examine several variations at once, multivariate testing could be a better fit for such purposes.

What factors should I consider when setting up an A/B test?

When setting up an A/B test, it’s crucial to consider the sample size for reliable results and precision, control the testing environment, and use tools for qualitative insights like session recordings. These factors will ensure the accuracy and effectiveness of your test.

How can I effectively analyze multivariate test data?

To thoroughly assess data from multivariate tests, consider how different combinations of page elements together influence user behavior and ultimately conversion rates. Determine which specific sets of page elements result in the most significant increase in conversions, while also noting which individual components contribute the least to overall site conversions.

What common mistakes should I avoid when conducting A/B and multivariate tests?

Ensure that you allow sufficient traffic to accumulate in order to reach statistical significance. It’s important to factor in external variables such as seasonal variations or shifts in the marketplace, and also be mindful of technical elements like how testing instruments might affect website performance. Overlooking these considerations may result in deceptive test outcomes and false interpretations, which could squander both time and investment.

IMAGES

  1. Classification of different univariate and multivariate hypothesis

    multivariate hypothesis also known as

  2. MANOVA (Multivariate Analysis of Variance)

    multivariate hypothesis also known as

  3. QT-Multivariate analysis

    multivariate hypothesis also known as

  4. Types Of Multivariate Analysis

    multivariate hypothesis also known as

  5. Theory of Multivariate Statistics

    multivariate hypothesis also known as

  6. PPT

    multivariate hypothesis also known as

VIDEO

  1. Two-Sample Hypothesis Testing

  2. 19 Snapshot of Multivariate Probability and Statistics: multivariate probability

  3. 8a. Introduction to Hypothesis Testing

  4. NCEA LEVEL 1 STATS INTERNAL

  5. kuznet's Inverted U Hypothesis

  6. Why Multiverses might exist

COMMENTS

  1. Multivariate statistics

    Multivariate statistics

  2. PDF STAT 542: Multivariate Analysis Spring 2021 Lecture 11: Multiple

    Spring 2021. Lecture 11: Multiple hypothesis testInstructor: Yen-Chi Chen11.1 IntroductionThe multiple hypothesis testing. s the scenario that we are conducting several hyp. thesis tests at the same time. Suppose we have n tests, each leads to a p-value. So we ca. view the `data' as P1; ; Pn 2 [0; 1], where Pi is the p-value of the i-th test. We c.

  3. Multivariate analysis: an overview

    Multivariate analysis: an overview - Students 4 Best Evidence

  4. Introduction to Multivariate Regression Analysis

    Test on overall or reduced model. In our example Tpers = β 0 + β 1 time outdoors + β 2 Thome +β 3 wind speed + residual. The null hypothesis (H 0) is that there is no regression overall i.e. β 1 = β 2 =+βρ = 0. The test is based on the proportion of the SS explained by the regression relative to the residual SS.

  5. Lesson 8: Multivariate Analysis of Variance (MANOVA)

    Populations 4 and 5 are also closely related, but not as close as populations 2 and 3. Population 1 is closer to populations 2 and 3 than populations 4 and 5. Each branch (denoted by the letters A, B, C, and D) corresponds to a hypothesis we may wish to test. This yields the contrast coefficients as shown in each row of the following table:

  6. Multivariate Analysis: Overview

    Abstract. Multivariate analysis is appropriate whenever more than one variable is measured on each sample individual, and overall conclusions about the whole system are sought. Many different multivariate techniques now exist for addressing a variety of objectives. This brief review outlines, in broad terms, some of the more common objectives ...

  7. Multivariate Hypothesis Testing Methods for Evaluating Significant

    Multivariate Z -test (MZ; Wald test) For examinee i, the null hypothesis, H0:θi2 = θi1, is tested against the alternative hypothesis, Ha:θi2 ≠ θi1. This is an overall test; hence, the change can occur in any direction or pattern (i.e., a two-tailed test) and involve one or more dimensions.

  8. Multivariate Analysis: Causation, Control, and Conditionality

    The table also gives the value of the constant, also known as the intercept. This is the value of the dependent variable when the independent variable has a value of zero. As will be shown, a formula that includes both the constant and the regression coefficient can be used to estimate, or predict, hypothetical values of the dependent variable.

  9. Multivariate Hypothesis Tests

    4.1 Multinomial Test Statistics. In this section we consider three well-known test statistics, namely the likelihood ratio test, the Wald test, and the Score test. We show that all three test statistics are asymptotically equivalent, and the score statistic is the same as Pearson's goodness of fit test statistic.

  10. Multivariate Statistical Analysis

    Classical multivariate statistical methods concern models, distributions and inference based on the Gaussian distribution. These are the topics in the first textbook for mathematical statisticians by T. W. Anderson that was published in 1958 and that appeared as a slightly expanded 3rd edition in 2003. Matrix theory and notation is used there ...

  11. What is Multivariate Statistical Analysis?

    Shane Hall - Updated June 25, 2018. Multivariate statistical analysis refers to multiple advanced techniques for examining relationships among multiple variables at the same time. Researchers use multivariate procedures in studies that involve more than one dependent variable (also known as the outcome or phenomenon of interest), more than one ...

  12. Chapter 9 Multivariate Data Analysis

    This problem can be thought of as the multivariate counterpart of the univariate hypothesis t-test. 9.1.1 Hotelling's T2 Test The most fundamental approach to signal detection is a mere generalization of the t-test, known as Hotelling's \(T^2\) test .

  13. An Introduction to Multivariate Analysis

    1. What is multivariate analysis? In data analytics, we look at different variables (or factors) and how they might impact certain situations or outcomes. For example, in marketing, you might look at how the variable "money spent on advertising" impacts the variable "number of sales.". In the healthcare sector, you might want to explore ...

  14. A Practical Guide to Multivariate Testing with Examples

    Multivariate testing helps examine the performance of multiple page elements in various combinations to analyze the impact of each element. Learn more. ... Hypothesis: A tentative ... Also known as the independent sample t-test, it's a method used to examine whether the means of two or more unknown samples are equal or not. If your sample ...

  15. What is Multivariate Data Analysis?

    Multivariate data analysis includes many complex computations and hence can be laborious. The analysis necessitates the collection and tabulation of a large number of observations for various variables. This process of observation takes a long time. (Also read: 15 Statistical Terms for Machine Learning) 7 Types of Multivariate Data Analysis

  16. Chapter 22 Multivariate Methods

    Chapter 22 Multivariate Methods | A Guide on Data Analysis

  17. Univariate, Bivariate and Multivariate Analysis

    Multivariate analysis is similar to Bivariate analysis but you are comparing more than two variables. For three variables, you can create a 3-D model to study the relationship (also known as ...

  18. Biostatistics Series Module 10: Brief Overview of Multivariate Methods

    Introduction. Multivariate analysis refers to statistical techniques that simultaneously look at three or more variables in relation to the subject under investigation with the aim of identifying or clarifying the relationships between them. The real world is always multivariate. Anything happening is the result of many different inputs and ...

  19. 9 Multivariate methods for heterogeneous data

    9 Multivariate methods for heterogeneous data. 9. Multivariate methods for heterogeneous data. Real situations often involve point clouds, gradients, graphs, attraction points, noise and different spatial milieux, a little like this picture where we have a rigid skeleton, waves, sun and starlings. In Chapter 7, we saw how to summarize ...

  20. PDF Chapter 8: The Multivariate General Linear Model

    The multivariate general linear model is a straightforward generalization of the univariate case in Equation (5.3). Instead of having one dependent variable in one column of the vector y, we have a set of p dependent variables in the several columns of the matrix Y. The model is therefore. ⎡ yˆ L yˆ.

  21. PDF The general linear hypothesis testing problem for multivariate

    Mathematically, a general k-sample problem, also known as one-way multivariate analy-sis of variance for functional data (FMANOVA) can be described as follows. Let SP p(η,Γ) denote a p-dimensional stochastic process with vector of mean functions η(t),t∈T, and

  22. Chapter 13: Multivariate Analysis of Variation

    Given multivariate random samples originating from several Gaussian populations sharing the same covariance matrix, the one-way multivariate analysis of variation (also known as multivariate analysis of variance) technique enables one to test whether or not the population mean value vectors are equal. When a second set of treatments is also ...

  23. A/B Testing Vs. Multivariate Testing: Which One Is Better

    A/B vs. Multivariate: A Quick Comparison: A/B testing is ideal for testing two versions of a single variable and requires less traffic. Conversely, multivariate testing involves testing multiple variables and their interactions but needs a higher traffic volume to provide significant results. Formulating a SMART Hypothesis: Both methods require ...