If I were to remark to you that “the weather is very nice today” or “I didn’t like that person”, it is unlikely that I would have made such statements based on a single variable. It is more likely that a combination of variables were evaluated to arrive at these statements. When we analysis datasets with multiple variables, we are undertaking Multivariate Statistical Analysis.
Multivariate Analysis comes in two flavours :-
- Analysis of Correlations between Multiple Variables – Known as R-Analysis – Informally known as reducing the dimensionality of your dataset.
- Analysis of Distance between Many Objects – Known as Q-Analysis – Informally known as mapping, clustering or segmentation of your dataset.
At bottom, all multivariate analysis is exploratory analysis of high-dimensional data. There is very little in the way of formal modelling or hypothesis testing. What we are trying to do is find a suitable way to visualise and make sense our data that exists in multi-dimensional space. Often, we seek to reduce (or project) our n-dimensional data into a 2-dimensional chart or table that can be displayed on a screen and a variety of methods exist to help us do this. This is why I like to say that multivariate analysis is more art rather than science but if you have strong statistical thinking skills, you will be able to be more scientific in your interpretation of the data.
I have written the following blog posts using multivariate analysis methods and I hope you find these useful resources for learning more.
A. Principal Components Analysis (PCA)
PCA is a mainstay of R-Analysis, the task of reducing your n-dimensional dataset to fewer (ideally 2 dimensions). As I said before, our weather is an inherently multivariate dataset and over the course of these 4 blogs, I use UK seasonal weather data from the Met Office to explain how PCA works. Please note, the sections on PCA are preceded by some seasonal trend analysis so please skip past these to get to the PCA part which starts in the section headed “How many dimensions …“.
- What is a component?
- What are the advantages of principal components?
- Can PCA predict our summer weather?
- Sometimes PCA tells us nothing
Note all of these posts also the explain the concept of Standardisation (aka Z-Scores). This is an important concept to know about in multivariate analysis so it worth reading the start of these posts to find out more.
Finally, there is a related method to PCA known as Factor Analysis. I am not a big fan of Factor Analysis but it has a lot of similarities to PCA. Please note that software packages will sometimes use Factor Analysis as a generic heading that includes PCA.
B. Multiple Correspondence Analysis (MCA)
PCA can only be used if your variables are all numerical. When we have categorical variables and we want to reduce the dimensionality of a categorical dataset, we need to use MCA instead. This method is not widely available in software packages but it produces similar outputs to PCA.
As yet I have not written any posts using MCA.
C. Multi-Dimensional Scaling (MDS)
The most basic method of mapping data so as to assess how far apart objects are from each other is MDS. The idea is relatively simple to explain and the following posts made use of MDS.
D. Cluster Analysis (K-Means and AHC)
Cluster analysis is the mainstay of segmentation, the task of splitting a sample of objects into distinct groups. The two main methods are Agglomerative Hierarchal Clustering (AHC) and K-Means Clustering. Both methods can only be used with numerical datasets.
I have published the following on Cluster Analysis
- A presentation given to EXISTA about how to segment customer databases.
E. Manual Clustering
It might seem that when you have complex data, you need complex statistical methods to make sense of its multivariate nature. Sometimes though, clusters can be found with a bit of statistical thinking and common sense. It can be a good idea to start a cluster analysis using manual methods before proceeding to the more complex methods. This is especially the case when your data consists of a lot of binary variables.
I have written the following posts which use manual clustering.
- Find your way out of the Brexit Maze in 9 days!
- How to predict by-elections in the Brexit era – repeated in this YouTube clip
The first post was referred to in the Robert Peston show on ITV on 20th March 2019 (clip starts 34:40 in). Apparently it made me “geek of the week!”
F. Classification Modelling
Classification often makes use of clustering methods but the goal of classification is usually prediction i.e. given what we know about a certain object, can we predict which category they will end up in? A method not mentioned so far that plays a large part in classification modelling is Discriminant Analysis which includes an important method of multivariate analysis known as Canonical Variate Analysis (CVA).
As yet, I have not written any posts on Classification Models.
If you would like to book a training course in Multivariate Analysis, then please contact me.
I can recommend the following book “Multivariate Statistical Analysis – A Conceptual Introduction” by Sam Kachigan. This does a very job of keeping the maths to a minimum and focusing on the key concepts instead.
For more information about my other training courses in statistics, please visit my Statistical Training homepage.