Example: Inside Out
In the movie, the main character, Riley, has five archetypal emotions: Joy, Sadness, Fear, Anger, and Disgust.
![]()
Archetypal emotions in Inside Out movie
New emotions
However, as seen in Inside Out 2, not only exists these emotions, but also new emotions can be obtained by combining these:
![]()
Mixed emotions in Inside Out 2
Combining archetypal emotions
Therefore, a new wide range of emotions can be obtained by combining the archetypal emotions:
![]()
New emotions obtained from combining the archetypal emotions
Archetypal Analysis in Machine Learning
Motivation of work
![]()
Number of publications containing the “archetypal analysis” keyword over the years. Data has been collected from Google Scholar.
Machine Learning
![]()
Machine learning overview
Archetypal Analysis (AA)
- Unsupervised machine learning technique introduced by Cutler and Breiman in 1994.
- Identifies extreme points that are representative of the underlying patterns or structures within the data set.
- Represents each data point as a mixture of these archetypes.
Let \mathfrak{X}=\{\bm{x}^{(1)}, \bm{x}^{(2)}, \ldots, \bm{x}^{(N)}\} be a dataset where each \bm{x}^{(n)} \in \mathbb{R}^M.
Objective
Find some archetypes \mathfrak{Z} = \{\bm{z}^{(1)}, \bm{z}^{(2)}, \ldots, \bm{z}^{(K)}\} where each \bm{z}^{(k)} \in \mathbb{R}^M, which are convex combinations of the data points in \mathfrak{X}.
Simultaneously, each data point \bm{x}^{(n)} is also approximated as a convex combination of these archetypes \mathfrak{Z}.
Archetypes as convex combinations of the data points
Expanding previous definition, each archetype \bm{Z}_k, i.e. \bm{z}^{(k)}, can be represented as convex combinations of the data points: \bm{Z}_k = \sum_{n=1}^{N} \bm{{C}}_{k,n} \bm{X}_n
subject to \|{\bm{{c}}_k}\|_1 = 1 and \bm{{c}}_{k,n} \in [0, 1].
![]()
Archetypes as convex combinations of the data points
Data points as convex combinations of the archetypes
Simultaneously, each data point \bm{X}_n, i.e. \bm{x}^{(n)}, is then approximated as a convex combination of these archetypes \bm{X}_n \approx \sum_{k=1}^{K} \bm{{S}}_{n,k} \bm{Z}_k similarly subject to \|{\bm{{s}}_n}\|_1 = 1 and \bm{{s}}_{n,k} \in [0,1].
![]()
Data points as convex combinations of the archetypes
Matrix notation
This formulation thus involves two matrices of coefficients: \bm{{C}} which defines the archetypes as combinations of the data points, and \bm{{S}} which represents each data point as a mixture of archetypes.
In matrix notation, archetypal analysis can be expressed as: \bm{X} \approx \bm{S} \bm{Z} = \bm{S} (\bm{C} \bm{X})
Computation
To compute the archetypes, we need to minimize the following objective function: \mathop{\mathrm{arg\,min}}_{\mathbf{S}, \mathbf{C}} \|\mathbf{X} - \mathbf{SCX}\|_F^2 subject to \|{\mathbf{s}_n}\|_1 = 1, \mathbf{s}_n \geq 0, \|{\mathbf{c}_k}\|_1 = 1 and \mathbf{c}_k \geq 0.
As proposed by Mørup & Hansen in 2012, the objective function can be rewritten in terms of the trace operator as E = \|{\mathbf{X} - \mathbf{SCX}}\|_F^2 = \text{tr} \left( \mathbf{X} \mathbf{X}^{\mkern-1.5mu\mathsf{T}}- 2 \mathbf{SCX}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}+ \mathbf{SCX}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}\mathbf{C}^{\mkern-1.5mu\mathsf{T}}\mathbf{S}^{\mkern-1.5mu\mathsf{T}}\right), which allows us to derive the gradients with respect to \mathbf{S} and \mathbf{C} \nabla_{\mathbf{S}} E = 2(\mathbf{SCX} \mathbf{X}^{\mkern-1.5mu\mathsf{T}}\mathbf{C}^{\mkern-1.5mu\mathsf{T}}-\mathbf{X} \mathbf{X}^{\mkern-1.5mu\mathsf{T}}\mathbf{C}^{\mkern-1.5mu\mathsf{T}}), \nabla_{\mathbf{C}} E = 2(\mathbf{S}^{\mkern-1.5mu\mathsf{T}}\mathbf{S}\mathbf{C}\mathbf{X}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}- \mathbf{S}^{\mkern-1.5mu\mathsf{T}}\mathbf{X}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}).
Selecting the optimal number of archetypes
- Achieving a balance between complexity and elucidation of data patterns is crucial.
- Archetypes are not necessarily nested; they can change as K increases to better capture data structure.
- The Elbow Method is a preferred approach due to its simplicity and intuitive visual clarity.
![]()
Archetypal Analysis with different number of archetypes
Definition
Following the work of Kleindessner et al. in 2023 on fair PCA, our objective in fair archetypal analysis (FairAA) is to obscure critical information when projecting the dataset onto the archetypal space.
Let z_i \in \{0, 1\} represent a critical attribute of the data point x_i, indicating membership in one of two groups. Ideally, no classifier should be able to infer z_i from the projection of x_i onto the K-archetypal space.
Analagously to the definition of AA, we can define FairAA as follows:
\mathop{\mathrm{arg\,min}}_{\mathbf{S}, \mathbf{C}} \|{\mathbf{X} - \mathbf{SCX}}\|_F^2
subject to \|{\mathbf{s}_n}\|_1 = 1, \mathbf{s}_n \geq 0, \|{\mathbf{c}_k}\|_1 = 1, \mathbf{c}_k \geq 0 and \forall h:\mathbb{R}^k\rightarrow\mathbb{R}, h(\mathbf{s}_i) and z_i are statistically independent.
If we define \bar{z} = \frac{1}{n}\sum z_i and \mathbf{z} = \{z_1 - \bar{z}, \dots, z_n - \bar{z}\} \in \mathbb{R}^n and consider only linear classifiers, we can reformulate the fairness requirement as:
\begin{gather*}
\forall w \in \mathbb{R}^k, b\in \mathbb{R}: \mathbf{w} \mathbf{s}_i^{\mkern-1.5mu\mathsf{T}}+ b \text{ and } z_i \text{ are uncorrelated} \Leftrightarrow \\
\forall w \in \mathbb{R}^k, b\in \mathbb{R}: \sum_{i=1}^n (z_i - \bar{z})(\mathbf{w} \mathbf{s}_i^{\mkern-1.5mu\mathsf{T}}+ b) = 0 \Leftrightarrow \\
\forall w \in \mathbb{R}^k: \mathbf{z}\mathbf{S}\mathbf{w}^T = 0 \Leftrightarrow \\
\mathbf{z}\mathbf{S} = 0
\end{gather*}
Thus, we can define FairAA by introducing an additional term in the optimization problem:
\mathop{\mathrm{arg\,min}}_{\mathbf{S}, \mathbf{C}} \|{\mathbf{X} - \mathbf{SCX}}\|_F^2 + \lambda \|{\mathbf{z}\mathbf{S}}\|_F^2
subject to \|{\mathbf{s}_n}\|_2 = 1, \mathbf{s}_n \geq 0, \|{\mathbf{c}_k}\|_2 = 1, \mathbf{c}_k \geq 0, and \lambda \geq 0, which acts as a regularization parameter.
Computation
In this case, the objective function can also be written in terms of the trace operator as: E = \text{tr} \left( \mathbf{X} \mathbf{X}^{\mkern-1.5mu\mathsf{T}}- 2 \mathbf{SCX}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}+ \mathbf{SCX}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}\mathbf{C}^{\mkern-1.5mu\mathsf{T}}\mathbf{S}^{\mkern-1.5mu\mathsf{T}}\right) + \lambda \text{tr}\left( \mathbf{z}\mathbf{S}\mathbf{S}^{\mkern-1.5mu\mathsf{T}}\mathbf{z}^{\mkern-1.5mu\mathsf{T}}\right)
and the resulting gradients are: \nabla_{\mathbf{S}} E = 2(\mathbf{SCX} \mathbf{X}^{\mkern-1.5mu\mathsf{T}}\mathbf{C}^{\mkern-1.5mu\mathsf{T}}-\mathbf{X} \mathbf{X}^{\mkern-1.5mu\mathsf{T}}\mathbf{C}^{\mkern-1.5mu\mathsf{T}}+ \lambda\mathbf{z}^{\mkern-1.5mu\mathsf{T}}\mathbf{z} \mathbf{S}), \nabla_{\mathbf{C}} E = 2(\mathbf{S}^{\mkern-1.5mu\mathsf{T}}\mathbf{S}\mathbf{C}\mathbf{X}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}- \mathbf{S}^{\mkern-1.5mu\mathsf{T}}\mathbf{X}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}).
To optimize the objective function, we can use the same alternating optimization scheme as in AA.
Complexity order
Regarding the complexity order, the \lambda \mathbf{z}^{\mkern-1.5mu\mathsf{T}}\mathbf{z} parameter could be precomputed, so the complexity order will be the same as AA.
Kernelizing fair AA
The computed gradients depend solely on the pairwise relationships, which are represented by the kernel matrix of inner products, \mathbf{K} = \mathbf{X}\mathbf{X}^{\mkern-1.5mu\mathsf{T}}.
Therefore, FairAA can be easily extended to kernel-based representations that rely on other pairwise data point relationships (FairKernelAA).
In this new framework, the gradients are given by the following expressions: \nabla_{\mathbf{S}} E = 2(\mathbf{SCK}\mathbf{C}^{\mkern-1.5mu\mathsf{T}}-\mathbf{K}\mathbf{C}^{\mkern-1.5mu\mathsf{T}}+ \lambda\mathbf{z}^{\mkern-1.5mu\mathsf{T}}\mathbf{z} \mathbf{S}), \nabla_{\mathbf{C}} E = 2(\mathbf{S}^{\mkern-1.5mu\mathsf{T}}\mathbf{S}\mathbf{C}\mathbf{K} - \mathbf{S}^{\mkern-1.5mu\mathsf{T}}\mathbf{K}).
where \mathbf{K} = \mathbf{K}(\mathbf{X}) \in \mathbb{R}^{n \times n} represents the kernel matrix that encodes the pairwise relationships between the elements of \mathbf{X}.
Hiding multiple groups
To extend FairAA to cases where the critical attribute consists of m disjoint groups, we adopt the one-vs-all approach.
We define m one-hot encoded critical attributes \mathbf{z}^{(1)}, \dots, \mathbf{z}^{(m)}, where z_i^{(l)} = 1 if \mathbf{x}_i belongs to group l, and z_i^{(l)} = 0 if it does not.
This definition implies that \mathbf{ZS} = 0, where the l-th row of \mathbf{Z} \in \mathbb{R}^{m \times n} is given by \{z_1^{(l)} - \bar{z}^{(l)}, \dots, z_n^{(l)} - \bar{z}^{(l)}\}, with \bar{z}^{(l)} being the mean of the l-th critical attribute across all data points.
As a result, the optimization problem can be solved in a similar manner to the fair AA approach for two groups.
\mathop{\mathrm{arg\,min}}_{\mathbf{S}, \mathbf{C}} \|{\mathbf{X} - \mathbf{SCX}}\|_F^2 + \lambda \|{\mathbf{Z}\mathbf{S}}\|_F^2
subject to \|{\mathbf{s}_n}\|_1 = 1, \mathbf{s}_n \geq 0, \|{\mathbf{c}_k}\|_2 = 1, \mathbf{c}_k \geq 0, and \lambda \geq 0, which acts as a regularization parameter.