Twelve numbers describe a country. It turns out you need barely one. A working note on principal components, factors, and clusters.
Feed twelve development indicators into a machine that asks only “where does the variation live?” and it hands back a single line. Each tick is a nation; colour marks the four worlds the data falls into.
Pick a country and write down twelve facts about it: how long its people live, how many years they spend in school, what they earn, whether they have internet, how many infants die, how old the median citizen is. You now hold a point in twelve-dimensional space. With 178 countries you have a cloud of 178 points in that space — impossible to picture, impossible to put on a slide.
The instinct of a manager, an economist, or a machine-learning model is the same: compress it. Find the few directions that carry most of what is going on and throw away the rest. That instinct has three classic incarnations, and this note walks through all three on one dataset so the differences become concrete rather than abstract.
Principal Component Analysis rotates the cloud to find the directions of greatest spread. Factor Analysis asks instead what unobserved forces could have generated the correlations we see. Clustering ignores directions entirely and asks which countries simply belong together. They answer different questions — and on real data, the way they disagree is where the teaching is.
A note on honesty before we start. The headline result below is almost embarrassingly clean. That cleanliness is itself a lesson: when every variable measures a facet of the same underlying thing, dimension reduction looks like magic. The harder, more interesting questions live in the small print — the 23% of variance the first component throws away, and the cluster boundaries the data does not actually want.
Before reducing anything, look at why reduction is even possible. The matrix below is every indicator correlated with every other. If the variables were independent, it would be mostly pale. It is not pale.
PCA performs one geometric trick. It re-aims the coordinate axes so the first new axis points along the direction the data spreads out most, the second points along the most spread that is left over and perpendicular to the first, and so on. No information is lost — twelve indicators still give twelve components — but the variance is now front-loaded. The question is how front-loaded.
A component is a weighted recipe of the original variables. The weights — called loadings — tell us what the new axis means. For the first component the recipe is unambiguous.
Project all 178 nations onto the first two components and the abstract becomes a map. Left-to-right is development; bottom-to-top is insecurity. Recolour the same points three ways to see what each lens reveals.
PCA is a description of the data’s shape. Factor analysis makes a stronger, more scientific claim: that a handful of latent variables we cannot observe — call them development, demographic maturity, insecurity — are the real drivers, and that each measured indicator is just a noisy reflection of them. Mathematically the two start in the same place. The difference is what we do next.
Factor analysis rotates the extracted axes to make them interpretable, deliberately trading the “first axis hogs all the variance” property for axes that each map cleanly onto a few indicators. The giant first component splits apart.
The first two methods give every country a position on continuous scales. Clustering does something categorical: it partitions the 178 nations into groups that are internally similar and externally distinct. The manager’s version of the question is “how many meaningfully different kinds of country are there, and which is which?”
k-means needs to be told how many groups to find, so the first task is choosing k honestly — and the honest answer is uncomfortable.
Here are the k-means clusters drawn back onto the principal-component plane from Figure 4. Because development is the dominant axis, the clusters line up along it like a thermometer — but watch the vertical scatter.
A cluster is only useful if you can say what it is. The profile below shows how each group sits, indicator by indicator, against the global average.
k-means imposes a flat partition. Hierarchical clustering instead merges nations pair by pair, nearest first, recording the whole genealogy as a tree. Cutting the tree at the right height recovers the four tiers — and the branch heights show how reluctantly some groups join.
It is tempting to treat these as competing algorithms and ask which “wins.” They do not compete; they answer different questions, and a good analyst reaches for the one whose question matches the decision at hand.
Best when you want a compact, faithful summary — a development index, a risk score, inputs to another model. It is descriptive and assumption-light, but its axes chase variance, not meaning, and can blur distinct ideas together.
Best when you believe in latent constructs and want interpretable, named dimensions. Rotation buys legibility at the cost of the “one dominant axis” simplicity, and it leans on stronger modelling assumptions.
Best when a decision needs discrete segments — tiers, personas, policy groups. Powerful for action, but it will happily invent borders inside a continuum, so the number of groups is your responsibility, not the algorithm’s.
On this dataset the three converge on one story told three ways. PCA found that development is overwhelmingly one-dimensional and quietly flagged violence as a second, smaller theme. Factor analysis unpacked that first dimension into living standards and age structure, and confirmed insecurity as its own latent force. Clustering carved the development continuum into four working tiers and, without being asked, reproduced the violence signal as a property of the lower-middle group.
The convergence is reassuring; the disagreements are the education. PCA’s discarded 23%, factor analysis’s rotated axes, and clustering’s arbitrary fourth boundary are not flaws to be hidden — they are the places where a thoughtful analyst earns their keep, deciding how much structure the data truly supports and how much is being imposed for the sake of a cleaner slide.
Source. Country-level indicators compiled from World Bank, UNDP Human Development Reports, and related cross-sectional collections (a single recent snapshot, not a time series). Forty-one World Bank regional and income aggregates (“Sub-Saharan Africa”, “High income”, etc.) were removed so that only sovereign units enter the analysis.
Variables. Twelve indicators spanning health, education, income, demography, connectivity and safety. Three skewed monetary or rate variables — GNI per capita, the maternal-mortality ratio and the homicide rate — were log-transformed before analysis. The Human Development Index itself was deliberately excluded from the inputs (it is built from several of them) and held back only to validate the first component.
Complete cases. Of 217 sovereign units, 178 had non-missing values on all twelve indicators and form the analysis sample. Countries dropped for missingness are disproportionately very small states and a few conflict-affected nations; conclusions about those groups should be read with that caveat.
Estimation. All indicators were standardised to zero mean and unit variance, so no variable dominates by virtue of its units. PCA on the correlation matrix; factor analysis via a three-factor varimax-rotated solution (sampling adequacy KMO = 0.89, Bartlett’s test of sphericity p < 0.001, both indicating the data is well-suited to factoring). Clusters from k-means on the standardised indicators (k chosen at 4 for interpretability over the silhouette-optimal k = 2); the dendrogram uses Ward linkage on Euclidean distance.
Honest limits. This is one cross-section: it shows how countries differ today, not how any country changes over time. The strong one-dimensionality is partly real and partly an artefact of indicators that are definitionally entangled. And cluster boundaries, as Figure 6 stresses, are imposed on a continuum — convenient fictions, not natural kinds.