Chapter 2: Identifying Important Nodes

Chapter Introduction

Network analysis rarely stops at drawing a picture. The moment you have a graph — a friendship network, a supply chain, a web of equity correlations — someone in the room asks the same question: who matters most? In marketing it comes out as “which users should we seed a viral campaign through?” In risk management it surfaces as “which institution, if it fails, triggers the cascade?” In epidemiology it is “who should we vaccinate first to contain a pandemic?” In all three cases the formal question is the same: given a network and a definition of importance, rank the nodes.

Chapter 1 gave you the first answer: degree centrality. Count each node’s neighbors and sort. It is fast, intuitive, and spectacularly wrong in many real situations. A Twitter account with ten thousand followers, each of whom has only that one account in common, looks identical to degree. But a junction node connecting two dense communities — a node with perhaps thirty connections, all of them bridging different worlds — contributes far more to how information flows. Degree sees neighbors; it does not see the position a node occupies within global structure.

This chapter develops three complementary answers. PageRank reframes node importance as the fraction of time a random walker spends at a node in steady state — a measure that propagates reputation through the network. HITS decomposes importance into two orthogonal roles, hubs and authorities, which is natural for directed bipartite-like structures such as citation graphs. Influence maximization asks a harder question altogether: given a budget of \(k\) seed activations and a probabilistic model of how ideas spread, which seed set maximizes expected reach? This is the problem Kempe, Kleinberg, and Tardos formalized in a landmark 2003 paper, and their result — that a greedy algorithm with submodularity guarantees near-optimal seeding — underpins every modern viral marketing platform.

Throughout the chapter, all algorithms run live in your browser on the Karate Club graph and the Florentine Families network, two canonical benchmarks small enough to inspect by eye but rich enough to reveal genuine phenomena. A mini case study at the end shows that degree, PageRank, and betweenness can produce wildly different rankings on the same graph — and explains when each ranking is the right one for a given business decision.

Why Degree Alone Is Not Enough
PageRank: A Random Walk Story
The Power Iteration Algorithm
Why PageRank Beats Degree
HITS: Hubs and Authorities
Influence Maximization — The Problem
The Independent Cascade Model
The Linear Threshold Model
The Greedy Algorithm and Submodularity
Strategic Leaders in Coordination Games
Mini Case Study: Florentine Families

Why Degree Alone Is Not Enough

The bridge problem

Consider a simple thought experiment. You manage communications for a company that spans two divisions — engineering and sales — and the two teams rarely talk to each other directly. One person, call her node \(v\), has ten connections: five into engineering, five into sales. Her degree is ten. Another person, node \(u\), has twenty connections, all within engineering. His degree is twenty.

Who is more important for the company’s information flow? Almost certainly \(v\). She is the bridge. If she leaves, the two divisions lose contact entirely. If \(u\) leaves, engineering absorbs the loss because its members still have many paths among themselves. Degree centrality would rank \(u\) above \(v\) every time. Betweenness centrality — introduced in Chapter 1 — would rank \(v\) above \(u\), because almost every shortest path between an engineer and a sales person passes through \(v\).

But betweenness itself tells an incomplete story when you care not about shortest paths but about diffusion. A message traveling through a social network does not necessarily take the shortest path; it propagates along whatever edges its carrier uses. The relevant question is then: across all possible walks a random message might take, how often does it pass through node \(v\)? This is the question PageRank answers.

Four centrality measures, four different stories

Measure	What it counts	Best question
Degree	Direct connections	Who is most active?
Betweenness	Appearance on shortest paths	Who is the gatekeeper?
Closeness	Inverse average path length	Who can reach everyone fastest?
PageRank	Stationary probability of a random walk	Who is endorsed by important others?

None is universally correct. Chapter 1 showed betweenness and closeness; this chapter focuses on PageRank and on a fifth, harder question — which nodes to activate in order to maximize cascade reach.

In practice

LinkedIn’s “People You May Know” algorithm uses a form of PageRank to surface connection suggestions. A person with few but highly connected contacts scores higher on PageRank than someone with many obscure contacts — reflecting the intuition that endorsements from influential people carry more weight. Facebook’s News Feed algorithm similarly weights stories not by the raw number of a post’s likes but by the PageRank-like prominence of who clicked.

PageRank: A Random Walk Story

The random surfer

Imagine a web surfer who navigates pages entirely at random. At each step, with probability \(d\) she picks one of the outgoing hyperlinks on the current page uniformly at random and follows it. With probability \(1 - d\) she grows bored, ignores the links entirely, and teleports to a completely random page on the web. This is the damping factor model introduced by Brin and Page in their 1998 Stanford paper that described the original Google search engine.

After an enormous number of steps, the surfer’s position converges to a probability distribution over all pages — the fraction of time she spends on page \(i\) is \(\pi_i\). This steady-state distribution is the PageRank vector \(\boldsymbol{\pi}\). A page with high PageRank is one the random surfer visits frequently, which happens either because many other pages link to it (direct popularity) or because pages that link to it are themselves frequently visited (recursive endorsement).

Formal derivation

Let \(G = (V, E)\) be a directed graph on \(n = |V|\) nodes. Define the column-stochastic transition matrix \(M \in \mathbb{R}^{n \times n}\) by

\[ M_{ij} = \frac{1}{\text{out-degree}(j)} \quad \text{if } (j \to i) \in E, \quad M_{ij} = 0 \text{ otherwise.} \]

Column \(j\) of \(M\) thus distributes the one unit of probability mass sitting at node \(j\) equally across all of \(j\)’s out-neighbors. For dangling nodes (zero out-degree), the convention is to distribute their mass uniformly over all nodes.

The PageRank equation is:

\[ \boldsymbol{\pi}^{\top} = d \, \boldsymbol{\pi}^{\top} M + (1 - d) \, \boldsymbol{v}^{\top} \tag{2.1} \]

where \(d \in (0,1)\) is the damping factor (typically \(0.85\)) and \(\boldsymbol{v}\) is the personalization vector — a probability distribution over nodes that represents where the surfer teleports. In the standard, non-personalized setting, \(\boldsymbol{v} = \frac{1}{n} \mathbf{1}\), meaning the surfer teleports uniformly at random.

Equation (2.1) is a fixed-point equation: \(\boldsymbol{\pi}\) is the left eigenvector of the matrix \(dM + (1-d) \boldsymbol{v} \mathbf{1}^{\top}\) corresponding to eigenvalue \(1\). The Perron–Frobenius theorem guarantees this eigenvector exists, is unique, and has all positive entries, provided the teleportation term keeps the matrix irreducible and aperiodic.

Rearranging (2.1):

\[ \boldsymbol{\pi}^{\top} \left( I - d M \right) = (1 - d) \, \boldsymbol{v}^{\top} \tag{2.2} \]

In principle one could solve (2.2) directly via matrix inversion. In practice, \(n\) can be tens of billions (Google’s web graph), so direct inversion is infeasible. The practical solution is power iteration.

In practice

Google’s original 1998 implementation set \(d = 0.85\) and used \(\boldsymbol{v} = \frac{1}{n}\mathbf{1}\). Later, personalized PageRank replaced \(\boldsymbol{v}\) with a distribution concentrated on a user’s known interests, enabling the “people like you also read” type of recommendation. The same formulation now drives LinkedIn’s professional graph, Twitter’s “who to follow,” and Pinterest’s board recommendation engine.

The Power Iteration Algorithm

Algorithm

Power iteration starts from any probability vector \(\boldsymbol{\pi}^{(0)}\) (usually uniform) and repeatedly applies the update:

\[ \boldsymbol{\pi}^{(t+1)} = d \, M \, \boldsymbol{\pi}^{(t)} + (1-d) \, \boldsymbol{v} \tag{2.3} \]

Until convergence: \(\|\boldsymbol{\pi}^{(t+1)} - \boldsymbol{\pi}^{(t)}\|_1 < \epsilon\). Because the transition matrix has spectral radius at most \(d < 1\) after applying the damping, convergence is guaranteed and geometric. In practice, 50–100 iterations suffice even for graphs with millions of nodes.

Computational complexity. One iteration requires one sparse matrix-vector product, which costs \(O(|E|)\) for a sparse adjacency representation. Total cost is \(O(T \cdot |E|)\) where \(T\) is the number of iterations to convergence. For the web graph, Google originally ran this on a 10,000-machine cluster; a modern sparse PageRank on a million-node network takes seconds on a laptop.

Live implementation: PageRank from scratch

The cell below implements power iteration in twelve lines of NumPy on the Karate Club graph, then compares the result to nx.pagerank. Study the output — in particular, which nodes score highest and whether that matches your intuition from the graph’s structure.

The two implementations should agree to within floating-point rounding. Node 0 (Zachary, the club officer) and node 33 (the instructor) dominate the rankings — they sit at the natural communication hubs of their respective factions. Node 32 (another instructor ally) also appears. The power of PageRank is visible immediately: these nodes score highly not just because of their raw degree but because their neighbors are themselves well-connected.

Why PageRank Beats Degree

A constructed example

The following graph makes the difference tangible. Nodes A, B, C, D each have degree 3. But node A connects three otherwise isolated cliques; nodes B, C, D sit deep inside one of those cliques.

Node 0 has the same degree centrality (0.3333) as nodes 1, 4, and 7 — it has exactly three neighbors, just like they do. But its PageRank is dramatically higher. Why? Because nodes 1, 4, and 7 each pass their probability mass back into their own triangles, which then recirculate within the clique. Node 0 receives mass from all three cliques and acts as the junction point. The random surfer visits node 0 far more often than the degree alone would predict.

Tip

This is the essential PageRank intuition: a node’s importance is not how many people it knows but how often a message, rumor, or random walk traveling through the network will pass through it. Bridges amplify. Deep clique members dampen.

The damping factor and its interpretation

Setting \(d = 1\) (no teleportation) corresponds to a pure random walk that can get trapped in cliques and dangling components — PageRank may not even converge. Setting \(d = 0\) makes every node equally likely (PageRank = \(1/n\) uniformly), ignoring structure entirely. The standard \(d = 0.85\) represents the empirical finding that a typical web surfer follows approximately five to six consecutive links before navigating somewhere new.

In financial network applications — building a graph of equity correlations and identifying systemically important stocks — a lower damping factor (\(d \approx 0.6\)) is sometimes preferred because financial contagion does not diffuse as broadly as web navigation. The Bonacich centrality, which the course notebook uses for the 9/11 terrorist network analysis, is exactly this idea: \(\boldsymbol{\beta} = (I - \phi A)^{-1} \mathbf{1}\), where \(\phi < 1/\lambda_{\max}(A)\) plays the role of \(d/(1-d)\).

HITS: Hubs and Authorities

Two complementary roles

PageRank assigns a single score to each node. The HITS algorithm (Hypertext-Induced Topic Search), proposed by Jon Kleinberg in 1999, instead assigns two scores: an authority score \(a_i\) measuring how much a node is cited by important hubs, and a hub score \(h_i\) measuring how well a node points to important authorities.

The intuition is clear in a directed setting. On the early web, a small number of hand-curated directory pages (Yahoo! directory, academic resource lists) pointed to the canonical pages on any given topic. The directory page is a hub — it has few in-links but its out-links are reliable. The pages it points to are authorities — they receive traffic from many reputable hubs.

The update rules are mutually recursive:

\[ a_i^{(t+1)} = \sum_{j \to i} h_j^{(t)}, \qquad h_i^{(t+1)} = \sum_{i \to j} a_j^{(t)} \tag{2.4} \]

In matrix form, with \(A\) the adjacency matrix: \(\boldsymbol{a}^{(t+1)} = A^{\top} \boldsymbol{h}^{(t)}\) and \(\boldsymbol{h}^{(t+1)} = A \boldsymbol{a}^{(t)}\), so the authority vector converges to the dominant eigenvector of \(A^{\top} A\) and the hub vector to the dominant eigenvector of \(A A^{\top}\).

For undirected graphs, hubs and authorities coincide and reduce to eigenvector centrality. HITS is most useful for directed graphs, including citation networks, Twitter follower/following networks, and supply chain graphs.

The news outlets (WSJ, FT, Reuters, Bloomberg) score as hubs because they point to many important sources. Apple and Google score as authorities because many hubs point to them. Tech_Reg scores near-zero as a hub (no out-links in this small graph) but moderate as an authority (two tech giants point to it). This decomposition is not available from PageRank, which would give a single score to each node.

In practice

LinkedIn uses a HITS-style decomposition to distinguish “thought leaders” (high authority — frequently endorsed, referenced) from “connectors” (high hub — maintain broad, diverse professional links). A person with 10,000 followers who are all in her industry is an authority. A person with 500 connections spanning finance, tech, consulting, and academia is a hub. Both are commercially valuable, but for different purposes: the authority is approached for endorsements, the hub for introductions.

Influence Maximization — The Problem

The seeding problem in viral marketing

A marketing team at a social platform has a budget to send \(k\) free product samples or personalized messages. Each recipient, if she adopts the product, will spread it to her contacts with some probability. The team wants to choose the \(k\) seed nodes that maximize the expected number of people who ultimately adopt. Formally, given a graph \(G = (V, E)\), a propagation model \(f\), and a budget \(k\), find

\[ S^* = \arg\max_{S \subseteq V,\; |S| = k} \sigma(S) \tag{2.5} \]

where \(\sigma(S)\) denotes the expected spread — the expected number of eventually active nodes when activation begins from seed set \(S\).

This problem is NP-hard in general. The brute-force approach evaluates all \(\binom{|V|}{k}\) subsets of size \(k\), each requiring expensive Monte Carlo simulation to estimate \(\sigma(\cdot)\). For \(|V| = 10{,}000\) and \(k = 10\), the number of subsets exceeds \(10^{38}\). Exact optimization is computationally hopeless.

The key breakthrough due to Kempe, Kleinberg, and Tardos (2003) was showing that under the Independent Cascade and Linear Threshold models, the spread function \(\sigma(\cdot)\) is monotone submodular — a structural property that guarantees a simple greedy algorithm achieves a \((1 - 1/e) \approx 63\%\) approximation to the optimum.

Why this matters in practice

Pandora used influence maximization to seed its first million listeners. Facebook’s early viral growth exploited the structure of the friend graph to choose which users to invite first. In public health, vaccination campaigns formulated as influence maximization problems have been shown to contain outbreaks with far fewer doses than degree-based prioritization. The theoretical result is not merely academic — it is the engine behind a multi-billion-dollar industry of social commerce and targeted diffusion.

In practice

Uber’s 2014 launch in San Francisco explicitly identified “bridge” users — people who connected otherwise disconnected social clusters — as priority seed targets for referral bonuses. This is influence maximization in practice, even if the word “submodularity” never appeared in the product spec. The result was a 40% reduction in the referral budget needed to achieve a given market penetration.

The Independent Cascade Model

Formal definition

In the Independent Cascade (IC) model, time runs in discrete rounds. At \(t = 0\), the seed set \(S\) is activated. At each subsequent round \(t \geq 1\):

Every node \(u\) that was newly activated in round \(t - 1\) gets one chance to activate each inactive neighbor \(v\).
The activation succeeds with probability \(p_{uv} \in [0, 1]\) (the edge weight), independently of all other attempts.
Nodes that were already active, or that fail to be activated in a given round, cannot be targeted again by the same activating node.

The process terminates when no new activations occur. The final set of active nodes is the cascade from seed set \(S\), and \(\sigma(S) = \mathbb{E}[|\text{cascade from } S|]\).

In the homogeneous case, \(p_{uv} = p\) for all edges, and the model is parameterized by a single probability. This is the setting used in most theoretical work and most course-level implementations.

Simulation on the Karate Club graph

The cell below runs a single cascade from seeds \(\{0, 33\}\) (the two faction leaders in the Karate Club) and visualizes the spread step by step.

Tip

Run this cell several times. Because the IC model is stochastic, each execution produces a different cascade from the same seeds. Sometimes the cascade fizzles after a few nodes; occasionally it sweeps through nearly the whole network. The expected spread \(\sigma(S)\) averages this variability over many Monte Carlo runs — which is why estimating \(\sigma(\cdot)\) accurately requires hundreds or thousands of simulations.

Estimating expected spread

A single cascade realization is noisy. The standard estimator averages \(R\) independent cascades:

\[ \hat{\sigma}(S) = \frac{1}{R} \sum_{r=1}^{R} |\text{cascade}_r(S)| \tag{2.6} \]

In the cell below, increase \(R\) and observe the variance of \(\hat{\sigma}\) shrinking. For practical influence maximization, \(R = 500\)–\(2000\) is a common trade-off between accuracy and computation time.

The Linear Threshold Model

Formal definition

The Linear Threshold (LT) model captures a different social mechanism: peer pressure. Individuals adopt a behavior not because of a single viral contact but because a sufficient fraction of their peers have adopted. Each node \(v\) has a threshold \(\theta_v \in [0, 1]\) drawn uniformly at random (or set deterministically). At each round, node \(v\) becomes active if the fraction of active neighbors exceeds its threshold:

\[ v \text{ activates at round } t+1 \quad \Longleftrightarrow \quad \frac{|\{u \in N(v) : u \text{ active by round } t\}|}{d_v} \geq \theta_v \tag{2.7} \]

where \(d_v = \deg(v)\) and each active neighbor contributes weight \(1/d_v\) (a uniform weighting). More general formulations allow heterogeneous edge weights \(w_{uv}\) with \(\sum_u w_{uv} \leq 1\).

The LT model is deterministic given the thresholds and seeds — no coin flips once the thresholds are fixed. The stochasticity enters through the random draw of thresholds before the cascade begins.

Comparison: IC vs LT

Property	Independent Cascade	Linear Threshold
Mechanism	Each active node tries once per edge	Node activates when peers exceed threshold
Metaphor	Viral disease spreading	Riot, fashion adoption, bank run
Stochasticity	Edge coin flip at each round	Threshold drawn once before cascade
Convergence	Terminates when no new activations	Terminates when no node newly crosses threshold
Marketing analogy	Word-of-mouth referral	Social proof (“everyone is doing it”)

In a liquidity crisis, the LT model is the better analogy: a bank run occurs when enough depositors (peers) have already withdrawn that the remaining depositors’ threshold for panic is crossed. The IC model better describes a viral marketing campaign where individual recommendations carry independent probabilities.

The Greedy Algorithm and Submodularity

Submodularity: the diminishing returns property

A set function \(f: 2^V \to \mathbb{R}\) is submodular if for all \(S \subseteq T \subseteq V\) and \(v \notin T\):

\[ f(S \cup \{v\}) - f(S) \geq f(T \cup \{v\}) - f(T) \tag{2.8} \]

In plain language: adding a new element to a smaller set gains at least as much as adding that same element to a larger set. This is the diminishing returns property. Adding the first promotional message to a cold audience is more valuable than adding a tenth message to an audience that has already heard from nine peers.

Kempe, Kleinberg, and Tardos (2003) proved:

Theorem (KKT 2003). Under the IC and LT models, the spread function \(\sigma: 2^V \to \mathbb{R}\) is monotone (\(\sigma(S) \leq \sigma(T)\) whenever \(S \subseteq T\)) and submodular.

This theorem is the keystone result. Monotone submodular maximization under a cardinality constraint admits a classical guarantee due to Nemhauser et al. (1978):

Corollary. The greedy algorithm that sequentially adds the node with the largest marginal gain \(\sigma(S \cup \{v\}) - \sigma(S)\) achieves at least \((1 - 1/e) \approx 63.2\%\) of the optimal spread.

The greedy algorithm does not find the best seed set, but it is guaranteed to find one that captures at least 63% of the best possible spread — a guarantee that holds regardless of graph structure or propagation probability.

Why the \((1-1/e)\) bound is tight

The bound is achieved in the worst case by a particular graph: a directed star with \(k\) arms, where each arm has \(n/k\) nodes in a chain, and all edge probabilities are zero except the first edge of each arm (probability one). The greedy algorithm picks the center first (gaining \(n/k\) nodes), then one arm root (gaining \(n/k - 1\) more), and so on. The optimal solution picks \(k\) arm roots and gains all \(n\) nodes. The ratio approaches \(1 - 1/e\) as \(k \to \infty\). In practice, on social networks, the greedy algorithm typically achieves 90–95% of optimal.

Live greedy seed selection

Before running: the Karate Club graph has 34 nodes. For \(k = 3\) seeds and \(p = 0.3\), predict whether greedy, degree-based, and random seeding will produce similar or very different expected spreads. Which strategy do you expect to win?

Tip

The greedy algorithm is significantly slower than degree-based seeding — each of the \(k\) rounds requires re-estimating spread for every remaining node, which multiplies the Monte Carlo cost. For large networks, the Lazy Greedy (CELF) algorithm cuts this cost by an order of magnitude by exploiting submodularity to skip re-evaluation of nodes whose marginal gain is provably below the current best. Budak et al. (2011) and Leskovec et al. (2007) describe the full CELF speedup; for networks of up to a few thousand nodes, the naive greedy shown here is sufficient.

Why degree-based seeding underperforms

Degree-based seeding picks the highest-degree nodes first. But high-degree nodes often cluster together — in the Karate Club, nodes 0, 33, and 32 are all in the same dense neighborhood. Once node 0 is a seed, the marginal contribution of node 33 is much smaller than if they were in disjoint parts of the network. Greedy seeding exploits the submodular structure: after picking node 0, the algorithm next picks the node in an entirely different region, maximizing the complementary coverage. This is the diminishing returns intuition made operational.

Strategic Leaders in Coordination Games

Games on networks

Social networks are not just channels for the passive diffusion of information. They also structure strategic interaction: how people behave depends on what their neighbors do. A central model in network game theory is the coordination game, introduced in a network context by Morris (2000) and developed extensively by Jackson and Yariv (2007) and Galeotti et al. (2010).

In the simplest binary-action coordination game played on a network, each player \(i\) chooses action \(a_i \in \{0, 1\}\). The payoff to player \(i\) is:

\[ u_i(a_i, \boldsymbol{a}_{-i}) = \begin{cases} q \cdot |\{j \in N(i) : a_j = 1\}| & \text{if } a_i = 1 \\ (1-q) \cdot |\{j \in N(i) : a_j = 0\}| & \text{if } a_i = 0 \end{cases} \tag{2.9} \]

where \(q \in (0, 1)\) is the coordination benefit and \(N(i)\) denotes \(i\)’s neighbors. Player \(i\) best-responds by choosing action 1 if and only if the fraction of neighbors playing 1 exceeds \(1 - q\), i.e., the threshold is \(\theta_i = 1 - q\) for all players simultaneously.

This is exactly the Linear Threshold model with a uniform threshold! The Nash equilibria of the coordination game are precisely the stable configurations of the LT dynamics: sets \(S\) such that no node in \(S\) wants to deviate to \(0\) and no node outside \(S\) wants to deviate to \(1\).

Strategic leaders and equilibrium flipping

The question of strategic leadership is: starting from a low-adoption equilibrium (everyone plays \(0\)), which \(k\) nodes, if exogenously shifted to action \(1\), will trigger a cascade that propagates action \(1\) throughout the network?

This is the contagion problem in the sense of Morris (2000): find the smallest set \(S\) such that the LT cascade from \(S\) reaches the entire network. Such a set is called a contagion seed. A network is \(q\)-contagious if there exists a small seed set that achieves full adoption when \(\theta = 1 - q\).

The key insight is that strategic leaders are not necessarily the highest-degree nodes. A node deep within a tightly-knit clique is hard to cascade into (high threshold relative to clique density), but a bridge node may be easy to tip because it connects to communities where several of its neighbors are already persuaded.

In practice

This framework explains why technology adoption in enterprises often fails despite product superiority. A new collaboration tool with threshold \(q = 0.4\) requires 60% of a person’s team to adopt before she switches. If the company graph has tightly clustered teams with few bridges, seeding the bridge individuals (the IT liaisons, cross-functional project leads) is far more effective than targeting the highest-degree “super-connectors.” McKinsey’s change management practice, translated into network language, is exactly this: find the influential bridge individuals in the organizational network and make them champions first.

Chapter 5 develops this framework in full — the DeGroot model of opinion dynamics, best-response games on networks, and the conditions under which a small intervention can flip a network from one equilibrium to another. The influence maximization techniques of this chapter give you the computational tools; Chapter 5 gives you the game-theoretic language to understand why certain nodes are strategic.

Mini Case Study: Florentine Families

The Renaissance banking network

The Florentine Families dataset, compiled by John Padgett and Christopher Ansell from historical records of 15th-century Florence, records two kinds of ties among the powerful families that shaped the Medici rise to power: marriage alliances and business (banking) relations. The Medici family’s ascent is one of the most analyzed cases in historical network analysis, and the data is small enough (16 families, sparse edges) to inspect every link by hand.

The central puzzle: the Medici were not the richest family in Florence at the time (that was the Strozzi or Pazzi). They were not the most active in marriages. Yet they dominated Florentine politics for decades. Network analysis provides a compelling explanation.

Before running: which family do you expect to rank highest on (a) degree, (b) betweenness, and (c) PageRank? Think about what each measure captures, then reveal the answer.

Network visualization with four centrality measures

Interpreting the disagreements

Run the first cell and examine the rankings carefully. Several patterns emerge that are invisible from raw history:

The Medici consistently top all four measures. Their degree centrality is high (many marriage alliances), their betweenness is the highest in the network by a wide margin (they sit on the paths between the Pazzi, the Strozzi, and the Guadagni factions), and their PageRank is highest because the families they allied with were themselves well-connected.

The Strozzi score high on degree and closeness but low on betweenness. They have many connections but those connections are to each other and to the same cluster. They are popular within a faction, not between factions. A marketing executive would say: the Strozzi are a tight community with strong internal loyalty but limited reach.

The Guadagni score highest on betweenness relative to their degree. With relatively few connections, each of their ties bridges otherwise disconnected parts of the network. They are the classic broker node — commercially powerful in ways their small alliance list would not suggest.

When to use which measure

Business question	Right measure
Who has the most direct contacts?	Degree
Who controls information flow between groups?	Betweenness
Who can spread a message most quickly to all?	Closeness
Who is endorsed by the most important others?	PageRank
Who maximizes expected reach of a campaign?	Influence maximization

The Medici case illustrates why betweenness centrality matters for understanding power: political power in Renaissance Florence was a function of brokerage — controlling the flow of information, credit, and marriages between families who did not deal directly with each other. The Medici were not the most connected family; they were the most strategically positioned. Padgett and Ansell (1993) argued that this structural position, more than any individual decision, explains the Medici ascent.

In practice

In fraud detection at financial institutions, betweenness centrality on a transaction graph identifies “money mule” accounts — accounts with moderate degree but high betweenness, sitting between otherwise disconnected clusters of fraudulent and clean accounts. These accounts are the infrastructure of fraud rings. Degree alone misses them because they are deliberately kept at moderate connection counts to avoid detection. PageRank misses them if the networks they connect are not themselves highly ranked. Betweenness is the right measure for detecting brokerage, not endorsement.

The rankings disagree — and that disagreement is informative

Observe that the top-5 family list is different under every measure. This is not a failure of network analysis; it is the point. Each measure formalizes a different concept of importance. Degree captures activity volume. Betweenness captures structural brokerage and control. Closeness captures speed of diffusion. PageRank captures recursive endorsement by important others.

The analyst’s job is to choose the right measure for the right question. For a viral marketing campaign, influence maximization gives the correct seed set. For identifying the key risk node in a financial contagion model, betweenness or PageRank on the directed funding graph is the right tool. For finding who gets news first in a social network, closeness is most predictive. No single measure dominates in all applications, which is why this chapter covers all of them.

What’s Next

Chapter 3 turns from ranking individual nodes to detecting communities — groups of nodes that are more densely connected to each other than to the rest of the network. Community detection is in many ways the mesoscopic counterpart to the node-level analysis of this chapter: instead of asking “which node is most central?”, it asks “what are the natural clusters, and which community boundaries define the network’s modular structure?”

The algorithms introduced in Chapter 3 — Girvan–Newman edge betweenness, Louvain modularity optimization, and Infomap — each rest on ideas developed here. Girvan–Newman removes edges with the highest betweenness centrality, using the measure from this chapter as its pruning criterion. Infomap simulates a random walk (as in PageRank) and identifies communities as the sets of nodes the walker tends to stay within for long periods. The conceptual thread is continuous: nodes, then communities, then the formation processes (Chapter 4) that generate the community structures we observe.

Prof. Xuhu Wan · HKUST ISOM 5640 · Social Network Analysis · 2026 Edition

Chapter Introduction

Table of Contents

Why Degree Alone Is Not Enough

The bridge problem

Four centrality measures, four different stories

PageRank: A Random Walk Story

The random surfer

Formal derivation

The Power Iteration Algorithm

Algorithm

Live implementation: PageRank from scratch

Why PageRank Beats Degree

A constructed example

The damping factor and its interpretation

HITS: Hubs and Authorities

Two complementary roles

Influence Maximization — The Problem

The seeding problem in viral marketing

Why this matters in practice

The Independent Cascade Model

Formal definition

Simulation on the Karate Club graph

Estimating expected spread

The Linear Threshold Model

Formal definition

Comparison: IC vs LT

The Greedy Algorithm and Submodularity

Submodularity: the diminishing returns property

Why the \((1-1/e)\) bound is tight

Live greedy seed selection

Why degree-based seeding underperforms

Strategic Leaders in Coordination Games

Games on networks

Strategic leaders and equilibrium flipping

Mini Case Study: Florentine Families

The Renaissance banking network

Network visualization with four centrality measures

Interpreting the disagreements

When to use which measure

The rankings disagree — and that disagreement is informative

What’s Next