Chapter 1: Basics of Networks
Chapter Introduction
The world is held together by connections. A bank lends to another bank, which lends to a hedge fund, which holds bonds issued by a sovereign that owes money to the first bank. A tweet is retweeted by someone with two million followers, igniting a brand crisis that takes six months and a hundred million dollars to extinguish. A traveller boards a flight in Wuhan, changes planes in Frankfurt, and three weeks later a respiratory virus is circulating in twelve countries. A new smartphone app grows from zero to fifty million users in eighteen months because every user automatically invites five contacts.
None of these phenomena can be understood by looking at the actors in isolation. The bank’s solvency depends not on its own balance sheet alone but on the network of obligations in which it is embedded. The tweet’s reach depends on where in the social graph the original poster sits. The pathogen’s spread is shaped by the topology of the airline network. The app’s explosive growth is a function of the referral graph and the strength of social ties. In every case, the structure of the network is the explanation.
This course — ISOM 5640, Social Network Analysis — is about building the mathematical and computational tools to study those structures rigorously. The reference text is Matthew O. Jackson’s Social and Economic Networks: Models and Analysis (Stanford / NBER, 2014), which provides the theoretical backbone. These notes complement Jackson with Python implementations using NetworkX, the standard library for network analysis, so that every concept you encounter can be computed, visualised, and experimented with immediately.
What you will learn in this chapter
Chapter 1 covers the foundational vocabulary. By the end you will be able to formalise a real-world relationship system as a mathematical graph; represent that graph inside Python as a NetworkX object, an adjacency matrix, or an edge list; compute the basic structural statistics — degree distribution, shortest paths, connected components; and quantify the “importance” of individual nodes using four centrality measures. Every subsequent chapter builds directly on these primitives.
The coding in this chapter uses only networkx, numpy, pandas, and matplotlib. You need basic Python fluency — functions, lists, dictionaries, for loops — but no prior graph theory or linear algebra is required. Mathematical notation is introduced gradually, always after the intuition has been established.
Why networks matter: four motivating examples
Financial contagion. During the 2008 global financial crisis, the failure of Lehman Brothers triggered losses at institutions that had never lent Lehman a single dollar, because they held derivatives whose counterparties had, and those counterparties had counterparties in turn. The Federal Reserve and the Bank of England later commissioned network analyses of the interbank market to understand why the contagion was so severe. A naive model that treated each bank independently would have predicted the system was safe; the network model revealed it was fragile.
Viral marketing. Dropbox grew from 100,000 to 4 million users in fifteen months without a dollar of paid advertising, by offering extra storage space to both the referrer and the new user. The mechanism only works because social networks have degree heterogeneity: a small fraction of users have enormous numbers of connections. Seed the right nodes — the high-degree “hubs” — and the cascade takes care of itself.
Terrorist cells. After the September 11 attacks, intelligence analysts reconstructed the communication network of the 19 hijackers. They found that the network had a characteristic structure: two or three individuals with high betweenness centrality — nodes that lay on the shortest paths between many pairs — had effectively coordinated the cell. Removing those nodes would have fragmented the conspiracy. The same structural logic guides modern counter-terrorism network analysis.
Supply chain resilience. When a single semiconductor fabrication plant in Taiwan halted production for two weeks in 2021, automobile manufacturers across three continents halted assembly lines. The auto industry had mapped its supply chain as a list of direct suppliers, not as a network — and was blind to the fact that many seemingly independent suppliers sourced the same critical components from the same single node. Network analysis would have revealed the fragility before the crisis.
Table of Contents
- Graphs as Models of Relationships
- Representing Graphs in Python
- Loading Real Network Data
- Degree, Paths, and Components
- Centrality Measures
- PageRank: A Natural Extension
- Mini Case Study: The Florentine Families
Graphs as Models of Relationships
The formal definition
A graph \(G\) is a pair \((V, E)\) where \(V\) is a set of nodes (also called vertices) and \(E\) is a set of edges (also called links or ties). Each edge connects two nodes and represents a relationship between them.
\[G = (V, E), \quad E \subseteq V \times V\]
The number of nodes \(|V|\) is called the order of the graph; the number of edges \(|E|\) is called the size. These two numbers alone — order and size — already constrain a great deal about what is structurally possible in the network.
A graph is the right abstraction whenever your data can be described as a set of entities and a set of pairwise relationships between them. The art of network analysis lies partly in choosing what counts as a node and what counts as an edge — decisions that shape every subsequent conclusion.
Directed versus undirected graphs
In an undirected graph, edges have no orientation: if node \(i\) is connected to node \(j\), then \(j\) is equally connected to \(i\). Friendship networks, co-authorship networks, and trade networks between countries where the relationship is symmetric are naturally modelled as undirected graphs.
In a directed graph (or digraph), every edge is an ordered pair \((i, j)\) meaning “\(i\) points to \(j\)” or “\(i\) has a directed relationship towards \(j\).” The Twitter follow graph is directed: user A can follow user B without B following back. Web hyperlinks are directed: page A linking to page B does not imply the reverse. Financial transaction networks are directed: the flow of money runs one way.
Formally, in a directed graph we distinguish between the in-degree of a node \(i\), written \(d^{\text{in}}(i)\), and the out-degree, \(d^{\text{out}}(i)\):
\[d^{\text{in}}(i) = |\{j : (j, i) \in E\}|, \qquad d^{\text{out}}(i) = |\{j : (i, j) \in E\}|\]
On Twitter, in-degree is the follower count (how many people follow you) and out-degree is the following count (how many people you follow). The two can differ wildly: a celebrity may have ten million followers and follow only two hundred accounts.
Weighted versus unweighted graphs
An unweighted graph treats all edges as equivalent: either two nodes are connected or they are not. A weighted graph assigns a numerical value \(w_{ij}\) to each edge, capturing the strength or intensity of the relationship.
In a stock-return correlation network, the edge weight between two equities is their Pearson correlation coefficient — edges near 1 indicate stocks that move together, edges near 0 indicate near-independence. In a communication network of terrorists, the edge weight might be the frequency of phone calls between two individuals. In a supply chain, it might be the dollar volume of goods flowing through each relationship.
When edge weights are available, ignoring them — treating the graph as unweighted — discards information. On the other hand, the choice of threshold above which to include an edge (and thereby convert a dense weighted network into a sparser unweighted one) is itself an analytical decision that can dramatically change the network’s apparent structure.
Real-world examples
| Domain | Nodes | Edges | Directed? | Weighted? |
|---|---|---|---|---|
| Twitter / X | Users | Follow relationships | Yes | No |
| FX trading | Currencies | Trade volume | Yes | Yes |
| Facebook ego graph | Users | Friendships | No | No |
| Terrorist cell | Individuals | Communications | No | Yes (frequency) |
| Board interlocks | Companies | Shared board member | No | Yes (# shared) |
| Supply chain | Firms | Supplier–buyer link | Yes | Yes ($ volume) |
| Interbank market | Banks | Loans / credit lines | Yes | Yes ($ amount) |
| Citation network | Papers | Cites relationship | Yes | No |
Before writing any code, answer three questions: (1) What is a node in your system? (2) What is an edge — what relationship are you representing? (3) Is that relationship symmetric? The answers determine whether you need a Graph or a DiGraph, and whether to store edge weights. Getting the abstraction right takes five minutes but saves hours of debugging and misinterpretation later.
Representing Graphs in Python
The NetworkX Graph object
NetworkX is the standard Python library for network analysis. Its central object is the Graph (undirected) or DiGraph (directed), which stores nodes and edges together with any associated attributes. Installing and importing it takes one line; the library ships with dozens of built-in graph generators and algorithms that we will use throughout this course.
The cell below constructs a small friendship network from scratch: five people, four friendships. Every subsequent concept in this chapter is demonstrated on graphs built in exactly this way.
The adjacency matrix
Every graph on \(n\) nodes can be represented as an \(n \times n\) adjacency matrix \(A\) where
\[A_{ij} = \begin{cases} 1 & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}\]
For weighted graphs, \(A_{ij} = w_{ij}\) replaces the binary entry. For undirected graphs, \(A\) is symmetric: \(A_{ij} = A_{ji}\).
The adjacency matrix is the natural object of linear algebra. The product \(A^2\) counts the number of length-2 paths between every pair of nodes; \(A^k\) counts length-\(k\) paths. This observation connects graph theory to matrix algebra and underpins eigenvector centrality and PageRank, both introduced later in this chapter.
The drawback of the adjacency matrix is density. A network of \(n = 10{,}000\) nodes requires storing \(10{,}000^2 = 100\) million entries even if only a few thousand edges exist. Real social networks are sparse — each node has far fewer than \(n\) neighbours — and adjacency matrices waste memory on zeros. For sparse graphs, the edge list (simply a list of pairs \((i, j)\)) is far more efficient.
The Facebook social graph has approximately 3 billion nodes and roughly 150 billion edges. Its adjacency matrix would require \(9 \times 10^{18}\) entries — far beyond any computer’s memory. In production, Facebook stores the graph as a compressed edge list in a distributed graph database (TAO, their internal system), and network algorithms run on distributed compute clusters. NetworkX is appropriate for graphs up to a few million edges on a laptop; for larger graphs, tools such as GraphX (Apache Spark) or specialized graph databases take over.
Edge lists and NetworkX internals
NetworkX stores graphs internally as nested dictionaries, not as matrices. This means that looking up whether an edge \((i, j)\) exists is \(O(1)\) on average, iterating over neighbours of node \(i\) costs \(O(d_i)\) where \(d_i\) is the degree of \(i\), and the total memory usage scales with \(|E|\) rather than \(|V|^2\). For the sparse networks that dominate real applications — social graphs, citation networks, supply chains — this representation is dramatically more efficient.
Loading Real Network Data
Working with built-in graph datasets
The most reliable way to run network code in a browser-based Python environment is to use graphs that ship directly with NetworkX. Several canonical datasets are bundled with the library and load without any file access or internet connection. We will use them throughout this chapter and the ones that follow.
The four most useful for teaching purposes are:
nx.karate_club_graph()— Zachary’s Karate Club (1977): 34 members of a university karate club, edges representing social interactions outside class. A landmark dataset in community detection.nx.florentine_families_graph()— 15th-century Florentine elite families, edges representing marriage alliances. The canonical illustration of network power (the Medici’s rise to dominance).nx.les_miserables_graph()— 77 characters from Victor Hugo’s novel, weighted edges representing co-appearance frequency in chapters.nx.davis_southern_women_graph()— a bipartite graph of 18 women and 14 social events from Davis et al.’s 1941 community study of Natchez, Mississippi.
Constructing a network from inline data
In practice, you will often load network data from CSV files or text edge lists. The pattern below shows how to construct a graph from tabular data embedded directly in the notebook — the same approach works when pd.read_csv() loads data from disk or from a URL. We use a small synthetic social-media retweet network to illustrate.
When you have an edge list stored as a text file — one edge per line, separated by spaces or tabs, as in the Facebook facebook_combined.txt dataset shipped with this course — NetworkX loads it in a single call:
G = nx.read_edgelist("facebook_combined.txt", create_using=nx.Graph())For weighted networks stored as three-column edge lists (source, target, weight):
G = nx.read_edgelist("network.txt", data=[("weight", float)])The Facebook combined dataset contains 4,039 users and 88,234 friendship edges — large enough that nx.draw() will produce an unintelligible hairball. For graphs of that scale, always compute summary statistics first and visualise only subgraphs or aggregate properties.
Degree, Paths, and Components
Degree and degree distribution
The degree of a node \(i\) in an undirected graph is the number of edges incident to it:
\[d_i = |\{j \in V : (i,j) \in E\}|\]
The handshaking lemma states that the sum of all degrees equals twice the number of edges:
\[\sum_{i \in V} d_i = 2|E|\]
This is easy to see: every edge contributes one to the degree of each of its two endpoints.
The degree distribution \(P(k)\) is the fraction of nodes with degree \(k\). This distribution is one of the most informative fingerprints of a network’s structure. Random graphs (Erdős–Rényi) have a Poisson degree distribution. Real social networks — Facebook, Twitter, the WWW — follow a power law: \(P(k) \sim k^{-\gamma}\) for some exponent \(\gamma\) typically in \([2, 3]\). This “scale-free” property means that a small number of nodes have extremely high degree (hubs), while the vast majority have very low degree. The implications for viral marketing, epidemic spreading, and network resilience are profound and form a central theme of Chapter 4.
Paths and shortest paths
A path from node \(u\) to node \(v\) is a sequence of distinct nodes \(u = v_0, v_1, \ldots, v_k = v\) such that \((v_{i-1}, v_i) \in E\) for all \(i\). The length of the path is \(k\) (the number of edges traversed). The shortest path (or geodesic) between \(u\) and \(v\) is the path of minimum length; its length is the geodesic distance \(d(u, v)\).
The average shortest path length of a graph is
\[\bar{L} = \frac{1}{n(n-1)} \sum_{i \neq j} d(i, j)\]
and it measures how “efficiently” information or goods can travel across the network. The famous “six degrees of separation” conjecture — that any two people on Earth are connected by at most six friendship hops — is equivalent to the claim that the human social graph has \(\bar{L} \lesssim 6\).
The diameter of a graph is the maximum geodesic distance over all node pairs: \(\text{diam}(G) = \max_{i,j} d(i,j)\).
Connected components
A connected component is a maximal set of nodes such that there is a path between every pair. In an undirected graph, the components partition \(V\) into disjoint subsets. In a directed graph, we distinguish strongly connected components (directed path from every node to every other) from weakly connected components (connected when edge directions are ignored).
Most real social networks have one giant component containing nearly all nodes, plus a collection of small isolated fragments. The Erdős–Rényi random graph model (Chapter 4) gives a precise phase transition: there exists a critical edge density below which the largest component is of order \(\log n\) and above which a giant component of order \(n\) emerges suddenly.
When a real-world network has many small components, the analyst must decide whether to work on the full graph or only the largest connected component (LCC). Many centrality algorithms are undefined for disconnected graphs (closeness centrality requires finite path lengths between all pairs). The standard practice is to extract the LCC, justify the choice in the methodology section, and report the fraction of nodes retained. In the Facebook combined dataset, the LCC contains 99.7% of all nodes — a typical figure for mature social networks.
Centrality Measures
What does “important” mean?
Before defining any formula, it is worth asking what “importance” means in your specific context. The answer differs dramatically depending on the problem.
- Who should receive a vaccine first to minimise epidemic spread? The answer depends on who is most connected — high-degree nodes.
- Which airport, if shut down, would cause the most travel delays? The answer depends on which airport lies on the most paths between cities — betweenness centrality.
- Which Twitter user can reach the largest audience in three steps? The answer depends on closeness centrality — average distance to all others.
- Whose opinion matters most in a board deliberation? The answer may depend on whose connections themselves carry influence — eigenvector centrality.
Each question calls for a different measure. The skill of a network analyst lies not in computing all four and reporting them all, but in identifying which structural property aligns with the substantive question.
Degree centrality
The simplest measure of importance is the number of direct connections a node has. Normalised so that the maximum possible value is 1:
\[C_D(i) = \frac{d_i}{n - 1}\]
where \(d_i\) is the degree of node \(i\) and \(n\) is the total number of nodes. Degree centrality captures local influence: how many people does this node directly touch? In a disease transmission network, a high-degree node is a superspreader. In an influencer marketing context, it is the creator with the largest direct audience.
Closeness centrality
Closeness centrality measures how quickly a node can reach all others. Formally, it is the reciprocal of the average geodesic distance from node \(i\) to all reachable nodes:
\[C_C(i) = \frac{n - 1}{\sum_{j \neq i} d(i, j)}\]
A node with high closeness sits near the “centre” of the network; it can disseminate information efficiently because it is, on average, few hops away from everyone else. High closeness does not require many direct neighbours — a node on a path between two densely connected clusters may have low degree but high closeness.
Betweenness centrality
Betweenness centrality measures how often a node lies on the shortest path between two other nodes:
\[C_B(i) = \sum_{s \neq i \neq t} \frac{\sigma_{st}(i)}{\sigma_{st}}\]
where \(\sigma_{st}\) is the total number of shortest paths from node \(s\) to node \(t\), and \(\sigma_{st}(i)\) is the number of those paths that pass through \(i\). The normalised version divides by \((n-1)(n-2)\) (or \((n-1)(n-2)/2\) for undirected graphs).
Betweenness centrality identifies brokers and gatekeepers — nodes that control the flow of information, goods, or influence between otherwise separate groups. Removing a high-betweenness node tends to fragment the network; adding connections that reduce the betweenness of a broker undermines its structural power. In Ronald Burt’s theory of structural holes, high-betweenness individuals occupy advantageous positions precisely because they bridge groups that do not talk to each other directly.
Eigenvector centrality
Eigenvector centrality extends degree centrality by accounting for the quality, not just the quantity, of a node’s connections. The intuition: being connected to influential nodes should count more than being connected to peripheral ones. Formally, eigenvector centrality \(x_i\) satisfies the self-referential equation
\[x_i = \frac{1}{\lambda} \sum_{j \in N(i)} x_j\]
or in matrix form, \(\mathbf{x} = \frac{1}{\lambda} A\mathbf{x}\), meaning \(\mathbf{x}\) is an eigenvector of the adjacency matrix \(A\) corresponding to eigenvalue \(\lambda\). By the Perron–Frobenius theorem, for a connected graph with non-negative adjacency matrix the leading eigenvector (corresponding to the largest eigenvalue) has all positive entries and provides a well-defined centrality measure.
Eigenvector centrality is the mathematical ancestor of PageRank (Chapter 2): Google’s original algorithm is eigenvector centrality applied to the web graph, with modifications to handle dangling nodes and ensure convergence.
Betweenness centrality requires computing shortest paths between all pairs of nodes. The naive algorithm runs in \(O(n^3)\) time; the best known algorithm (Brandes, 2001) reduces this to \(O(nm)\) for unweighted graphs and \(O(nm + n^2 \log n)\) for weighted graphs, where \(n\) is the number of nodes and \(m\) is the number of edges. For the Facebook social graph (\(n = 3\text{B}\), \(m = 150\text{B}\)), exact betweenness is computationally infeasible and must be approximated using random sampling. For graphs up to a few hundred thousand nodes, NetworkX’s implementation finishes in seconds on a modern laptop.
PageRank: A Natural Extension
From eigenvectors to random walks
Eigenvector centrality has a clean intuition and works well on undirected, connected graphs. When the graph is directed — as the web is, as Twitter is — eigenvector centrality can fail because some nodes may have zero in-degree (no pages link to them) and the Perron–Frobenius theorem no longer guarantees a positive leading eigenvector.
PageRank, introduced by Brin and Page (1998) in their original Google paper, resolves this by modelling a random surfer: a user who, at each step, either follows a randomly chosen outgoing link (with probability \(d\), the damping factor, typically 0.85) or jumps to a uniformly random page (with probability \(1 - d\)). The stationary distribution of this random walk gives the PageRank score.
The stationary distribution satisfies
\[\mathbf{r} = d \cdot M \mathbf{r} + \frac{1-d}{n} \mathbf{1}\]
where \(M\) is the column-stochastic version of the adjacency matrix (each column normalised by the out-degree). The teleportation term \(\frac{1-d}{n} \mathbf{1}\) ensures the Markov chain is ergodic (irreducible and aperiodic), guaranteeing convergence to a unique stationary distribution regardless of graph structure.
PageRank is the subject of Chapter 2. The cell below runs it on a small directed graph so you can see the output format and compare it to degree.
The network below has one “authoritative” hub (@news_outlet) pointed to by many users. Predict which node will have the highest PageRank score before running the cell.
Twitter’s (now X’s) “Who to Follow” recommendations are partly derived from a variant of PageRank applied to the follower graph. Google Scholar’s “Cited by” counts are a simpler in-degree measure; the full PageRank of academic papers would weight citations from high-impact journals more heavily than citations from obscure conference proceedings. LinkedIn’s “How you’re connected” uses shortest-path algorithms in its people graph. Every major platform runs graph algorithms at production scale — what you are learning here is the mathematical core of those systems.
Mini Case Study: The Florentine Families
Background
In fifteenth-century Florence, sixteen elite families competed for political and economic dominance through a web of marriage alliances. In 1993, John Padgett and Christopher Ansell published a landmark study — “Robust Action and the Rise of the Medici, 1400–1434” (American Journal of Sociology, 98(6)) — arguing that the Medici family’s rise to power was not merely a consequence of their wealth but of their structural position in the marriage network. The Medici sat at the centre of the network as brokers, connecting families that had no direct ties to each other.
This dataset has become one of the most widely analysed networks in social science, partly because it is small enough to inspect directly and partly because the structural explanation matches the historical record so precisely.
Computing all four centralities
Interpreting the rankings
Discussion: what each measure captures — and why the Medici won
The four centrality measures tell consistent but distinct stories about power in the Florentine network.
Degree centrality shows that the Medici had the most marriage alliances of any family — six direct ties. This reflects raw connection count and is the most immediately visible form of network power. But it is not the complete picture: the Strozzi also had five connections, and they did not rise to dominance.
Closeness centrality shows the Medici sitting closest to all other families on average — they could propagate information, coordinate agreements, or respond to threats with fewer intermediaries than any rival. Closeness is the measure of speed and reach: in a world where communication was slow, being close to everyone in the network was a genuine strategic asset.
Betweenness centrality is where the Padgett–Ansell explanation becomes most sharp. The Medici score dramatically higher than any other family on betweenness. Many of the other families’ shortest paths to each other passed through the Medici — meaning that the Medici were structurally positioned as brokers. They could facilitate or impede the information flow and coalition-formation of their rivals. The Strozzi, Guadagni, and Albizzi, despite their wealth and their own connections, were embedded in parts of the network that were less well-connected to the rest. Their alliances were, in a sense, local.
Eigenvector centrality tells a slightly different story: who is connected to influential families? Here the Medici again score highly, but the Guadagni and Strozzi are comparatively stronger than they look on betweenness. This reflects the fact that eigenvector centrality rewards being embedded in a locally dense cluster of well-connected families, not merely sitting at a bridge.
The historical outcome — the Medici’s rise to dominance after the Ciompi Revolt of 1378, their patronage network, and their eventual control of Florentine government — maps onto the betweenness advantage more cleanly than any other single measure. Padgett and Ansell’s argument is that the Medici practised “robust action”: by maintaining ties across factional lines, they accumulated structural power without committing to any single faction, leaving rivals unable to coordinate effectively against them.
There is no universally “correct” centrality measure. Degree centrality answers “who has the most direct connections?” Closeness answers “who can reach everyone most quickly?” Betweenness answers “who controls the flow between others?” Eigenvector answers “who is connected to the most influential nodes?” Always start by asking what kind of importance your problem requires, then choose the measure that operationalises that concept. In the Florentine families case, the question was “who has structural power as a broker?” — and betweenness centrality was the right answer.
The Karate Club: a second illustration
To reinforce the intuition with a different dataset, the cell below runs the same four-measure analysis on Zachary’s Karate Club — a network where the ground truth (which members sided with which faction after the club split) is known. Nodes 0 and 33 are the two leaders of the eventual split.
The two faction leaders — node 0 (the instructor) and node 33 (the club president) — consistently rank near the top across all four measures. This is not surprising: the eventual split reflects the fact that the network was already structured around two high-centrality poles, with other members gravitating towards one or the other. The network’s structure predicted the social fracture before it happened.
What’s Next
This chapter has given you the foundational vocabulary of network analysis: the mathematical definition of a graph, the Python tools to build and manipulate one, the key structural statistics (degree distribution, shortest paths, connected components), and four distinct measures of node importance. Every subsequent chapter applies and extends these primitives.
Chapter 2 — Identifying Important Nodes goes deeper into the question of influence. PageRank is treated in full: we derive the random-walk model, implement the power iteration algorithm, and compare it to the centrality measures introduced here. We then turn to the harder problem of influence maximization — given a budget of \(k\) seed nodes, which \(k\) should you activate to maximise the expected spread of a behaviour through the network? The greedy algorithm, the Independent Cascade model, and the submodularity property that makes the greedy approach near-optimal are all covered with live demonstrations.
The mathematical tools become more demanding in Chapter 2 onwards, but the conceptual structure is the same: start with a marketing or social question, formalise it as a network problem, implement it in Python, and interpret the output in terms of the original question.
Prof. Xuhu Wan · HKUST ISOM 5640 · Social Network Analysis · 2026 Edition