Networks

Author

Jeremy Van Cleve

Published

April 2, 2024

library(tidyverse)
theme_set(theme_classic())

Outline for today

  • What is a network?
  • Networks in igraph
  • Reading networks into igraph
  • Plotting networks
  • Descriptive statistics of networks

What is a network?

Networks are increasingly useful tools for analyzing how the pieces of a system interact. Given the importance of both quantitative and visualization tools for analyzing networks, R is a natural environment for network analysis. Before describing what a network is more technically, it is important to give an intuitive definition and some examples.

A network is simply a way to describe how things are connected. More specifically, a network is a collection of objects (“nodes” or “vertices”) where lines (“edges”) are drawn between nodes that are connected or that interact. Nodes can be of different types (e.g., males and females in a network of people) and edges can be “undirected” where connections are symmetric (e.g., people on the same phone call) or “directed” where connections start at one node and stop at another (e.g., caller and receiver in a phone call). Finally, edges can be “weighted” instead of either being present or absent (e.g, the number of calls between the caller and receiver).

Examples of networks

Many kinds of data you have seen before are actually networks. Two of the most famous examples are of course the world wide web and the internet where the former consists of websites as nodes connected by links (directed edges) and the latter of computers connected by wifi, cables, and routers.

A classic movie about networks. Networks are EXCITING!

Since its so large, the internet is difficult to visualize, but the figure below attempts to do so.

The internet circa summer 2022 from http://www.opte.org/the-internet/

Biologists have found tremendous use for networks from gene regulatory and co-expression networks

library(sand)
data(Ecoli.data)

g.regDB <- graph_from_adjacency_matrix(regDB.adj, "undirected")
Warning: The `adjmatrix` argument of `graph_from_adjacency_matrix()` must be symmetric
with mode = "undirected" as of igraph 1.6.0.
ℹ Use mode = "max" to achieve the original behavior.
plot(g.regDB, vertex.size=5, vertex.label=NA)

Regulatory relationships in E. coli extracted from the RegulonDB database, see Kolaczyk and Csárdi, 2020, p. 121 1

to social networks in animals such as dolphins where edges represent individuals who spend time together.

dolphin = read_graph("dolphins.tsv", format = "edgelist", directed = FALSE)

plot(dolphin, vertex.size=5, vertex.label=NA)

Bottlenose dolphin community of Doubtful Sound, New Zealand 2

There are many of other kinds of biological networks that one might analyze such as protein interaction networks, epidemiological contact networks for disease transmission, and metabolic networks. Though the kinds of analyses one might perform on a network often and should depend on the biology of the network in question, there are some common properties and statistics that are often used with networks. These common analyses are part of the field of “network science”.

Networks in igraph

There are a number of packages in R that are useful for network visualization and analysis. One commonly used package is called igraph and is described nicely in the authors of the package Kolaczyk and Csárdi in the book “Statistical Analysis of Network Data with R” 3. igraph is actually a C-language library but it has both python and R interfaces, so it’s one of the most commonly used libraries for graphs. You will be briefly introduced to using igraph in R here.

Creating a network

Networks are also called “graphs” (i.e., “graph theory” in mathematics) and creating a small graph in igraph simply entails listing all of the edges where an edge is denoted by two nodes and their undirected, directed, and/or weighted connection.

library(igraph)

g1 = graph(edges = c(1,2, 2,3, 3,4, 4,2), directed = FALSE)
plot(g1)

g2 = graph.formula(1-2, 2-3, 3-4, 4-2)
plot(g2)

Above, you can see how to create the same graph in two different ways, one with the graph function that takes a list of edges and one with the graph.formula function that takes each edge as an argument. Below, you can use the structure function print_all from igraph to get a summary of each graph,

print_all(g1)
IGRAPH 2d8f734 U--- 4 4 -- 
+ edges from 2d8f734:
[1] 1--2 2--3 3--4 2--4
print_all(g2)
IGRAPH 4098985 UN-- 4 4 -- 
+ attr: name (v/c)
+ edges from 4098985 (vertex names):
[1] 1--2 2--3 2--4 3--4

which shows that the graphs have the same list of edges. You can also create graphs with strings as node names instead of numbers

g = graph(c("catherine", "ann", "ann", "vinnie", 
             "vinnie", "jeramiah", "julie", "jessica",
             "jessica", "vinnie", "jeramiah", "julie",
             "robbie", "catherine", "robbie", "rosana"), 
           isolates = c("ashely", "jeremy"))

plot(g)

where the isolates arguments allows you to have unconnected nodes.

Edge, vertex, and network attributes

Each graph object has a list of edges,

E(g)
+ 8/8 edges from 235518b (vertex names):
[1] catherine->ann       ann      ->vinnie    vinnie   ->jeramiah 
[4] julie    ->jessica   jessica  ->vinnie    jeramiah ->julie    
[7] robbie   ->catherine robbie   ->rosana   

vertices,

V(g)
+ 10/10 vertices, named, from 235518b:
 [1] catherine ann       vinnie    jeramiah  julie     jessica   robbie   
 [8] rosana    ashely    jeremy   

and “adjacency matrix”,

g[] # or get.adjacency(g) or as_adjacency_matrix(g)
10 x 10 sparse Matrix of class "dgCMatrix"
  [[ suppressing 10 column names 'catherine', 'ann', 'vinnie' ... ]]
                             
catherine . 1 . . . . . . . .
ann       . . 1 . . . . . . .
vinnie    . . . 1 . . . . . .
jeramiah  . . . . 1 . . . . .
julie     . . . . . 1 . . . .
jessica   . . 1 . . . . . . .
robbie    1 . . . . . . 1 . .
rosana    . . . . . . . . . .
ashely    . . . . . . . . . .
jeremy    . . . . . . . . . .

which is a matrix whose \((i,j)^{th}\) element is the strength of the edge between node \(i\) and node \(j\) and where zero strength is a lack of an edge.

as.matrix(g[]) # as a normal matrix
          catherine ann vinnie jeramiah julie jessica robbie rosana ashely
catherine         0   1      0        0     0       0      0      0      0
ann               0   0      1        0     0       0      0      0      0
vinnie            0   0      0        1     0       0      0      0      0
jeramiah          0   0      0        0     1       0      0      0      0
julie             0   0      0        0     0       1      0      0      0
jessica           0   0      1        0     0       0      0      0      0
robbie            1   0      0        0     0       0      0      1      0
rosana            0   0      0        0     0       0      0      0      0
ashely            0   0      0        0     0       0      0      0      0
jeremy            0   0      0        0     0       0      0      0      0
          jeremy
catherine      0
ann            0
vinnie         0
jeramiah       0
julie          0
jessica        0
robbie         0
rosana         0
ashely         0
jeremy         0

Thus, undirected graphs have symmetric adjacency matrices. You can store networks as a list of edges, an adjacency matrix, or as an “adjacency list”, which is a list of nodes where each element of the list is a list of other nodes that node is connected to:

as_adj_list(g)
$catherine
+ 2/10 vertices, named, from 235518b:
[1] ann    robbie

$ann
+ 2/10 vertices, named, from 235518b:
[1] catherine vinnie   

$vinnie
+ 3/10 vertices, named, from 235518b:
[1] ann      jeramiah jessica 

$jeramiah
+ 2/10 vertices, named, from 235518b:
[1] vinnie julie 

$julie
+ 2/10 vertices, named, from 235518b:
[1] jeramiah jessica 

$jessica
+ 2/10 vertices, named, from 235518b:
[1] vinnie julie 

$robbie
+ 2/10 vertices, named, from 235518b:
[1] catherine rosana   

$rosana
+ 1/10 vertex, named, from 235518b:
[1] robbie

$ashely
+ 0/10 vertices, named, from 235518b:

$jeremy
+ 0/10 vertices, named, from 235518b:

It is also simple to add attributes to the network, its edges, or its nodes. This is accomplished with the dollar sign “$” after the network object. For example, you can get the vertex names using

V(g)$name
 [1] "catherine" "ann"       "vinnie"    "jeramiah"  "julie"     "jessica"  
 [7] "robbie"    "rosana"    "ashely"    "jeremy"   

and set some attributes like “position”

V(g)$position = c("assoc", "full", "full", "full", "assoc", "assoc", "assist", "assist", "assoc", "assoc")
vertex_attr(g)
$name
 [1] "catherine" "ann"       "vinnie"    "jeramiah"  "julie"     "jessica"  
 [7] "robbie"    "rosana"    "ashely"    "jeremy"   

$position
 [1] "assoc"  "full"   "full"   "full"   "assoc"  "assoc"  "assist" "assist"
 [9] "assoc"  "assoc" 

Reading networks into igraph

Creating networks explicitly is practical for small networks, but large networks will likely come from large data files that list edges or the adjacency matrix. The dolphin social network was simply an edge list

dlist = read_tsv("dolphins.tsv", col_names = c("From", "To"))
Rows: 159 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (2): From, To

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dlist)
# A tibble: 6 × 2
   From    To
  <dbl> <dbl>
1     9     4
2    10     6
3    10     7
4    11     1
5    11     3
6    14     6

Converting this edge list into a graph involves calling the graph_from_data_frame() function where d is the list of edges, which is actually a data frame:

dg = graph_from_data_frame(d = dlist, directed = FALSE)

V(dg)
+ 62/62 vertices, named, from db2152e:
 [1] 9  10 11 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 36
[26] 37 38 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 56 57 58 59 60 61 62
[51] 4  6  7  1  3  2  8  13 24 5  12 49
plot(dg, vertex.size=6, vertex.label=NA)

The E. coli dataset was an adjacency matrix

str(regDB.adj)
 num [1:153, 1:153] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:153] "acrR_b0464_at" "ada_b2213_at" "adiY_b4116_at" "alpA_b2624_at" ...
  ..$ : chr [1:153] "acrR_b0464_at" "ada_b2213_at" "adiY_b4116_at" "alpA_b2624_at" ...

and it can be turned into a graph using the graph_from_adjacency_matrix() function:

eg = graph_from_adjacency_matrix(regDB.adj, mode = "undirected")

plot(eg, vertex.size=6, vertex.label=NA)

To use an adjacency matrix with weights, add the weighted = TRUE argument to graph_from_adjacency_matrix.

Plotting networks

There are a number of other parameters that one can use in igraph to alter how plot displays the network. To see some of these, we’ll load the social network derived from dominance behavior in a colony of 62 adult female Japanese macaques (Macaca fuscata fuscata)4. The edges are weighted since and show the number of dominance encounters between two individuals.

macaques = read_csv("moreno_macaques.csv")

mg = graph_from_data_frame(macaques)
mg
IGRAPH 00bb449 DNW- 62 1187 -- 
+ attr: name (v/c), weight (e/n)
+ edges from 00bb449 (vertex names):
 [1] 1->2  1->3  1->4  1->5  1->6  1->7  1->8  1->9  1->10 1->11 1->12 1->13
[13] 1->14 1->15 1->16 1->17 1->18 1->19 1->20 1->21 1->22 1->23 1->24 1->25
[25] 1->26 1->27 1->28 1->29 1->30 1->31 1->32 1->33 1->34 1->35 2->36 2->4 
[37] 2->5  2->37 2->6  2->7  2->8  2->9  2->11 2->12 2->13 2->14 2->15 2->16
[49] 2->18 2->38 2->19 2->39 2->40 2->41 2->20 2->21 2->22 2->23 2->24 2->42
[61] 2->31 2->43 2->44 2->45 2->46 2->47 2->34 3->36 3->4  3->5  3->37 3->6 
[73] 3->7  3->8  3->11 3->12 3->14 3->16 3->17 3->18 3->38 3->40 3->21 3->22
[85] 3->23 3->25 3->27 3->28 3->29 3->30 3->48 3->49 3->46 3->33 3->47 3->34
+ ... omitted several edges

From the description above, you can see that the macaques network has a weight attribute for the edges. You can use this to make the edges thicker and thinner. The color of the edges, the size of the nodes, and many other properties can be set directly to the graph; before, you altered the plot by adding additional parameters to the plot function itself. The plot function will automatically use attributes of the graph when plotting.

# set edge color, curve, width, and arrow size
E(mg)$color = "blue"
E(mg)$curved = 0.15
E(mg)$width = E(mg)$weight/2
E(mg)$arrow.size = 0.4

# set color of nodes
V(mg)$color = "red"

# set size of the node to its degree
V(mg)$size = degree(mg)/6

# get rid of vertex labels
V(mg)$label = NA

plot(mg)

Network layouts

If you have tried plotting the same graph with exactly the same options, you might have noticed that the layout of the nodes and edges changes slightly every time. The reason for this is that igraph uses an algorithm to figure out how to layout the nodes and edges and that algorithm has some randomness built in. The layout argument to plot tells igraph which algorithm to use.

plot(mg, layout = layout_in_circle(mg), edge.width = 0.5)

The argument to layout just a list of coordinates for each node,

head(layout_in_circle(mg))
          [,1]      [,2]
[1,] 1.0000000 0.0000000
[2,] 0.9948693 0.1011683
[3,] 0.9795299 0.2012985
[4,] 0.9541393 0.2993631
[5,] 0.9189578 0.3943559
[6,] 0.8743466 0.4853020

which means that you can technically place the nodes however you like.

The fancy algorithms for layout typically work by treating the nodes as balls and the edges as springs and trying to calculate the positions of the nodes such that nodes with more edges or connections are pulled in more different directions. A couple popular algorithms are the “Fruchterman–Reingold” (layout_with_fr) and “Kamada and Kawai” (layout_with_kk) methods.

plot(mg, layout = layout_with_fr(mg))

For larger graphs, the layout_with_lgl is supposed to work a little better, but you can see the results below aren’t great (YMMV).

plot(mg, layout = layout_with_lgl(mg))

Descriptive statistics of networks

Degree distribution

The most common quantitative measure of a network is its degree distribution where the degree of a node is how many edge connect to it. Much of “network science” begins with studying how different simple models that generate networks can also generate different classes of degree distributions. To say more about this is quite beyond the scope of this course, but for more see the book by Kolaczyk and Csárdi referenced above. Calculating and plotting the degree distribution is simple using degree() to calculate the degrees of every node but then plotting it with ggplot means you have to convert it to a data.frame:

degree(mg)
 1  2  3 36  4  5 37  6  7  8  9 10 11 12 13 14 15 16 17 18 38 19 50 56 39 40 
34 34 31 36 40 55 54 35 31 47 35 40 47 40 37 47 45 57 34 42 35 43 44 39 36 44 
41 20 21 22 51 59 52 23 24 25 26 27 28 42 29 30 31 43 60 48 49 53 44 45 32 61 
35 30 49 45 46 43 35 31 42 49 47 47 36 24 42 31 40 29 34 29 36 38 34 35 33 17 
46 33 47 54 34 35 55 57 58 62 
39 37 38 37 43 45 32 37 27 20 
data.frame(degree = degree(mg)) %>% 
  ggplot() + 
  geom_histogram(aes(x = degree)) +
  geom_vline(xintercept = mean(degree(mg))) + 
  geom_text(data=tibble(degree=mean(degree(mg))+2, count=9, label="mean"), mapping = aes(x=degree, y=count, label=label))

Centrality

One way to measure how important different nodes are is to calculate some measure of their “centrality”. The degree of a node is the simplest measure of its centrality. Another measure is the “closeness centrality”, which measures how close (in terms of number of hops across other nodes) a node is to all other nodes.

data.frame(closeness = closeness(mg)) %>%
  ggplot() + geom_histogram(aes(x = closeness))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

“Betweenness” centrality measures the extent to which a vertex is located ‘between’ other pairs of vertices.

data.frame(betweenness = betweenness(mg)) %>%
  ggplot() + geom_histogram(aes(x = betweenness))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finally, “eigenvector” centrality measures how central a node is by how central the nodes are that are connected to it. This measure of centrality is very closely related to the original Google “PageRank” measure that Google used to rank websites for search.

data.frame(eigen_centrality = eigen_centrality(mg)$vector) %>%
  ggplot() + geom_histogram(aes(x = eigen_centrality))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## What about graphs and ggplot? Enter: ggraph

According to the package author: > ggraph is an extension of ggplot2 aimed at supporting relational data structures such as networks, graphs, and trees.

A lot of its functionality comes from igraph in the background but it wraps that in an interface and plotting style that is more familar to users of ggplot. Similarly, its companion package tidygraph > provides a way to switch between the two tables and provides dplyr verbs for manipulating them.

For example, for the E. coli network above, which had an adjacency matrix, we can convert it to graph with the function as_tbl_graph and then plot with ggraph:

library(ggraph)
library(tidygraph)

Attaching package: 'tidygraph'
The following object is masked from 'package:igraph':

    groups
The following object is masked from 'package:stats':

    filter
ecoli = as_tbl_graph(regDB.adj, directed = FALSE) 

ecoli %>%
  ggraph(layout = "fr") +
  geom_edge_link(color = "red") +
  geom_node_circle(aes(r=0.1)) + 
  coord_fixed()

The layout argument to ggraph is where we can give it the layout algorithm we want it to use. Likewise, we can make graphs from edge lists such as the macaque data:

macaque_graph = tbl_graph(edges = macaques)

macaque_graph %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_width=weight), color="red") +
  scale_edge_width(range = c(0.1,1)) +
  geom_node_circle(aes(r=0.02)) + 
  coord_fixed()

A cool thing about ggraph though is that it has a few interesting layouts. For example, to do a circular graph, we say we want a linear layout with circular=TRUE with edges that are arcs using geom_edge_arc:

macaque_graph %>%
  ggraph(layout = "linear", circular = TRUE) +
  geom_edge_arc(aes(edge_width=weight, color=weight>5)) +
  scale_edge_width(range = c(0.1,1)) +
  scale_edge_color_manual(values=c("TRUE" = "red", "FALSE" = "black")) +
  geom_node_circle(aes(r=0.02)) + 
  coord_fixed()

Network from bike share data

Let’s create a network from the bike share data that we saw earlier in the course. Such a network could consist of the stations as the nodes and the number of trips as the weights of edges between the nodes. So we’ll have to create a list of edges where each edge connects two stations with the appropriate weight.

To do this, we first load in the data from the database.

library(DBI)
library(dbplyr)

Attaching package: 'dbplyr'
The following objects are masked from 'package:dplyr':

    ident, sql
dbcon = dbConnect(RSQLite::SQLite(), "bikedb.sqlite")
dbcon
<SQLiteConnection>
  Path: /Users/vancleve/science/teaching/UK/BIO-580/2024-Spring/bikedb.sqlite
  Extensions: TRUE

We’ll filter the trips to only those that occur on July 4th just as a way to keep the dataset a manageable size.

july4th = dbcon |>
  tbl("trips") |>
  filter(day(start_time) == 4, month(start_time) == 7) |> as_tibble()

Now, we need to create the edge list for each pair of stations. We do this by looping over all the starting stations and ending stations and counting the number of trips in the table.

starts = july4th |> distinct(start_station_id) |> pull()
ends = july4th |> distinct(end_station_id) |> pull()

edges = tibble(From = character(), To = character(), Weight = numeric())

for (start in starts) {
  print(start)
  for (end in ends) {
    weight = july4th |> filter(start_station_id == start, end_station_id == end) |> nrow()
    edges = edges |> add_row(From = start, To = end, Weight = weight)
  }
}
edges = edges |> filter(Weight > 0)

That loop takes a bit of time so I’ve saved a previous run of the loop to a file and we’ll read that in and plot the graph

bike_edges = read_csv("sf_bike_july4_station_edges.csv")
Rows: 2321 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): From, To
dbl (1): Weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
bike_graph = tbl_graph(edges = bike_edges)

bike_graph %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_width=Weight), color="red") +
  scale_edge_width(range = c(0.1,10)) +
  geom_node_circle(aes(r=0.02)) + 
  coord_fixed()

Lab

Problems

  1. Use the data file rugosa_social_network.csv, which describes the social network of the sleepy lizard (Tiliqua rugosa)5 as an adjacency matrix. You may plot using the basic igraph functions or the ggraph ones. Use column_to_rownames(var = "ID") before converting the data frame to a matrix since the graph functions like matrices with rownames.
    • Read in the file (hint: you may need to be careful with which function and/or parameters you use here to get the matrix right, which includes denoting missing values in the matrix).
    • Plot the social network of the sleepy lizard
    • Make the node size proportional to the eigenvalue centrality of the node
    • What is the degree of the most highly connected lizard? Are there multiple lizards with this max degree?
  2. Identify the dataset that you would like to use for your project talk
    • Briefly describe the data
    • Briefly describe each of the figures you would like to make with these data
    • Identify any major hurdles to wrangling or plotting these data
    • You may change alter your dataset and figures for the final project talk, but use this as an opportunity to nail serious candidate dataset and figures.

Footnotes

  1. Kolaczyk, Eric D. and Csárdi, Gábor. 2020. Statistical Analysis of Network Data with R. Springer, New York, NY. http://doi.org/10.1007/978-3-030-44129-6. Github: https://github.com/kolaczyk/sand.↩︎

  2. Lusseau, David, Schneider, Karsten, Boisseau, Oliver J., Haase, Patti, Slooten, Elisabeth, and Dawson, Steve M.. 2003. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behav Ecol Sociobiol 54:396–405. http://dx.doi.org/10.1007/s00265-003-0651-y. Data at http://konect.uni-koblenz.de/networks/dolphins.↩︎

  3. Kolaczyk, Eric D. and Csárdi, Gábor. 2020. Statistical Analysis of Network Data with R. Springer, New York, NY. http://doi.org/10.1007/978-3-030-44129-6. Github: https://github.com/kolaczyk/sand.↩︎

  4. Takahata, Yukio. 1991. Diachronic changes in the dominance relations of adult female Japanese monkeys of the Arashiyama B group. The Monkeys of Arashiyama. State University of New York Press, Albany, pp 123–139. Data from http://konect.cc/networks/moreno_mac/.↩︎

  5. Bull, C. M., Godfrey, S. S., and Gordon, D. M.. 2012. Social networks and the spread of Salmonella in a sleepy lizard population. Mol Ecol 21:4386–4392. http://dx.doi.org/10.1111/j.1365-294X.2012.05653.x↩︎