Networks

Author

Jeremy Van Cleve

Published

November 28, 2022

library(tidyverse)
theme_set(theme_classic())

Outline for today

  • What is a network?
  • Networks in igraph
  • Reading networks into igraph
  • Plotting networks
  • Descriptive statistics of networks

What is a network?

Networks are increasingly useful tools for analyzing how the pieces of a system interact. Given the importance of both quantitative and visualization tools for analyzing networks, R is a natural environment for network analysis. Before describing what a network is more technically, it is important to give an intuitive definition and some examples.

A network is simply a way to describe how things are connected. More specifically, a network is a collection of objects (“nodes” or “vertices”) where lines (“edges”) are drawn between nodes that are connected or that interact. Nodes can be of different types (e.g., males and females in a network of people) and edges can be “undirected” where connections are symmetric (e.g., people on the same phone call) or “directed” where connections start at one node and stop at another (e.g., caller and receiver in a phone call). Finally, edges can be “weighted” instead of either being present or absent (e.g, the number of calls between the caller and receiver).

Examples of networks

Many kinds of data you have seen before are actually networks. Two of the most famous examples are of course the world wide web and the internet where the former consists of websites as nodes connected by links (directed edges) and the latter of computers connected by wifi, cables, and routers.

A classic movie about networks. Networks are EXCITING!

Since its so large, the internet is difficult to visualize, but the figure below attempts to do so.

The internet circa summer 2022 from http://www.opte.org/the-internet/

Biologists have found tremendous use for networks from gene regulatory and co-expression networks

library(sand)
data(Ecoli.data)

g.regDB <- graph_from_adjacency_matrix(regDB.adj, "undirected")
plot(g.regDB, vertex.size=5, vertex.label=NA)

Regulatory relationships in E. coli extracted from the RegulonDB database, see Kolaczyk and Csárdi, 2020, p. 121 1

to social networks in animals such as dolphins where edges represent individuals who spend time together.

dolphin = read_graph("dolphins.tsv", format = "edgelist", directed = FALSE)

plot(dolphin, vertex.size=5, vertex.label=NA)

Bottlenose dolphin community of Doubtful Sound, New Zealand 2

There are many of other kinds of biological networks that one might analyze such as protein interaction networks, epidemiological contact networks for disease transmission, and metabolic networks. Though the kinds of analyses one might perform on a network often and should depend on the biology of the network in question, there are some common properties and statistics that are often used with networks. These common analyses are part of the field of “network science”.

Networks in igraph

There are a number of packages in R that are useful for network visualization and analysis. One commonly used package is called igraph and is described nicely in the authors of the package Kolaczyk and Csárdi in the book “Statistical Analysis of Network Data with R” 3. igraph is actually a C-language library but it has both python and R interfaces, so it’s one of the most commonly used libraries for graphs. You will be briefly introduced to using igraph in R here.

Creating a network

Networks are also called “graphs” (i.e., “graph theory” in mathematics) and creating a small graph in igraph simply entails listing all of the edges where an edge is denoted by two nodes and their undirected, directed, and/or weighted connection.

library(igraph)

g1 = graph( edges = c(1,2, 2,3, 3,4, 4,2), directed = FALSE)
plot(g1)

g2 = graph.formula(1-2, 2-3, 3-4, 4-2)
plot(g2)

Above, you can see how to create the same graph in two different ways, one with the graph function that takes a list of edges and one with the graph.formula function that takes each edge as an argument. Below, you can use the structure function print_all from igraph to get a summary of each graph,

print_all(g1)
IGRAPH c406477 U--- 4 4 -- 
+ edges from c406477:
[1] 1--2 2--3 3--4 2--4
print_all(g2)
IGRAPH 299093c UN-- 4 4 -- 
+ attr: name (v/c)
+ edges from 299093c (vertex names):
[1] 1--2 2--3 2--4 3--4

which shows that the graphs have the same list of edges. You can also create graphs with strings as node names instead of numbers

g = graph( c("catherine", "david weis", "david weis", "vinnie", 
             "vinnie", "jeramiah", "julie", "jessica",
             "jessica", "vinnie", "jeramiah", "julie",
             "robbie", "catherine", "robbie", "rosana"), 
           isolates = c("ashely", "jeremy"))

plot(g)

where the isolates arguments allows you to have unconnected nodes.

Edge, vertex, and network attributes

Each graph object has a list of edges,

E(g)
+ 8/8 edges from 7021d55 (vertex names):
[1] catherine ->david weis david weis->vinnie     vinnie    ->jeramiah  
[4] julie     ->jessica    jessica   ->vinnie     jeramiah  ->julie     
[7] robbie    ->catherine  robbie    ->rosana    

vertices,

V(g)
+ 10/10 vertices, named, from 7021d55:
 [1] catherine  david weis vinnie     jeramiah   julie      jessica   
 [7] robbie     rosana     ashely     jeremy    

and “adjacency matrix”,

g[] # or get.adjacency(g) or as_adjacency_matrix(g)
10 x 10 sparse Matrix of class "dgCMatrix"
   [[ suppressing 10 column names 'catherine', 'david weis', 'vinnie' ... ]]
                              
catherine  . 1 . . . . . . . .
david weis . . 1 . . . . . . .
vinnie     . . . 1 . . . . . .
jeramiah   . . . . 1 . . . . .
julie      . . . . . 1 . . . .
jessica    . . 1 . . . . . . .
robbie     1 . . . . . . 1 . .
rosana     . . . . . . . . . .
ashely     . . . . . . . . . .
jeremy     . . . . . . . . . .

which is a matrix whose \((i,j)^{th}\) element is the strength of the edge between node \(i\) and node \(j\) and where zero strength is a lack of an edge.

as.matrix(g[]) # as a normal matrix
           catherine david weis vinnie jeramiah julie jessica robbie rosana
catherine          0          1      0        0     0       0      0      0
david weis         0          0      1        0     0       0      0      0
vinnie             0          0      0        1     0       0      0      0
jeramiah           0          0      0        0     1       0      0      0
julie              0          0      0        0     0       1      0      0
jessica            0          0      1        0     0       0      0      0
robbie             1          0      0        0     0       0      0      1
rosana             0          0      0        0     0       0      0      0
ashely             0          0      0        0     0       0      0      0
jeremy             0          0      0        0     0       0      0      0
           ashely jeremy
catherine       0      0
david weis      0      0
vinnie          0      0
jeramiah        0      0
julie           0      0
jessica         0      0
robbie          0      0
rosana          0      0
ashely          0      0
jeremy          0      0

Thus, undirected graphs have symmetric adjacency matrices. You can store networks as a list of edges, an adjacency matrix, or as an “adjacency list”, which is a list of nodes where each element of the list is a list of other nodes that node is connected to:

get.adjlist(g)
$catherine
+ 2/10 vertices, named, from 7021d55:
[1] david weis robbie    

$`david weis`
+ 2/10 vertices, named, from 7021d55:
[1] catherine vinnie   

$vinnie
+ 3/10 vertices, named, from 7021d55:
[1] david weis jeramiah   jessica   

$jeramiah
+ 2/10 vertices, named, from 7021d55:
[1] vinnie julie 

$julie
+ 2/10 vertices, named, from 7021d55:
[1] jeramiah jessica 

$jessica
+ 2/10 vertices, named, from 7021d55:
[1] vinnie julie 

$robbie
+ 2/10 vertices, named, from 7021d55:
[1] catherine rosana   

$rosana
+ 1/10 vertex, named, from 7021d55:
[1] robbie

$ashely
+ 0/10 vertices, named, from 7021d55:

$jeremy
+ 0/10 vertices, named, from 7021d55:

It is also simple to add attributes to the network, its edges, or its nodes. This is accomplished with the dollar sign “$” after the network object. For example, you can get the vertex names using

V(g)$name
 [1] "catherine"  "david weis" "vinnie"     "jeramiah"   "julie"     
 [6] "jessica"    "robbie"     "rosana"     "ashely"     "jeremy"    

and set some attributes like “position”

V(g)$position = c("assoc", "full", "full", "full", "assoc", "assoc", "assist", "assist", "assoc", "assoc")
vertex_attr(g)
$name
 [1] "catherine"  "david weis" "vinnie"     "jeramiah"   "julie"     
 [6] "jessica"    "robbie"     "rosana"     "ashely"     "jeremy"    

$position
 [1] "assoc"  "full"   "full"   "full"   "assoc"  "assoc"  "assist" "assist"
 [9] "assoc"  "assoc" 

Reading networks into igraph

Creating networks explicitly is practical for small networks, but large networks will likely come from large data files that list edges or the adjacency matrix. The dolphin social network was simply an edge list

dlist = read_tsv("dolphins.tsv", col_names = c("From", "To"))
Rows: 159 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (2): From, To

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dlist)
# A tibble: 6 × 2
   From    To
  <dbl> <dbl>
1     9     4
2    10     6
3    10     7
4    11     1
5    11     3
6    14     6

Converting this edge list into a graph involves calling the graph_from_data_frame() function where d is the list of edges, which is actually a data frame:

dg = graph_from_data_frame(d = dlist, directed = FALSE)

V(dg)
+ 62/62 vertices, named, from f55adcc:
 [1] 9  10 11 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 36
[26] 37 38 39 40 41 42 43 44 45 46 47 48 50 51 52 53 54 55 56 57 58 59 60 61 62
[51] 4  6  7  1  3  2  8  13 24 5  12 49
plot(dg, vertex.size=6, vertex.label=NA)

The E. coli dataset was an adjacency matrix

str(regDB.adj)
 num [1:153, 1:153] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:153] "acrR_b0464_at" "ada_b2213_at" "adiY_b4116_at" "alpA_b2624_at" ...
  ..$ : chr [1:153] "acrR_b0464_at" "ada_b2213_at" "adiY_b4116_at" "alpA_b2624_at" ...

and it can be turned into a graph using the graph_from_adjacency_matrix() function:

eg = graph_from_adjacency_matrix(regDB.adj, mode = "undirected")

plot(eg, vertex.size=6, vertex.label=NA)

To use an adjacency matrix with weights, add the weighted = TRUE argument to graph_from_adjacency_matrix.

Plotting networks

There are a number of other parameters that one can use in igraph to alter how plot displays the network. To see some of these, we’ll load the social network derived from dominance behavior in a colony of 62 adult female Japanese macaques (Macaca fuscata fuscata)4. The edges are weighted since and show the number of dominance encounters between two individuals.

macaques = read_csv("moreno_macaques.csv")

mg = graph_from_data_frame(macaques)
mg
IGRAPH 877773b DNW- 62 1187 -- 
+ attr: name (v/c), weight (e/n)
+ edges from 877773b (vertex names):
 [1] 1->2  1->3  1->4  1->5  1->6  1->7  1->8  1->9  1->10 1->11 1->12 1->13
[13] 1->14 1->15 1->16 1->17 1->18 1->19 1->20 1->21 1->22 1->23 1->24 1->25
[25] 1->26 1->27 1->28 1->29 1->30 1->31 1->32 1->33 1->34 1->35 2->36 2->4 
[37] 2->5  2->37 2->6  2->7  2->8  2->9  2->11 2->12 2->13 2->14 2->15 2->16
[49] 2->18 2->38 2->19 2->39 2->40 2->41 2->20 2->21 2->22 2->23 2->24 2->42
[61] 2->31 2->43 2->44 2->45 2->46 2->47 2->34 3->36 3->4  3->5  3->37 3->6 
[73] 3->7  3->8  3->11 3->12 3->14 3->16 3->17 3->18 3->38 3->40 3->21 3->22
[85] 3->23 3->25 3->27 3->28 3->29 3->30 3->48 3->49 3->46 3->33 3->47 3->34
+ ... omitted several edges

From the description above, you can see that the macaques network has a weight attribute for the edges. You can use this to make the edges thicker and thinner. The color of the edges, the size of the nodes, and many other properties can be set directly to the graph; before, you altered the plot by adding additional parameters to the plot function itself. The plot function will automatically use attributes of the graph when plotting.

# set edge color, curve, width, and arrow size
E(mg)$color = "blue"
E(mg)$curved = 0.15
E(mg)$width = E(mg)$weight/2
E(mg)$arrow.size = 0.4

# set color of nodes
V(mg)$color = "red"

# set size of the node to its degree
V(mg)$size = degree(mg)/6

# get rid of vertex labels
V(mg)$label = NA

plot(mg)

Network layouts

If you have tried plotting the same graph with exactly the same options, you might have noticed that the layout of the nodes and edges changes slightly every time. The reason for this is that igraph uses an algorithm to figure out how to layout the nodes and edges and that algorithm has some randomness built in. The layout argument to plot tells igraph which algorithm to use.

plot(mg, layout = layout_in_circle(mg), edge.width = 0.5)

The argument to layout just a list of coordinates for each node,

head(layout_in_circle(mg))
          [,1]      [,2]
[1,] 1.0000000 0.0000000
[2,] 0.9948693 0.1011683
[3,] 0.9795299 0.2012985
[4,] 0.9541393 0.2993631
[5,] 0.9189578 0.3943559
[6,] 0.8743466 0.4853020

which means that you can technically place the nodes however you like.

The fancy algorithms for layout typically work by treating the nodes as balls and the edges as springs and trying to calculate the positions of the nodes such that nodes with more edges or connections are pulled in more different directions. A couple popular algorithms are the “Fruchterman–Reingold” (layout_with_fr) and “Kamada and Kawai” (layout_with_kk) methods.

plot(mg, layout = layout_with_fr(mg))

For larger graphs, the layout_with_lgl apparently works a little better.

plot(mg, layout = layout_with_lgl(mg))

Descriptive statistics of networks

Degree distribution

The most common quantitative measure of a network is its degree distribution where the degree of a node is how many edge connect to it. Much of “network science” begins with studying how different simple models that generate networks can also generate different classes of degree distributions. To say more about this is quite beyond the scope of this course, but for more see the book by Kolaczyk and Csárdi referenced above. Calculating and plotting the degree distribution is simple using degree() to calculate the degrees of every node but then plotting it with ggplot means you have to convert it to a data.frame:

degree(mg)
 1  2  3 36  4  5 37  6  7  8  9 10 11 12 13 14 15 16 17 18 38 19 50 56 39 40 
34 34 31 36 40 55 54 35 31 47 35 40 47 40 37 47 45 57 34 42 35 43 44 39 36 44 
41 20 21 22 51 59 52 23 24 25 26 27 28 42 29 30 31 43 60 48 49 53 44 45 32 61 
35 30 49 45 46 43 35 31 42 49 47 47 36 24 42 31 40 29 34 29 36 38 34 35 33 17 
46 33 47 54 34 35 55 57 58 62 
39 37 38 37 43 45 32 37 27 20 
data.frame(degree = degree(mg)) %>% 
  ggplot() + 
  geom_histogram(aes(x = degree)) +
  geom_vline(xintercept = mean(degree(mg))) + 
  geom_text(data=tibble(degree=mean(degree(mg))+2, count=9, label="mean"), mapping = aes(x=degree, y=count, label=label))

Centrality

One way to measure how important different nodes are is to calculate some measure of their “centrality”. The degree of a node is the simplest measure of its centrality. Another measure is the “closeness centrality”, which measures how close (in terms of number of hops across other nodes) a node is to all other nodes.

data.frame(closeness = closeness(mg)) %>%
  ggplot() + geom_histogram(aes(x = closeness))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite values (stat_bin).

“Betweenness” centrality measures the extent to which a vertex is located ‘between’ other pairs of vertices.

data.frame(betweenness = betweenness(mg)) %>%
  ggplot() + geom_histogram(aes(x = betweenness))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finally, “eigenvector” centrality measures how central a node is by how central the nodes are that are connected to it. This measure of centrality is very closely related to the original Google “PageRank” measure that Google used to rank websites for search.

data.frame(eigen_centrality = eigen_centrality(mg)$vector) %>%
  ggplot() + geom_histogram(aes(x = eigen_centrality))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## What about graphs and ggplot? Enter: ggraph

According to the package author: > ggraph is an extension of ggplot2 aimed at supporting relational data structures such as networks, graphs, and trees.

A lot of its functionality comes from igraph in the background but it wraps that in an interface and plotting style that is more familar to users of ggplot. Similarly, its companion package tidygraph > provides a way to switch between the two tables and provides dplyr verbs for manipulating them.

For example, for the E. coli network above, which had an adjacency matrix, we can convert it to graph with the function as_tbl_graph and then plot with ggraph:

library(ggraph)
library(tidygraph)

Attaching package: 'tidygraph'
The following object is masked from 'package:igraph':

    groups
The following object is masked from 'package:stats':

    filter
ecoli = as_tbl_graph(regDB.adj, directed = FALSE) 
Warning in (is.null(rownames(x)) && is.null(colnames(x))) || colnames(x) == :
'length(x) = 153 > 1' in coercion to 'logical(1)'
ecoli %>%
  ggraph(layout = "fr") +
  geom_edge_link(color = "red") +
  geom_node_circle(aes(r=0.1)) + 
  coord_fixed()

The layout argument to ggraph is where we can give it the layout algorithm we want it to use. Likewise, we can make graphs from edge lists such as the macaque data:

macaque_graph = tbl_graph(edges = macaques)

macaque_graph %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_width=weight), color="red") +
  scale_edge_width(range = c(0.1,1)) +
  geom_node_circle(aes(r=0.02)) + 
  coord_fixed()

A cool thing about ggraph though is that it has a few interesting layouts. For example, to do a circular graph, we say we want a linear layout with circular=TRUE with edges that are arcs using geom_edge_arc:

macaque_graph %>%
  ggraph(layout = "linear", circular = TRUE) +
  geom_edge_arc(aes(edge_width=weight, color=weight>5)) +
  scale_edge_width(range = c(0.1,1)) +
  scale_edge_color_manual(values=c("TRUE" = "red", "FALSE" = "black")) +
  geom_node_circle(aes(r=0.02)) + 
  coord_fixed()

Lab

Problems

  1. We’re going to create our own social network plot based on favorite TV shows. I’ll create a speadsheet and gather our top 5 TV shows and count the overlap. This will be our edge weights.
    • Plot this adjacency matrix as a network
    • Make the edge widths proportional to the overlap in TV favorites
  2. Use the data file rugosa_social_network.csv, which describes the social network of the sleepy lizard (Tiliqua rugosa)5 as an adjacency matrix. You may plot using the basic igraph functions or the ggraph ones. Use column_to_rownames(var = "ID") before converting the data frame to a matrix since the graph functions like matrices with rownames.
    • Read in the file (hint: you may need to be careful with which function and/or parameters you use here to get the matrix right, which includes denoting missing values in the matrix).
    • Plot the social network of the sleepy lizard
    • Make the node size proportional to the eigenvalue centrality of the node
    • What is the degree of the most highly connected lizard? Are there multiple lizards with this max degree?
  3. Identify the dataset that you would like to use for your project talk
    • Briefly describe the data
    • Briefly describe each of the figures you would like to make with these data
    • Identify any major hurdles to wrangling or plotting these data
    • You may change alter your dataset and figures for the final project talk, but use this as an opportunity to nail serious candidate dataset and figures.

Footnotes

  1. Kolaczyk, Eric D. and Csárdi, Gábor. 2020. Statistical Analysis of Network Data with R. Springer, New York, NY. http://doi.org/10.1007/978-3-030-44129-6. Github: https://github.com/kolaczyk/sand.↩︎

  2. Lusseau, David, Schneider, Karsten, Boisseau, Oliver J., Haase, Patti, Slooten, Elisabeth, and Dawson, Steve M.. 2003. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behav Ecol Sociobiol 54:396–405. http://dx.doi.org/10.1007/s00265-003-0651-y. Data at http://konect.uni-koblenz.de/networks/dolphins.↩︎

  3. Kolaczyk, Eric D. and Csárdi, Gábor. 2020. Statistical Analysis of Network Data with R. Springer, New York, NY. http://doi.org/10.1007/978-3-030-44129-6. Github: https://github.com/kolaczyk/sand.↩︎

  4. Takahata, Yukio. 1991. Diachronic changes in the dominance relations of adult female Japanese monkeys of the Arashiyama B group. The Monkeys of Arashiyama. State University of New York Press, Albany, pp 123–139. Data from http://konect.cc/networks/moreno_mac/.↩︎

  5. Bull, C. M., Godfrey, S. S., and Gordon, D. M.. 2012. Social networks and the spread of Salmonella in a sleepy lizard population. Mol Ecol 21:4386–4392. http://dx.doi.org/10.1111/j.1365-294X.2012.05653.x↩︎