01 Feb Tensor Factorization for Graph Analysis in Python
We already know, machine learning is an amazing field and a large number of techniques exist in order to solve problems. Unfortunately, just a few subsets of algorithms are very popular and largely used in Kaggle competitions. But, since the world is a large place, a lot of powerful, and not largely used algorithms/techniques, exist.
In this article, we will show a simple application of the Tensor Decomposition (TD) algorithm on graph data. More in details, we will show how to apply TF to analyze time changing graphs. The goal of this analysis is to extract latent information from graphs in order to find hidden information.
This article is divided into two main parts. In the first, we will give a quick overview of the tensor factorization algorithm. In the second, we will show how to use this technique on real data with Python.
We start by the simple definition of a tensor. In this article, we will use different definitions extracted from the well-known paper on tensor decomposition: Kolda et. al, Tensor Decompositions and Applications, SIAM Review, 2009, Vol. 51(3): pp. 455–500.
“A tensor is a multidimensional array. More formally, an N-way or Nth-order tensor is an element of the tensor product of N vector spaces, each of which has its own coordinate system. A third-order tensor has three indices, as shown in Figure 1. A first-order tensor is a vector, a second-order tensor is a matrix, and tensors of order three or higher are called higher-order tensors.” (Kolda et. al, Tensor Decompositions and Applications)
The goal of tensor decomposition is to obtain a compact representation of a given tensor. One of the first types of tensor factorization is the canonical polyadic decomposition (CPD). This decomposition factorizes a tensor into a sum of component rank-one tensors as described in Figure 2.
In Figure 2, we introduced a particular type of tensor decomposition, the Non-Negative Canonical Polyadic Decomposition (NCPD). The only difference with “classical” CPD is related to the non-negative constraints for the extracted factors.
One of the most important point to take into account from Figure 2 is the Rank (R) of the factorization.
“The rank of a tensor T, denoted rank(T), is defined as the smallest number of rank-one tensors that generate T as their sum. In other words, this is the smallest number of components in an exact CP decomposition, where “exact” means that there is equality in the factorization and the original tensor. An exact CP decomposition with R = rank(T) components is called the rank decomposition.” (Kolda et. al, Tensor Decompositions and Applications).
Tensorize and Factorize Graphs with Python
The data used in this article were selected among a large series of network datasets available at the Index of Complex Networks website. We selected the Golden Age of Hollywood dataset, here its description extracted from the website:
“A network of all collaborations among 55 actors who were prominent during the Golden Age of Hollywood (1930–1959). Nodes are actors and each edge represents those actors appearing together in a movie. Edge direction points toward the actor with “better” billing status. Collaborations were extracted from the Internet Movie Database and span 1909–2009. Data available as both a temporal network (sequence of graphs, each spanning 10 years) and as single aggregated network.”
All graphs, except the aggregated network, were used during the analysis. As you already know a graph can be represented with an adjacency matrix. Using its matrix representation we can then built a tensor (3D matrix) where we stacked all the matrix along the 3rd mode as described in Figure 3.
The first step is to load all the adjacency matrices in a 3D matrix using the following python code.
Before stacking the matrices we remove the self-loop in the graph simply putting zero on the diagonal. In the last step, in order to simplify the visualization, we loaded the file containing the names of all the Hollywood actors available in the dataset.
We are now able to start using tensor factorization. Tensorly is the best choice to perform tensor factorization in Python since it offers a wide range of functions. Another incredible tool is Tensorlab, unfortunately, it is only available for MATLAB and will not be used in this article. Before running the factorization algorithm we need to transform the 3D numpy matrix in a tensor that Tensorly can process using the following code.
Now we can perform the non-negative tensor factorization. In this article, we decided to extract 20 components.
In the factors array, we have all the factors extracted from the factorization. Let’s start an exploratory analysis in order to check, what we can see in the different components. We will just focus on a few components in order to describe a general way to perform the analysis.
In order to perform the plot, we used matplotlib and NetworkX. More in detail, we plot two different information: 1) The vector product between A and B. This information shows a subgraph of the original graph describing the sub connections more relevant for this specific source. 2) The factor C. This factor describes which temporal period a specific component contains.
The main idea behind these plots is to understand which subset of relevant information can be extracted from the original dataset. The python code to perform the visualization is available in the following image.
Here the plots
Now, let’s try to understand the meaning of those plots. Apparently, component #0 contains information about the filmography of Marx brothers. Indeed, if we focus our analysis on the plot of the product of factor A and B (Botton of Figure 8) we can see that the obtained graph contains the connection between the 4 Marx brothers. Moreover, if we analyze the component C we can see values different from zero in the range 1920–1959 with a spike in 1930–1939. This plot perfectly overlaps with their filmography as you can see in Figure 9. From 17 movies, 2 (11.8%) belong to 1920–1929, 9 (52.9%) belong to 1930–1939, 4 (23.5%) belong to 1940–1949 and 2 (11.8%) belong to 1950–1959.
Another interesting component is the #4. Plots are visible in the following image.
This component is also significant since represents one of the most celebrated and successful collaborations of any actor/director, the one between Alfred Hitchcock and Cary Grant, visible in bottom part of Figure 10. If we also look at the Top part of the figure, we can see the same value in the range 1940–1949 and 1950–1959. This is a really amazing fact since this duo made 4 movies, 2 in 1940–1949 (“Suspicion” — 1941 and “Notorious” — 1949) and 2 in 1950–1959 (“To Catch a Thief” — 1941 and “North by Northwest” — 1949).
In this article, we presented a simple approach to analyzing longitudinal changing networks using non-negative tensor factorization. We started with a simple introduction to tensors and tensor factorization. More in details, we used a particular type of tensor factorization, namely Non-Negative Canonical Polyadic Decomposition (NCPD). The decomposition was applied to analyze time changing graphs obtained from the Index of Complex Networks website. Analyzing different components, we found interesting hidden information available in the dataset.