Sunday, March 3, 2013

Anonymized Social Network Data

Social network analysis is a popular research area these days thanks to the rapid growth of networks like Facebook, Google+, and Twitter. Data for subsets of these networks is readily available, e.g. here, for researchers to analyze and discover interesting properties. Sometimes, these datasets are anonymized to protect the privacy of users while preserving the structure of the network; for example, instead of nodes being labelled with names, they are often replaced with a random ID. An interesting question that arises is whether this is secure, that is, can an attacker extract information from data anonymized in this way? Several years ago, it was shown that individuals could be identified from the anonymized Netflix Prize dataset, so this question is important for the research community to consider. In this paper, it is shown that an active attacker can quite easily defeat the naive random ID anonymization.

Let $G$ be the anonymized network with $n$ nodes that an attacker has access to. In this context, an active attacker is someone who has "user-level access" to the network prior to the creation of the anonymized dataset. For example, if $G$ is the anonymized Facebook network, an attacker can create accounts and friend people arbitrarily before $G$ is created such that those nodes and edges will be reflected in $G$. The resulting attack is quite simple. Let $H = \{ x_1, x_2, \ldots, x_k \}$ be the set of "fake" nodes created by the attacker and $W = \{ w_1, w_2, \ldots, w_b \}$ be the set of existing nodes that the attacker wants to compromise (in particular, learn about edges between $w_i$ and $w_j$). It is shown that with just $k = \Theta( \log{n} )$ attacker nodes, you can, with high probability, perform an attack efficiently on $b = \Theta( \log^2{n} )$ of the "real" nodes in the network. The attack is based on generating $H$ in such a way that it can easily be recovered from $G$ and connecting $H$ to $W$ so that each node in $W$ can be uniquely identified.

There are a few properties of $H$ that need to be guaranteed. There cannot be another induced subgraph in $G$ which is isomorphic to $H$, so that we can identify $H$, and there must be an efficient algorithm to find $H$ within $G$. To generate $H$, first include the edges $(x_i, x_{i+1})$ for $i = 1, 2, \ldots, k-1$ and then include every other edge with probability $1/2$. Then each $w_j$ is connected to a small, unique subset of nodes in $H$ so that we can recover all of the $w_j$'s once we find $H$. Furthermore, each $x_i$ has edges added randomly to nodes in $G - H$ until it reaches degree $\Delta_i = O(\log{n})$. Based on the structure of $H$ and the degrees, we can now recover it from $G$ using a search over short walks in $G$. Starting from each node, consider it as a potential $x_1$ and then, if that passes the degree requirements, consider all of its neighbors as potential $x_2$'s. For each neighbor that passes the degree requirements and the internal structure constraints of $H$, continue the walk, and repeat this until a walk with $k$ nodes is found. Through some delicate analysis, the authors show that, with high probability, there will be a unique walk that satisfies all the constraints (corresponding to $H$ itself) and that the search tree will have size $O(n^{1+\epsilon})$ so the algorithm is tractable on large datasets.

The paper describes experiments run on a LiveJournal dataset with over 4 million nodes and 70 million edges. With just $k = 7$ attacker nodes, they are able to successfully recover $H$ with roughly 90% probability, and it only goes up as $k$ increases. This means that it is simple to perform such attacks in practice and, perhaps more importantly, that the attacks have a sufficiently small effect in the network to avoid detection. The end result of the work is captured quite nicely by a quote from the paper, highlighting the observation that simple anonymization is not enough: "true safeguarding of privacy requires mathematical rigor, beginning with a clear description of what it means to compromise privacy, what are the computational and behavioral capabilities of the adversary, and to what information does it have access."