Sunday, April 7, 2013

Encrypted VoIP

Voice over IP (VoIP) technology is extremely popular these days and as a result it is important to study the security of the protocols being used. While most such technologies use strong encryption to protect the data, there are often side channel attacks that can be exploited. This paper from a couple of years ago presents an avenue of attack that assumes very little about the adversary while still achieving nontrivial results. The authors describe a process that uses a machine learning techniques combined with computational linguistic knowledge and specific audio codec properties to reconstruct potential transcripts of encrypted conversations. It's particularly impressive because they are able to combine the research advancements of different fields and apply it to a problem that has serious real-world implications.

In their model, the adversary needs only access to the packet lengths of the VoIP call (easily attainable by sniffing the network, e.g. over wireless) and knowledge of the language of the call (assumed to be English). Amazingly, with just these two pieces of information, it is possible to perform a limited reconstruction of the transcript of the call; not enough to deem the protocols complete broken, but enough to warrant a much closer look at how voice data should be encrypted. The attack begins by training on a large corpus of voice data that is passed through the known speech codec to build a model of phoneme boundaries based on packet lengths. This is the most novel part as they use observations about the codec itself, in particular variable bitrate encoding, as the basis for the model. By considering the length of each packet in relation to the surrounding packets and using dynamic programming, they construct the highest-probability phoneme boundaries for an entire sequence of packets.

Following the identification of the boundaries is the hardest task: classifying a sequence of packet lengths as an actual phoneme. Again, the authors use a combination of the training data and context to determine the most likely sequence of phonemes. For example, having a full dictionary of possible phoneme bigrams and trigrams in the English language will greatly limit the space of possible outputs; using a frequency distribution can further aid in choosing certain phonemes over others. Once the phonemes are identified (usually with error), the attack identifies word boundaries and finally maps the phoneme sequences into actual English words. There is some novelty in their approach to these latter two stages as well, but as I understand it those are well-known problems in the speech recognition field. The result of all this is some transcriptions that looks like this (the best ones, naturally):
  • "Change involves the displacement of form." => "Codes involves the displacement of aim."
  • "The two artists exchanged autographs." => "The two artists instance attendants."
  • "Artificial intelligence is for real." => "Artificial intelligence is carry all."
Pretty impressive, as these sentences are spoken, encoded, encrypted, and then sent over the network, with the only observable factor being the resulting packet lengths. Being able to pick out words should be very worrisome for privacy, and even though the environment is controlled and some assumptions are made, the techniques presented are legitimately close to "breaking" the security of speech data encoded and sent over the network in this way.

This paper reinforces my thoughts on the sometimes misleading nature of "theoretical" security. Even using unbreakable encryption algorithms is not enough to provide security in the real world, as the possibility of side channel attacks is always present. Moreover, since every application has a different set of such attacks it is crucial to carefully design protocols to not be vulnerable, which is a significant burden. This topic of attacks based on packet lengths was also the basis of the StegoTorus project which I was a part of; we showed that the website you are visiting can often be identified based on packet lengths (when you cannot see the IP itself, e.g. over a proxy). Especially as Internet traffic becomes more dynamic and differentiated, patterns are emerging which can render encryption much less effective and may demonstrate the need for something more to protect privacy.

No comments:

Post a Comment