Protocol reverse engineering is the process of extracting
application-level specifications for network protocols. Such
specifications are very useful in a number of security-related
contexts, for example, to perform deep packet inspection and
black-box fuzzing, or to quickly understand custom botnet
command and control (C&C) channels. Since manual reverse
engineering is a time-consuming and tedious process,
a number of systems have been proposed that aim to automate
this task. These systems either analyze network traffic
directly or monitor the execution of the application that
receives the protocol messages. While previous systems show
that precise message formats can be extracted automatically,
they do not provide a protocol specification. The reason is
that they do not reverse engineer the protocol state machine.
In this paper, we focus on closing this gap by presenting
a system that is capable of automatically inferring state
machines. This greatly enhances the results of automatic
protocol reverse engineering, while further reducing the
need for human interaction. We extend previous work that
focuses on behavior-based message format extraction, and
introduce techniques for identifying and clustering different
types of messages not only based on their structure, but also
according to the impact of each message on server behavior.
Moreover, we present an algorithm for extracting the state
machine. We have applied our techniques to a number of
real-world protocols, including the command and control
protocol used by a malicious bot. Our results demonstrate
that we are able to extract format specifications for different
types of messages and meaningful protocol state machines.
We use these protocol specifications to automatically generate
input for a stateful fuzzer, allowing us to discover
security vulnerabilities in real-world applications.