COZZI Emanuele | EURECOM

Person has left EURECOM

Nom : COZZI Emanuele

Thesis

Binary Analysis for Linux and IoT Malware

Objective

Security companies collect million of malware samples every day. This big-data aspect is a

new concept in malware analysis but it is certainly here to stay. On top of traditional samples,

the upcoming Internet of Things (IoT) revolution will inevitably increase both the amount and

the diversity of the collected artifacts.

However, despite its promises, big data collection has so far brought to our field more

challenges than advantages – mainly resulting in a burden for researchers and malware analysts.

In fact, on the one hand more samples mean less time to analyze them and larger infrastructures

required to store the files and execute them in dynamic analysis sandboxes. On the other hand,

security companies are clearly struggling to sift through this increasing amount of data in the

attempt to extract some actionable intelligence to better protect their customers and improve

their services. As a result, while there is a clear global trend towards collecting more and more

data, most of this data is just sitting unused on some server, taking terabytes of storage space

without actually being used, exploited, and often even properly understood by the company

that collect it.

On top of this poor understanding of big malware dataset, new advanced techniques are

making the analysis of individual samples more complex and more time-consuming. For instance,

ROP-only malware, disk-less samples, and advanced obfuscations are reducing our

ability to automatically process and understand new malicious files.

The goal of this thesis is to harness the information stored in large malware datasets to

improve the samples analysis, provide intelligence information, detect correlation, or simply

study trends and evolution of different techniques used by malware writers. In this challenging

context, this dissertation will also explore new techniques to extend current static and dynamic

analysis approaches to the analysis of novel and sophisticated malware samples. This can

involve heavily obfuscated and packed binaries or new form of malicious code and will rely

on existing large-scale malware collection systems to provide the required data to conduct

experiments.

Research Overview

The first objective of this thesis is to advance the state of the art in binary and malware analysis.

For example, recent efforts have been done to better understand packed samples [7], better analyze

their behavior [3,8], or to reverse new form of advanced rootkits [5]. However, these works

only scratched the surface of the techniques we need to analyze complex malware samples –

both in a fully automated fashion and as tools to support manual reverse engineering.

A second objective of this thesis is the investigation of new form of malware, starting from

malware running on other operating systems of platform. Only recently researchers have started

looking at more “exotic” form or malware [4], but there is still a lot to explore in this area. For

example, as a starting point we plan to develop an open source infrastructure to analyze Linuxbased

malware samples. Internet routers and IoT devices are rapidly becoming prime targets

for malicious code – ranging from simple botnet to more sophisticated targeted attacks. Unfortunately,

the security industry is still largely unprepared for this threat. Most of the tool and

the knowledge about the behavior and the characteristics of malware derives from a decade

of research on Windows binaries. However, Linux samples have its unique set of characteristics,

including the widespread use of static linking, the broad set of CPU architectures, its own

packing ecosystem, and completely different techniques to achieve persistence and process infection.

This task includes the development of dedicated tools, as well as their application

to tens of thousands Linux malware samples – with the goal of extracting and measuring the

prevalence of different techniques and the characteristics of this rapidly increasing form of malware.

As a result, this part of the project would not only produce a usable platform, but also a

precious knowledge base about the behavior and key indicators of Linux malware – that can be

extremely useful for malware analysts, to improve the detection of these samples, and to guide

incident response on infected devices.

Finally, part of the research in this area will also focus on the problem of cyber-attribution [1,

2] – proposing new techniques to identify reused components and detect malware samples

likely developed by the same group. As currently pointed out by Graziano et al. [6], the current

malware collection infrastructure is very efficient, but the vertiginous amount of samples analyzed

every day in dynamic analysis sandboxes makes it impossible to tell apart the interesting

malware from the surrounding noise of less relevant samples.