There are many ways to analyze malware. In this blog post, we illustrate a typical analysis method: comparing an unknown with a known , to determine if the unknown sample is malicious or not.

During one of our engagements, we came across a PDF document that triggered our anti-virus. What intrigued us, was that the document had the same title and almost the same size as another document we knew to be benign. Did our anti-virus find a trojanized document? Let’s find out!

Usually, when performing PDF document analysis, PDFiD is the starting place: it gives us an idea what we can expect to find inside the document. We take the same start here, but we are not going to take a detailed look at the report produced by PDFiD.

First, we will compare the reports for the known and unknown :

《Differential Malware Analysis: An Example》

《Differential Malware Analysis: An Example》

Comparing the two reports with diffdump.py tells us that these reports are identical (except for the filename, which is included in the report). We can one step further, by using option -a to generate a report for all names found in the PDF, instead of only names that might indicate malicious behavior.

《Differential Malware Analysis: An Example》

《Differential Malware Analysis: An Example》

And we have the same result: the reports are identical. From a lexical PDF language point of view, these documents are identical. But we know they are not, they have a different cryptographic hash and one triggers our anti-virus, while the other does not.

Time to dig a bit deeper into the syntax and semantics of these documents with the help of pdf-parser. pdf-parser has a little-known option (-a) to calculate statistics of the elements and objects found inside a PDF document:

《Differential Malware Analysis: An Example》

Comparing the statistics for our 2 samples, reveals that they are identical:

《Differential Malware Analysis: An Example》

This is important information: that our and original documents have and identical number and type of elements and objects, is a strong indication that they are related.

To try to learn more about the differences, we let pdf-parser produce a full report for both documents:

《Differential Malware Analysis: An Example》

《Differential Malware Analysis: An Example》

Here again, the reports are practically identical, except for some characters at the beginning of the reports. This is our first important clue as to the difference for these documents. When we compare the comments at the beginning of these files, we notice they are identical except for the end-of-line characters: n for our sample and rn for our original:

《Differential Malware Analysis: An Example》

On Windows, the end-of-line is defined with 2 characters: carriage-return + newline (0x0D 0x0A or rn). While on Linux, it is a single character: newline.

Maybe our sample is a version of the original document that was somehow processed on a machine, resulting in a change of end-of-line character. Time to define and test a hypothesis: the documents are identical, except for the end-of-line characters.

We use the stream editor sed to replace all rn instances in our original document with n, and then we compare the sample with our transformed original:

《Differential Malware Analysis: An Example》

The files are identical!

Conclusion

With this analysis, we show that both documents are identical, except for the end-of-line character(s). We trust our original not to be malicious, and since we can simply convert our original to the sample with eol-conversion, we can conclude that the sample can not be malicious. Often, a differential analysis will not be so clear-cut, nevertheless it is an important method in the arsenal of the reverse engineer.

We have reported this false positive to our anti-virus vendor. The original sample is a private document that we will not share. We don’t know why exactly this sample triggered a false positive.

Want to learn more? Please do join us at the upcoming BruCON training on malicious documents, which was authored by NVISO’s experts!

the author
Didier Stevens is a malware expert working for NVISO. Didier is a SANS Internet Storm Center senior handler and MVP, and has developed numerous popular tools to assist with malware analysis. You can find Didier on Twitter and LinkedIn.

《Differential Malware Analysis: An Example》