CTPH Clustering Analysis in Big Data Environment
CTPH has been used for similarity analysis in information security industry for more than 10 years and has become one of the most popular and standard fuzzy hashing algorithms used. In Microsoft Defender team, CTPH has been used for sample clustering analysis and malware classification since 2014. With a big success of using it, we also encountered a big issue: it became very expensive to find clusters with certain similarity from a huge number of samples. In order to find similar samples in a short time, we proposed and implemented a simplified CTPH similarity analysis solution to help reduce the cost and still be able to achieve our business requirements.
This presentation is to introduce how we conduct CTPH similarity analysis internally and we will also explore our clustered data, such as we analyzed over 100B samples including PE and non-PE samples, how many of them could be clustered, how many clustered samples could be contributed to our automation solutions, how many of samples are not clustered, and what we can learn from these lonely samples. In addition, we will have some case studies using CTPH data to understand more in real world malware attacks.