Besides a new project of mine that is investigating the use of neural networks to carry out phishing campaigns (more on this at a later date!):
… and the upcoming Christchurch Hacker Conference, my main project remains my malware tracker (2017-12-01: decommissioned). So I thought I’d take the chance to run through some of the features and data being offered by it. But firstly, I’ll briefly cover off perhaps the most common query I receive concerning it: where does the data come from?
At present the data is obtained from:
- Honeypots: URL’s are extracted from my Dionaea honeypots and dumped into a ‘fetch queue’ that is processed once per day. I tend to ignore IP data for this purpose because a huge percentage of it relates to compromised personal devices which do not host any malware.
- Spamtraps: Malicious attachments are automatically unpacked and submitted to a Cuckoo sandbox, and the data from this is sent to the same queue as the honeypot data. The C2 portion of this is still somewhat manual.
- OSINT: For the moment this mostly consists of a Payload Security feed, Shodan and malware IP and domain blacklists. One of the core rules I set for myself when developing the tracker was – where possible – not to double up on data that is made available by similar services such as the Malc0de and VX Vault databases. The approach I have taken is to stretch a single indicator as far as possible by pivoting between multiple data sources, so that could – for example – follow a process such as:C2 checkin or beacon alert in Payload Security feed > extract domains (a blacklist based crawl would start at this point) > use passive DNS service to identify recently utilised IP’s > search on VirusTotal for samples that have recently communicated with the IP’s > submit C2 data to the tracker > search on VirusTotal for URL’s associated with the IP’s > fetch samples > Yara scan files > submit sample data to tracker. The next day the engine would run through the same C2 data to try and identify any new, related samples that have surfaced in the past 24 hours.
- Manual Submission: e.g. I find something mentioned on Twitter and throw it in the queue.
Hopefully that clears things up. It’s not overly technical, but has served as an excellent exercise in maximising the value of data.
static analysis changes
Last month I began to run into some pretty horrific performance woes using a pure Python static analysis library, which while producing a nicely detailed summary of a file, on the other hand:
- The data was in the form of massive JSON objects, in some cases over 1MB of pure text, so database bloat was a reality and requests took too long. Some of the regex was also very broad in it’s intention and prone to catastrophic backtracking.
- It was too verbose for what it was trying to communicate, and most of the PE information was available on VirusTotal anyway.
- I was confined to only presenting analysis data on PE files, leaving the likes of PDF, DOC and XLS files with none.
Over the past month I’ve been rolling Loki into the investigative toolset of the first-line support team at my workplace, so I’ve been doing a lot of work with Yara – which, one day, sparked the idea that it could be a perfect replacement for the library I was currently using to perform the static analysis of files prior to their submission to the tracker. This would not only condense the capabilities and attributes of a file into a very simple list, but would also allow me to include documents and apply the same level of analysis to them as for PE files.
The only real issue I’ve run into is catastrophic backtracking with a couple of rules (e.g. domain.yar) which can hang up a processing run by upwards of an hour in some cases – where it should really only take a matter of seconds. So, as a workaround to that I’ve simply accepted I cannot full rule coverage across every single file and have both set a limit on file size and enforced a per-file TTL on Yara processing before dropping known troublesome rules and retrying. To date I haven’t had a single issue and only run into a condition where I have to retry with perhaps 1% of files.
As well as malware hosting data I am now also including C2 data. As outlined in the ‘data sources’ portion of this post this data is mostly obtained from my Cuckoo sandbox, Payload Security and Shodan (courtesy of their ‘malware’ tagging) – and I pivot between these sources and VirusTotal to identify samples that have communicated with the specific C2 domain and current+recent IP’s. Data from here is also passed into subsequent stages of processing where malware URL’s are identified.
Unlikely. This is my pet project and has sucked a lot of my time, but I am certainly looking at rolling some of the capabilities into ph0neutria.
2017-12-01: I am currently in the process of merging ph0neutria and the malware tracker.