Every day, OpenDNS discovers thousands of websites serving malicious content, by harnessing massive amounts of DNS data.
Besides what DNS level data can tell us, examining the type of server software cybercriminals use also helps increase the accuracy of our algorithms.
In this experiment, we collected 50,000 domain names that have been actively serving malware between March 6th and June 6th, and 50,000 popular domain names that we never saw involved in malicious activities.
Web server software
As of today, Apache remains the most popular web server software, though Nginx is clearly on the rise.
That said, malicious domains run Apache more often (62.88%) than benign domains do (41.64%), when compared to Nginx it’s more the opposite (10.87% versus 26%).
Another interesting observation is, compared to malicious domains, benign domains clearly tend to obfuscate or hide the server software they are running. Our data show that malicious domains typically use one of nine different “Server:” header signatures. A staggering 95.27% of domains serving malware match these signatures, whereas benign domains match the same signatures only 17.23% of the time.
Some websites are also taking advantage of Content Delivery Networks (CDNs). However, we couldn’t find any domains currently serving malware using Akamai, Bitgravity, Cachefly, Chinacache, or Limelight.
Though it’s not inconceivable, one can assume that websites using one of these CDNs are much less likely to be malicious.
However, 0.2% of malicious domains are using Cloudflare, and 0.1% of them were using Microsoft Azure.
The next thing we examined was the “X-Powered-By” header, which is also an identifier for the software running a web application or site.
Although the difference is not significant, Plesk is found more often on compromised websites than benign ones (5.67% vs 1.49%). But perhaps most important to note here is the presence of a “X-Powered-By” header which doesn’t indicate the presence of Plesk, ASP, or PHP.
Web servers running Ruby (Rack), NodeJS (Express), Mono, and Java-based application servers (Jboss/Tomcat) are clearly less used for malware distribution than other software stacks.
Cookies are a good indication of whether a website needs to somehow track a user, and also a good indication of what framework or application is running.
In order to ignore cookies sent by third-party services, like ad servers, we only analyzed the home page of each website, and discarded cross-domain content.
Approximately half of the visited benign websites don’t serve any cookies. Compare that to 77.58% of malicious websites that don’t serve cookies.
Benign websites also tend to have a higher diversity of cookie names than malicious websites.
This can be partly explained by the fact that cybercriminals will often target applications that are easier to compromise, and hosting services that are malware-friendly often offer similar operating systems and software stacks.
Not all WordPress instances are sending cookies at the first visit.
A more reliable way to detect sites powered by WordPress that inspect cookies is to look for specific files. The one tested here is /wp-includes/wlwmanifest.xml.
According to this test, no less than 19.50% of malicious/compromised sites are running WordPress.
But WordPress is also omnipresent on sites that haven’t been compromised (yet): the file was also found on 13.92% of the benign web sites from our training set.
Looking at the “Last-Modified” header when requesting the home page is a good way to see whether a website is regularly updated.
Plotting the CDF of both classes of domains shows that sites whose home page hasn’t been recently updated have a higher likelihood to be malicious or compromised than sites containing more dynamic content.
The length of the content is also a useful feature. I examined HTML code for the home pages only of these sites.
In this training set, none of the benign examples served HTML code larger than 2 Mb on the home page, at least according to the Content-Length header.
A few examples as of today:
hxxp://portail-bassin-arcachon.com 11,255,479 bytes hxxp://portail-cote-azur.com 9,542,640 bytes hxxp://location-mer.eu 8,437,761 bytes hxxp://portail-cote-vendeenne.com 6,934,555 bytes hxxp://portail-toulousain.com 6,914,263 bytes hxxp://grupokarion.com 6,272,172 bytes hxxp://portail-sologne.com 6,079,756 bytes hxxp://unoshn.com 4,373,545 bytes hxxp://portail-vallee-des-rois.com 4,355,293 bytes hxxp://lacajareiki.com 2,854,292 bytes
More than 20% of web servers are also running an SSH server on the same IP address. This holds true both for benign and malicious servers.
The figures are quite different when it comes to FTP servers.
No less than 36.65% of web servers serving malicious content are running an FTP server. That’s nearly twice as much as servers for which we didn’t observe any malicious activity (18.57%).
In both cases, Pure-FTPd is the most popular FTP server software, with a 46.5% share, mainly due to it being shipped with Cpanel.
A POP server usually doesn’t share the same IP as a benign web server. Only 13.3% of benign web servers are also listening to port 110.
However, POP servers run simultaneously on 23.15% of malicious web sites.
The distribution of the POP server software is similar in both benign and malicious cases, with Dovecot being by far the most popular option.
As expected, SMTP servers also tend to be more frequently found on web servers hosting malicious content than on benign ones: 25.03% vs 17.49%.
Using this data for classification
After analysis, we then used this data to extract simple binary features:
- Server: *Apache*
- Server: *nginx*
- Server: !*(IIS or Apache or Nginx or Litespeed or Oversee or Lighttpd or ATS or Varnish or Tengine)*
- Server: *Akamai*
- X-Powered-By: *(Plesk or ASP or PHP)*
- The presence of cookies
- Set-Cookie: *(wordpress or ci_session or uid or PHPSESSID or PHP_SESSION_ID or virtuemart or VisitorID)*
- Last-Modified date > 1 day
- Content-Length >= 2,000,000
- The presence of an FTP server
- The presence of an SSH server
- The presence of an SMTP server
- The presence of a POP server
A decision tree trained with these features on 2/3 of our examples leads to the following ROC curve:
This classifier is simple and extremely fast, but it clearly doesn’t perform well enough on its own for our security needs. Furthermore, collecting test data is a network-intensive operation.
However, we have many models currently tagging domain names as suspicious or not according to different algorithms.
Some of these domains have a very high precision and are added to the list we are blocking after a quick manual review. For instance, newly registered domains acting as fast-flux fall into this category.
Output of other models need extra votes before we are confident enough to blacklist them and thereby protect our customers. And this new classifier is going to play a significant role in this regard.