Harvesting phishing sites for filtering has always been somewhat of an ongoing, uphill battle. Many phishing sites are designed to look as close to the legitimate webpages they’re imitating as possible. The more genuine looking, the greater the chance of someone willingly agreeing to hand over sensitive personal information, and also the tougher it is to determine if the site is legitimate or if it’s a phish. Additionally, phishing pages tend to have a high turnover rate, meaning that more often than not, the site will only be live for a day or two, sometimes even hours, before it’s discovered and taken down, or moved to a different URL. It’s reasons like these that have made tackling phishing sites a tedious chase.
New Recommender System on Phishtank To Automate Submission Verification
After searching for a solution to this obstacle, OpenDNS Labs came up with the idea of using its phishing detection model, NLPRank, as a recommender system on our community based phishing verification system, PhishTank, to improve phishing verification time.
This new, automated approach to the verification process is outlined as follows:
- The algorithm takes as input a submission (domain or URL) from the submitter/community and checks them against any existing OpenDNS whitelists and ASN filters. This initial step is to filter out false positives and spammy submissions that are often submitted to PhishTank.
- If the URL makes it past these first few checkpoints, it then fetches the source code/content from the submission URL for review.
- That source code is then analyzed by our machine learning system, that in a nutshell, compares the submission content to a curated corpus of content from commonly spoofed brands and returns a similarity score. If the similarity score is above the predetermined threshold, the URL is labeled as a phish and gets sent to our Proxy for auto-blocking.
Figure 1 shows a diagram of how the system works:
The eventual plan is to integrate the results of our recommender system back into PhishTank and share results with the community. PhishTank’s current approach of “submit a domain and wait for the community to verify” has so far been formidable, as it continues to remain “best in class” as one of the largest sources for human curated data when it comes to phishing sites. However, the drawback here is that the current system continues to become more primitive as time marches on, and as the Time to Verify measurement grows larger, the efficacy of the feed suffers. This new recommender system increases the effectiveness of PhishTank and improves the overall experience for the thousands of users that utilize PhishTank’s verified phishes’ feed.
Rogue Infrastructure Detection
While this method is outstanding for real-time blocking of active phishes, it is reactive, but here at OpenDNS Labs we are all about pushing the limits to develop predictive models. So how do we evolve this system to be truly predictive and block these phishing sites and their hosting infrastructures before they’re even created? Consider the notion that phishes can sometimes be a bit like cockroaches. If you see one marching around your house, chances are there are a bunch more hanging out somewhere close by, out of sight. By taking the verified results of the NLPRank process, and pivoting through their server IPs using Investigate, we are able to uncover handfuls of other registered phishing domains acting as targets for the very phishing campaign that was initially discovered. Additionally when we continue to dig deeper through adversaries’ WHOIS records, specifically the email registrant, we uncover even more of the same tactics. By adding these IPs and registrants to our blacklists, we are able to stay ahead of the curve and greatly increase the chances of our users being protected from widespread phishing campaigns. Figure 2 shows a diagram of the Rogue Infrastructure Classification system.
- First we take dedicated phishing domains that we have caught with the recommender system.
- We then query OpenDNS Investigate for domains associated with the hosting IP and the registrant email address from their WHOIS records.
- We have then built a classifier using different features from the domains on these infrastructures to detect rogue registrants and IP addresses, and in turn push them to our blacklist.
In this sense we can predict the infrastructures phishers will use as they are being setup, and we are now blocking phishing sites even before they go live with spoofed content.
Our phishing recommender system is still in its very early stages of production, however the results and accuracy from its output thus far have been exceptional. As we push forward, and continue to mature our recommender system with PhishTank and OpenDNS data, we only stand to increase the level of security that is delivered to the thousands of OpenDNS users worldwide.
Here is an example of a dedicated phishing site we found, appleid-apple-icloud-safe-link[.]com, where we pivoted on the email address and also the server IP address, and were able to uncover this rogue hosting infrastructure.
Figure 3 displays some domains, IPs, and email registrants associated with a recent Apple phishing campaign that we have been tracking for a while now and that we visualized with OpenGraphiti. We were able to discover and block these actors with the new rogue hosting infrastructure detection system we have created.
Figure 4, 5, and 6 (below) show a specific example of the type of infrastructures we are catching and predictively blocking with these new techniques. We have had the hosting IP address for httpsaccounts-gooogle[.]cf on our blacklist since July but we are still seeing new phishing pages coming alive on this IP, which we are blocking before they are even up and running. This displays the power of our new rogue hosting classification system.
Figure 5 below shows the landing page serving the phishing content spoofing Gmail.
Figure 6 below shows the IP 18.104.22.168 polluted with phishing content:
As displayed from the results above we are uncovering new rogue hosting providers daily that are leveraged for phishing campaigns and other toxic content. The system is still evolving, however our findings are very promising and we look forward to sharing more of them with the community in the future.