At OpenDNS our resolvers are flooded with massive amounts of Chinese domains on a daily basis, many of which security researchers are unfamiliar with. One of the projects our team was initially tasked with was to come up with a method to filter these Chinese domains out from the rest of the traffic in order to reduce the false positive rate for our classifier algorithms and to potentially detect IPs exhibiting spamming or search engine optimization (SEO) behavior. Pinyin is the official phonetic system for transcribing Mandarin pronunciations into the Latin alphabet; it is one of the ways to represent Mandarin or Cantonese on the Internet, specifically in DNS.
In certain cases it is very hard to detect Chinese or Pinyin domains, and most language identification tools are unable to solve this problem effectively. In order to tackle this problem we used the “bag of words” approach, and also used machine learning techniques such as N-gram modeling and Naive Bayes Probability to build an algorithm to classify these domains as Pinyin.
Pinyin, or Hanyu Pinyin, is the official phonetic system for transcribing the Mandarin pronunciations of Chinese characters into the Latin alphabet in the People’s Republic of China, Republic of China (Taiwan), and Singapore. More information about Pinyin can be found on Wikipedia.
Parallels can be drawn between programmatic language detection and the way a human would recognize a language. Before describing the algorithm we designed, let’s discuss a scenario that will build intuition about how language detection algorithms work. Imagine you are walking down the street and there is a person walking in front of you talking on a cell phone in a different language. From the first few phrases out of that person’s mouth, you begin to recognize the language but are not exactly sure what it is.
At what point are you certain about what language the person is speaking? In theory, the way a human recognizes language is the exact same way you would program a machine to do it. For example, saying “how are you?” in Spanish is “como estas?”, but in Portuguese it’s “como vai” and in French it is “comment ca va?”. Since the first word in each phrase sounds the same, you wouldn’t be able to really discern the difference until you hear the next word. This is very similar to the way a computer processes languages: it will have to identify the words character by character, breaking down prefixes, suffixes, and words and match them to its own “memory bank” (corpus).
Background/Problem:
N-gram modeling is a machine learning technique widely used for natural language processing—some examples include spelling correction and searching. Most recently, it has become popular in building security incident detection and monitoring systems. The reason it’s called N-gram is that the algorithm works on N sized character blocks; a 1-token sequence would be a unigram, 2-token sequence, bigram, and an n-token sequence, an n-gram. Typically, when doing language classification, you are classifying a text (documents, webpage, book, article, etc.), and you are training your algorithm on multiple texts written in a specific language that are usually very long in length (e.g. Moby Dick, Paradise Lost, etc.).
Identifying the language of a specific domain presents a harder problem to solve, because it’s much shorter in length than having a whole document full of characters of a certain language. Relating it back to the cell-phone example above; if you were only able to hear a 10 words out of the conversation it would be much harder to accurately identify the language than if you heard 200 words. Also, domains are written in a sort of “Internet Language”, and often contain a lot of numbers, so another thing to take into account was to craft our own version of Pinyin, which is Pinyin text from articles/books combined with domain names.
Corpus Generation:
One of the most crucial aspects of building classifier algorithms is coming up with a solid corpus to train your function on. A corpus is essentially the algorithm’s past experience with the language, and the training stage is where you teach your algorithm that language. It would be a similar comparison to the cell-phone example above. Say that the reason you are able to recognize the language is because you spent a few years in a foreign country. You may be able recognize different dialects, and be able to discern between Spanish as spoken in Spain from Spanish spoken in Latin America, or the differences between Brazilian Portuguese and European Portuguese. Since this was not the traditional method of language detection, we had to define our own language model, a combination of Pinyin domains and Pinyin language found in books or articles written in Pinyin to add as a supplement.
As part of this research we have 3 different types of corpora:
-Plain Pinyin text
-Known Pinyin Domains
-Chinese Language Domains not necessarily Pinyin (mostly comprised of domains with a lot of numbers)
It is very important when building your corpus to craft it very precisely, and not allow for any deviations from what you’re trying to identify. We had to search far and wide on the web for Pinyin texts. Luckily, many of my classmates are from China, or are Chinese-Americans, and were able to direct me to some great resources. Currently we have 3 corpora. One is just a text corpus which is what we might train on for language classifiers. The other was more of an Internet language corpus comprising of Chinese domains. The only problem is some of the Chinese/Pinyin domains just comprise of numbers. The third is a specialized corpus of Chinese domains that consist mostly of numbers, and very few alphabetical characters (ex. 58493.com.cn).
Additional Feature Detection:
We added some supplemental feature-detection on top of our classifier to improve the total score for domains where the language is harder to identify. These features were based off geo-location “hints” extracted from the DNS log data. Here is where we took into account certain features the domain exhibited, for example: .cn in the TLD, or if the country the IP of the domain resolved to the countries China, Hong Kong, or Taiwan. I used PyGeoIP/MaxMind library to do the country lookups. In addition, I also filtered out puny code domains for future analysis, where the SLD start with “xn—“.
Another feature I am starting to design is what I call “giveaway” words, for example, “zhuong”, “xiang”, “zheng” etc. These substrings carry a higher weight and are more unique to Pinyin than other languages, increasing the probability that the domain is Pinyin. The intuition here (going back to the cellphone example), these would be words or sounds you would hear in the conversation when as soon as you heard them, you would instantly recognize the language. Usually they’re very unique to the language, not many other languages would have “zhang”, “xiong”, etc. Scanning through domains for additional words will effect performance, a better alternative would be to assign higher weights to certain trigrams and bigrams. This will require a more in depth analysis of the Pinyin language and the way it’s constructed.
Building the Classifier:
Step 1: Cleaning the data
One of the first things to do when building a text classifier algorithm is to “clean” the data as best as possible. Traditionally we would be working with large texts and, in the preprocessing stage, we would first filter out “stop words” (ex. the, a, than, etc.). Since we are working with domain data, we decided to treat the TLDs as “stop words” and filter those out, as well as all the periods (“.”) for classification. Depending on what type of Chinese domains we are looking for we can strip out the numbers and the dashes. We then break up the domain into bigrams, trigrams, and quad grams and add those into separate dictionaries. Most of the algorithm’s text analysis will be done on the SLD (second-level domain) and the other subdomains attached to that. We then go through and divide.
Step 2: Calculating the probabilities
The next step is to go through and check the if the bigram, trigram, quad gram exists within the corpora. The following calculations are then employed to compute the probabilities for all the grams:
As you can see from the formulas above, the quad grams have a higher weight, being multiplied by 3, trigrams are multiplied by 2, and no weight attached to bigrams. This make sense because the longer the string, especially if it’s more unique, the higher the probability it is a part of a specific language.
Step 3: Adding in features calculating total score
Finally, we went through and summed up all of the probabilities of the all the grams, per domain, and factored in the scores for the additional features to compute the total score per domain.
Sample Output for 10,000 domains
Domain | Pinyin Probability Score |
---|---|
vwudz.enshi0.cn. |
0.005836117 |
files-webcars-com-cn.powercdn.cn. |
0.005729309 |
elvshangjun.cn. |
0.005141224 |
anshanbanxueliwenping.gov.cn.xuspnx.com. |
0.005133575 |
7az0e.fuzhuang278.cn. |
0.005082791 |
t.hefei.cc. |
0.005081406 |
shexiang9.cn. |
0.00506026 |
592.33qyi.fuzhuang206.cn. |
0.005012791 |
huishui.novadigital.cn. |
0.004977142 |
vasba.edu.cn.dkcciau.com. |
0.004937082 |
www.qingbiji.cn. |
0.00485009 |
talk.weibo.10086.cn. |
0.004801779 |
873.41699.win2016.cn. |
0.004795111 |
fcxlb.dianziyouxi11886.org. |
0.004777545 |
3h48.news.qqparty.com.cn. |
0.004717036 |
ezdvv.dianziyouxi11886.org. |
0.004713618 |
egpfm.dianziyouxi13886.org. |
0.004713618 |
dpmyt.dianziyouxi13886.org. |
0.004713618 |
abbhqyt.huangguantouzhudailiwang.cn. |
0.004693812 |
fulltech.com.tw. |
0.004587503 |
317.57836.fuzhuang128.cn. |
0.004560346 |
51121.fuzhuang186.cn. |
0.004560346 |
www.cn-dajiang.com. |
0.00449607 |
roll.caijing.com.cn. |
0.0044836 |
866.4w6uo.tianlisujiao.com. |
0.004483468 |
shhongzhuang.com. |
0.004377925 |
s73q9.huihuangcaxie.cn. |
0.004355986 |
www.cnkingtone.com. |
0.004335397 |
www.02328.cn. |
0.004333471 |
326.6d3ih.fuzhuang376.cn. |
0.004316031 |
www.xinlvxing.com.cn. |
0.004270074 |
1vp6s.beiwei39du.cn. |
0.004250109 |
jinlongqipaiwohaoxiangzhidao.flxc.net. |
0.00424021 |
www.cnsyhz.com. |
0.004237153 |
hugeman.ekymnt.cn. |
0.004232029 |
jianfei21.com. |
0.004216755 |
henanyongtanduojinzhibo.131uu.cn. |
0.004182583 |
www.bdmedia.cn. |
0.004104624 |
tianzhi.com. |
0.004097529 |
www.022w.cn. |
0.004021225 |
dvlnb.whwxbj.cn. |
0.00400433 |
t.mala.cn. |
0.004002096 |
acbyqtj3h5.l20.yunpan.cn. |
0.003999642 |
yutai.0535rc.com. |
0.003986877 |
www.hljzp.net. |
0.003984501 |
684.sa9c0.fgtolu.cn. |
0.003970125 |
8371.n4o6j.huangmayulecheng1.com. |
0.003961966 |
emogo.cn. |
0.00395232 |
bj43b2b.dns4.cn. |
0.003913787 |
qiche2010.com. |
0.003907618 |
szkanne.com. |
0.003906845 |
31688.fpbmkb4.cn. |
0.003899726 |
www.thefox.cn. |
0.00389916 |
www.adminsl.cn. |
0.003898344 |
ptyyyssc.sdtjzk.com. |
0.003848159 |
acvqh.gxuro.com.cn. |
0.003845223 |
155.91178.chenghaijinguangwanju.qdsrrh.cn. |
0.00383204 |
www.eetop.cn. |
0.003821803 |
fengxipingzhaigongsi.cpnys.cc. |
0.003819816 |
mi.cn. |
0.003816321 |
www.s-zone.cn. |
0.003815913 |
bbs.yzg.ely.cn. |
0.003810671 |
78061.eiwqutrancz.cn. |
0.00381022 |
huikangsc.com. |
0.003810175 |
www.xnw5.cn. |
0.003808215 |
profdcb48.websitecname.cn. |
0.003799702 |
56api8h64.dfvfdsk.cn. |
0.00378904 |
19078.meijianail.com. |
0.003786749 |
zhsm12198.com. |
0.003782171 |
www.zymjr.com.cn. |
0.003775527 |
7ch.cnjc56.com. |
0.00377101 |
795.39680.01tch.cn. |
0.003769193 |
i.wo.com.cn. |
0.003768649 |
7doe.cnjc56.com. |
0.003768238 |
www.vmarketing.cn. |
0.003768144 |
blog.libruce.cn. |
0.003767259 |
host1.ynicp.cn. |
0.003764843 |
jw.cicc.com.cn. |
0.00376439 |
d0j3eku.dfupcun.cn. |
0.003764302 |
sun01.f5.sinosure.com.cn. |
0.003763847 |
id.ekymnt.cn. |
0.003762709 |
freesimplehandmade.com. |
0.00376204 |
kis74.8d35.cn. |
0.003761579 |
t.rednet.cn. |
0.003760458 |
mail.sz2g.cn. |
0.003758049 |
2bw7ok.betaclub.cn. |
0.003757989 |
56591.nnoxxv.cn. |
0.003757969 |
bytgdcfsqrj.adaxnw.com. |
0.00375755 |
t.sz.net.cn. |
0.003756152 |
beidougpsweixingdingwei.870118.com. |
0.003755757 |
kxovz.sxjlb.cn. |
0.003755744 |
t.jatxh.cn. |
0.003755336 |
6og6.502550.cn. |
0.003755097 |
lwmcs.cnsh123.com. |
0.00375438 |
www.fgyt.cn. |
0.003754163 |
zcjsjrj.com. |
0.003753777 |
bcaxzqy.nopevcd.cn. |
0.003753239 |
www.bhgmag.com.cn. |
0.003752705 |
1001040177149.027jd.cn. |
0.003752521 |
zhld.com. |
0.003752209 |
51110.nphjw.cn. |
0.003752068 |
bbs.lcxw.cn. |
0.003752 |
dbfhutx.eofvfr.cn. |
0.003751464 |
szgdb.cn. |
0.003751159 |
cd.gccdn.cn. |
0.003750377 |
iphone4dingweizhuizongwangzhi.sjk138.com. |
0.003715294 |
linyibanyingyusiliujichengjidan.h4dzsv.com. |
0.003573521 |
jzlejia.com. |
0.003478708 |
zongtongyulechengbaoma.taijichan.com. |
0.003470566 |
www.win-in-shanghai.com. |
0.003400833 |
huangguantaobaowanganquanma.2014sk.com. |
0.003398596 |
kid.tcdn.qq.com. |
0.003394156 |
jj7lr.shhaifeng.com. |
0.00332892 |
huangguanxianjinwangh.73212.48973.com. |
0.003242515 |
yulinbanwangshangkechawenping.tggomsl.com. |
0.003201924 |
286.putong.zhiwen79.in. |
0.003172524 |
xm-yuanyang.com. |
0.003152139 |
www.uralhelicom.com. |
0.003145724 |
chongqingbaoyang.com. |
0.003139086 |
jrjiaomu.com. |
0.003103858 |
sanlichen.com. |
0.003095998 |
compuhom.com. |
0.00305815 |
yangshengw.net. |
0.003053929 |
ejiacheng.com. |
0.003036892 |
wwwppnbacom.bkk456.com. |
0.003010301 |
dyn-dsl-pt-98-124-47-5.nexicom.net. |
0.003007465 |
nujiangbanbenkebiyezheng.xuspnx.com. |
0.002981089 |
spiritcommunicator.com. |
0.002965862 |
themealmobile.com. |
0.002919905 |
hejianbanjiajiehunzheng.bk6zs.com. |
0.002896148 |
petunione.com. |
0.002880016 |
hot.xinggan.com. |
0.00274529 |
ka5f9.4er17.tianlisujiao.com. |
0.002741001 |
bbs.57xizang.com. |
0.002728393 |
hakkeka.com. |
0.002725519 |
taiziyulecheng18.bbs.227623.com. |
0.002669892 |
drneilmd.com. |
0.002666576 |
invisiblefence-com.webmail.emailsrvr.com. |
0.00260573 |
82725.eufhhr.com. |
0.002603327 |
awanggo.com. |
0.002599609 |
shssl30fkj.shhxjf.com. |
0.002598443 |
zhongguozuqiuduijinqibisai.57138.in. |
0.002582069 |
shenyangbanjiashizheng.pgetkm.com. |
0.002531125 |
vice.duoshuo.com. |
0.002498526 |
yunguichuantiantianlezoushitu.maximschina.cn. |
0.002484479 |
amarpai.com. |
0.002435753 |
shlanyuan.com. |
0.002433683 |
www.mchenryhd.com. |
0.002432575 |
res.mashangju.com.w.alikunlun.com. |
0.002427826 |
fudun.com.tw. |
0.002426019 |
kuaicaile.aomendubojiqiao128168.com. |
0.002419035 |
wwwzd699com.ejer3.com. |
0.002392149 |
nissan-huasheng.net. |
0.002379231 |
kuchetabg.com. |
0.00234786 |
leifenggaoshoutanxinshuiluntan.70539.in. |
0.00234699 |
danyangglassesline.net. |
0.002330363 |
shcpdz.com. |
0.002292495 |
lazxyl.game722.net. |
0.002288849 |
www.ruidi.net. |
0.002284677 |
gregoryaugustine.com. |
0.002279901 |
www.renmaiku.com. |
0.002269117 |
www.saintaugustinehyundai.com. |
0.002247514 |
www.shaolindizi.org.cn. |
0.002247355 |
6796.huhwa.com. |
0.002246984 |
www.jkhyy.com. |
0.002224344 |
sendai.rumotan.com. |
0.002222603 |
club.in2underwear.com. |
0.002214117 |
gztica.com. |
0.002199464 |
securec28.ezhostingserver.com. |
0.002191123 |
bjftzz.com. |
0.002191085 |
bake-line.com. |
0.002186827 |
jameshandlon.com. |
0.002175567 |
www.hwz9.com. |
0.002175143 |
www.chongshengtz.com. |
0.00216471 |
deewallacestone.com. |
0.002154474 |
mail.amazproduct.com. |
0.002148831 |
vod2.igoldengate.com. |
0.00213926 |
emilyratajkowski.org. |
0.002139205 |
www.bahar-narenj.com. |
0.002130782 |
programinvestasisedekah.com. |
0.002129508 |
njqrky.com. |
0.002126455 |
qe3ri.ggdsaeff.com. |
0.002118777 |
baijialeyingqianjueqiao.jychenlong.com. |
0.002116435 |
tongji.wrating.com. |
0.00211354 |
www.footwearjapan.com. |
0.002108889 |
staugustineinvestmentmanagement.com. |
0.002108422 |
dww.xiagc.com.cn. |
0.002106417 |
xbojzk.shemeshop.net. |
0.002105211 |
realgecko.com. |
0.002096769 |
85609.gpxuu.com. |
0.002095043 |
marketdigitalproducts.com. |
0.002093682 |
www.tagless.hk. |
0.002088241 |
cha8i.ifuhxcn.com. |
0.002087579 |
bilishi.2723397.biz. |
0.0020853 |
fltportal.gefleet.com.gtm.ge.com. |
0.002081435 |
mytamilchannel.com. |
0.002081435 |
tu-demounstable-fe.transformersuniverse.com. |
0.002081435 |
xifusheng.com. |
0.002065535 |
13988880001.diwudai.com. |
0.002059709 |
fucai3dlecaiwangzhai.gdlshb.com. |
0.002059691 |
hutongyouwu.com. |
0.002059615 |
gledainajivo.com. |
0.002057547 |
asiri.blogfa.com. |
0.002049428 |
www.yijee.com. |
0.002047796 |
qiahe.net. |
0.002042158 |
marikanasettlement.net. |
0.00203916 |
cd581.gotoip.net. |
0.002036992 |
sparktheevent.com. |
0.002029098 |
www.shhweijia.com. |
0.00202101 |
algerie360.goodbarber.com. |
0.00201951 |
ridethebattle.com. |
0.002017048 |
www.shhsjzcl.com. |
0.002006529 |
www.htcaijing.com. |
0.00199344 |
liuhecaitemacaituwangzhi.d3-w.com. |
0.001992571 |
nj005.zapto.org. |
0.001992072 |
www.tecnostamp-usa.com. |
0.001985682 |
www.bjnahan.net. |
0.001984731 |
qilei.org. |
0.001984712 |
sundragonpress.com. |
0.001981217 |
richmobi.com. |
0.001978547 |
hfyyx.com. |
0.001977105 |
www.concretehr.com. |
0.00197698 |
www.touziqun.com. |
0.001975886 |
wr2um.dycz123.com. |
0.001974865 |
sdykpx.com. |
0.001973644 |
www.bjxcyangdianfeng.net. |
0.00197189 |
07esf.ewzmzgo.com. |
0.001966409 |
www.qihuatong.org. |
0.001965642 |
portsideview.com. |
0.001964235 |
www.cncb.org. |
0.001962655 |
g8ozi.nccpj4.org. |
0.001961915 |
flashvid.dtiblog.com. |
0.001959971 |
www.usapolomalls.com. |
0.001959511 |
22qjz.ejcsjp.com. |
0.001958841 |
rugseattle.com. |
0.001957529 |
nataliakhodakova.com. |
0.00195524 |
en.ex-silver.com. |
0.001954879 |
macerc.org. |
0.001954365 |
fs2.catr.uuzuonline.net. |
0.001953869 |
antennasbest.net. |
0.001953253 |
duncancomics.com. |
0.00195054 |
929.78pfy.cnironfx.com. |
0.001947731 |
mx01.deutsche-annington.com. |
0.001944194 |
mx02.deutsche-annington.com. |
0.001944194 |
www.drhouseitalia.altervista.org. |
0.001943243 |
pastariagranditalia.com. |
0.001943243 |
photourl.carbase.com. |
0.00194233 |
f6byi5.ufc155.org. |
0.001938265 |
hongjiu.ytredwine.com. |
0.00192949 |
prestigegoodyearandautomotive.com. |
0.00192883 |
kedimama.com. |
0.001927289 |
74081.jyijfm.com. |
0.001926884 |
wsdbdszmdd.bsjhjj.com. |
0.001922105 |
as5400-s01ss7a-188.cnt.entelchile.net. |
0.001920853 |
www.deertex.com.tw. |
0.001919265 |
aip2.charolaisusa.com. |
0.001919146 |
s2103.wartune.r2games.com. |
0.00191781 |
www.phonerator.com. |
0.001916443 |
ep.geely.com. |
0.001913873 |
e51cv.ekiwi1.cn. |
0.001913536 |
3g.rsdlyj.com. |
0.001913271 |
628jr41e452.ipcheker.com. |
0.001912738 |
inter-hosfair.com. |
0.001910439 |
lokjv.sweatwerks.com. |
0.001910439 |
killer.51netu.com. |
0.001910439 |
jrtrohmregister.com. |
0.001910439 |
deervalleypress.com. |
0.001910423 |
s162-237-30-96.ssvec.az.wi-power.com. |
0.001910423 |
pictures.comunpoisson.net. |
0.001909499 |
t.jschina.com.cn. |
0.001908993 |
www.asaska.com. |
0.001908888 |
5402.grwjm.com. |
0.001908511 |
terinamg2272.edublogs.org. |
0.001907519 |
www.zdmoz.com. |
0.001906594 |
www.2012synchro.com. |
0.001903092 |
mchenrychamberofcomm.chambermaster.com. |
0.001903076 |
baycity.infellowship.com. |
0.001902937 |
friarsclubinc.org. |
0.001901911 |
lgoc66.hnja.in. |
0.001900581 |
www.gzcyts.com. |
0.001900071 |
www.diweiylc.com. |
0.001899941 |
helpcenterofaustin.org. |
0.001896461 |
www.muziu.com.tw. |
0.001895981 |
mail.pdsdallas.com. |
0.001895747 |
www.qileke.com. |
0.001895471 |
huutokaupat.com. |
0.001894685 |
woodysoutdoorpower.com. |
0.001894518 |
jrtcgb.webs.com. |
0.001894026 |
846.guzcz.com. |
0.001893962 |
yhylc.qvpzyjp.com. |
0.001893881 |
diginyomda.com. |
0.001892797 |
baexxxtu.97cr.cc. |
0.001892075 |
idiyhandmade.com. |
0.001891516 |
voyeur-reviews.info. |
0.001891107 |
dns.ausnutria.com. |
0.001891087 |
amfriendsaugustine.org. |
0.001891071 |
ymgf1.yimoe.com. |
0.001890908 |
nmd54093f.nike-hi.net. |
0.001890762 |
www.yeyouwo.com. |
0.001890102 |
pifaweb.com. |
0.001889254 |
www.techtwomd.com. |
0.001888793 |
special.bydauto.com.cn. |
0.0018877 |
xn--gmil-1na.com. |
0.001887693 |
bestsupply.info. |
0.001887686 |
grandhotelpylypets.com. |
0.001887686 |
603.lfegg.gbfgh.com. |
0.00188764 |
sn-zc.com. |
0.001887495 |
mdiwestziyu1.com. |
0.001887482 |
pt.invoicexpress.com. |
0.001887466 |
i1kjj.kdrnwj.com. |
0.001887241 |
medals4mettle.org. |
0.001886969 |
yierbokaihu.sdqdyt.com. |
0.001886396 |
sdhxhjt.com. |
0.001886379 |
dyn-dsl-mb-98-124-25-213.nexicom.net. |
0.001886222 |
dyn-dsl-mb-98-124-28-108.nexicom.net. |
0.001886222 |
dyn-dsl-mb-98-124-28-62.nexicom.net. |
0.001886222 |
dyn-dsl-mb-98-124-28-231.nexicom.net. |
0.001886222 |
jmyd0.86fashion.net. |
0.001885849 |
pupupuooj.dtiblog.com. |
0.001885539 |
weuee.com. |
0.001884683 |
t080.ltkmoijl.com. |
0.001884524 |
doomedtoexist.com. |
0.001884069 |
ovzxmpeh.seedy123.com. |
0.00188292 |
zerko6.edublogs.org. |
0.001882789 |
pro-dvizh.com. |
0.001882402 |
xntzdb.com. |
0.001882102 |
nzw.3721job.net. |
0.001881729 |
bazartdugrandjas.com. |
0.00188146 |
2012napabasuperregional.apalanjevents.com. |
0.001881036 |
jinkadaishan.flxc.net. |
0.001880306 |
blog.webnots.com. |
0.001880198 |
tjfate.com. |
0.001880169 |
rutrackercat.org. |
0.001880153 |
dqzmyq55.proveke.com. |
0.001879243 |
selectionat.com. |
0.001879227 |
a5sfp.13813.ejewxzg.com. |
0.001879131 |
sbcz.net. |
0.00187871 |
www.sdyjsw.com. |
0.001878593 |
190.70k3l.njrcrx.com. |
0.001878218 |
cxddz.com. |
0.00187805 |
kdeopen.com. |
0.001878039 |
emmtx.cn. |
0.001877706 |
17yy.org. |
0.001877573 |
deehtya.w4fa.com. |
0.001877527 |
xeaa5.shgkv.com. |
0.001877521 |
www.shgzbb.com. |
0.001877521 |
bzlhg.com. |
0.001877505 |
shjtjd.com. |
0.001877362 |
4017.pjbct.com. |
0.001877235 |
www.flurkapelle-boedigheim.com. |
0.001877145 |
www.bestbestinmarket.com. |
0.001877129 |
9dmz0.dwwmswu.com. |
0.001877124 |
hmztv.com. |
0.001877108 |
3ps6i.cj9.in. |
0.001876959 |
815.olukq.qlk668.com. |
0.001876933 |
qlyewu.com. |
0.001876917 |
www.mamaspeaks.com. |
0.001876836 |
zgt6w.ambjlqxw.com. |
0.001876816 |
1mx9a.gvlzei.com. |
0.001876629 |
58349.czxzdt.com. |
0.001876173 |
418.bsjsvq.com. |
0.00187617 |
mnjjr.com. |
0.001876132 |
142.tellht.com. |
0.001876081 |
tpdbv.ebboedmre.com. |
0.001875687 |
sztlqm.com. |
0.001875611 |
xplr-ts-t11-208-114-155-51.barrettxplore.com. |
0.00187541 |
www.xn--mgbebn2h.com. |
0.001875212 |
021vod.com. |
0.001875187 |
xn--q9js9lqa9fj4fn90ata.com. |
0.001875184 |
xn--cckl0itdpc9763ahlyc.cc. |
0.001875184 |
Conclusion:
Overall, the algorithm was successful in being able to identify Pinyin domains in our DNS query traffic. For testing, we ran the filter continuously on traffic samples from our resolvers and were able to come back with successful results. In addition, we used the cosine distance algorithm to test the accuracy of the algorithm. When testing against a few different domain corpuses (French, English, Spanish, Russian, German), the Pinyin one came back with the closest match. Overall this helped the Security Research team sift through domains faster – and in some cases be able to identify new malicious Chinese domains. Some additional features we’d like to add include improving smoothing (for grams with 0 probability) and weighting (features, and possibly grams). We also want to try and detect different types of anomalies that deviate from the norm, for example, Pinyin language in a domain that ends in .eu.
Some future ideas for this project would be to expand the corpora to support multiple languages in addition to Pinyin. As a part of this research we have decided to publish the code for the Pinyin Language Detector on our public GitHub page at https://github.com/opendns/PinyinDetector.