pinyin

At OpenDNS our resolvers are flooded with massive amounts of Chinese domains on a daily basis, many of which security researchers are unfamiliar with. One of the projects our team was initially tasked with was to come up with a method to filter these Chinese domains out from the rest of the traffic in order to reduce the false positive rate for our classifier algorithms and to potentially detect IPs exhibiting spamming or search engine optimization (SEO) behavior. Pinyin is the official phonetic system for transcribing Mandarin pronunciations into the Latin alphabet; it is one of the ways to represent Mandarin or Cantonese on the Internet, specifically in DNS.

In certain cases it is very hard to detect Chinese or Pinyin domains, and most language identification tools are unable to solve this problem effectively. In order to tackle this problem we used the “bag of words” approach, and also used machine learning techniques such as N-gram modeling and Naive Bayes Probability to build an algorithm to classify these domains as Pinyin.

Pinyin, or Hanyu Pinyin, is the official phonetic system for transcribing the Mandarin pronunciations of Chinese characters into the Latin alphabet in the People’s Republic of China, Republic of China (Taiwan), and Singapore. More information about Pinyin can be found on Wikipedia.

Parallels can be drawn between programmatic language detection and the way a human would recognize a language. Before describing the algorithm we designed, let’s discuss a scenario that will build intuition about how language detection algorithms work. Imagine you are walking down the street and there is a person walking in front of you talking on a cell phone in a different language. From the first few phrases out of that person’s mouth, you begin to recognize the language but are not exactly sure what it is.

At what point are you certain about what language the person is speaking? In theory, the way a human recognizes language is the exact same way you would program a machine to do it. For example, saying “how are you?” in Spanish is “como estas?”, but in Portuguese it’s “como vai” and in French it is “comment ca va?”. Since the first word in each phrase sounds the same, you wouldn’t be able to really discern the difference until you hear the next word. This is very similar to the way a computer processes languages: it will have to identify the words character by character, breaking down prefixes, suffixes, and words and match them to its own “memory bank” (corpus).

 

Background/Problem:

N-gram modeling is a machine learning technique widely used for natural language processing—some examples include spelling correction and searching. Most recently, it has become popular in building security incident detection and monitoring systems. The reason it’s called N-gram is that the algorithm works on N sized character blocks; a 1-token sequence would be a unigram, 2-token sequence, bigram, and an n-token sequence, an n-gram. Typically, when doing language classification, you are classifying a text (documents, webpage, book, article, etc.), and you are training your algorithm on multiple texts written in a specific language that are usually very long in length (e.g. Moby Dick, Paradise Lost, etc.).

Identifying the language of a specific domain presents a harder problem to solve, because it’s much shorter in length than having a whole document full of characters of a certain language. Relating it back to the cell-phone example above; if you were only able to hear a 10 words out of the conversation it would be much harder to accurately identify the language than if you heard 200 words. Also,  domains are written in a sort of “Internet Language”, and often contain a lot of numbers, so another thing to take into account was to craft our own version of Pinyin, which is Pinyin text from articles/books combined with domain names.

Corpus Generation:

One of the most crucial aspects of building classifier algorithms is coming up with a solid corpus to train your function on. A corpus is essentially the algorithm’s past experience with the language, and the training stage is where you teach your algorithm that language. It would be a similar comparison to the cell-phone example above. Say that the reason you are able to recognize the language is because you spent a few years in a foreign country. You may be able recognize different dialects, and be able to discern between Spanish as spoken in Spain from Spanish spoken in Latin America, or the differences between Brazilian Portuguese and European Portuguese. Since this was not the traditional method of language detection, we had to define our own language model, a combination of Pinyin domains and Pinyin language found in books or articles written in Pinyin to add as a supplement.

As part of this research we have 3 different types of corpora:

-Plain Pinyin text (this was a great resource: http://en.wikibooks.org/wiki/Category:Pinyin)
-Known Pinyin Domains
-Chinese Language Domains not necessarily Pinyin (mostly comprised of domains with a lot of numbers)

It is very important when building your corpus to craft it very precisely, and not allow for any deviations from what you’re trying to identify. We had to search far and wide on the web for Pinyin texts. Luckily, many of my classmates are from China, or are Chinese-Americans, and were able to direct me to some great resources. Currently we have 3 corpora. One is just a text corpus which is what we might train on for language classifiers. The other was more of an Internet language corpus comprising of Chinese domains. The only problem is some of the Chinese/Pinyin domains just comprise of numbers. The third is a specialized corpus of Chinese domains that consist mostly of numbers, and very few alphabetical characters (ex. 58493.com.cn).

 

Additional Feature Detection:

We added some supplemental feature-detection on top of our classifier to improve the total score for domains where the language is harder to identify. These features were based off geo-location “hints” extracted from the DNS log data. Here is where we took into account certain features the domain exhibited, for example: .cn in the TLD, or if the country the IP of the domain resolved to the countries China, Hong Kong, or Taiwan. I used PyGeoIP/MaxMind library to do the country lookups. In addition, I also filtered out puny code domains for future analysis, where the SLD start with “xn—“.

Another feature I am starting to design is what I call “giveaway” words, for example, “zhuong”, “xiang”, “zheng” etc. These substrings carry a higher weight and are more unique to Pinyin than other languages, increasing the probability that the domain is Pinyin. The intuition here (going back to the cellphone example), these would be words or sounds you would hear in the conversation when as soon as you heard them, you would instantly recognize the language. Usually they’re very unique to the language, not many other languages would have “zhang”, “xiong”, etc. Scanning through domains for additional words will effect performance, a better alternative would be to assign higher weights to certain trigrams and bigrams. This will require a more in depth analysis of the Pinyin language and the way it’s constructed.

 

Building the Classifier:

Step 1: Cleaning the data

One of the first things to do when building a text classifier algorithm is to “clean” the data as best as possible. Traditionally we would be working with large texts and, in the preprocessing stage, we would first filter out “stop words” (ex. the, a, than, etc.). Since we are working with domain data, we decided to treat the TLDs as “stop words” and filter those out, as well as all the periods (“.”) for classification. Depending on what type of Chinese domains we are looking for we can strip out the numbers and the dashes. We then break up the domain into bigrams, trigrams, and quad grams and add those into separate dictionaries. Most of the algorithm’s text analysis will be done on the SLD (second-level domain) and the other subdomains attached to that. We then go through and divide.

Step 2: Calculating the probabilities

The next step is to go through and check the if the bigram, trigram, quad gram exists within the corpora. The following calculations are then employed to compute the probabilities for all the grams:

Screenshot 2014-10-16 11.24.11

As you can see from the formulas above, the quad grams have a higher weight, being multiplied by 3, trigrams are multiplied by 2, and no weight attached to bigrams. This make sense because the longer the string, especially if it’s more unique, the higher the probability it is a part of a specific language.

Step 3: Adding in features calculating total score

Finally, we went through and summed up all of the probabilities of the all the grams, per domain, and factored in the scores for the additional features to compute the total score per domain.

 

Sample Output for 10,000 domains

Domain Pinyin Probability Score
vwudz.enshi0.cn.
0.005836117
files-webcars-com-cn.powercdn.cn.
0.005729309
elvshangjun.cn.
0.005141224
anshanbanxueliwenping.gov.cn.xuspnx.com.
0.005133575
7az0e.fuzhuang278.cn.
0.005082791
t.hefei.cc.
0.005081406
shexiang9.cn.
0.00506026
592.33qyi.fuzhuang206.cn.
0.005012791
huishui.novadigital.cn.
0.004977142
vasba.edu.cn.dkcciau.com.
0.004937082
www.qingbiji.cn.
0.00485009
talk.weibo.10086.cn.
0.004801779
873.41699.win2016.cn.
0.004795111
fcxlb.dianziyouxi11886.org.
0.004777545
3h48.news.qqparty.com.cn.
0.004717036
ezdvv.dianziyouxi11886.org.
0.004713618
egpfm.dianziyouxi13886.org.
0.004713618
dpmyt.dianziyouxi13886.org.
0.004713618
abbhqyt.huangguantouzhudailiwang.cn.
0.004693812
fulltech.com.tw.
0.004587503
317.57836.fuzhuang128.cn.
0.004560346
51121.fuzhuang186.cn.
0.004560346
www.cn-dajiang.com.
0.00449607
roll.caijing.com.cn.
0.0044836
866.4w6uo.tianlisujiao.com.
0.004483468
shhongzhuang.com.
0.004377925
s73q9.huihuangcaxie.cn.
0.004355986
www.cnkingtone.com.
0.004335397
www.02328.cn.
0.004333471
326.6d3ih.fuzhuang376.cn.
0.004316031
www.xinlvxing.com.cn.
0.004270074
1vp6s.beiwei39du.cn.
0.004250109
jinlongqipaiwohaoxiangzhidao.flxc.net.
0.00424021
www.cnsyhz.com.
0.004237153
hugeman.ekymnt.cn.
0.004232029
jianfei21.com.
0.004216755
henanyongtanduojinzhibo.131uu.cn.
0.004182583
www.bdmedia.cn.
0.004104624
tianzhi.com.
0.004097529
www.022w.cn.
0.004021225
dvlnb.whwxbj.cn.
0.00400433
t.mala.cn.
0.004002096
acbyqtj3h5.l20.yunpan.cn.
0.003999642
yutai.0535rc.com.
0.003986877
www.hljzp.net.
0.003984501
684.sa9c0.fgtolu.cn.
0.003970125
8371.n4o6j.huangmayulecheng1.com.
0.003961966
emogo.cn.
0.00395232
bj43b2b.dns4.cn.
0.003913787
qiche2010.com.
0.003907618
szkanne.com.
0.003906845
31688.fpbmkb4.cn.
0.003899726
www.thefox.cn.
0.00389916
www.adminsl.cn.
0.003898344
ptyyyssc.sdtjzk.com.
0.003848159
acvqh.gxuro.com.cn.
0.003845223
155.91178.chenghaijinguangwanju.qdsrrh.cn.
0.00383204
www.eetop.cn.
0.003821803
fengxipingzhaigongsi.cpnys.cc.
0.003819816
mi.cn.
0.003816321
www.s-zone.cn.
0.003815913
bbs.yzg.ely.cn.
0.003810671
78061.eiwqutrancz.cn.
0.00381022
huikangsc.com.
0.003810175
www.xnw5.cn.
0.003808215
profdcb48.websitecname.cn.
0.003799702
56api8h64.dfvfdsk.cn.
0.00378904
19078.meijianail.com.
0.003786749
zhsm12198.com.
0.003782171
www.zymjr.com.cn.
0.003775527
7ch.cnjc56.com.
0.00377101
795.39680.01tch.cn.
0.003769193
i.wo.com.cn.
0.003768649
7doe.cnjc56.com.
0.003768238
www.vmarketing.cn.
0.003768144
blog.libruce.cn.
0.003767259
host1.ynicp.cn.
0.003764843
jw.cicc.com.cn.
0.00376439
d0j3eku.dfupcun.cn.
0.003764302
sun01.f5.sinosure.com.cn.
0.003763847
id.ekymnt.cn.
0.003762709
freesimplehandmade.com.
0.00376204
kis74.8d35.cn.
0.003761579
t.rednet.cn.
0.003760458
mail.sz2g.cn.
0.003758049
2bw7ok.betaclub.cn.
0.003757989
56591.nnoxxv.cn.
0.003757969
bytgdcfsqrj.adaxnw.com.
0.00375755
t.sz.net.cn.
0.003756152
beidougpsweixingdingwei.870118.com.
0.003755757
kxovz.sxjlb.cn.
0.003755744
t.jatxh.cn.
0.003755336
6og6.502550.cn.
0.003755097
lwmcs.cnsh123.com.
0.00375438
www.fgyt.cn.
0.003754163
zcjsjrj.com.
0.003753777
bcaxzqy.nopevcd.cn.
0.003753239
www.bhgmag.com.cn.
0.003752705
1001040177149.027jd.cn.
0.003752521
zhld.com.
0.003752209
51110.nphjw.cn.
0.003752068
bbs.lcxw.cn.
0.003752
dbfhutx.eofvfr.cn.
0.003751464
szgdb.cn.
0.003751159
cd.gccdn.cn.
0.003750377
iphone4dingweizhuizongwangzhi.sjk138.com.
0.003715294
linyibanyingyusiliujichengjidan.h4dzsv.com.
0.003573521
jzlejia.com.
0.003478708
zongtongyulechengbaoma.taijichan.com.
0.003470566
www.win-in-shanghai.com.
0.003400833
huangguantaobaowanganquanma.2014sk.com.
0.003398596
kid.tcdn.qq.com.
0.003394156
jj7lr.shhaifeng.com.
0.00332892
huangguanxianjinwangh.73212.48973.com.
0.003242515
yulinbanwangshangkechawenping.tggomsl.com.
0.003201924
286.putong.zhiwen79.in.
0.003172524
xm-yuanyang.com.
0.003152139
www.uralhelicom.com.
0.003145724
chongqingbaoyang.com.
0.003139086
jrjiaomu.com.
0.003103858
sanlichen.com.
0.003095998
compuhom.com.
0.00305815
yangshengw.net.
0.003053929
ejiacheng.com.
0.003036892
wwwppnbacom.bkk456.com.
0.003010301
dyn-dsl-pt-98-124-47-5.nexicom.net.
0.003007465
nujiangbanbenkebiyezheng.xuspnx.com.
0.002981089
spiritcommunicator.com.
0.002965862
themealmobile.com.
0.002919905
hejianbanjiajiehunzheng.bk6zs.com.
0.002896148
petunione.com.
0.002880016
hot.xinggan.com.
0.00274529
ka5f9.4er17.tianlisujiao.com.
0.002741001
bbs.57xizang.com.
0.002728393
hakkeka.com.
0.002725519
taiziyulecheng18.bbs.227623.com.
0.002669892
drneilmd.com.
0.002666576
invisiblefence-com.webmail.emailsrvr.com.
0.00260573
82725.eufhhr.com.
0.002603327
awanggo.com.
0.002599609
shssl30fkj.shhxjf.com.
0.002598443
zhongguozuqiuduijinqibisai.57138.in.
0.002582069
shenyangbanjiashizheng.pgetkm.com.
0.002531125
vice.duoshuo.com.
0.002498526
yunguichuantiantianlezoushitu.maximschina.cn.
0.002484479
amarpai.com.
0.002435753
shlanyuan.com.
0.002433683
www.mchenryhd.com.
0.002432575
res.mashangju.com.w.alikunlun.com.
0.002427826
fudun.com.tw.
0.002426019
kuaicaile.aomendubojiqiao128168.com.
0.002419035
wwwzd699com.ejer3.com.
0.002392149
nissan-huasheng.net.
0.002379231
kuchetabg.com.
0.00234786
leifenggaoshoutanxinshuiluntan.70539.in.
0.00234699
danyangglassesline.net.
0.002330363
shcpdz.com.
0.002292495
lazxyl.game722.net.
0.002288849
www.ruidi.net.
0.002284677
gregoryaugustine.com.
0.002279901
www.renmaiku.com.
0.002269117
www.saintaugustinehyundai.com.
0.002247514
www.shaolindizi.org.cn.
0.002247355
6796.huhwa.com.
0.002246984
www.jkhyy.com.
0.002224344
sendai.rumotan.com.
0.002222603
club.in2underwear.com.
0.002214117
gztica.com.
0.002199464
securec28.ezhostingserver.com.
0.002191123
bjftzz.com.
0.002191085
bake-line.com.
0.002186827
jameshandlon.com.
0.002175567
www.hwz9.com.
0.002175143
www.chongshengtz.com.
0.00216471
deewallacestone.com.
0.002154474
mail.amazproduct.com.
0.002148831
vod2.igoldengate.com.
0.00213926
emilyratajkowski.org.
0.002139205
www.bahar-narenj.com.
0.002130782
programinvestasisedekah.com.
0.002129508
njqrky.com.
0.002126455
qe3ri.ggdsaeff.com.
0.002118777
baijialeyingqianjueqiao.jychenlong.com.
0.002116435
tongji.wrating.com.
0.00211354
www.footwearjapan.com.
0.002108889
staugustineinvestmentmanagement.com.
0.002108422
dww.xiagc.com.cn.
0.002106417
xbojzk.shemeshop.net.
0.002105211
realgecko.com.
0.002096769
85609.gpxuu.com.
0.002095043
marketdigitalproducts.com.
0.002093682
www.tagless.hk.
0.002088241
cha8i.ifuhxcn.com.
0.002087579
bilishi.2723397.biz.
0.0020853
fltportal.gefleet.com.gtm.ge.com.
0.002081435
mytamilchannel.com.
0.002081435
tu-demounstable-fe.transformersuniverse.com.
0.002081435
xifusheng.com.
0.002065535
13988880001.diwudai.com.
0.002059709
fucai3dlecaiwangzhai.gdlshb.com.
0.002059691
hutongyouwu.com.
0.002059615
gledainajivo.com.
0.002057547
asiri.blogfa.com.
0.002049428
www.yijee.com.
0.002047796
qiahe.net.
0.002042158
marikanasettlement.net.
0.00203916
cd581.gotoip.net.
0.002036992
sparktheevent.com.
0.002029098
www.shhweijia.com.
0.00202101
algerie360.goodbarber.com.
0.00201951
ridethebattle.com.
0.002017048
www.shhsjzcl.com.
0.002006529
www.htcaijing.com.
0.00199344
liuhecaitemacaituwangzhi.d3-w.com.
0.001992571
nj005.zapto.org.
0.001992072
www.tecnostamp-usa.com.
0.001985682
www.bjnahan.net.
0.001984731
qilei.org.
0.001984712
sundragonpress.com.
0.001981217
richmobi.com.
0.001978547
hfyyx.com.
0.001977105
www.concretehr.com.
0.00197698
www.touziqun.com.
0.001975886
wr2um.dycz123.com.
0.001974865
sdykpx.com.
0.001973644
www.bjxcyangdianfeng.net.
0.00197189
07esf.ewzmzgo.com.
0.001966409
www.qihuatong.org.
0.001965642
portsideview.com.
0.001964235
www.cncb.org.
0.001962655
g8ozi.nccpj4.org.
0.001961915
flashvid.dtiblog.com.
0.001959971
www.usapolomalls.com.
0.001959511
22qjz.ejcsjp.com.
0.001958841
rugseattle.com.
0.001957529
nataliakhodakova.com.
0.00195524
en.ex-silver.com.
0.001954879
macerc.org.
0.001954365
fs2.catr.uuzuonline.net.
0.001953869
antennasbest.net.
0.001953253
duncancomics.com.
0.00195054
929.78pfy.cnironfx.com.
0.001947731
mx01.deutsche-annington.com.
0.001944194
mx02.deutsche-annington.com.
0.001944194
www.drhouseitalia.altervista.org.
0.001943243
pastariagranditalia.com.
0.001943243
photourl.carbase.com.
0.00194233
f6byi5.ufc155.org.
0.001938265
hongjiu.ytredwine.com.
0.00192949
prestigegoodyearandautomotive.com.
0.00192883
kedimama.com.
0.001927289
74081.jyijfm.com.
0.001926884
wsdbdszmdd.bsjhjj.com.
0.001922105
as5400-s01ss7a-188.cnt.entelchile.net.
0.001920853
www.deertex.com.tw.
0.001919265
aip2.charolaisusa.com.
0.001919146
s2103.wartune.r2games.com.
0.00191781
www.phonerator.com.
0.001916443
ep.geely.com.
0.001913873
e51cv.ekiwi1.cn.
0.001913536
3g.rsdlyj.com.
0.001913271
628jr41e452.ipcheker.com.
0.001912738
inter-hosfair.com.
0.001910439
lokjv.sweatwerks.com.
0.001910439
killer.51netu.com.
0.001910439
jrtrohmregister.com.
0.001910439
deervalleypress.com.
0.001910423
s162-237-30-96.ssvec.az.wi-power.com.
0.001910423
pictures.comunpoisson.net.
0.001909499
t.jschina.com.cn.
0.001908993
www.asaska.com.
0.001908888
5402.grwjm.com.
0.001908511
terinamg2272.edublogs.org.
0.001907519
www.zdmoz.com.
0.001906594
www.2012synchro.com.
0.001903092
mchenrychamberofcomm.chambermaster.com.
0.001903076
baycity.infellowship.com.
0.001902937
friarsclubinc.org.
0.001901911
lgoc66.hnja.in.
0.001900581
www.gzcyts.com.
0.001900071
www.diweiylc.com.
0.001899941
helpcenterofaustin.org.
0.001896461
www.muziu.com.tw.
0.001895981
mail.pdsdallas.com.
0.001895747
www.qileke.com.
0.001895471
huutokaupat.com.
0.001894685
woodysoutdoorpower.com.
0.001894518
jrtcgb.webs.com.
0.001894026
846.guzcz.com.
0.001893962
yhylc.qvpzyjp.com.
0.001893881
diginyomda.com.
0.001892797
baexxxtu.97cr.cc.
0.001892075
idiyhandmade.com.
0.001891516
voyeur-reviews.info.
0.001891107
dns.ausnutria.com.
0.001891087
amfriendsaugustine.org.
0.001891071
ymgf1.yimoe.com.
0.001890908
nmd54093f.nike-hi.net.
0.001890762
www.yeyouwo.com.
0.001890102
pifaweb.com.
0.001889254
www.techtwomd.com.
0.001888793
special.bydauto.com.cn.
0.0018877
xn--gmil-1na.com.
0.001887693
bestsupply.info.
0.001887686
grandhotelpylypets.com.
0.001887686
603.lfegg.gbfgh.com.
0.00188764
sn-zc.com.
0.001887495
mdiwestziyu1.com.
0.001887482
pt.invoicexpress.com.
0.001887466
i1kjj.kdrnwj.com.
0.001887241
medals4mettle.org.
0.001886969
yierbokaihu.sdqdyt.com.
0.001886396
sdhxhjt.com.
0.001886379
dyn-dsl-mb-98-124-25-213.nexicom.net.
0.001886222
dyn-dsl-mb-98-124-28-108.nexicom.net.
0.001886222
dyn-dsl-mb-98-124-28-62.nexicom.net.
0.001886222
dyn-dsl-mb-98-124-28-231.nexicom.net.
0.001886222
jmyd0.86fashion.net.
0.001885849
pupupuooj.dtiblog.com.
0.001885539
weuee.com.
0.001884683
t080.ltkmoijl.com.
0.001884524
doomedtoexist.com.
0.001884069
ovzxmpeh.seedy123.com.
0.00188292
zerko6.edublogs.org.
0.001882789
pro-dvizh.com.
0.001882402
xntzdb.com.
0.001882102
nzw.3721job.net.
0.001881729
bazartdugrandjas.com.
0.00188146
2012napabasuperregional.apalanjevents.com.
0.001881036
jinkadaishan.flxc.net.
0.001880306
blog.webnots.com.
0.001880198
tjfate.com.
0.001880169
rutrackercat.org.
0.001880153
dqzmyq55.proveke.com.
0.001879243
selectionat.com.
0.001879227
a5sfp.13813.ejewxzg.com.
0.001879131
sbcz.net.
0.00187871
www.sdyjsw.com.
0.001878593
190.70k3l.njrcrx.com.
0.001878218
cxddz.com.
0.00187805
kdeopen.com.
0.001878039
emmtx.cn.
0.001877706
17yy.org.
0.001877573
deehtya.w4fa.com.
0.001877527
xeaa5.shgkv.com.
0.001877521
www.shgzbb.com.
0.001877521
bzlhg.com.
0.001877505
shjtjd.com.
0.001877362
4017.pjbct.com.
0.001877235
www.flurkapelle-boedigheim.com.
0.001877145
www.bestbestinmarket.com.
0.001877129
9dmz0.dwwmswu.com.
0.001877124
hmztv.com.
0.001877108
3ps6i.cj9.in.
0.001876959
815.olukq.qlk668.com.
0.001876933
qlyewu.com.
0.001876917
www.mamaspeaks.com.
0.001876836
zgt6w.ambjlqxw.com.
0.001876816
1mx9a.gvlzei.com.
0.001876629
58349.czxzdt.com.
0.001876173
418.bsjsvq.com.
0.00187617
mnjjr.com.
0.001876132
142.tellht.com.
0.001876081
tpdbv.ebboedmre.com.
0.001875687
sztlqm.com.
0.001875611
xplr-ts-t11-208-114-155-51.barrettxplore.com.
0.00187541
www.xn--mgbebn2h.com.
0.001875212
021vod.com.
0.001875187
xn--q9js9lqa9fj4fn90ata.com.
0.001875184
xn--cckl0itdpc9763ahlyc.cc.
0.001875184

 

Conclusion:

Overall, the algorithm was successful in being able to identify Pinyin domains in our DNS query traffic. For testing, we ran the filter continuously on traffic samples from our resolvers and were able to come back with successful results. In addition, we used the cosine distance algorithm to test the accuracy of the algorithm. When testing against a few different domain corpuses (French, English, Spanish, Russian, German), the Pinyin one came back with the closest match. Overall this helped the Security Research team sift through domains faster – and in some cases be able to identify new malicious Chinese domains. Some additional features we’d like to add include improving smoothing (for grams with 0 probability) and weighting (features, and possibly grams). We also want to try and detect different types of anomalies that deviate from the norm, for example, Pinyin language in a domain that ends in .eu.

Some future ideas for this project would be to expand the corpora to support multiple languages in addition to Pinyin. As a part of this research we have decided to publish the code for the Pinyin Language Detector on our public GitHub page at https://github.com/opendns/PinyinDetector.

This post is categorized in: