LangID is a library that performs Language Identification. If you give a string as input, it will indicate which language the string belongs to.
The simple usage is as follows.
# -*- coding: utf-8 -*-
import langid
result = langid.classify('This is Japanese')
print(result) #=> ('ja', -197.7628321647644)
The algorithms in this library are made from publicly known research, and references can be found at here.
The point to be worried about is the difficulty in speed. Since the above simple test takes nearly 3 seconds, it seems that it can not be used very much in the world of the Web where real-time performance is important.
Recommended Posts