Frequency Analysis of Unicode Blocks with PHP

Dear readers, how are you today? I hope you are having a great day. Today I just want to share with you that I’ve been working on Babylon, a language detector for PHP that is been implemented with a machine learning technique other than n-gram.

Babylon is the simplest, easiest-to-use language detector.

The Documentation describes how the library can learn new languages — if you’d want to help add more and more, it’ll be great. Feel free to add a new language and send a PR.

The cli/prepare.php command calculates some statistics: the fingerprint of the language families, which is a disjointish set containing the most frequent words expected to be found in each family of languages.

Example: babylon/dataset/output/latin-fingerprint.csv

Now, going back to the title of this post, have you ever needed to count the number of times that a particular Unicode block is used in a string or text for frequency analysis purposes? Babylon can detect alphabets because it relies on Unicode Ranges which is a PHP tool to work with Unicode blocks in a friendly, object-oriented way.

Unicode Ranges might help you achieve your goal as well; it is useful if you need to know if a given character belongs to this or that Unicode block.

Char Belongs To
a BasicLatin
Ξ GreekAndCoptic
Ӝ Cyrillic

A bit more specifically, this is the babylon/src/Unicode.php file that calculates the number of times that the Unicode ranges appear in a text.

Converter::unicode2range($char) is the one converting the characters into their object-oriented, Unicode block counterpart.

And here is a test to show how to get information about the characters used in a string.

That’s it for now. I hope you’ll find the code self-explanatory. Thank you for reading this post, I hope you liked it.