Frequency Analysis of Unicode Blocks with PHP

Dear readers, how are you today? I hope you are having a great day. Today I just want to share with you that I’ve been working on Babylon, a language detector for PHP that is been implemented with a machine learning technique other than n-gram.

Babylon is the simplest, easiest-to-use language detector.

use Babylon\LanguageDetector;

$text = 'You will have your data soon, I remarked, pointing with my finger;
        this is the Brixton Road, and that is the house, if I am not very much
        mistaken.';

$isoCode = (new LanguageDetector($text))->detect();

The Documentation describes how the library can learn new languages — if you’d want to help add more and more, it’ll be great. Feel free to add a new language and send a PR.

php cli/prepare.php
This will create a CSV with the most frequent words in all of the files in the dataset/input folder.
The operation may take a few seconds to be completed.
Do you want to proceed? (Y/N): y
OK! The most frequent words in ceb.txt were transformed into CSV format...
OK! The most frequent words in tgl.txt were transformed into CSV format...
The austronesian language family has been updated.
OK! The most frequent words in cym.txt were transformed into CSV format...
OK! The most frequent words in gla.txt were transformed into CSV format...
OK! The most frequent words in gle.txt were transformed into CSV format...
The gaelic language family has been updated.
OK! The most frequent words in dan.txt were transformed into CSV format...
OK! The most frequent words in deu.txt were transformed into CSV format...
OK! The most frequent words in eng.txt were transformed into CSV format...
OK! The most frequent words in isl.txt were transformed into CSV format...
OK! The most frequent words in nld.txt were transformed into CSV format...
OK! The most frequent words in nob.txt were transformed into CSV format...
OK! The most frequent words in swe.txt were transformed into CSV format...
The germanic language family has been updated.
OK! The most frequent words in fra.txt were transformed into CSV format...
OK! The most frequent words in ita.txt were transformed into CSV format...
OK! The most frequent words in por.txt were transformed into CSV format...
OK! The most frequent words in ron.txt were transformed into CSV format...
OK! The most frequent words in spa.txt were transformed into CSV format...
The romance language family has been updated.
OK! The most frequent words in ces.txt were transformed into CSV format...
OK! The most frequent words in pol.txt were transformed into CSV format...
The slavic language family has been updated.
OK! The most frequent words in bul.txt were transformed into CSV format...
OK! The most frequent words in hrv.txt were transformed into CSV format...
OK! The most frequent words in rus.txt were transformed into CSV format...
The slavic language family has been updated.
OK! The most frequent words in fin.txt were transformed into CSV format...
OK! The most frequent words in hun.txt were transformed into CSV format...
The uralic language family has been updated.
OK! The words in slavic.csv were successfully read...
OK! cyrillic-fingerprint.csv was successfully written...
Operation completed.
OK! The words in austronesian.csv were successfully read...
OK! The words in gaelic.csv were successfully read...
OK! The words in germanic.csv were successfully read...
OK! The words in romance.csv were successfully read...
OK! The words in slavic.csv were successfully read...
OK! The words in uralic.csv were successfully read...
OK! latin-fingerprint.csv was successfully written...
Operation completed.

The cli/prepare.php command calculates some statistics: the fingerprint of the language families, which is a disjointish set containing the most frequent words expected to be found in each family of languages.

Example: babylon/dataset/output/latin-fingerprint.csv

romance,i de la que a el no l un una en d per tom amb els les à et le je il une du des nous e di che non si in è era ma con o da se um os uma do em as com não na cu pe ca sa n ce i din nu y los su las por del

Now, going back to the title of this post, have you ever needed to count the number of times that a particular Unicode block is used in a string or text for frequency analysis purposes? Babylon can detect alphabets because it relies on Unicode Ranges which is a PHP tool to work with Unicode blocks in a friendly, object-oriented way.

Unicode Ranges might help you achieve your goal as well; it is useful if you need to know if a given character belongs to this or that Unicode block.

Char Belongs To
a BasicLatin
Ξ GreekAndCoptic
Ӝ Cyrillic
CJKRadicalsSupplement

A bit more specifically, this is the babylon/src/Unicode.php file that calculates the number of times that the Unicode ranges appear in a text.

namespace Babylon;

use Babylon;
use UnicodeRanges\Converter;

/**
 * Unicode class.
 *
 * @author Jordi Bassagañas <info@programarivm.com>
 * @link https://programarivm.com
 * @license MIT
 */
class Unicode
{
     const N_FREQ_UNICODE_RANGES = 10;

    /**
     * Text to be analyzed.
     *
     * @var string
     */
	protected $text;

	/**
     * Unicode ranges frequency -- number of times that the unicode ranges appear in the text.
     *
     * Example:
     *
     *      Array
     *      (
     *         [Basic Latin] => 25
     *         [Cyrillic] => 14
     *         [CJK Unified Ideographs] => 12
     *         [Arabic] => 9
     *         [Hangul Syllables] => 5
     *         [Hiragana] => 3
	 *          ...
     *      )
     *
     * @var array
     */
	protected $freq;

	/**
     * Constructor.
     *
     * @param string $text
     */
	public function __construct(string $text)
	{
		$this->text = $text;
	}

	/**
     * The most frequent unicode ranges in the text.
     *
     * @return array
     * @throws \InvalidArgumentException
     */
	public function freq(): array
	{
		$chars = $this->mbStrSplit($this->text);
		foreach ($chars as $char) {
			$unicodeRange = Converter::unicode2range($char);
			empty($this->freq[$unicodeRange->name()])
				? $this->freq[$unicodeRange->name()] = 1
				: $this->freq[$unicodeRange->name()] += 1;
		}
		arsort($this->freq);

		return array_slice($this->freq, 0, self::N_FREQ_UNICODE_RANGES);
	}

	/**
     * The most frequent unicode range in the text.
     *
     * @return \UnicodeRanges\AbstractRange
     * @throws \InvalidArgumentException
     */
	public function mostFreq(): string
	{
		return key(array_slice($this->freq(), 0, 1));
	}

	/**
     * Converts a multibyte string into an array of chars.
     *
     * @return array
     */
	private function mbStrSplit(string $text): array
	{
		$text = preg_replace('!\s+!', ' ', $text);
		$text = str_replace (' ', '', $text);

		return preg_split('/(?<!^)(?!$)/u', $text);
	}
}

Converter::unicode2range($char) is the one converting the characters into their object-oriented, Unicode block counterpart.

And here is a test to show how to get information about the characters used in a string.

namespace Babylon\Tests\Unit\Unit;

use Babylon\Unicode;
use PHPUnit\Framework\TestCase;

class UnicodeTest extends TestCase
{
    /**
     * @test
     */
    public function freq()
    {
        $text = '律絕諸篇俱宇宙古今مليارات في мале,тъйжалнопе hola que tal como 토마토쥬스 estas tu hoy この平安朝の';
        $expected = [
            'Basic Latin' => 25,
            'Cyrillic' => 14,
            'CJK Unified Ideographs' => 12,
            'Arabic' => 9,
            'Hangul Syllables' => 5,
            'Hiragana' => 3,
        ];

        $this->assertEquals($expected, (new Unicode($text))->freq());
    }

    /**
     * @test
     */
    public function most_freq()
    {
        $text = '律絕諸篇俱宇宙古今مليارات في мале,тъйжалнопе hola que tal como 토마토쥬스 estas tu hoy この平安朝の';

        $this->assertEquals('Basic Latin', (new Unicode($text))->mostFreq());
    }
}

That’s it for now. I hope you’ll find the code self-explanatory. Thank you for reading this post, I hope you liked it.