'str_word_count() function doesn't display Arabic language properly
I've made the next function to return a specific number of words from a text:
function brief_text($text, $num_words = 50) {
$words = str_word_count($text, 1);
$required_words = array_slice($words, 0, $num_words);
return implode(" ", $required_words);
}
and it works pretty well with English language but when I try to use it with Arabic language it fails and doesn't return words as expected. For example:
$text_en = "Cairo is the capital of Egypt and Paris is the capital of France";
echo brief_text($text_en, 10);
will output Cairo is the capital of Egypt and Paris is the
while
$text_ar = "القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا";
echo brief_text($text_ar, 10);
will output � � � � � � � � � �
.
I know that the problem is with the str_word_count
function but I don't know how to fix it.
UPDATE
I have already written another function that works pretty good with both English and Arabic languages, but I was looking for a solution for the problem caused by str_word_count()
function when using with Arabic. Anyway here is my other function:
function brief_text($string, $number_of_required_words = 50) {
$string = trim(preg_replace('/\s+/', ' ', $string));
$words = explode(" ", $string);
$required_words = array_slice($words, 0, $number_of_required_words); // get sepecific number of elements from the array
return implode(" ", $required_words);
}
Solution 1:[1]
Try with this function for word count:
// You can call the function as you like
if (!function_exists('mb_str_word_count'))
{
function mb_str_word_count($string, $format = 0, $charlist = '[]') {
mb_internal_encoding( 'UTF-8');
mb_regex_encoding( 'UTF-8');
$words = mb_split('[^\x{0600}-\x{06FF}]', $string);
switch ($format) {
case 0:
return count($words);
break;
case 1:
case 2:
return $words;
break;
default:
return $words;
break;
}
};
}
echo mb_str_word_count("??????? ?? ????? ??? ?????? ?? ????? ?????") . PHP_EOL;
Resources
- Unicode list for arabic
- A Rule-Based Arabic Stemming Algorithm
- A Rule and Template Based Stemming Algorithm for Arabic Language (seems more complete)
Recommentations
- Use the tag
<meta charset="UTF-8"/>
in HTML files - Always add
Content-type: text/html; charset=utf-8
headers when serving pages
Solution 2:[2]
For accepting ASCII characters too:
if (!function_exists('mb_str_word_count'))
{
function mb_str_word_count($string, $format = 0, $charlist = '[]') {
$string=trim($string);
if(empty($string))
$words = array();
else
$words = preg_split('~[^\p{L}\p{N}\']+~u',$string);
switch ($format) {
case 0:
return count($words);
break;
case 1:
case 2:
return $words;
break;
default:
return $words;
break;
}
}
}
Solution 3:[3]
A while ago I wanted to calculate the reading time of a paragraph and had the same issue and I just simply count the SPACEs in the paragraph :) (note that it won't be that accurate but it suits me)
like this:
substr_count($text, ' ') + 1;
Solution 4:[4]
hi friend if you want to get count of word in Farsi language or Arabic you can use below code
public function customWordCount($content_text)
{
$resultArray = explode(' ',trim($content_text));
foreach ($resultArray as $key => $item)
{
if (in_array($item,["|",";",".","-","=",":","{","}","[","]","(",")"]))
{
$resultArray[$key] = '';
}
}
$resultArray = array_filter($resultArray);
return count($resultArray);
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | ahoo |
Solution 3 | Erfan Paslar |
Solution 4 | Alireza Salehi |