'How to split a Thai sentence, which does not use spaces, into words?
How to split word from Thai sentence? English we can split word by space.
Example: I go to school
, split = ['I', 'go', 'to' ,'school']
Split by looking only space.
But Thai language had no space, so I don't know how to do. Example spit ฉันจะไปโรงเรียน to from txt file to ['ฉัน' 'จะ' 'ไป' 'โรง' 'เรียน'] = output another txt file.
Are there any programs or libraries that identify Thai word boundaries and split?
Solution 1:[1]
In 2006, someone contributed code to the Apache Lucene project to make this work.
Their approach (written in Java) was to use the BreakIterator class, calling getWordInstance()
to get a dictionary-based word iterator for the Thai language. Note also that there is a stated dependency on the ICU4J project. I have pasted the relevant section of their code below:
private BreakIterator breaker = null;
private Token thaiToken = null;
public ThaiWordFilter(TokenStream input) {
super(input);
breaker = BreakIterator.getWordInstance(new Locale("th"));
}
public Token next() throws IOException {
if (thaiToken != null) {
String text = thaiToken.termText();
int start = breaker.current();
int end = breaker.next();
if (end != BreakIterator.DONE) {
return new Token(text.substring(start, end),
thaiToken.startOffset()+start,
thaiToken.startOffset()+end, thaiToken.type());
}
thaiToken = null;
}
Token tk = input.next();
if (tk == null) {
return null;
}
String text = tk.termText();
if (UnicodeBlock.of(text.charAt(0)) != UnicodeBlock.THAI) {
return new Token(text.toLowerCase(),
tk.startOffset(),
tk.endOffset(),
tk.type());
}
thaiToken = tk;
breaker.setText(text);
int end = breaker.next();
if (end != BreakIterator.DONE) {
return new Token(text.substring(0, end),
thaiToken.startOffset(),
thaiToken.startOffset()+end,
thaiToken.type());
}
return null;
}
Solution 2:[2]
There are multiple ways to do 'Thai words tokenization'. One way is to use dictionary-based or pattern-based. In this case, the algorithm will go through characters and if it appears in the dictionary, we'll count as a word.
Also, there are also recent libraries to tokenize Thai text where it trained Deep learning to tokenize Thai word on BEST corpus including rkcosmos/deepcut, pucktada/cutkum and more.
Example usage of deepcut
:
import deepcut
deepcut.tokenize('???????????????')
# output as ['???', '??', '??', '???', '?????']
Solution 3:[3]
The simplest segmenter for Chinese and Japanese is to use a greedy dictionary based scheme. This should work just as well for Thai---get a dictionary of Thai words, and at the current character, match the longest string from that character that exists in the dictionary. This gets you a pretty decent segmenter, at least in Chinese and Japanese.
Solution 4:[4]
Here's how to split Thai text into words using Kotlin and ICU4J. ICU4J is a better choice than Lucene's version (last updated 6/2011), because ICU4J is constantly updated and has additional related tools. Search for icu4j
at mvnrepository.com to see them all.
fun splitIntoWords(s: String): List<String> {
val wordBreaker = BreakIterator.getWordInstance(Locale("th"));
wordBreaker.setText(s)
var startPos = wordBreaker.first()
var endPos = wordBreaker.next()
val words = mutableListOf<String>()
while(endPos != BreakIterator.DONE) {
words.add(s.substring(startPos,endPos))
startPos = endPos
endPos = wordBreaker.next()
}
return words.toMutableList()
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | mpontillo |
Solution 2 | |
Solution 3 | Ben Allison |
Solution 4 | devdanke |