'How to index a text field or make it searchable in MySQL?
I have stored about 7 million biological protein sequences in text
field of MySQL table (using InnoDB
storage engine and latin1_swedish_ci
collation).
Sequences stored in MySQL are simple combinations of English alphabets in uppercase. Like this:
MSTWQVYRLLMEYCSCLDNKTPNAFAKWCSSRKIKFLQADYFRKRPKHCDEGTGRYRSIYVMKKEYLGDIVRKITN
Selection of text
field in MySQL looks essential because the sequences are trending from minimum 1 byte to maximum unlimited/unknown bytes (max size was 23089 in stored 7 million records but it will ideally go beyond for upcoming records in future).
Maximum key size for varchar or text in MySQL is 767
. Means that only first left most 767
bytes can be indexed.LIKE
operator ineffective on this index to efficiently retrieve substring
from entire text
field.
So, is there any way to index entire text
field to efficiently search substrings inside it?
Solution 1:[1]
You're hoping to store, and then search for substrings, in alphabetic protein sequences.
MySQL / MariaDB's search capabilities, both LIKE '%CSCLDNKTPNAFAKW%'
and FULLTEXT, are not suitable for this application, sorry to say.
Why not?
Searches with
LIKE '%CSCLDN%'
will be absurdly slow. (%
in LIKE strings is the wildcard operator.) And, a prefix index on your column won't help make things faster.FULLTEXT works on natural-language sequences of words, not the long strings of characters used to represent DNA sequences.
The PostgreSQL RDBMS has a feature called trigram indexes. When you use it you can search long TEXT objects with LIKE '%ACTG%'
- style filters with decent performance. You can declare your trigram index using something like this.
CREATE INDEX CONCURRENTLY table1_text ON table1
USING GIN (textcolumn, gin_trgm_ops)
INCLUDE (col1, col2, col3);
But before you do that you'll have to switch over to using postgreSQL.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |