'How to index a text field or make it searchable in MySQL?

I have stored about 7 million biological protein sequences in text field of MySQL table (using InnoDB storage engine and latin1_swedish_ci collation).
Sequences stored in MySQL are simple combinations of English alphabets in uppercase. Like this:

MSTWQVYRLLMEYCSCLDNKTPNAFAKWCSSRKIKFLQADYFRKRPKHCDEGTGRYRSIYVMKKEYLGDIVRKITN

Selection of text field in MySQL looks essential because the sequences are trending from minimum 1 byte to maximum unlimited/unknown bytes (max size was 23089 in stored 7 million records but it will ideally go beyond for upcoming records in future).
Maximum key size for varchar or text in MySQL is 767. Means that only first left most 767 bytes can be indexed.
LIKE operator ineffective on this index to efficiently retrieve substring from entire text field.

So, is there any way to index entire text field to efficiently search substrings inside it?



Solution 1:[1]

You're hoping to store, and then search for substrings, in alphabetic protein sequences.

MySQL / MariaDB's search capabilities, both LIKE '%CSCLDNKTPNAFAKW%' and FULLTEXT, are not suitable for this application, sorry to say.

Why not?

  • Searches with LIKE '%CSCLDN%' will be absurdly slow. (% in LIKE strings is the wildcard operator.) And, a prefix index on your column won't help make things faster.

  • FULLTEXT works on natural-language sequences of words, not the long strings of characters used to represent DNA sequences.

The PostgreSQL RDBMS has a feature called trigram indexes. When you use it you can search long TEXT objects with LIKE '%ACTG%' - style filters with decent performance. You can declare your trigram index using something like this.

CREATE INDEX CONCURRENTLY table1_text ON table1
       USING GIN (textcolumn, gin_trgm_ops)
       INCLUDE (col1, col2, col3);

But before you do that you'll have to switch over to using postgreSQL.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1