'Android custom keyboard suggestions

I am building a custom keyboard for android, the one that atleast supports autocomplete suggestions. To achieve this, I am storing every word that user types (not password fields) in a Room database table which has a simple model, the word and its frequency. Now for showing the suggestions, I am using a Trie which is populated by words from this database table. My query is basically to order by the table based on frequency of the word and limit the results to 5K (I do not feel like overpopulating the Trie, these 5K words can be considered as the users' favourite words that he uses often and needs suggestions for). Now my actual problem is the ORDER BY clause, this is a rapidly growing data set, sorting lets say 0.1M words to get 5K words seems like an overkill. How can i rework this approach to improve the efficiency of this entire suggestions logic.



Solution 1:[1]

If not already implemented, an index on the frequency @ColumnInfo(index = true).

Another could be to add a table that maintains the highest 5k. Supported by yet another table (the support table) that has 1 row, with columns for; the highest frequency (not really required), the lowest frequency in the current 5k, and a 3rd column for the number currently held. So you could then, after adding an existing word get whether or not the new/updated word should be added to the 5k table (perhaps a 4th column for the primary key of the lowest to facilitate efficient deletion).

So

  1. if the number currently held is less than 5k insert or update the 5k table and increment the number currently held in the support table.
  2. otherwise if the number is lower then the lowest skip it.
  3. otherwise update if it already exists.
  4. otherwise delete the lowest, insert the replacement and then update the lowest accordingly in the support table.

Note that the 5K table would probably only need to store the rowid as a pointer/reference/map to the core table.

  • rowid is a column that virtually all tables will have in Room (virtual tables an exception as are table that have the WITHOUT ROWID attribute but Room does not facilitate (as far as I am aware) WITHOUT ROWID table).

  • The rowid can be up to twice as fast as other indexes. I would suggest using @PrimaryKey Long id=null; (java) or @PrimaryKey var id: Long?=null (Kotlin) and NOT using @PrimaryKey(autogenerate = true).

    • autogenerate = true equates to SQLite's AUTOINCREMENT, about which the SQLite documentation says "The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed."
    • see https://www.sqlite.org/rowidtable.html, and also https://sqlite.org/autoinc.html
    • curiously/funnily the support table mentioned isn't that far away from what coding AUTOINCREMENT does.
    • a table with a row per table that has AUTOINCREMENT, is used (sqlite_sequence) that stores the table name and the highest ever allocated rowid.
      • Without AUTOINCREMENT but with <column_name> INTEGER PRIMARY KEY and no value or null for the primary key column's value then SQLite generates a value that is 1 greater than max(rowid).
      • With AUTOINCREMENT/autogenerate=true then the generated value is the greater of max(rowid) and the value stored, for that table, in the sqlite_sequence table (hence the overheads).
      • of course those overheads, will very likely be insignificant in comparison to sorting 0.1M rows.

Demonstration

The following is a demonstration albeit just using a basic Word table as the source.

First the 2 tables (@Entity annotated classes)

Word

@Entity (
        indices = {@Index(value = {"word"},unique = true)}
)
class Word {
    @PrimaryKey Long wordId=null;
    @NonNull
    String word;
    @ColumnInfo(index = true)
    long frequency;

    Word(){}

    @Ignore
    Word(String word, long frequency) {
        this.word = word;
        this.frequency = frequency;
    }
}

WordSubset aka the table with the highest occurring 5000 frequencies, it simply has a reference/map/link to the underlying/actual word. :-

@Entity(
        foreignKeys = {
                @ForeignKey(
                        entity = Word.class,
                        parentColumns = {"wordId"},
                        childColumns = {"wordIdMap"},
                        onDelete = ForeignKey.CASCADE,
                        onUpdate = ForeignKey.CASCADE
                )
        }
)
class WordSubset {
    public static final long SUBSET_MAX_SIZE = 5000;
    @PrimaryKey
    long wordIdMap;

    WordSubset(){};

    @Ignore
    WordSubset(long wordIdMap) {
        this.wordIdMap = wordIdMap;
    }
}
  • note the constant SUBSET_MAX_SIZE, hard coded just the once so a simple single change to adjust (lowering it after rows have been added may cause issues)

WordSubsetSupport this will be a single row table that contains the highest and lowest frequencies (highest is not really needed), the number of rows in the WordSubset table and a reference/map to the word with the lowest frequency.

@Entity(
        foreignKeys = {
                @ForeignKey(
                        entity = Word.class,
                        parentColumns = {"wordId"},
                        childColumns = {"lowestWordIdMap"}
                )
        }
)
class WordSubsetSupport {
    @PrimaryKey
    Long wordSubsetSupportId=null;
    long highestFrequency;
    long lowestFrequency;
    long countOfRowsInSubsetTable;
    @ColumnInfo(index = true)
    long lowestWordIdMap;

    WordSubsetSupport(){}
    @Ignore
    WordSubsetSupport(long highestFrequency, long lowestFrequency, long countOfRowsInSubsetTable, long lowestWordIdMap) {
        this.highestFrequency = highestFrequency;
        this.lowestFrequency = lowestFrequency;
        this.countOfRowsInSubsetTable = countOfRowsInSubsetTable;
        this.lowestWordIdMap = lowestWordIdMap;
        this.wordSubsetSupportId = 1L;
    }
}

For access an abstract class (rather than interface, as this, in Java, allows methods/functions with a body, a Kotlin interface allows these) CombinedDao :-

@Dao
abstract class CombinedDao {
    @Insert(onConflict = OnConflictStrategy.IGNORE)
    abstract long insert(Word word);
    @Insert(onConflict = OnConflictStrategy.IGNORE)
    abstract long insert(WordSubset wordSubset);
    @Insert(onConflict = OnConflictStrategy.IGNORE)
    abstract long insert(WordSubsetSupport wordSubsetSupport);

    @Query("SELECT * FROM wordsubsetsupport LIMIT 1")
    abstract WordSubsetSupport getWordSubsetSupport();
    @Query("SELECT count() FROM wordsubsetsupport")
    abstract long getWordSubsetSupportCount();
    @Query("SELECT countOfRowsInSubsetTable FROM wordsubsetsupport")
    abstract long getCountOfRowsInSubsetTable();
    @Query("UPDATE wordsubsetsupport SET countOfRowsInSubsetTable=:updatedCount")
    abstract void updateCountOfRowsInSubsetTable(long updatedCount);
    @Query("UPDATE wordsubsetsupport " +
            "SET countOfRowsInSubsetTable = (SELECT count(*) FROM wordsubset), " +
            "lowestWordIdMap = (SELECT word.wordId FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)," +
            "lowestFrequency = (SELECT coalesce(min(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)," +
            "highestFrequency = (SELECT coalesce(max(frequency),0) FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId)")
    abstract void autoUpdateWordSupportTable();
    @Query("DELETE FROM wordsubset WHERE wordIdMap= (SELECT wordsubset.wordIdMap FROM wordsubset JOIN word ON wordsubset.wordIdMap = word.wordId ORDER BY frequency ASC LIMIT 1)")
    abstract void deleteLowestFrequency();

    @Transaction
    @Query("")
    int addWord(Word word) {
        /* try to add the add word, setting the wordId value according to the result.
            The result will be the wordId generated (1 or greater) or if the word already exists -1
        */
        word.wordId = insert(word);
        /* If the word was added and not rejected as a duplicate, then it may need to be added to the WordSubset table */
        if (word.wordId > 0) {
            /* Are there any rows in the support table? if not then add the very first entry/row */
            if (getWordSubsetSupportCount() < 1) {
                /* Need to add the word to the subset */
                insert(new WordSubset(word.wordId));
                /* Can now add the first (and only) row to the support table */
                insert(new WordSubsetSupport(word.frequency,word.frequency,1,word.wordId));
                autoUpdateWordSupportTable();
                return 1;
            }
            /* If there are less than the maximum number of rows in the subset table then
                1) insert the new subset row, and
                2) update the support table accordingly
            */
            if (getCountOfRowsInSubsetTable() < WordSubset.SUBSET_MAX_SIZE) {
                insert(new WordSubset(word.wordId));
                autoUpdateWordSupportTable();
                return 2;
            }
            /*
                Last case is that the subset table is at the maximum number of rows and
                the frequency of the added word is greater than the lowest frequency in the
                subset, so
                1) the row with the lowest frequency is removed from the subset table and
                2) the added word is added to the subset
                3) the support table is updated accordingly
            */
            if (getCountOfRowsInSubsetTable() >= WordSubset.SUBSET_MAX_SIZE) {
                WordSubsetSupport currentWordSubsetSupport = getWordSubsetSupport();
                if (word.frequency > currentWordSubsetSupport.lowestFrequency) {
                    deleteLowestFrequency();
                    insert(new WordSubset(word.wordId));
                    autoUpdateWordSupportTable();
                    return 3;
                }
            }
            return 4; /* indicates word added but does not qualify for addition to the subset */
        }
        return -1;
    }
}
  • The addWord method/function is the only method that is used as this automatically maintains the WordSubset and the WordSubsetSupport tables.

TheDatabase is a pretty standard @Database annotated class, other than that it allows use the main thread for the sake of convenience and brevity of the demo:-

@Database( entities = {Word.class,WordSubset.class,WordSubsetSupport.class}, version = TheDatabase.DATABASE_VERSION, exportSchema = false)
abstract class TheDatabase extends RoomDatabase {
    abstract CombinedDao getCombinedDao();

    private static volatile TheDatabase instance = null;
    public static TheDatabase getInstance(Context context) {
        if (instance == null) {
            instance = Room.databaseBuilder(context,TheDatabase.class,DATABASE_NAME)
                    .addCallback(cb)
                    .allowMainThreadQueries()
                    .build();
        }
        return instance;
    }

    private static Callback cb = new Callback() {
        @Override
        public void onCreate(@NonNull SupportSQLiteDatabase db) {
            super.onCreate(db);
        }

        @Override
        public void onOpen(@NonNull SupportSQLiteDatabase db) {
            super.onOpen(db);
        }
    };



    public static final String DATABASE_NAME = "the_database.db";
    public static final int DATABASE_VERSION = 1;
}

Finally activity code that randomly generates and adds 10,000 words (or thereabouts as some could be duplicate words), each word having a frequency that is also randomly generated (between 1 and 10000) :-

public class MainActivity extends AppCompatActivity {
    
    TheDatabase db;
    CombinedDao dao;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);
        
        db = TheDatabase.getInstance(this);
        dao = db.getCombinedDao();
        
        for (int i=0; i < 10000; i++) {
            Word currentWord = generateRandomWord();
            Log.d("ADDINGWORD","Adding word " + currentWord.word + " frequency is " + currentWord.frequency);
            dao.addWord(generateRandomWord());
        }
    }
    public static final String ALPHABET = "abcdefghijklmnopqrstuvwxyz";
    private Word generateRandomWord() {
        Random r = new Random();
        int wordLength = (abs(r.nextInt()) % 24) + 1;
        int frequency = abs(r.nextInt()) % 10000;
        StringBuilder sb = new StringBuilder();
        for (int i=0; i < wordLength; i++) {
            int letter = abs(r.nextInt()) % (ALPHABET.length());
            sb.append(ALPHABET.substring(letter,letter+1));
        }
        return new Word(sb.toString(),frequency);
    }
}

Obviously the results would differ per run, also the demo is only really designed to be run once (although it could be run more).

After running, using AppInspection, then

The support table (in this instance) is :-

enter image description here

  • So as countOfRowsInSubsetTable is 5000 then the subset table has been filled to it's capacity/limit.
  • The highest frequency encountered as 9999 (as could well be expected)
  • The lowest frequency in the subset is 4690 and that is for the word with the wordId that is 7412.

The subset table on it's own means little as it just contains a map to the actual word. So it's more informative to use a query to look at what it contains.

e.g.

enter image description here

As can be seen the query shows that the word who's wordId is 7412 is the one with the lowest frequency of 4690 (as expected according to the support table)

Going to the last page shows:-

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1