'Similar UTF-8 strings for autocomplete field
Background
Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) characters. This is using the pg_trgm
module.
Problem
The code resembles the following:
SELECT
t.label
FROM
the_table t
WHERE
label % 'fil'
ORDER BY
similarity( t.label, 'fil' ) DESC
When the user types fil
, the query matches filbert
but not filé powder
. (Because of the accented character?)
Failed Solution #1
I tried to implement an unaccent function and rewrite the query as:
SELECT
t.label
FROM
the_table t
WHERE
unaccent( label ) % unaccent( 'fil' )
ORDER BY
similarity( unaccent( t.label ), unaccent( 'fil' ) ) DESC
This returns only filbert
.
Failed Solution #2
As suggested:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;
CREATE OR REPLACE FUNCTION unaccent_text(text)
RETURNS text AS
$BODY$
SELECT unaccent($1);
$BODY$
LANGUAGE sql IMMUTABLE
COST 1;
All other indexes on the table have been dropped. Then:
CREATE INDEX label_unaccent_idx
ON the_table( lower( unaccent_text( label ) ) );
This returns only one result:
SELECT
t.label
FROM
the_table t
WHERE
label % 'fil'
ORDER BY
similarity( t.label, 'fil' ) DESC
Question
What is the best way to rewrite the query to ensure that both results are returned?
Thank you!
Related
http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.0#Unaccent_filtering_dictionary
http://postgresql.1045698.n5.nabble.com/index-refuses-to-build-td5108810.html
Solution 1:[1]
You are not using the operator class provided by the pg_trgm
module. Create an index like this:
CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);
Originally, I had a GIN index here, but a GiST is typically better suited for this kind of query because it can return values sorted by similarity. See:
Your query has to match the index expression to be able to use it.
SELECT label
FROM the_table
WHERE lower(unaccent_text(label)) % 'fil'
ORDER BY similarity(label, 'fil') DESC; -- ok to use original string here
However, "filbert" and "filé powder" are not actually very similar to "fil" according to the %
operator. I suspect you really want:
SELECT label
FROM the_table
WHERE lower(unaccent_text(label)) LIKE 'fil%' -- !
ORDER BY similarity(label, 'fil') DESC; -- ok to use original string here
This finds all strings starting with the search string, and sorts the best matches according to the %
operator first.
The expression can use a GIN or GiST index since PostgreSQL 9.1! The manual:
Beginning in PostgreSQL 9.1, these index types also support index searches for
LIKE
andILIKE
, for example
If you actually meant to use the %
operator:
Try adapting the threshold for the similarity operator %
:
SET pg_trgm.similarity_threshold = 0.1; -- Postgres 9.6 or later
SELECT set_limit(0.1); -- Postgres 9.5 or older
Or even lower? The default is 0.3. Just to see whether the threshold filters additional matches.
Solution 2:[2]
A solution for PostgreSQL 9.1:
-- Install the requisite extensions.
CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;
-- Function fixes STABLE vs. IMMUTABLE problem of the unaccent function.
CREATE OR REPLACE FUNCTION unaccent_text(text)
RETURNS text AS
$BODY$
-- unaccent is STABLE, but indexes must use IMMUTABLE functions.
SELECT unaccent($1);
$BODY$
LANGUAGE sql IMMUTABLE
COST 1;
-- Create an unaccented index.
CREATE INDEX the_table_label_unaccent_idx
ON the_table USING gin (lower(unaccent_text(label)) gin_trgm_ops);
-- Define the matching threshold.
SELECT set_limit(0.175);
-- Test the query (matching against the index expression).
SELECT
label
FROM
the_table
WHERE
lower(unaccent_text(label)) % 'fil'
ORDER BY
similarity(label, 'fil') DESC
Returns "filbert", "fish fillet", and "filé powder".
Without calling SELECT set_limit(0.175);
, you can use the double tilde (~~
) operator:
-- Test the query (matching against the index expression).
SELECT
label
FROM
the_table
WHERE
lower(unaccent_text(label)) ~~ 'fil'
ORDER BY
similarity(label, 'fil') DESC
Also returns "filbert", "fish fillet", and "filé powder".
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |