'Saving Greek characters in varchar column using SQL Server 2019 _UTF8 collations

Background: I'm doing some proofing of SQL Server 2019's _UTF8 collations (ref. https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver15).

Use case: We sometimes have a need to store scientific names featuring Greek letters, such as DL-α-Tocopherol Acetate or Δ8-THC-Naphthoylester, etc.. Although this isn't common (most of our customers do not need this feature) but when the feature is needed, it almost non-negotiable. Since we normally don't need this feature we usually default the columns to varchar. I don't really want to get into a debate on nvarchar vs. varchar here, but suffice to say we only use nvarchar when needed, so when saving characters like α and Δ we'll change the data type and move on.

But the _UTF8 collations appear to allow these values to be saved in a varchar column without needing to change the data type to nvarchar:

UTF-8 support SQL Server 2019 (15.x) introduces full support for the widely used UTF-8 character encoding as an import or export encoding, and as database-level or column-level collation for string data. UTF-8 is allowed in the char and varchar data types, and it's enabled when you create or change an object's collation to a collation that has a UTF8 suffix. One example is changing LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.

UTF-8 is available only to Windows collations that support supplementary characters, as introduced in SQL Server 2012 (11.x). The nchar and nvarchar data types allow UCS-2 or UTF-16 encoding only, and they remain unchanged.

Azure SQL Database and Azure SQL Managed Instance also support UTF-8 on database and column level, while Managed Instance supports this on a server level as well.

My problem:

I tried to save α and Δ with a simple example using SQL Server 2019 as follows:

CREATE TABLE dbo.SCI_DATA 
(
    id int NOT NULL,
    name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)

INSERT INTO dbo.SCI_DATA (id, name) 
VALUES (1, 'DL-α-Tocopherol Acetate');

INSERT INTO dbo.SCI_DATA (id, name) 
VALUES (2, 'Δ8-THC-Naphthoylester');

INSERT INTO dbo.SCI_DATA (id, name) 
VALUES (3, 'Slainté');

SELECT * 
FROM dbo.SCI_DATA;

DROP TABLE IF EXISTS dbo.SCI_DATA;

this returns

id  name
---------------------------
1   DL-a-Tocopherol Acetate
2   ?8-THC-Naphthoylester
3   Slainté

The accented é is fine but Alpha α and Delta Δ are returned as ?. I'm wondering what I'm missing.



Solution 1:[1]

Even though the column is varchar with a UTF8 collation, it appears you still need to indicate that the data is Unicode with N'.

This modified example works:

CREATE TABLE dbo.SCI_DATA (
 id int NOT NULL
  , name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO dbo.SCI_DATA (id, name) VALUES (1, N'DL-?-Tocopherol Acetate');
INSERT INTO dbo.SCI_DATA (id, name) VALUES (2, N'?8-THC-Naphthoylester');
INSERT INTO dbo.SCI_DATA (id, name) VALUES (3, 'Slainté');

SELECT * FROM dbo.SCI_DATA;
DROP TABLE IF EXISTS dbo.SCI_DATA;

Returns:

id  name
1   DL-?-Tocopherol Acetate
2   ?8-THC-Naphthoylester
3   Slainté

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dale K