'Saving Greek characters in varchar column using SQL Server 2019 _UTF8 collations
Background: I'm doing some proofing of SQL Server 2019's _UTF8 collations (ref. https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver15).
Use case: We sometimes have a need to store scientific names featuring Greek letters, such as DL-α-Tocopherol Acetate
or Δ8-THC-Naphthoylester
, etc.. Although this isn't common (most of our customers do not need this feature) but when the feature is needed, it almost non-negotiable. Since we normally don't need this feature we usually default the columns to varchar. I don't really want to get into a debate on nvarchar vs. varchar here, but suffice to say we only use nvarchar when needed, so when saving characters like α and Δ we'll change the data type and move on.
But the _UTF8 collations appear to allow these values to be saved in a varchar
column without needing to change the data type to nvarchar
:
UTF-8 support SQL Server 2019 (15.x) introduces full support for the widely used UTF-8 character encoding as an import or export encoding, and as database-level or column-level collation for string data. UTF-8 is allowed in the char and varchar data types, and it's enabled when you create or change an object's collation to a collation that has a UTF8 suffix. One example is changing LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.
UTF-8 is available only to Windows collations that support supplementary characters, as introduced in SQL Server 2012 (11.x). The nchar and nvarchar data types allow UCS-2 or UTF-16 encoding only, and they remain unchanged.
Azure SQL Database and Azure SQL Managed Instance also support UTF-8 on database and column level, while Managed Instance supports this on a server level as well.
My problem:
I tried to save α and Δ with a simple example using SQL Server 2019 as follows:
CREATE TABLE dbo.SCI_DATA
(
id int NOT NULL,
name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO dbo.SCI_DATA (id, name)
VALUES (1, 'DL-α-Tocopherol Acetate');
INSERT INTO dbo.SCI_DATA (id, name)
VALUES (2, 'Δ8-THC-Naphthoylester');
INSERT INTO dbo.SCI_DATA (id, name)
VALUES (3, 'Slainté');
SELECT *
FROM dbo.SCI_DATA;
DROP TABLE IF EXISTS dbo.SCI_DATA;
this returns
id name
---------------------------
1 DL-a-Tocopherol Acetate
2 ?8-THC-Naphthoylester
3 Slainté
The accented é
is fine but Alpha α and Delta Δ are returned as ?
. I'm wondering what I'm missing.
Solution 1:[1]
Even though the column is varchar with a UTF8 collation, it appears you still need to indicate that the data is Unicode with N'.
This modified example works:
CREATE TABLE dbo.SCI_DATA (
id int NOT NULL
, name varchar(75) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO dbo.SCI_DATA (id, name) VALUES (1, N'DL-?-Tocopherol Acetate');
INSERT INTO dbo.SCI_DATA (id, name) VALUES (2, N'?8-THC-Naphthoylester');
INSERT INTO dbo.SCI_DATA (id, name) VALUES (3, 'Slainté');
SELECT * FROM dbo.SCI_DATA;
DROP TABLE IF EXISTS dbo.SCI_DATA;
Returns:
id name
1 DL-?-Tocopherol Acetate
2 ?8-THC-Naphthoylester
3 Slainté
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Dale K |