'If <meta charset=“utf-8”> means that JavaScript is using utf-8 encoding instead of utf-16
I have been trying to understanding why the need for encoding/decoding to UTF-8 happens all over the place in JavaScript land, and learned that JavaScript uses UTF-16 encoding.
Let’s talk about Javascript string encoding
So I'm assuming that's why a library such as utf8.js exists, to convert between UTF-16 and UTF-8.
But then at the end he provides some insights:
Encoding in Node is extremely confusing, and difficult to get right. It helps, though, when you realize that Javascript string types will always be encoded as UTF-16, and most of the other places strings in RAM interact with sockets, files, or byte arrays, the string gets re-encoded as UTF-8.
This is all massively inefficient, of course. Most strings are representable as UTF-8, and using two bytes to represent their characters means you are using more memory than you need to, as well as paying an O(n) tax to re-encode the string any time you encounter a HTTP or filesystem boundary.
That reminded me of the <meta charset=“utf-8”>
in the HTML <head>
, which I never really thought too much about, other than "you need this to get text working properly".
Now I'm wondering, which this question is about, if that <meta charset=“utf-8”>
tag tells JavaScript to do UTF-8 encoding. That would then mean that when you create strings in JavaScript, they would be UTF-8 encoded rather than UTF-16. Or if I'm wrong there, what exactly it is doing. If it is telling JavaScript to use UTF-8 encoding instead of UTF-16 (which I guess would be considered the "default"), then that would mean you don't need to be paying that O(n)
tax over doing conversions between UTF-8 and UTF-16, which would mean a performance improvement. Wondering if I am understanding correctly, or if not, what I am missing.
Solution 1:[1]
Charset in meta
The <meta charset=“utf-8”>
tag tells HTML (less sloppily: the HTML parser) that the encoding of the page is utf8.
JS does not have a built-in facility to switch between different encondings of strings - it is always utf-16.
Asymptotic bounds
I do not think that there is a O(n)
penalty for encoding conversions. Whenever this kind of encoding change is due, there already is an O(n)
operation: reading/writing the data stream. So any fixed number of operations on each octet would still be O(n)
. Encoding change requires local knowledge only, ie. a look-ahead window of fixed length only, and can thus be incorporated in the stream read/write code with a penalty of O(1)
.
You could argue that the space penalty is O(n)
, though if there is the need to store the string in any standard encoding (ie. without compression), the move to utf-16 means a factor of 2 at max thus staying within the O(n)
bound.
Constant factors
Even if the concern is minimizing the constant factors hidden in O(n)
notation encoding change have a modest impact, in the time domain at least. Writing/reading a utf-16 stream as utf-8 for the most part of (Western) textual data means skipping every second octet / inserting null octets. That performance hit pales in comparison with the overhead and the latency stemming from interfacing with a socket or the file system.
Storage is different, of course, though storage is comparatively cheap today and the upper bound of 2 still holds. The move from 32 to 64 bit has a higher memeory impact wrt to number representations and pointers.
Solution 2:[2]
JavaScript uses UTF-16
HTML5 uses UTF-8
Your <meta charset=“utf-8”>
setting applies to your HTML5 web page encoding, which is optional as most modern browsers know HTML5 is encoded and decoded from UTF-8. This meta tag setting has nothing to do with JavaScript encoding, however, and does not change or affect JavaScript, except to tell it to decode your page using UTF-8 encoding which it does in all the newer browsers by default.
There was a deprecated meta tag that was optional and allowed you to control how external or internal <script>
scripts and files were encoded. But this is no longer supported by HTML5 and would not change how JavaScript engines naturally decode these files.
This old meta tag is shown below but should NOT be used:
<meta http-equiv="Content-Script-Type" content="text/javascript; charset=UTF-8;">
HOW JAVASCRIPT ENGINES WORK
The way most modern JavaScript UTF-16 decoding engines work is YES they do read and decode web files, script files, HTML markup, and page text into UTF-16. That means when they read basic English or ASCII characters and numbers they store them as two bytes when they more often than not only need one to read most English-based websites. However, this UTF-16 feature allows JavaScript to also store any larger plane Unicode glyphs and characters that might appear in their 2-4 byte range alongside the lower plane English ASCII ones.
Some below are arguing about parsing, memory, or storage speed and space saving buts its a moot point because these script engines have been perfected over 20+ years and are designed to do things to maximize efficiency. More on that below...
Most of web pages, script files, and external text on the Web in 2022 are stored as UTF-8 by default (or in ASCII in some cases, the older model). Most of UTF-8 and ASCII can both be safely stored in 1 byte and so UTF-8 is the default today and is cross-compatible with old and new web page encodings and decodings. That is why HTML5 is UTF-8 and works so well. But JavaScript long ago planned for issues around higher order Unicode languages and glyphs. So they decided to store everything in UTF-16. But for speed and other reasons, JavaScript still often stores the first ASCII set (English characters and numbers) in its native form, or as one byte just as UTF-8, or in the same encoding as your HTML5 web page uses by default. Its not a hard and fast rule. So HTML tags read and stored by JavaScript in say Chrome's V8 JavaScript engine might still store them in 1-byte UTF-8, not 2-byte UTF-16. It is one more reason you do not need to have HTML to tell JavaScript how to encode or decode web pages or script files. The engines handle all that for you and many end up storing things as UTF-8 natively anyway for increased speed. Again, it is when you start playing around with exotic languages, glyphs, font sets, and emoticons that the very large code point numbers need more memory and could cause issues if not encoded and decoded correctly on the server or in files sent to browsers to interpret. (One more reason I still think JavaScript is Evil!)
What is happening under the covers of these scripting engines in terms of UTF-8 encoding or ASCII and how they get stored in memory isn't something you should worry about. You only run into issues when streaming more complex upper "planes" of Unicode characters. The UTF-16 characteristics of Javascript storage and encoding are variable, I have read. Its not something most web developers need to worry about, in my opinion, until you get into upper level Unicode languages and character set manipulation in Javascript. That is what Node and many open source engines have struggled at in terms of decoding and encoding UTF-8 and UTF-16 because of their reliance on Javascripting engines.
Again, because everything is moving towards UTF-8 encoding now (where 1-4 bytes are optionally used to encode the complete Unicode character set versus UTF-16 which starts at 2-bytes sets and goes up) you will see Javascript handle all that decoding of UTF-8 into UTF-16 and back out as a pretty seamless process with lots of contingency in place.
BTW....scripting engines read most HTML5 web pages as UTF-8, including their own external JavaScript pages. They then translate or "encode" that back into UTF-16 in memory. As mentioned above however, because ASCII English characters are 99% of most characters and read and store the same in memory for both UTF-8 and UTF-16 these engines rarely try and store them in UTF-16. Its a waste of memory. But JavaScript has to also parse and store its own external Javascript web files from the server and those are also more often encoded in UTF-8 (default) or ASCII, NOT UTF-16. Most browsers by default, without extra charset instructions, follow the web servers "content-type" and assume those are all in UTF-8 or ASCII, rarely UTF-16. Most developers just save their JavaScript as UTF-8 unknowingly in almost all cases, which works fine.
But JavaScript has to "decode" those from UTF-8 to UTF-16 for its own internal use, especially if your scripts have upper plane Unicode characters inside it.
As mentioned, that's rarely needed for most script characters encoded in those libraries UNLESS some very large upper plane Unicode is found in the file. If you choose to help JavaScript browser engines with script files with lots of complex Unicode, then in that one case you might consider encoding your script files into UTF-16 then set your server or your HTML5 with metatags to instruct the script engines to try and decode your external script files as UTF-16.
That is the only case where it might be critical. JavaScript browser engines will listen to the mime type or "content-type" and charset in the HTTP header coming from the server to see what all the web page files should be decoded from first before the HTML metatags. As mentioned that's almost always UTF-8 now in HTML5. If it cannot determine the type from the HTTP server header it next checks your HTML5 web page's and script's <script>
tag and its custom type attributes for both mime type and/or charset to see what if your JavaScript source file has set for that encoding type. You can set it to UTF-16 if youve encoding those files in UTF-16. Otherwise it assume UTF-8 or ASCII which work the same as far as basic encoding characters into numbers from bits. In most cases these settings are missing in websites, which is ok as again, modern scripting engines have lots of fallback checks and assume UTF-8.
If the JavaScript engine has trouble, it will check the web pages metatag "charset" for both the HTML5 page, which is either UTF-8, or if HTML5 is used, it assumes UTF-8. For scripts you can set that metatag to UTF-16 if you've encoded those pages that way (which isn't common).
Lastly, there is also the "byte order mark" or BOM on the script file which likely is UTF-8. Microsoft products are notorious for adding BOM's on files, which in some cases can cause problems. It was a way for them to self-assign encoding on files in the first few bytes of a file header, which is much faster than trying to parse and sniff complete files. But sometimes it causes problems in browsers.
Even if your web files, like HTML and JavaScript, are encoded in ASCII or say Latin-1, that still translates directly into UTF-8, anyway. Only ANSI from old Windows machines had Unicode numbers for some characters that could not be cross-translated back to Unicode. That is why you occasionally see unrecognized gibberish in web pages. Most of that are higher level characters that could not be mapped from ANSI to UTF encodings, so are lost.
But once the encoding type of all the web files are known to the JavaScript browser engine, it can decode the bits and extract the character numbers and re-encode them into its own 2-byte, UTF-16 memory set as mentioned above.
At the end of the day the engines do a good job of negotiating all this for you. :)
Solution 3:[3]
Re "meta charset=“utf-8”"... another sign about how sloppy the standards bodies that build the web can be. This has nothing whatsoever to do with character sets. It's encodings of glyphs. A character set is more closely related to an alphabet or a language than to an encoding. HTML got it as wrong as you can get.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | kwatson |