Sword Art Online Øâ§ã™â€žã˜â­ã™â€žã™â€šã˜â© 01 Øâ§ã™ë†ã™â€  Ù„ã˜â§ã™å Ã™â€  Ù…ã˜âªã˜â±ã˜âã™â€¦ Øâ¹ã˜â±ã˜â¨ã™å

Aw man. I was using "WTF-8" to mean "Double UTF-8", every bit I described most recently at [1]. Double UTF-viii is that unintentionally popular encoding where someone takes UTF-eight, accidentally decodes information technology equally their favorite single-byte encoding such as Windows-1252, and so encodes those characters as UTF-8.

[1] http://blog.luminoso.com/2015/05/21/ftfy-fixes-text-for-you-...

It was such a perfect abridgement, but now I probably shouldn't utilise it, as it would be dislocated with Simon Sapin's WTF-8, which people would really use on purpose.


> ÃÆ'ÂÆ'‚ÃÆ'‚ the time to come of publishing at W3C

That is an amazing example.

It'south not even "double UTF-8", information technology's UTF-8 six times (including the one to get it on the Web), it'due south been decoded every bit Latin-1 twice and Windows-1252 three times, and at the end there'due south a non-breaking space that'south been converted to a space. All to stand for what originated every bit a single not-breaking space anyway.

Which makes me happy that my module solves it.

                                                                      >>> from ftfy.fixes import fix_encoding_and_explain     >>> fix_encoding_and_explain("ÃÆ'ÂÆ'‚ÃÆ'‚ the future of publishing at W3C")     ('\xa0the futurity of publishing at W3C',      [('encode', 'sloppy-windows-1252', 0),       ('transcode', 'restore_byte_a0', 2),       ('decode', 'utf-8-variants', 0),       ('encode', 'sloppy-windows-1252', 0),       ('decode', 'utf-8', 0),       ('encode', 'latin-1', 0),       ('decode', 'utf-8', 0),       ('encode', 'sloppy-windows-1252', 0),       ('decode', 'utf-8', 0),       ('encode', 'latin-ane', 0),       ('decode', 'utf-8', 0)])                                


Neato! I wrote a shitty version of fifty% of that 2 years ago, when I was tasked with uncooking a bunch of data in a MySQL database as part of a larger migration to UTF-8. I hadn't washed that much pencil-and-paper flake manipulation since I was thirteen.


Awesome module! I wonder if anyone else had always managed to opposite-engineer that tweet before.


I love this.

                                                                      The key words "WHAT", "DAMNIT", "Adept GRIEF", "FOR Sky'S SAKE",     "RIDICULOUS", "Encarmine HELL", and "DIE IN A GREAT Big CHEMICAL Burn down"     in this memo are to be interpreted as described in [RFC2119].                                


You really want to call this WTF (8)? Is it april 1st today? Am I the simply ane that thought this commodity is about a new funny project that is called "what the fuck" encoding, like when somebody announced he had written a to_nil jewel https://github.com/mrThe/to_nil ;) Sorry but I tin't stop laughing.


This is intentional. I wish nosotros didn't have to do stuff like this, merely we do and that's the "what the fuck". All considering the Unicode Commission in 1989 really wanted 16 bits to be plenty for everybody, and of course it wasn't.


The error is older than that. Wide character encodings in general are only hopelessly flawed.

WinNT, Java and a lot of more software use wide graphic symbol encodings UCS2/UTF-xvi(/UTF-32?). And it was added to C89/C++ (wchar_t). WinNT actually predates the Unicode standard by a yr or so. http://en.wikipedia.org/wiki/Wide_character , http://en.wikipedia.org/wiki/Windows_NT#Development

Converting between UTF-8 and UTF-xvi is wasteful, though often necessary.

> wide characters are a hugely flawed idea [parent post]

I know. Back in the early nineties they thought otherwise and were proud that they used it in retrospect. But present UTF-viii is usually the better choice (except for possibly some asian and exotic later added languages that may crave more space with UTF-8) - I am not saying UTF-16 would be a better choice then, at that place are certain other encodings for special cases.

And as the linked commodity explains, UTF-16 is a huge mess of complexity with back-dated validation rules that had to exist added because it stopped being a wide-character encoding when the new code points were added. UTF-xvi, when implemented correctly, is actually significantly more complicated to become right than UTF-8.

UTF-32/UCS-4 is quite simple, though obviously it imposes a 4x penalisation on bytes used. I don't know anything that uses information technology in exercise, though surely something does.

Again: wide characters are a hugely flawed idea.

Certain, go to 32 bits per character. Merely it cannot be said to be "unproblematic" and will not permit you to make the supposition that 1 integer = 1 glyph.

Namely it won't relieve you from the post-obit problems:

                                                                  * Precomposed vs multi-codepoint diacritics (Practice y'all write á with       ane 32 bit char or with 2? If it'due south Unicode the respond is both)      * Variation selectors (run into besides Han unification)      * Bidi, RTL and LTR embedding chars                                                              
And possibly others I don't know about. I feel like I am learning of these dragons all the fourth dimension.

I nearly like that utf-16 and more than and so utf-8 break the "1 character, ane glyph" dominion, because it gets yous in the mindset that this is bogus. Because in Unicode it is most decidedly artificial, even if you switch to UCS-4 in a vain try to avoid such problems. Unicode just isn't elementary any way you lot piece information technology, and then you might as well shove the complexity in everybody's face and have them confront it early.

You tin't employ that for storage.

> The mapping betwixt negative numbers and graphemes in this form is not guaranteed constant, fifty-fifty between strings in the same process.


What's your storage requirement that's non adequately solved past the existing encoding schemes?


What are you suggesting, store strings in UTF8 and then "normalize" them into this baroque format whenever you load/save them purely so that offsets correspond to grapheme clusters? Doesn't seem worth the overhead to my eyes.

In-retention string representation rarely corresponds to on-disk representation.

Various programming languages (Coffee, C#, Objective-C, JavaScript, ...) as well as some well-known libraries (ICU, Windows API, Qt) use UTF-sixteen internally. How much data practice yous have lying around that'due south UTF-16?

Sure, more than recently, Get and Rust take decided to go with UTF-8, only that's far from common, and it does accept some drawbacks compared to the Perl6 (NFG) or Python3 (latin-1, UCS-two, UCS-4 as appropriate) model if you have to exercise actual processing instead of merely passing opaque strings effectually.

Also note that you have to go through a normalization footstep anyway if you don't want to be tripped upward by having multiple means to represent a single graphic symbol.

NFG enables O(Due north) algorithms for character level operations.

The overhead is entirely wasted on lawmaking that does no character level operations.

For lawmaking that does practice some graphic symbol level operations, avoiding quadratic behavior may pay off handsomely.

i call up linux/mac systems default to UCS-four, certainly the libc implementations of wcs* do.

i agree its a flawed idea though. 4 billion characters seems similar enough for now, but i'd guess UTF-32 will need extending to 64 too... and actually how most decoupling the size from the data entirely? it works well enough in the general case of /every blazon of data we know about/ that i'1000 pretty sure this specialised apply case is not very special.

The Unixish C runtimes of the earth uses a iv-byte wchar_t. I'm non aware of anything in "Linux" that actually stores or operates on 4-byte character strings. Obviously some software somewhere must, simply the overwhelming majority of text processing on your linux box is done in UTF-8.

That's not remotely comparable to the situation in Windows, where file names are stored on disk in a 16 flake not-quite-broad-grapheme encoding, etc... And it'south leaked into firmware. GPT segmentation names and UEFI variables are 16 bit despite never once being used to shop anything but ASCII, etc... All that software is, broadly, incompatible and buggy (and of questionable security) when faced with new code points.

Nosotros don't even have 4 billion characters possible at present. The Unicode range is only 0-10FFFF, and UTF-16 can't correspond any more than that. Then UTF-32 is restricted to that range too, despite what 32 bits would allow, never mind 64.

But we don't seem to be running out -- Planes iii-13 are completely unassigned and so far, covering 30000-DFFFF. That's nearly 65% of the Unicode range completely untouched, and planes i, ii, and xiv still have big gaps too.

> Just we don't seem to be running out

The issue isn't the quantity of unassigned codepoints, it's how many private use ones are bachelor, only 137,000 of them. Publicly available private use schemes such every bit ConScript are fast filling up this space, mainly by encoding block characters in the same fashion Unicode encodes Korean Hangul, i.e. by using a formula over a small set of base components to generate all the block characters.

My own surrogate scheme, UTF-88, implemented in Go at https://github.com/gavingroovygrover/utf88 , expands the number of UTF-8 codepoints to two billion as originally specified by using the top 75% of the private use codepoints every bit 2nd tier surrogates. This scheme tin hands be fitted on top of UTF-16 instead. I've taken the liberty in this scheme of making 16 planes (0x10 to 0x1F) bachelor every bit individual use; the balance are unassigned.

I created this scheme to help in using a formulaic method to generate a commonly used subset of the CJK characters, perhaps in the codepoints which would exist half dozen bytes under UTF-8. It would be more than hard than the Hangul scheme because CJK characters are congenital recursively. If successful, I'd look at pitching the UTF-88 surrogation scheme for UTF-16 and having UTF-eight and UTF-32 officially extended to 2 billion characters.


NFG uses the negative numbers down to about -2 billion as a implementation-internal individual employ surface area to temporarily store graphemes. Enables fast grapheme-based manipulation of strings in Perl half dozen. Though such negative-numbered codepoints could just be used for private utilise in information interchange between 3rd parties if the UTF-32 was used, considering neither UTF-8 (even pre-2003) nor UTF-16 could encode them.


Yes. sizeof(wchar_t) is 2 on Windows and four on Unix-like systems, then wchar_t is pretty much useless. That'south why C11 added char16_t and char32_t.


I'one thousand wondering how common the "fault" of storing UTF-sixteen values in wchar_t on Unix-similar systems? I know I thought I had my code carefully basing whether it was UTF-16 or UTF-32 based on the size of wchar_t, just to find that one of the supposedly portable libraries I used had UTF-xvi no matter how big wchar_t was.


Oh ok it's intentional. Thx for explaining the option of the name. Not only because of the name itself only also by explaining the reason backside the choice, you accomplished to get my attention. I will endeavour to observe out more than about this problem, because I guess that equally a developer this might have some impact on my work sooner or later and therefore I should at to the lowest degree be aware of it.

to_nil is actually a pretty important function! Completely trivial, obviously, just it demonstrates that there's a approved style to map every value in Ruby to zero. This is substantially the defining feature of nil, in a sense.

With typing the interest here would be more articulate, of class, since it would be more than credible that zippo inhabits every type.


The main motivator for this was Servo's DOM, although it ended up getting deployed first in Rust to deal with Windows paths. We haven't determined whether we'll demand to employ WTF-eight throughout Servo—it may depend on how document.write() is used in the wild.

Then nosotros're going to come across this on web sites. Oh, joy.

Information technology'southward fourth dimension for browsers to start saying no to actually bad HTML. When a browser detects a major error, it should put an error bar across the pinnacle of the page, with something like "This folio may display improperly due to errors in the page source (click for details)". Start doing that for serious errors such equally Javascript code aborts, security errors, and malformed UTF-eight. Then extend that to pages where the graphic symbol encoding is cryptic, and cease trying to guess grapheme encoding.

The HTML5 spec formally defines consistent handling for many errors. That's OK, there'south a spec. Terminate there. Don't try to outguess new kinds of errors.

No. This is an internal implementation item, non to be used on the Spider web.

As to draconian mistake treatment, that'south what XHTML is nearly and why it failed. Just define a somewhat sensible behavior for every input, no matter how ugly.


Yes, that bug is the best place to starting time. We've future proofed the architecture for Windows, just there is no direct piece of work on information technology that I'grand aware of.


What does the DOM practise when it receives a surrogate half from Javascript? I idea that the DOM APIs (e.g. createTextNode, innerHTML setter, setAttribute, HTMLInputElement.value setter, document.write) would all strip out the solitary surrogate code units?


In electric current browsers they'll happily pass effectually lone surrogates. Nothing special happens to them (v. any other UTF-16 lawmaking-unit) till they reach the layout layer (where they manifestly cannot be drawn).


I found this through https://news.ycombinator.com/particular?id=9609955 -- I find it fascinating the solutions that people come up with to deal with other people'south problems without damaging right code. Rust uses WTF-8 to collaborate with Windows' UCS2/UTF-16 hybrid, and from a quick look I'one thousand hopeful that Rust's story around handling Unicode should be much nicer than (say) Python or Java.


Accept you looked at Python 3 nevertheless? I'm using Python 3 in production for an internationalized website and my feel has been that it handles Unicode pretty well.

Not that great of a read. Stuff like:

> I take been told multiple times at present that my point of view is incorrect and I don't sympathise beginners, or that the "text model" has been changed and my asking makes no sense.

"The text model has inverse" is a perfectly legitimate reason to plough down ideas consistent with the previous text model and inconsistent with the electric current model. Keeping a coherent, consistent model of your text is a pretty important part of curating a language. Ane of Python's greatest strengths is that they don't merely pile on random features, and keeping old crufty features from previous versions would amount to the same thing. To dismiss this reasoning is extremely shortsighted.


Many people who prefer Python3'southward fashion of handling Unicode are aware of these arguments. It isn't a position based on ignorance.


Hey, never meant to imply otherwise. In fact, even people who take issues with the py3 way ofttimes concord that it's still amend than two's.


Python 3 doesn't handle Unicode whatsoever amend than Python two, it but made information technology the default string. In all other aspects the situation has stayed as bad equally it was in Python 2 or has gotten significantly worse. Skillful examples for that are paths and anything that relates to local IO when you're locale is C.

> Python 3 doesn't handle Unicode any ameliorate than Python 2, it only made it the default string. In all other aspects the situation has stayed as bad every bit it was in Python two or has gotten significantly worse.

Maybe this has been your feel, but it hasn't been mine. Using Python iii was the unmarried best decision I've made in developing a multilingual website (nosotros support English/German/Spanish). There's not a ton of local IO, only I've upgraded all my personal projects to Python 3.

Your complaint, and the complaint of the OP, seems to be basically, "Information technology's different and I have to change my code, therefore information technology'southward bad."

My complaint is not that I have to change my code. My complaint is that Python 3 is an endeavour at breaking as footling compatibilty with Python 2 as possible while making Unicode "easy" to use. They failed to achieve both goals.

At present we have a Python 3 that'due south incompatible to Python 2 but provides almost no significant do good, solves none of the large well known problems and introduces quite a few new problems.


I accept to disagree, I call back using Unicode in Python 3 is currently easier than in whatsoever language I've used. Information technology certainly isn't perfect, only it's meliorate than the alternatives. I certainly accept spent very piddling fourth dimension struggling with it.


That is not quite true, in the sense that more of the standard library has been made unicode-aware, and implicit conversions between unicode and bytestrings have been removed. Then if you're working in either domain you go a coherent view, the problem being when you're interacting with systems or concepts which straddle the divide or (even worse) may be in either domain depending on the platform. Filesystem paths is the latter, information technology's text on OSX and Windows — although possibly ill-formed in Windows — but it's bag-o-bytes in most unices. There Python two is but "better" in that issues will probably fly nether the radar if y'all don't prod things also much.

In that location is no coherent view at all. Bytes still accept methods like .upper() that make no sense at all in that context, while unicode strings with these methods are broken because these are locale dependent operations and there is no appropriate API. You can also alphabetize, piece and iterate over strings, all operations that y'all really shouldn't do unless you actually now what you are doing. The API in no way indicates that doing any of these things is a problem.

Python 2 handling of paths is not good considering in that location is no good abstraction over different operating systems, treating them as byte strings is a sane lowest common denominator though.

Python 3 pretends that paths tin be represented as unicode strings on all OSes, that's non true. That is held upwardly with a very leaky abstraction and means that Python lawmaking that treats paths as unicode strings and non every bit paths-that-happen-to-be-unicode-only-really-arent is broken. Virtually people aren't aware of that at all and it'south definitely surprising.

On top of that implicit coercions have been replaced with implicit cleaved guessing of encodings for case when opening files.

When you say "strings" are you referring to strings or bytes? Why shouldn't you piece or alphabetize them? It seems like those operations make sense in either instance but I'1000 certain I'm missing something.

On the guessing encodings when opening files, that'south not actually a problem. The caller should specify the encoding manually ideally. If you don't know the encoding of the file, how tin y'all decode information technology? You could still open it every bit raw bytes if required.

I used strings to hateful both. Byte strings can exist sliced and indexed no issues because a byte as such is something you may actually want to bargain with.

Slicing or indexing into unicode strings is a trouble considering information technology's not clear what unicode strings are strings of. You can look at unicode strings from different perspectives and meet a sequence of codepoints or a sequence of characters, both can exist reasonable depending on what you want to do. Most of the time nevertheless yous certainly don't want to bargain with codepoints. Python however only gives yous a codepoint-level perspective.

Guessing encodings when opening files is a problem precisely considering - as you mentioned - the caller should specify the encoding, non only sometimes merely ever. Guessing an encoding based on the locale or the content of the file should exist the exception and something the caller does explicitly.

Information technology slices past codepoints? That's just silly, so we've gone through this whole unicode everywhere process so we can stop thinking nigh the underlying implementation details but the api forces you to take to deal with them anyway.

Fortunately information technology's not something I deal with often but thanks for the info, will cease me getting caught out later.


I retrieve you are missing the difference between codepoints (every bit singled-out from codeunits) and characters.

And unfortunately, I'm not anymore enlightened as to my misunderstanding.

I get that every different thing (character) is a dissimilar Unicode number (code bespeak). To store / transmit these you need some standard (encoding) for writing them downward as a sequence of bytes (code units, well depending on the encoding each code unit of measurement is made upward of different numbers of bytes).

How is any of that in conflict with my original points? Or is some of my above agreement incorrect.

I know y'all have a policy of not reply to people and then mayhap someone else could step in and clear upward my defoliation.


Codepoints and characters are not equivalent. A grapheme tin can consist of one or more than codepoints. More importantly some codepoints just modify others and cannot stand up on their own. That means if you slice or index into a unicode strings, you might go an "invalid" unicode cord dorsum. That is a unicode cord that cannot exist encoded or rendered in any meaningful way.

Right, ok. I recall something most this - ü tin be represented either by a single code point or by the letter 'u' preceded by the modifier.

Every bit the user of unicode I don't really care almost that. If I piece characters I expect a slice of characters. The multi code point thing feels like it'south just an encoding item in a different place.

I gauge y'all need some operations to go to those details if you need. Human being, what was the drive behind adding that actress complication to life?!

Thanks for explaining. That was the slice I was missing.

bytes.upper is the Correct Affair when y'all are dealing with ASCII-based formats. Information technology also has the advantage of breaking in less random ways than unicode.upper.

And I mean, I can't really think of any cross-locale requirements fulfilled by unicode.upper (maybe example-insensitive matching, just and so you lot as well desire to do lots of other filtering).

> At that place Python 2 is just "better" in that problems will probably wing under the radar if you don't prod things too much.

Ah yeah, the JavaScript solution.


Well, Python iii's unicode back up is much more complete. Every bit a trivial example, instance conversions now cover the whole unicode range. This holds pretty consistently - Python 2'southward `unicode` was incomplete.

> It is unclear whether unpaired surrogate byte sequences are supposed to exist well-formed in CESU-viii.

According to the Unicode Technical Report #26 that defines CESU-viii[i], CESU-8 is a Compatibility Encoding Scheme for UTF-sixteen ("CESU"). In fact, the way the encoding is defined, the source data must exist represented in UTF-16 prior to converting to CESU-8. Since UTF-16 cannot represent unpaired surrogates, I retrieve it's safety to say that CESU-eight cannot stand for them either.

[1] http://www.unicode.org/reports/tr26/

From the article:

>UTF-16 is designed to stand for whatsoever Unicode text, but it can not correspond a surrogate code point pair since the corresponding surrogate 16-flake lawmaking unit pairs would instead represent a supplementary code point. Therefore, the concept of Unicode scalar value was introduced and Unicode text was restricted to not incorporate any surrogate code betoken. (This was presumably deemed simpler that only restricting pairs.)

This is all gibberish to me. Can someone explain this in laymans terms?

People used to think xvi $.25 would be enough for anyone. It wasn't, so UTF-sixteen was designed every bit a variable-length, backwards-uniform replacement for UCS-ii.

Characters exterior the Basic Multilingual Plane (BMP) are encoded as a pair of sixteen-bit lawmaking units. The numeric value of these code units announce codepoints that lie themselves within the BMP. While these values can be represented in UTF-8 and UTF-32, they cannot be represented in UTF-16. Because we want our encoding schemes to be equivalent, the Unicode code space contains a hole where these so-chosen surrogates lie.

Considering not everyone gets Unicode right, real-globe data may contain unpaired surrogates, and WTF-viii is an extension of UTF-viii that handles such data gracefully.

I understand that for efficiency we want this to be as fast as possible. Simple compression can accept care of the wastefulness of using excessive space to encode text - and then it really only leaves efficiency.

If was to make a first attempt at a variable length, but well defined backwards uniform encoding scheme, I would utilize something like the number of $.25 upto (and including) the kickoff 0 bit every bit defining the number of bytes used for this character. And so,

> 0xxxxxxx, ane byte > 10xxxxxx, 2 bytes > 110xxxxx, 3 bytes.

Nosotros would never run out of codepoints, and lecagy applications tin can unproblematic ignore codepoints it doesn't understand. We would only waste material 1 bit per byte, which seems reasonable given merely how many problems encoding usually represent. Why wouldn't this piece of work, apart from already existing applications that does not know how to do this.

That'south roughly how UTF-eight works, with some tweaks to arrive self-synchronizing. (That is, yous tin jump to the center of a stream and discover the next code signal by looking at no more iv bytes.)

As to running out of code points, we're limited past UTF-16 (up to U+10FFFF). Both UTF-32 and UTF-8 unchanged could go up to 32 bits.


Pretty unrelated but I was thinking about efficiently encoding Unicode a week or 2 agone. I think there might be some value in a stock-still length encoding but UTF-32 seems a fleck wasteful. With Unicode requiring 21 (20.09) $.25 per code point packing three code points into 64 bits seems an obvious idea. But would information technology be worth the hassle for example as internal encoding in an operating arrangement? Information technology requires all the extra shifting, dealing with the potentially partially filled last 64 bits and encoding and decoding to and from the external world. Is the desire for a fixed length encoding misguided because indexing into a string is way less mutual than information technology seems?

When y'all use an encoding based on integral bytes, y'all can use the hardware-accelerated and often parallelized "memcpy" majority byte moving hardware features to manipulate your strings.

Simply inserting a codepoint with your approach would require all downstream bits to be shifted within and across bytes, something that would be a much bigger computational brunt. It'due south unlikely that anyone would consider saddling themselves with that for a mere 25% space savings over the dead-simple and memcpy-able UTF-32.

I think you lot'd lose half of the already-small benefits of stock-still indexing, and there would exist enough actress complexity to leave you worse off.

In addition, there'south a 95% chance you're not dealing with enough text for UTF-32 to hurt. If you're in the other 5%, and so a packing scheme that's 1/3 more efficient is nevertheless going to hurt. There's no good employ case.

Coding for variable-width takes more effort, just it gives you a ameliorate issue. Yous can carve up strings appropriate to the use. Sometimes that'due south lawmaking points, only more than frequently it'due south probably characters or bytes.

I'm non fifty-fifty certain why you would want to detect something similar the 80th code point in a cord. Information technology'due south rare enough to not be a tiptop priority.


Yes. For instance, this allows the Rust standard library to convert &str (UTF-8) to &std::ffi::OsStr (WTF-viii on Windows) without converting or fifty-fifty copying data.


An interesting possible application for this is JSON parsers. If JSON strings contain unpaired surrogate code points, they could either throw an mistake or encode every bit WTF-8. I bet some JSON parsers think they are converting to UTF-eight, but are really converting to GUTF-eight.

The name is unserious simply the project is very serious, its author has responded to a few comments and linked to a presentation of his on the bailiwick[0]. It's an extension of UTF-8 used to span UTF-eight and UCS2-plus-surrogates: while UTF8 is the modern encoding you have to interact with legacy systems, for UNIX'south bags of bytes you may be able to assume UTF8 (possibly ill-formed) but a number of other legacy systems used UCS2 and added visible surrogates (rather than proper UTF-xvi) afterwards.

Windows and NTFS, Java, UEFI, Javascript all piece of work with UCS2-plus-surrogates. Having to interact with those systems from a UTF8-encoded globe is an issue because they don't guarantee well-formed UTF-16, they might contain unpaired surrogates which can't exist decoded to a codepoint allowed in UTF-8 or UTF-32 (neither allows unpaired surrogates, for obvious reasons).

WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates only, paired surrogates from valid UTF16 are decoded and re-encoded to a proper UTF8-valid codepoint) which allows interaction with legacy UCS2 systems.

WTF8 exists solely equally an internal encoding (in-memory representation), merely it'due south very useful there. It was initially created for Servo (which may need it to take an UTF8 internal representation yet properly interact with javascript), simply turned out to showtime exist a benefaction to Rust's Os/filesystem APIs on Windows.

[0] http://exyr.org/2015/!!Con_WTF-eight/slides.pdf

> WTF8 exists solely as an internal encoding (in-memory representation)

Today.

Want to bet that someone will cleverly decide that information technology'due south "simply easier" to utilise it as an external encoding likewise? This kind of cat always gets out of the bag eventually.


Improve WTF8 than invalid UCS2-plus-surrogates. And UTF-8 decoders will just plow invalid surrogates into the replacement character.


I thought he was tackling the other problem which is that yous frequently find web pages that have both UTF-8 codepoints and unmarried bytes encoded every bit ISO-latin-one or Windows-1252

The nature of unicode is that there's always a problem you didn't (but should) know existed.

And because of this global defoliation, everyone of import ends up implementing something that somehow does something moronic - and then then everyone else has nevertheless another problem they didn't know existed and they all fall into a self-harming screw of depravity.


Some fourth dimension agone, I made some ASCII art to illustrate the various steps where things can go wrong:

                                                                      [user-perceived characters]                 ^                 |                 five        [graphic symbol clusters] <-> [characters]                 ^                   ^                 |                   |                 5                   v             [glyphs]           [codepoints] <-> [code units] <-> [bytes]                                


And then basically information technology goes wrong when someone assumes that any two of the in a higher place is "the same matter". It'due south often implicit.

That's certainly one important source of errors. An obvious example would be treating UTF-32 as a stock-still-width encoding, which is bad considering you might cease upwardly cutting grapheme clusters in half, and you tin can easily forget about normalization if you think about information technology that way.

Then, information technology's possible to make mistakes when converting between representations, eg getting endianness incorrect.

Some issues are more subtle: In principle, the decision what should be considered a single character may depend on the language, nevermind the debate well-nigh Han unification - but every bit far as I'm concerned, that's a WONTFIX.

Permit me see if I have this straight. My agreement is that WTF-eight is identical to UTF-8 for all valid UTF-16 input, only it can too circular-trip invalid UTF-sixteen. That is the ultimate goal.

Below is all the background I had to learn most to empathise the motivation/details.

UCS-2 was designed as a 16-chip stock-still-width encoding. When it became clear that 64k code points wasn't enough for Unicode, UTF-xvi was invented to deal with the fact that UCS-2 was assumed to be fixed-width, but no longer could be.

The solution they settled on is weird, but has some useful backdrop. Basically they took a couple code point ranges that hadn't been assigned yet and allocated them to a "Unicode within Unicode" coding scheme. This scheme encodes (i big code point) -> (ii small-scale lawmaking points). The small code points will fit in UTF-sixteen "code units" (this is our name for each two-byte unit in UTF-16). And for some more terminology, "large code points" are called "supplementary lawmaking points", and "minor code points" are chosen "BMP code points."

The weird thing nearly this scheme is that nosotros bothered to make the "2 pocket-sized code points" (known equally a "surrogate" pair) into real Unicode code points. A more normal matter would be to say that UTF-xvi code units are totally separate from Unicode code points, and that UTF-16 code units have no meaning outside of UTF-sixteen. An number like 0xd801 could have a code unit of measurement meaning as part of a UTF-16 surrogate pair, and also be a totally unrelated Unicode code point.

Just the one nice belongings of the fashion they did this is that they didn't break existing software. Existing software assumed that every UCS-2 grapheme was also a lawmaking signal. These systems could be updated to UTF-16 while preserving this supposition.

Unfortunately it made everything else more complicated. Considering now:

- UTF-16 can be ill-formed if it has whatever surrogate code units that don't pair properly.

- nosotros have to effigy out what to do when these surrogate code points — code points whose only purpose is to help UTF-xvi break out of its 64k limit — occur outside of UTF-16.

This becomes especially complicated when converting UTF-sixteen -> UTF-8. UTF-eight has a native representation for large lawmaking points that encodes each in iv bytes. Merely since surrogate code points are real lawmaking points, you could imagine an culling UTF-eight encoding for large code points: brand a UTF-sixteen surrogate pair, then UTF-8 encode the two code points of the surrogate pair (hey, they are real code points!) into UTF-8. Merely UTF-eight disallows this and only allows the canonical, four-byte encoding.

If you feel this is unjust and UTF-8 should be allowed to encode surrogate lawmaking points if it feels similar it, and so you might like Generalized UTF-eight, which is exactly similar UTF-8 except this is allowed. Information technology's easier to convert from UTF-sixteen, because you don't demand any specialized logic to recognize and handle surrogate pairs. You all the same demand this logic to go in the other management though (GUTF-8 -> UTF-xvi), since GUTF-8 can take large code points that yous'd demand to encode into surrogate pairs for UTF-16.

If y'all like Generalized UTF-8, except that you always want to use surrogate pairs for large code points, and you lot want to totally disallow the UTF-8-native four-byte sequence for them, you might similar CESU-8, which does this. This makes both directions of CESU-8 <-> UTF-xvi easy, because neither conversion requires special treatment of surrogate pairs.

A nice property of GUTF-8 is that it can round-trip any UTF-16 sequence, even if it's ill-formed (has unpaired surrogate code points). It's pretty easy to get ill-formed UTF-xvi, because many UTF-16-based APIs don't enforce wellformedness.

But both GUTF-8 and CESU-8 take the drawback that they are not UTF-eight compatible. UTF-viii-based software isn't generally expected to decode surrogate pairs — surrogates are supposed to be a UTF-16-merely peculiarity. Most UTF-8-based software expects that once it performs UTF-8 decoding, the resulting code points are real code points ("Unicode scalar values", which make upwardly "Unicode text"), not surrogate code points.

And then basically what WTF-8 says is: encode all code points equally their real code signal, never as a surrogate pair (like UTF-8, unlike GUTF-8 and CESU-8). However, if the input UTF-sixteen was sick-formed and independent an unpaired surrogate code bespeak, and then you may encode that code point directly with UTF-eight (like GUTF-viii, not allowed in UTF-8).

So WTF-8 is identical to UTF-8 for all valid UTF-16 input, but information technology can too round-trip invalid UTF-16. That is the ultimate goal.

By the fashion, 1 affair that was slightly unclear to me in the dr.. In section 4.2 (https://simonsapin.github.io/wtf-viii/#encoding-ill-formed-utf-...):

> If, on the other hand, the input contains a surrogate code bespeak pair, the conversion will be incorrect and the resulting sequence volition not stand for the original code points.

It might be more articulate to say: "the resulting sequence will not represent the surrogate code points." It might be by some fluke that the user actually intends the UTF-xvi to interpret the surrogate sequence that was in the input. And this isn't actually lossy, since (AFAIK) the surrogate code points exist for the sole purpose of representing surrogate pairs.

The more interesting example hither, which isn't mentioned at all, is that the input contains unpaired surrogate code points. That is the case where the UTF-16 will actually end upwardly being ill-formed.


The encoding that was designed to be fixed-width is chosen UCS-two. UTF-sixteen is its variable-length successor.

hmmm... await... UCS-2 is only a broken UTF-sixteen?!?!

I idea it was a distinct encoding and all related issues were largely imaginary provided you /only/ handle things correct...

UCS2 is the original "broad graphic symbol" encoding from when code points were defined as sixteen bits. When codepoints were extended to 21 bits, UTF-16 was created every bit a variable-width encoding compatible with UCS2 (so UCS2-encoded information is valid UTF-xvi).

Sadly systems which had previously opted for stock-still-width UCS2 and exposed that detail every bit part of a binary layer and wouldn't break compatibility couldn't continue their internal storage to 16 bit code units and move the external API to 32.

What they did instead was keep their API exposing 16 bits lawmaking units and declare it was UTF16, except most of them didn't carp validating anything and so they're really exposing UCS2-with-surrogates (non even surrogate pairs since they don't validate the information). And that'south how y'all discover lone surrogates traveling through the stars without their mate and shit's all fucked up.

The given history of UTF-sixteen and UTF-8 is a flake muddled.

> UTF-16 was redefined to be sick-formed if it contains unpaired surrogate 16-bit code units.

This is incorrect. UTF-16 did not be until Unicode 2.0, which was the version of the standard that introduced surrogate code points. UCS-2 was the 16-bit encoding that predated it, and UTF-sixteen was designed as a replacement for UCS-2 in gild to handle supplementary characters properly.

> UTF-viii was similarly redefined to be sick-formed if it contains surrogate byte sequences.

Not really truthful either. UTF-viii became part of the Unicode standard with Unicode 2.0, and so incorporated surrogate code point handling. UTF-8 was originally created in 1992, long before Unicode 2.0, and at the fourth dimension was based on UCS. I'm non really sure information technology's relevant to talk about UTF-8 prior to its inclusion in the Unicode standard, but even so, encoding the code point range D800-DFFF was not allowed, for the same reason it was actually non allowed in UCS-two, which is that this code point range was unallocated (it was in fact role of the Special Zone, which I am unable to discover an actual definition for in the scanned expressionless-tree Unicode 1.0 book, but I oasis't read it comprehend-to-cover). The distinction is that it was non considered "ill-formed" to encode those code points, so information technology was perfectly legal to receive UCS-2 that encoded those values, procedure information technology, and re-transmit it (as it's legal to process and retransmit text streams that represent characters unknown to the process; the assumption is the process that originally encoded them understood the characters). So technically yes, UTF-8 changed from its original definition based on UCS to one that explicitly considered encoding D800-DFFF equally sick-formed, but UTF-8 as it has existed in the Unicode Standard has e'er considered it sick-formed.

> Unicode text was restricted to not contain any surrogate code bespeak. (This was presumably deemed simpler that but restricting pairs.)

This is a fleck of an odd parenthetical. Regardless of encoding, it's never legal to emit a text stream that contains surrogate code points, as these points take been explicitly reserved for the use of UTF-16. The UTF-viii and UTF-32 encodings explicitly consider attempts to encode these code points every bit ill-formed, just there's no reason to ever allow it in the first place as it'due south a violation of the Unicode conformance rules to practise so. Because there is no process that can maybe have encoded those lawmaking points in the first place while befitting to the Unicode standard, there is no reason for any process to attempt to interpret those code points when consuming a Unicode encoding. Allowing them would only be a potential security run a risk (which is the aforementioned rationale for treating not-shortest-form UTF-eight encodings as ill-formed). Information technology has null to do with simplicity.

jordanginstioniff43.blogspot.com

Source: https://news.ycombinator.com/item?id=9611710

Related Posts

0 Response to "Sword Art Online Øâ§ã™â€žã˜â­ã™â€žã™â€šã˜â© 01 Øâ§ã™ë†ã™â€  Ù„ã˜â§ã™å Ã™â€  Ù…ã˜âªã˜â±ã˜âã™â€¦ Øâ¹ã˜â±ã˜â¨ã™å"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel