Merge pull request #1416 from alexcrichton/js-string-valid-utf16

Add warnings about UTF-16 vs UTF-8 strings
This commit is contained in:
Alex Crichton
2019-04-05 10:12:32 -05:00
committed by GitHub
6 changed files with 89 additions and 1 deletions

View File

@@ -20,3 +20,30 @@ with handles to JavaScript string values, use the `js_sys::JsString` type.
```js
{{#include ../../../../examples/guide-supported-types-examples/str.js}}
```
## UTF-16 vs UTF-8
Strings in JavaScript are encoded as UTF-16, but with one major exception: they
can contain unpaired surrogates. For some Unicode characters UTF-16 uses two
16-byte values. These are called "surrogate pairs" because they always come in
pairs. In JavaScript, it is possible for these surrogate pairs to be missing the
other half, creating an "unpaired surrogate".
When passing a string from JavaScript to Rust, it uses the `TextEncoder` API to
convert from UTF-16 to UTF-8. This is normally perfectly fine... unless there
are unpaired surrogates. In that case it will replace the unpaired surrogates
with U+FFFD (<28>, the replacement character). That means the string in Rust is
now different from the string in JavaScript!
If you want to guarantee that the Rust string is the same as the JavaScript
string, you should instead use `js_sys::JsString` (which keeps the string in
JavaScript and doesn't copy it into Rust).
If you want to access the raw value of a JS string, you can use `JsString::iter`,
which returns an `Iterator<Item = u16>`. This perfectly preserves everything
(including unpaired surrogates), but it does not do any encoding (so you
have to do that yourself!).
If you simply want to ignore strings which contain unpaired surrogates, you can
use `JsString::is_valid_utf16` to test whether the string contains unpaired
surrogates or not.

View File

@@ -8,6 +8,9 @@ Copies the string's contents back and forth between the JavaScript
garbage-collected heap and the Wasm linear memory with `TextDecoder` and
`TextEncoder`
> **Note**: Be sure to check out the [documentation for `str`](str.html) to
> learn about some caveats when working with strings between JS and Rust.
## Example Rust Usage
```rust