Please, please, never blindly delete chunks of text, especially not just because you can't see or understand them; it destroys information. Someone put them there for a reason; tweets containing emoji often don't make any sense without the emoji.
For what it's worth, what you're seeing isn't really "binary"; it's most likely a small square with the Unicode codepoint spelled out in hex. For ??, that's U+1F49C, so you probably see 0 1 F
4 9 C
. This is how Unicode characters are rendered when none of your installed fonts have glyphs for them.
To actually see the characters, you have several options.
- Get Symbola from here, and install it. Now you can see emoji. But no one else can.
Get Symbola, and add it to your website with a @font-face
block like this:
@font-face {
font-family: Symbola;
src: url('Symbola.ttf') format('truetype');
unicode-range: U+1F???;
}
Then set your page's font with font-family: Symbola, "your preferred font", sans-serif;
.
The downside to this is that, to my understanding, CSS's font-family
picks the first font that exists at all, and does not specify a Unicode fallback. So in browsers that don't support unicode-range
(Firefox), that will render your entire page in the not particularly beautiful Symbola.
You could sorta hack around this by finding all the emoji and wrapping them in a <span class="emoji">
, then only using Symbola for .emoji
elements.
Find all the emoji and replace them with <img>
tags, like Twitter does. Twitter's images are all at URLs containing the codepoint, e.g. https://abs.twimg.com/emoji/v1/72x72/1f43e.png, so just reusing those is easy enough. (I'm a little surprised the Twitter API won't do this for you, actually.)
If you want to find and replace all emoji, you probably want to just look for all astral-plane characters — i.e., those not in the Basic Multilingual Plane where modern human languages live. These are all characters with codepoints of U+10000 and above.
In JavaScript, strings aren't really strings; they're arrays of 16-bit numbers. 16 bits is four hex digits, so Unicode codepoints that have five hex digits won't fit in a single 16-bit number. Instead, JavaScript encodes them with the terrible UTF-16 encoding, which uses two 16-bit numbers: one in the range 0xD800 to 0xDBFF and one in the range 0xDC00 to 0xDFFF. Two numbers together are called a "surrogate pair". None of these numbers will ever be real Unicode codepoints; the entire block is reserved for this encoding.
To find all the astral plane characters, you actually want to find all the surrogate pairs:
/[uD800-uDBFF][uDC00-uDFFF]/
So an implementation of Twitter's image replacement might look like this:
var text = "hey babe ?? how you doin";
// Split on surrogate pairs, and preserve the surrogates; this will give
// you an array that alternates between BMP text and a single surrogate
// pair: [text, emoji, text, emoji, text...]
var chunks = text.split(/([uD800-uDBFF][uDC00-uDFFF])/);
// A DocumentFragment is a DOM tree that can be manipulated freely without
// causing a reflow, so it's more performant for heavy tree-building and a
// good habit to get into
var frag = document.createDocumentFragment();
for (var i = 0, l = chunks.length; i < l; i++) {
if (i % 2 == 0) {
// Even-numbered chunks are plain text
frag.appendChild(document.createTextNode(chunks[i]));
}
else {
// Odd-numbered chunks are surrogate pairs
// We have TWO characters, but we want one codepoint; this is how
// you decode UTF-16 :(
var pair = chunks[i];
var codepoint = (
0x10000
| ((pair.charCodeAt(0) - 0xD800) << 10)
| (pair.charCodeAt(1) - 0xDC00)
);
var hex = codepoint.toString(16); // now it's in hex
var img = document.createElement('img');
img.src = "https://abs.twimg.com/emoji/v1/72x72/" + hex + ".png";
// Twitter uses pretty big images and just scales them down
// clientside; you could change these to whatever you want, or add
// a class here and use CSS to set the width/height to 1em to
// match the current font size
img.height = 16;
img.width = 16;
frag.appendChild(img);
}
}
// Now just stick it into the page somewhere
var el = document.createElement('p');
el.appendChild(frag);
document.body.appendChild(el);
This creates an <img>
as per option 3, but you could also easily add a <span class="emoji">
and go with option 2. Or do whatever else you want, like replace emoji with their Unicode names. (Twitter has the Unicode names as title
on each image, but that's not done here because it requires including a huge list mapping codepoints to names ?)