Graphemes: A Better Way of Counting Characters

In Technology
Scroll this

At the risk of stating the painfully obvious, core web technologies were not built with emojis in mind.

๐Ÿ˜ฒ

This is no surprise to you if you’ve ever been in the business of counting characters. Take, for instance, the maxlength attribute on an input tag.

<input id="my-input" type="text" maxlength=16 />

Input this family of four emoji:

๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ

Guess how many characters that is?

1? 4? 8?

Try 11.

11 characters. No foolin’. You can’t even enter that emoji twice in that maxlength 16 input tag.

To be fair, this problem predates the de-evolution of the human language heralded by the emoji. For anyone who knows a non-Latin based language, this is probably not news. Here, for example, is the Tibetan symbol for “Om”:

เฝจเฝผเฝพ

That’s 3 characters.

This is minimally interesting, but why should I care?

If you’re a developer, you’re going to trip up on this at one point or another—probably on a password input field.

Password fields enforce character limits. Otherwise, you could theoretically set your password to the King James Bible. The whole thing.

But what is a character? That family of four, which renders as a single graphic—that’s 11 characters? By what Seussian logic is that counted? It doesn’t make any sense.

Well, on a purely technical level, it does. If you keep reading—god help you—you’ll find out why. But on a much more basic, logical level, it’s kind of bonkers. And largely to avoid this question of what a character even is, a great majority of password fields only accept Latin characters.

But wouldn’t it be great if all characters were allowed in our input fields, including passwords? If you had the option, wouldn’t you want to create a password that was a lion, a frog, and a donut, with a few mathematical operators thrown in just for good measure?

๐Ÿฆ+๐Ÿธ=๐Ÿฉ

It’s a delightful, strong password—easy to remember, but extraordinarily difficult to guess.

Here in the West, the emoji has merely made acute what has been a long-standing issue for much of the world’s non-Latin based language users. This is bigger than the emoji: it’s about language equity.

To support non-Latin characters as a developer, however, takes a bit of work. You have to understand—if only a little bit—some of the fundamentals.

Lay a foundation if you must

You’re not going to like this section, but here it is anyway.

A byte is a unit of digital information. It can consist of any number of bits, but using 8 bits, one can create the entire Latin alphabet. To expand beyond that, to characters from other languages, requires multiple bytes.

Emojis, for the record, begin at 4 bytes.

This is why you may have had to make some MySQL character set changes at one point or another. The utf8mb4 character set supports 4 byte characters like emojis. Other character sets do not.

So, if you’re wondering why the letter “J” is a single character but ๐Ÿ˜Ž is two, there’s your dry and very technical answer. You probably shouldn’t bust out this information at a dinner party. You won’t make any friends.

Back to the 11 character family emoji, let’s take a look under the hood. UniView is a fascinating web app that will break it down for us.

๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ is actually four distinct two-character emojis stitched together using three one-character zero width joiners. Add it all up and you get 11 characters.

I could go on. We could talk about code points and really dig into different languages. But that author, Manish Goregaokar, is several orders of magnitude smarter than I am. Truthfully, I’m out of my depth beyond this point, and you’re probably intensely bored besides.

So what’s graspable and relevant for me?

Let’s step back from the cliff and try to solve an eminently solvable problem—how to count characters in a smarter way.

To begin with, we have to drop the word “character”. New term: grapheme.

Merriam-Webster provides a lovely definition using other words you’ve also never heard of, but in short, a grapheme is what you thought a character was. It’s one collection of chicken scratches that’s segmentable (often by a small amount of empty space) from the next collection of chicken scratches. Or in the case of emojis, one picture of a wizard separated from another picture of a barber pole.

๐Ÿง™โ€โ™‚๏ธ๐Ÿ’ˆ

All of this is meant to say, don’t count characters; count graphemes.

Believe it or not, PHP has provided native grapheme functions for over 10 years—since version 5.3. grapheme_strlen will get the job done nicely.

Here’s a snippet of PHP server side validation that ensures a description being committed to a MySQL table never exceeds a 256 grapheme limit:

if (grapheme_strlen($desc) > 256) {
	$desc = grapheme_substr($desc, 0, 256);
}

JavaScript is another beast. There are no native functions for graphemes. You’re going to need a library. I’m using Grapheme Splitter on one of my web apps. It’s been terrific so far. Here’s how I’m enforcing a 16 grapheme limit on one particular input field (jQuery in use):

var npnameMaxChar = 16;// grapheme limit
var charSplitter = new GraphemeSplitter();// initialize the Grapheme Splitter library
$('#notepad-input').on({
	'input': function() {
		if (charSplitter.countGraphemes($(this).val()) >= npnameMaxChar) {
			$(this).val(charSplitter.splitGraphemes($(this).val()).slice(0, npnameMaxChar).join(''));
		}
	}
});

Is this a perfect way of going about it? No. Perfection is not attainable here.

Again, I’m going to refer back to Manish’s excellent article—specifically the section on grapheme clusters. I can’t possibly state the problem better than he has. The concept of a “user-perceived character” is indeed a “nebulous” one. But I believe the grapheme is a more perfect method of counting characters than any other.

Yes, it takes more code. The HTML maxlength attribute—such an easy solution for an emoji-less, Latin-centric, insular world—is not viable any longer, if it ever was. It was never inclusive of other languages, and it never dreamed of an age when someone might want to pictorially represent a mosquito eating a moon cake.

๐Ÿฅฎ๐ŸฆŸ

But here we are. We can do better. We should do better.

Viva la revoluciรณn!

Submit a comment

Your email address will not be published. Required fields are marked *