[CIMC 2015 Part 3] Monsters and Pandas and Tigers, Oh My!

My inner perfectionist is crying that I have to post this, in particular over my pathetic snowclone title, but my inner pragmatist knows that, judging by my old blogging patterns, it’s now or never.

18.06: 56%, haven’t touched it in a while, but I think I can do lots more on the plane.

As a non-contestant, I confess I feel totally uninvested in the results and find the Closing Ceremony boring. All contestants go up, country by country, and have their awards read off. No effort is made to make any sort of buildup to a climax. But maybe this is for the best; we don’t want anybody feeling shafted or discouraged from continuing to do math due to a mere elementary-/middle-school competition. Meanwhile, though, I’m browsing reddit on my phone.

After this ceremony, the entire Taiwan delegation spends some time walking around outside while the guides make confused phone calls trying to decide where we eat lunch. My parents offer me some potato chips they bought somewhere, which are (as the label is really eager to point out) baked, not fried. Some time passes this way; eventually, the guides figure it out and we go through amazingly long queues to eat at the cafeteria, as usual. Then we are sent to a massive shopping mall for the afternoon, a place so large that its exits have number labels that go up into the double digits so that people don’t get lost.

I take trippy failed panorama photos from the bus windows.

[trippy panorama of a shopping mall]
Continue reading

Translation Party

Just a short anecdote for the streak today. Hmm, I guess this developed beyond being just another filler post, which is good.

In addition to preparing my presentation, the other job I have to do for the math competition I’m attending in a week or so (not as a participant, okay?) is translating various guests’ speeches between English and Chinese.

The speeches’ length and formulaicness really get on my nerves, but then again my standards for speeches were skewed upward by Richard Forster’s speeches during the opening and closing ceremony of IOI 2014, but on the gripping hand I don’t think it’s that hard to at least try not to be formulaic and I really can’t see any effort on their part whatsoever. Off the top of my head, pretty much all the speeches tend to go like this:

  1. Welcome!
  2. Math is great!
  3. This competition is great!
  4. The city hosting this competition is great!
  5. The college hosting this competition is great!
  6. You contestants are great!
  7. Good luck!

Except each bullet point is a paragraph that lasts a minute.

(Ninja edit: Which is not to say they didn’t put any effort into their speeches at all, but that much of the effort seem misguided to me. I don’t see how anybody who has been in the audience for one of these speeches can overlook the same flaws in their own. Unless it’s like, at some point in the natural life cycle of the human brain, people spontaneously start enjoying these safe and repetitive speech topics instead of some earnest and maybe lighthearted advice and anecdotes and jokes? Like how people somehow start enjoying spicy stuff, or the bitter flavor of beer and wine, or writing teenage-angsty ranty posts complaining to nobody in particular like this one? Tough questions.)

Anyway. My mom actually does most of the translation but I am the grammar stickler post-processor and we work together on the hard parts. The second hardest things to translate are idioms. The hardest things to translate are quotes. It turns out that lots of people find translated quotes to Chinese and it can be incredibly difficult to reconstruct their English versions. Here is the quote that today’s story is about, which we were tasked with providing the English translation (or original) for and which the speech attributed to 克莱因 (trad.: 克萊因).

Continue reading

Bilingualism II


If you came to this blog or this post hoping to read English, sorry not sorry. It’s only fair, really, given how many people on Facebook can’t read the massive English textwall posts I’ve spammed them with for so long.








Continue reading

Adventures in Unicode Forensics

What do you do when you get a bunch of files like this from a zipfile?

I've blurred the messed-up file names because I'm not sure it's impossible to reconstruct the Chinese names of people from them and I'd rather err towards being paranoid about privacy. Except for the one file name whose author's identity I'm OK with disclosing.

I’ve blurred the messed-up file names because I’m not convinced it’s impossible to reconstruct the Chinese names of people from them and I’d rather err towards being paranoid about privacy. Except for the one file name whose author’s identity I’m OK with disclosing.

Back story: I have been tasked with collecting everybody’s Chinese assignments for this semester. I didn’t even do that part yet (I really should); first the girl who handled last semester’s assignments had to pass on the files she collected to me, and I unzipped the attachment to reveal these file names.

Well, I thought, no big deal, I understand Unicode (somewhat), I’ll just do something like decode them as latin1 and re-encode them as UTF-8, right? Or maybe Big5? In fact, maybe python-ftfy will just automagically fix it for me; I’ve been waiting for a chance to use that, like the dozens of other things on my GitHub star list…

I wish.

ftfy did nothing to the text. Also, UTF-8 was not high on my list because the filenames strongly suggested a double-byte encoding. By eyeballing the filenames and comparing them to the organized naming scheme of some other files that had, thankfully, survived the .zip intact, I noted that every Chinese character had replaced with two characters, while underscores had survived unscathed. It was a simple substitution cipher with a lot of crib text available. So that rules out the most common UTF-8, but it gives me a lot of information, so this can’t be hard, right?

One of the files that I knew belonged to me came out as such, according to Python’s os.listdir:


After codecs.decode(_, 'utf_8') we get:

Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

That’s really weird; those codepoints are kinda large, and there’s a literal i at the start, and it doesn’t look at all like the two-characters-for-one pattern we noticed from staring at the plaintext. Oh, they’re using combining characters. How do we fix that?

Fumble, mumble, search. Oh, I want unicodedata.normalize('NFC', _).

Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

Although the byte sequences are totally different, they look the same, which is the point. *holds up fist* This… is… UNICODE!!!

Anyway, the code points in the normalized version make more sense. Except for that conspicuous \u2202 or , of course. Indeed, although it looks promising, if we now try e.g. codecs.encode(_, 'windows-1252') we get,

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2202' in position 5: character maps to

One can get around this by passing a third argument to encode to make it ignore or replace the invalid parts, but the result of the first couple of codepoints, after further pseudomagical decoding and encoding, is still nonsense. Alas.

I continued to try passing the filenames in and out of codecs.encode and codecs.decode with various combinations of utf_8 and latin_1 and big5 and windows-1252, to no avail.

Then I did other stuff. Homework, college forms, writing a bilingual graduation song, talking to other dragons about 1994 video games, and did I mention I started taking driving lessons now? Yeah. That sort of thing.

Around that last thing, I asked the previous caretaker and learned that the computer she collected these files had a Japanese locale. So I added shift_jis and euc_jp to the mix, but still nothing.

I later tried unzipping it with unzip from the command-line — the results were even worse — as well as unzipping it from a Windows computer — even worse than the command line.

So the problem remained, until…

The resolution is very anticlimactic; it took me half a week to think of getting at the files programmatically straight from the zip archive, instead of unzipping first. Python spat out half the file names from the zip file as unicode and the other half as str. From there it was easy guessing.

There were still three file names that failed, but they were easy to fix manually.

I’m not sure why I eventually decided to blog about this anymore, honestly. Especially compared to my 12 other drafts. Oops.

ETA: I thought this script would be a one-trick pony but, amazingly, I ended up using it again the day after, this time with Big5 after I copied some files to a Windows laptop, converted them to .pdf, and sent them back in a .zip.

ETA2: This script came in handy again after I copied a zip file to an Ubuntu desktop computer with an actual CD drive so I could burn everything to a disc!

Pronunciation Stereotypes and the Uncrackable IPA Code

Disclaimer: just because a significant number of people in group A (esp. of a certain race/ethnicity) also have quality B does not mean that (i) all or most people of group A have quality B or (ii) people of group A who do not have quality B are in any way strange or inferior.

In other words, stereotypes are stupid; don’t apply them to real people.

The stereotypical “Asian” (a person from “Asia”, a mythical faraway continent consisting of two countries, China and Japan) is too hard-working, gets disowned for any grade below an A, has infinitesimally thin eyeslits, and pronounces L’s and R’s identically.

*jumps at opportunity to find and use .gif seen on Reddit without understanding any context*

The internet says the L/R thing is mostly due to Japanese having only a single sound somewhere in between those two. Wikipedia has a page on Japanese phonology which seems to support this. Still, Wikipedia articles on phonology all consist of giving every sound a long incomprehensible name, such as the “apical postalveolar flap undefined for laterality” for the Japanese sound discussed above, and I’m not Japanese, so don’t take my word for it.

Mandarin Chinese (blatantly ignoring the myriad dialect variations) has a perfect L sound (ㄌ) and an R sound (ㄖ) that is only a little different. Of course, there are people who still pronounce them identically, but it’s not common — generally, the language teaches L’s and R’s well. Right?
Continue reading


So, as triggered by my confrontation with the Chinese book report (remember? whatever the answer is, it’s okay): a reflection on my incompetence at dealing with two languages, and why this matters, or not.

I can think in both languages. It’s a natural product of our school environment. The two languages often have to complement each other; most of the nerdy terms or globally relevant allusions are English-exclusive (I couldn’t talk coherently about SOPA in any language other than English!), but a lot of cultural and geographical staples around here are Chinese only. And sometimes there are unexpected holes where an innocuous-looking phrase simply has a few too many connotations to translate perfectly (the example I always get stuck on, and have yet to solve satisfactorily with anything short of a full sentence recasting, is “appreciate”.)

Also for some reason when I consciously make myself cross the language rift my internal subvocalization process gains this elaborate mainland accent with extra effort to make the retro-alveo-something sounds emphasized. That is something I can’t talk about coherently in either language alone. Tada.

But all things considered, my thought and writing process seems more optimized for English. I think the dramatic difference in the subjects of study in our education is a big source of the problem: namely, too much Classical Chinese stuff. I concede, learning it for the heritage and historical background is important and completely justified, and there are a lot of big-picture literary techniques that can be applied no matter which dialect of language one decides to use. Still, being able to apply them in a language-independent manner is far from trivial. And everybody is so serious in these passages, just going on about how to be wise, use money and time well, or how beautiful the frozen lake is in winter (so the consensus is I’m severely deficient in aesthetic percepts as well, but that’s a topic for another post). Only the best of the best parts of the important wise people’s writings made it into these books, but we’re not all important wise people and you can’t expect us to write in this manner all the time! Nothing even vaguely outlandish or imaginative like (pulling something out of a hat here) “Harrison Bergeron”. This term’s Chinese textbook has just seven passages, five classical and two vernacular. And I simply don’t believe anybody is actually expecting us to learn the ins and outs of Classical Chinese to write it! There aren’t any authors publishing books in the dialect. It’s important and memorable and significant to our heritage, all undeniable points that I concede, but it’s dead, as harsh as it feels saying that.

I don’t know how much of this is my own fault for not reading as much Chinese “extracurriculars”. I wouldn’t be surprised if I’m already stuck in a confirmation bias feedback loop. There is definitely much more hype and many more options if one is looking a foreign-language bestseller (even after translation) than compared with native ones. And just maybe, there really aren’t enough cool or attractive authors for me, because my jargon-infested computer-reliant hyperlinked niche is already too firmly wedged on the other side of the cultural barrier. Being a technologically up-to-date nerd is just so much easier following the Western world where everybody else is.

And, while we’re on the technological bits (no pun intended): the Internet is still not quite free of its roots in seven-bit ASCII. Of course we’re moving away, forced by the waves of globalization (Google says 60% of the web uses Unicode), but it’s still far from a completely idiot-proof system. Pentadactyl isn’t playing nice with me over here with two foreign characters. I should fix this except I don’t think I’d make any progress.

On ease of writing and file size: because of less redundancy and more versatile combination, I’m pretty sure Chinese is a little bit more compact if considered as just a sequence of bytes, even under reasonable UTF encoding. But when writing words out, there’s a lot of room for contention, and as a math nerd nothing makes me more frustrated than graph theory. Take a look: 邊 (“edge”/”side”) has 19 zarking strokes and takes me three times as long to write out as either the English word or the simplified character (6 strokes). 點 (naturally “point”/”vertex”), 17 versus 9. And when I have to write an entire proof with these characters again and again, well.. use your imagination. I still feel some loyalty to preserving tradition and keeping the writing in the traditional format, but considerations like these make me admit that the whole simplification thing has some very good points. (Then, who knows how much physical writing I should expect to need to perform in the rest of my life, as the computers take over?)

But on the flip side, English is really a clusterfudge of a language, too! How can anybody tolerate pronouncing “colonel” the way we do? Seriously? It’s idiotic and that’s a profound understatement. Okay, this post is basically nothing but understatements, or maybe I’m just always a hyperbolic writer. In any case, I need to stop overcorrecting this post like I always do.

Yes, that’s all. Everybody should switch to Esperanto or something, just like the Dvorak keyboard layout or base-6 number system.