[CIMC 2015 Part 3] Monsters and Pandas and Tigers, Oh My!

My inner perfectionist is crying that I have to post this, in particular over my pathetic snowclone title, but my inner pragmatist knows that, judging by my old blogging patterns, it’s now or never.

18.06: 56%, haven’t touched it in a while, but I think I can do lots more on the plane.


As a non-contestant, I confess I feel totally uninvested in the results and find the Closing Ceremony boring. All contestants go up, country by country, and have their awards read off. No effort is made to make any sort of buildup to a climax. But maybe this is for the best; we don’t want anybody feeling shafted or discouraged from continuing to do math due to a mere elementary-/middle-school competition. Meanwhile, though, I’m browsing reddit on my phone.

After this ceremony, the entire Taiwan delegation spends some time walking around outside while the guides make confused phone calls trying to decide where we eat lunch. My parents offer me some potato chips they bought somewhere, which are (as the label is really eager to point out) baked, not fried. Some time passes this way; eventually, the guides figure it out and we go through amazingly long queues to eat at the cafeteria, as usual. Then we are sent to a massive shopping mall for the afternoon, a place so large that its exits have number labels that go up into the double digits so that people don’t get lost.

I take trippy failed panorama photos from the bus windows.

[trippy panorama of a shopping mall]
Continue reading

Advertisements

Translation Party

Just a short anecdote for the streak today. Hmm, I guess this developed beyond being just another filler post, which is good.

In addition to preparing my presentation, the other job I have to do for the math competition I’m attending in a week or so (not as a participant, okay?) is translating various guests’ speeches between English and Chinese.

The speeches’ length and formulaicness really get on my nerves, but then again my standards for speeches were skewed upward by Richard Forster’s speeches during the opening and closing ceremony of IOI 2014, but on the gripping hand I don’t think it’s that hard to at least try not to be formulaic and I really can’t see any effort on their part whatsoever. Off the top of my head, pretty much all the speeches tend to go like this:

  1. Welcome!
  2. Math is great!
  3. This competition is great!
  4. The city hosting this competition is great!
  5. The college hosting this competition is great!
  6. You contestants are great!
  7. Good luck!

Except each bullet point is a paragraph that lasts a minute.

(Ninja edit: Which is not to say they didn’t put any effort into their speeches at all, but that much of the effort seem misguided to me. I don’t see how anybody who has been in the audience for one of these speeches can overlook the same flaws in their own. Unless it’s like, at some point in the natural life cycle of the human brain, people spontaneously start enjoying these safe and repetitive speech topics instead of some earnest and maybe lighthearted advice and anecdotes and jokes? Like how people somehow start enjoying spicy stuff, or the bitter flavor of beer and wine, or writing teenage-angsty ranty posts complaining to nobody in particular like this one? Tough questions.)

Anyway. My mom actually does most of the translation but I am the grammar stickler post-processor and we work together on the hard parts. The second hardest things to translate are idioms. The hardest things to translate are quotes. It turns out that lots of people find translated quotes to Chinese and it can be incredibly difficult to reconstruct their English versions. Here is the quote that today’s story is about, which we were tasked with providing the English translation (or original) for and which the speech attributed to 克莱因 (trad.: 克萊因).

Continue reading

Bilingualism II

我本來還想把這一篇用英文跟中文各打一次來實際比較看看,但這已經被困在草稿匣夠久了。一直做重複、吹毛求疵、沒有效率的修改本來就是我寫心情文時的弊病,如果再要求修改時要保持兩種語言版本的同步的話,大概寫到天荒地老都寫不完。(如果讀者沒有讀過我最近亂發的文章,故事都是這樣的:我覺得我在草稿匣累積了太多寫一半的文章,所以在畢業後,我在六月十一號決定從那天開始每天發文,直到我出國,目的是強迫自己把那些草稿寫完,中間還亂發很多其他我本來大概不會寫下來的東西。其實還滿有成就感的。另外,如果你的中文沒有很爛的話,如果你在任何地方認為我寫的中文不流暢或是可以寫得更好,請毫不留情的留言批評。這也是一件我發現我在寫中文方面很缺乏的經驗,我會很感謝。當然,根據以往經驗,懇求讀者留言沒什麼機會成功。)

If you came to this blog or this post hoping to read English, sorry not sorry. It’s only fair, really, given how many people on Facebook can’t read the massive English textwall posts I’ve spammed them with for so long.

常常有人訝異我的英文這麼好。有時候這些人還會問我怎麼學的--碰到這種問題我都不知道要怎麼回答。喔,很簡單啊,只要選擇在一個主要說英文的國家出生,然後跟一群從國外回來的同學一起讀有80%的課是用英文上的學校連續讀十二年就好了。

偶爾也有人知道我是雙語部的或在不同的情況下認識我,反而會覺得我的中文比他們預期的好,例如我的駕訓班教練。其實我在雙語部還是上了十二年的國文課,進度理論上跟其他學校一樣(「理論上」三個字要強調),我也跟身旁很多親戚朋友用中文溝通了更久。比奧林匹亞競賽的經驗應該也讓我認識的上普通國語教學學校的學生,比其他雙語部同學認識的多。所以這應該也不奇怪吧。

但是。其實我對這兩個語言的精熟度其實還是差很多。日常對話沒問題,不過我對英文細枝末節的部分瞭解的比中文多太多太多了。家裡有不止一本「常見英語錯誤」類的書,我小時候會當小說讀--我是一個很奇怪的人。在學校課程中,我編輯自己或他人的英文文章是一天到晚的事情,什麼奇怪的句子跟構造都碰過、想過、修理過。還有些時候,我會很自然的寫出一個英文單字,然後發現我不知道為什麼自己會知道這個單字的意思,但就是有一種感覺告訴我,對,subsist的意思就是「存活於只有最基本的需求被滿足的情況」。跟中文比較:有時候,直覺也會告訴我有某一個成語是可以用的,但我只能清楚想到這個成語的兩、三個字。聽起來很「對」,不過我就是想不到第四個字是什麼,或是怕前三個字換成錯別字,再加上懷疑整個成語意思根本不是我模糊腦袋裡現在浮現的,因為我認得這些字字面上的意思,但無法說服自己為什麼它們合在一起可以解釋成這種意思,最後只好放棄,用國小白話文的措詞就好了。

「總覺得,我在用任何偏離國小程度的白話文的詞彙的時候都是假裝的。」

我在舊一點的草稿寫出了這個句子,不過那一串「的」字讓我覺得怪怪的,現在想想,沒有人跟我討論過寫出這種或其他奇怪的句子時應該怎麼辦。哪一些「的」可以省略?有辦法把句子重寫(recast)避開嗎?還是不管它,我覺得它聽起來很奇怪純粹是錯覺,多讀一點中文就會發現根本沒什麼大不了的?

使用兩種語言的方法也差很多。看看這個部落格就知道了。我相信我在學校雙語部外的大部分台灣朋友不會試著去理解我關於自己長篇大論的英文作文,但長篇大論的文章還是我最認真表達自己的地方。反觀我的中文短文,都通常是那種搞笑、釣讚、裝弱的文。(不,我真的很弱。)我不時會發現自己在網上逛到一個陌生人的心情文,看得津津有味。我們之間的關係頂多是朋友的朋友的朋友,我只知道我跟他應該都喜歡數學這個共同點罷了,但因為語言,我在那些瞬間覺得自己瞭解他勝於瞭解幾乎所有在我身邊講中文的朋友。我自己怎麼看這件事都覺得不合理,有點慚愧。

我回去讀了我第一次寫的關於雙語的文章,可能有一點過火:後來證實,在我的朋友中似乎真的有會純粹為了抒發感情而寫七言律詩的人。而且,我講中文的朋友圈裡,文學類佔的比例本來就應該比他們在整個社會裡佔的比例少很多。在學校跟在生活裡,吸收英文多於中文(而且數學多於英文)是我做的選擇,只是我自己不知道選擇的中介點好不好罷了。是否,我花在跟講中文的親戚朋友互動的力氣不夠?

Continue reading

Adventures in Unicode Forensics

What do you do when you get a bunch of files like this from a zipfile?

I've blurred the messed-up file names because I'm not sure it's impossible to reconstruct the Chinese names of people from them and I'd rather err towards being paranoid about privacy. Except for the one file name whose author's identity I'm OK with disclosing.

I’ve blurred the messed-up file names because I’m not convinced it’s impossible to reconstruct the Chinese names of people from them and I’d rather err towards being paranoid about privacy. Except for the one file name whose author’s identity I’m OK with disclosing.

Back story: I have been tasked with collecting everybody’s Chinese assignments for this semester. I didn’t even do that part yet (I really should); first the girl who handled last semester’s assignments had to pass on the files she collected to me, and I unzipped the attachment to reveal these file names.

Well, I thought, no big deal, I understand Unicode (somewhat), I’ll just do something like decode them as latin1 and re-encode them as UTF-8, right? Or maybe Big5? In fact, maybe python-ftfy will just automagically fix it for me; I’ve been waiting for a chance to use that, like the dozens of other things on my GitHub star list…

I wish.

ftfy did nothing to the text. Also, UTF-8 was not high on my list because the filenames strongly suggested a double-byte encoding. By eyeballing the filenames and comparing them to the organized naming scheme of some other files that had, thankfully, survived the .zip intact, I noted that every Chinese character had replaced with two characters, while underscores had survived unscathed. It was a simple substitution cipher with a lot of crib text available. So that rules out the most common UTF-8, but it gives me a lot of information, so this can’t be hard, right?

One of the files that I knew belonged to me came out as such, according to Python’s os.listdir:

'i\xcc\x81\xc2\xaci\xcc\x82a\xcc\x8aa\xcc\x82\xe2\x88\x82_e\xcc\x80Wn\xcc\x83n.pdf'

After codecs.decode(_, 'utf_8') we get:

u'i\u0301\xaci\u0302a\u030aa\u0302\u2202_e\u0300Wn\u0303n.pdf'
Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

That’s really weird; those codepoints are kinda large, and there’s a literal i at the start, and it doesn’t look at all like the two-characters-for-one pattern we noticed from staring at the plaintext. Oh, they’re using combining characters. How do we fix that?

Fumble, mumble, search. Oh, I want unicodedata.normalize('NFC', _).

u'\xed\xac\xee\xe5\xe2\u2202_\xe8W\xf1n.pdf'
Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

Although the byte sequences are totally different, they look the same, which is the point. *holds up fist* This… is… UNICODE!!!

Anyway, the code points in the normalized version make more sense. Except for that conspicuous \u2202 or , of course. Indeed, although it looks promising, if we now try e.g. codecs.encode(_, 'windows-1252') we get,

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2202' in position 5: character maps to

One can get around this by passing a third argument to encode to make it ignore or replace the invalid parts, but the result of the first couple of codepoints, after further pseudomagical decoding and encoding, is still nonsense. Alas.

I continued to try passing the filenames in and out of codecs.encode and codecs.decode with various combinations of utf_8 and latin_1 and big5 and windows-1252, to no avail.

Then I did other stuff. Homework, college forms, writing a bilingual graduation song, talking to other dragons about 1994 video games, and did I mention I started taking driving lessons now? Yeah. That sort of thing.

Around that last thing, I asked the previous caretaker and learned that the computer she collected these files had a Japanese locale. So I added shift_jis and euc_jp to the mix, but still nothing.

I later tried unzipping it with unzip from the command-line — the results were even worse — as well as unzipping it from a Windows computer — even worse than the command line.

So the problem remained, until…


The resolution is very anticlimactic; it took me half a week to think of getting at the files programmatically straight from the zip archive, instead of unzipping first. Python spat out half the file names from the zip file as unicode and the other half as str. From there it was easy guessing.

There were still three file names that failed, but they were easy to fix manually.

I’m not sure why I eventually decided to blog about this anymore, honestly. Especially compared to my 12 other drafts. Oops.

ETA: I thought this script would be a one-trick pony but, amazingly, I ended up using it again the day after, this time with Big5 after I copied some files to a Windows laptop, converted them to .pdf, and sent them back in a .zip.

ETA2: This script came in handy again after I copied a zip file to an Ubuntu desktop computer with an actual CD drive so I could burn everything to a disc!

Pronunciation Stereotypes and the Uncrackable IPA Code

Disclaimer: just because a significant number of people in group A (esp. of a certain race/ethnicity) also have quality B does not mean that (i) all or most people of group A have quality B or (ii) people of group A who do not have quality B are in any way strange or inferior.

In other words, stereotypes are stupid; don’t apply them to real people.

The stereotypical “Asian” (a person from “Asia”, a mythical faraway continent consisting of two countries, China and Japan) is too hard-working, gets disowned for any grade below an A, has infinitesimally thin eyeslits, and pronounces L’s and R’s identically.

*jumps at opportunity to find and use .gif seen on Reddit without understanding any context*

The internet says the L/R thing is mostly due to Japanese having only a single sound somewhere in between those two. Wikipedia has a page on Japanese phonology which seems to support this. Still, Wikipedia articles on phonology all consist of giving every sound a long incomprehensible name, such as the “apical postalveolar flap undefined for laterality” for the Japanese sound discussed above, and I’m not Japanese, so don’t take my word for it.

Mandarin Chinese (blatantly ignoring the myriad dialect variations) has a perfect L sound (ㄌ) and an R sound (ㄖ) that is only a little different. Of course, there are people who still pronounce them identically, but it’s not common — generally, the language teaches L’s and R’s well. Right?
Continue reading

Bilingualism

So, as triggered by my confrontation with the Chinese book report (remember? whatever the answer is, it’s okay): a reflection on my incompetence at dealing with two languages, and why this matters, or not.

I can think in both languages. It’s a natural product of our school environment. The two languages often have to complement each other; most of the nerdy terms or globally relevant allusions are English-exclusive (I couldn’t talk coherently about SOPA in any language other than English!), but a lot of cultural and geographical staples around here are Chinese only. And sometimes there are unexpected holes where an innocuous-looking phrase simply has a few too many connotations to translate perfectly (the example I always get stuck on, and have yet to solve satisfactorily with anything short of a full sentence recasting, is “appreciate”.)

Also for some reason when I consciously make myself cross the language rift my internal subvocalization process gains this elaborate mainland accent with extra effort to make the retro-alveo-something sounds emphasized. That is something I can’t talk about coherently in either language alone. Tada.

But all things considered, my thought and writing process seems more optimized for English. I think the dramatic difference in the subjects of study in our education is a big source of the problem: namely, too much Classical Chinese stuff. I concede, learning it for the heritage and historical background is important and completely justified, and there are a lot of big-picture literary techniques that can be applied no matter which dialect of language one decides to use. Still, being able to apply them in a language-independent manner is far from trivial. And everybody is so serious in these passages, just going on about how to be wise, use money and time well, or how beautiful the frozen lake is in winter (so the consensus is I’m severely deficient in aesthetic percepts as well, but that’s a topic for another post). Only the best of the best parts of the important wise people’s writings made it into these books, but we’re not all important wise people and you can’t expect us to write in this manner all the time! Nothing even vaguely outlandish or imaginative like (pulling something out of a hat here) “Harrison Bergeron”. This term’s Chinese textbook has just seven passages, five classical and two vernacular. And I simply don’t believe anybody is actually expecting us to learn the ins and outs of Classical Chinese to write it! There aren’t any authors publishing books in the dialect. It’s important and memorable and significant to our heritage, all undeniable points that I concede, but it’s dead, as harsh as it feels saying that.

I don’t know how much of this is my own fault for not reading as much Chinese “extracurriculars”. I wouldn’t be surprised if I’m already stuck in a confirmation bias feedback loop. There is definitely much more hype and many more options if one is looking a foreign-language bestseller (even after translation) than compared with native ones. And just maybe, there really aren’t enough cool or attractive authors for me, because my jargon-infested computer-reliant hyperlinked niche is already too firmly wedged on the other side of the cultural barrier. Being a technologically up-to-date nerd is just so much easier following the Western world where everybody else is.

And, while we’re on the technological bits (no pun intended): the Internet is still not quite free of its roots in seven-bit ASCII. Of course we’re moving away, forced by the waves of globalization (Google says 60% of the web uses Unicode), but it’s still far from a completely idiot-proof system. Pentadactyl isn’t playing nice with me over here with two foreign characters. I should fix this except I don’t think I’d make any progress.

On ease of writing and file size: because of less redundancy and more versatile combination, I’m pretty sure Chinese is a little bit more compact if considered as just a sequence of bytes, even under reasonable UTF encoding. But when writing words out, there’s a lot of room for contention, and as a math nerd nothing makes me more frustrated than graph theory. Take a look: 邊 (“edge”/”side”) has 19 zarking strokes and takes me three times as long to write out as either the English word or the simplified character (6 strokes). 點 (naturally “point”/”vertex”), 17 versus 9. And when I have to write an entire proof with these characters again and again, well.. use your imagination. I still feel some loyalty to preserving tradition and keeping the writing in the traditional format, but considerations like these make me admit that the whole simplification thing has some very good points. (Then, who knows how much physical writing I should expect to need to perform in the rest of my life, as the computers take over?)

But on the flip side, English is really a clusterfudge of a language, too! How can anybody tolerate pronouncing “colonel” the way we do? Seriously? It’s idiotic and that’s a profound understatement. Okay, this post is basically nothing but understatements, or maybe I’m just always a hyperbolic writer. In any case, I need to stop overcorrecting this post like I always do.

Yes, that’s all. Everybody should switch to Esperanto or something, just like the Dvorak keyboard layout or base-6 number system.