[CIMC 2015 Part 1] Rainy Days in July (and Other Months)

We get up at 3:40 AM. By 4 AM we have left our house, speeding like a bullet into the dark.


(Ohai. Somehow it slipped my mind that I was ending my streak by leaving the country for a competition that would likely be highly bloggable, like my last two international olympiads, both of which led to notable post sequences on this blog. (Admittedly, the first one was never really completed…) My only excuse was that I was worried I might not be able to access my blog from inside the Great Firewall, but I did (via vpn.mit.edu) and even if I hadn’t, I could still have drafted posts locally in Markdown as I usually do, so I don’t know what I was thinking.)

(Also: because, as I’ve said way too many times recently, I need to do linear algebra homework, these posts aren’t going to be as complete or as perfect as I’d like them to be. Although I’m probably just saying this to persuade myself; I tend to include many of the boring parts as well as the interesting parts of the trip, which maybe benefits my future self at the expense of other readers. I probably need to get out of this habit more if I want to blog for a wider audience, though. Oh well.)

Backstory

The International Mathematics Competition (IMC) is, as it says, an international mathematics competition. But I should add that it is for elementary and middle-school students (in other words, I am not competing, okay??). (edit: Also, one or two letters are often prefixed to indicate the host country, for whatever reason. This year it would be CIMC, C for China.) I am tagging along because I am a student of Dr. Sun, one of the chief organizers, and have been slotted to give a talk and possibly help with grading the papers and translating. My father is coming to help arrange a side event, a domino puzzle game competition, which he programmed the system for; and my mom and sister are also coming to help with translation and other duties. Other people in our group: Dr. Sun himself, his longtime assistant slash fellow teacher Mr. Li (wow I’m sorry I forgot you while first writing this), my friend and fellow math student Hsin-Po, who is an expert at making polyhedra from origami or binder clips (and at Deemo); Chin-Ling, my father’s student/employee who also programmed lots of the domino puzzle server and possesses a professional camera; and, of course, all the elementary- and middle-school contestants, as well as most of their parents.

I don’t think I’ve ever given this amount of background exposition about any event I’ve attended to my not-so-imaginary audience before. It feels weird. Some part of me is worried about breaking these people’s privacy by posting this, which makes a little bit of sense but not enough for me to think that it’s actually a valid reason to avoid or procrastinate blogging. I think it’s a rationalization.

Here we go.

Day 1

The only interesting thing that happens at the airport is a short loud argument in the queues for luggage check-in, perhaps partly fueled by our high number of people and of heavy boxes (gifts for other countries and raw materials for Hsin-Po’s polyhedra). I don’t know whose fault it is.

In case I fail to scale the firewall, I attempt to download Facebook on my phone for one last look before boarding, but it fails during installation twice and I give up.

Our plane is not fancy enough to offer personal screens and entertainment centers for everybody, but thankfully the ride lasts only three hours, so this is tolerable. Instead, the plane plays the second Divergence movie on overhead screens, which I watch half-heartedly. The plot setup seems interesting but the ending seems to me to involve two Ass Pulls™, although since I haven’t been paying much attention I am not confident if I just missed some foreshadowing or character development. On the flight, I also read the proof of the irrationality of powers of e in Proofs from THE BOOK and leaf through the magazines.

I don’t hear any good music on-board, except maybe “Space Oddity”, which is a little freaky to be listening to while cruising at so may kilometers in the sky. Perhaps because of this, I find myself singing and humming “Space Oddity” unexpectedly often over the next few days.

Arrival

The very first sign we see after alighting the plane consists entirely of characters that are the same in Simplified and Traditional Chinese — if I remember correctly, 「前有坡道,小心慢走」1. The Changchun airport looks like any other airport, coolly blue-themed with moving platforms. The restrooms have fancy bright purple soap. Even though I consciously think about how I have suddenly arrived in a country that places notable restrictions on freedom of speech and Internet access, I don’t feel it. Eep, what an anticlimax.

[People dragging luggage boxes over gravelly ground outdoors.]
Continue reading

Advertisements

Adventures in Unicode Forensics

What do you do when you get a bunch of files like this from a zipfile?

I've blurred the messed-up file names because I'm not sure it's impossible to reconstruct the Chinese names of people from them and I'd rather err towards being paranoid about privacy. Except for the one file name whose author's identity I'm OK with disclosing.

I’ve blurred the messed-up file names because I’m not convinced it’s impossible to reconstruct the Chinese names of people from them and I’d rather err towards being paranoid about privacy. Except for the one file name whose author’s identity I’m OK with disclosing.

Back story: I have been tasked with collecting everybody’s Chinese assignments for this semester. I didn’t even do that part yet (I really should); first the girl who handled last semester’s assignments had to pass on the files she collected to me, and I unzipped the attachment to reveal these file names.

Well, I thought, no big deal, I understand Unicode (somewhat), I’ll just do something like decode them as latin1 and re-encode them as UTF-8, right? Or maybe Big5? In fact, maybe python-ftfy will just automagically fix it for me; I’ve been waiting for a chance to use that, like the dozens of other things on my GitHub star list…

I wish.

ftfy did nothing to the text. Also, UTF-8 was not high on my list because the filenames strongly suggested a double-byte encoding. By eyeballing the filenames and comparing them to the organized naming scheme of some other files that had, thankfully, survived the .zip intact, I noted that every Chinese character had replaced with two characters, while underscores had survived unscathed. It was a simple substitution cipher with a lot of crib text available. So that rules out the most common UTF-8, but it gives me a lot of information, so this can’t be hard, right?

One of the files that I knew belonged to me came out as such, according to Python’s os.listdir:

'i\xcc\x81\xc2\xaci\xcc\x82a\xcc\x8aa\xcc\x82\xe2\x88\x82_e\xcc\x80Wn\xcc\x83n.pdf'

After codecs.decode(_, 'utf_8') we get:

u'i\u0301\xaci\u0302a\u030aa\u0302\u2202_e\u0300Wn\u0303n.pdf'
Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

That’s really weird; those codepoints are kinda large, and there’s a literal i at the start, and it doesn’t look at all like the two-characters-for-one pattern we noticed from staring at the plaintext. Oh, they’re using combining characters. How do we fix that?

Fumble, mumble, search. Oh, I want unicodedata.normalize('NFC', _).

u'\xed\xac\xee\xe5\xe2\u2202_\xe8W\xf1n.pdf'
Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

Although the byte sequences are totally different, they look the same, which is the point. *holds up fist* This… is… UNICODE!!!

Anyway, the code points in the normalized version make more sense. Except for that conspicuous \u2202 or , of course. Indeed, although it looks promising, if we now try e.g. codecs.encode(_, 'windows-1252') we get,

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2202' in position 5: character maps to

One can get around this by passing a third argument to encode to make it ignore or replace the invalid parts, but the result of the first couple of codepoints, after further pseudomagical decoding and encoding, is still nonsense. Alas.

I continued to try passing the filenames in and out of codecs.encode and codecs.decode with various combinations of utf_8 and latin_1 and big5 and windows-1252, to no avail.

Then I did other stuff. Homework, college forms, writing a bilingual graduation song, talking to other dragons about 1994 video games, and did I mention I started taking driving lessons now? Yeah. That sort of thing.

Around that last thing, I asked the previous caretaker and learned that the computer she collected these files had a Japanese locale. So I added shift_jis and euc_jp to the mix, but still nothing.

I later tried unzipping it with unzip from the command-line — the results were even worse — as well as unzipping it from a Windows computer — even worse than the command line.

So the problem remained, until…


The resolution is very anticlimactic; it took me half a week to think of getting at the files programmatically straight from the zip archive, instead of unzipping first. Python spat out half the file names from the zip file as unicode and the other half as str. From there it was easy guessing.

There were still three file names that failed, but they were easy to fix manually.

I’m not sure why I eventually decided to blog about this anymore, honestly. Especially compared to my 12 other drafts. Oops.

ETA: I thought this script would be a one-trick pony but, amazingly, I ended up using it again the day after, this time with Big5 after I copied some files to a Windows laptop, converted them to .pdf, and sent them back in a .zip.

ETA2: This script came in handy again after I copied a zip file to an Ubuntu desktop computer with an actual CD drive so I could burn everything to a disc!

Technological Fails Continue

Hardware:

The laptop I’m typing this on is over two years old. This is not a lot by some measures, but weird spontaneous glitches are starting to accumulate to the point where they’re getting on my nerves. The internet card still needs an extra reset to start working half the time, and occasionally warrants a full reboot, which costs five minutes. The USB ports are loopy, some windows just show up black when they feel like it, and there’s a steadily climbing whir in the background. I’m kind of anticipating the moment the whole thing just drops dead.

Well, I’m not about to run out of computers to use (there’s a noisy XP desktop that also barely works despite handling all our print jobs, but also one spanking new eight-core CPU laptop, which Dad considered a valuable enough investment (?)) but such a loss is still not something to be dismissed lightly. And the externalized cost is far more important and chilling. Who knows how many kids in the Congo had to mine coltan, or how much conflict has occurred over the crude oil, or what awful conditions those sweatshop-assembly workers are going through? Annie Leonard’s words still resonate with me from when we were first shown the video a year ago. Which is more recent than this laptop, so that doesn’t mean that much. I think a couple months ago I would have absolutely no second thoughts about getting a new one, though. Yup, I’m in a quandary (ha ha vocabulary) on the balance between desensitization and compulsive hoarding of stuff.

I mean, I hoarded the sleeve from my 2011 planner in a drawer somewhere, and it turned out to serve nicely as a pencil box divider. So hoarding is cool, old stuff is useful. But so is a new laptop!

Flashbacks. The Andreas guy from the book report book was examining his own consumption habits and going “I am a complete a()hole”. Les Mis lyrics.
See our children fed! Help us in our shame!
Something for a crust of bread in Holy Jesus’ name!
In the Lord’s holy name! In his name, in his name, in his name (fade)

Hmm, right, back to musical drama for the first time this semester tomorrow along with our terrifically haphazard “play”, and I could even launch into a wild rambling on capitalism and socialism and how cool it is to finally kind of understand what everybody is talking about in those serious Round Table political threads, but… that’s far enough off-topic.

Software:

Another day working on the IMC site. Bugs bugs buuuuugs.

First, for a couple eternities, we agonized over a recursive loop that defied all reason and attempts at diagnosis, which Dad tracked down as reportedly deriving from the difference between WEB-INF and WEB_INF. Then something like three more bugs appeared, each more difficult to explain than the last. Completely amorphous artifacts of compilation and building, placement of initialization hooks that basically boil down to wherever looks faintly reasonable.

Slowly, loose ends kind of resolved themselves and there are beautiful extensionless URLs, but getting all the old files into the new framework is still going to be rather tedious. I’m pretty sure we’ve reinvented the wheel a lot here, with a totally custom server-side dynamic system for displaying the menus (although it’s not really complicated either), but I’m not sure how severe it is. But I still feel more black magic in this system than the last one. I can’t even hand-wave an explanation for putting <% %> here and <%! %> there.

On my own time I was working on refactoring my grid editor thingamajig that, as of now, still lacks a snappy name (some pun on “gridlock”?). The “main class” has upwards of 600 lines, the kind of ridiculously ugly interdependent mass I abhor because (it seems) everything needs references and callbacks to everything else. I have thought and thought, and can’t decouple anything from anything else sensibly.

You know what they say, always code as if the next person to maintain your code will be a psychopathic killer who knows where you live? All through yesterday I’ve been continually thinking of my future self climbing out of a time machine like that. That’s how bad it is.