Adventures in Unicode Forensics

What do you do when you get a bunch of files like this from a zipfile?

I've blurred the messed-up file names because I'm not sure it's impossible to reconstruct the Chinese names of people from them and I'd rather err towards being paranoid about privacy. Except for the one file name whose author's identity I'm OK with disclosing.

I’ve blurred the messed-up file names because I’m not convinced it’s impossible to reconstruct the Chinese names of people from them and I’d rather err towards being paranoid about privacy. Except for the one file name whose author’s identity I’m OK with disclosing.

Back story: I have been tasked with collecting everybody’s Chinese assignments for this semester. I didn’t even do that part yet (I really should); first the girl who handled last semester’s assignments had to pass on the files she collected to me, and I unzipped the attachment to reveal these file names.

Well, I thought, no big deal, I understand Unicode (somewhat), I’ll just do something like decode them as latin1 and re-encode them as UTF-8, right? Or maybe Big5? In fact, maybe python-ftfy will just automagically fix it for me; I’ve been waiting for a chance to use that, like the dozens of other things on my GitHub star list…

I wish.

ftfy did nothing to the text. Also, UTF-8 was not high on my list because the filenames strongly suggested a double-byte encoding. By eyeballing the filenames and comparing them to the organized naming scheme of some other files that had, thankfully, survived the .zip intact, I noted that every Chinese character had replaced with two characters, while underscores had survived unscathed. It was a simple substitution cipher with a lot of crib text available. So that rules out the most common UTF-8, but it gives me a lot of information, so this can’t be hard, right?

One of the files that I knew belonged to me came out as such, according to Python’s os.listdir:

'i\xcc\x81\xc2\xaci\xcc\x82a\xcc\x8aa\xcc\x82\xe2\x88\x82_e\xcc\x80Wn\xcc\x83n.pdf'

After codecs.decode(_, 'utf_8') we get:

u'i\u0301\xaci\u0302a\u030aa\u0302\u2202_e\u0300Wn\u0303n.pdf'
Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

That’s really weird; those codepoints are kinda large, and there’s a literal i at the start, and it doesn’t look at all like the two-characters-for-one pattern we noticed from staring at the plaintext. Oh, they’re using combining characters. How do we fix that?

Fumble, mumble, search. Oh, I want unicodedata.normalize('NFC', _).

u'\xed\xac\xee\xe5\xe2\u2202_\xe8W\xf1n.pdf'
Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

Although the byte sequences are totally different, they look the same, which is the point. *holds up fist* This… is… UNICODE!!!

Anyway, the code points in the normalized version make more sense. Except for that conspicuous \u2202 or , of course. Indeed, although it looks promising, if we now try e.g. codecs.encode(_, 'windows-1252') we get,

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2202' in position 5: character maps to

One can get around this by passing a third argument to encode to make it ignore or replace the invalid parts, but the result of the first couple of codepoints, after further pseudomagical decoding and encoding, is still nonsense. Alas.

I continued to try passing the filenames in and out of codecs.encode and codecs.decode with various combinations of utf_8 and latin_1 and big5 and windows-1252, to no avail.

Then I did other stuff. Homework, college forms, writing a bilingual graduation song, talking to other dragons about 1994 video games, and did I mention I started taking driving lessons now? Yeah. That sort of thing.

Around that last thing, I asked the previous caretaker and learned that the computer she collected these files had a Japanese locale. So I added shift_jis and euc_jp to the mix, but still nothing.

I later tried unzipping it with unzip from the command-line — the results were even worse — as well as unzipping it from a Windows computer — even worse than the command line.

So the problem remained, until…


The resolution is very anticlimactic; it took me half a week to think of getting at the files programmatically straight from the zip archive, instead of unzipping first. Python spat out half the file names from the zip file as unicode and the other half as str. From there it was easy guessing.

There were still three file names that failed, but they were easy to fix manually.

I’m not sure why I eventually decided to blog about this anymore, honestly. Especially compared to my 12 other drafts. Oops.

ETA: I thought this script would be a one-trick pony but, amazingly, I ended up using it again the day after, this time with Big5 after I copied some files to a Windows laptop, converted them to .pdf, and sent them back in a .zip.

Three Standard Deviations

A PSYCHOLOGICAL TIP

Whenever you’re called on to make up your mind,
and you’re hampered by not having any,
the best way to solve the dilemma, you’ll find,
is simply by spinning a penny.
No — not so that chance shall decide the affair
while you’re passively standing there moping;
but the moment the penny is up in the air,
you suddenly know what you’re hoping.

— Piet Hein

(By the way, apparently spinning a penny is a terrible randomization process; studies have shown they come up tails 80% of the time. Tossing or flipping is better but there’s still a faintly biased 51% chance it lands with the same face it started with (PDF link). Entirely irrelevantly, is the meter amphibrachic? Nice. I’m sorry, but the impenetrable English names they give to metrical feet just sound so cool.)

As May 1 has been coming up, I’ve been half-seriously giving this advice to others who still haven’t decided. But I knew this wouldn’t work for me. I knew where I intuitively wanted to go all along.

The reasons holding me back were more… reasonable. Mostly the money. Call it an id-superego conflict.

I don’t know if the difference between my choices would mean I’d have to take out loans, or work a lot during college, or both. I don’t think either of those things would be difficult. I think tech internships over the summer could just cover the parts assigned to parental contribution (which I’m not going to let my parents pay, unless they start earning a lot more money than expected) and I think I have the skills to get those internships. But of course that’s a tradeoff. Maybe there will be something more self-actualizing or more helpful to my future career that I could do during the summer. I’m not so sure that I’ll find the same drive to program for a job instead of for a personal project I really want to use myself, or for putting off something more boring. I don’t know yet.

(Get it? Drive? Program? Um, never mind, I guess that’s a hardware problem.)

Continue reading

Puzzle 47 / Fillomino [LITS + Walls]

CLICKBAIT PERSONALITY TEST THAT YOU CAN DO WITHOUT SOLVING THE PUZZLE: What do you see in the puzzle image below? I have my own thoughts but I won’t bias you by posting them yet. Sound out your thoughts in the comments below! (I don’t expect this to work but I’d love to be proven wrong)

Okay so apparently how puzzles work is I go nearly a year without posting one and then when I post a terrible one, I feel guilty and obligated to post a legitimate one soon after. Testsolved by chaotic_iak.

This is a Fillomino (write a number in every empty cell so that every group of cells with the same number that is connected through its edges has that number of cells) where each tetromino has had their 4s replaced by one of L, I, T, or S describing their shape, and they obey the rules of LITS — they can touch if they are not congruent, they must all be connected, and their squares cannot form a 2×2 block. In addition, cells separated by a thick border may not contain the same number or letter.

See my Puzzle 36 or chao’s Puzzle 36 (I only noticed this coincidence today, it’s quite amazing) or FFF 6 for prior LITS Fillominoes and links to more, and my puzzle 43 for a prior Fillomino Walls mutant with links to other Walls.

fillomino-lits-walls

Bugs

First there was the form that silently exploded when opened in two tabs, then there was the guestbook that stayed XSS-vulnerable for I don’t even know how many weeks, then there was the form submission system whose form submission system for the user to report errors kept reporting an error itself (and still isn’t done with its job), then there was the pointed refusal of a certain API to serve up any polls, then there was the sneaky two-factor authentication that pretended to work until the last moment when it died in a sea of redirects.

(also I was sick the past few days, if that counts)

And now apparently WordPress’s link with Facebook secretly breaks without telling you in the new interface?

Edit: Yay it works

Edit 2: And now yet another site fails to give me a secure connection, leaving me unable to do anything.

Puzzle 46 / Fillomino [LITS + Extra Region + Walls + Anti-Walls + Inequality + Tapa + Masyu]

5:27 PM phenomist: do you use gridderface to make pretty puzzles?

5:52 PM phenomist: actually nvm excel is probably easier lol

Okay I’m sorry this is a horrible puzzle where the rules don’t make sense and I didn’t even get it testsolved. I just wanted an image to concisely demonstrate the capabilities of gridderface, my puzzle marking and creation program, for the project homepage, after somebody expressed interest in using the program to write a puzzle. Then I got tremendously carried away.

fillomino-gf

Hints:

  • L, I, T, S are comparable, but not in that order.
  • The Tapa clue isn’t part of any polyomino.
  • The Masyu clue restricts the borders around three intersections, including itself.
  • (Is there a name for Fillominoes with the opposite of walls?)

Other than that, you’ll have to figure out the rules yourself (partly because I am lazy and partly because if I listed the rules it would probably take longer to read and understand them than to apply them to the puzzle…)

Anyway, this post is actually to say that it is now possible to use the Scala reboot of gridderface without spending forever to attempt to install Scala, because I figured out how to make .jar file releases. It’s probably still hard to use but who knows?

Because it’s a full rewrite, the version is back to 0.2 unlike the old 0.5. As I noted earlier, the code base was horrible.

By the way here is the Heyawake phenomist eventually made.

Okay back to homework (and agonizing over which college to go to).

X + Y (movie)

On Wednesday I got to see a special screening of the film X + Y. You know, the one about the autistic boy who goes to compete in the IMO. You can watch the trailer if you haven’t already.

Disclosure: the ticket was free, courtesy of my math teacher (who appears at 1:06–1:07 in the trailer) having helped the filming process. (I visited once and got to look at some of the cool equipment. Also, far away from everything, one of the director assistants sort of interviewed me. That is the full extent of my contribution, okay?) Except I was also sick with a cold so I might have been kind of miserable. Also I didn’t really have dinner that day, and we got home really late so I had to stay up even later doing homework. So those are the extent of my biases.

I guess this is a review of sorts.

The most important thing I have to say is this: X + Y is not a film primarily about math competitions or the IMO. It is a film about love, about autism, about accepting people who are different, about conquering your own psychological demons, about gender and family and cultural roles. But mostly about love. The film gets big novelty bonus points for a reasonably authentic look at the high school mathematics olympiad scene, but if you go watch this as a former contestant looking to relive some vicarious moments of glory and triumph through hard mathematical work and thinking, I’m pretty sure you’ll be disappointed. None of the main character’s important character development moments are related to becoming better at math. The IMO is largely a well-researched and extensively utilized plot device. (This is one of the acceptable usages of the word “utilize”, okay? Dear classmates: please stop using it as a seven-letter synonym for “use”.)

Continue reading

Thoughts at Midnight

These are the thoughts that sometimes keep me awake at night.

These are things I don’t want to think about. These are things I’ve spent hours thinking about, never productively. They are worrying, but unlike typical worries in my life, it is fundamentally impractical to take steps to resolve or mitigate them, after which I may rest assured that I’ve done my best. The reason is that they also happen to be either untestable/unfalsifiable or only testable if one incurs absurd and irreversible costs, mainly dying.

Sometimes I explain them away to myself successfully and move on. Sometimes I read what I’ve written and think about these thoughts and do the cognitive equivalent of looking at them funny, as I’m expecting most readers to feel if they get that far — why would anybody be bothered, or afraid, or soul-crushingly panicked about these things? Life is so busy, there are literally more than sixty-four items on my HabitRPG to-do list, and besides, there are so many serious global issues humanity is actually facing right now, and people who are actually deprived of basic rights and resources and have to struggle to stay alive. How can I possibly be bothered by these absurd remote thoughts?

But I know that other times I do feel those emotions exactly. And if I stare just right, I can feel those emotions bubbling beneath the surface in me. Sometimes I can’t explain the issues away to myself, and a deep soul-sucking pang grows in my stomach. I’m irrational — I’m afraid of some of these thoughts — and I have submitted to the fact that there are some edges of my irrationality that would not be worth the effort to fix if just not thinking about them is better.

Sometimes these thoughts make me wish I were not so rational. Sometimes they even make me wish I were religious; it would be easier if (I believed) consciousness were, somehow, special. I suspect if I tried really hard, I could make myself believe something like that sincerely. But I think that’s a betrayal of myself I’m not willing to take. I think there are better ways to remain happy.

I want to maximize happiness. Thinking about more general moral principles will help with that, but the remoteness of these particular thoughts is such that I doubt I’ll ever have to make a choice that would benefit from me having thought about them. At least, I think the chance is small enough to not be worth the negative utility spent thinking about them.

So: “There is nothing to fear but fear itself.”

But I feel frustrated: not thinking about something just doesn’t seem like a solution. I don’t know how to come to terms with just how irrational happiness fundamentally is. And I still can’t resist thinking about them sometimes…

Continue reading