Adventures in Unicode Forensics

What do you do when you get a bunch of files like this from a zipfile?

I've blurred the messed-up file names because I'm not sure it's impossible to reconstruct the Chinese names of people from them and I'd rather err towards being paranoid about privacy. Except for the one file name whose author's identity I'm OK with disclosing.

I’ve blurred the messed-up file names because I’m not convinced it’s impossible to reconstruct the Chinese names of people from them and I’d rather err towards being paranoid about privacy. Except for the one file name whose author’s identity I’m OK with disclosing.

Back story: I have been tasked with collecting everybody’s Chinese assignments for this semester. I didn’t even do that part yet (I really should); first the girl who handled last semester’s assignments had to pass on the files she collected to me, and I unzipped the attachment to reveal these file names.

Well, I thought, no big deal, I understand Unicode (somewhat), I’ll just do something like decode them as latin1 and re-encode them as UTF-8, right? Or maybe Big5? In fact, maybe python-ftfy will just automagically fix it for me; I’ve been waiting for a chance to use that, like the dozens of other things on my GitHub star list…

I wish.

ftfy did nothing to the text. Also, UTF-8 was not high on my list because the filenames strongly suggested a double-byte encoding. By eyeballing the filenames and comparing them to the organized naming scheme of some other files that had, thankfully, survived the .zip intact, I noted that every Chinese character had replaced with two characters, while underscores had survived unscathed. It was a simple substitution cipher with a lot of crib text available. So that rules out the most common UTF-8, but it gives me a lot of information, so this can’t be hard, right?

One of the files that I knew belonged to me came out as such, according to Python’s os.listdir:

'i\xcc\x81\xc2\xaci\xcc\x82a\xcc\x8aa\xcc\x82\xe2\x88\x82_e\xcc\x80Wn\xcc\x83n.pdf'

After codecs.decode(_, 'utf_8') we get:

u'i\u0301\xaci\u0302a\u030aa\u0302\u2202_e\u0300Wn\u0303n.pdf'
Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

That’s really weird; those codepoints are kinda large, and there’s a literal i at the start, and it doesn’t look at all like the two-characters-for-one pattern we noticed from staring at the plaintext. Oh, they’re using combining characters. How do we fix that?

Fumble, mumble, search. Oh, I want unicodedata.normalize('NFC', _).

u'\xed\xac\xee\xe5\xe2\u2202_\xe8W\xf1n.pdf'
Attempt at UTF-8 rendition: í¬îåâ∂_èWñn.pdf

Although the byte sequences are totally different, they look the same, which is the point. *holds up fist* This… is… UNICODE!!!

Anyway, the code points in the normalized version make more sense. Except for that conspicuous \u2202 or , of course. Indeed, although it looks promising, if we now try e.g. codecs.encode(_, 'windows-1252') we get,

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2202' in position 5: character maps to

One can get around this by passing a third argument to encode to make it ignore or replace the invalid parts, but the result of the first couple of codepoints, after further pseudomagical decoding and encoding, is still nonsense. Alas.

I continued to try passing the filenames in and out of codecs.encode and codecs.decode with various combinations of utf_8 and latin_1 and big5 and windows-1252, to no avail.

Then I did other stuff. Homework, college forms, writing a bilingual graduation song, talking to other dragons about 1994 video games, and did I mention I started taking driving lessons now? Yeah. That sort of thing.

Around that last thing, I asked the previous caretaker and learned that the computer she collected these files had a Japanese locale. So I added shift_jis and euc_jp to the mix, but still nothing.

I later tried unzipping it with unzip from the command-line — the results were even worse — as well as unzipping it from a Windows computer — even worse than the command line.

So the problem remained, until…


The resolution is very anticlimactic; it took me half a week to think of getting at the files programmatically straight from the zip archive, instead of unzipping first. Python spat out half the file names from the zip file as unicode and the other half as str. From there it was easy guessing.

There were still three file names that failed, but they were easy to fix manually.

I’m not sure why I eventually decided to blog about this anymore, honestly. Especially compared to my 12 other drafts. Oops.

ETA: I thought this script would be a one-trick pony but, amazingly, I ended up using it again the day after, this time with Big5 after I copied some files to a Windows laptop, converted them to .pdf, and sent them back in a .zip.

ETA2: This script came in handy again after I copied a zip file to an Ubuntu desktop computer with an actual CD drive so I could burn everything to a disc!

#pyconapac2014

Late post. As usual.

It started with an online competition — write programs, solve problems, get points. I wouldn’t call the problems easy, but they weren’t hard either. So I solved all of them. To make it even less impressive, only about twenty people submitted anything at all.

But the result was just what it was: I ended up with a free ticket to PyCon APAC 2014.

I’d prefer a conference about a more functional programming language, but I’ll take what I get. Another adventure!

Continue reading

Haskell and Primes

“I have been told that any encryption becomes safer if the underlying algorithm is maximally obscured, what is most conveniently done by coding it in Haskell.” — rankk

Functional programming is terribly addicting! Partly I think the completely different way of thinking makes it feel like learning programming, and falling in love with it, all over again. Partly there’s this evil sense of satisfaction from using $s (and later <$>s and =<<s and &&&s) to improve readability for initiated Haskellers and worsen it for everybody else. Partly it’s because Learn You a Haskell for Great Good! is such a fun read — there are too many funny bits to list but my favorite so far is when the author analyzes the first verse of Avril Lavigne’s Girlfriend.

Although I think my code in Haskell tends to be more readable than in other languages, code obfuscation in Haskell is almost natural: all you have to do is refactor the wrong function to be “pointfree”, which means that even though it’s a function that takes arguments, you define it without parameters by manipulating and joining a bunch of other functions. Example (plus a few other tiny obfuscations):

isPrime = liftA2 (&&) (liftA2 ($) (all . ((.) (0 /=)) . rem) (flip
    takeWhile [2..] . (flip (.) $ liftA2 (*) id id) . (>=))) ((<) 1)

QQ wordpress why no Haskell highlighting

Also, for some reason, you can do this in Haskell:

ghci> let 2 + 2 = 5 in 2 + 2
5

(via Haskell for the Evil Genius)


Okay, but seriously now. I wrote this about my journey to learn functional programming in the programming babble post half a year ago:

The main obstacle I have is that it’s hard to optimize or get asymptotics when computation is expensive (a big problem if you’re trying to learn through Project Euler problems, particularly ones with lots of primes).

Continue reading

Example “BetaWorldProblems”

because my title needs to mean something.

  • Too many green coins!
  • <i> or <em> tags!? How to decide?
  • Why is the menstrual cycle so complicated!?
  • QQ https://github.com/mxcl/homebrew/issues/17992
  • No, 94Seconds, I refuse to recognize that raisins are a kind of fruit.
  • The award for the most error-prone arithmetical calculations I’ve done in years goes to stoichiometry.
  • What do you mean, normal!? There’s nothing normal about an extension field where every irreducible is a product of distinct linear factors.

Are you weird like me and want to do chemistry homework with Python as your calculator?