crazycmd: enable utf8 for offlineimap

Get offlineimap working with non ASCII characters.
Introduction

I recently started using offlineimap [1] to manage my GMail accounts locally on my machine.

With offlineimap all my email is local so the navigation is very fast and also I am a little more relieved that I have a local copy of all my email.

The only problem is that offlineimap does not work well with non english characters. This makes it difficult to use for those with labels in Russian, Chinese, Japanese etc [2].

Fortunately offlineimap offers some nice options that allows us to modify it's behavior. Using the "pythonfile" and "nametrans" options I was able to get Japanese working flawlessly.

Problem Description

The problem is that IMAP4 uses a modified UTF-7 coding for all folder names. A good IMAP client like mutt understands this coding and decodes it before presenting the folder name on screen. Unfortunately offlineimap does not convert the folder names before creating local repositories. This results in very cryptic names like "&MMYwuTDI-" that makes it difficult to know what folder is what.

Some research (googling) on the topic resulted in a very nice code to add IMAP4 UTF-7 encoding capabilities to Python [3]. So with a little time and some copy/paste skills I added this code to offlineimap and ban! I got Japanese working correctly in my folders.

Fixing unicode issues in offlineimap

First we must add the code to convert from IMAP4 UTF7 encoding support to offlineimap. Thanks to Dominic LoBue (offlineimap developer) I learned that we can add our own python code to offlineimap. To do this we created a simple python script (e.g. ~/.utf7.py) that contains the following code (copy/paste from [3]:

# -*- coding: latin-1 -*-

"""
Imap folder names are encoded using a special version of utf-7 as defined in RFC
2060 section 5.1.3.

5.1.3. Mailbox International Naming Convention

By convention, international mailbox names are specified using a
modified version of the UTF-7 encoding described in [UTF-7]. The
purpose of these modifications is to correct the following problems
with UTF-7:

1) UTF-7 uses the "+" character for shifting; this conflicts with
the common use of "+" in mailbox names, in particular USENET
newsgroup names.

2) UTF-7's encoding is BASE64 which uses the "/" character; this
conflicts with the use of "/" as a popular hierarchy delimiter.

3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
the use of "\" as a popular hierarchy delimiter.

4) UTF-7 prohibits the unencoded usage of "~"; this conflicts with
the use of "~" in some servers as a home directory indicator.

5) UTF-7 permits multiple alternate forms to represent the same
string; in particular, printable US-ASCII chararacters can be
represented in encoded form.

In modified UTF-7, printable US-ASCII characters except for "&"
represent themselves; that is, characters with octet values 0x20-0x25
and 0x27-0x7e. The character "&" (0x26) is represented by the two-
octet sequence "&-".

All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
Unicode 16-bit octets) are represented in modified BASE64, with a
further modification from [UTF-7] that "," is used instead of "/".
Modified BASE64 MUST NOT be used to represent any printing US-ASCII
character which can represent itself.

"&" is used to shift to modified BASE64 and "-" to shift back to US-
ASCII. All names start in US-ASCII, and MUST end in US-ASCII (that
is, a name that ends with a Unicode 16-bit octet MUST end with a "-
").

For example, here is a mailbox name which mixes English, Japanese,
and Chinese text: ~peter/mail/&ZeVnLIqe-/&U,BTFw-

"""
import binascii
import codecs

# encoding

def modified_base64(s):
s = s.encode('utf-16be')
return binascii.b2a_base64(s).rstrip('\n=').replace('/', ',')

def doB64(_in, r):
if _in:
r.append('&%s-' % modified_base64(''.join(_in)))
del _in[:]

def encoder(s):
r = []
_in = []
for c in s:
ordC = ord(c)
if 0x20 <= ordC <= 0x25 or 0x27 <= ordC <= 0x7e:
doB64(_in, r)
r.append(c)
elif c == '&':
doB64(_in, r)
r.append('&-')
else:
_in.append(c)
doB64(_in, r)
return (str(''.join(r)), len(s))

# decoding

def modified_unbase64(s):
b = binascii.a2b_base64(s.replace(',', '/') + '===')
return unicode(b, 'utf-16be')

def decoder(s):
r = []
decode = []
for c in s:
if c == '&' and not decode:
decode.append('&')
elif c == '-' and decode:
if len(decode) == 1:
r.append('&')
else:
r.append(modified_unbase64(''.join(decode[1:])))
decode = []
elif decode:
decode.append(c)
else:
r.append(c)
if decode:
r.append(modified_unbase64(''.join(decode[1:])))
bin_str = ''.join(r)
return (bin_str, len(s))

class StreamReader(codecs.StreamReader):
def decode(self, s, errors='strict'):
return decoder(s)

class StreamWriter(codecs.StreamWriter):
def decode(self, s, errors='strict'):
return encoder(s)

def imap4_utf_7(name):
if name == 'imap4-utf-7':
return (encoder, decoder, StreamReader, StreamWriter)
codecs.register(imap4_utf_7)

Then in our offlineimap configuration file, in the "[general]" section we add a line to load this python script like:

pythonfile = ~/.utf7.py

Then in our remote repository configuration add a nametrans option to convert all foldernames from "imap4-utf-7" encoding to your encoding of preference. My Ubuntu installation is all UTF-8 so I convert all folder names to this encoding:

nametrans = lambda foldername: foldername.decode('imap4-utf-7').encode('utf-8')

The "imap4-utf-7" encoding is added by our "utf7.py" script. With this configuration I can now see the correct Japanese names for all the folders (labels) in my GMail accounts inside Mutt [4].

crazycmd

November 2, 2011

enable utf8 for offlineimap

No comments:

Post a Comment

How to type letters with accent

Followers

Labels