[Chandler-dev] PyICU upgraded with charset detection

Andi Vajda vajda at osafoundation.org
Fri Jan 26 16:43:41 PST 2007


I added support for the new character set detection APIs in ICU 3.6.
These APIs are documented here:
     http://icu.sourceforge.net/apiref/icu4c/ucsdet_8h.html

The PyICU wrappers include two Python classes CharsetDetector and 
CharsetMatch which wrap the ICU APIs.

   detector = CharsetDetector()
   detector = CharsetDetector(string)   # string must be a python str
   detector = CharsetDetector(string, declaredEncoding)

   match = detector.detect()
   matches = detector.detectAll()       # return a tuple of all matches

   detector.setText(string)             # string must be python str
   detector.setDeclaredEncoding(encoding)

   detector.enableInputFilter(bool)
   bool = detector.isInputFilterEnabled()

   stringEnumeration = detector.getAllDetectableCharsets()

   string = match.getName()
   number = match.getConfidence()
   string = match.getLanguage()

   string = unicode(match)             # returns a unicode string

In other words, a simple way to take an attachment or feed data and convert it 
to unicode is:

   >>> unicode(CharsetDetector(data).detect())

Andi..


More information about the chandler-dev mailing list