[pyicu-dev] Bug in Python 4-byte to ICU UChar?

Jim Fulton jim at zope.com
Wed Nov 30 11:05:29 PST 2005


In common.cpp, there is a definition for PyObject_AsUnicodeString:

EXPORT UnicodeString &PyObject_AsUnicodeString(PyObject *object,
                                                char *encoding, char *mode,
                                                UnicodeString &string)
{
     if (PyUnicode_CheckExact(object))
     {
         if (sizeof(Py_UNICODE) == sizeof(UChar))
             string.setTo((const UChar *) PyUnicode_AS_UNICODE(object),
                          (int32_t) PyUnicode_GET_SIZE(object));
         else
         {
             int len = PyUnicode_GET_SIZE(object);
             Py_UNICODE *pchars = PyUnicode_AS_UNICODE(object);
	    UChar *chars = new UChar[len];

             for (int i = 0; i < len; i++)
                 chars[i] = pchars[i];

             string.setTo((const UChar *) chars, (int32_t) len);
	    delete chars;
         }
     }
     else if (PyString_CheckExact(object))
         PyString_AsUnicodeString(object, encoding, mode, string);
     else
     {
         PyErr_SetObject(PyExc_TypeError, object);
         throw ICUException();
     }

     return string;
}

(I have no idea where the actual source for this is.)

The second else case, where sizeof(Py_UNICODE) is false looks
wrong to me, but I am far from a unicode expert.  It looks like
it will cause an overflow if there are 4-byte unicode characters
with the high-order bits set, as 4-byte values are being
assigned to 2-byte variables.  Is there some sort of C++ magic that
I'm missing?  Shouldn't u_strFromUTF32 be used here?

Jim

-- 
Jim Fulton           mailto:jim at zope.com       Python Powered!
CTO                  (540) 361-1714            http://www.python.org
Zope Corporation     http://www.zope.com       http://www.zope.org


More information about the pyicu-dev mailing list