[pyicu-dev] Bug in Python 4-byte to ICU UChar?

Andi Vajda vajda at osafoundation.org
Wed Nov 30 11:31:41 PST 2005


On Wed, 30 Nov 2005, Jim Fulton wrote:

>
> In common.cpp, there is a definition for PyObject_AsUnicodeString:
>
> EXPORT UnicodeString &PyObject_AsUnicodeString(PyObject *object,
>                                               char *encoding, char *mode,
>                                               UnicodeString &string)
> {
>    if (PyUnicode_CheckExact(object))
>    {
>        if (sizeof(Py_UNICODE) == sizeof(UChar))
>            string.setTo((const UChar *) PyUnicode_AS_UNICODE(object),
>                         (int32_t) PyUnicode_GET_SIZE(object));
>        else
>        {
>            int len = PyUnicode_GET_SIZE(object);
>            Py_UNICODE *pchars = PyUnicode_AS_UNICODE(object);
> 	    UChar *chars = new UChar[len];
>
>            for (int i = 0; i < len; i++)
>                chars[i] = pchars[i];
>
>            string.setTo((const UChar *) chars, (int32_t) len);
> 	    delete chars;
>        }
>    }
>    else if (PyString_CheckExact(object))
>        PyString_AsUnicodeString(object, encoding, mode, string);
>    else
>    {
>        PyErr_SetObject(PyExc_TypeError, object);
>        throw ICUException();
>    }
>
>    return string;
> }
>
> (I have no idea where the actual source for this is.)
>
> The second else case, where sizeof(Py_UNICODE) is false looks
> wrong to me, but I am far from a unicode expert.  It looks like
> it will cause an overflow if there are 4-byte unicode characters
> with the high-order bits set, as 4-byte values are being
> assigned to 2-byte variables.  Is there some sort of C++ magic that
> I'm missing?  Shouldn't u_strFromUTF32 be used here?

Indeed, this is a bug. Is u_strFromUTF32() compatible with Python's 4 byte 
unicode ? If so the fix should be simple. If not, what are the differences and 
how are they bridged ?

Andi..


More information about the pyicu-dev mailing list