[pyicu-dev] Bug in Python 4-byte to ICU UChar?
Andi Vajda
vajda at osafoundation.org
Wed Nov 30 11:31:41 PST 2005
On Wed, 30 Nov 2005, Jim Fulton wrote:
>
> In common.cpp, there is a definition for PyObject_AsUnicodeString:
>
> EXPORT UnicodeString &PyObject_AsUnicodeString(PyObject *object,
> char *encoding, char *mode,
> UnicodeString &string)
> {
> if (PyUnicode_CheckExact(object))
> {
> if (sizeof(Py_UNICODE) == sizeof(UChar))
> string.setTo((const UChar *) PyUnicode_AS_UNICODE(object),
> (int32_t) PyUnicode_GET_SIZE(object));
> else
> {
> int len = PyUnicode_GET_SIZE(object);
> Py_UNICODE *pchars = PyUnicode_AS_UNICODE(object);
> UChar *chars = new UChar[len];
>
> for (int i = 0; i < len; i++)
> chars[i] = pchars[i];
>
> string.setTo((const UChar *) chars, (int32_t) len);
> delete chars;
> }
> }
> else if (PyString_CheckExact(object))
> PyString_AsUnicodeString(object, encoding, mode, string);
> else
> {
> PyErr_SetObject(PyExc_TypeError, object);
> throw ICUException();
> }
>
> return string;
> }
>
> (I have no idea where the actual source for this is.)
>
> The second else case, where sizeof(Py_UNICODE) is false looks
> wrong to me, but I am far from a unicode expert. It looks like
> it will cause an overflow if there are 4-byte unicode characters
> with the high-order bits set, as 4-byte values are being
> assigned to 2-byte variables. Is there some sort of C++ magic that
> I'm missing? Shouldn't u_strFromUTF32 be used here?
Indeed, this is a bug. Is u_strFromUTF32() compatible with Python's 4 byte
unicode ? If so the fix should be simple. If not, what are the differences and
how are they bridged ?
Andi..
More information about the pyicu-dev
mailing list