[pyicu-dev] Bug in Python 4-byte to ICU UChar?
Jim Fulton
jim at zope.com
Wed Nov 30 11:05:29 PST 2005
In common.cpp, there is a definition for PyObject_AsUnicodeString:
EXPORT UnicodeString &PyObject_AsUnicodeString(PyObject *object,
char *encoding, char *mode,
UnicodeString &string)
{
if (PyUnicode_CheckExact(object))
{
if (sizeof(Py_UNICODE) == sizeof(UChar))
string.setTo((const UChar *) PyUnicode_AS_UNICODE(object),
(int32_t) PyUnicode_GET_SIZE(object));
else
{
int len = PyUnicode_GET_SIZE(object);
Py_UNICODE *pchars = PyUnicode_AS_UNICODE(object);
UChar *chars = new UChar[len];
for (int i = 0; i < len; i++)
chars[i] = pchars[i];
string.setTo((const UChar *) chars, (int32_t) len);
delete chars;
}
}
else if (PyString_CheckExact(object))
PyString_AsUnicodeString(object, encoding, mode, string);
else
{
PyErr_SetObject(PyExc_TypeError, object);
throw ICUException();
}
return string;
}
(I have no idea where the actual source for this is.)
The second else case, where sizeof(Py_UNICODE) is false looks
wrong to me, but I am far from a unicode expert. It looks like
it will cause an overflow if there are 4-byte unicode characters
with the high-order bits set, as 4-byte values are being
assigned to 2-byte variables. Is there some sort of C++ magic that
I'm missing? Shouldn't u_strFromUTF32 be used here?
Jim
--
Jim Fulton mailto:jim at zope.com Python Powered!
CTO (540) 361-1714 http://www.python.org
Zope Corporation http://www.zope.com http://www.zope.org
More information about the pyicu-dev
mailing list