Ticket #1053 (new defect)

Opened 5 years ago

Last modified 5 years ago

Encoding/Conversion fails when non-ASCII characters are used in any field of contacts, todos or events

Reported by: ThoMaus Owned by: dgollub
Priority: high Milestone:
Component: OpenSync: Format Conversion Version: 0.22
Severity: critical Keywords: encoding, conversion, synce
Cc: ThoMaus, Graham, Cobb

Description

Environment in use

  • OpenSync? 0.22 (Plugins: kdepim synce-legacy)
  • SynCE -- issue occurs with 0.12 and 0.13
  • Kontact 3.5.10
  • OpenSuSE 11.1
  • PDA with WinCE aka WM2003

Observations

  • Sync between KDE-PIM and PDA is running fine, as long as no characters outside the ASCII set are used (on the PDA).
  • As soon as the PDA data contains any non-ASCII, e.g. umlauts and the like, the conversion engine fails with 'invalid utf8 passed to VFormat. Limbing along.' The individual resulting XML data field is cut off at the position of the non-ASCII character -- the XML structure is undamaged.
  • The entries ending up in Kontact's databases (std.ics and std.vcf) are encoded as UTF8 and the data is cut off where the intermediary XML data was cut off.
  • Trace output from msynctool is suggesting that the intermediary vcal or vcard data is in Windows code page 1250 encoding (which is very similar to ISO-latin-1)
  • A pcap-traffic-capture on the ppp0 interface shows that the data from the PDA is encoded as shown by the msynctool trace, with the notable exception that all characters are encoded as 2-byte entities but the note or comment fields of contacts, todos or events, which are encoded as single byte characters (but e. g. umlauts still have the identical binary representation besides the difference of the 1- or 2-byte-width).
  • Non-ASCII traveling the opposite direction, i.e. from KDE PIM to the PDA, are:
    • UTF-8-encoded in the PIM databases
    • UTF-8-encoded in the intermediary vcal (for VEVENT or VTODO) or vcard (contacts)
    • HTML-encoded Unicode in the intermediary XML (i.e. a diaresis becomes ä)
    • junk on the PDA (i.e. a diaresis is displayed as two characters: the representations of capital A with tilde and currency sign)

Request

This conversion problem renders OpenSync? useless for users relying on characters outside the ASCII set -- which should constitute a significant percentage of the potential user base. The problem exists at least for WM2003 devices but as it seems located in the central conversion routines it might impair other sync-plugins not using natively UTF-8, too.

Therefore I consider this a critical defect.

Given guidance I'm willing to investigate deeper into this and try to provide a (part of a) solution (see below).

Attachments

msynctool_excerpt.log (81.0 KB) - added by ThoMaus 5 years ago.
annotated msynctool log file for syncing todos with different kinds of non-ASCII content
examples.tgz (40.9 KB) - added by ThoMaus 5 years ago.
Raw and transformed tasks exercising varying charsets and triggering the defect, together with screenshots how the data originally displayed on the PDA

Change History

comment:1 Changed 5 years ago by ThoMaus

Further Investigations

  • test with opensync 0.38 is not viable as I could not find a synce-legacy-plugin (and the synce-plugin coming with synce 0.13 itself is up to the expectations by not working with a WM 2003 device ... ;-) Alas -- because it probably would make more sense to test if the problem persists in the new code base. Hopefully you plan to support WM2003 devices in the new code base, too?
  • the error message 'invalid utf8 passed to VFormat. Limping along.' is thrown in in function _parse in file vformat.c, line 690.
  • _parse is called only from vformat_construct (line 811)
  • which is called only from vformat_new_from_string (line 826)
  • which is called only from vformat_new (line 833)
  • which is not called from within file format.c -- is there any documentation sheding light on the architecture and interactions on some higher level than mere function calls?
  • in function _read_attribute_value_add (vformat.c line 217) are provisions for converting from parametrised charsets to UTF8, but these obviously are not used in our case!?

Any hints?

comment:2 follow-up: ↓ 3 Changed 5 years ago by Graham Cobb

  • Cc Graham, Cobb added

I have also seen error messages about non-ASCII characters in 0.3x, however I haven't yet had time to investigate/reproduce them.

I really think it is worth trying to reproduce the problems with 0.3x. If you have some time to spend on this you might want to try to reproduce it in 0.38 with a file-to-file sync. One option is to export data from your PDA to create a file (e.g. export one contact to create a file containing a vCard). Then try syncing that to another directory (using two instances of file-sync) and see if you can reproduce the problem.

One problem is that in the most recent versions, file-to-file probably doesn't go through vformat processing any more. I am not sure how to force that -- but I think in 0.38 it did still go through vformat.

Graham

comment:3 in reply to: ↑ 2 ; follow-up: ↓ 5 Changed 5 years ago by ThoMaus

Replying to Graham Cobb:

...

I really think it is worth trying to reproduce the problems with 0.3x. If you have some time to spend on this you might want to try to reproduce it in 0.38 with a file-to-file sync. ...

I really would love to test this, but 0.38 is not working at all for me, as I have not been able to find a plugin for synce-legacy, i.e. WM2003 aka WinCE devices, which is my platform. If you could guide me to a plugin enabling me to test with WM2003 this would be very welcome.

(Actually I'm wondering: WM2003 has an airsync.dll -- is there any way to operate it in "non-legacy" synce mode?)

comment:4 Changed 5 years ago by ThoMaus

Investigations into and questions about the 0.22 code base

I took a deeper look into the code base of the synce plugin (synce.c and friends).

My rationale was that I expected to find all sync group member, i.e. device specific, stuff concentrated here. I hoped for some (undocumented ;-) config options where the sync group member can be configured on a per-device basic as using Windows Code Page XYZZY -- but I did not discover anything like this. Along the same lines of thought I conjectured that your archtitecture uses some kind of "data bus", an exchange data format powerful enough to contain all potential input formats, where every plugin is expected to do its own conversions, i.e. delivering output and taking input in this exchange format.

Your architecture seems to be different, but as of yet I neither understand your architecture nor your (probably good!) reasons for it.

I found the plugin using opensync_convert.c, opensync_conreg.c and friends, but there seems no mechanism in place for a plugin to communicate the charset its operating in nor conversion-functions providing to take such charsets into account.

I'm baffled:

  • What is a plugin expected to provide and accept in syncing events, todos, contacts?
  • Which entity is considered in charge for triggering/performing charset and format conversions?
  • How is this entity provided with the necessary information, e.g. about charsets to be used?

(For the moment I suspect either the plugin not fulfilling its obligations (e.g. converting to UTF8) or the conversion mechanism either not called or called with insufficient parameters ...)

comment:5 in reply to: ↑ 3 ; follow-up: ↓ 6 Changed 5 years ago by Graham Cobb

Replying to ThoMaus:

I really would love to test this, but 0.38 is not working at all for me, as I have not been able to find a plugin for synce-legacy, i.e. WM2003 aka WinCE devices, which is my platform.

My suggestion was that you try to reproduce it in the latest version without using synce (or any device at all). Just use files containing the data and see if there are any vformat errors reported when synchronising those to other files (using the file-sync plugin in each case).

comment:6 in reply to: ↑ 5 Changed 5 years ago by ThoMaus

Replying to Graham Cobb:

Replying to ThoMaus:

I really would love to test this, but 0.38 is not working at all for me, as I have not been able to find a plugin for synce-legacy, i.e. WM2003 aka WinCE devices, which is my platform.

My suggestion was that you try to reproduce it in the latest version without using synce (or any device at all). Just use files containing the data and see if there are any vformat errors reported when synchronising those to other files (using the file-sync plugin in each case).

I'm completely at loss what insight we gain from this test?

The error occurs during the conversion of the event/todo/contact coming from the WinCE device to XML. This will not happen when syncing it as file!?!?!? I'd sync from one linux system to another, both running UTF8 locales. Even if there would be a conversion -- because I synthetically crafted an .ics with Windows Code Page 1250 chars in it -- it would happen in a completely different context (especially not prompting conversion to XML events/todos in any way), and actually should be a null transformation, because there is no reasonable source of information which could warrant a binary mangling of file-contents ... (Imagine the following: In the same directory I put an JPG, MP3 or other binary media file -- it would be wrong, IMHO, for the file-sync-plugin to manipulate these binary representations during sync, wouldn't it. Why then should it handle a file seemingly containing VCARD data different, and how could it be sure?)

Changed 5 years ago by ThoMaus

annotated msynctool log file for syncing todos with different kinds of non-ASCII content

comment:7 Changed 5 years ago by ThoMaus

Further Investigations

  • When using the evolution-plugin instead of the kdepim-plugin exactly the same faulty behavior is to find in the logs. Conclusion: the problem is not located in the kdepim-plugin.
  • I crafted some todos, events and contacts as test cases which contained only ASCII chars on one hand, some with Middle European diacriticals (mostly umlauts) on the other hand (encoded in Windows Code Page 1250), and finally some containing a mixture of latin and cyrillic characters (to the best of my knowledge encoded in Unicode) on the gripping hand. (Just google for 'gripping hand' and find some of the finest SF classics ;-) All these non-ASCII entries were displayed and handled correctly on the PDA (aka WinCe? aka WM2003 device).

I performed a sync test with these -- find attached an annotated msynctool log (comments start with hash mark at the beginning of the line).

My conclusion from the results: The initial conversion from the raw PDA data into the VCALENDAR and VCARD formats is (identically) broken, causing -- of course -- several aftereffects.

Questions

  • Which components are considered responsible of performing this initial transformation into VCALENDAR and VCARD?
  • What format(s) in what charsets are they expected to deliver to which components?

comment:8 Changed 5 years ago by ThoMaus

Errata

First I have to correct myself: The intermediate VCALENDAR and VCARD is in the native representation of the device, which is much more complex than guessed by me so far. Especially the encoding is not Windows Code Page 1250.

I attach a tarball with a few crafted examples with different non-ASCII char classes in different fields in raw format (as delivered by rra-get-data -- you should be able to upload these to a PDA with rra-put-data and make your own tests with them), their transformations into VCALENDAR and screenshots from the PDA showing their actual glyphs used on the PDA.

Working hypothesis

Entries for todos (and events and contacts) are stored as wide chars (2-byte chars) in UTF-16-LE on the device -- at least the character codings of all non-ASCII chars used in the examples support this. The notable exception are notes in these entries (what becomes DESCRIPTION fields in VCALENDAR and VCONTACT): These are stored as blobs and have a different encoding depending on the content.

If (and only if) all non-ASCII chars can be represented in the Windows Code Page 1252, this encoding is used (it's the M$ mockup of ISO-latin-1 -- it's mostly identical to ISO-latin-1 but has a few additional chars, notably the EURO sign in a different position (this in turn is identical to cp 1250 -- thus I was mislead ...)

If there are any chars outside Windows Code Page 1252 a completely different encoding is used (even for the chars representable within cp 1252!). This encoding is very similar to UTF-8 but slightly different. To the best of my knowledge it is not CESU-8 either. I'm clueless ... Any hint is appreciated.

Request

Those in the know: please give an indication where the root cause of this defect probably is located. Without understanding the architecture you've choosen and your rationale behind it, I don't want to mess with your code on one hand, and can waste arbritrary amounts of time and brain in defect-irrelevant places ...

Changed 5 years ago by ThoMaus

Raw and transformed tasks exercising varying charsets and triggering the defect, together with screenshots how the data originally displayed on the PDA

comment:9 follow-up: ↓ 10 Changed 5 years ago by cstender

We try to convert a given vcard/vevent/vtodo into utf-8 in vformat.c [0] with a dirty hack. Please look at _read_attribute_value_add. Sadly, I've no time at the moment to do further debugging.

[0] http://www.opensync.org/browser/branches/branch-0.2X/opensync/formats/vformats-xml/vformat.c

comment:10 in reply to: ↑ 9 Changed 5 years ago by Graham Cobb

Replying to cstender:

We try to convert a given vcard/vevent/vtodo into utf-8 in vformat.c [0] with a dirty hack. Please look at _read_attribute_value_add. Sadly, I've no time at the moment to do further debugging.

By the way, we do a similar hack when reading vformat files for import into the GPE applications although the logic is slightly different: we first try reading it as UTF-8 then, if that doesn't work, we try reading it using the user's current locale, only if that fails do we try reading it as ISO-8859-15. At least that means the user has a chance to help us with the problem by defining his locale. You might want to try adding that option to vformat.

And there is another subtlety: if an attribute value is encoded in QP or Base64 then the decoded value must be converted again (a second time) using the current charset after decoding (i.e. if the file was originally in ISO-8859-15 then we assume the QP or Base64 decode will give us a value also in ISO-8859-15 and it will need to be converted to UTF-8 before we can use it).

Another subtlety is that in the GPE logic we make the conversion decision on the whole file at once, not on individual attribute values. This is because it is quite common to have, say, naked 8-bit values in (say) the N: or FN: fields of a card but have QP encoded values (where each character may just be 7-bit ASCII) in say the LABEL: field. It is really useful to look at the whole data, not the individual attribute value in order to make a good guess.

I have also been wondering whether we should add an advanced option to file-sync to allow the user to specify the charset for the files, which could then be passed with the data. Of course, file-sync would also have to use that when creating output files so that the charset remained consistent.

Note: See TracTickets for help on using tickets.