Apache OpenOffice (AOO) Bugzilla – Issue 8810
Enhance XMerge to allow access to embedded objects in OpenOffice XML files.
Last modified: 2002-11-06 11:29:42 UTC
The XMerge API needs to be enhanced to provide read/write access to embedded objects within an OpenOffice.org XML file. This will allow XMerge to be used for conversions of richer documents than a PDA can handle. For further information, see the the thread on dev@xml.openoffice.org started by Henrik Just on 18th October 2002, entitled "Using xmerge to convert rich document formats." The changes will take the form of adding an abstract EmbeddedObject class to the org.openoffice.xmerge.converter.xml package. There will be two concrete classes, EmbeddedBinaryObject and EmbeddedXMLObject to represent the two types of embedded object allowed in an OpenOffice.org XML file (as of XML File Format Specification 1.0).
Changes are mostly complete. Will use this bug to track changes made to the XMerge API.
EmbeddedObject defines accessor methods for the data of the embedded object as well as the name/path (within the manifest.xml file) and MIME type of the object. A number of package private methods also exist to interact with the OfficeZip and OfficeDocument classes for storage purposes. Note that flat OpenOffice.org XML files store embedded objects as inline tags/data within the document structure. The EmbeddedObject class and its subclasses are intended to represent embedded objects as stored in the zipped OpenOffice.org file format.
Retrieval of both EmbeddedObject information and the data for each EmbeddedObject is deferred until specifically called via provided methods. This incurs a performance penalty when first accessing data, but ensures that no performance degradation occurs where embedded object data is not a concern. In order to support the retrival of data, two new public methods have been added to OfficeDocument. The first returns an Iterator of all the embedded objects in the document. The second returns a specific EmbeddedObject instance representing a named object. An object name can be found from the xlink:href attribute for an embedded object in a document's content tree.
Tested read and write functionality. Can successfully read and write embedded objects when converting. Tests on existing plugins show no impact on existing XMerge functionality. All changes now committed.
There is a small issue: The code to disable processing the DTD doesn't work with Crimson as a parser. Here is a simple fix: In the method "getNamedDOM" in EmbeddedXMLDocument, return builder.parse(domData); can be replaced with InputSource is = new InputSource(domData); is.setSystemId(""); return builder.parse(is); Also, OfficeDocument uses another trick to avoid reading the DTD (the method "hack"). This code doesn't work with non-ASCII characters (it doesn't translate from utf-8); to fix that, it should be replaced by the same code as in EmbeddedXMLDocument.
Another detail: There is some confusion with trailing "/" for embedded objects: In manifest.xml an XML object is named with a trailing "/" (because it is a directory in the zip file). A binary object does not have a trailing "/" (since it is a file in the zip file). The method getEmbeddedObject(String name) in OfficeDocument uses the name from manifest.xml. But in the xlink:href attributes as well as in EmbeddedObject objects, there is never a trailing "/" in the name. So I think the most consistent solution would be not to require the trailing "/" in getEmbeddedObject.
The trailing '/' character should not be required for getEmbeddedObject. When the objects are being read in, any trailing character is chopped off. See getEmbeddedObjects(). The documentation for getEmbeddedObject also states that any '/' or '#' characters should be stripped. These are the extras that appear in the xlink:href entry.
Fixed the problem with the trailing '/' character. Also amended the hack() method of OfficeDocument to read the byte stream as UTF-8. This resolves the issue of searching for a DTD. The previous approach, to use an EntityResolver, did not work consistently on all parsers.
Henrik's development and testing indicates that the changes work as they should. Internal testing shows no regressions. Henrik's e-mail: Hi Mark I've tested the latest version of OfficeDocument and EmbeddedXMLDocument. Everything seems to be perfect! - I have no trouble extracting formulas and graphics from a Writer document. Thanks again! Henrik Closing this bug.