128019 – Replace OpenOffice string implementation with Standard Library string implementation

Issue 128019 - Replace OpenOffice string implementation with Standard Library string implementation

Summary: Replace OpenOffice string implementation with Standard Library string impleme...

Status:	CONFIRMED

Alias:	None

Product:	General
Classification:	Code
Component:	code (show other issues)
Version:	4.2.0-dev
Hardware:	All All

Importance:	P5 (lowest) Normal with 1 vote (vote)
Target Milestone:	---
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:

Depends on:
Blocks:	67649
	Show dependency tree

Reported:	2019-01-27 14:46 UTC by Peter
Modified:	2019-03-26 08:20 UTC (History)
CC List:	2 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description Peter 2019-01-27 14:46:49 UTC

The goal is to use the standard implementation of String template instead our own implementation.

Comment 1 damjan 2019-01-27 15:43:14 UTC

Which one? We have 6 string implementations:
https://wiki.openoffice.org/wiki/Hacking#Can_I_get_a_char_.2A.2C_please.3F

Comment 2 Peter 2019-01-27 15:47:17 UTC

Do we need 6? shouldn't be one enough?

Comment 3 damjan 2019-01-27 21:53:14 UTC

We have C string structs and C++ string wrapper classes around those, found in main/sal, in ASCII and "Unicode" (UTF-16) versions, with 2^32 chars max length.

Another 2 are in main/tools, 2^16 chars max length, used by Calc, StarBasic, possibly more. Keeping max string length in a 16 bit instead of 32 bit length field probably saves a lot of space in spreadsheets with lots of cells; Excel also does this.

Apart from being based on sal_Char / sal_Unicode instead of native C++ types, they contain many functions not found in C++ standard library strings, eg. conversion to/from integer and double, string tokenization, interning, comparison of Unicode strings against ASCII, etc.

Given the move to UTF8-only languages lately (Go, Rust), and the UTF-8 everywhere manifesto (https://utf8everywhere.org), we could consider eliminating the UTF-16 strings, and using the ASCII strings as UTF-8. That would however require fixing all code to traverse code points instead of code units, something it probably does wrong already.

Comment 4 Peter 2019-02-01 06:07:14 UTC

I added the 67649 for reference on an issue. Because an String Overhaul has the high possibility of fixing the other Bug. (IMHO)

Comment 5 Peter 2019-02-01 06:26:32 UTC

I like the UTF-8 approach as described on https://utf8everywhere.org/ but I have not many insights on alternatives.

I think we should decouple the string implementation from OpenOffice. This would allow us to be able to change and maintain this part easier.

Also we would need valid convertors for the Other UTF definitions.I think maybe it makes sense to base the string implementation on STL, then have a Own string class that adds the features we need, hidden behind an interface.

And we have to check the API. This is I think the most hideous part, based on the FOSDEM presentation. https://ftp.fau.de/fosdem/2018/AW1.120/ode_uri.mp4
UTF concerning part starts around 10 Minutes.Very interesting talk, thanks to Stephan Bergmann.