Skip to content

8364007: Add no-argument codePointCount method to CharSequence and String #26461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

tats-u
Copy link
Contributor

@tats-u tats-u commented Jul 24, 2025

Adds codePointCount() overloads to String, Character, (Abstract)StringBuilder, and StringBuffer to make it possible to conveniently retrieve the length of a string as code points without extra boundary checks.

if (superTremendouslyLongExpressionYieldingAString().codePointCount() > limit) {
    throw new Exception("exceeding length");
}

Is a CSR required to this change?


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8364007: Add no-argument codePointCount method to CharSequence and String (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26461/head:pull/26461
$ git checkout pull/26461

Update a local copy of the PR:
$ git checkout pull/26461
$ git pull https://git.openjdk.org/jdk.git pull/26461/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26461

View PR using the GUI difftool:
$ git pr show -t 26461

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26461.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 24, 2025

👋 Welcome back tats-u! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 24, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Jul 24, 2025

@tats-u The following labels will be automatically applied to this pull request:

  • compiler
  • core-libs
  • i18n

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 24, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 24, 2025

Webrevs

@RogerRiggs
Copy link
Contributor

The recommended process for proposing new APIs is to put the proposal to the OpenJDK core-libs-dev mail alias.
Putting the effort into a PR before there is some agreement on the value is premature.
And yes, every change to the spec needs a CSR.

@RogerRiggs
Copy link
Contributor

To keep the proposal focused on the APIs, please drop the changes to modules other than java.base.

Copy link
Member

@liach liach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need codePointCount on CharSequence?

int count = this.count;
byte[] value = this.value;
if (isLatin1(coder)) {
return value.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return value.length;
return count;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I fixed the argument passed to StringUTF16.codePointCount too.

Copy link
Member

@myankelev myankelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a bug number under @bug?

final int length = seq.length();
int n = length;
for (int i = 0; i < length; ) {
if (isHighSurrogate(seq.charAt(i++)) && i < length &&
Copy link
Member

@myankelev myankelev Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this is quite hard to read, especially with i++ inside of the if statement. What do you think about changing it to this?

for (int i = 1; i < length-1; i++) {
    if (isHighSurrogate(seq.charAt(i)) &&
        isLowSurrogate(seq.charAt(i + 1))) {
        n--;
        i++;
    }
}

edit: fixed a typo in my example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first place it yields an incorrect result for sequences whose first character is a supplementary character.

jshell> int len(CharSequence seq) {
   ...>     final int length = seq.length();
   ...>     int n = length;
   ...>     for (int i = 1; i < length-1; i++) {
   ...>             if (isHighSurrogate(seq.charAt(i)) &&
   ...>                 isLowSurrogate(seq.charAt(i + 1))) {
   ...>                     n--;
   ...>                     i++;
   ...>             }
   ...>     }
   ...>     return n;
   ...> }
|  次を作成しました: メソッド len(CharSequence)。しかし、 method isHighSurrogate(char), and method isLowSurrogate(char)が宣言されるまで、起動できません

jshell> boolean isHighSurrogate(char ch) {
   ...>     return 0xd800 <= ch && ch <= 0xdbff;
   ...> }
|  次を作成しました: メソッド isHighSurrogate(char)

jshell> boolean isLowSurrogate(char ch) {
   ...>     return 0xdc00 <= ch && ch <= 0xdfff;
   ...> }
|  次を作成しました: メソッド isLowSurrogate(char)

jshell> len("𠮷");
$5 ==> 2

jshell> len("OK👍");
$6 ==> 3

jshell> len("👍👍");
$7 ==> 3

I will not change it alone unless the existing overload int codePointCount(CharSequence seq, int beginIndex, int endIndex) is also planned to be changed.

@tats-u
Copy link
Contributor Author

tats-u commented Jul 26, 2025

The recommended process for proposing new APIs is to put the proposal to the OpenJDK core-libs-dev mail alias.

I glanced over https://mail.openjdk.org/pipermail/core-libs-dev/2025-July/thread.html and those for some past months, but I did not get how to send one.

According to https://mail.openjdk.org/pipermail/core-libs-dev/2025-July/149338.html and sub messages, the content in this PR seems to be transferred to the mailing list.

Also, do we need codePointCount on CharSequence?

I did not add it because it does not have an existing overload and has a simple (but not efficient) workaround (codePoints().count()), but it would be nice if it exists.

Could you please add a bug number under @bug?

Which doc comments shall I add it?

P.S. only classes for test (containing each test-running methods) including Supplementary?

And yes, every change to the spec needs a CSR.

I got it, but do you know how non-Authors like me create ones?

@tats-u
Copy link
Contributor Author

tats-u commented Jul 26, 2025

How and where can I add tests for default implementing methods in CharSequence?

@jaikiran
Copy link
Member

Hello @tats-u,

The recommended process for proposing new APIs is to put the proposal to the OpenJDK core-libs-dev mail alias.

I glanced over https://mail.openjdk.org/pipermail/core-libs-dev/2025-July/thread.html and those for some past months, but I did not get how to send one.

The OpenJDK contribution guide has the necessary details on how to contribute to the project. Specifically this section https://openjdk.org/guide/#socialize-your-change is of relevance. In order to send a mail to the core-libs-dev mailing list, please first subscribe to that mailing list https://mail.openjdk.org/mailman/listinfo/core-libs-dev and initiate a discussion explaining the need and motivation for this new API. After there's some agreement about this proposal, the implementation changes in this PR can be pursued further.

@AlanBateman
Copy link
Contributor

The addition to CharSequence will require static analysis to check for conflicts with implementation. It will also likely impact the CharBuffer spec.

@tats-u tats-u changed the title 8364007: Add overload without arguments to codePointCount in String etc. 8364007: Add no-argument codePointCount method to CharSequence and String Jul 27, 2025
@tats-u
Copy link
Contributor Author

tats-u commented Jul 27, 2025

please first subscribe to that mailing list https://mail.openjdk.org/mailman/listinfo/core-libs-dev

Does this mailing list system require us to subscribe the list to post a new mail to the list? I would like to leave it at least after this PR is merged because I would not like my mailbox to be messed up by emails not related to this change.

The addition to CharSequence will require static analysis to check for conflicts with implementation. It will also likely impact the CharBuffer spec.

The title of the JBS issue seems to be changed by you but it looks like the default method for CharSequence should be stripped for this time according to your concerns. No codePointCount methods have been added to CharSequence so it may be too early for us to add one to CharSequence. Do you think that you should replace CharSequence in the title with another class name?

@AlanBateman
Copy link
Contributor

No codePointCount methods have been added to CharSequence so it may be too early for us to add one to CharSequence. Do you think that you should replace CharSequence in the title with another class name?

Can you clarify what you mean? Right now your PR is proposing to add a default method named codePointCount to CharSequence.

@tats-u
Copy link
Contributor Author

tats-u commented Jul 27, 2025

Right now your PR is proposing to add a default method named codePointCount to CharSequence.

If it should be excluded for this time, I will push an additional commit to remove it from the content in this PR.

@AlanBateman
Copy link
Contributor

Right now your PR is proposing to add a default method named codePointCount to CharSequence.

If it should be excluded for this time, I will push an additional commit to remove it from the content in this PR.

I think we should mull over the addition of CharSequence::codePointCount. On the surface it looks like it fits but we can't rush it (CharSequence is widely implemented and additions to this interface have a history of disruption in the eco system).

What is the reason for proposing Character.codePointCount(CharSequence) aswell?

@tats-u
Copy link
Contributor Author

tats-u commented Jul 28, 2025

I think we should mull over the addition of CharSequence::codePointCount. On the surface it looks like it fits but we can't rush it (CharSequence is widely implemented and additions to this interface have a history of disruption in the eco system).

We might as well defer it until another JBS issue if it is too difficult to decide whether it should be included in this PR.

What is the reason for proposing Character.codePointCount(CharSequence) aswell?

  1. It already has an overload with the start and end indices unlike CharSequence like String and AbstractStringBuilder
  2. Less harmful than CharSequence::codePointCount because it is just a static method.
  3. There are already the (CharSequence, int, int) and (char[], int, int) overloads and the (char[], int, int) overload is used for the test for String::codePointCount(int, int). We should add the (char[]) overload for test and also add the (CharSequence) for consistency.

@naotoj
Copy link
Member

naotoj commented Jul 28, 2025

The addition to CharSequence will require static analysis to check for conflicts with implementation. It will also likely impact the CharBuffer spec.

Looking at the original JSR 204 issue: https://bugs.openjdk.org/browse/JDK-4985217, it is interesting that the problem description included CharSequence but not in the proposed API. Tried to find the reason behind, but could not find any relevant information so far.
As to the general comment, I am not so sure adding the no-arg overrides, as they would simply be convenience methods to codePointCount(0, length()) which to me adding not a significant benefit. My $0.02

@tats-u
Copy link
Contributor Author

tats-u commented Aug 3, 2025

Its author may have prioritized the versatility of the APIs.

codePointCount(0, length())

This workaround is only effective if the instance expression is sufficiently short or can afford to be stored to a new temporary variable once. It can be a pain in the neck that you have to write the expression even twice to get the number of code points in the entire string instance.

P.S. I subscribed the mailing list (but changed the settings not to receive any emails)

@vicente-romero-oracle
Copy link
Contributor

/label remove compiler

@openjdk
Copy link

openjdk bot commented Aug 11, 2025

@vicente-romero-oracle
The compiler label was successfully removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

8 participants