Skip to content

cssSelector doesn't handle combining characters correctly #1984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
samshutchins opened this issue Jul 25, 2023 · 1 comment · Fixed by #2305
Closed

cssSelector doesn't handle combining characters correctly #1984

samshutchins opened this issue Jul 25, 2023 · 1 comment · Fixed by #2305
Labels
fixed An {bug|improvement} that has been {fixed|implemented}
Milestone

Comments

@samshutchins
Copy link

    @Test
    void combiningCharactersInIdentifier()
    {
        final String html = """
            <html>
            <head>
            <meta charset="utf-8">
            </head>
                        
            <body>
            <img class="e\u0301" src="/corner.jpg">
            </body>
                        
            </html>""";

        final Document document = Jsoup.parse(html);
        final Elements images = document.getElementsByTag("img");

        final Element img = images.get(0);
        final String cssSelector = img.cssSelector();

        assertEquals("html > body > img.e\u0301", cssSelector);
    }

The example above uses combining characters to create an é. Emoji make heavy use of combining characters (👨‍👨‍👧‍👧 is made up of 11 characters: \uD83D\uDC68\u200D\uD83D\uDC68\u200D\uD83D\uDC67\u200D\uD83D\uDC67).

I have seen emoji used as css class names in the wild, and I think the character escaping code is doing the wrong thing when calling cssSelector, it looks like it's escaping every character individually, which breaks things with these combining characters.

@samshutchins samshutchins changed the title cssSelector doesn't handle combinding characters correctly cssSelector doesn't handle combining characters correctly Aug 4, 2023
@jhy
Copy link
Owner

jhy commented Oct 20, 2023

Current jsoup: html > body > img.e\́
Chrome: body > p.e\\u0301

I don't think it's incorrect to emit it as a run of characters. And the selector does work in jsoup. We could improve to escape the combining form as a \u escape character, like Chrome is.

@jhy jhy closed this as completed in #2305 Apr 22, 2025
@jhy jhy added the fixed An {bug|improvement} that has been {fixed|implemented} label Apr 22, 2025
@jhy jhy added this to the 1.20.1 milestone Apr 22, 2025
jhy added a commit that referenced this issue Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed An {bug|improvement} that has been {fixed|implemented}
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants