Data islands vs. RDFa for a human- and machine-readable format #51

jacoscaz · 2024-01-13T16:00:47Z

jacoscaz
Jan 13, 2024
Collaborator

/chair hat off

Hi everyone. Often, particularly when it comes to formats, the discussion touches upon whether RDF data islands can be a valid alternative to RDFa for picking a format readable by both humans and machines alike. Let's leave aside, for a moment, the fact that data islands are not a W3C REC and let's focus on the technical side of this issue.

Now, an obligatory disclaimer: nothing in this issue is an attempt at forcing such a format upon the WebID Spec, whatever form that takes. I am, however, interested in your opinion as to the pros and cons of each.

In my humble opinion, data islands are, indeed, much friendlier than RDFa but only insofar as they can be parsed out of HTML without a full-blown DOM/HTML5 parser. To that end, the following code demonstrates a way to do so:

const html_string = `
  <html>
  <body>
  <script type="application/ld+json">
    {
      "@id": "some document"
    }
  </script>
  </body>
  </html>
`;

for (const match of html_string.matchAll(/<script[^>]*?type="application\/ld\+json"[^>]*?>(.*?)<\/script>/sig)) {
  console.log(match[1]);
}

Granted, the above is a crude, inefficient quick hack and it is incapable of supporting edge cases such as a data island that contains a </script> within a JSON-LD string literal. Nonetheless, at least in my case, the above would be more than enough functionally to consider using JSON-LD data islands rather than RDFa.

I think a state machine could be made that would be capable of quickly getting to data islands while discarding everything else and still be orders of magnitude less complex than full DOM/HTML parsing.

Thoughts?

melvincarvalho · 2024-01-13T17:16:32Z

melvincarvalho
Jan 13, 2024

Data Islands very much are a W3C REC. Not only that, they represent the de-facto semantic web in 2024, via schema.org

https://www.w3.org/TR/json-ld11/#embedding-json-ld-in-html-documents

I have a stub of a similar library, getj here:

https://github.com/spux/getj

Demo:

https://spux.org/getj/test.html

0 replies

melvincarvalho · 2024-01-13T17:27:32Z

melvincarvalho
Jan 13, 2024

IMHO RDFa (and XHTML) are technical debt that hold back projects that need to support these old, less popular, formats. A good example being Solid. RDFa holds it back, developers dont want to join, and those that joined before walked away, because modern web devs want to use JSON.

0 replies

VirginiaBalseiro · 2024-01-13T17:41:31Z

VirginiaBalseiro
Jan 13, 2024

A good example being Solid. RDFa holds it back, developers dont want to join, and those that joined before walked away, because modern web devs want to use JSON.

Do you have data to back up this claim or is this just your opinion?

0 replies

webr3 · 2024-01-13T22:21:40Z

webr3
Jan 13, 2024

If it helps any, I (heavily involved in RDFa WG, and RDFa API author) ripped out RDFa from many pages (100 million +) and moved our setups to json-ld in data islands (billions of pages).

For interest they all.utilize the data islands as data in js also.

0 replies

webr3 · 2024-01-13T22:34:29Z

webr3
Jan 13, 2024

for (const match of html_string.matchAll(/<script[^>]?type="application/ld+json"[^>]?>(.*?)</script>/sig)) {
console.log(match[1]);
}

    globalThis.di = Array.from(document.querySelectorAll('[type="application/ld+json"]')).map(function(island){ return [island.id, JSON.parse(island.text)]}).reduce(function(obj, item) {
      obj[item[0]] = item[1]
      return obj
    }, {});

0 replies

melvincarvalho · 2024-01-13T23:04:33Z

melvincarvalho
Jan 13, 2024

A good example being Solid. RDFa holds it back, developers dont want to join, and those that joined before walked away, because modern web devs want to use JSON.

Do you have data to back up this claim or is this just your opinion?

A bit of both. I founded the Solid Community Group and am in touch with many people there, and before it. I also have traffic statistics from reddit. I created the biggest and most popular Solid Pod, and ran it for 1/4 of a decade until I got sick. I also follow the github interest in solid. While the project is extremely well funded, developer interest has waned from its peak. RDFa is hard to work with, and web developers like JSON. RDFa is also enormously buggy. Compare the triples on your own webid, in the RDFa, and that of the turtle. They are not the same, last I checked. I'm sure it will all get fixed eventually given the long runway that Solid has, but working with JSON allows other projects in the open (social) web, to progress enormously fast. I helped on board 1000s of developers onto the open (social) web, and JSON is one of the big sellers. People will look at Solid and say "interesting" but then go and work on a JSON project.

0 replies

melvincarvalho · 2024-01-14T10:10:11Z

melvincarvalho
Jan 14, 2024

Try this:

npx getj <uri_with_data_island>

for example

npx getj https://spux.org/getj/test.html

gives

{
  "@context": "http://schema.org",
  "@type": "WebPage",
  "url": "https://example.com",
  "name": "Example Web Page"
}

If there's interest I can donate this npm library to the CG and we can collaborate on a function that will extract data islands from command line, browser, or server

0 replies

jacoscaz · 2024-01-14T10:33:39Z

jacoscaz
Jan 14, 2024
Collaborator Author

@webr3 @melvincarvalho both of your implementations rely on a full-blown DOM/HTML5 parser, though, as provided by either the browser or by dependencies. Ugly and hack-ish as it is, my code doesn't rely on anything but the obvious JSON-LD parser one would need anyway.

IMHO, compared to RDFa, which has its own media type and doesn't force a client to rely on heuristics, Data Islands (or Blocks, according to the JSON-LD spec) make sense only if they allow devs to dispense with the complexity of parsing HTML5 or, worse, of an in-memory DOM representation. Otherwise one would already be 80% there to RDFa support.

0 replies

melvincarvalho · 2024-01-14T12:01:28Z

melvincarvalho
Jan 14, 2024

Ugly and hack-ish as it is, my code doesn't rely on anything but the obvious JSON-LD parser one would need anyway.

Mine was indeed an ugly hack too. But we could make a half decent library if we work together, I suspect.

0 replies

jacoscaz · 2024-01-20T10:20:08Z

jacoscaz
Jan 20, 2024
Collaborator Author

Would anyone object to converting this issue into a discussion?

1 reply

TallTed Jan 23, 2024

Regrettably, I didn't see this sooner. GitHub's "discussions" aren't discussions. They're Q + timestamp-ordered As (+ unthreaded timestamp-ordered comments on each A), roughly comparable to an early version of StackOverflow and its kin.

GitHub's "discussions" make it difficult if not impossible to track conversations, as each new comment on an answer goes to the bottom of that answer's comments (not to the bottom of the page), and there's no "show me all comments since I last read this page", not even by manually scrolling to the last remembered comment (which is a functional if imperfect way to see all new comments on an Issue) nor having a useful "show me all comments in timestamp order" button or similar.

(Just imagine what that 255 comment thread would have looked like!)

I strongly advise and request minimal use of GitHub's "discussions" (which I will always put in quotes) going forward.

TallTed · 2024-01-24T00:02:37Z

TallTed
Jan 24, 2024

Solid folks have largely rejected RDFa because they want to support SPARQL and/or N3 Update, and handling such against RDFa requires diving further into (X)HTML than they've expressed any interest or even willingness to do; quite the opposite.

So far as I'm aware, Solid folks have not evidenced objections of similar strength to Data Islands, because SPARQL and/or N3 Updates require far less understanding of the enclosing (X)HTML, just an ability to replace strings found within <script> tags.

That said, RDFa is likely to have a disk consumption advantage when compared to Data Islands, especially for larger RDF datasets when the user chooses to have multiple islands (Turtle, JSON-LD, Microdata) in a single document. (How much advantage will depend on how much volume is consumed by the RDFa markup within the HTML, among other considerations.)

My usual preference is to embrace the power of AND, allowing users to use whichever they prefer, but this does require much more effort on the part of implementation developers, and possibly on our part as the specification authors.

(EDIT: The opinions of Solid folks are relevant because we're currently expecting Solid WG to adopt the WebID spec.)

0 replies

melvincarvalho · 2024-02-02T12:48:01Z

melvincarvalho
Feb 2, 2024

Solid folks have largely rejected RDFa

Very glad to hear this and fully agree.

0 replies

Data islands vs. RDFa for a human- and machine-readable format #51

Uh oh!

jacoscaz Jan 13, 2024 Collaborator

Replies: 12 comments · 1 reply

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jacoscaz Jan 14, 2024 Collaborator Author

Uh oh!

Uh oh!

jacoscaz Jan 20, 2024 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jacoscaz
Jan 13, 2024
Collaborator

Replies: 12 comments 1 reply

jacoscaz
Jan 14, 2024
Collaborator Author

jacoscaz
Jan 20, 2024
Collaborator Author