Aggregating results from SPARQL query

Question

Welcome To Ask or Share your Answers For Others

Aggregating results from SPARQL query

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Aggregating results from SPARQL query

I'm querying a dataset of tweets:

SELECT * WHERE {
  ?tweet smo:tweeted_at ?date ;
         smo:has_hashtag ?hashtag ;
         smo:tweeted_by ?account ;
         smo:english_tweet true .
  FILTER ( ?date >= "20130722"^^xsd:date && ?date < "20130723"^^xsd:date )
}

If a tweet has multiple hashtags, there is one row in the result set per hashtag. Is there any way for me to aggregate the hashtags into an array instead?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:16:36+0000

You can GROUP BY by the variables that identify the tweet and then use GROUP_CONCAT to concatenate the hashtags into something like an array, but it will still be a string that you'll need to parse afterward. For instance, given data like

@prefix smo: <http://example.org/> .
@prefix : <http://example.org/> .

:tweet1 smo:tweeted_at "1" ;
        smo:has_hashtag "tag1", "tag2", "tag3" ;
        smo:tweeted_by "user1" ;
        smo:english_tweet true .

:tweet2 smo:tweeted_at "2" ;
        smo:has_hashtag "tag2", "tag3", "tag4" ;
        smo:tweeted_by "user2" ;
        smo:english_tweet true .

you can use a query like

prefix smo: <http://example.org/>

select ?tweet ?date ?account (group_concat(?hashtag) as ?hashtags) where {
  ?tweet smo:tweeted_at ?date ;
         smo:has_hashtag ?hashtag ;
         smo:tweeted_by ?account ;
         smo:english_tweet true .
}
group by ?tweet ?date ?account

to get results like:

--------------------------------------------------
| tweet      | date | account | hashtags         |
==================================================
| smo:tweet2 | "2"  | "user2" | "tag4 tag3 tag2" |
| smo:tweet1 | "1"  | "user1" | "tag3 tag2 tag1" |
--------------------------------------------------

You can specify the delimiter used in the group concatenation, so if there is some character that cannot appear in hashtags, you can use it as a delimiter. For instance, supposing that | can't appear in hashtags, you can use:

(group_concat(?hashtag;separator="|") as ?hashtags)

instead and get

--------------------------------------------------
| tweet      | date | account | hashtags         |
==================================================
| smo:tweet2 | "2"  | "user2" | "tag4|tag3|tag2" |
| smo:tweet1 | "1"  | "user1" | "tag3|tag2|tag1" |
--------------------------------------------------

If you're working in a language that has some literal array syntax, you might even be able to replicate that:

(concat('[',group_concat(?hashtag;separator=","),']') as ?hashtags)

----------------------------------------------------
| tweet      | date | account | hashtags           |
====================================================
| smo:tweet2 | "2"  | "user2" | "[tag4,tag3,tag2]" |
| smo:tweet1 | "1"  | "user1" | "[tag3,tag2,tag1]" |
----------------------------------------------------

Now, it doesn't affect the data here, but group_concat will actually include duplicates in the concatenation if they're present in the data. E.g., from the following (where I'm just providing data with values for the sake of the example):

prefix : <http://example.org/>

select ?tweet (concat('[',group_concat(?hashtag;separator=','),']') as ?hashtags)
where {
  values (?tweet ?hashtag) { 
    (:tweet1 "tag1") (:tweet1 "tag1") (:tweet1 "tag2") (:tweet1 "tag3")
    (:tweet2 "tag2") (:tweet2 "tag3") (:tweet2 "tag4")
  }
}
group by ?tweet

we get results including [tag1,tag1,tag2,tag3], i.e., the duplicate value of ?hashtag is included:

-------------------------------------
| tweet   | hashtags                |
=====================================
| :tweet2 | "[tag2,tag3,tag4]"      |
| :tweet1 | "[tag1,tag1,tag2,tag3]" |
-------------------------------------

We can avoid this by using group_concat(distinct ?hashtag;...):

prefix : <http://example.org/>

select ?tweet (concat('[',group_concat(distinct ?hashtag;separator=','),']') as ?hashtags)
where {
  values (?tweet ?hashtag) { 
    (:tweet1 "tag1") (:tweet1 "tag1") (:tweet1 "tag2") (:tweet1 "tag3")
    (:tweet2 "tag2") (:tweet2 "tag3") (:tweet2 "tag4")
  }
}
group by ?tweet

--------------------------------
| tweet   | hashtags           |
================================
| :tweet2 | "[tag2,tag3,tag4]" |
| :tweet1 | "[tag1,tag2,tag3]" |
--------------------------------

Categories

Aggregating results from SPARQL query

Aggregating results from SPARQL query

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags