[Java] Build SPARQL endpoint without AWS server (Apache Jena)

3 minute read

This is the second attempt to build a SPARQL endpoint without using an AWS server. The first is here.

I tried to create a SPARQL endpoint in the AWS serverless environment, but it did not work https://qiita.com/uedayou/items/bdf7a802e27fe330044e

Last time, I had a feeling that it could be used for limited purposes because the search speed was difficult due to the library used. This time I tried using Apache Jena, which has a proven record as an RDF store.

Environment

Same as last time, it is a configuration called AWS Lambda+API Gateway. Apache Jena

using. You can use RDF files directly in addition to the method used in TDB, but it is recommended to convert them to TDB based on the following results.

The source code is published below. https://github.com/uedayou/jena-sparql-server-aws-serverless

Query search time measurement

Search time

  • How to use TDB
  • How to use RDF files directly

We are measuring about two. TDB is unzipped and compressed.

SPARQL query

The query is the same as Last time. The dataset is the same as the previous“InternationalStandardIdentifierforLibrariesandRelatedOrganizations(ISIL)”trialLOD. Measurements are made for each dataset (RDF file is a combination of all files) created by adding each divided Turtle file.

(1) Acquire 100 triples

select * where {?s ?p ?o} limit 100

(2) Get all triples

select (count(*) as ?count) where {?s ?p ?o}

(3) Filter string using filter

prefix schema: <http://schema.org/>
prefix org: <http://www.w3.org/ns/org#>
prefix dbpedia: <http://dbpedia.org/ontology/>

select * where {
  ?uri dbpedia:originalName ?name;
  org:hasSite/org:siteAddress/schema:addressRegion ?pref.
  filter( regex(?pref, "Tokyo"))
}
limit 10

TDB results

With TDB, I was able to search quite fast. However, AWS Lambda takes additional time to initialize the container and decompress the ZIP-compressed TDB when the container is created (the container will be reused for a while after being created once), so at that time (for example, It took a long time to process those files at the time of first startup or when the container has been discarded without execution for a while), and it took about 4 seconds for the file used this time. Below is the time when the container is already created. If the container is not created, it will take +4 seconds for the following time. Last time had a query that took 10 seconds or longer, and in some cases even simple queries could time out and search results could not be obtained. Even if you need to generate a TDB container, you can get the result within 5 seconds, and I think that it will not time out unless it is a very complicated query.

Triple Number (1) (2) (3)
21,788 242ms 494ms 159ms
42,585 254ms 531ms 102ms
63,448 148ms 502ms 67ms
84,587 166ms 504ms 100ms
104,826 154ms 572ms 85ms
124,718 176ms 367ms 112ms
144,669 153ms 583ms 80ms
160,491 141ms 579ms 104ms

RDF file result

Using RDF files directly took longer than TDB. The following is the time taken to create a container as with TDB, but it took longer to initialize than TDB (about 7 seconds). Although TDB also includes ZIP decompression processing, I do not know well that it takes longer to initialize using RDF file, but I thought that it would be better to use TDB even after subtracting it.

Triple Number (1) (2) (3)
160,491 1587ms 1664ms 1215ms

Summary

  • (Personally) I was able to build a SPARQL endpoint in a serverless environment with a search speed that can be used practically
  • AWS Lambda cold start problem? Takes a few seconds to create a container
  • It is better to convert RDF files to TDB instead of using them directly

I think the AWS serverless SPARQL endpoint that uses Apache Jena has a performance that I can personally satisfy, so I would like to use it in various ways in the future.

Immediately, the SPARQL endpointoftherailwayopendataprovidingsiterailwaystationLOD is updated to Apache Jena version. It has changed.

If you want to give it a try, see the article below.

Experimentally released SPARQL endpoint of railway station LOD https://qiita.com/uedayou/items/3ba823c5d3bede12af9c