Build SPARQL endpoints without AWS server (Apache Jena edition)

This is the second trial of building a SPARQL endpoint without an AWS server. The first is here.

I tried to create a SPARQL endpoint in an AWS serverless environment, but it didn't work. https://qiita.com/uedayou/items/bdf7a802e27fe330044e

Last time, I felt that the search speed was difficult due to the library used and it could be used for limited purposes. This time, I used Apache Jena, which has a proven track record as an RDF store.

environment

The configuration is AWS Lambda + API Gateway as before. Apache Jena

using. Although it is possible to use the RDF file directly in addition to the method used in TDB, it is recommended to convert it to TDB and use it based on the following results.

The source code is available below. https://github.com/uedayou/jena-sparql-server-aws-serverless

Query search time measurement

How long does it take to search

--How to use TDB --How to use RDF files directly

We are measuring about two things. TDB is deploying once ZIP-compressed.

SPARQL query

The query is the same as Last time. The dataset is the same as Last time ["International Standard Identifier for Libraries and Related Organizations (ISIL)" Trial Version LOD](https://www. ndl.go.jp/jp/dlib/standards/opendataset/index.html) was used. It is measured for each data set created by adding divided Turtle files one by one (RDF files are only those that integrate all files).

(1) Obtained 100 triples

select * where {?s ?p ?o} limit 100

(2) Get all triples

select (count(*) as ?count) where {?s ?p ?o}

(3) Use `filter` to narrow down the character string

prefix schema: <http://schema.org/>
prefix org:   <http://www.w3.org/ns/org#>
prefix dbpedia: <http://dbpedia.org/ontology/>

select * where {
  ?uri dbpedia:originalName ?name;
  org:hasSite/org:siteAddress/schema:addressRegion ?pref.
  filter( regex(?pref, "Tokyo") )
}
limit 10

TDB results

I was able to search fairly quickly using TDB. However, AWS Lambda takes additional time to initialize and decompress the ZIP-compressed TDB when a container is created (once it is created, the container will be reused for a while), so at that time (for example) It took about 4 seconds for the file used this time, because it took a long time to process them at the first startup or when the container was destroyed without being executed for a while. Below is the time when the container has already been created. When the container is not created, it will take +4 seconds for the following time. Last time took more than 10 seconds for some queries, and even simple queries could sometimes time out and not get search results. Even when TDB container creation is required, the result can be obtained within 5 seconds, and I think that it will not time out unless it is a very complicated query.

Number of triples	(1)	(2)	(3)
21,788	242ms	494ms	159ms
42,585	254ms	531ms	102ms
63,448	148ms	502ms	67ms
84,587	166ms	504ms	100ms
104,826	154ms	572ms	85ms
124,718	176ms	367ms	112ms
144,669	153ms	583ms	80ms
160,491	141ms	579ms	104ms

RDF file results

Using RDF files directly took longer than TDB. The following is the time when the container is created like TDB, but it took longer to initialize than TDB (about 7 seconds). Although TDB also includes ZIP decompression processing, it is not clear until I investigate that using an RDF file takes longer to initialize, but I thought it would be better to use TDB even after subtracting it.

Number of triples	(1)	(2)	(3)
160,491	1587ms	1664ms	1215ms

Summary

-(Personally) I was able to build a SPARQL endpoint that can produce a practical search speed in a serverless environment. --AWS Lambda cold start problem? Takes a few seconds to create the container --It is better to convert the RDF file to TDB instead of using it directly.

I personally think that the AWS serverless version of SPARQL endpoints that use Apache Jena have satisfactory performance, so I would like to use them in various ways in the future.

Immediately, SPARQL endpoint of the railway open data providing site Railway station LOD is converted to Apache Jena version. It has changed.

If you want to try it out, please refer to the following article.

Experimentally released the SPARQL endpoint of the railway station LOD https://qiita.com/uedayou/items/3ba823c5d3bede12af9c