lucene-s3directory

This is a Lucene Directory implementation for AWS S3. It stores indices in S3 buckets instead of the local file system. This project is still considered experimental but is now in a stable state, meaning, it can be used in production.

There is an open pull request for merge this project into Lucene: apache/lucene#13949.

Also, there's a similar project called Nixiesearch with a broader scope, aiming to implement a full-featured, cloud-native search engine on top of S3. Check out the slides by Roman Grebennikov for an introduction to Nixiesearch.

Motivation

The project was inspired by Shay Banon (kimchy), creator of Elasticsearch and Compass. It is a direct fork of his JdbcDirectory which is part of Compass.

Back in 2007, Shay wrote about the idea of Lucene-to-S3 integration in his blog post:

I spent some time trying to have the ability to store Lucene index on Amazon S3 service. Amazon S3 is a really cool idea, and having the ability to store Lucene index on top of it will provide a simple way to allow storing Lucene index in a distributed environment supporting HA. It will also make a lot of sense for applications deployed on Amazon EC2, since working with S3 from EC2 is free.

But back then S3 did not support locking so he scrapped the implementation:

It would be great if the good people at Amazon would allow for simple locking support. I understand that this is not simple to do in a distributed environment, but it must be there in some form, it will make S3 much a more attractive offer.

Since late 2018 S3 supports locking. The S3Directory uses legal hold locks on write.lock files.

Getting started

The package is available on Maven Central:

<dependency>
  <groupId>com.erudika</groupId>
  <artifactId>lucene-s3directory</artifactId>
  <version>1.0.0</version>
</dependency>

Requirements:

Java 17+
Lucene 10+ compatible

To build the project:

mvn -DskipTests=true clean install

Usage:

final S3Directory s3dir = new S3Directory("my-lucene-index");
s3dir.create();

IndexWriterConfig config = new IndexWriterConfig();
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setUseCompoundFile(false);
try (s3dir; IndexWriter writer = new IndexWriter(s3Dir, config)) {
  Document doc = new Document();
  String word = "lorem ipsum dolor";
  doc.add(new StringField("keyword", word, Field.Store.YES));
  doc.add(new StringField("unindexed", word, Field.Store.YES));
  doc.add(new StringField("unstored", word, Field.Store.NO));
  doc.add(new StringField("text", word, Field.Store.YES));
  writer.addDocument(doc);

  final Query query = new TermQuery(new Term("text", "ipsum"));
  try (DirectoryReader ireader = DirectoryReader.open(s3Dir)) {
    final IndexSearcher isearcher = new IndexSearcher(ireader);
    final TopDocs topDocs = isearcher.search(query, 1000);
    final StoredFields storedFields = isearcher.storedFields();
    final ScoreDoc[] hits = topDocs.scoreDocs;
    // Iterate through the results:
    for (final ScoreDoc hit : hits) {
      final Document hitDoc = storedFields.document(hit.doc);
      System.out.println("This is the text found: " + hitDoc.get("text"));
    }
  } catch (Exception e) {
    e.printStackTrace();
  }
}

// optionally, close or delete manually, if needed.
s3dir.close();
s3dir.delete();

The integration tests use adobe/S3Mock library for local testing and don't require access to the real S3 service nor an AWS account.

Dependencies

The project initially used the official AWS Java SDK v2, but that dependency was later removed in favor of the excellent and lightweight AWS Lightweight Java Client by @davidmoten.

There are 3 dependencies in total:

AWS Lightweight Java Client
Lucene Core
SLF4J API

Performance

Performance is not great. Each request to AWS takes a lot of time - TLS handshake, signature calculation, etc. I tried to do my best to optimize the code but I'm sure it can be optimized further. Contributions are welcome.

S3DirectoryBenchmarkITest.java:

RAMDirectory Time: 225 ms
FSDirectory Time : 62 ms
S3Directory Time : 16859 ms

Contributions & Goals

Contributions and PRs are welcome, especially those which aim to enhance performance. The feature I would like to see implemented the most is some sort of block caching for reads, backed by a MMapDirectory. The idea for this feature was presented at Haystack EU '24 by Roman Grebennikov as part of his Nixiesearch presentation.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deploy.sh		deploy.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lucene-s3directory

Motivation

Getting started

Dependencies

Performance

Contributions & Goals

License

About

Releases 1

Packages

Contributors 2

Languages

License

albogdano/lucene-s3directory

Folders and files

Latest commit

History

Repository files navigation

lucene-s3directory

Motivation

Getting started

Dependencies

Performance

Contributions & Goals

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages