Lucene.NET is a port of the popular Java search library Lucene (http://lucenenet.apache.org/). And how very awesome it is.

Recently I was porting a SQL server database to SQL Azure, however the original database contained a full text catalog for the website's search functionality, something that SQL Azure does not support. You can get around this by using LIKE, however depending on how your index is built your performance could be terrible. LIKE '%{word}%' does NOT use the column's index on query, it does a table scan, which in SQL Azure is very costly (slow). LIKE '%{word}' and LIKE '{word}%' will use a column's index though so if your data is structured where it only needs to match the start or tail of the index's contents you will be in luck. Because of the drastic performance hit on doing a table scan for each search it was apparent we needed a new way to handle search, enter Lucene.

The way Lucene operates is it builds out its own index using your provided data and then queries against that index to return results. Because our data isn't very big (56k rows totaling ~8mb) we opted to simply put everything in these indexes, the benefit to that route is that for a query result we do not need to load anything from the database and so performance is pretty awesome. Because this is used to power a "live" search text box performance is of great concern.

Lucene.NET does not really run on Azure out of the box. The issue lies in the fact that an Azure cloud service doesn't have much access to the underlying file system so the built in methods for Lucene to store its index are not available. You can use the built in RAMDirectory method to get around this but we didn't have any success implementing it as the primary index location. We instead use AzureDirectory. Do not install the NuGet version, it's out dated and very undocumented. Instead download this repo zip and add the entire project to your current solution. We tried adding it in as a .dll but it would not co-operate for some unknown reason. Once this is added to your project it's time to make an index!

First you need to create a Azure storage account object:

Microsoft.WindowsAzure.Storage.CloudStorageAccount cloudAccount = Microsoft.WindowsAzure.Storage.CloudStorageAccount.Parse("DefaultEndpointsProtocol=https;AccountName={Name};AccountKey={Key}");

Now with this created you can make a AzureDirectory object with a RAMDirectory object for local cache. The second parameter is a string name of the cache you want to create (this is what the blob storage container will be named).

Lucene.Net.Store.RAMDirectory ramDIR = new Lucene.Net.Store.RAMDirectory();
var dir = new AzureDirectory(cloudAccount, "SearchCatalog", ramDIR);

Before we get any further we need to jump back a few steps. For performance you need to wrap the Lucene IndexSearcher object in a static class. The reason for this is that each time the IndexSearcher is initialized there is a delay while the index is loaded back into memory, coupled with the delay of needing to download the index from blob storage and the timings start to get really slow. Setting true on the IndexSearcher creator opens the index in a read-only mode making it impossible to write back to it using this IndexSearcher.

public static class IndexSearch
{
    public static IndexSearcher indexSearcher;
    static IndexSearch()
    {
        indexSearcher = new IndexSearcher(
                                          new AzureDirectory(Microsoft.WindowsAzure.Storage.CloudStorageAccount.Parse("DefaultEndpointsProtocol=https;AccountName={Name};AccountKey={Key}"), 
                                          "SearchCatalog",
                                          new RAMDirectory()), 
                                          true);
    }
}

Now for the code that took me entirely too long to work out (seriously guys, update some documentation!)

Declare a null IndexWriter and the type of analyzer that you want to use, for most cases the standard will work just fine so that’s what we’ll use.

IndexWriter writer = null;
StandardAnalyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

Now we need to get a lock on the index and try to clear it if we’re doing a clean build (not calling delete will just cause you to append new records onto the index).

while (writer == null)
{
    try
    {
        writer = new IndexWriter(dir, analyzer, !IndexReader.IndexExists(dir), IndexWriter.MaxFieldLength.UNLIMITED);
        writer.DeleteAll();
        writer.Commit();
    }
    catch (LockObtainFailedException)
    {
        dir.ClearLock("write.lock");
    }
}

Now that we have a non-null writer it’s time to put some documents into the index. Earlier I loaded all of the search data into a table searchData, so adding records means I just need to iterate through it.

foreach (DataRow dr in searchData.Rows)
{
    // Create the Document object 
    Document doc = new Document();

    doc.Add(new Field("SearchName", dr["SearchName"].ToString(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
    doc.Add(new Field("SearchData", dr["SearchData"].ToString(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
    doc.Add(new Field("SearchURL", dr["SearchURL"].ToString(), Field.Store.YES, Field.Index.NO));
    doc.Add(new Field("SearchRank", dr["SearchRank"].ToString(), Field.Store.YES, Field.Index.NO));
    doc.Add(new Field("SearchPopularity", "", Field.Store.YES, Field.Index.NO));

    writer.AddDocument(doc);
}

As I said earlier we don't have a massive amount of data, so instead of needing to do a query against SQL Azure before loading the search results we simply store all of the necessary data directly in Lucene's index. This balloons out the index a bit, but since we query against everything but the URL it's not really wasted space, and it'sonly 15mb.

Before moving on I want to cover what the different settings mean:

Field.Store: Store the original contents of the field before the analyzer is run across this. If this is set to NO you will not be able to pull the original data back out of the index, but it will save you some space. Field.Index: Index this field? If set to NO the field will not be searchable. [Field.TermVector]: Optional – Should this field be analyzed for related terms in other documents? This lets you build out a graph type dataset using similar terms. This is great for indexing documents that you want to show related documents for later.

Now call optimize on the writer to have it re-order and stack the index for faster results, close it to release the blob lock and then force the static IndexSearcher to load the new index (if you don't you'll continue to search against the old index until it's unloaded from memory and is otherwise force to be re-loaded).

writer.Optimize();
writer.Close();

IndexSearch.indexSearcher = new IndexSearcher(new AzureDirectory(Microsoft.WindowsAzure.Storage.CloudStorageAccount.Parse("DefaultEndpointsProtocol=https;AccountName={Name};AccountKey={Key}"), "AjaxSearchCatalog", new RAMDirectory()), true);

Now the hard part is done it's time to do a search and test it out. Bellow is the search that we use in production at the moment. We are still tweaking it but using the two different query types seems to give us the best performance and results.

var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, "SearchName", analyzer);
// "SearchName" is the field in the documents declared above to search.
// Alternately this can be "all" to search all indexed fields.

Query query = parser.Parse(SearchString); //SearchString is a string that contains the search term from the website's text box.

TopDocs hits = IndexSearch.indexSearcher.Search(query, 20); 
//20 declares max documents to return use IndexSearch.indexSearcher.MaxDoc to get all results

TopDocs morehits = null;
if (hits.TotalHits < 20)
{
     // If we don't have 20 results yet fill in the rest using a wild card search.
     WildcardQuery wilcard = new Lucene.Net.Search.WildcardQuery(new Term("SearchName", SearchString + "*"));
     morehits = IndexSearch.indexSearcher.Search(wilcard, 20 - hits.TotalHits);
}

// Create list of SearchItem objects for the view.
List<SearchItem> results = new List<SearchItem>();

// Iterate through results, there's no GetEnumerator so must use for() instead of foreach()
for (int i = 0; i < hits.ScoreDocs.Count(); i++)
{
     // the results don't contain the documents, just the index to the document
     int docID = hits.ScoreDocs[i].Doc;
Document d = IndexSearch.indexSearcher.Doc(docID);
     results.Add(new SearchItem
     {
          // use .Get({FieldName}) to load the string stored in the document
          SearchName = d.Get("SearchName"),
          SearchURL = d.Get("SearchURL")
     });
}

// If wild card search has results add those to the results list
if (morehits != null)
{
     for (int i = 0; i < morehits.ScoreDocs.Count(); i++)
     {
          int docID = morehits.ScoreDocs[i].Doc;
          Document d = IndexSearch.indexSearcher.Doc(docID);
          results.Add(new SearchItem
               {
                    SearchName = d.Get("SearchName"),
                    SearchURL = d.Get("SearchURL")
               });
     }
}

And that's it! Full blown text search without a database handled all in memory for some stupid fast performance. Want to see it in action? Check out the live site I made this for at POSGuys.com. As a first time getting my feet wet using such a library it seems to work well, though I expect as time goes on and we learn more about the power of Lucene we'll be able to put it to better use in other projects.

Lessons Learned

The biggest issue I ran into with this project was the out dated examples and documentation. Nowhere did examples show you needing to pass the version number of Lucene to the constructors for example.

Getting AzureDirectory running was very difficult which is why it's just sitting as a project in the main solution for the site now. Not the best solution but at least we have control of the source code if changes need to be made to it. In its documentation the how to get the lock on the blob is not in the getting started and frankly it never got the lock until it was put into that while() loop which made it maddening because the error you get from Azure is not about locks, it just says it can't be done.

AzureDirectory also didn't state which CloudStorageAccount type it needed, it was trial and error before realizing that there's two of them now.

About Author

Siva Katir

Siva Katir

Senior Software Engineer working at PlayFab in Seattle Washington.