A custom built, fully self-contained search engine for Nuance.js docs

NuQuery

2024

What is it?

As Nuance.js grows, I realized that it would be useful to have a self-contained search engine for the docs, to make navigation easier. This is where NuQuery comes in. If you'd like to test it out, click the search bar on the top right of the page below and search for something, like "randomInRange" or "library".

How it works

Currently, the search engine works by using a simple, but effective, algorithm to search through the docs' data. To ensure the search engine is fast and efficient, it uses JS workers to run the search algorithm in the background, and then sends the results back to the main thread. This allows the search engine work asynchronously, preventing the UI from being blocked while the search is being performed.

Let's take a look at the algorithm in more detail:

  1. Storing Data:

    • First, we need to define and store some data about the docs. This data includes the title, description, and content of each doc. We also need to store the explicitlysearchable terms of each page.
    • This data is stored in a JSON file, which is loaded into memory on page load.
    • Each piece of data contains:
      • The parent section of the doc (e.g. "Getting Started", "Core", "String", etc.)
      • The child section of the parent section (e.g. "Introduction", "Installation", "randomInRange()", etc.)
      • The child Id (This is typically the id of the header element in the page, e.g. "Arguments", "Returns", etc.)
      • The searchable text within the childId's text. For example, the searchable text for the "returns" childId of the "randomInRange()" function would be "(number): Randomly generated number within specified randomInRange".
      • The page associated with The child section. This is used to be able to link to the page from the search results further on.
  2. Giving the Worker the Context:

    • Next, the worker needs to receive 2 parameters: the data and the search value. Remember, the worker is asynchronous and doesn't have access to the page's data, so we need to pass the data to the worker.
  3. The Search:

    • The worker then uses the search value to search through the data. It does this by iterating through each piece of data and checking if the search value is contained within the data provided.
  4. Extracting Context:

    • When the search value is found within the searchableText, the algorithm extracts a context snippet around the search term.
    • The extractContext function is used to find the index of the search term in the searchableText.
    • It then splits the text into words and identifies the word index where the search term is found.
    • It extracts a snippet that includes up to 3 words before and up to 10 words after the search term, ensuring the snippet is concise and relevant.
    • // Find the word index where searchValMin is found\
      const wordIndex = words.findIndex(...);
      const start = Math.max(0, wordIndex - 3);
      const end = Math.min(words.length, wordIndex + 11); // +11 to include the target word and 10 words after
      
      return words.slice(start, end).join(" ");
      
  5. Grouping Results by Parent Section:

  • After identifying all matching search results, the algorithm groups them by their parentSection.
  • The groupByParentSection function iterates through the found results and groups them into an array of objects, each containing a parentSection and an array of content items.
  • If a group for a parentSection already exists, the content is added to that group; otherwise, a new group is created.
  1. Filtering Search Results & Removing Duplicate Entries:
    • The algorithm has logic to avoid adding certain common headers ("Arguments", "Note", "Returns", "Example") directly to the search results unless they are found by the generic header.
    • If these headers contain the search term, they are marked with foundByGenericHeader: true.
    • When the search term is found within the searchableText, the algorithm constructs a textExcerpt using the context extracted earlier.
    •     const seenChildIds = new Set<string>();
          const filteredContent = section.content.filter((item) => {
          // Check if the item was found by generic header and if the childId has been seen
          const shouldFilterOut =
            item.foundByGenericHeader && seenChildIds.has(item.childId);
          if (!shouldFilterOut) {
            // If not filtering out, add the childId to seenChildIds (if found by genericHeader)
            if (item.foundByGenericHeader) {
              seenChildIds.add(item.childId);
            }
            return true;
          }
          return false;
          });
      
  2. Returning the Results:
    • After processing, the algorithm returns the filtered and grouped search results.
    • The searchBySearchableText function compiles the results by calling the grouping and filtering functions and then returns the final structured results.
    • The worker sends the results back to the main thread using postMessage

This algorithm efficiently searches through the provided content, extracts relevant context, filters and groups the results, and ensures that the results are unique and well-organized, making it easier to display and navigate the search results on the front end.

Challenges

This was the first time I've ever made something like this, and I often like to do things on my own the first time, because I treat it as a learning experience. Some of the problems with this is that it's very manual, and could use some refactoring and automation. Adding abstractions to the code would make it easier to maintain and expand on as well. I plan on remaking an engine for QuébecNatif, and I'm hoping to make it more modular and easier to maintain.