Drupal 8 as a Static Site: Elasticsearch Query

Submitted by nigel on Wednesday 28th November 2018

The most fundamental part of the search is the query - this will determine which documents will be retrieved after a search. It must obviously reflect the search phrase input into the search form, and the results returned should be presented in an order of most relevant first. 

Elasticsearch is excellent at this - but it's crucial that the query reflects exactly what I want to achieve and it searches the correct fields. No two sites are the same and therefore no two search requirements are identical. I have decided that I want to search on the following:

  1. Blog title
  2. Blog body text (a custom defined Paragraph type)
  3. Body text (only used with the book content type and not my regular blogs)
  4. Technology vocabulary which I use to tag my blog

Each of these will be assigned a score during the search process and the results delivered back in high to low score order.

Initial basic query
Unless you re an expert in Elastsearch queries (I am definitely not that), you will need a useful tool with automatic syntax suggestions, and lo and behold Kibana is excellent at this. My first stab at the query is shown below, and it behaved pretty much to my expectations
GET /myindex/_search
{
  "query": {
        "query_string": {
           "query": "Search phrase here",
           "fields": [
              "title",
              "*body",
              "term"
              ],
            "default_operator": "or"
            }
  },
  "highlight": {
        "number_of_fragments" : 1,
        "pre_tags": [ "<strong>" ],
        "post_tags": [ "</strong>" ],
        "fragment_size": 400,
        "no_match_size": 400,
        "phrase_limit": 1,        
        "fields" : {
            "*body": {},
            "title": {},
            "term": {}
        }
  },
  "size": 10
}
Some of this is self-explanatory, some less so. Let's cherry pick our explanations.
*body - I have two fields which contain the word body and these can be searched using a wildcard.
highlight - This is where I specify what I actually want to display back to the end user. The following parameters defined this.
number_of_fragments - The maximum number of fragments to return.
pre_tags and post_tags - When the search term is found, we can specify what tags we want before and after. I chose strong instead of the default em
fragment_size - How many characters the snipped of the find will contain as a maximum, although it could be less depending upon how Elasticsearch's paragraph and line break algorithm interprets the result
no_match_size - As above if there are no matches in a particular field. I still want to show the snippet so the user has got some context.
phrase_limit - Controls the number of matching phrases in a document that are considered.

The major shortcoming of this query is it shows no favouritism towards the date the blog is published. Whether it was published yesterday or ten years ago, the score will be same.
Adding a date decay to favour more recent blogs
Elasticsearch date decay

Technical blogs are ephemeral and transitory. They tend to have a shelf life when they are very popular, but then their popularity will wane as time goes by. This can be due to a number of factors - such as the popularity of the subject of the blog, the technology written about evolving thus obsoleting the origin blog, and the subject of the blog becoming commonplace or covered more extensively in official documentation.

Therefore we need a mechanism to favour newer blogs in our search that will be more relevant to our users. Elasticsearch has the decay feature, and it supports three different mathematical formula. There is gauss, linear and exp. I had a think here - I could use the default Gaussian approach and with some subtle parameter settings I think it would have done a good job. I could also have used Exponential, with the relatively shallow curve to present the passing of time. In the end I decided I liked a linear representation. 

My rationale is this: A blog will be totally relevant for about a year, then it will decay at a linear rate for a number of years, then it will hit the bottom. Sheer speculation drove me to use a total decay period of eight years although I concede this is somewhat arbitrary and probably sub-optimal. 


The syntax for representing this is a little arcane. Here we go:
"function_score": {
  "functions": [
    {
      "linear": {
        "created": {
          "origin": "now",
          "offset": "365d", 
          "scale": "1460d",
          "decay": 0.5
          }
        }
      }
    ]
  }
Here is the commentary:
created - This is the name of the date field for the date the content was published
origin - The start point for our time span, which will always be anchored at the current date.
offset - This is the initial flat spot at the top of the graph above. I am saying here - all documents in the first year will be scored equally from a date created perspective.
scale - Defines the distance from origin + offset at which the computed score will equal decay parameter.
decay - The decay parameter defines how documents are scored at the distance given at scale. So if you divide decay into scale you get the figure of 8 years.
The complete query
Adding the decay to the initial query means that they multiplied together (inferred by default).
GET /myindex/_search
{
 
  "query": {
    "function_score": {
      "query": {
        "query_string": {
          "query": "Search phrase here",
             "fields": [
                "title",
                "*body",
                "term"   
             ],
             "default_operator": "OR"
        }
      },
      "functions": [
        {
          "linear": {
            "created": {
              "origin": "now",
              "offset": "365d", 
              "scale": "1460d",
              "decay": 0.5
            }
          }
        }
      ]
    }
  },
    "highlight": {
    "number_of_fragments" : 1,
    "pre_tags": [ "<strong>" ],
    "post_tags": [ "</strong>" ],
    "fragment_size": 400,
    "no_match_size": 400,
    "phrase_limit": 1,        
    "fields" : {
      "*body": {},
      "title": {},
      "term": {}
    }
  },
  "size": 10
}