Tuesday, January 9, 2018

An introduction to machine-learned ranking in Apache Solr

https://opensource.com/article/17/11/learning-rank-apache-solr

Learn how to train a machine learning model to rank documents retrieved in the Solr enterprise search platform.

An introduction to machine-learned ranking in Apache Solr
Image by : 
Opensource.com
This tutorial describes how to implement a modern learning to rank (LTR, also called machine-learned ranking) system in Apache Solr. It's intended for people who have zero Solr experience, but who are comfortable with machine learning and information retrieval concepts. I was one of those people only a couple of months ago, and I found it extremely challenging to get up and running with the Solr materials I found online. This is my attempt at writing the tutorial I wish I had when I was getting started.

Table of contents

Setting up Solr

Firing up a vanilla Solr instance on Linux (Fedora, in my case) is actually pretty straightforward. On Fedora, first download the Solr source tarball (i.e., a file containing "src") and extract it to a reasonable location. Next, cd into the Solr directory:
cd /path/to/solr-<version>/solr
Building Solr requires Apache Ant and Apache Ivy, so install those:
sudo dnf install ant ivy
And now build Solr:
ant server
You can confirm Solr is working by running:
bin/solr start
and making sure you see the Solr Admin interface at http://localhost:8983/solr/. You can stop Solr (but don't stop it now) with:
bin/solr stop

Solr basics

Solr is a search platform, so you only need to know how to do two things to function: index data and define a ranking model. Solr has a REST-like API, which means changes will be made with the curl command. To get going, create a core named test:
bin/solr create -c test
This seemingly simple command actually did a lot of stuff behind the scenes. Specifically, it defined a schema that tells Solr how documents should be processed (think tokenization, stemming, etc.) and searched (e.g., using the tf-idf vector space model), and it set up a configuration file that specifies what libraries and handlers Solr will use. A core can be deleted with:
bin/solr delete -c test
OK, let's add some documents. First download this XML file of tweets provided on the Solr in Action GitHub. Take a look inside the XML file. Notice how it's using an tag to tell Solr to add several documents (denoted with tags) to the index. To index the tweets, run:
bin/post -c test /path/to/tweets.xml
If you go to http://localhost:8983/solr/ now (you might have to refresh) and click on the "Core Selector" dropdown on the left-hand side, you can select the test core. If you then click on the "Query" tab, the query interface will appear. If you click on the blue "Execute Query" button at the bottom, a JSON document containing information regarding the tweets that were just indexed will be displayed. Congratulations, you just ran your first successful query! Specifically, you used the /select RequestHandler to execute the query *:*. The *:* is a special syntax that tells Solr to return everything. The Solr query syntax is not very intuitive, in my opinion, so it's something you'll just have to get used to.

Defining features

Now that you have a basic Solr instance up and running, define features for your LTR system. Like all machine learning problems, effective feature engineering is critical to success. Standard features in modern LTR models include using multiple similarity measures (e.g., cosine similarity of tf-idf vectors or BM25) to compare multiple text fields (e.g., body, title), in addition to other text characteristics (e.g., length) and document characteristics (e.g., age, PageRank). A good starting point is Microsoft Research's list of features for an academic data set. A list of some other commonly used features can be found on slide 32 of University of Massachusetts Amherst researcher Jiepu Jiang's lecture notes.
To start, modify /path/to/solr-/solr/server/solr/test/conf/managed-schema so it includes the text fields that you'll need for your model. First, change the text field so that it is of the text_general type (which is already defined inside managed-schema). The text_general type will allow you to calculate BM25 similarities. Because the text field already exists (it was automatically created when you indexed the tweets), you need to use the replace-field command like so:
curl -X POST -H 'Content-type:application/json' --data-binary '{   "replace-field" : {      "name":"text",      "type":"text_general",      "indexed":"true",      "stored":"true",      "multiValued":"true"} }' http://localhost:8983/solr/test/schema
I encourage you to take a look inside managed-schema following each change so you can get a sense for what's happening. Next, specify a text_tfidf type, which will allow you to calculate tf-idf cosine similarities:
curl -X POST -H 'Content-type:application/json' --data-binary '{   "add-field-type" : {      "name":"text_tfidf",      "class":"solr.TextField",      "positionIncrementGap":"100",      "indexAnalyzer":{         "tokenizer":{            "class":"solr.StandardTokenizerFactory"},         "filter":{            "class":"solr.StopFilterFactory",            "ignoreCase":"true",            "words":"stopwords.txt"},         "filter":{            "class":"solr.LowerCaseFilterFactory"}},      "queryAnalyzer":{         "tokenizer":{            "class":"solr.StandardTokenizerFactory"},         "filter":{            "class":"solr.StopFilterFactory",            "ignoreCase":"true",            "words":"stopwords.txt"},         "filter":{            "class":"solr.SynonymGraphFilterFactory",            "ignoreCase":"true",            "synonyms":"synonyms.txt"},         "filter":{            "class":"solr.LowerCaseFilterFactory"}},      "similarity":{            "class":"solr.ClassicSimilarityFactory"}} }' http://localhost:8983/solr/test/schema
Now add a text_tfidf field that will be of the text_tfidf type you just defined:
curl -X POST -H 'Content-type:application/json' --data-binary '{   "add-field" : {      "name":"text_tfidf",      "type":"text_tfidf",      "indexed":"true",      "stored":"false",      "multiValued":"true"} }' http://localhost:8983/solr/test/schema
Because the contents of the text field and the text_tfidf field are the same (they're just being handled differently), tell Solr to copy the contents from text to text_tfidf:
curl -X POST -H 'Content-type:application/json' --data-binary '{   "add-copy-field" : {      "source":"text",      "dest":"text_tfidf"} }' http://localhost:8983/solr/test/schema
You're now ready to re-index your data:
bin/post -c test /home/malcorn/solr-in-action/example-docs/ch6/tweets.xml

Learning to rank

Now that your documents are properly indexed, build an LTR model. If you're new to LTR, I recommend checking out Tie-Yan Liu's (long) paper and textbook. If you're familiar with machine learning, the ideas shouldn't be too difficult to grasp. I also recommend checking out the Solr documentation on LTR, which I'll be linking to throughout this section. Enabling LTR in Solr first requires making some changes to /path/to/solr-/solr/server/solr/test/solrconfig.xml. Copy and paste the following text anywhere between the and
tags (at the top and bottom of the file, respectively).
dir="${solr.install.dir:../../../..}/contrib/ltr/lib/" regex=".*\.jar" /> dir="${solr.install.dir:../../../..}/dist/" regex="solr-ltr-\d.*\.jar" /> name="ltr" class="org.apache.solr.ltr.search.LTRQParserPlugin"/> name="QUERY_DOC_FV"       class="solr.search.LRUCache"       size="4096"       initialSize="2048"       autowarmCount="4096"       regenerator="solr.search.NoOpRegenerator" /> name="features" class="org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory">   name="fvCacheName">QUERY_DOC_FV
>
>You're now ready to run Solr with LTR enabled. First, stop Solr:
bin/solr stop
then restart it with the LTR plugin enabled:
bin/solr start -Dsolr.ltr.enabled=true
Next, push the model features and the model specification to Solr. In Solr, LTR features are defined using a JSON-formatted file. For this model, save the following features in my_efi_features.json:
[   {     "store" : "my_efi_feature_store",     "name" : "tfidf_sim_a",     "class" : "org.apache.solr.ltr.feature.SolrFeature",     "params" : { "q" : "{!dismax qf=text_tfidf}${text_a}" }   },   {     "store" : "my_efi_feature_store",     "name" : "tfidf_sim_b",     "class" : "org.apache.solr.ltr.feature.SolrFeature",     "params" : { "q" : "{!dismax qf=text_tfidf}${text_b}" }   },   {     "store" : "my_efi_feature_store",     "name" : "bm25_sim_a",     "class" : "org.apache.solr.ltr.feature.SolrFeature",     "params" : { "q" : "{!dismax qf=text}${text_a}" }   },   {     "store" : "my_efi_feature_store",     "name" : "bm25_sim_b",     "class" : "org.apache.solr.ltr.feature.SolrFeature",     "params" : { "q" : "{!dismax qf=text}${text_b}" }   },   {     "store" : "my_efi_feature_store",     "name" : "max_sim",     "class" : "org.apache.solr.ltr.feature.SolrFeature",     "params" : { "q" : "{!dismax qf='text text_tfidf'}${text}" }   } ]
The command store tells Solr where to store the feature; name is the name of the feature; class specifies which Java class will handle the feature; and params provides additional information about the feature required by its Java class. In the case of a SolrFeature, you need to provide the query. {!dismax qf=text_tfidf}${text_a} tells Solr to search the text_tfidf field with the contents of text_a using the DisMaxQParser. The reason to use the DisMax parser instead of the seemingly more obvious FieldQParser (e.g., {!field f=text_tfidf}${text_a}) is because the FieldQParser automatically converts multi-term queries to "phrases" (i.e., it converts something like "the cat in the hat" into, effectively, "thecatinthehat," rather than "the," "cat," "in," "the," "hat"). This FieldQParser behavior (which seems like a rather strange default to me) ended up giving me quite a headache, but I eventually found a solution with DisMaxQParser.
{!dismax qf='text text_tfidf'}${text} tells Solr to search both the text and text_tfidf fields with the contents of text and then take the max of those two scores. While this feature doesn't really make sense in this context because similarities from both fields are already being used as features, it demonstrates how such a feature could be implemented. For example, imagine that the documents in your corpus are linked to, at most, five other sources of text data. It might make sense to incorporate that information during a search, and taking the max over multiple similarity scores is one way of doing that.
To push the features to Solr, run the following command:
curl -XPUT 'http://localhost:8983/solr/test/schema/feature-store' --data-binary "@/path/to/my_efi_features.json" -H 'Content-type:application/json'
If you want to upload new features, you first have to delete the old features with:
curl -XDELETE 'http://localhost:8983/solr/test/schema/feature-store/my_efi_feature_store'
Next, save the following model specification in my_efi_model.json:
{   "store" : "my_efi_feature_store",   "name" : "my_efi_model",   "class" : "org.apache.solr.ltr.model.LinearModel",   "features" : [     { "name" : "tfidf_sim_a" },     { "name" : "tfidf_sim_b" },     { "name" : "bm25_sim_a" },     { "name" : "bm25_sim_b" },     { "name" : "max_sim" }   ],   "params" : {     "weights" : {       "tfidf_sim_a" : 1.0,       "tfidf_sim_b" : 1.0,       "bm25_sim_a" : 1.0,       "bm25_sim_b" : 1.0,       "max_sim" : 0.5     }   } }
In this case, store specifies where the features the model is using are storedname is the name of the model; class specifies which Java class will implement the model; features is a list of the model features; and params provides additional information required by the model's Java class. Start by using the LinearModel, which simply takes a weighted sum of the feature values to generate a score. Obviously, the provided weights are arbitrary. To find better weights, you'll need to extract training data from Solr. I'll go over this topic in more depth in the RankNet section.
You can push the model to Solr with:
curl -XPUT 'http://localhost:8983/solr/test/schema/model-store' --data-binary "@/path/to/my_efi_model.json" -H 'Content-type:application/json'
And now you're ready to run your first LTR query:
http://localhost:8983/solr/test/query?q=historic north&rq={!ltr model=my_efi_model efi.text_a=historic efi.text_b=north efi.text='historic north'}&fl=id,score,[features]
You should see something like:
{   "responseHeader":{     "status":0,     "QTime":101,     "params":{       "q":"historic north",       "fl":"id,score,[features]",       "rq":"{!ltr model=my_efi_model efi.text_a=historic efi.text_b=north efi.text='historic north'}"}},   "response":{"numFound":1,"start":0,"maxScore":3.0671878,"docs":[       {         "id":"1",         "score":3.0671878,         "[features]":"tfidf_sim_a=0.53751516,tfidf_sim_b=0.0,bm25_sim_a=0.84322417,bm25_sim_b=0.84322417,max_sim=1.6864483"}]   } }
Referring to the request, q=historic north is the query used to fetch the initial results (using BM25 in this case), which are then re-ranked with the LTR model. rq is where all the LTR parameters are provided, and efi stands for "external feature information," which allows you to specify features at query time. In this case, you're populating the text_a argument with the term historic, the text_b argument with the term north, and the text argument with the multi-term query 'historic north' (note that this is not being treated as a "phrase"). fl=id,score,[features] tells Solr to include the id, score, and model features in the results. You can verify that the feature values are correct by performing the associated search in the "Query" interface of the Solr Admin UI. For example, typing text_tfidf:historic in the q text box and typing score in the fl text box, and then clicking the "Execute Query" button should return a value of 0.53751516.

RankNet

For LTR systems, linear models are generally trained using what's called a "pointwise" approach, which is where documents are considered individually (i.e., the model asks, "Is this document relevant to the query or not?"); however, pointwise approaches are generally not well-suited for LTR problems. RankNet is a neural network that uses a "pairwise" approach, which is where documents with a known relative preference are considered in pairs (i.e., the model asks, "Is document A more relevant than document B for the query or not?"). RankNet is not supported by Solr out of the box, but I've implemented RankNet in Solr and Keras. It's worth noting that LambdaMART might be more appropriate for your search application. However, RankNet can be trained quickly on a GPU using my Keras implementation, which makes it a good solution for search problems where only one document is relevant to any given query. For a nice (technical) overview of RankNet, LambdaRank, and LambdaMART, see Chris Burges' paper written when he was at Microsoft Research.
To enable RankNet in Solr, you'll have to add RankNet.java to /path/to/solr-/solr/contrib/ltr/src/java/org/apache/solr/ltr/model and then re-build Solr (reminder: build in /path/to/solr-/solr):
ant server
Now if you inspect /path/to/solr-/solr/dist/solr-ltr-{version}-SNAPSHOT.jar, you should see RankNet.class under /org/apache/solr/ltr/model/.
Unfortunately, the suggested method of feature extraction in Solr is painfully slow (other Solr users seem to agree it could be faster). Even when making requests in parallel, it took me almost three days to extract features for ~200,000 queries. I think a better approach might be to index the queries and then calculate the similarities between the "documents" (which consist of the true documents and queries), but this is really something that should be baked into Solr. Anyway, here is some example Python code for extracting features from Solr using queries:
import numpy as np import requests import simplejson # Number of documents to be re-ranked. RERANK = 50 with open("RERANK.int", "w") as f:     f.write(str(RERANK)) # Build query URL. q_id = row["id"] q_field_a = row["field_a"].strip().lower() q_field_b = row["field_b"].strip().lower() q_field_c = row["field_c"].strip().lower() q_field_d = row["field_d"].strip().lower() all_text = " ".join([q_field_a, q_field_b, q_field_c, q_field_d]) url = "http://localhost:8983/solr/test/query" # We only re-rank one document when extracting features because we want to be # able to compare the LTR model to the BM25 ranking. Setting reRankDocs=1 # ensures the original ranking is maintained. url += "?q={0}&rq={{!ltr model=my_efi_model reRankDocs=1 ".format(all_text) url += "efi.field_a='{0}' efi.field_b='{1}' efi.field_c='{2}' efi.field_d='{3}' ".format(field_a, field_b, field_c, field_d) url += "efi.all_text='{0}'}}&fl=id,score,[features]&rows={1}".format(all_text, RERANK) # Get response and check for errors. response = requests.request("GET", url) try:     json = simplejson.loads(response.text) except simplejson.JSONDecodeError:     print(q_id)     return if "error" in json:     print(q_id)     return # Extract the features. results_features = [] results_targets = [] results_ranks = [] add_data = False for (rank, document) in enumerate(json["response"]["docs"]):     features = document["[features]"].split(",")     feature_array = []     for feature in features:         feature_array.append(feature.split("=")[1])     feature_array = np.array(feature_array, dtype = "float32")     results_features.append(feature_array)     doc_id = document["id"]     # Check if document is relevant to query.     if q_id in relevant.get(doc_id, {}):         results_ranks.append(rank + 1)         results_targets.append(1)         add_data = True     else:         results_targets.append(0) if add_data:     np.save("{0}_X.npy".format(q_id), np.array(results_features))     np.save("{0}_y.npy".format(q_id), np.array(results_targets))     np.save("{0}_rank.npy".format(q_id), np.array(results_ranks))
Now you're ready to train some models. To start, pull in the data and evaluate the BM25 rankings on the entire data set.
import glob import numpy as np rank_files = glob.glob("*_rank.npy") suffix_len = len("_rank.npy") RERANK = int(open("RERANK.int").read()) ranks = [] casenumbers = [] Xs = [] ys = [] for rank_file in rank_files:     X = np.load(rank_file[:-suffix_len] + "_X.npy")     casenumbers.append(rank_file[:suffix_len])     if X.shape[0] != RERANK:         print(rank_file[:-suffix_len])         continue     rank = np.load(rank_file)[0]     ranks.append(rank)     y = np.load(rank_file[:-suffix_len] + "_y.npy")     Xs.append(X)     ys.append(y) ranks = np.array(ranks) total_docs = len(ranks) print("Total Documents: {0}".format(total_docs)) print("Top 1: {0}".format((ranks == 1).sum() / total_docs)) print("Top 3: {0}".format((ranks <= 3).sum() / total_docs)) print("Top 5: {0}".format((ranks <= 5).sum() / total_docs)) print("Top 10: {0}".format((ranks <= 10).sum() / total_docs))
Next, build and evaluate a (pointwise) linear support vector machine.
from scipy.stats import rankdata from sklearn.svm import LinearSVC X = np.concatenate(Xs, 0) y = np.concatenate(ys) train_per = 0.8 train_cutoff = int(train_per * len(ranks)) * RERANK train_X = X[:train_cutoff] train_y = y[:train_cutoff] test_X = X[train_cutoff:] test_y = y[train_cutoff:] model = LinearSVC() model.fit(train_X, train_y) preds = model._predict_proba_lr(test_X) n_test = int(len(test_y) / RERANK) new_ranks = [] for i in range(n_test):     start = i * RERANK     end = start + RERANK     scores = preds[start:end, 1]     score_ranks = rankdata(-scores)     old_rank = np.argmax(test_y[start:end])     new_rank = score_ranks[old_rank]     new_ranks.append(new_rank) new_ranks = np.array(new_ranks) print("Total Documents: {0}".format(n_test)) print("Top 1: {0}".format((new_ranks == 1).sum() / n_test)) print("Top 3: {0}".format((new_ranks <= 3).sum() / n_test)) print("Top 5: {0}".format((new_ranks <= 5).sum() / n_test)) print("Top 10: {0}".format((new_ranks <= 10).sum() / n_test))
Now you can try out RankNet. First, assemble the training data so that each row consists of a relevant document vector concatenated with an irrelevant document vector (for a given query). Because 50 rows were returned in the feature extraction phase, each query will have 49 document pairs in the dataset.
Xs = [] for rank_file in rank_files:     X = np.load(rank_file[:-suffix_len] + "_X.npy")     if X.shape[0] != RERANK:         print(rank_file[:-suffix_len])         continue     rank = np.load(rank_file)[0]     pos_example = X[rank - 1]     for (i, neg_example) in enumerate(X):         if i == rank - 1:             continue         Xs.append(np.concatenate((pos_example, neg_example))) X = np.stack(Xs) dim = int(X.shape[1] / 2) train_per = 0.8 train_cutoff = int(train_per * len(ranks)) * (RERANK - 1) train_X = X[:train_cutoff] test_X = X[train_cutoff:]
Build the model in Keras:
from keras import backend from keras.callbacks import ModelCheckpoint from keras.layers import Activation, Add, Dense, Input, Lambda from keras.models import Model y = np.ones((train_X.shape[0], 1)) INPUT_DIM = dim h_1_dim = 64 h_2_dim = h_1_dim // 2 h_3_dim = h_2_dim // 2 # Model. h_1 = Dense(h_1_dim, activation = "relu") h_2 = Dense(h_2_dim, activation = "relu") h_3 = Dense(h_3_dim, activation = "relu") s = Dense(1) # Relevant document score. rel_doc = Input(shape = (INPUT_DIM, ), dtype = "float32") h_1_rel = h_1(rel_doc) h_2_rel = h_2(h_1_rel) h_3_rel = h_3(h_2_rel) rel_score = s(h_3_rel) # Irrelevant document score. irr_doc = Input(shape = (INPUT_DIM, ), dtype = "float32") h_1_irr = h_1(irr_doc) h_2_irr = h_2(h_1_irr) h_3_irr = h_3(h_2_irr) irr_score = s(h_3_irr) # Subtract scores. negated_irr_score = Lambda(lambda x: -1 * x, output_shape = (1, ))(irr_score) diff = Add()([rel_score, negated_irr_score]) # Pass difference through sigmoid function. prob = Activation("sigmoid")(diff) # Build model. model = Model(inputs = [rel_doc, irr_doc], outputs = prob) model.compile(optimizer = "adagrad", loss = "binary_crossentropy")
Now train and test the model:
NUM_EPOCHS = 30 BATCH_SIZE = 32 checkpointer = ModelCheckpoint(filepath = "valid_params.h5", verbose = 1, save_best_only = True) history = model.fit([train_X[:, :dim], train_X[:, dim:]], y,                      epochs = NUM_EPOCHS, batch_size = BATCH_SIZE, validation_split = 0.05,                      callbacks = [checkpointer], verbose = 2) model.load_weights("valid_params.h5") get_score = backend.function([rel_doc], [rel_score]) n_test = int(test_X.shape[0] / (RERANK - 1)) new_ranks = [] for i in range(n_test):     start = i * (RERANK - 1)     end = start + (RERANK - 1)     pos_score = get_score([test_X[start, :dim].reshape(1, dim)])[0]     neg_scores = get_score([test_X[start:end, dim:]])[0]     scores = np.concatenate((pos_score, neg_scores))     score_ranks = rankdata(-scores)     new_rank = score_ranks[0]     new_ranks.append(new_rank) new_ranks = np.array(new_ranks) print("Total Documents: {0}".format(n_test)) print("Top 1: {0}".format((new_ranks == 1).sum() / n_test)) print("Top 3: {0}".format((new_ranks <= 3).sum() / n_test)) print("Top 5: {0}".format((new_ranks <= 5).sum() / n_test)) print("Top 10: {0}".format((new_ranks <= 10).sum() / n_test)) # Compare to BM25. old_ranks = ranks[-n_test:] print("Total Documents: {0}".format(n_test)) print("Top 1: {0}".format((old_ranks == 1).sum() / n_test)) print("Top 3: {0}".format((old_ranks <= 3).sum() / n_test)) print("Top 5: {0}".format((old_ranks <= 5).sum() / n_test)) print("Top 10: {0}".format((old_ranks <= 10).sum() / n_test))
If the model's results are satisfactory, save the parameters to a JSON file to be pushed to Solr:
import json weights = model.get_weights() solr_model = json.load(open("my_efi_model.json")) solr_model["class"] = "org.apache.solr.ltr.model.RankNet" solr_model["params"]["weights"] = [] for i in range(len(weights) // 2):     matrix = weights[2 * i].T     bias = weights[2 * i + 1]     bias = bias.reshape(bias.shape[0], 1)     out_matrix = np.hstack((matrix, bias))     np.savetxt("layer_{0}.csv".format(i), out_matrix, delimiter = ",")     matrix_str = open("layer_{0}.csv".format(i)).read().strip()     solr_model["params"]["weights"].append(matrix_str) solr_model["params"]["nonlinearity"] = "relu" with open("my_efi_model.json", "w") as out:     json.dump(solr_model, out, indent = 4)
and push it the same as before (following a delete):
curl -XDELETE 'http://localhost:8983/solr/test/schema/model-store/my_efi_model' curl -XPUT 'http://localhost:8983/solr/test/schema/model-store' --data-binary "@/path/to/my_efi_model.json" -H 'Content-type:application/json'
And there you have it — a modern learning-to-rank setup in Apache Solr.

No comments:

Post a Comment