https://opensource.com/article/17/11/learning-rank-apache-solr
tags (at the top and bottom of the file, respectively).
Learn how to train a machine learning model to rank documents retrieved in the Solr enterprise search platform.
This tutorial describes how to implement a modern learning to rank (LTR, also called machine-learned ranking) system in Apache Solr.
It's intended for people who have zero Solr experience, but who are
comfortable with machine learning and information retrieval concepts. I
was one of those people only a couple of months ago, and I found it
extremely challenging to get up and running with the Solr materials I
found online. This is my attempt at writing the tutorial I wish I had
when I was getting started.
tag to tell Solr to add several documents (denoted with tags) to the index. To index the tweets, run:
To start, modify /path/to/solr-/solr/server/solr/test/conf/managed-schema so it includes the text fields that you'll need for your model. First, change the text field so that it is of the text_general type (which is already defined inside managed-schema). The text_general type will allow you to calculate BM25 similarities. Because the text field already exists (it was automatically created when you indexed the tweets), you need to use the replace-field command like so:
/solr/server/solr/test/solrconfig.xml . Copy and paste the following text anywhere between the and
Table of contents
Setting up Solr
Firing up a vanilla Solr instance on Linux (Fedora, in my case) is actually pretty straightforward. On Fedora, first download the Solr source tarball (i.e., a file containing "src") and extract it to a reasonable location. Next, cd into the Solr directory:Building Solr requires Apache Ant and Apache Ivy, so install those:cd /path/to/solr-<version>/solr
And now build Solr:sudo dnf install ant ivy
You can confirm Solr is working by running:ant server
and making sure you see the Solr Admin interface at http://localhost:8983/solr/. You can stop Solr (but don't stop it now) with:bin/solr start
bin/solr stop
Solr basics
Solr is a search platform, so you only need to know how to do two things to function: index data and define a ranking model. Solr has a REST-like API, which means changes will be made with the curl command. To get going, create a core named test:This seemingly simple command actually did a lot of stuff behind the scenes. Specifically, it defined a schema that tells Solr how documents should be processed (think tokenization, stemming, etc.) and searched (e.g., using the tf-idf vector space model), and it set up a configuration file that specifies what libraries and handlers Solr will use. A core can be deleted with:bin/solr create -c test
OK, let's add some documents. First download this XML file of tweets provided on the Solr in Action GitHub. Take a look inside the XML file. Notice how it's using anbin/solr delete -c test
If you go to http://localhost:8983/solr/ now (you might have to refresh) and click on the "Core Selector" dropdown on the left-hand side, you can select the test core. If you then click on the "Query" tab, the query interface will appear. If you click on the blue "Execute Query" button at the bottom, a JSON document containing information regarding the tweets that were just indexed will be displayed. Congratulations, you just ran your first successful query! Specifically, you used the /select RequestHandler to execute the query *:*. The *:* is a special syntax that tells Solr to return everything. The Solr query syntax is not very intuitive, in my opinion, so it's something you'll just have to get used to.bin/post -c test /path/to/tweets.xml
Defining features
Now that you have a basic Solr instance up and running, define features for your LTR system. Like all machine learning problems, effective feature engineering is critical to success. Standard features in modern LTR models include using multiple similarity measures (e.g., cosine similarity of tf-idf vectors or BM25) to compare multiple text fields (e.g., body, title), in addition to other text characteristics (e.g., length) and document characteristics (e.g., age, PageRank). A good starting point is Microsoft Research's list of features for an academic data set. A list of some other commonly used features can be found on slide 32 of University of Massachusetts Amherst researcher Jiepu Jiang's lecture notes.To start, modify /path/to/solr-
I encourage you to take a look inside managed-schema following each change so you can get a sense for what's happening. Next, specify a text_tfidf type, which will allow you to calculate tf-idf cosine similarities:curl -X POST -H 'Content-type:application/json' --data-binary '{ "replace-field" : { "name":"text", "type":"text_general", "indexed":"true", "stored":"true", "multiValued":"true"} }' http://localhost:8983/solr/test/schema
Now add a text_tfidf field that will be of the text_tfidf type you just defined:curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field-type" : { "name":"text_tfidf", "class":"solr.TextField", "positionIncrementGap":"100", "indexAnalyzer":{ "tokenizer":{ "class":"solr.StandardTokenizerFactory"}, "filter":{ "class":"solr.StopFilterFactory", "ignoreCase":"true", "words":"stopwords.txt"}, "filter":{ "class":"solr.LowerCaseFilterFactory"}}, "queryAnalyzer":{ "tokenizer":{ "class":"solr.StandardTokenizerFactory"}, "filter":{ "class":"solr.StopFilterFactory", "ignoreCase":"true", "words":"stopwords.txt"}, "filter":{ "class":"solr.SynonymGraphFilterFactory", "ignoreCase":"true", "synonyms":"synonyms.txt"}, "filter":{ "class":"solr.LowerCaseFilterFactory"}}, "similarity":{ "class":"solr.ClassicSimilarityFactory"}} }' http://localhost:8983/solr/test/schema
Because the contents of the text field and the text_tfidf field are the same (they're just being handled differently), tell Solr to copy the contents from text to text_tfidf:curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field" : { "name":"text_tfidf", "type":"text_tfidf", "indexed":"true", "stored":"false", "multiValued":"true"} }' http://localhost:8983/solr/test/schema
You're now ready to re-index your data:curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-copy-field" : { "source":"text", "dest":"text_tfidf"} }' http://localhost:8983/solr/test/schema
bin/post -c test /home/malcorn/solr-in-action/example-docs/ch6/tweets.xml
Learning to rank
Now that your documents are properly indexed, build an LTR model. If you're new to LTR, I recommend checking out Tie-Yan Liu's (long) paper and textbook. If you're familiar with machine learning, the ideas shouldn't be too difficult to grasp. I also recommend checking out the Solr documentation on LTR, which I'll be linking to throughout this section. Enabling LTR in Solr first requires making some changes to /path/to/solr->dir="${solr.install.dir:../../../..}/contrib/ltr/lib/" regex=".*\.jar" /> dir="${solr.install.dir:../../../..}/dist/" regex="solr-ltr-\d.*\.jar" /> name="ltr" class="org.apache.solr.ltr.search.LTRQParserPlugin"/> name="QUERY_DOC_FV" class="solr.search.LRUCache" size="4096" initialSize="2048" autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" />name="features" class="org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory"> name="fvCacheName"> QUERY_DOC_FV
>You're now ready to run Solr with LTR enabled. First, stop Solr:
then restart it with the LTR plugin enabled:bin/solr stop
Next, push the model features and the model specification to Solr. In Solr, LTR features are defined using a JSON-formatted file. For this model, save the following features in my_efi_features.json:bin/solr start -Dsolr.ltr.enabled=true
The command store tells Solr where to store the feature; name is the name of the feature; class specifies which Java class will handle the feature; and params provides additional information about the feature required by its Java class. In the case of a SolrFeature, you need to provide the query. {!dismax qf=text_tfidf}${text_a} tells Solr to search the text_tfidf field with the contents of text_a using the DisMaxQParser. The reason to use the DisMax parser instead of the seemingly more obvious FieldQParser (e.g., {!field f=text_tfidf}${text_a}) is because the FieldQParser automatically converts multi-term queries to "phrases" (i.e., it converts something like "the cat in the hat" into, effectively, "thecatinthehat," rather than "the," "cat," "in," "the," "hat"). This FieldQParser behavior (which seems like a rather strange default to me) ended up giving me quite a headache, but I eventually found a solution with DisMaxQParser.[ { "store" : "my_efi_feature_store", "name" : "tfidf_sim_a", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!dismax qf=text_tfidf}${text_a}" } }, { "store" : "my_efi_feature_store", "name" : "tfidf_sim_b", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!dismax qf=text_tfidf}${text_b}" } }, { "store" : "my_efi_feature_store", "name" : "bm25_sim_a", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!dismax qf=text}${text_a}" } }, { "store" : "my_efi_feature_store", "name" : "bm25_sim_b", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!dismax qf=text}${text_b}" } }, { "store" : "my_efi_feature_store", "name" : "max_sim", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!dismax qf='text text_tfidf'}${text}" } } ]
{!dismax qf='text text_tfidf'}${text} tells Solr to search both the text and text_tfidf fields with the contents of text and then take the max of those two scores. While this feature doesn't really make sense in this context because similarities from both fields are already being used as features, it demonstrates how such a feature could be implemented. For example, imagine that the documents in your corpus are linked to, at most, five other sources of text data. It might make sense to incorporate that information during a search, and taking the max over multiple similarity scores is one way of doing that.
To push the features to Solr, run the following command:
If you want to upload new features, you first have to delete the old features with:curl -XPUT 'http://localhost:8983/solr/test/schema/feature-store' --data-binary "@/path/to/my_efi_features.json" -H 'Content-type:application/json'
Next, save the following model specification in my_efi_model.json:curl -XDELETE 'http://localhost:8983/solr/test/schema/feature-store/my_efi_feature_store'
In this case, store specifies where the features the model is using are stored; name is the name of the model; class specifies which Java class will implement the model; features is a list of the model features; and params provides additional information required by the model's Java class. Start by using the LinearModel, which simply takes a weighted sum of the feature values to generate a score. Obviously, the provided weights are arbitrary. To find better weights, you'll need to extract training data from Solr. I'll go over this topic in more depth in the RankNet section.{ "store" : "my_efi_feature_store", "name" : "my_efi_model", "class" : "org.apache.solr.ltr.model.LinearModel", "features" : [ { "name" : "tfidf_sim_a" }, { "name" : "tfidf_sim_b" }, { "name" : "bm25_sim_a" }, { "name" : "bm25_sim_b" }, { "name" : "max_sim" } ], "params" : { "weights" : { "tfidf_sim_a" : 1.0, "tfidf_sim_b" : 1.0, "bm25_sim_a" : 1.0, "bm25_sim_b" : 1.0, "max_sim" : 0.5 } } }
You can push the model to Solr with:
And now you're ready to run your first LTR query:curl -XPUT 'http://localhost:8983/solr/test/schema/model-store' --data-binary "@/path/to/my_efi_model.json" -H 'Content-type:application/json'
http://localhost:8983/solr/test/query?q=historic north&rq={!ltr model=my_efi_model efi.text_a=historic efi.text_b=north efi.text='historic north'}&fl=id,score,[features]
You should see something like:
Referring to the request, q=historic north is the query used to fetch the initial results (using BM25 in this case), which are then re-ranked with the LTR model. rq is where all the LTR parameters are provided, and efi stands for "external feature information," which allows you to specify features at query time. In this case, you're populating the text_a argument with the term historic, the text_b argument with the term north, and the text argument with the multi-term query 'historic north' (note that this is not being treated as a "phrase"). fl=id,score,[features] tells Solr to include the id, score, and model features in the results. You can verify that the feature values are correct by performing the associated search in the "Query" interface of the Solr Admin UI. For example, typing text_tfidf:historic in the q text box and typing score in the fl text box, and then clicking the "Execute Query" button should return a value of 0.53751516.{ "responseHeader":{ "status":0, "QTime":101, "params":{ "q":"historic north", "fl":"id,score,[features]", "rq":"{!ltr model=my_efi_model efi.text_a=historic efi.text_b=north efi.text='historic north'}"}}, "response":{"numFound":1,"start":0,"maxScore":3.0671878,"docs":[ { "id":"1", "score":3.0671878, "[features]":"tfidf_sim_a=0.53751516,tfidf_sim_b=0.0,bm25_sim_a=0.84322417,bm25_sim_b=0.84322417,max_sim=1.6864483"}] } }
RankNet
For LTR systems, linear models are generally trained using what's called a "pointwise" approach, which is where documents are considered individually (i.e., the model asks, "Is this document relevant to the query or not?"); however, pointwise approaches are generally not well-suited for LTR problems. RankNet is a neural network that uses a "pairwise" approach, which is where documents with a known relative preference are considered in pairs (i.e., the model asks, "Is document A more relevant than document B for the query or not?"). RankNet is not supported by Solr out of the box, but I've implemented RankNet in Solr and Keras. It's worth noting that LambdaMART might be more appropriate for your search application. However, RankNet can be trained quickly on a GPU using my Keras implementation, which makes it a good solution for search problems where only one document is relevant to any given query. For a nice (technical) overview of RankNet, LambdaRank, and LambdaMART, see Chris Burges' paper written when he was at Microsoft Research.To enable RankNet in Solr, you'll have to add RankNet.java to /path/to/solr-
Now if you inspect /path/to/solr-ant server
Unfortunately, the suggested method of feature extraction in Solr is painfully slow (other Solr users seem to agree it could be faster). Even when making requests in parallel, it took me almost three days to extract features for ~200,000 queries. I think a better approach might be to index the queries and then calculate the similarities between the "documents" (which consist of the true documents and queries), but this is really something that should be baked into Solr. Anyway, here is some example Python code for extracting features from Solr using queries:
Now you're ready to train some models. To start, pull in the data and evaluate the BM25 rankings on the entire data set.import numpy as np import requests import simplejson # Number of documents to be re-ranked. RERANK = 50 with open("RERANK.int", "w") as f: f.write(str(RERANK)) # Build query URL. q_id = row["id"] q_field_a = row["field_a"].strip().lower() q_field_b = row["field_b"].strip().lower() q_field_c = row["field_c"].strip().lower() q_field_d = row["field_d"].strip().lower() all_text = " ".join([q_field_a, q_field_b, q_field_c, q_field_d]) url = "http://localhost:8983/solr/test/query" # We only re-rank one document when extracting features because we want to be # able to compare the LTR model to the BM25 ranking. Setting reRankDocs=1 # ensures the original ranking is maintained. url += "?q={0}&rq={{!ltr model=my_efi_model reRankDocs=1 ".format(all_text) url += "efi.field_a='{0}' efi.field_b='{1}' efi.field_c='{2}' efi.field_d='{3}' ".format(field_a, field_b, field_c, field_d) url += "efi.all_text='{0}'}}&fl=id,score,[features]&rows={1}".format(all_text, RERANK) # Get response and check for errors. response = requests.request("GET", url) try: json = simplejson.loads(response.text) except simplejson.JSONDecodeError: print(q_id) return if "error" in json: print(q_id) return # Extract the features. results_features = [] results_targets = [] results_ranks = [] add_data = False for (rank, document) in enumerate(json["response"]["docs"]): features = document["[features]"].split(",") feature_array = [] for feature in features: feature_array.append(feature.split("=")[1]) feature_array = np.array(feature_array, dtype = "float32") results_features.append(feature_array) doc_id = document["id"] # Check if document is relevant to query. if q_id in relevant.get(doc_id, {}): results_ranks.append(rank + 1) results_targets.append(1) add_data = True else: results_targets.append(0) if add_data: np.save("{0}_X.npy".format(q_id), np.array(results_features)) np.save("{0}_y.npy".format(q_id), np.array(results_targets)) np.save("{0}_rank.npy".format(q_id), np.array(results_ranks))
Next, build and evaluate a (pointwise) linear support vector machine.import glob import numpy as np rank_files = glob.glob("*_rank.npy") suffix_len = len("_rank.npy") RERANK = int(open("RERANK.int").read()) ranks = [] casenumbers = [] Xs = [] ys = [] for rank_file in rank_files: X = np.load(rank_file[:-suffix_len] + "_X.npy") casenumbers.append(rank_file[:suffix_len]) if X.shape[0] != RERANK: print(rank_file[:-suffix_len]) continue rank = np.load(rank_file)[0] ranks.append(rank) y = np.load(rank_file[:-suffix_len] + "_y.npy") Xs.append(X) ys.append(y) ranks = np.array(ranks) total_docs = len(ranks) print("Total Documents: {0}".format(total_docs)) print("Top 1: {0}".format((ranks == 1).sum() / total_docs)) print("Top 3: {0}".format((ranks <= 3).sum() / total_docs)) print("Top 5: {0}".format((ranks <= 5).sum() / total_docs)) print("Top 10: {0}".format((ranks <= 10).sum() / total_docs))
Now you can try out RankNet. First, assemble the training data so that each row consists of a relevant document vector concatenated with an irrelevant document vector (for a given query). Because 50 rows were returned in the feature extraction phase, each query will have 49 document pairs in the dataset.from scipy.stats import rankdata from sklearn.svm import LinearSVC X = np.concatenate(Xs, 0) y = np.concatenate(ys) train_per = 0.8 train_cutoff = int(train_per * len(ranks)) * RERANK train_X = X[:train_cutoff] train_y = y[:train_cutoff] test_X = X[train_cutoff:] test_y = y[train_cutoff:] model = LinearSVC() model.fit(train_X, train_y) preds = model._predict_proba_lr(test_X) n_test = int(len(test_y) / RERANK) new_ranks = [] for i in range(n_test): start = i * RERANK end = start + RERANK scores = preds[start:end, 1] score_ranks = rankdata(-scores) old_rank = np.argmax(test_y[start:end]) new_rank = score_ranks[old_rank] new_ranks.append(new_rank) new_ranks = np.array(new_ranks) print("Total Documents: {0}".format(n_test)) print("Top 1: {0}".format((new_ranks == 1).sum() / n_test)) print("Top 3: {0}".format((new_ranks <= 3).sum() / n_test)) print("Top 5: {0}".format((new_ranks <= 5).sum() / n_test)) print("Top 10: {0}".format((new_ranks <= 10).sum() / n_test))
Build the model in Keras:Xs = [] for rank_file in rank_files: X = np.load(rank_file[:-suffix_len] + "_X.npy") if X.shape[0] != RERANK: print(rank_file[:-suffix_len]) continue rank = np.load(rank_file)[0] pos_example = X[rank - 1] for (i, neg_example) in enumerate(X): if i == rank - 1: continue Xs.append(np.concatenate((pos_example, neg_example))) X = np.stack(Xs) dim = int(X.shape[1] / 2) train_per = 0.8 train_cutoff = int(train_per * len(ranks)) * (RERANK - 1) train_X = X[:train_cutoff] test_X = X[train_cutoff:]
Now train and test the model:from keras import backend from keras.callbacks import ModelCheckpoint from keras.layers import Activation, Add, Dense, Input, Lambda from keras.models import Model y = np.ones((train_X.shape[0], 1)) INPUT_DIM = dim h_1_dim = 64 h_2_dim = h_1_dim // 2 h_3_dim = h_2_dim // 2 # Model. h_1 = Dense(h_1_dim, activation = "relu") h_2 = Dense(h_2_dim, activation = "relu") h_3 = Dense(h_3_dim, activation = "relu") s = Dense(1) # Relevant document score. rel_doc = Input(shape = (INPUT_DIM, ), dtype = "float32") h_1_rel = h_1(rel_doc) h_2_rel = h_2(h_1_rel) h_3_rel = h_3(h_2_rel) rel_score = s(h_3_rel) # Irrelevant document score. irr_doc = Input(shape = (INPUT_DIM, ), dtype = "float32") h_1_irr = h_1(irr_doc) h_2_irr = h_2(h_1_irr) h_3_irr = h_3(h_2_irr) irr_score = s(h_3_irr) # Subtract scores. negated_irr_score = Lambda(lambda x: -1 * x, output_shape = (1, ))(irr_score) diff = Add()([rel_score, negated_irr_score]) # Pass difference through sigmoid function. prob = Activation("sigmoid")(diff) # Build model. model = Model(inputs = [rel_doc, irr_doc], outputs = prob) model.compile(optimizer = "adagrad", loss = "binary_crossentropy")
If the model's results are satisfactory, save the parameters to a JSON file to be pushed to Solr:NUM_EPOCHS = 30 BATCH_SIZE = 32 checkpointer = ModelCheckpoint(filepath = "valid_params.h5", verbose = 1, save_best_only = True) history = model.fit([train_X[:, :dim], train_X[:, dim:]], y, epochs = NUM_EPOCHS, batch_size = BATCH_SIZE, validation_split = 0.05, callbacks = [checkpointer], verbose = 2) model.load_weights("valid_params.h5") get_score = backend.function([rel_doc], [rel_score]) n_test = int(test_X.shape[0] / (RERANK - 1)) new_ranks = [] for i in range(n_test): start = i * (RERANK - 1) end = start + (RERANK - 1) pos_score = get_score([test_X[start, :dim].reshape(1, dim)])[0] neg_scores = get_score([test_X[start:end, dim:]])[0] scores = np.concatenate((pos_score, neg_scores)) score_ranks = rankdata(-scores) new_rank = score_ranks[0] new_ranks.append(new_rank) new_ranks = np.array(new_ranks) print("Total Documents: {0}".format(n_test)) print("Top 1: {0}".format((new_ranks == 1).sum() / n_test)) print("Top 3: {0}".format((new_ranks <= 3).sum() / n_test)) print("Top 5: {0}".format((new_ranks <= 5).sum() / n_test)) print("Top 10: {0}".format((new_ranks <= 10).sum() / n_test)) # Compare to BM25. old_ranks = ranks[-n_test:] print("Total Documents: {0}".format(n_test)) print("Top 1: {0}".format((old_ranks == 1).sum() / n_test)) print("Top 3: {0}".format((old_ranks <= 3).sum() / n_test)) print("Top 5: {0}".format((old_ranks <= 5).sum() / n_test)) print("Top 10: {0}".format((old_ranks <= 10).sum() / n_test))
and push it the same as before (following a delete):import json weights = model.get_weights() solr_model = json.load(open("my_efi_model.json")) solr_model["class"] = "org.apache.solr.ltr.model.RankNet" solr_model["params"]["weights"] = [] for i in range(len(weights) // 2): matrix = weights[2 * i].T bias = weights[2 * i + 1] bias = bias.reshape(bias.shape[0], 1) out_matrix = np.hstack((matrix, bias)) np.savetxt("layer_{0}.csv".format(i), out_matrix, delimiter = ",") matrix_str = open("layer_{0}.csv".format(i)).read().strip() solr_model["params"]["weights"].append(matrix_str) solr_model["params"]["nonlinearity"] = "relu" with open("my_efi_model.json", "w") as out: json.dump(solr_model, out, indent = 4)
And there you have it — a modern learning-to-rank setup in Apache Solr.curl -XDELETE 'http://localhost:8983/solr/test/schema/model-store/my_efi_model' curl -XPUT 'http://localhost:8983/solr/test/schema/model-store' --data-binary "@/path/to/my_efi_model.json" -H 'Content-type:application/json'
No comments:
Post a Comment