'How to use gensim BM25 ranking in python
I found gensim has BM25 ranking function. However, i cannot find the tutorial how to use it.
In my case, I had one query. a few documents which were retrieved from the search engine. How to use gensim BM 25 ranking to compare the query and documents to find the most similar one?
I am new to gensim. Thanks.
query:
"experimental studies of creep buckling ."
document 1:
" the 7 x 7 in . hypersonic wind tunnel at rae farnborough, part 1, design, instrumentation and flow visualization techniques . this is the first of three parts of the calibration report on the r.a.e. some details of the design and lay-out of the plant are given, together with the calculated performance figures, and the major components of the facility are briefly described . the instrumentation provided for the wind-tunnel is described in some detail, including the optical and other methods of flow visualization used in the tunnel . later parts will describe the calibration of the flow in the working-section, including temperature measurements . a discussion of the heater performance will also be included as well as the results of tests to determine starting and running pressure ratios, blockage effects, model starting loads, and humidity of the air flow ."
document 2:
" the 7 in. x 7 in. hypersonic wind tunnel at r.a.e. farnborough part ii. heater performance . tests on the storage heater, which is cylindrical in form and mounted horizontally, show that its performance is adequate for operation at m=6.8 and probably adequate for flows at m=8.2 with the existing nozzles . in its present state, the maximum design temperature of 680 degrees centigrade for operation at m=9 cannot be realised in the tunnel because of heat loss to the outlet attachments of the heater and quick-acting valve which form, in effect, a large heat sink . because of this heat loss there is rather poor response of stagnation temperature in the working section at the start of a run . it is hoped to cure this by preheating the heater outlet cone and the quick-acting valve . at pressures greater than about 100 p.s.i.g. free convection through the fibrous thermal insulation surrounding the heated core causes the top of the heater shell to become somewhat hotter than the bottom, which results in /hogging/ distortion of the shell . this free convection cools the heater core and a vertical temperature gradient is set up across it after only a few minutes at high pressure . modifications to be incorporated in the heater to improve its performance are described ."
document 3:
" supersonic flow at the surface of a circular cone at angle of attack . formulas for the inviscid flow properties on the surface of a cone at angle of attack are derived for use in conjunction with the m.i.t. cone tables . these formulas are based upon an entropy distribution on the cone surface which is uniform and equal to that of the shocked fluid in the windward meridian plane . they predict values for the flow variables which may differ significantly from the corresponding values obtained directly from the cone tables . the differences in the magnitudes of the flow variables computed by the two methods tend to increase with increasing free-stream mach number, cone angle and angle of attack ."
document 4:
" theory of aircraft structural models subjected to aerodynamic heating and external loads . the problem of investigating the simultaneous effects of transient aerodynamic heating and external loads on aircraft structures for the purpose of determining the ability of the structure to withstand flight to supersonic speeds is studied . by dimensional analyses it is shown that .. constructed of the same materials as the aircraft will be thermally similar to the aircraft with respect to the flow of heat through the structure will be similar to those of the aircraft when the structural model is constructed at the same temperature as the aircraft . external loads will be similar to those of the aircraft . subjected to heating and cooling that correctly simulate the aerodynamic heating of the aircraft, except with respect to angular velocities and angular accelerations, without requiring determination of the heat flux at each point on the surface and its variation with time . acting on the aerodynamically heated structural model to those acting on the aircraft is determined for the case of zero angular velocity and zero angular acceleration, so that the structural model may be subjected to the external loads required for simultaneous simulation of stresses and deformations due to external loads ."
Solution 1:[1]
Full disclosure I don't have any experience using the BM25 ranking, however I do have quite a bit of experience with gensim's TF-IDF and LSI distributed models, along with gensim's similarity index.
The author does a really good job at keeping a readable codebase, so if you're ever having trouble with anything like this again, I recommend just jumping into the source code.
Looking at the source code: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/summarization/bm25.py
So I initialized a BM25()
object with the documents you pasted above.
It looks like our good old friend Radim didn't include a function to calculate the average_idf
for us, which no biggie, we can just plagarize line 65 for our cause:
average_idf = sum(map(lambda k: float(bm25.idf[k]), bm25.idf.keys())) / len(bm25.idf.keys())
Then, well if I understand the original intention of get_scores
correctly, you should get each BM25 score with respect to your original query, simply by doing
scores = bm_25_object.get_scores(query_doc, average_idf)
Which returns all the scores for each document, and then, if I understand the BM25 ranking based on what I read on this wikipedia page: https://en.wikipedia.org/wiki/Okapi_BM25
You should be able to just pick the document with the highest score as follows:
best_result = docs[scores.index(max(scores))]
So the first document should be the most relevant to your query? I hope that's what you were expecting anyways, and I hope that this helped in some capacity. Good luck!
Solution 2:[2]
Since @mkerrig answer is now outdated (2020) here is a way to use BM25 with gensim 3.8.3
, assuming you have a list docs
of documents. This code returns the indices of the best 10 matching documents.
from gensim import corpora
from gensim.summarization import bm25
texts = [doc.split() for doc in docs] # you can do preprocessing as removing stopwords
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
bm25_obj = bm25.BM25(corpus)
query_doc = dictionary.doc2bow(query.split())
scores = bm25_obj.get_scores(query_doc)
best_docs = sorted(range(len(scores)), key=lambda i: scores[i])[-10:]
Notice that you do not need the average_idf
parameter any more.
Solution 3:[3]
I acknowledge that the answer above is correct. However, I'll go ahead and add my 2 bits, for other community members who land here. :)
The following 4 links are quiet useful and comprehensively cover the question.
https://github.com/nhirakawa/BM25 A Python implementation of the BM25 ranking function. Extremely easy to use, I used it for my project too. Works great! I think, this is the system that will work for your problem.
https://sajalsharma.com/portfolio/cross_language_information_retrieval Shows very detailed and step by step use of Okapi BM25 in a system that can be used for drawing references for current system design tasks.
http://lixinzhang.github.io/implementation-of-okapi-bm25-on-python.html Further code for only Okapi BM25.
https://github.com/thunlp/EntityDuetNeuralRanking Entity-Duet Neural Ranking Model. Great for research and academic work.
Peace!
---addition: https://github.com/airalcorn2/RankNet RankNet and LambdaRank
Solution 4:[4]
The answer given by @fonfonx would work. But it is not the natural way to use BM25.
BM25 constructor requires a List[List[str]]
. which means it attends to get a tokenized corpus.
I feel a better example would look like that:
from gensim.summarization.bm25 import BM25
corpus = ["The little fox ran home",
"dogs are the best ",
"Yet another doc ",
"I see a little fox with another small fox",
"last doc without animals"]
def simple_tok(sent:str):
return sent.split()
tok_corpus = [simple_tok(s) for s in corpus]
bm25 = BM25(tok_corpus)
query = simple_tok("a little fox")
scores = bm25.get_scores(query)
best_docs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
for i, b in enumerate(best_docs):
print(f"rank {i+1}: {corpus[b]}")
Output:
>> rank 1: I see a little fox with another small fox
>> rank 2: The little fox ran home
>> rank 3: dogs are the best
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | mkerrig |
Solution 2 | fonfonx |
Solution 3 | Varpie |
Solution 4 |