關(guān)于檢索的核心IndexSearcher類。
IndexSearcher是Lucene的檢索實(shí)現(xiàn)的最核心的實(shí)現(xiàn)類,它繼承自抽象類Searcher,該抽象類中包含了用于檢索的一些核心的方法的實(shí)現(xiàn)。而Searcher抽象類有實(shí)現(xiàn)了Searchable接口,Searchable接口是實(shí)現(xiàn)檢索的抽象網(wǎng)絡(luò)協(xié)議,可以基于此協(xié)議來實(shí)現(xiàn)對(duì)遠(yuǎn)程服務(wù)器上的索引目錄的訪問。這一點(diǎn),可以從Searchable接口所繼承的java.rmi.Remote接口來說明。
java.rmi.Remote接口在JDK中給出了說明,如下所示:
也就是說,繼承java.rmi.Remote的接口具有的特性是:
1、遠(yuǎn)程接口用來識(shí)別那些繼承java.rmi.Remote的接口類,這些接口被非本地虛擬機(jī)調(diào)用;
2、繼承java.rmi.Remote的接口類具有遠(yuǎn)程可用的特性;
3、實(shí)現(xiàn)了java.rmi.Remote接口的子接口的實(shí)現(xiàn)類,可以對(duì)遠(yuǎn)程對(duì)象進(jìn)行管理。
?
下面就對(duì)與檢索相關(guān)的一些接口及一些抽象類做一個(gè)概覽,有助于后面對(duì)這些接口的實(shí)現(xiàn)類進(jìn)行學(xué)習(xí)研究:
Searchable接口類
Searchable接口的實(shí)現(xiàn)如下所示:
package org.apache.lucene.search;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.FieldSelector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.CorruptIndexException;
import java.io.IOException;????
public interface Searchable extends java.rmi.Remote {
/* 用于檢索的核心方法,指定了權(quán)重Weight和過濾器Filter參數(shù)。因?yàn)榉祷刂禐関oid類型,所以實(shí)際檢索出來的Document都被存放在HitCollector中,該HitCollector類收集了那些得分大于0的Document。*/
void search(Weight weight, Filter filter, HitCollector results)
throws IOException;
// 釋放一個(gè)IndexSearcher檢索器所關(guān)聯(lián)的資源
void close() throws IOException;
// 返回根據(jù)指定詞條檢索出來的Document的數(shù)量
int docFreq(Term term) throws IOException;
// 返回根據(jù)指定詞條數(shù)組中所列詞條檢索出來的Document的數(shù)量的一個(gè)數(shù)組
int[] docFreqs(Term[] terms) throws IOException;
// 返回一個(gè)整數(shù)值:最大可能的Document的數(shù)量 + 1
int maxDoc() throws IOException;
// 檢索的方法,返回檢索出來的得分(Hits)排在前n位的Document
TopDocs search(Weight weight, Filter filter, int n) throws IOException;
//?? 獲取編號(hào)為i的Document,(注意:是內(nèi)部編號(hào),可以在上面測(cè)試程序中執(zhí)行System.out.println(searcher.doc(24));,打印出結(jié)果為Document<stored/uncompressed,indexed<path:E:\Lucene\txt1\mytxt\FAQ.txt> stored/uncompressed,indexed<modified:200604130754>>)
Document doc(int i) throws CorruptIndexException, IOException;
// 獲取在位置n上的Document;FieldSelector接口類似于一個(gè)文件過濾器,它有一個(gè)方法FieldSelectorResult accept(String fieldName);
Document doc(int n, FieldSelector fieldSelector) throws CorruptIndexException, IOException;
//?? 重新設(shè)置Query(即,重寫先前設(shè)定的Query)
Query rewrite(Query query) throws IOException;
//?? 返回一個(gè)Explanation,該Explanation用于計(jì)算得分
Explanation explain(Weight weight, int doc) throws IOException;
// 指定一種排序方式,在此基礎(chǔ)上返回得分在前n位的Document
TopFieldDocs search(Weight weight, Filter filter, int n, Sort sort)
throws IOException;
}
Searcher抽象類
package org.apache.lucene.search;
import java.io.IOException;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.Term;
import org.apache.lucene.document.Document;
// 該抽象類實(shí)現(xiàn)了Searchable接口
public abstract class Searcher implements Searchable {
// 查詢與指定Query匹配的Document,返回Hits實(shí)例,該Hits內(nèi)容相當(dāng)豐富
public final Hits search(Query query) throws IOException {
??? return
search
(query, (Filter)null);???
// 調(diào)用下面的search()方法
}
public Hits
search
(Query query, Filter filter) throws IOException {
??? return new Hits(this, query, filter);
}
// 指定了Sort
public Hits search(Query query, Sort sort)
??? throws IOException {
??? return new Hits(this, query, null, sort);
}
// 指定了Filter和Sort
public Hits search(Query query, Filter filter, Sort sort)
??? throws IOException {
??? return new Hits(this, query, filter, sort);
}
// 實(shí)現(xiàn)了Searchable接口中方法,指定一種排序方式,在此基礎(chǔ)上返回得分在前n位的Document
public TopFieldDocs search(Query query, Filter filter, int n,
???????????????????????????? Sort sort) throws IOException {
??? return search(createWeight(query), filter, n, sort);???
// 調(diào)用abstract public TopDocs search(Weight weight, Filter filter, int n) throws IOException;
}
public void search(Query query, HitCollector results)
??? throws IOException {
??? search(query, (Filter)null, results);
}
public void search(Query query, Filter filter, HitCollector results)
??? throws IOException {
??? search(createWeight(query), filter, results);
}
?? public TopDocs search(Query query, Filter filter, int n)
??? throws IOException {
??? return search(createWeight(query), filter, n);
}
?? public Explanation explain(Query query, int doc) throws IOException {
??? return explain(createWeight(query), doc);
}
// 為一個(gè)Searcher設(shè)置一個(gè)Similarity
public void setSimilarity(Similarity similarity) {
??? this.similarity = similarity;
}
public Similarity getSimilarity() {
??? return this.similarity;
}
// 根據(jù)指定的Query,創(chuàng)建一個(gè)用于記錄該Query狀態(tài)的Weight
protected Weight createWeight(Query query) throws IOException {
????? return query.weight(this);
}
// 實(shí)現(xiàn)了接口Searchable中的方法
public int[] docFreqs(Term[] terms) throws IOException {
??? int[] result = new int[terms.length];
??? for (int i = 0; i < terms.length; i++) {
????? result[i] = docFreq(terms[i]);
??? }
??? return result;
}
// 一些abstract方法,在接口Searchable中列舉過
abstract public void search(Weight weight, Filter filter, HitCollector results) throws IOException;
abstract public void close() throws IOException;
abstract public int docFreq(Term term) throws IOException;
abstract public int maxDoc() throws IOException;
abstract public TopDocs search(Weight weight, Filter filter, int n) throws IOException;
abstract public Document doc(int i) throws CorruptIndexException, IOException;
abstract public Query rewrite(Query query) throws IOException;
abstract public Explanation explain(Weight weight, int doc) throws IOException;
abstract public TopFieldDocs search(Weight weight, Filter filter, int n, Sort sort) throws IOException;
}
Weight接口類
創(chuàng)建一個(gè)Weight的目的是,使得一個(gè)已經(jīng)定制的Query實(shí)例不在檢索過程中被修改,以至于該Query實(shí)例可以被重用,而無需重復(fù)創(chuàng)建。
一個(gè)Query實(shí)例是獨(dú)立于IndexSearcher檢索器的。Query的這種獨(dú)立的狀態(tài)應(yīng)該被記錄在一個(gè)Weight中。
Weight接口的源代碼如下所示:
package org.apache.lucene.search;
import java.io.IOException;
import org.apache.lucene.index.IndexReader;
public interface Weight extends java.io.Serializable {
// 獲取該Weight所關(guān)聯(lián)的Query實(shí)例
Query getQuery();
// 獲取一個(gè)Query的Weight值
float getValue();
/** The sum of squared weights of contained query clauses. */
float sumOfSquaredWeights() throws IOException;
// 為一個(gè)Query設(shè)置標(biāo)準(zhǔn)化因子
void normalize(float norm);
// 為一個(gè)Weight創(chuàng)建一個(gè)Scorer(Scorer是與Document的得分相關(guān)的)
Scorer scorer(IndexReader reader) throws IOException;
// 為編號(hào)為i的Document計(jì)算得分,返回Explanation記錄了該Document的得分
Explanation explain(IndexReader reader, int doc) throws IOException;
}
HitCollector抽象類
package org.apache.lucene.search;
// 抽象類用于收集檢索出來的Document
public abstract class HitCollector {
// 根據(jù)Document的編號(hào)和得分,篩選符合條件的Document
public abstract void collect(int doc, float score);
}
Scorer抽象類
package org.apache.lucene.search;
import java.io.IOException;
// 用于管理與查詢Query匹配的Document的得分
public abstract class Scorer {
private Similarity similarity;
// Constructs a Scorer.
protected Scorer(Similarity similarity) {
??? this.similarity = similarity;
}
public Similarity getSimilarity() {
??? return this.similarity;
}
// 遍歷HitCollector,收集所有匹配的Document
public void score(HitCollector hc) throws IOException {
??? while (next()) {
????? hc.collect(doc(), score());
??? }
}
// 在指定范圍內(nèi)(編號(hào)<max的Document)收集匹配的Document
protected boolean score(HitCollector hc, int max) throws IOException {
??? while (doc() < max) {
????? hc.collect(doc(), score());
????? if (!next())
??????? return false;
??? }
??? return true;
}
/** Advances to the next document matching the query. */
public abstract boolean next() throws IOException;
// 獲取當(dāng)前Document的編號(hào)
public abstract int doc();
//?? 獲取當(dāng)前匹配的Document的得分
public abstract float score() throws IOException;
/** Skips to the first match beyond the current whose document number is
?? * greater than or equal to a given target.
?? * <br>When this method is used the
{@link
#explain(int)} method should not be used.
?? * @param target The target document number.
?? * @return true iff there is such a match.
?? * <p>Behaves as if written: <pre>
?? *?? boolean skipTo(int target) {
?? *???? do {
?? *?????? if (!next())
?? * ????? return false;
?? *???? } while (target > doc());
?? *???? return true;
?? *?? }
?? * </pre>Most implementations are considerably more efficient than that.
?? */
public abstract boolean skipTo(int target) throws IOException;
public abstract Explanation explain(int doc) throws IOException;
}
Similarity抽象類
?
關(guān)于該抽象類的說明,可以參考源代碼說明,如下所示:
org.apache.lucene.search.Similarity
Expert: Scoring API.
Subclasses implement search scoring.
The score of query
q
for document
d
correlates to the cosine-distance or dot-product between document and query vectors in a
Vector Space Model (VSM) of Information Retrieval
. A document whose vector is closer to the query vector in that model is scored higher. The score is computed as follows:
|
where
-
tf(t in d)
correlates to the term's
frequency
, defined as the number of times term
t
appears in the currently scored document
d
. Documents that have more occurrences of a given term receive a higher score. The default computation for
tf(t in d)
in DefaultSimilarity is:
tf(t in d) = frequency ? -
idf(t)
stands for Inverse Document Frequency. This value correlates to the inverse of
docFreq
(the number of documents in which the term
t
appears). This means rarer terms give higher contribution to the total score. The default computation for
idf(t)
in DefaultSimilarity is:
idf(t) = 1 + log ( numDocs ––––––––– docFreq+1 ) - coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
-
queryNorm(q)
is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in DefaultSimilarity is:
queryNorm(q) = queryNorm(sumOfSquaredWeights) = 1 –––––––––––––– sumOfSquaredWeights ?
The sum of squared weights (of the query terms) is computed by the query org.apache.lucene.search.Weight object. For example, a boolean query computes this value as:
sumOfSquaredWeights = q.getBoost() 2 · ∑ ( idf(t) · t.getBoost() ) 2 t in q - t.getBoost() is a search time boost of term t in the query q as specified in the query text (see query syntax ), or as set by application calls to setBoost(). Notice that there is really no direct API for accessing a boost of one term in a multi term query, but rather multi terms are represented in a query as multi TermQuery objects, and so the boost of a term in the query is accessible by calling the sub-query getBoost().
-
norm(t,d)
encapsulates a few (indexing time) boost and length factors:
- Document boost - set by calling doc.setBoost() before adding the document to the index.
- Field boost - set by calling field.setBoost() before adding the field to a document.
- lengthNorm (field) - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together:
norm(t,d) = doc.getBoost() · lengthNorm(field) · ∏ f.getBoost() field f in d named as t
However the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75. Also notice that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.
?
該抽象類的源代碼如下所示:
package org.apache.lucene.search;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.util.SmallFloat;
import java.io.IOException;
import java.io.Serializable;
import java.util.Collection;
import java.util.Iterator;
public abstract class Similarity implements Serializable {
// DefaultSimilarity是Similarity的子類
private static Similarity defaultImpl = new DefaultSimilarity();
public static void setDefault(Similarity similarity) {
??? Similarity.defaultImpl = similarity;
}
public static Similarity getDefault() {
??? return Similarity.defaultImpl;
}
// 標(biāo)準(zhǔn)化因子列表
private static final float[] NORM_TABLE = new float[256];
static {???
// 靜態(tài)加載
??? for (int i = 0; i < 256; i++)
????? NORM_TABLE[i] = SmallFloat.byte315ToFloat((byte)i);???
// 將Cache中的字節(jié)轉(zhuǎn)化成浮點(diǎn)數(shù)
}
// 解碼標(biāo)準(zhǔn)化因子(從byte變?yōu)閒loat)
public static float decodeNorm(byte b) {
??? return NORM_TABLE[b & 0xFF];???
// & 0xFF maps negative bytes to positive above 127
}
// 獲取解碼標(biāo)準(zhǔn)化因子列表
public static float[] getNormDecoder() {
??? return NORM_TABLE;
}
// 指定了名稱為fieldName的Field,以及該Field中包含的詞條的數(shù)量numTokens,計(jì)算該Field的標(biāo)準(zhǔn)化因子長(zhǎng)度
public abstract float lengthNorm(String fieldName, int numTokens);
// 給定了一個(gè)Query的每個(gè)詞條的Weight的平方值,計(jì)算一個(gè)Query的標(biāo)準(zhǔn)化因子
public abstract float queryNorm(float sumOfSquaredWeights);
//??? 為一個(gè)索引中存儲(chǔ)的標(biāo)準(zhǔn)化因子解碼(從float到byte)
public static byte encodeNorm(float f) {
??? return SmallFloat.floatToByte315(f);
}
// 計(jì)算一個(gè)Document中的詞條的得分因子
public float tf(int freq) {
??? return tf((float)freq);
}
/** Computes the amount of a sloppy phrase match, based on an edit distance.
?? * This value is summed for each sloppy phrase match in a document to form
?? * the frequency that is passed to
{@link
#tf(float)}.
?? *
?? * <p>A phrase match with a small edit distance to a document passage more
?? * closely matches the document, so implementations of this method usually
?? * return larger values when the edit distance is small and smaller values
?? * when it is large.
?? *
?? * @see PhraseQuery#setSlop(int)
?? * @param distance the edit distance of this sloppy phrase match
?? * @return the frequency increment for this match
?? */
public abstract float sloppyFreq(int distance);
/** Computes a score factor based on a term or phrase's frequency in a
?? * document. This value is multiplied by the
{@link
#idf(Term, Searcher)}
?? * factor for each term in the query and these products are then summed to
?? * form the initial score for a document.
?? *
?? * <p>Terms and phrases repeated in a document indicate the topic of the
?? * document, so implementations of this method usually return larger values
?? * when <code>freq</code> is large, and smaller values when <code>freq</code>
?? * is small.
?? *
?? * @param freq the frequency of a term within a document
?? * @return a score factor based on a term's within-document frequency
?? */
public abstract float tf(float freq);
/** Computes a score factor for a simple term.
?? *
?? * <p>The default implementation is:<pre>
?? *?? return idf(searcher.docFreq(term), searcher.maxDoc());
?? * </pre>
?? *
?? * Note that
{@link
Searcher#maxDoc()} is used instead of
?? *
{@link
org.apache.lucene.index.IndexReader#numDocs()} because it is proportional to
?? *
{@link
Searcher#docFreq(Term)} , i.e., when one is inaccurate,
?? * so is the other, and in the same direction.
?? *
?? * @param term the term in question
?? * @param searcher the document collection being searched
?? * @return a score factor for the term
?? */
public float idf(Term term, Searcher searcher) throws IOException {
??? return idf(searcher.docFreq(term), searcher.maxDoc());
}
// 為一個(gè)短語計(jì)算得分因子
public float idf(Collection terms, Searcher searcher) throws IOException {
??? float idf = 0.0f;
??? Iterator i = terms.iterator();
??? while (i.hasNext()) {
????? idf += idf((Term)i.next(), searcher);
??? }
??? return idf;
}
/** Computes a score factor based on a term's document frequency (the number
?? * of documents which contain the term). This value is multiplied by the
?? *
{@link
#tf(int)} factor for each term in the query and these products are
?? * then summed to form the initial score for a document.
?? */
public abstract float idf(int docFreq, int numDocs);
/** Computes a score factor based on the fraction of all query terms that a
?? * document contains. This value is multiplied into scores.
?? */
public abstract float coord(int overlap, int maxOverlap);
/**
?? * Calculate a scoring factor based on the data in the payload. Overriding implementations
?? * are responsible for interpreting what is in the payload. Lucene makes no assumptions about
?? * what is in the byte array.
?? */
public float scorePayload(byte [] payload, int offset, int length)
{
???
//Do nothing
??? return 1;
}
}
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號(hào)聯(lián)系: 360901061
您的支持是博主寫作最大的動(dòng)力,如果您喜歡我的文章,感覺我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長(zhǎng)非常感激您!手機(jī)微信長(zhǎng)按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對(duì)您有幫助就好】元
