最近在做一个知识库问答项目,便是现在大模型浪潮下比较火的 RAG 运用。LangChain 能够说是 RAG 最受欢迎的工具,因而我首选 LangChain 来快速构建我的运用。坦白来讲 LangChain 自身一套关于组件的界说现已让我感觉很杂乱,为什么选用 f-stringstring.format 就能完成的工作必须要抽出一个这么杂乱的目标。

当然上面种种原因可能是我不了解 LangChain 规划之禅,可是下面这个坑的的的确在在让我对 LangChain 感到失望的当地。

原因

工作原因很简略,我很快构建好了一个最简略的 RAG 运用,无非以下三步:

  1. 用户输入 query
  2. 将用户的 query 进行 embedding 之后进行类似度检索,并依照阈值过滤类似度低的文本。
  3. 整合检索的文本并依照必定格局送入大模型。

可是在第二步呈现了问题。我在测验的时分发现我总是会召回许多无关的文本,而且我把类似度阈值调高之后,依然没有把这些不相干的文本过滤掉,这让我十分困惑,可是翻看 LangChain 调用代码之后我瞬间一个恍然大明白,这儿 xxx 有坑!

回忆

LangChain 中关于文本检索有个类叫做 BaseRetriever,刚刚开始我只运用向量数据库进行最简略的检索,可是考虑后续会参加多种检索方法,为了组合方便我选用了 VectorStoreRetriever 进行检索。基本代码是这样的:

# 省略加载db的进程
retriever = db.as_retriever()
docs = retriever.get_relevant_documents(query, score_threshold=threshold)

便是这样,我把 threshold 调高也不会过滤那些显然无关的文本。于是我就想看看 LangChain 是怎样调用的。

排查

首要看一下 get_relevant_documents() 这个函数调用流程,它在 BaseRetriever 是这么界说的,源码贴脸正告!!!

def get_relevant_documents(
    self,
    query: str,
    *,
    callbacks: Callbacks = None,
    tags: Optional[List[str]] = None,
    metadata: Optional[Dict[str, Any]] = None,
    run_name: Optional[str] = None,
    **kwargs: Any,
) -> List[Document]:
    """Retrieve documents relevant to a query.
    Users should favor using `.invoke` or `.batch` rather than
    `get_relevant_documents directly`.
    Args:
        query: string to find relevant documents for
        callbacks: Callback manager or list of callbacks
        tags: Optional list of tags associated with the retriever. Defaults to None
            These tags will be associated with each call to this retriever,
            and passed as arguments to the handlers defined in `callbacks`.
        metadata: Optional metadata associated with the retriever. Defaults to None
            This metadata will be associated with each call to this retriever,
            and passed as arguments to the handlers defined in `callbacks`.
        run_name: Optional name for the run.
    Returns:
        List of relevant documents
    """
    from langchain_core.callbacks.manager import CallbackManager
    callback_manager = CallbackManager.configure(
        callbacks,
        None,
        verbose=kwargs.get("verbose", False),
        inheritable_tags=tags,
        local_tags=self.tags,
        inheritable_metadata=metadata,
        local_metadata=self.metadata,
    )
    run_manager = callback_manager.on_retriever_start(
        dumpd(self),
        query,
        name=run_name,
        run_id=kwargs.pop("run_id", None),
    )
    try:
        _kwargs = kwargs if self._expects_other_args else {}
        if self._new_arg_supported:
            result = self._get_relevant_documents(
                query, run_manager=run_manager, **_kwargs
            )
        else:
            result = self._get_relevant_documents(query, **_kwargs)
    except Exception as e:
        run_manager.on_retriever_error(e)
        raise e
    else:
        run_manager.on_retriever_end(
            result,
        )
        return result

这个函数文档说主张运用 .invoke() 而不是直接调用这个函数,可是 .invoke() 也是间接调用这个函数。这个函数的流程仍是挺清晰的,它会处理一些 callback 然后持续调用 _get_relevant_documents() 这个函数,这个函数由每个子类自己完成,咱们看看 VectorStoreRetriever 关于这个函数的完成:

def _get_relevant_documents(
    self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
    if self.search_type == "similarity":
        docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
    elif self.search_type == "similarity_score_threshold":
        docs_and_similarities = (
            self.vectorstore.similarity_search_with_relevance_scores(
                query, **self.search_kwargs
            )
        )
        docs = [doc for doc, _ in docs_and_similarities]
    elif self.search_type == "mmr":
        docs = self.vectorstore.max_marginal_relevance_search(
            query, **self.search_kwargs
        )
    else:
        raise ValueError(f"search_type of {self.search_type} not allowed.")
    return docs

这个函数自身逻辑也不难,便是依照 search_type 的不同,调用 vectorstore 的不同办法。所以这个 VectorStoreRetriever 其实便是对 vectorstore 的再一次封装,核心仍是调用 vectorstore 的办法。

回到函数自身来,这儿呈现了一个新的变量叫 search_type,这个其实在 VectorStoreRetriever 中给出了:

class VectorStoreRetriever(BaseRetriever):
    """Base Retriever class for VectorStore."""
    vectorstore: VectorStore
    """VectorStore to use for retrieval."""
    search_type: str = "similarity"
    """Type of search to perform. Defaults to "similarity"."""
    search_kwargs: dict = Field(default_factory=dict)
    """Keyword arguments to pass to the search function."""
    allowed_search_types: ClassVar[Collection[str]] = (
        "similarity",
        "similarity_score_threshold",
        "mmr",
    )

其实当咱们调用 vectorstore.as_retriever() 时分也能够指定该参数,咱们看看 as_retriever() 这个函数的完成。

def as_retriever(self, **kwargs: Any) -> VectorStoreRetriever:
    """Return VectorStoreRetriever initialized from this VectorStore.
    Args:
        search_type (Optional[str]): Defines the type of search that
            the Retriever should perform.
            Can be "similarity" (default), "mmr", or
            "similarity_score_threshold".
        search_kwargs (Optional[Dict]): Keyword arguments to pass to the
            search function. Can include things like:
                k: Amount of documents to return (Default: 4)
                score_threshold: Minimum relevance threshold
                    for similarity_score_threshold
                fetch_k: Amount of documents to pass to MMR algorithm (Default: 20)
                lambda_mult: Diversity of results returned by MMR;
                    1 for minimum diversity and 0 for maximum. (Default: 0.5)
                filter: Filter by document metadata
    Returns:
        VectorStoreRetriever: Retriever class for VectorStore.
    Examples:
    .. code-block:: python
        # Retrieve more documents with higher diversity
        # Useful if your dataset has many similar documents
        docsearch.as_retriever(
            search_type="mmr",
            search_kwargs={'k': 6, 'lambda_mult': 0.25}
        )
        # Fetch more documents for the MMR algorithm to consider
        # But only return the top 5
        docsearch.as_retriever(
            search_type="mmr",
            search_kwargs={'k': 5, 'fetch_k': 50}
        )
        # Only retrieve documents that have a relevance score
        # Above a certain threshold
        docsearch.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={'score_threshold': 0.8}
        )
        # Only get the single most similar document from the dataset
        docsearch.as_retriever(search_kwargs={'k': 1})
        # Use a filter to only retrieve documents from a specific paper
        docsearch.as_retriever(
            search_kwargs={'filter': {'paper_title':'GPT-4 Technical Report'}}
        )
    """
    tags = kwargs.pop("tags", None) or []
    tags.extend(self._get_retriever_tags())
    return VectorStoreRetriever(vectorstore=self, **kwargs, tags=tags)

能够看到这儿的 search_type 支撑 similaritymmrsimilarity_score_threshold 三种,默认的是 similarity。看到这儿,第一个引起我疑问的当地来了,这个 similaritysimilarity_score_threshold 有什么区别呢?

下面咱们分两条线进行剖析,依照不同调用链看看他们到底是什么意思。

分支一:similarity

在分支一,会调用 vetorstore.similarity_search() 办法,这是 VectorStore 的一个抽象办法,需求子类自己完成,咱们看看 FAISS 是怎样完成的。

def similarity_search(
    self,
    query: str,
    k: int = 4,
    filter: Optional[Union[Callable, Dict[str, Any]]] = None,
    fetch_k: int = 20,
    **kwargs: Any,
) -> List[Document]:
    """Return docs most similar to query.
    Args:
        query: Text to look up documents similar to.
        k: Number of Documents to return. Defaults to 4.
        filter: (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
        fetch_k: (Optional[int]) Number of Documents to fetch before filtering.
                  Defaults to 20.
    Returns:
        List of Documents most similar to the query.
    """
    docs_and_scores = self.similarity_search_with_score(
        query, k, filter=filter, fetch_k=fetch_k, **kwargs
    )
    return [doc for doc, _ in docs_and_scores]

这儿能够看到他是调用了 similarity_search_with_score() 办法,然后把成果中的 score 给略去了,这儿不得不吐槽这个调用是不是脱裤子放屁,明明能够写在一个办法里边,传入一个 flag 标识是否要回来分数就能够处理,非要封装成两个办法。吐槽完毕持续检查 similarity_search_with_score() 办法。

def similarity_search_with_score(
    self,
    query: str,
    k: int = 4,
    filter: Optional[Union[Callable, Dict[str, Any]]] = None,
    fetch_k: int = 20,
    **kwargs: Any,
) -> List[Tuple[Document, float]]:
    """Return docs most similar to query.
    Args:
        query: Text to look up documents similar to.
        k: Number of Documents to return. Defaults to 4.
        filter (Optional[Dict[str, str]]): Filter by metadata.
            Defaults to None. If a callable, it must take as input the
            metadata dict of Document and return a bool.
        fetch_k: (Optional[int]) Number of Documents to fetch before filtering.
                  Defaults to 20.
    Returns:
        List of documents most similar to the query text with
        L2 distance in float. Lower score represents more similarity.
    """
    embedding = self._embed_query(query)
    docs = self.similarity_search_with_score_by_vector(
        embedding,
        k,
        filter=filter,
        fetch_k=fetch_k,
        **kwargs,
    )
    return docs

这个办法便是将 query 进行 embedding 之后,依据向量进行查询,调用了 similarity_search_with_score_by_vector() 办法,咱们持续盯梢。

def similarity_search_with_score_by_vector(
    self,
    embedding: List[float],
    k: int = 4,
    filter: Optional[Union[Callable, Dict[str, Any]]] = None,
    fetch_k: int = 20,
    **kwargs: Any,
) -> List[Tuple[Document, float]]:
    """Return docs most similar to query.
    Args:
        embedding: Embedding vector to look up documents similar to.
        k: Number of Documents to return. Defaults to 4.
        filter (Optional[Union[Callable, Dict[str, Any]]]): Filter by metadata.
            Defaults to None. If a callable, it must take as input the
            metadata dict of Document and return a bool.
        fetch_k: (Optional[int]) Number of Documents to fetch before filtering.
                  Defaults to 20.
        **kwargs: kwargs to be passed to similarity search. Can include:
            score_threshold: Optional, a floating point value between 0 to 1 to
                filter the resulting set of retrieved docs
    Returns:
        List of documents most similar to the query text and L2 distance
        in float for each. Lower score represents more similarity.
    """
    faiss = dependable_faiss_import()
    vector = np.array([embedding], dtype=np.float32)
    if self._normalize_L2:
        faiss.normalize_L2(vector)
    scores, indices = self.index.search(vector, k if filter is None else fetch_k)
    docs = []
    if filter is not None:
        filter_func = self._create_filter_func(filter)
    for j, i in enumerate(indices[0]):
        if i == -1:
            # This happens when not enough docs are returned.
            continue
        _id = self.index_to_docstore_id[i]
        doc = self.docstore.search(_id)
        if not isinstance(doc, Document):
            raise ValueError(f"Could not find document for id {_id}, got {doc}")
        if filter is not None:
            if filter_func(doc.metadata):
                docs.append((doc, scores[0][j]))
        else:
            docs.append((doc, scores[0][j]))
    score_threshold = kwargs.get("score_threshold")
    if score_threshold is not None:
        cmp = (
            operator.ge
            if self.distance_strategy
            in (DistanceStrategy.MAX_INNER_PRODUCT, DistanceStrategy.JACCARD)
            else operator.le
        )
        docs = [
            (doc, similarity)
            for doc, similarity in docs
            if cmp(similarity, score_threshold)
        ]
    return docs[:k]

这儿便是调用了 FAISS 创建数据库时的索引进行类似度的检索,检索之后,会取关键词参数中是否有 score_threshold,假如之前的调用中传入了阈值分数,则会进行类似度的过滤。由于我遇到的问题便是无法过滤无关内容,因而这儿过滤引起了我的注意。

剖析一下这个过滤的代码:

  1. 界说比较算子,假如间隔战略选用最大内积或者杰卡德系数就选用大于,不然便是小于。
  2. 依照算子将类似度和阈值核算来进行过滤。

这儿我茅塞顿开,我赶紧检查了一下我自己选用了什么间隔战略,翻看源码得知 FAISS 默认选用的间隔战略是 DistanceStrategy.EUCLIDEAN_DISTANCE。也便是欧式间隔,所以算子应该选用小于,也便是说保存类似度低于阈值的。

这儿我恍然大明白,这很好了解,假如你选用欧式间隔作为类似度核算,的确应该值越小表明越类似,所以我之前调高类似度阈值反而没有过滤是正常的,由于调的越大,反而过滤力度越小!

这就很反直觉,假如我选用内积作为间隔战略,则我之前的行为便是正确的。LangChain 并没有对这个情况进行合理的处理,乃至没有看到 LangChain 对此有一个提示。

分支一就此完毕,尽管现已处理了我最开始的问题,可是咱们仍是持续看看分支二。

分支二:similarity_score_threshold

在分支二,VectorStoreRetriever 会调用 vectorstore.similarity_search_with_relevance_scores() 办法。这儿多了一个概念叫 relevance_scores 咱们姑且暂时叫做相关性分数,这个和之前类似度有啥关系呢,咱们先不揭晓答案,先看看这个函数做了啥。

def similarity_search_with_relevance_scores(
    self,
    query: str,
    k: int = 4,
    **kwargs: Any,
) -> List[Tuple[Document, float]]:
    """Return docs and relevance scores in the range [0, 1].
    0 is dissimilar, 1 is most similar.
    Args:
        query: input text
        k: Number of Documents to return. Defaults to 4.
        **kwargs: kwargs to be passed to similarity search. Should include:
            score_threshold: Optional, a floating point value between 0 to 1 to
                filter the resulting set of retrieved docs
    Returns:
        List of Tuples of (doc, similarity_score)
    """
    score_threshold = kwargs.pop("score_threshold", None)
    docs_and_similarities = self._similarity_search_with_relevance_scores(
        query, k=k, **kwargs
    )
    if any(
        similarity < 0.0 or similarity > 1.0
        for _, similarity in docs_and_similarities
    ):
        warnings.warn(
            "Relevance scores must be between"
            f" 0 and 1, got {docs_and_similarities}"
        )
    if score_threshold is not None:
        docs_and_similarities = [
            (doc, similarity)
            for doc, similarity in docs_and_similarities
            if similarity >= score_threshold
        ]
        if len(docs_and_similarities) == 0:
            warnings.warn(
                "No relevant docs were retrieved using the relevance score"
                f" threshold {score_threshold}"
            )
    return docs_and_similarities

这个函数文档中写到回来文档和对应的相关性分数,相关性分数在0到1之间,0表明不类似,1表明最类似。这个流程也不杂乱,可是这儿需求理一下流程:

  1. 把关键词参数中 score_threshold 给弹了出来,这意味着后边传入的关键词参数中不会有 score_threshold 这个参数。(这儿又是一个让人吐槽的当地,后边再说。)
  2. 调用 _similarity_search_with_relevance_scores() 函数,(这儿吐槽一下函数名里边是 relevance_scores 可是承受变量的确 docs_and_similarities 为什么要搞这么多杂乱的名称呢?)
  3. 假如第一步中获得的 score_threshold 不为空则进行过滤,保存类似度大于阈值的文档,注意这儿并没有分支一最终的算子判断。

到这儿我有点懵了,由于引入了一个 relevance_scores 可是好像和类似度概念差不多,包括在函数文档以及函数内部都是混用的,所以我很猎奇为啥要引入一个新概念。可是有一点确认的是,相关性分数越高,文本类似度越高,无论你选用了什么样的间隔战略都是这样的。

让咱们持续调查调用链,看看第二步中的函数:

def _similarity_search_with_relevance_scores(
    self,
    query: str,
    k: int = 4,
    **kwargs: Any,
) -> List[Tuple[Document, float]]:
    """
    Default similarity search with relevance scores. Modify if necessary
    in subclass.
    Return docs and relevance scores in the range [0, 1].
    0 is dissimilar, 1 is most similar.
    Args:
        query: input text
        k: Number of Documents to return. Defaults to 4.
        **kwargs: kwargs to be passed to similarity search. Should include:
            score_threshold: Optional, a floating point value between 0 to 1 to
                filter the resulting set of retrieved docs
    Returns:
        List of Tuples of (doc, similarity_score)
    """
    relevance_score_fn = self._select_relevance_score_fn()
    docs_and_scores = self.similarity_search_with_score(query, k, **kwargs)
    return [(doc, relevance_score_fn(score)) for doc, score in docs_and_scores]

函数文档再次阐明回来文档和对应的相关性分数,相关性分数在0到1之间,0表明不类似,1表明最类似。函数也很简略,首要调用了一个相关性分数函数,然后调用 similarity_search_with_score() 得到文档和类似度,最终将类似度依照相关性分数函数做一个转化,至此两个分支走到了一同,终究都是调用 similarity_search_with_score()

这儿就能够答复为什么之前要 pop 关键词参数中的阈值,由于假如关键词参数中有 score_threshold,那么在 similarity_search_with_score() 这步就会进行过滤,可是这个函数过滤是依照间隔战略不同选不同算子,分支二过滤直接依照大于进行过滤。

到了这儿在紊乱的概念中有个开始的形象,能够得到如下三个观念:

  1. 类似度和相关性是不同的,至少在 LangChain 中是这样界说的,尽管在函数中两个变量混用,可是依照行为上的确是不同的两个界说。
  2. 相关性分数越大,则文本越相关;类似度则是依据间隔战略决议,关于欧式间隔,类似度越小,文本越相关。
  3. 相关性分数经过类似度核算出来的,核算函数便是 _select_relevance_score_fn()

我感觉到了成功的曙光,只要查明这个 _select_relevance_score_fn() 具体做了啥,就知道这两个界说怎么关联的了。

def _select_relevance_score_fn(self) -> Callable[[float], float]:
    """
    The 'correct' relevance function
    may differ depending on a few things, including:
    - the distance / similarity metric used by the VectorStore
    - the scale of your embeddings (OpenAI's are unit normed. Many others are not!)
    - embedding dimensionality
    - etc.
    Vectorstores should define their own selection based method of relevance.
    """
    raise NotImplementedError

这儿能够看到不同的 vectorstore 完成是不同的,这儿我当然评论的是 FAISS,咱们看 LangChain 在 FAISS 中怎么界说的。

def _select_relevance_score_fn(self) -> Callable[[float], float]:
    """
    The 'correct' relevance function
    may differ depending on a few things, including:
    - the distance / similarity metric used by the VectorStore
    - the scale of your embeddings (OpenAI's are unit normed. Many others are not!)
    - embedding dimensionality
    - etc.
    """
    if self.override_relevance_score_fn is not None:
        return self.override_relevance_score_fn
    # Default strategy is to rely on distance strategy provided in
    # vectorstore constructor
    if self.distance_strategy == DistanceStrategy.MAX_INNER_PRODUCT:
        return self._max_inner_product_relevance_score_fn
    elif self.distance_strategy == DistanceStrategy.EUCLIDEAN_DISTANCE:
        # Default behavior is to use euclidean distance relevancy
        return self._euclidean_relevance_score_fn
    elif self.distance_strategy == DistanceStrategy.COSINE:
        return self._cosine_relevance_score_fn
    else:
        raise ValueError(
            "Unknown distance strategy, must be cosine, max_inner_product,"
            " or euclidean"
        )

这儿面能够看到 LangChain 对 FAISS 支撑三种间隔战略,每个战略有不同的核算公式,这儿我直接贴出三个核算公式:

@staticmethod
def _max_inner_product_relevance_score_fn(distance: float) -> float:
    """Normalize the distance to a score on a scale [0, 1]."""
    if distance > 0:
        return 1.0 - distance
    return -1.0 * distance
@staticmethod
def _euclidean_relevance_score_fn(distance: float) -> float:
    """Return a similarity score on a scale [0, 1]."""
    # The 'correct' relevance function
    # may differ depending on a few things, including:
    # - the distance / similarity metric used by the VectorStore
    # - the scale of your embeddings (OpenAI's are unit normed. Many
    #  others are not!)
    # - embedding dimensionality
    # - etc.
    # This function converts the euclidean norm of normalized embeddings
    # (0 is most similar, sqrt(2) most dissimilar)
    # to a similarity function (0 to 1)
    return 1.0 - distance / math.sqrt(2)
@staticmethod
def _cosine_relevance_score_fn(distance: float) -> float:
    """Normalize the distance to a score on a scale [0, 1]."""
    return 1.0 - distance

这儿咱们都考虑 embedding 向量经过 L2 正则化,则内积和余弦类似度核算应该相同,实际上在内积上有存在问题。

首要内积为负值,直接取其相反数没有问题,由于负相关也是相关,可是当为正值时就有问题了,举个例子,假如选用内积核算,得到一个类似度为 0.7 的值,理应这两个比较相关,可是经过这个相关性函数得到只要 0.3 反而变成不相关了。这三个公式只要欧式间隔是正确的。

试验

上面阐明 LangChain 关于不同间隔战略,没能给出正确的过滤方法,且关于相关性的核算,搞反了语义类似性和相关性的关系。

关于 VectorStore 而言,假如选用欧氏间隔,选用 similarity_search_with_relevance_scores() 才干正确依照类似度过滤文档,相应的 VectorStoreRetriever 中的 search_type 应该选用 similarity_score_threshold

假如选用最大内积,选用 similarity_search_with_score() 才干正确检索文档,相应的 VectorStoreRetriever 中的 search_type 应该选用 similarity

除此之外的组合都不能依照预期的检索出文档。

为了证明我的猜测,下面进行试验环节。

版别信息

我选用的 LangChain 版别如下:

pip show langchain
Name: langchain
Version: 0.1.16
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: D:miniconda3envsnewLibsite-packages
Requires: aiohttp, dataclasses-json, jsonpatch, langchain-community, langchain-core, langchain-text-splitters, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: 

试验进程

导包环节

import numpy as np
from langchain_community.vectorstores.faiss import FAISS, DistanceStrategy
from langchain_openai import OpenAIEmbeddings

我将下面三句毫不相关的话为文档,树立三个不同间隔战略的向量库。

text_list = ["今天天气真好", "我喜欢吃苹果", "山公排序很不牢靠"]
embeddings = OpenAIEmbeddings(
    openai_api_base="xxx",
    openai_api_key="xxx"
)
embedding_list = [embeddings.embed_query(text) for text in text_list]

OpenAIEmbeddings 会将向量进行 L2 正则化。

for embedding in embedding_list:
    print(np.linalg.norm(embedding))
0.9999999999999989
1.0000000000000002
1.0000000000000002

树立下面三个向量库:

vs1 = FAISS.from_embeddings(zip(text_list, embedding_list), embeddings, normalize_L2=True, distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE)
vs2 = FAISS.from_embeddings(zip(text_list, embedding_list), embeddings, normalize_L2=True, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT)
vs3 = FAISS.from_embeddings(zip(text_list, embedding_list), embeddings, normalize_L2=True, distance_strategy=DistanceStrategy.COSINE)

咱们先都检索一下,保证三个向量库中内容都存在。

print(vs1.similarity_search_with_score("今天天气真好"))
print(vs2.similarity_search_with_score("今天天气真好"))
print(vs3.similarity_search_with_score("今天天气真好"))
[(Document(page_content='今天天气真好'), 0.0), (Document(page_content='我喜欢吃苹果'), 0.40074897), (Document(page_content='山公排序很不牢靠'), 0.5013859)]
[(Document(page_content='今天天气真好'), 0.9999843), (Document(page_content='我喜欢吃苹果'), 0.7995081), (Document(page_content='山公排序很不牢靠'), 0.74908566)] 
[(Document(page_content='今天天气真好'), 0.0), (Document(page_content='我喜欢吃苹果'), 0.40074897), (Document(page_content='山公排序很不牢靠'), 0.5013859)]

这儿能够看到选用余弦类似度作为间隔战略的向量库,检索分数和欧氏间隔相同,这儿我认为是 FAISS 支撑的是欧氏间隔内积,尽管正则化后内积余弦类似度等价,可是树立索引时分 FAISS 并不支撑余弦类似度,于是依照欧氏间隔树立的索引。一个猜测,没有证明。

依照上面的猜测,在 VectorStore 中,假如选用 similarity_search_with_score() 给出分数阈值,只要选用内积的能正确过滤文档。

print(vs1.similarity_search_with_score("今天天气真好", score_threshold=0.8))
print(vs2.similarity_search_with_score("今天天气真好", score_threshold=0.8))
print(vs3.similarity_search_with_score("今天天气真好", score_threshold=0.8))
[(Document(page_content='今天天气真好'), 0.0), (Document(page_content='我喜欢吃苹果'), 0.40074897), (Document(page_content='山公排序很不牢靠'), 0.5011895)]
[(Document(page_content='今天天气真好'), 0.9999846)]
[(Document(page_content='今天天气真好'), 0.0), (Document(page_content='我喜欢吃苹果'), 0.40074897), (Document(page_content='山公排序很不牢靠'), 0.5011895)]

事实果真如此,假如选用 similarity_search_with_relevance_scores() 给出阈值分数,只要选用欧氏间隔能正确过滤文档。

print(vs1.similarity_search_with_relevance_scores("今天天气真好", score_threshold=0.8))
print(vs2.similarity_search_with_relevance_scores("今天天气真好", score_threshold=0.8))
print(vs3.similarity_search_with_relevance_scores("今天天气真好", score_threshold=0.8))
[(Document(page_content='今天天气真好'), 0.999978158576509)]
d:miniconda3envsnewLibsite-packageslangchain_corevectorstores.py:342](): UserWarning: No relevant docs were retrieved using the relevance score threshold 0.8 warnings.warn(
[]
[(Document(page_content='今天天气真好'), 1.0)]

成果也是如此,你可能会疑问余弦类似度也能正确输出,这是由于首要在间隔核算时,它选用了欧氏间隔,然后相关性分数时选用余弦类似度也是错的,两次过错导致语义和相关性的关系是对的。可是好的程序不能靠 BUG 过活!

VectorStore 层面,证明了我的定论的正确性,那依照调用链来说 VectorStoreRetriever 也满足我的定论,可是仍是持续试验。

search_typesimilarity 时,只要内积是正确召回。

search_type = "similarity"
search_kwargs = {
    "score_threshold": 0.8
}
re1 = vs1.as_retriever(search_type=search_type, search_kwargs=search_kwargs)
re2 = vs2.as_retriever(search_type=search_type, search_kwargs=search_kwargs)
re3 = vs3.as_retriever(search_type=search_type, search_kwargs=search_kwargs)
print(re1.get_relevant_documents("今天天气真好"))
print(re2.get_relevant_documents("今天天气真好"))
print(re3.get_relevant_documents("今天天气真好"))
[Document(page_content='今天天气真好'), Document(page_content='我喜欢吃苹果'), Document(page_content='山公排序很不牢靠')] 
[Document(page_content='今天天气真好')] 
[Document(page_content='今天天气真好'), Document(page_content='我喜欢吃苹果'), Document(page_content='山公排序很不牢靠')]

search_typesimilarity_score_threshold 时,只要欧氏间隔是正确召回。

search_type = "similarity_score_threshold"
search_kwargs = {
    "score_threshold": 0.8
}
re1 = vs1.as_retriever(search_type=search_type, search_kwargs=search_kwargs)
re2 = vs2.as_retriever(search_type=search_type, search_kwargs=search_kwargs)
re3 = vs3.as_retriever(search_type=search_type, search_kwargs=search_kwargs)
print(re1.get_relevant_documents("今天天气真好"))
print(re2.get_relevant_documents("今天天气真好"))
print(re3.get_relevant_documents("今天天气真好"))
[Document(page_content='今天天气真好')]
d:miniconda3envszhiguolibsite-packageslangchain_corevectorstores.py:323](): UserWarning: No relevant docs were retrieved using the relevance score threshold 0.8 warnings.warn(
[]
[Document(page_content='今天天气真好')]

这儿余弦类似度正确召回原因同上,靠 BUG 过活算了。

试验最终再重申一下我的定论:

关于 VectorStore 而言,假如选用欧氏间隔,选用 similarity_search_with_relevance_scores() 才干正确依照类似度过滤文档,相应的 VectorStoreRetriever 中的 search_type 应该选用 similarity_score_threshold

假如选用最大内积,选用 similarity_search_with_score() 才干正确检索文档,相应的 VectorStoreRetriever 中的 search_type 应该选用 similarity

注:当前试验只对 LangChain 封装的 FAISS 负责,其他向量库不负责。

后记

这次一个问题的溯源让我觉得那些流行的开源库也不是居高临下,里边也会存在许多问题:有的明明能靠一个标记变量区别,可是非要从头封装函数、引入过多概念,导致代码紊乱等等。

后边运用 LangChain 结构 Agent 时,发现它好像是让模型依照必定的 JSON 格局输出 actionaction_input 然后解析这个 JSON 格局进行下一步操作,假如模型不是严格依照这个 JSON 格局输出(例如多输出一些文本)就会呈现解析过错的问题,而且这种方法好像没有利用模型自身的 function call 才能。这个还没有仔细检查,欢迎大家纠正。

在这段时间运用 LangChain 的进程中,我感觉它只要文本分割集成向量检索这两部分比较实用,现在发现检索也存在问题。他的杂乱规划让我感觉不如自己编写一套可复用的库来完成自己的需求,也或许是我没有真正了解到 LangChain 规划之禅吧。