Reproducing ANCE-Tele Results in OpenMatch

OpenMatch has added support for ANCE-Tele method of training dense retrieval models.

This package is provided with tools necessary to integrate ANCE-Tele into your negative mining loops. A script is also provided to reproduce the results in the paper.

The Key Concepts

ANCE-Tele builds upon the idea of utilizing train-positive negatives, i.e. negative documents near positive samples, as negatives for training. In training, train-positive samples could be mixed with normal negative samples to achieve better stability and efficiency. Negatives from previous episodes are also kept in later stages, to avoid catastrophic forgetting in training.

Train-Positive Negatives

To build Train-Positive Negative samples, you should first associate the positive passage sample with the query. In general, you could read out the index first positive passage of each query in the qrels.txt, look it up in the corpus, and write the content of that passage to a file with its document_id replaced by the query’s query_id.

You may want to use build_train_positive.py in scripts/ANCE-Tele to do this work automatically. The usage is:

python build_train_positive.py --qrel_file <QREL_FILE> --corpus_file <CORPUS_FILE> --save_to <SAVE_FILE_PATH> [--save_name <SAVE_FILE_NAME>]

This script reads from tab-seperated <QREL_FILE>, treating the first column as query_id and third column as first positive document_id. Then, it reads from the tab-seperated <CORPUS_FILE>, treating the first column as document_id and everything left as contents. For each query read, it prints <query_id>\t<contents>\n to <SAVE_FILE_PATH>/<SAVE_FILE_NAME>(<SAVE_FILE_NAME> defaults to train.positive.txt due to compatibility).

Then, with the processed train-positives file (and embeddings for corpus built; see dr-msmarco-passage.md or the scripts to see how), you may use a modified version of retrieve to retrieve negative documents near these positives:

CUDA_VISIBLE_DEVICES=${CUDA_TO_USE} python -m openmatch.driver.retrieve  \
    --output_dir ${EMBEDDING_SAVE_DIR} \ # Path to corpus embeddings of the dataset in question
    --model_name_or_path ${MODEL_DIR}  \ # Path to model that does the dense encoding. Should be the same model that created the corpus embeddings
    --per_device_eval_batch_size 256  \ # The batch size for encoding positive documents
    --corpus_path ${PATH_TO_POSITIVE_TXT}  \ #Path to the positive relevance file created by the process above
    --encode_query_as_passage \ # IMPORTANT!!
    --doc_template ${PASSAGE_TEMPLATE}  \ # Template to prompt the document; Must be the same template used in encoding corpus.
    --doc_column_names id,title,text  \ # Name of each column of the "corpus" file, to be used in substitution in templates
    --p_max_len 128  \ # Use this if you used a different max len for documents when encoding corpus
    --retrieve_depth 200 \ # Change this to change the number of documents presented per query in TREC file
    --fp16  \ # Keep this setting synced with the training and other encoding settings
    --use_gpu \ # If you may want to use GPU for retrieving
    --trec_save_path ${RETRIEVE_SAVE_DIR}/train.positive.trec  \ #Path to saved TREC file; The path should exist before calling the retriever
    --dataloader_num_workers 1

The --encode_query_as_passage option would let the retriever treat the contents of train-positives file as documents, and process them through the document encoding pipeline before searching with retriever.

This would still create a TREC file with the same format as the hard-negatives does; Processing them with build_hn.py (in scripts/ANCE-Tele) would yield negatives file just like hard negatives does.

If you may want to customize the settings, keep in mind that, with --encode_query_as_passage, the contents of the file are seen as passages, so --corpus_path, --doc_template, --doc_column_names and p_max_len are used to preprocess the input.

Merging Negatives

This package provides a script for merging different negative files into one file:

combine_negitive.py --input_folder_1 <INPUT_1> --input_folder_2 <INPUT_2> --output_folder <OUTPUT_FOLDER>

With <INPUT_1> and <INPUT_2> as folders to train negative files generated by OpenMatch (splitxx.jsonl), this script reads all negatives in the folder1, and appends them to the corresponding query in folder 2. The positives in folder 1 is discarded.

Reproducing the Results

This package is provided with the necessary scripts for reproducing ANCE-Tele results on MS MARCO. The methods are tuned to fit the original method in the paper. Other Parameters are described below.

Parameter

Value

Starting Model

coCondenser

Dataset

MS MARCO

Episode

3

Mix Ratio

0.5

Total Negative Retrieved

200

Per-Episode Train Method

From Beginning

Learn Rate

$5\times 10^{-5}$

Warm-up Ratio

0.1

If you’re simply trying to reproduce the results in the paper, you may use the shell script to help you: see Usage of Script for details.

If you’re trying to do more experiments with the method in the paper, you may refer to Manual Operation for a step-by-step guide.

Usage of Script

Important: Please install pytrec_eval into your environment (pip install pytrec_eval) before running the script.

A series of shell scripts are provided to aid in the reproduction of the results.

The scripts are written in bash shell scripts, and are in scripts/ANCE-Tele/shells.

Switch to that directory, and configure the variables in openmatch-ANCE-Tele.sh:

# The dir of the starting model.
# If no model was found in this path, the script would automatically download
PLM_DIR=~/datas/OpenMatch-New/models/co-condenser-marco
# Path to openmatch scripts.
OPENMATCH_SCRIPTS_DIR=../..
# The name of model, used for naming directories.
PLM_NAME=co-condenser
# The path to the datasets.
# By default, MS MARCO dataset would be downloaded to this directory, then preprocessed for further training.
COLLECTION_DIR=~/datas/OpenMatch-New/datasets
# Path to store embedded corpus.
EMBEDDING_DIR=~/datas/OpenMatch-New/embeddings
# Path to store retrieved documents (hard-negative or dev set)
RESULT_DIR=~/datas/OpenMatch-New/retrieved
# Path to store training datasets (datas loaded for trainer)
PROCESSED_DIR=~/datas/OpenMatch-New/train_data
# Path to store the models and checkpoints during training.
MODEL_DIR=~/datas/OpenMatch-New/models
# Path to store logs of training (for tensorboard to recall)
LOG_DIR=~/datas/OpenMatch-New/logs
# Path to store logs of this script (Because training are running in backend)
SCRIPT_LOG_DIR=~/datas/OpenMatch-New/script_log
# Index of CUDA devices to use for each episode.
# The length of this list determines the number of episode to train.
# Negatives will first be mined on that device, then trained on that device.
# If multiple device is defined, distributed processing would be automatically deployed.
CUDA_LIST=("4,7" "4,7" "4,7")

Then run the script with . ./openmatch-ANCE-Tele.sh.

Notes:

  • openmatch-ANCE-Tele.sh would call build_negative.sh, train.sh and evaluation.sh to complete the negative-building section, training section and evaluation section. If only one of the section is required (e.g. you’re exploring the difference between different negative-building parameters), you may refer to the comments in the related shell script to call it.

  • This script would cut off the training procedure when the required checkpoint(20000 step) is acquired. Change this setting in train.sh.

  • The scripts have no error-checking; if any of the module is not functional, the script would still run despite the error. Please check the script logs frequently to avoid running with glitched data.

  • The script would redirect all outputs to the corresponding log in ${SCRIPT_LOG_DIR}; use tail -f ${LOG} in another terminal to see the log updating in real-time.

Manual Operation

We may separate the training loop into three phases: Negative Building, Training and Evaluation.

Negative Building stage uses a trained model and previous stage’s negatives to generate negatives for next training stage.

Training stage trains the starting model and negatives produced in previous stage to train the model.

3 Training stage is performed in this implantation. The detailed sequence is as follows:

#

Starting Model

Training Data

Stage Name

Result

1

coCondenser

N/A

Epi-0 Mining

Epi-1 Data

2

coCondenser

Epi-1 Data

Epi-1 Training

Epi-1 Model (20000-step checkpoint)

3

Epi-1 Model (20000-step checkpoint)

Epi-1 Training

Epi-1 Mining

Epi-2 Data

4

coCondenser

Epi-2 Data

Epi-2 Training

Epi-2 Model (20000-step checkpoint)

5

Epi-2 Model (20000-step checkpoint)

Epi-1/2 Training

Epi-2 Mining

Epi-3 Data

6

coCondenser

Epi-3 Data

Epi-3 Training

Epi-3 Model(Complete Training)

7

Epi-3 Model

N/A

Epi-3 Evaluation

Evaluation Result

The following operations assume that you have downloaded coCondenser into path ${PLM_DIR}.

0. Downloading and Preprocessing the Dataset

First, download and extract RocketQA processed MS MARCO dataset into a folder (named $COLLECTION_DIR):

wget  https://rocketqa.bj.bcebos.com/corpus/marco.tar.gz
tar -zxf marco.tar.gz
rm -rf marco.tar.gz
mv -v ./marco/* ./

Download the official train qrel into this folder to override the RocketQA processed one:

wget https://msmarco.blob.core.windows.net/msmarcoranking/qrels.train.tsv -O qrels.train.tsv

Merge the title and contents of the passages together into one file:

join  -t "$(echo -en '\t')"  -e '' -a 1  -o 1.1 2.2 1.2  <(sort -k1,1 para.txt) <(sort -k1,1 para.title.txt) | sort -k1,1 -n > corpus.tsv

Preprocess the positive samples to bind them with queries (see Key Concepts for details):

python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/build_train_positive.py \
--qrel_file $COLLECTION_DIR/qrels.train.tsv \
--corpus_file $COLLECTION_DIR/corpus.tsv \
--save_to $COLLECTION_DIR

Finally, add two columns to the dev set qrel file to make it compatible with qrel-processing scripts:

awk '{printf "%s 0 %s 1\n",$1,$2}' $COLLECTION_DIR/qrels.dev.tsv | sed "s/ /\t/g" > $COLLECTION_DIR/qrels.dev.restructured.tsv

1. Negative Building

First, encode the passages with the model trained in last episode:

CUDA_VISIBLE_DEVICES=${CUDA} python -m openmatch.driver.build_index  \
    --output_dir ${EMBEDDING_SAVE_DIR}  \ #Path to save the embedded passages
    --model_name_or_path ${CURRENT_MODEL_DIR}  \ #Path to model used to encode
    --per_device_eval_batch_size 1024  \
    --corpus_path ${COLLECTION_DIR}/corpus.tsv  \
    --doc_template "<title>[SEP]<text>"  \
    --doc_column_names id,title,text  \
    --q_max_len 32  \
    --p_max_len 128  \
    --fp16  \
    --dataloader_num_workers 1

Note:

  • per_device_eval_batch_size refers to the number of passages processed in one batch. This is related to VMEM consumed in this stage and does not affect the result of embedding, so feel free to change that to reach a balance between speed and VMEM consumption.

  • If you may want to use multiple GPU for this work, use torch.distributed.launch like this:

    CUDAs_TO_USE=0,1
    PORT=19041
    CUDA_VISIBLE_DEVICES=${CUDAs_TO_USE} python -m torch.distributed.launch --nproc_per_node=2 --master_port ${PORT} \
    -m openmatch.driver.retrieve \
    # Other arguments here
    

    The script uses only 1 GPU for this work due to a bug with multi-GPU processing; this bug should have been fixed by now, so feel free to use multi-GPU.

  • doc_template, q_max_len, p_max_len arguments are set to match the settings in original paper.

Then, retrieve passages for train queries to obtain hard negatives:

CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m openmatch.driver.retrieve  \
    --output_dir ${EMBEDDING_SAVE_DIR} \ #Path to the encoded passages in last step
    --model_name_or_path $CURRENT_MODEL_DIR  \ # Path to model used to encode; Must be same as last step does
    --per_device_eval_batch_size 256  \
    --query_path $COLLECTION_DIR/train.query.txt  \
    --query_template "<text>"  \
    --query_column_names id,text  \
    --q_max_len 32  \
    --retrieve_depth 200 \
    --fp16  \
    --use_gpu \
    --trec_save_path ${RETRIEVE_SAVE_DIR}/train.trec  \ # Create this path first
    --dataloader_num_workers 1

This creates a TREC-style file train.trec in ${RETRIEVE_SAVE_DIR}.

Notes:

  • You may feel free to use multiple GPU for this stage: just change CUDA_TO_USE to something like CUDA_TO_USE=0,1.

    torch.distributed.launch is not required; the indexes would be automatically split to all devices defined.

  • ${RETRIEVE_SAVE_DIR} should exist before running this command; retriever may not automatically create this directory.

  • If you’re short in VMEM (About 31 GiB is required in total) or has not installed faiss-gpu, you may remove --use_gpu from arguments; this would use CPU to search for nearest neighbors, but the speed would be significantly slower.

    • You could also try to use ${OPENMATCH_SCRIPTS_DIR}/split_embeddings.py to split the embeddings into several splits, and then call successive_retieve instead of retrieve; this would search the queries on each split of corpus embeddings instead of the whole corpus, reducing the total VMEM needed for each round of search.

You also need to retrieve passages for positive passages to obtain positive neighbor negatives:

CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m openmatch.driver.retrieve  \
    --output_dir ${EMBEDDING_SAVE_DIR} \
    --model_name_or_path $CURRENT_MODEL_DIR  \
    --per_device_eval_batch_size 256  \
    --corpus_path $COLLECTION_DIR/train.positive.txt  \
    --encode_query_as_passage \ # Be sure to add this option
    --doc_template "<title>[SEP]<text>"  \
    --doc_column_names id,title,text  \
    --p_max_len 128  \
    --retrieve_depth 200 \
    --fp16  \
    --use_gpu \
    --trec_save_path ${RETRIEVE_SAVE_DIR}/train.positive.trec  \
    --dataloader_num_workers 1

This creates a TREC-style file train.positive.trec in ${RETRIEVE_SAVE_DIR}.

Notes:

  • Be sure to define --corpus_path, --doc_template and --p_max_len in arguments; They will be used instead of query-related arguments, since we’re dealing with passages. See The Key Concepts for details.

Next, we build negative files from retrieved TREC files:

#Hard negative
python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/build_hn.py  \
    --tokenizer_name $CURRENT_MODEL_DIR  \ # Path to model used; only the tokenizer would be used
    --hn_file ${RETRIEVE_SAVE_DIR}/train.trec  \ #TREC file for building negatives
    --qrels $COLLECTION_DIR/qrels.train.tsv  \
    --queries $COLLECTION_DIR/train.query.txt  \
    --collection $COLLECTION_DIR/corpus.tsv  \
    --save_to ${TRAIN_SAVE_DIR}/hard_neg  \ #Path to save .jsonl files
    --depth 200 \ # Same as original paper does
    --n_sample 30
# Positive negative
python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/build_hn.py  \
    --tokenizer_name $CURRENT_MODEL_DIR  \
    --hn_file ${RETRIEVE_SAVE_DIR}/train.positive.trec  \
    --qrels $COLLECTION_DIR/qrels.train.tsv  \
    --queries $COLLECTION_DIR/train.query.txt  \
    --collection $COLLECTION_DIR/corpus.tsv  \
    --save_to ${TRAIN_SAVE_DIR}/train_pos  \
    --depth 200 \
    --n_sample 30

This samples 30 negatives from TREC file, and group them with positive file to form trainer-compatible .jsonl training data files for training.

Note that both train.trec and train.positive.trec should be used.

Then, merge two training data into one:

python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/combine_negative.py \
--input_folder_1 ${TRAIN_SAVE_DIR}/train_pos \
--input_folder_2 ${TRAIN_SAVE_DIR}/hard_neg \
--output_folder ${TRAIN_SAVE_DIR}/cur_mixed \

This appends negatives in hard negative directory to positive negative files, and save them to a new file.

Because trainer would randomly select negative from the training data, this ensures that the mixing ratio in the paper is satisfied.

If previous episode train files are available, they have to be merged as well:

python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/combine_negative.py \
--input_folder_1 ${TRAIN_SAVE_DIR}/cur_mixed \
--input_folder_2 ${LAST_TRAIN_SAVE_DIR} \ #Path to training files used in last episode
--output_folder ${TRAIN_SAVE_DIR}

If previous episode does not exist, you could simply move the .jsonl files out of the /cur_mixed directory.

Because trainer loads the .jsonl file using glob, they’re loaded in arbitrary order. You may want to merge the train files into one file to minimize the uncertainty:

cat ${TRAIN_SAVE_DIR}/*.hn.jsonl > ${TRAIN_SAVE_DIR}/train.hn.jsonl-temp
rm -v ${TRAIN_SAVE_DIR}/*.hn.jsonl
mv -v ${TRAIN_SAVE_DIR}/train.hn.jsonl-temp ${TRAIN_SAVE_DIR}/train.hn.jsonl

2. Training

Call OpenMatch’s trainer to train the model on the negatives mined:

CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m openmatch.driver.train_dr  \
    --output_dir $CURRENT_MODEL_DIR  \ #Path to save trained model and checkpoints
    --model_name_or_path $PREVIOUS_MODEL_DIR  \ # Starting model; Is $PLM_DIR for each episode in this implantation
    --do_train  \
    --save_steps 20000  \
    --train_path ${TRAIN_FILE}  \ # Path to merged training file; Must not be a single file
    --fp16  \
    --per_device_train_batch_size 8  \ # Total data per batch is 8
    --train_n_passages $N_PASSAGES  \ # 16 for first episode; 32 for later episodes
    --learning_rate 5e-6  \
    --q_max_len 32  \
    --p_max_len 128  \
    --num_train_epochs 3  \
    --logging_dir ${LOG_PATH} \ # Path to save tensorboard datas
    --use_mapping_dataset \ # Important!
    --dataloader_drop_last \ # Whether dropping last incomplete batch

Notes:

  • In current version of OpenMatch, using iterable datasets in training may introduce a bias in negative selection, and thus affect the performance of trained models. Please use --use_mapping_dataset, as such problem does not occur in mapping datasets.

    • However, huggingface datasets would generate a cache in ~/.cache with the same size as training data does. If you’re running low on disk space here, you may specify --cache_dir ${CACHE_DIR} to save the cache to somewhere else.

  • Be aware that a linear warm-up stage with length of 10% of all training steps is used in training; thus, the number of training epoch defined and --dataloader_drop_last may have an impact on all checkpoints generated.

  • Since this implantation uses the 20000-step checkpoint to mine the next episode’s negatives, you may want to stop the train as early as 20000 steps. However, because of the warm-up stage, you should still set the training epoch to 3 to avoid difference in learning rates.

  • Note that this implantation learns 16 passages per data in first training stage, and 32 passages in later stages.

  • You could also use multiple GPUs for training. However, you should modify the commands:

CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m torch.distributed.launch --nproc_per_node=2 --master_port ${PORT} \
    --per_device_train_batch_size $[8/$CUDA_NUM]  \ # The total batch size should be 8, so divide this accordingly
    --negatives_x_device \ # Set this to correctly set the gradients
    # Other arguments could be left as-is

3. Evaluation

First, encode the passages using build_index.

CUDA_VISIBLE_DEVICES=${CUDA} python -m openmatch.driver.build_index  \
    --output_dir ${EMBEDDING_SAVE_DIR}  \ #Path to save the embedded passages
    --model_name_or_path ${CURRENT_MODEL_DIR}  \ #Path to model used to encode
    --per_device_eval_batch_size 1024  \
    --corpus_path ${COLLECTION_DIR}/corpus.tsv  \
    --doc_template "<title>[SEP]<text>"  \
    --doc_column_names id,title,text  \
    --q_max_len 32  \
    --p_max_len 128  \
    --fp16  \
    --dataloader_num_workers 1

Then, retrieve passages for queries. This time, use dev set queries:

CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m openmatch.driver.retrieve  \
    --output_dir ${EMBEDDING_SAVE_DIR} \
    --model_name_or_path $CURRENT_MODEL_DIR  \
    --per_device_eval_batch_size 256  \
    --query_path $COLLECTION_DIR/dev.query.txt  \
    --query_template "<text>"  \
    --query_column_names id,text  \
    --q_max_len 32  \
    --fp16  \
    --trec_save_path ${RETRIEVE_SAVE_DIR}/dev.trec  \ # Remember to create this path first!
    --dataloader_num_workers 1 \
    --use_gpu \

These two steps are identical to the corresponding steps in 1. Negative Building; Only the queries used is different.

Once dev-set passages has been retrieved (in standard TREC format), you may use evaluation scripts to evaluate the model’s performance.

OpenMatch is packed with an evaluation script (Requires pytrec_eval package) : scripts/evaluate.py.

You may want to use this script to compute MRR@10:

python $OPENMATCH_SCRIPTS_DIR/evaluate.py -m mrr_cut.10 \
${COLLECTION_DIR}/qrels.dev.restructured.tsv \ # Use the processed qrels file
${RETRIEVE_SAVE_DIR}/dev.trec # The retrieved TREC file for dev set