# Reproducing ANCE-Tele Results in OpenMatch
OpenMatch has added support for ANCE-Tele method of training dense retrieval models.
This package is provided with tools necessary to integrate ANCE-Tele into your negative mining loops. A script is also provided to reproduce the results in the paper.
## The Key Concepts
ANCE-Tele builds upon the idea of utilizing train-positive negatives, i.e. negative documents near positive samples, as negatives for training.
In training, train-positive samples could be mixed with normal negative samples to achieve better stability and efficiency.
Negatives from previous episodes are also kept in later stages, to avoid catastrophic forgetting in training.
### Train-Positive Negatives
To build Train-Positive Negative samples, you should first associate the positive passage sample with the query.
In general, you could read out the index first positive passage of each query in the `qrels.txt`, look it up in the corpus, and write the content of that passage to a file with its `document_id` replaced by the query's `query_id`.
You may want to use `build_train_positive.py` in `scripts/ANCE-Tele` to do this work automatically. The usage is:
`python build_train_positive.py --qrel_file --corpus_file --save_to [--save_name ]`
This script reads from tab-seperated ``, treating the first column as `query_id` and third column as first positive `document_id`.
Then, it reads from the tab-seperated ``, treating the first column as `document_id` and everything left as contents.
For each query read, it prints `\t\n` to `/`(`` defaults to `train.positive.txt` due to compatibility).
Then, with the processed train-positives file (and embeddings for corpus built; see `dr-msmarco-passage.md` or the scripts to see how), you may use a modified version of `retrieve` to retrieve negative documents near these positives:
```bash
CUDA_VISIBLE_DEVICES=${CUDA_TO_USE} python -m openmatch.driver.retrieve \
--output_dir ${EMBEDDING_SAVE_DIR} \ # Path to corpus embeddings of the dataset in question
--model_name_or_path ${MODEL_DIR} \ # Path to model that does the dense encoding. Should be the same model that created the corpus embeddings
--per_device_eval_batch_size 256 \ # The batch size for encoding positive documents
--corpus_path ${PATH_TO_POSITIVE_TXT} \ #Path to the positive relevance file created by the process above
--encode_query_as_passage \ # IMPORTANT!!
--doc_template ${PASSAGE_TEMPLATE} \ # Template to prompt the document; Must be the same template used in encoding corpus.
--doc_column_names id,title,text \ # Name of each column of the "corpus" file, to be used in substitution in templates
--p_max_len 128 \ # Use this if you used a different max len for documents when encoding corpus
--retrieve_depth 200 \ # Change this to change the number of documents presented per query in TREC file
--fp16 \ # Keep this setting synced with the training and other encoding settings
--use_gpu \ # If you may want to use GPU for retrieving
--trec_save_path ${RETRIEVE_SAVE_DIR}/train.positive.trec \ #Path to saved TREC file; The path should exist before calling the retriever
--dataloader_num_workers 1
```
The `--encode_query_as_passage` option would let the retriever treat the contents of train-positives file as documents, and process them through the document encoding pipeline before searching with retriever.
This would still create a TREC file with the same format as the hard-negatives does; Processing them with `build_hn.py` (in `scripts/ANCE-Tele`) would yield negatives file just like hard negatives does.
If you may want to customize the settings, keep in mind that, with `--encode_query_as_passage`, the contents of the file are seen as passages, so `--corpus_path`, `--doc_template`, `--doc_column_names` and `p_max_len` are used to preprocess the input.
### Merging Negatives
This package provides a script for merging different negative files into one file:
`combine_negitive.py --input_folder_1 --input_folder_2 --output_folder `
With `` and `` as folders to train negative files generated by OpenMatch (`splitxx.jsonl`), this script reads all negatives in the folder1, and appends them to the corresponding query in folder 2. The positives in folder 1 is discarded.
## Reproducing the Results
This package is provided with the necessary scripts for reproducing ANCE-Tele results on MS MARCO. The methods are tuned to fit the original method in the paper. Other Parameters are described below.
| Parameter | Value |
|-----------|-------|
| Starting Model | coCondenser |
| Dataset | MS MARCO |
| Episode | 3 |
| Mix Ratio | 0.5 |
| Total Negative Retrieved | 200 |
| Per-Episode Train Method | From Beginning |
| Learn Rate | $5\times 10^{-5}$ |
| Warm-up Ratio | 0.1 |
If you're simply trying to reproduce the results in the paper, you may use the shell script to help you: see [Usage of Script](#usage-of-script) for details.
If you're trying to do more experiments with the method in the paper, you may refer to [Manual Operation](#manual-operation) for a step-by-step guide.
### Usage of Script
**Important: Please install pytrec_eval into your environment (`pip install pytrec_eval`) before running the script.**
A series of shell scripts are provided to aid in the reproduction of the results.
The scripts are written in bash shell scripts, and are in `scripts/ANCE-Tele/shells`.
Switch to that directory, and configure the variables in `openmatch-ANCE-Tele.sh`:
```bash
# The dir of the starting model.
# If no model was found in this path, the script would automatically download
PLM_DIR=~/datas/OpenMatch-New/models/co-condenser-marco
# Path to openmatch scripts.
OPENMATCH_SCRIPTS_DIR=../..
# The name of model, used for naming directories.
PLM_NAME=co-condenser
# The path to the datasets.
# By default, MS MARCO dataset would be downloaded to this directory, then preprocessed for further training.
COLLECTION_DIR=~/datas/OpenMatch-New/datasets
# Path to store embedded corpus.
EMBEDDING_DIR=~/datas/OpenMatch-New/embeddings
# Path to store retrieved documents (hard-negative or dev set)
RESULT_DIR=~/datas/OpenMatch-New/retrieved
# Path to store training datasets (datas loaded for trainer)
PROCESSED_DIR=~/datas/OpenMatch-New/train_data
# Path to store the models and checkpoints during training.
MODEL_DIR=~/datas/OpenMatch-New/models
# Path to store logs of training (for tensorboard to recall)
LOG_DIR=~/datas/OpenMatch-New/logs
# Path to store logs of this script (Because training are running in backend)
SCRIPT_LOG_DIR=~/datas/OpenMatch-New/script_log
# Index of CUDA devices to use for each episode.
# The length of this list determines the number of episode to train.
# Negatives will first be mined on that device, then trained on that device.
# If multiple device is defined, distributed processing would be automatically deployed.
CUDA_LIST=("4,7" "4,7" "4,7")
```
Then run the script with `. ./openmatch-ANCE-Tele.sh`.
Notes:
* `openmatch-ANCE-Tele.sh` would call `build_negative.sh`, `train.sh` and `evaluation.sh` to complete the negative-building section, training section and evaluation section.
If only one of the section is required (e.g. you're exploring the difference between different negative-building parameters), you may refer to the comments in the related shell script to call it.
* This script would cut off the training procedure when the required checkpoint(20000 step) is acquired. Change this setting in `train.sh`.
* The scripts have no error-checking; if any of the module is not functional, the script would still run despite the error. Please check the script logs frequently to avoid running with glitched data.
* The script would redirect all outputs to the corresponding log in `${SCRIPT_LOG_DIR}`; use `tail -f ${LOG}` in another terminal to see the log updating in real-time.
### Manual Operation
We may separate the training loop into three phases: Negative Building, Training and Evaluation.
Negative Building stage uses a trained model and previous stage's negatives to generate negatives for next training stage.
Training stage trains the starting model and negatives produced in previous stage to train the model.
3 Training stage is performed in this implantation. The detailed sequence is as follows:
|# |Starting Model|Training Data|Stage Name|Result|
|--|--------------|-------------|-----|------|
|1 |coCondenser|N/A|Epi-0 Mining|Epi-1 Data|
|2 |coCondenser|Epi-1 Data|Epi-1 Training | Epi-1 Model (20000-step checkpoint)|
|3 |Epi-1 Model (20000-step checkpoint) |Epi-1 Training|Epi-1 Mining|Epi-2 Data|
|4 |coCondenser|Epi-2 Data|Epi-2 Training | Epi-2 Model (20000-step checkpoint)|
|5 |Epi-2 Model (20000-step checkpoint) |Epi-1/2 Training|Epi-2 Mining|Epi-3 Data|
|6 |coCondenser|Epi-3 Data|Epi-3 Training | Epi-3 Model(Complete Training)|
|7 |Epi-3 Model|N/A|Epi-3 Evaluation|Evaluation Result|
The following operations assume that you have downloaded [coCondenser](https://huggingface.co/Luyu/co-condenser-marco) into path `${PLM_DIR}`.
#### 0. Downloading and Preprocessing the Dataset
First, download and extract RocketQA processed MS MARCO dataset into a folder (named `$COLLECTION_DIR`):
```bash
wget https://rocketqa.bj.bcebos.com/corpus/marco.tar.gz
tar -zxf marco.tar.gz
rm -rf marco.tar.gz
mv -v ./marco/* ./
```
Download the official train qrel into this folder to override the RocketQA processed one:
```bash
wget https://msmarco.blob.core.windows.net/msmarcoranking/qrels.train.tsv -O qrels.train.tsv
```
Merge the title and contents of the passages together into one file:
```bash
join -t "$(echo -en '\t')" -e '' -a 1 -o 1.1 2.2 1.2 <(sort -k1,1 para.txt) <(sort -k1,1 para.title.txt) | sort -k1,1 -n > corpus.tsv
```
Preprocess the positive samples to bind them with queries (see **Key Concepts** for details):
```bash
python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/build_train_positive.py \
--qrel_file $COLLECTION_DIR/qrels.train.tsv \
--corpus_file $COLLECTION_DIR/corpus.tsv \
--save_to $COLLECTION_DIR
```
Finally, add two columns to the dev set qrel file to make it compatible with qrel-processing scripts:
```bash
awk '{printf "%s 0 %s 1\n",$1,$2}' $COLLECTION_DIR/qrels.dev.tsv | sed "s/ /\t/g" > $COLLECTION_DIR/qrels.dev.restructured.tsv
```
#### 1. Negative Building
First, encode the passages with the model trained in last episode:
```bash
CUDA_VISIBLE_DEVICES=${CUDA} python -m openmatch.driver.build_index \
--output_dir ${EMBEDDING_SAVE_DIR} \ #Path to save the embedded passages
--model_name_or_path ${CURRENT_MODEL_DIR} \ #Path to model used to encode
--per_device_eval_batch_size 1024 \
--corpus_path ${COLLECTION_DIR}/corpus.tsv \
--doc_template "[SEP]" \
--doc_column_names id,title,text \
--q_max_len 32 \
--p_max_len 128 \
--fp16 \
--dataloader_num_workers 1
```
Note:
* `per_device_eval_batch_size` refers to the number of passages processed in one batch. This is related to VMEM consumed in this stage and does not affect the result of embedding, so feel free to change that to reach a balance between speed and VMEM consumption.
* If you may want to use multiple GPU for this work, use `torch.distributed.launch` like this:
```bash
CUDAs_TO_USE=0,1
PORT=19041
CUDA_VISIBLE_DEVICES=${CUDAs_TO_USE} python -m torch.distributed.launch --nproc_per_node=2 --master_port ${PORT} \
-m openmatch.driver.retrieve \
# Other arguments here
```
The script uses only 1 GPU for this work due to a bug with multi-GPU processing; this bug should have been fixed by now, so feel free to use multi-GPU.
* `doc_template`, `q_max_len`, `p_max_len` arguments are set to match the settings in original paper.
Then, retrieve passages for train queries to obtain hard negatives:
```bash
CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m openmatch.driver.retrieve \
--output_dir ${EMBEDDING_SAVE_DIR} \ #Path to the encoded passages in last step
--model_name_or_path $CURRENT_MODEL_DIR \ # Path to model used to encode; Must be same as last step does
--per_device_eval_batch_size 256 \
--query_path $COLLECTION_DIR/train.query.txt \
--query_template "" \
--query_column_names id,text \
--q_max_len 32 \
--retrieve_depth 200 \
--fp16 \
--use_gpu \
--trec_save_path ${RETRIEVE_SAVE_DIR}/train.trec \ # Create this path first
--dataloader_num_workers 1
```
This creates a TREC-style file `train.trec` in `${RETRIEVE_SAVE_DIR}`.
Notes:
* You may feel free to use multiple GPU for this stage: just change `CUDA_TO_USE` to something like `CUDA_TO_USE=0,1`.
`torch.distributed.launch` is not required; the indexes would be automatically split to all devices defined.
* `${RETRIEVE_SAVE_DIR}` should exist before running this command; retriever may not automatically create this directory.
* If you're short in VMEM (About 31 GiB is required in total) **or has not installed faiss-gpu**, you may remove `--use_gpu` from arguments; this would use CPU to search for nearest neighbors, but the speed would be significantly slower.
* You could also try to use `${OPENMATCH_SCRIPTS_DIR}/split_embeddings.py` to split the embeddings into several splits, and then call `successive_retieve` instead of `retrieve`; this would search the queries on each split of corpus embeddings instead of the whole corpus, reducing the total VMEM needed for each round of search.
You also need to retrieve passages for positive passages to obtain positive neighbor negatives:
```bash
CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m openmatch.driver.retrieve \
--output_dir ${EMBEDDING_SAVE_DIR} \
--model_name_or_path $CURRENT_MODEL_DIR \
--per_device_eval_batch_size 256 \
--corpus_path $COLLECTION_DIR/train.positive.txt \
--encode_query_as_passage \ # Be sure to add this option
--doc_template "[SEP]" \
--doc_column_names id,title,text \
--p_max_len 128 \
--retrieve_depth 200 \
--fp16 \
--use_gpu \
--trec_save_path ${RETRIEVE_SAVE_DIR}/train.positive.trec \
--dataloader_num_workers 1
```
This creates a TREC-style file `train.positive.trec` in `${RETRIEVE_SAVE_DIR}`.
Notes:
* Be sure to define `--corpus_path`, `--doc_template` and `--p_max_len` in arguments; They will be used instead of query-related arguments, since we're dealing with passages.
See [The Key Concepts](#the-key-concepts) for details.
Next, we build negative files from retrieved TREC files:
```bash
#Hard negative
python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/build_hn.py \
--tokenizer_name $CURRENT_MODEL_DIR \ # Path to model used; only the tokenizer would be used
--hn_file ${RETRIEVE_SAVE_DIR}/train.trec \ #TREC file for building negatives
--qrels $COLLECTION_DIR/qrels.train.tsv \
--queries $COLLECTION_DIR/train.query.txt \
--collection $COLLECTION_DIR/corpus.tsv \
--save_to ${TRAIN_SAVE_DIR}/hard_neg \ #Path to save .jsonl files
--depth 200 \ # Same as original paper does
--n_sample 30
# Positive negative
python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/build_hn.py \
--tokenizer_name $CURRENT_MODEL_DIR \
--hn_file ${RETRIEVE_SAVE_DIR}/train.positive.trec \
--qrels $COLLECTION_DIR/qrels.train.tsv \
--queries $COLLECTION_DIR/train.query.txt \
--collection $COLLECTION_DIR/corpus.tsv \
--save_to ${TRAIN_SAVE_DIR}/train_pos \
--depth 200 \
--n_sample 30
```
This samples 30 negatives from TREC file, and group them with positive file to form trainer-compatible `.jsonl` training data files for training.
Note that both `train.trec` and `train.positive.trec` should be used.
Then, merge two training data into one:
```bash
python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/combine_negative.py \
--input_folder_1 ${TRAIN_SAVE_DIR}/train_pos \
--input_folder_2 ${TRAIN_SAVE_DIR}/hard_neg \
--output_folder ${TRAIN_SAVE_DIR}/cur_mixed \
```
This appends negatives in hard negative directory to positive negative files, and save them to a new file.
Because trainer would randomly select negative from the training data, this ensures that the mixing ratio in the paper is satisfied.
If previous episode train files are available, they have to be merged as well:
```bash
python $OPENMATCH_SCRIPTS_DIR/ANCE-Tele/combine_negative.py \
--input_folder_1 ${TRAIN_SAVE_DIR}/cur_mixed \
--input_folder_2 ${LAST_TRAIN_SAVE_DIR} \ #Path to training files used in last episode
--output_folder ${TRAIN_SAVE_DIR}
```
If previous episode does not exist, you could simply move the `.jsonl` files out of the `/cur_mixed` directory.
Because trainer loads the `.jsonl` file using `glob`, they're loaded in arbitrary order. You may want to merge the train files into one file to minimize the uncertainty:
```bash
cat ${TRAIN_SAVE_DIR}/*.hn.jsonl > ${TRAIN_SAVE_DIR}/train.hn.jsonl-temp
rm -v ${TRAIN_SAVE_DIR}/*.hn.jsonl
mv -v ${TRAIN_SAVE_DIR}/train.hn.jsonl-temp ${TRAIN_SAVE_DIR}/train.hn.jsonl
```
#### 2. Training
Call OpenMatch's trainer to train the model on the negatives mined:
```bash
CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m openmatch.driver.train_dr \
--output_dir $CURRENT_MODEL_DIR \ #Path to save trained model and checkpoints
--model_name_or_path $PREVIOUS_MODEL_DIR \ # Starting model; Is $PLM_DIR for each episode in this implantation
--do_train \
--save_steps 20000 \
--train_path ${TRAIN_FILE} \ # Path to merged training file; Must not be a single file
--fp16 \
--per_device_train_batch_size 8 \ # Total data per batch is 8
--train_n_passages $N_PASSAGES \ # 16 for first episode; 32 for later episodes
--learning_rate 5e-6 \
--q_max_len 32 \
--p_max_len 128 \
--num_train_epochs 3 \
--logging_dir ${LOG_PATH} \ # Path to save tensorboard datas
--use_mapping_dataset \ # Important!
--dataloader_drop_last \ # Whether dropping last incomplete batch
```
Notes:
* In current version of OpenMatch, using iterable datasets in training may introduce a bias in negative selection, and thus affect the performance of trained models. Please use `--use_mapping_dataset`, as such problem does not occur in mapping datasets.
* However, huggingface datasets would generate a cache in `~/.cache` with the same size as training data does. If you're running low on disk space here, you may specify `--cache_dir ${CACHE_DIR}` to save the cache to somewhere else.
* Be aware that a linear warm-up stage with length of 10% of all training steps is used in training; thus, the number of training epoch defined and `--dataloader_drop_last` may have an impact on all checkpoints generated.
* Since this implantation uses the 20000-step checkpoint to mine the next episode's negatives, you may want to stop the train as early as 20000 steps. However, because of the warm-up stage, you should still set the training epoch to 3 to avoid difference in learning rates.
* Note that this implantation learns 16 passages per data in first training stage, and 32 passages in later stages.
* You could also use multiple GPUs for training. However, you should modify the commands:
```bash
CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m torch.distributed.launch --nproc_per_node=2 --master_port ${PORT} \
--per_device_train_batch_size $[8/$CUDA_NUM] \ # The total batch size should be 8, so divide this accordingly
--negatives_x_device \ # Set this to correctly set the gradients
# Other arguments could be left as-is
```
#### 3. Evaluation
First, encode the passages using `build_index`.
```bash
CUDA_VISIBLE_DEVICES=${CUDA} python -m openmatch.driver.build_index \
--output_dir ${EMBEDDING_SAVE_DIR} \ #Path to save the embedded passages
--model_name_or_path ${CURRENT_MODEL_DIR} \ #Path to model used to encode
--per_device_eval_batch_size 1024 \
--corpus_path ${COLLECTION_DIR}/corpus.tsv \
--doc_template "[SEP]" \
--doc_column_names id,title,text \
--q_max_len 32 \
--p_max_len 128 \
--fp16 \
--dataloader_num_workers 1
```
Then, retrieve passages for queries. This time, use dev set queries:
```bash
CUDA_VISIBLE_DEVICES=$CUDA_TO_USE python -m openmatch.driver.retrieve \
--output_dir ${EMBEDDING_SAVE_DIR} \
--model_name_or_path $CURRENT_MODEL_DIR \
--per_device_eval_batch_size 256 \
--query_path $COLLECTION_DIR/dev.query.txt \
--query_template "" \
--query_column_names id,text \
--q_max_len 32 \
--fp16 \
--trec_save_path ${RETRIEVE_SAVE_DIR}/dev.trec \ # Remember to create this path first!
--dataloader_num_workers 1 \
--use_gpu \
```
These two steps are identical to the corresponding steps in [1. Negative Building](#1-negative-building); Only the queries used is different.
Once dev-set passages has been retrieved (in standard TREC format), you may use evaluation scripts to evaluate the model's performance.
OpenMatch is packed with an evaluation script (Requires `pytrec_eval` package) : `scripts/evaluate.py`.
You may want to use this script to compute MRR@10:
```bash
python $OPENMATCH_SCRIPTS_DIR/evaluate.py -m mrr_cut.10 \
${COLLECTION_DIR}/qrels.dev.restructured.tsv \ # Use the processed qrels file
${RETRIEVE_SAVE_DIR}/dev.trec # The retrieved TREC file for dev set
```