TaskTracker is a novel approach to detect task drift in large language models (LLMs) by analyzing their internal activations. It is based on the research described in this SaTML’25 paper Get my drift? Catching LLM Task Drift with Activation Deltas.
Key features:
The repo includes:
TaskTracker enables more secure use of LLMs in retrieval-augmented applications by catching unwanted deviations from user instructions. It also opens up new directions for LLM interpretability and control.
To request access to the activation data we generated for simulating/evaluating task drift, please fill out this form and we will respond with a time-restricted download link (coming soon, we will send download links as soon as they are available).
brew install azcopy.azcopy copy 'https://tasktrackeropensource.blob.core.windows.net/activations/{MODEL_NAME}/{DATA_DISTRIBUTION}?{SAS_TOKEN}' <LOCAL_PATH> --recursive
[3, BATCH, LAYERS, DIM]:Training data was constructed using the same text examples as pairs of clean vs poisoned ones
[2, BATCH, LAYERS, DIM]:
import torch
# Load the activation data for validation/test
clean_activations = torch.load('activations/activations_0.pt')
poisoned_activations = torch.load('activations/activations_1.pt')
# Shape: (2, 1000, 32, 4096). For training files, this would be (3, 1000, 32, 4096)
# Subtract the first dimension of the activations (to remove the instruction and only compare data blocks)
clean_activations = clean_activations[1] - clean_activations[0]
poisoned_activations = poisoned_activations[1] - poisoned_activations[0]
# Shape: (1000, 32, 4096)
conda env create -f environment.yml
conda activate tasktracker
cd TaskTracker
pip install -e .
1- Check quick_start for a simple way to run on new data
2- Edit quick_start/config.yaml for configurations of classifier path, which LLM, parameters of layers and thresholds, etc.
3- Check the structure of data in quick_start/mock_data.json. You can prepare your data as
[
{
"user_prompt": "primary task",
"text": "paragraph, can be clean or poisoned",
"label": 1 (poisoned), 0 (clean)
}
]
4- run quick_start/main_quick_test.py. According to which LLM/Task Tracker you are using, change torch_type when loading the LLM (check TaskTracker/task_tracker/config/models.py for the precision we used for each LLM).
We provide pre-sampled dataset examples for training and evaluation (see option 1 for regenerating our exact data which you probably may need to do if you are using our precomputed activations).
task_tracker/dataset_creation/recreate_dataset which will automatically download the relevant resources and build the dataset. No change is required.task_tracker/config/models.py to point to your created files.To create your own dataset:
task_tracker/dataset_creation/ to prepare training, validation, and test datasets.task_tracker/config/models.py to point to your newly generated files.Note: Each notebook contains detailed instructions and customization options. Adjust parameters as needed for your specific use case.
prepare_training_dataset.ipynb: Samples training data from SQuAD training split
args.orig_task, args.emb_task, and args.embed_loctraining_dataset_combinations.ipynb for combination examplesprepare_datasets_clean_val.ipynb: Samples clean validation data
prepare_datasets_clean_test.ipynb: Samples clean test data
prepare_datasets_poisoned_val.ipynb: Samples poisoned validation data
prepare_datasets_poisoned_test.ipynb: Samples poisoned test data
prepare_datasets_poisoned_test_other_variations.ipynb: Generates variations of poisoned injections (trigger variations)
prepare_datasets_clean_test_spotlight.ipynb: Constructs clean examples with spotlighting prompts
prepare_datasets_poisoned_test_translation_WildChat.ipynb: Constructs WildChat examples (clean examples with instructions) and poisoned examples with translated instructionsAfter generating or downloading the dataset:
task_tracker/config/models.pyDATA_LISTS path in task_tracker/training/utils/constants.py to point to your downloaded files.To generate activations:
task_tracker/config/models.py:
```python
cache_dir = “/path/to/hf/cache/” os.environ[“TRANSFORMERS_CACHE”] = cache_dir os.environ[“HF_HOME”] = cache_dir
activation_parent_dir = “/path/to/store/activations/”
text_dataset_parent_dir = “/path/to/dataset/text/files/”
2. Customize activation generation in `task_tracker/activations/generate.py`:
```python
model_name: str = "mistral" # Choose from models in task_tracker.config.models
with_priming: bool = True # Set to False if no priming prompt is needed
(Optional) Modify the priming prompt in task_tracker/utils/data.py if needed.
Generate activations:
python task_tracker/activations/generate.py
After generating or downloading activations:
DATA_LISTS path in task_tracker/training/utils/constants.py:
DATA_LISTS = "/path/to/activation/file/lists/"
Note: Ensure that dataset file paths in task_tracker/config/models.py are correct before generating activations.
We provide pre-trained probes in the repository. However, if you wish to train your own probes, follow these steps:
Ensure you have:
task_tracker/config/models.py (from the dataset creation step)DATA_LISTS) in task_tracker/training/utils/constants.py (from the activation generation step)task_tracker/training/utils/constants.py:
```python
MODEL_OUTPUT_DIR = ‘/path/to/output/directory’task_tracker/training/linear_probe/train_linear_model.py:MODEL = "llama3_70b" # Choose from models in task_tracker.training.utils.constants
python task_tracker/training/linear_probe/train_linear_model.py
task_tracker/training/triplet_probe/train_per_layer.py:MODEL = 'mistral' # Choose from models in task_tracker.training.utils.constants
config = {
'model': MODEL,
'activations': ACTIVATIONS_DIR,
'activations_ood': ACTIVATIONS_VAL_DIR,
'ood_poisoned_file': OOD_POISONED_FILE,
'exp_name': 'mistral_test', # Update with your experiment name
'margin': 0.3,
'epochs': 6,
'num_layers': (0,5), # Start to end layer (both inclusive)
'files_chunk': 10,
'batch_size': 2500, # Batch size for triplet mining
'learning_rate': 0.0005,
'restart': False, # Set to True if restarting from a checkpoint
'feature_dim': 275,
'pool_first_layer': 5 if MODEL == 'llama3_70b' else 3,
'dropout': 0.5,
'check_each': 50,
'conv': True,
'layer_norm': False,
'delay_lr_factor': 0.95,
'delay_lr_step': 800
}
python task_tracker/training/triplet_probe/train_per_layer.py
After training or downloading pre-trained models:
task_tracker/experiments_outputs.py:linear_probe_out_parent_dir = "/path/to/linear/probes"
triplet_probe_out_parent_dir = "/path/to/triplet/probes"
Note: Adjust hyperparameters and configuration settings as needed for your specific use case.
This section provides scripts to evaluate models and reproduce our experiments.
Ensure you have:
task_tracker/config/models.pyDATA_LISTS) in task_tracker/training/utils/constants.pytask_tracker/experiments_outputs.pyUse task_tracker/evaluation/visualizations/tsne_raw_activations.ipynb to visualize task activation residuals:
from task_tracker.training.dataset import ActivationsDatasetDynamicPrimaryText
from task_tracker.training.utils.constants import (
TEST_ACTIVATIONS_DIR_PER_MODEL, TEST_CLEAN_FILES_PER_MODEL,
TEST_POISONED_FILES_PER_MODEL)
MODEL = 'mistral'
BATCH_SIZE = 256
TEST_ACTIVATIONS_DIR = TEST_ACTIVATIONS_DIR_PER_MODEL[MODEL]
FILES_CHUNCK = 10
LAYERS = 80 if MODEL == 'llama3_70b' else 32
To simulate attacks and verify model responses:
python task_tracker/evaluation/verifier/get_model_responses.py
POISONED_TEST_DATASET_FILENAME = data[‘test_poisoned’] CLEAN_TEST_DATASET_FILENAME = data[‘test_clean’] TOKEN = ‘’ # Add HF token MODEL = ‘mistral’ # Change as needed
2. Run the verifier:
```bash
python task_tracker/evaluation/verifier/gpt4_judge_parallel_calls.py
Note: This script uses parallel API calls. Be mindful of costs when processing large datasets.
from task_tracker.config.models import data
from task_tracker.experiments_outputs import (
MODELS_RESPONSE_OUT_FILENAME_PER_MODEL,
VERIFIER_RESPONSE_OUT_FILENAME_PER_MODEL)
MODEL = 'llama3_8b'
MAX_THREADS = 60
JUDGE_PROMPT_FILE = 'judge_prompt.txt'
JUDGE_MODEL = 'gpt-4-no-filter'
AZURE_OPENAI_KEY = '' # Add credentials
AZURE_OPENAI_ENDPOINT = ''
Use task_tracker/evaluation/linear_probe/evaluate_linear_models.ipynb:
from task_tracker.experiments_outputs import LINEAR_PROBES_PATHS_PER_MODEL
from task_tracker.training.dataset import ActivationsDatasetDynamicPrimaryText
from task_tracker.training.utils.constants import (
TEST_ACTIVATIONS_DIR_PER_MODEL, TEST_CLEAN_FILES_PER_MODEL,
TEST_POISONED_FILES_PER_MODEL)
FILES = 'test'
MODEL = 'llama3_70b'
Use task_tracker/evaluation/triplet_probe/evaluate_triplet_models_test_data.ipynb:
from task_tracker.experiments_outputs import TRIPLET_PROBES_PATHS_PER_MODEL
from task_tracker.training.utils.constants import (
TEST_ACTIVATIONS_DIR_PER_MODEL, TEST_CLEAN_FILES_PER_MODEL,
TEST_POISONED_FILES_PER_MODEL)
MODEL = 'llama3_70b'
task_tracker/experiments_outputs.py:
TRIPLET_PROBES_PATHS_PER_MODEL = {
'mistral': {
'path': triplet_probe_out_parent_dir + '/mistral_best',
'num_layers': (17,31),
'saved_embs_clean': 'clean_embeddings_20240429-133151.json',
'saved_embs_poisoned': 'poisoned_embeddings_20240429-134637.json'
}
}
Use task_tracker/evaluation/triplet_probe/distances_per_conditions.ipynb:
from task_tracker.config.models import data
from task_tracker.experiments_outputs import (
TRIPLET_PROBES_PATHS_PER_MODEL, VERIFIER_RESPONSE_OUT_FILENAME_PER_MODEL)
POISONED_TEST_DATASET_FILENAME = data['test_poisoned']
CLEAN_TEST_DATASET_FILENAME = data['test_clean']
Use task_tracker/evaluation/triplet_probe/temporal_distances_per_tokens.ipynb:
from task_tracker.config.models import cache_dir, data, models
from task_tracker.experiments_outputs import TRIPLET_PROBES_PATHS_PER_MODEL
os.environ["TRANSFORMERS_CACHE"] = cache_dir
os.environ["HF_HOME"] = cache_dir
POISONED_TEST_DATASET_FILENAME = data['test_poisoned']
CLEAN_TEST_DATASET_FILENAME = data['test_clean']
MODEL = 'mistral' # Change as needed
Note: Adjust model names and file paths as necessary for your specific setup and experiments.
If you find our paper, dataset, or this repo helpful, please cite our paper:
@inproceedings{abdelnabi2025getmydrift,
title={Get my drift? Catching LLM Task Drift with Activation Deltas},
author={Sahar Abdelnabi and Aideen Fay and Giovanni Cherubin and Ahmed Salem and Mario Fritz and Andrew Paverd},
year={2025},
booktitle={SaTML (To Appear)}
}