ledidi

class ledidi.ledidi.Ledidi(model, shape, target=None, input_loss=L1Loss(), output_loss=MSELoss(), tau=1, l=0.1, batch_size=16, max_iter=1000, early_stopping_iter=100, report_iter=100, lr=1.0, input_mask=None, initial_weights=None, eps=0.0001, return_history=False, verbose=True)

Ledidi is a method for editing sequences to exhibit desired properties.

Ledidi is a method for editing categorical sequences, such as those comprised of nucleotides or amino acids, to exhibit desired properties in a small number of edits. It does so through the use of an oracle model, which is a differentiable model that accepts a categorical sequence as input and makes relevant predictions. For instance, the model might take in one-hot encoded nucleotide sequence and predict the strength of binding for a particular transcription factor.

Given a sequence and a desired output, Ledidi uses gradient descent to design edits that bring the predicted output from the model closer to the desired output. Because the sequences that predictions are being made for must be categorical this involves using the Gumbel-softmax reparameterization trick.

Parameters

model: torch.nn.Module

A model to use as an oracle that will be frozen as a part of the Ledidi procedure.

shape: tuple of two integers

The number of categories and the number of positions, respectively, in the sequence to be edited. For nucleotides this might be (4, 1000).

target: int or None

When given a multi-task model, the target to slice out and feed into output_loss when calculating the gradient. If None, perform no slicing. Default is None.

input_loss: torch.nn.Loss, optional

A loss to apply to the input space. By default this is the L1 loss which corresponds to the number of positions that have been edited. This loss is also divided by 2 to account for each edit changing two values within that position. Default is torch.nn.L1Loss.

output_loss: torch.nn.Loss, optional

A loss to apply to the output space. By default this is the L2 loss which corresponds to the mean squared error between the predicted values and the desired values.

tau: float, positive, optional

The sharpness of the sampled values from the Gumbel distribution used to generate the one-hot encodings at each step. Higher values mean sharper, i.e., more closely match the argmax of each position. Default is 1.

l: float, positive, optional

The mixing weight parameter between the input loss and the output loss, applied to the input loss. The smaller this value is the more important it is that the output loss is minimized. Default is 0.01.

batch_size: int, optional

The number of sequences to generate at each step and average loss over. Default is 64.

max_iter: int, optional

The maximum number of iterations to continue generating samples. Default is 1000.

report_iter: int optional

The number of iterations to perform before reporting results of the optimization. Default is 100.

lr: float, optional

The learning rate of the procedure. Default is 0.1.

input_mask: torch.Tensor or None, shape=(shape[-1],)

A mask where 1 indicates what positions cannot be edited. This will set the initial weights mask to -inf at those positions. If None, no positions are masked out. Default is None.

initial_weights: torch.Tensor or None, shape=(1, shape[0, shape[1])

Initial weights to use in the weight matrix to specify priors in the composition of edits that can be made. Positive values mean more likely that certain edits are proposed, negative values mean less likely that those edits are proposed.

eps: float, optional

The epsilon to add to the one-hot encoding. Because the first step of the procedure is to take log(X + eps) the smaller eps is the higher a value in the design weight needs to be achieved before an edit can be induced. Default is 1e-4.

random_state: int or None, optional

Whether to force determinism.

verbose: bool, optional

Whether to print the loss during design. Default is True.

fit_transform(X, y_bar)

Apply the Ledidi procedure to design edits for a sequence.

This procedure takes in a single sequence and a desired output from the model and designs edits that cause the model to predict the desired output. This is done primarily by learning a weight matrix of logits that can be added the log’d one-hot encoded sequence. These weights are the only weights learned during the procedure.

Parameters

X: torch.Tensor, shape=(1, n_channels, length)

A tensor containing a single one-hot encoded sequence to propose edits for. This sequence is then expanded out to the desired batch size to generate a batch of edits.

y_bar: torch.Tensor, shape=(1, *)

The desired output from the model. Any shape for this tensor is permissable so long as the output_loss function can handle comparing it to the output from the given model.

Returns

y: torch.Tensor, shape=(batch_size, n_channels, length)

A tensor containing a batch of one-hot encoded sequences which may contain one or more edits compared to the sequence that was passed in.

forward(X)

Generate a set of edits given a sequence.

This method will take in the one-hot encoded sequence and the current learned weight filter and propose edits based on the Gumbel-softmax distribution.

Parameters

X: torch.Tensor, shape=(1, n_channels, length)

A tensor containing a single one-hot encoded sequence to propose edits for. This sequence is then expanded out to the desired batch size to generate a batch of edits.

Returns

y: torch.Tensor, shape=(batch_size, n_channels, length)

A tensor containing a batch of one-hot encoded sequences which may contain one or more edits compared to the sequence that was passed in.

ledidi.ledidi.ledidi(model, X, y_bar, n_repeats=1, n_samples=None, return_designer=False, return_history=False, device='cuda', **kwargs)

Ledidi is a method for editing sequences to exhibit desired properties.

Ledidi is a method for designing compact sets of edits to categorical sequences, such as DNA, to make them exhibit desired characteristics as predicted by an oracle model. This is done by rephrasing the edit design task as a continuous optimization problem that can be solved using off-the-shelf optimizers and strategies. In this problem, Ledidi is trying to minimize an objective function comprised of an output loss, which measures how far away predictions on the edited sequence are from the target predictions, and the input loss, which measures the number of edits.

Because gradients cannot be directly applied to one-hot encoded categorical sequences, Ledidi learns an underlying continuous weight matrix from which categorical sequences are sampled. The distribution used for sampling is the Gumbel-softmax distribution and this is referred to as the straight-through estimator. Essentially, the process allows us the flexibility of a simple continuous optimization problem and the consistency of still sampling one-hot encoded sequences to run through the oracle model.

The oracle model can bd a single task from a single model, multiple tasks from the same model, or even multiple tasks from multiple models. All that matters is that the entire thing is differentiable. PyTorch makes the use of multiple tasks/models easy through the use of wrappers. See

for more information on how to construct wrappers that may be helpful.

This function is a wrapper around the Ledidi class, which must inherit from torch.nn.Module because a torch.nn.Parameter must be initialized for the weight matrix, and a call to Ledidi.fit_transform. Basically, this wrapper turns the two line implementation into a one line one that is consistent with other design implementations.

Additionally, one can design an affinity catalog by passing in a list of target values in y_bar instead of a single value. When a list is provided, an additional dimension is added to the front of the returned tensor of designed sequences.

If one wants to perform design multiple times they can set n_repeats to a value above 1. The initial weight matrix will be zero but different samples will be drawn from the Gumbel-softmax distribution, potentially leading to different outcomes.

Finally, by default one batch of designed sequences is returned. If you would like more than one batch of samples returned, you can specify the number of samples drawn from Ledidi’s learned distributions.

Parameters

model: torch.nn.Module

A model to use as an oracle that will be frozen as a part of the Ledidi procedure.

X: torch.Tensor, shape=(1, n_channels, length)

A tensor containing a single one-hot encoded sequence to propose edits for. This sequence is then expanded out to the desired batch size to generate a batch of edits.

y_bar: torch.Tensor or list, shape=(1, *)

The desired output from the model. Any shape for this tensor is permissable so long as the output_loss function can handle comparing it to the output from the given model. If a list is provided then each item in the list must have those properties.

n_repeats: int, optional

The number of times to run the Ledidi procedure. If 1, do not include this as a dimension in the returned blob of sequences. If above 1, include this as the first or second dimension, depending on whether an affinity catalog is being designed (second if so, first if not). Default is 1.

n_samples: int or None, optional

The number of samples to draw from Ledidi after the optimization process. If None, draw one batch as defined by batch_size. Otherwise, draw the number of sequences specified.

return_designer: bool, optional

Whether to return the designers for each design setting. If multiple repeats are done, each designer will be returned. Orthogonally, if an affinity catalog is being designed, return designers for each step. Default is False.

return_history: bool, optional

Whether to return a history for each run of Ledidi. This history includes each loss and other statistics. Default is False.

device: str or torch.device, optional

The device to move all the tensors and models to as a convenience. Default is ‘cuda’.

**kwargs

Any additional arguments to be passed into the Ledidi object.

Returns

y: torch.Tensor, shape=(*ny, *n_repeats, n_sample, n_channels, length)

A tensor containing a batch of one-hot encoded sequences which may contain one or more edits compared to the sequence that was passed in. If a list of y_bar values has been passed in, indicating that one would like to design an affinity catalog, that becomes the first dimension. If multiple repeats are being done, prepend that as well as either the first dimension, if no affinity catalog is being designed, or the second dimension, if the catalog is being designed.