### Overview of TACCO framework

TACCO is based on three guiding principles: modularity, interpretability and efficiency. TACCOs compositional annotation algorithm is built from a single fast core method, which is then supplemented by a diverse set of wrappers and ‘boosters’, each providing additional functionality and features. The framework is completed by a set of downstream analysis tools, some of which are especially optimized for analysis involving compositional annotations. TACCO relies on Anndata and seamlessly integrates with the Scanpy^{46} ecosystem.

### Compositional annotation

TACCO aims to annotate single cells and cell-like objects, like Slide-seq beads or Visium spots, with compositional annotations. In cases with discrete ground truth (for example, a B cell versus a fibroblast), we interpret the compositional annotation as probabilities of belonging to a category. Both objects *b* and categories *t* are in a common high-dimensional data space, for example, expression space. Generally, objects that are close to a category in that space should have a high contribution of that category in the compositional annotation.

The goal of the annotation is to find a matrix *ρ*_{tb} which gives the distribution of annotation over objects *b* and categories *t*. Some applications imply certain natural choices for the ‘units’ of *ρ*_{tb}. For example, in the case of count data, the natural units are counts as well, such that *ρ*_{tb} gives the counts in object *b* which are attributed to category *t*. The marginal distribution over categories is just the total number of counts Σ_{t} *ρ*_{tb} = *ρ*_{b} per object *b* and is known. The marginal distribution over objects Σ_{b} *ρ*_{tb} = *ρ*_{t} is not known beforehand and is an output of the annotation process. As the marginals per object are known in general, it is equivalent to cite only the normalized distributions *ρ*′_{tb} = *ρ*_{tb}/*ρ*_{b} with Σ_{t} *ρ*′_{tb} = 1 as an annotation result.

### Core annotation method by OT

TACCO’s core method is entropically regularized, partially balanced OT (an intrinsically fast variant of OT). Balanced OT solves the optimization problem *γ*_{tb} = argmin_{γtb} Σ_{tb} *γ*_{tb} *M*_{tb} under the positivity constraint *γ*_{tb} ≥ *0* and the marginal constraints Σ_{t} *γ*_{tb} = *c*_{b} and Σ_{b} *γ*_{tb} = *c*_{t} with the transport cost Σ_{tb} *γ*_{tb} *M*_{tb} = *<* *γ*, *M* > _{F} given by the Frobenius inner product of a mapping matrix *γ*_{tb} and a constant matrix *M*_{tb}. *M*_{tb} encodes the cost of ‘transporting’ or mapping an object *b* to an annotation *t* and must be chosen sensibly to yield a ‘good’ mapping *γ*_{tb}. The annotation problem is solved if the marginals *c*_{b} and *c*_{t} and the cost *M*_{tb} can be tuned such that *γ*_{tb} = *ρ*_{tb}.

The marginal over annotations is known from the data *c*_{b} = *ρ*_{b}, while *c*_{t} = *ρ*_{t} is not. This can be used to encode prior knowledge about the data, for example, from a reference distribution. To support the general case when such a reference distribution is not available, partially balanced OT is used: instead of fixing the type marginal exactly, a Kullback–Leibler divergence penalty is imposed scaled with a parameter *λ*. This can be used to tune the amount of trust put in the prior distribution.

Choosing a well-performing cost function is more ambiguous. The cost function uses the information in data space and assigns a dissimilarity to each object-category combination. A straightforward choice is the cosine distance, or, for expression count data, the cosine distance on normalized and log1p-transformed data. In our benchmarks, the cosine distance on transformed data led to better results (Extended Data Fig. 1), but the transformation is rather specific for count data. Inspired by the overlap of states in quantum mechanics, a different measure, the Bhattacharyya coefficients, is generally used, with similar performance and without direct reference to counts. Bhattacharyya coefficients are a general measure of the overlap of probability distributions and are defined as BC(*p*,*q*) *=* Σ_{g}√*p*_{g}*q*_{g} for two probability distributions *p* and *q* in the data space. To allow the user to adapt the method according to the needs of their particular application, TACCO implements several other metrics, as well as making all scipy metrics available.

OT’s optimization problem is in general nonconvex and numerically expensive. However, using entropic regularization makes it strictly convex and efficiently solvable by using the Sinkhorn–Knopp matrix scaling algorithm^{47}. For entropic regularization, the tunable entropy regularization term *εΩ*(*γ*_{tb}) *=* *ε* Σ_{tb} *γ*_{tb} log(*γ*_{tb}) is added to the objective function. This term favors mappings that do not map a given object to a single annotation.

### Alternative annotation methods

The core, OT-based annotation method can be swapped by a number of other built-in methods, including nonnegative least squares or support vector machine (SVM) or by wrapped external methods from both Python and R, including NMFreg^{5} or RCTD^{8}. Custom core methods can also be added using a functional interface. The wrapped external and custom functions may already include ‘booster’ functionality (below). For example, RCTD already contains platform normalization. Many of the simple built-in methods are amenable to hand optimization, for example, by supporting data sparsity consistently, which makes them even faster. All the built-in methods are generally optimized for standard x86 hardware, but wrapped external methods may use GPU acceleration (for example, Tangram^{12}).

### Overview of boosters

Boosters can improve the performance of the core method or provide support for ‘missing features’ for example, for platform normalization, for using subtype variability to enhance the type representation in single-cell data, for creating a deconvolution method from a categorical annotation method and the other way round (Extended Data Fig. 1). They can be combined flexibly to adapt to special requirements for specific applications. The modularity introduced by boosters makes it straightforward to ‘unbox’ TACCO and understand what each part does. While most boosters do have some overhead in runtime, in the end, they all do the heavy lifting by calling the fast core method one or several times and transforming its inputs and/or outputs.

### Platform normalization

Dramatic differences in datasets can arise solely from differences in the experimental technique that impact the profiled cellular compartments (for example, single cell versus single nucleus), cellular compositions (due to differential capture of cells of different types) or gene biases. Annotation of data from one platform using a reference of another platform is much more difficult without accounting for these platform-dependent biases^{8}. This booster can be safely disabled if no platform effects are expected.

A platform normalization step mitigates platform-specific effects by rescaling data from one platform to make it comparable to data from a different platform, separately in each dimension *g* of the data space, for example, separately per gene. The necessary rescaling factors are estimated from data. A related but simpler approach compared to that given in ref. ^{8} is pursued, making less assumptions and therefore usable across a broader range of applications.

The representations of the categories *t* as sets of vectors in data space \(\pi ^A_{gt}\,{{{\mathrm{and}}}}\,\pi ^B_{gt}\) on the two platforms *A* and *B* are linked to each other via the platform normalization factors \(f^{AB}_g\,{{{\mathrm{as}}}}\,\pi ^A_{gt} = f^{AB}_g\pi ^B_{gt}\). If the category representations \(\pi ^B_{gt}\) for platform *B* and the (pseudo-bulk) category marginals \(\rho ^A_t\) for platform *A* are known, the data space marginals \(\rho ^A_g\) can be written as \({\rho^A_{g}} = {\mathop {\sum}\nolimits_{{{\mathrm{t}}}} {\pi ^A_{gt}\rho ^A_t}} = {f^{AB}_g}{\mathop {\sum}\nolimits_{{{\mathrm{t}}}} {\pi ^B_{gt}\rho ^A_t}}\), and therefore \(f^{AB}_g = \rho ^A_g/\mathop {\sum}\nolimits_{{{\mathrm{t}}}} {\pi ^B_{gt}\rho ^A_t}\). The category marginals \(\rho ^A_t\) are themselves usually a result of the annotation procedure and used here as input to a preprocessing step for the procedure. However, an iterative scheme can also be used. Starting with the assumption that the annotation marginals are identical to the reference (which is reasonable for example, if matched spatial and single-cell data are available), most of the gene-wise platform effect can be already captured. Next, platform normalization is rerun after a first round of cell typing using improved type fractions, and the process is iterated until the normalization factors are stable. In practice, this procedure converges very rapidly with the initial step being by far the most relevant one.

After determining the gene-wise platform normalization factors \(f^{AB}_g\), they are used to rescale the data space in either the reference or the test data or equivalently to work in the units of the test or the reference data. As the probabilities and the resulting annotation distribution are given in terms of these units, choosing test data units or reference data units can lead to quite different results. Which option should be preferred depends on the use case and downstream analyses. Here, test data units are chosen to retain integer count data for object splitting.

### Multicenter

In many cases, there are multiple observations per annotation category in the reference dataset, as is the case in single-cell data. To integrate the variability of these observations while maintaining speed, the observations per category are subclustered with *k*-means clustering to get multiple mean profiles per category. The annotation function is then called with this subannotation, and its result is summed over the subannotations to give an annotation in terms of the original categories. When the reference is already finely annotated, no improvement is expected from this booster, which could then lead to worse performance, because less data are available per subannotation to average noise and the reference gets overfitted.

### Bisectioning

Some annotation methods can be biased toward low-entropy annotations such that they may overweigh dominant categories in mixtures (the extreme case is a categorical classifier). Such methods can be adapted to compositional annotation by running the annotation method iteratively. In each iteration, the annotation is not used as the final result, but instead, a certain fraction of it is used to subtract a reconstructed approximation of the data from the original data. The residual is again subjected to the annotation method, while a fraction of the annotation result is added to the final annotation result. This procedure is very similar to gradient boosting. Bisectioning is useful when the objects are additive mixtures of the annotations in the data space. If the data consist (mainly) of objects which are best described by a single annotation, this booster can decrease performance, for example, by resolving the ambient contribution in single-cell data.

### Choosing parameters

While a single configuration of TACCO generally gives decent performance for a wide variety of use cases (Extended Data Fig. 8), it can be tuned to perform particularly well for certain use cases. Relevant parameters for the tuning include the following components.

#### Boosting/bisectioning

TACCO’s default depends on the core annotation method. As OT itself lacks support for additive compositional annotations, TACCO activates this booster for OT by default. Higher values for the number of bisections and the bisection divisor generally lead to more precise compositional annotations but at the cost of increased runtime, which scales roughly linearly with the sum iterations+divisor. A divisor smaller than 3 can lead to sizable artifacts, and so a divisor equal to or greater than 3 is recommended.

#### Multicenter

By default, TACCO does not use the multicenter booster (setting a number of representing means for each annotation category), as it is useful only in cases where the annotation categories have a large within-category variation. However, as many reference datasets come with coarse annotation categories, the multicenter booster is activated for most analyses in this study. This booster is not recommended if the categories already cover the full biologically significant heterogeneity in the data or if the dataset has only a single representing observation per category. A value of 5 or 10 is recommended in other cases (as this generally captures the heterogeneity in many datasets). Larger values are generally not recommended as they lead to an increase in runtime and have the potential for overfitting, by using irrelevant variability in the data for the annotation process.

#### Platform_iterations

TACCO’s default depends on the core annotation method. As OT itself does not have built-in platform normalization and most pairs of reference and target data have nontrivial platform effects, basic platform normalization is the default for OT. It is generally recommended not to disable platform normalization, except for specific cases, for example, if part of the dataset of a single batch was manually annotated and the annotation should be transferred to the rest of the data.

#### Max_annotation

TACCO’s default is a compositional annotation with nonzero values simultaneously possible for all annotation categories per observation. Max_annotation can be used to set the number of nonzero values per observation. For categorical annotation, one might want to restrict this to a single nonzero value per observation, corresponding to setting max_annotation to 1, while for data with very few contributing cells per observation (for example, Slide-seq or scRNA-Seq data with doublets) a suitable value could be 2 or 3.

#### Marginal relaxation parameter lambda (OT-specific)

TACCO’s default value is chosen as a compromise between using prior information from the reference pseudobulk annotation composition (for example, cell-type proportions) and annotating in a fully data-driven manner. While larger values (larger cost of having a pseudobulk annotation composition different from the reference) can stabilize the annotation in cases of low correspondence between reference and target data, smaller values can give more precise pseudobulk annotation compositions if the two datasets are highly corresponding and do not exhibit batch effects that are not accounted for by platform normalization.

#### Entropy regularization parameter epsilon (OT-specific)

TACCO’s default is chosen such that the regularization’s effect on the results is minimized: it biases the results toward high entropy configurations, that is, every observation (for example, spatial observation) is assigned a compositional annotation with large contributions from different categories (for example, cell types). This parameter must be nonzero, as the efficient regularized OT solver breaks down at zero epsilon (and numerically already at epsilon much smaller than 0.005). Except for specific settings (that is, prior information external to the dataset itself) in which it is beneficial to bias the result toward significant contributions of many annotation categories in the result, this parameter should only be changed in rare cases of numerical instability—which we did not observe with TACCO’s default.

### Object splitting

The annotation method generates an assignment of a composition of annotation categories for every observation. Generally, each observation is associated with more than one category with nonzero contribution. When the categories are cell types, this can be problematic for downstream applications that require the expression profiles of single cells as input, that is, pure profiles that can be attributed to a single cell. For example, as cell-type-related expression constitutes a strong signal, analysis of expression programs is easier across cells of shared type. Thus, it is desirable to derive several ‘virtual’ observations for every real observation, which correspond to the pure contribution of each annotation category.

A similar idea was proposed in ref. ^{8}, where the expected contribution of each cell type to the expression of one Slide-seq bead is approximated. In that approach, cell-type fractions that were reconstructed for every bead are taken as input in a Bayesian analysis, along with type-specific expression profiles and the bead-count matrix to yield expected reads per gene, bead and cell type. The result, however, lacks consistency with the annotation and the measured count matrix in the sense that the marginal over genes is not recovered. Here, we follow a similar strategy, but we directly integrate marginals as constraints of a matrix equivalence scaling problem in a data-driven frequentist approach.

In this approach, data generation is modeled as drawing single molecules labeled with gene *g*, cell type *t* and observation/bead *b* from the sample, which constitutes the central object—the joint probability *p*(*gtb*) for a given molecule to be of gene *g*, cell type *t* and bead *b*. In the actual experiment, only *g* and *b* are measured, yielding *p*(*gb*) *=* Σ_{t} *p*(*gtb*), the ‘t-marginal’. From the reference data, the reference profiles *p*(*g*|*t*) *=* Σ_{b} *p*(*gtb*)/*p*(*t*) are available, and from the annotation process the annotation result *p*(*tb*) = Σ_{g} *p*(*gtb*), the ‘g-marginal’ is obtained. Further, *p*(*gtb*) is modeled with a Bayesian-inspired product ansatz: *p*(*gtb*) = *p*(*gt*) *n*(*gb*) *n*(*tb*), with free parameters *n*(*gb*) and *n*(*tb*). These are fixed by enforcing the *t*– and *g*-marginals. The ansatz can be interpreted as adjusting the cell-type profiles and pseudobulk annotation with object-wise scaling factors per data dimension *g* and annotation category *t* such that the measurement and the annotations are reproduced exactly. To determine the parameters from the marginals, we have to solve a separate matrix equivalence scaling problem per object *b*.

Matrix equivalence scaling problems are guaranteed to have a solution if the matrix to be scaled, that is *p*(*gt*), has only positive entries. Therefore, a small positive number is added to all the elements of *p*(gt) to make the problem well defined. These problems can be solved by iterative normalization of the columns and rows of *p*(*gt*). This simple algorithm is known under many names, for example, the RAS algorithm^{48} or for doubly stochastic matrices as the Sinkhorn–Knopp algorithm^{47}, and it is also the algorithm used to solve OT efficiently^{49}. In contrast to OT, where there is a single matrix scaling problem, here a separate problem is solved for every object. Although this initially seems like a practical performance problem, these problems can be solved in parallel and use the same data, leading to speedups by reducing memory accesses. Moreover, we use the sparsity of the count frequency matrix *p*(bg) to implement the RAS algorithm very efficiently for the problem at hand.

By rescaling the resulting *p*(*gtb*) with the total weight per object (for example, total counts per cell for sequencing data), a consistent annotation-resolved split of the measurement is obtained, consisting of floating-point numbers. For expression count data, this split generally includes many values much smaller than 1. To optimize sparsity and obtain integer counts, an option is available to round this result by flooring and redistributing the remaining reads (via multinomial sampling from the remainders). The resulting split count matrix retains biological signal and can be used in standard downstream analyses (Extended Data Fig. 2).

### Single-molecule annotation

To annotate single molecules by cell-type assignment, TACCO first bins the single-molecule data in space to generate aggregate cell-like objects. TACCO then annotates them either using internal or wrapped external methods. Subsequently, TACCO maps the resulting compositional annotation back to the single molecules, as follows. First, object splitting is used to distribute molecules to annotations, resulting in a probability for every molecule species in the bin to have a certain annotation. This annotation is then distributed randomly among the molecules of that molecular species such that each molecule has only a single annotation and the population of molecules reproduce the annotation probability. Because spatial binning can introduce arbitrary boundary artifacts in the definition of local neighborhoods for molecules, TACCO repeats the binning and annotation for *N* (usually 2–3) spatial shifts of the Cartesian grid in steps of 1/*N* of the grid spacing per spatial dimension. For *d* spatial dimensions, this results in *N*^{d} annotations per molecule. The final annotation of each molecule is then determined by a majority vote. This simple single-molecule annotation is only feasible with fast annotation methods, which can be run multiple times on differently binned data in a reasonable amount of time.

In addition to the parameters for compositional annotation, the single-molecule annotation introduces extra parameters. We evaluated the annotation’s sensitivity to parameter value changes with respect to the baseline single-molecule TACCO configuration (Extended Data Fig. 6b). Sizable changes in most parameters conserved about 80% of the single-molecule annotations relative to the baseline and were consistent with the reference annotation of the osmFISH-recovered cells at about 0.6, with one notable exception as follows: increasing OT’s entropic regularization parameter epsilon leads to significant reductions in accuracy and stability as it drives the annotation to be homogeneous regardless of the data. As for compositional annotation, we therefore recommend keeping the small default value which is just large enough to make OT numerically stable in practice.

### Image-free segmentation

An image-free, density-based cell segmentation approach uses the per-molecule annotation to determine molecule assignments at critical cross-category boundaries. We assume that the exact assignment of boundaries within a category is not very important as long as it gives rise to a reasonable size of the segmented objects. The segmentation is implemented as graph-based hierarchical spectral clustering.

As the number of single molecules can easily be of the order of 10^{9}, an efficient Euclidean sparse distance matrix computation is required. As scipy’s sparse distance matrix calculation is serial and too slow for this application, a custom, fast, parallel and still generally applicable algorithm was developed and implemented. After calculating the distances \(d_{ij}^{{\mathrm{spatial}}}\) between molecules *i* and *j* in position space, a distance contribution is added in quadrature which is derived from the single-molecule annotation \(d_{ij}^{{\mathrm{annotation}}}\) to get a combined distance: \(d_{ij}^{{\mathrm{total}}} = \surd \left( {d_{ij}^{{\mathrm{spatial}}}} \right)^2 + \left( {d_{ij}^{{\mathrm{annotation}}}} \right)^2\). The annotation contribution can be either zero within an annotation and infinity between annotations, automatically derived using distances between the expression profiles of the annotations or manually specified. As the clustering works on affinities and not on distances, the distances have to be transformed into affinities, which is done by a Gaussian kernel with a subcellular distance scale.

As the number of molecules is so large, the clustering algorithm has to be fast and scalable, work entirely on sparse matrices and cut the neighborship graph at sensible places to generate reasonable cells-like objects. For this, a hierarchical clustering scheme was developed based on the spectral clustering implementation of scikit-learn as follows: it can handle sparse matrices, can use algebraic multigrid solvers for speed and, for only two clusters, solve the normalized graph cut problem^{50}. The best cuts are found iteratively in top-down fashion (binary splitting the data at each iteration). To keep the dataset tractable for the initial cuts, the spatial structure of the data is used to generate supernodes in the graph before clustering. When the clustering comes down into the size regime of a few single cells, several heuristics are employed to determine whether to accept a proposed cut based on the shape and size of the clusters, and on comparing the affinity loss of the proposed cut to the expected cost for cutting a homogeneous bulk graph of corresponding size and dimension. The final result is another column of annotation for the molecules containing unique object ids.

For visualization, these objects are filtered to include at least 20 molecules, and the associated expression profiles are normalized as in ref. ^{27} and embedded by a tSNE. To evaluate the self-consistency of annotation and segmentation, the silhouette score of the cell-type annotations is evaluated on identically filtered and normalized profiles using the silhouette_score function from scikit-learn.

### Region definition

For spatially convoluted expression data, such as in spatial transcriptomics methods including Slide-seq and Visium, clustering (of beads or spots) in expression space is less meaningful as individual cells are mixed. ‘Regions’ which are defined both in position and expression or annotation space can be a meaningful alternative. TACCO implements a method to define such regions (Extended Data Fig. 3a), consisting of the following two steps: (1) construction of one *k*-nearest neighbors graph based on expression (or annotation) distances and another *k*-nearest neighbors graph based on physical distances; and (2) combination of the two graphs as a weighted sum of the graphs’ adjacencies. Putting more weight on the position space adjacency gives contiguous regions in position space, while more weight on the expression adjacency leads to separated islands of similar expression being annotated with the same region and can connect regions across different spatial samples (for example, Slide-seq pucks). To account for the missing links from the position graph between samples, the cross-sample adjacency is scaled up by a balancing weight factor. The combined adjacency matrix is then subjected to standard Leiden clustering^{51} which assigns consistent region annotation across samples.

### Colocalization and neighborhood analysis

To score colocalization or co-occurrence of annotations^{26}, TACCO calculates *p*(anno|center;*x*)/*p*(anno), the probability to find an annotation ‘anno’ at a distance *x* from a center annotation ‘center’ normalized by the probability to find ‘anno’ regardless of a ‘center’. This is well defined also for noncategorical annotations, which are commonplace for the compositional annotations created with TACCO and for pairs of unrelated annotations.

As the most time-consuming computations for co-occurrence are the same as for neighborhood-enrichment analyses, TACCO also calculates neighborhood-enrichment *z*-scores^{52} for a set of distance bins (as opposed to the set of direct neighbors on a graph) and again supports noncategorical annotation, pairs of unrelated annotations for ‘anno’ and ‘center’, and multiple samples while being competitive performance-wise (Extended Data Fig. 9).

### Annotation coordinates

To analyze a given annotation (for example, cell-type composition) with respect to its spatial distance from a reference annotation (for example, histological annotation; Extended Data Fig. 3b,c), TACCO implements an algorithm that determines a stable ‘annotation coordinate’ in position space by regularizing the minimum distance by a critical neighborhood size determined at the weights level and afterward correcting for the regularization bias. (This is required because for noisy spatial data and/or noncategorical reference annotations, simply taking the minimal distance between objects of certain annotations is unstable and/or not well defined.)

Specifically, TACCO first generates a matrix *n*_{A,x}(*d*) of occurrence histograms versus spatial distance *d* for every annotation category *A* and spatial position *x*, counting fractional annotations as fractional occurrence counts. The distance \(d_{A,x}^l\) where its cumulative sum over distance \(N_{A,x}\left( d \right) = \mathop {\sum}\nolimits_{{{{\mathrm{d}}}}^{\prime} = 0}^{{{\mathrm{d}}}} {n_{A,x}\left( {d^{\prime} } \right)}\) will be over a certain threshold *N*_{l} is a robust measure for the radius of the sphere centered at *x* and containing a total of *N*_{l} occurrences of annotation *A*. Even for data homogeneously annotated with *A*, the minimal possible value of \(d_{A,x}^l\) is however larger than 0 as the threshold *N*_{l} will in general be larger than *l* to have a stabilizing effect. To allow for a minimum value of 0 and obtain a measure for the distance from *x* to a significant amount of category *A*, \(d_{A,x}^l\) is corrected using a fictitious homogeneous annotation category *H* and the value of *N*_{H,x}(\(d_{A,x}^l\)) is found, followed by solving for the distance \(d_{A,x}^0\) at which *N*_{l} less occurrences appeared: *N*_{H,x}(\(d_{A,x}^0\)) = *N*_{H,x}(\(d_{A,x}^l\)) – *N*_{l}. \(d_{A,x}^0\) is now consistent with the regular minimum distance for noise-free categorical annotations, stable against noise and well defined for compositional annotations.

### Enrichments

TACCO supports various approaches to visualize compositional differences in the spatial structure of a sample and estimate the statistical significance of these differences/enrichments (Extended Data Fig. 3d–g). In particular, TACCO uses sample information to calculate enrichments not across observations (single cells/spatial beads, which are no independent observations and can lead to *P* value inflation), but across multiple samples, which gives meaningful and reasonable *P* values.

When there is just a single (or few) spatial sample(s) but with considerable size, different parts of that sample are treated as semi-independent biological replicates, by splitting the sample along a selection of coordinate axes to create a statistical ensemble for enrichment analysis. Increasing the number of splits to parts that cannot be regarded as independent replicates interpolates smoothly to where every observation is considered independent.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.