Library Analysis#

The Library Analysis module provides tools for analyzing Thompson Sampling results, identifying top building blocks, and creating visualizations for benchmarking studies.

Module Architecture#

Quick Start#

Analyze results and find top building blocks:

Basic analysis workflow#

import polars as pl
from TACTICS.library_analysis import LibraryAnalysis, LibraryVisualization

# Load results
results_df = pl.read_csv("results.csv")

# Create analysis with SMILES files for building block lookup
analysis = LibraryAnalysis(
    df=results_df,
    smiles_files=["acids.smi", "amines.smi"],
    product_code_column="Product_Code",
    score_column="Scores"
)

# Find top building blocks in top 100 compounds
counters, total = analysis.find_top_building_blocks(cutoff=100)

# Visualize top building blocks
viz = LibraryVisualization(analysis)
viz.visualize_top_building_blocks(top_n=20)

Benchmark Thompson Sampling methods:

Benchmarking multiple methods#

from TACTICS.library_analysis.visualization import TS_Benchmarks

# TS_Benchmarks auto-generates all data during initialization
benchmarks = TS_Benchmarks(
    no_of_cycles=10,
    methods_list=["roulette_wheel", "bayes_ucb", "greedy"],
    TS_runs_data={
        "roulette_wheel": [cycle1_df, cycle2_df, ...],
        "bayes_ucb": [cycle1_df, cycle2_df, ...],
        "greedy": [cycle1_df, cycle2_df, ...],
    },
    reference_data=reference_df,
    top_n=100,
    sort_type="minimize"
)

# All data pre-calculated - ready to plot immediately
benchmarks.stripplot_TS_results()
benchmarks.plot_barplot_TS_results()
benchmarks.plot_line_performance_with_error_bars()

LibraryAnalysis#

Analyzes chemical libraries to identify top building blocks and their overlap.

Dependencies

Polars or Pandas DataFrame with product codes and scores
SMILES files (.smi) containing building block structures
Used by LibraryVisualization for visualizations

Constructor#

Parameters#
Parameter	Type	Required	Description
`df`	`DataFrame`	Yes	Polars or Pandas DataFrame with product codes and scores.
`smiles_files`	`str \| list[str]`	Yes	Path(s) to .smi file(s) containing SMILES and building block codes.
`product_code_column`	`str`	No	Name of product code column. Default: `"Product_Code"`.
`score_column`	`str`	No	Name of score column. Default: `"Scores"`.

Methods#

find_top_building_blocks#

Identify and count building blocks in top products.

Parameters#
Parameter	Type	Required	Description
`cutoff`	`int`	Yes	Number of top products to analyze.
`sort_scores_by`	`str`	No	`"ascending"` or `"descending"`. Default: `"ascending"`.
`top_n`	`int`	No	Number of top building blocks per position. Default: 100.

Returns

Type	Description
`Tuple[List[Counter], int]`	(position_counters, total_molecules)

Example

from TACTICS.library_analysis import LibraryAnalysis

analysis = LibraryAnalysis(
    df=results_df,
    smiles_files=["acids.smi", "amines.smi"],
    product_code_column="Product_Code",
    score_column="Scores"
)
counters, total = analysis.find_top_building_blocks(cutoff=100)

check_overlap#

Check overlap between current cutoff and a new cutoff value.

Parameters#
Parameter	Type	Required	Description
`new_cutoff`	`int`	Yes	New cutoff value to compare against.

Returns

Type	Description
`list[tuple]`	Overlap information for each position.

compare_analysis_overlap#

Compare building block overlap with another LibraryAnalysis instance.

Parameters#
Parameter	Type	Required	Description
`other_analysis`	`LibraryAnalysis`	Yes	Another analysis to compare with.
`top_n`	`int`	No	Number of top building blocks. Default: 100.

Returns

Type	Description
`list`	Overlap results for each position.

LibraryVisualization#

Creates visualizations from LibraryAnalysis instances.

Dependencies

Requires LibraryAnalysis instance
RDKit for molecular structure rendering

Depends on: LibraryAnalysis

Constructor#

Parameters#
Parameter	Type	Required	Description
`analysis`	`LibraryAnalysis`	Yes	Analysis instance containing data to visualize.

Methods#

visualize_top_building_blocks#

Display top building blocks using RDKit molecular drawings.

Parameters#
Parameter	Type	Required	Description
`show_overlap`	`bool`	No	Show overlapping building blocks. Default: False.
`mols_per_row`	`int`	No	Molecules per row in grid. Default: 5.
`sub_img_size`	`tuple`	No	Size of each molecule image. Default: (300, 300).
`comparison_analysis`	`LibraryAnalysis`	No	Another analysis for overlap comparison.
`top_n`	`int`	No	Number of top building blocks. Default: 20.

plot_top_products_comparison#

Generate subplots comparing overlap between analysis instances and a reference.

Parameters#
Parameter	Type	Required	Description
`analysis_instances`	`list`	Yes	List of LibraryAnalysis instances to compare.
`reference_instance`	`LibraryAnalysis`	Yes	Reference analysis instance.
`top_n`	`int`	No	Top products to consider. Default: 100.
`title`	`str`	No	Plot title.
`figsize`	`tuple`	No	Figure size (width, height). Default: (15, 10).
`analysis_labels`	`list`	No	Labels for each analysis group.
`save_path`	`str`	No	Path to save the plot.

TS_Benchmarks#

Comprehensive benchmarking and visualization for Thompson Sampling results across multiple cycles and search strategies.

Key Features:

Automatic data generation during initialization
Consistent color schemes across all plots
Multi-cycle analysis with statistics
Reference comparison with performance metrics
Multiple visualization types (strip, bar, line plots)

Constructor#

All required data is automatically generated during initialization.

Parameters#
Parameter	Type	Required	Description
`no_of_cycles`	`int`	Yes	Number of cycles to analyze.
`methods_list`	`list[str]`	Yes	List of method names (search strategies).
`TS_runs_data`	`dict`	Yes	Maps method names to lists of DataFrames (one per cycle).
`reference_data`	`DataFrame`	No	Ground truth reference data for comparison.
`top_n`	`int`	No	Top products for bar plot. Default: 100.
`sort_type`	`str`	No	`"minimize"` or `"maximize"`. Default: `"minimize"`.
`top_ns`	`list[int]`	No	Top-N values for line plot. Default: [50, 100, 200, 300, 400, 500].

Automatic Data Storage:

Attribute	Description
`combined_df_top_n`	Top N compounds from each method/cycle (for stripplot).
`combined_df_all`	All compounds from each method/cycle.
`bar_plot_df`	Hit recovery data for bar plots.
`line_plot_df`	Raw performance data across cycles.
`grouped_stats`	Statistical summaries with mean, std, error bounds.
`actual_methods`	Methods found in data (for validation).

Visualization Methods#

stripplot_TS_results#

Generate strip plot showing score distributions across cycles and methods.

Parameters#
Parameter	Type	Required	Description
`width`	`int`	No	Plot width in pixels (auto-calculated if None).
`height`	`int`	No	Plot height in pixels (auto-calculated if None).
`save_path`	`str`	No	Path to save (.html, .png, .svg).
`show_plot`	`bool`	No	Display in Jupyter. Default: True.
`legend_position`	`str`	No	Position of legend: `"right"` (default) or `"bottom"` for horizontal legend below plot.

Returns

Type	Description
`altair.Chart`	Altair chart object (or None if saved).

plot_barplot_TS_results#

Create grouped bar plot showing reference hit recovery by method and cycle.

Parameters#
Parameter	Type	Required	Description
`width`	`int`	No	Plot width in pixels (auto-calculated if None).
`height`	`int`	No	Plot height in pixels. Default: 400.
`save_path`	`str`	No	Path to save (.html, .png, .svg).
`show_plot`	`bool`	No	Display in Jupyter. Default: True.
`legend_position`	`str`	No	Position of legend: `"right"` (default) or `"bottom"` for horizontal legend below plot.
`dark_mode`	`bool`	No	Use white text for bar labels (for dark backgrounds). Default: False.

plot_line_performance_with_error_bars#

Create line plot with error bars showing mean performance across top-N cutoffs.

Parameters#
Parameter	Type	Required	Description
`width`	`int`	No	Plot width in pixels. Default: 800.
`height`	`int`	No	Plot height in pixels. Default: 500.
`save_path`	`str`	No	Path to save (.html, .png, .svg).
`show_plot`	`bool`	No	Display in Jupyter. Default: True.
`legend_position`	`str`	No	Position of legend: `"right"` (default) or `"bottom"` for horizontal legend below plot.

Other Methods#

Method	Description
`get_performance_summary()`	Get dict containing all stored data and charts.
`gen_TS_runs_data(top_n, sort_type)`	Internal: Generate combined datasets.
`get_barplot_TS_results_data(top_n)`	Internal: Generate bar plot data.
`gen_line_plot_performance_data(top_ns)`	Internal: Generate line plot data.

Complete Example#

Full benchmarking workflow#

import polars as pl
from TACTICS.library_analysis.visualization import TS_Benchmarks

# Load results from multiple runs
rw_runs = [pl.read_csv(f"rw_cycle_{i}.csv") for i in range(10)]
ucb_runs = [pl.read_csv(f"ucb_cycle_{i}.csv") for i in range(10)]
greedy_runs = [pl.read_csv(f"greedy_cycle_{i}.csv") for i in range(10)]

# Load reference data
reference = pl.read_csv("reference_scores.csv")

# Create benchmarks (all data generated automatically)
benchmarks = TS_Benchmarks(
    no_of_cycles=10,
    methods_list=["RouletteWheel", "BayesUCB", "Greedy"],
    TS_runs_data={
        "RouletteWheel": rw_runs,
        "BayesUCB": ucb_runs,
        "Greedy": greedy_runs,
    },
    reference_data=reference,
    top_n=100,
    sort_type="minimize",
    top_ns=[25, 50, 100, 200, 300]
)

# Generate all visualizations
strip_chart = benchmarks.stripplot_TS_results(
    width=800, height=500, save_path="strip_plot.html",
    legend_position="right"  # or "bottom" for horizontal legend
)

bar_chart = benchmarks.plot_barplot_TS_results(
    width=700, height=400, save_path="bar_plot.html",
    legend_position="bottom",  # horizontal legend below plot
    dark_mode=False  # set True for white text on dark backgrounds
)

line_chart = benchmarks.plot_line_performance_with_error_bars(
    width=900, height=600, save_path="line_plot.html",
    legend_position="right"
)

# Access computed statistics
print(f"Methods analyzed: {benchmarks.actual_methods}")
summary = benchmarks.get_performance_summary()

Workflow Overview#

Typical analysis workflow:

Run Thompson Sampling - Generate results DataFrames
LibraryAnalysis - Identify top building blocks and their frequencies
LibraryVisualization - Create molecular structure grids
TS_Benchmarks - Compare multiple methods with statistical analysis

Diagnostic Plots#

The diagnostic_plots module provides visualization functions for analyzing CATS diagnostics data collected during Thompson Sampling runs.

Requires: track_diagnostics=True in ThompsonSamplingConfig and the diagnostics DataFrame from sampler.get_diagnostics().

Available Functions#

Function	Description
`plot_criticality_trajectories(diag_df)`	Plot per-component criticality over cycles.
`plot_snr_trajectories(diag_df)`	Plot signal-to-noise ratio over cycles.
`plot_temperature_trajectories(diag_df)`	Plot effective temperature over cycles.
`plot_multiplier_trajectories(diag_df)`	Plot CATS multiplier over cycles.
`plot_diagnostics_dashboard(diag_df)`	Create a multi-panel dashboard with all diagnostic plots.

Example

from TACTICS.library_analysis.diagnostic_plots import plot_diagnostics_dashboard

# After running search with track_diagnostics=True
diag_df = sampler.get_diagnostics()
fig = plot_diagnostics_dashboard(diag_df)
fig.savefig("diagnostics.png", dpi=150, bbox_inches="tight")