Extracting Population Simulation Data

This chapter introduces how to extract and analyze data from NATAL Core simulations, including observation rules, history records, and output formats. These features are key components for data analysis, visualization, and statistical inference.

Data Extraction Overview

NATAL Core provides three main methods for data extraction:

Observation Rules

Observation rules are used to extract specific subpopulations from the complete population state, supporting flexible grouping and aggregation, suitable for real-time monitoring and statistical analysis.

History Records

The history recording feature saves state snapshots during the simulation, supporting time series analysis, with configurable recording frequency and storage format.

Output Formats

The output format system provides multiple data export formats, supporting integration with external tools for subsequent analysis and visualization.

Observation Rule System

Core Objects

Object	Purpose
Population.create_observation(...)	Recommended public entry point for building reusable observation objects
output_current_state / output_history	Recommended public output API with built-in observation integration
ObservationFilter	Advanced compilation helper class (only needed for low-level customization workflows)

Recommended Workflow

Direct instantiation of Observation in business code is discouraged; prefer using the population-layer public methods.

# Create observation rules
observation = pop.create_observation(
    groups={
        "adult_wt": {"genotype": ["WT|WT"], "age": [1]},
        "drive_carriers": {"genotype": ["WT|Drive", "Drive|Drive"]}
    },
    collapse_age=False,
)

# Get current state
current = pop.output_current_state(observation=observation)
print("Current observation data:", current["observed"])

# Get historical data
history = pop.output_history(observation=observation)
print("Historical observation data:", history["observed"])

This approach allows reusable observation definitions and defers dimension validity checking to the application/output stage.

Observation-Based History Recording (Compression Mode)

In large simulations (many genotypes, many demes), the storage cost of full raw history records is extremely high -- each snapshot contains counts for all genotypes. The Observation system can project the genotype dimension onto user-defined groups, performing aggregation directly during the recording stage and only recording the aggregated results, significantly reducing memory usage.

Comparison of Two Modes

Mode	Recorded Content	Size per Row	Re-parsing Needed on Export
Raw	`[tick, ind.ravel(), sperm.ravel()]`	`1 + n_sexes x n_ages x n_geno + ...`	Yes, expand by genotype
Observation	`[tick, observed.ravel()]`	`1 + n_groups x n_sexes x n_ages`	No, expand directly by group name

When n_groups << n_genotypes (common scenario), the compression ratio is approximately n_genotypes / n_groups.

Configuration Methods

Both methods can activate observation mode:

Method 1: Create Observation first, then set record_observation

obs = pop.create_observation(
    groups={
        "wt": {"genotype": ["WT|WT"]},
        "het": {"genotype": ["WT|Dr"]},
        "hom": {"genotype": ["Dr|Dr"]},
    },
    collapse_age=True,
)
pop.record_observation = obs  # Activate observation mode
pop.run(n_steps=100, record_every=10)

Method 2: Directly use set_observations()

pop.set_observations(
    groups={
        "wt": {"genotype": ["WT|WT"]},
        "het": {"genotype": ["WT|Dr"]},
        "hom": {"genotype": ["Dr|Dr"]},
    },
    collapse_age=True,
)
pop.run(n_steps=100, record_every=10)

Once record_observation is set, the kernel automatically uses observation aggregation mode during recording. output_history() automatically detects and selects the correct export path:

history = pop.output_history()
# Automatically exports in observation mode, each snapshot containing:
# - tick
# - labels: ["wt", "het", "hom"]
# - observed: { "wt": { "female": ..., "male": ... }, ... }

Panmictic Example

import natal as nt

species = nt.Species.from_dict(
    name="demo",
    structure={"chr1": {"loc": ["WT", "Dr"]}},
)

pop = (
    nt.DiscreteGenerationPopulation
    .setup(species=species, name="obs_demo", stochastic=False)
    .initial_state(individual_count={
        "female": {"WT|WT": 500, "Dr|WT": 50},
        "male": {"WT|WT": 500, "Dr|WT": 50},
    })
    .reproduction(eggs_per_female=50)
    .competition(carrying_capacity=10000)
    .build()
)

# Activate observation mode
pop.set_observations(
    groups={
        "wildtype": {"genotype": ["WT|WT"]},
        "drive_het": {"genotype": ["WT|Dr"]},
        "drive_hom": {"genotype": ["Dr|Dr"]},
    },
    collapse_age=True,
)
pop.run(n_steps=100, record_every=10)

# Export -- automatically uses observation mode
history = pop.output_history()
for snap in history["snapshots"]:
    print(f"tick {snap['tick']}: {snap['observed']}")

# You can switch back to raw mode at any time for inspection
pop.record_observation = None  # Disable observation mode
# Subsequent run() calls will use raw recording

Spatial Example

from natal import SpatialPopulation, HexGrid
import numpy as np

species = nt.Species.from_dict(
    name="spatial_obs",
    structure={"chr1": {"loc": ["WT", "Dr"]}},
)

kernel = np.array([
    [0.0, 1.0, 0.0],
    [1.0, 0.0, 1.0],
    [0.0, 1.0, 0.0],
], dtype=np.float64)

spatial = (
    SpatialPopulation.builder(species, n_demes=9, topology=HexGrid(3, 3))
    .setup(name="spatial_obs_demo", stochastic=False)
    .initial_state(individual_count={
        "female": {"WT|WT": 500}, "male": {"WT|WT": 500},
    })
    .reproduction(eggs_per_female=50)
    .competition(carrying_capacity=10000)
    .migration(kernel=kernel, migration_rate=0.2)
    .build()
)

# Activate observation mode
spatial.set_observations(
    groups={
        "wt": {"genotype": ["WT|WT"]},
        "dr": {"genotype": ["WT|Dr", "Dr|Dr"]},
    },
    collapse_age=True,
)
spatial.run(n_steps=50, record_every=5)

# Export -- expanded per deme, with cross-deme aggregation
history = spatial.output_history()
for snap in history["snapshots"]:
    print(f"tick {snap['tick']}")
    for deme_key, deme_obs in snap["demes"].items():
        print(f"  {deme_key}: {deme_obs}")
    print(f"  aggregate: {snap['aggregate']}")

Post-hoc Observation (Without Modifying Recording Mode)

If you don't want to change the recording mode but need to view history in observation format, you can pass the observation parameter:

obs = pop.create_observation(groups={
    "females": {"sex": "female"},
    "males": {"sex": "male"},
})

# Post-hoc observation on already recorded raw history
history = pop.output_history(observation=obs)
# Note: If history is in raw mode (record_observation not set),
# observations are recomputed per snapshot on export (slower but does not require re-running the simulation).
# If history is in observation mode, the compressed data is read directly.

Spatial Deme Selector

In Spatial mode, group specs support the "deme" key to control which demes are included in the group:

spatial.set_observations(
    groups={
        "center_release": {
            "genotype": ["Dr|Dr"],
            "deme": [(1, 1)],          # Observe only the center deme
        },
        "all_wt": {
            "genotype": ["WT|WT"],
            "deme": "all",             # All demes (default)
        },
    },
)

"deme" supports: - "all" or omitted: all demes - Integer list: flat indices, e.g., [0, 2, 4] - Coordinate list: (row, col) tuples, automatically resolved through topology

When to Use Observation Mode vs Post-hoc

Scenario	Recommended Approach
Detailed analysis requiring all genotype data	Raw history (default)
Only care about time series of a few groups	`record_observation` observation mode
Need to inspect history repeatedly with different groupings	Raw history + post-hoc `output_history(observation=obs)`
Large-scale spatial (thousands of demes)	`record_observation` observation mode
Memory-constrained environments	`record_observation` observation mode

History Recording System

History Recording Configuration

The population object has built-in history recording functionality with configurable recording frequency and storage format:

# Configure history recording
pop.record_every = 10  # Record every 10 steps
pop.max_history = 1000  # Maximum of 1000 snapshots

# Run simulation and record history
results = pop.run(n_steps=500, record_every=5)

# Get history data
history_data = pop.output_history()
print("Number of history records:", len(history_data["snapshots"]))
print("Last step data:", history_data["snapshots"][-1])

Accessing History Data

# Get full history
full_history = pop.output_history()

# Get history at a specific time step
history_at_tick_100 = pop.output_history(tick=100)

# Get list of time steps in history
ticks = [snapshot["tick"] for snapshot in full_history["snapshots"]]
print("Recorded time steps:", ticks)

# Clear history to save memory
pop.clear_history()

History Data Analysis

# Analyze allele frequency changes over time
allele_freq_history = []
for snapshot in full_history["snapshots"]:
    # Recompute allele frequency for each time step
    # This needs to be adjusted based on the actual data structure
    freq = calculate_allele_frequency(snapshot)
    allele_freq_history.append(freq)

# Plot time series
import matplotlib.pyplot as plt
plt.plot(ticks, allele_freq_history)
plt.xlabel("Time Step")
plt.ylabel("Allele Frequency")
plt.show()

Output Format System

Current State Output

# Get detailed snapshot of current state
current_state = pop.output_current_state()
print("Current state:", current_state)

# Get readable dictionary format
readable_state = pop.output_current_state(as_dict=True)
print("Readable state:", readable_state)

# Get JSON format (suitable for transmission and storage)
json_state = pop.output_current_state(as_json=True)
print("JSON state:", json_state[:200])  # Show first 200 characters

Data Export

# Export as dictionary format
data_dict = pop.output_current_state(as_dict=True)

# Export as JSON format
json_data = pop.output_current_state(as_json=True, indent=2)

# Save to file
import json
with open("population_state.json", "w") as f:
    json.dump(data_dict, f, indent=2)

# Export observation data
observation_data = pop.output_current_state(
    observation=observation,
    as_dict=True
)

Integration with External Tools

import pandas as pd
import numpy as np

# Convert history data to pandas DataFrame
def history_to_dataframe(history_data):
    """Convert history records to DataFrame"""
    data = []
    for snapshot in history_data["snapshots"]:
        row = {
            "tick": snapshot["tick"],
            "total_population": snapshot["total_count"],
            "females": snapshot["female_count"],
            "males": snapshot["male_count"]
        }
        # Add observation data
        if "observed" in snapshot:
            for group_name, count in snapshot["observed"].items():
                row[f"observed_{group_name}"] = count
        data.append(row)

    return pd.DataFrame(data)

# Usage example
history_df = history_to_dataframe(full_history)
print(history_df.head())

Observation Rules in Detail

Group Format

Observation rules support multiple group formats:

1. Dictionary Format (Named Groups)

groups = {
    "all_females": {"sex": "female"},
    "adults": {"age": [2, 3, 4, 5, 6, 7]},
    "drive_carriers": {"genotype": ["WT|Drive", "Drive|Drive"]},
    "juvenile_drive": {
        "genotype": ["WT|Drive"],
        "age": [0, 1],
        "sex": "female"
    },
}

2. Pattern Matching (Recommended)

groups = {
    "target_female": {
        # Ordered matching Maternal|Paternal
        "genotype": "A1/B1|A2/B2; C1/D1|C2/D2",
        "sex": "female",
    },
    "target_female_unordered": {
        # Unordered matching (two homologous chromosome copies can be swapped)
        "genotype": "A1/B1::A2/B2; C1/D1::C2/D2",
        "sex": "female",
    }
}

Selector Formats

Genotype Selector

# String (comma-separated)
{"genotype": "WT|WT"}

# Pattern string (recommended)
{"genotype": "A1/B1|A2/B2; C1/D1::C2/D2"}

# String list
{"genotype": ["WT|WT", "WT|Drive", "Drive|Drive"]}

# Wildcard (all genotypes)
{"genotype": "*"}

Sex Selector

# String
{"sex": "female"}  # or {"sex": "male"}

# Integer
{"sex": 0}  # Female, {"sex": 1} Male

# List
{"sex": ["female", "male"]}  # Both sexes

Age Selector

# Explicit list
{"age": [2, 3, 4]}

# Closed interval tuple
{"age": [2, 7]}  # ages 2,3,4,5,6,7

# Interval list (union)
{"age": [[0, 1], [4, 6]]}  # ages 0,1,4,5,6

# Callable
{"age": lambda a: a >= 2}  # age greater than or equal to 2

Practical Examples

Monitoring Gene Drive Spread

# Create observation rules specifically for monitoring gene drive
drive_observation = pop.create_observation(
    groups={
        "wild_type": {"genotype": ["WT|WT"]},
        "heterozygotes": {"genotype": ["WT|Drive"]},
        "homozygotes": {"genotype": ["Drive|Drive"]},
        "total_drive_carriers": {"genotype": ["WT|Drive", "Drive|Drive"]}
    }
)

# Run simulation and monitor in real time
for step in range(100):
    pop.run_tick()

    if step % 10 == 0:
        current = pop.output_current_state(observation=drive_observation)
        observed = current["observed"]
        print(f"Step {step}: WT={observed['wild_type']}, "
              f"Het={observed['heterozygotes']}, "
              f"Hom={observed['homozygotes']}")

Age Structure Analysis

# Analyze population dynamics across different age groups
age_observation = pop.create_observation(
    groups={
        "juveniles": {"age": [0, 1]},
        "young_adults": {"age": [2, 3]},
        "mature_adults": {"age": [4, 5]},
        "old_adults": {"age": [6, 7]}
    }
)

# Get history data and analyze
history = pop.output_history(observation=age_observation)

# Analyze age structure changes
for snapshot in history["snapshots"]:
    total = sum(snapshot["observed"].values())
    if total > 0:
        juvenile_ratio = snapshot["observed"]["juveniles"] / total
        print(f"Juvenile ratio: {juvenile_ratio:.3f}")

Best Practices

Observation Rule Design

Use meaningful group names for easier subsequent analysis
Keep groups mutually exclusive to avoid double counting
Prefer pattern matching over hardcoded genotype lists

History Record Management

Set an appropriate record_every parameter to balance precision and performance
Use clear_history() to manage memory usage
Regularly export history data to avoid data loss

Data Export

Use standard formats (JSON, dictionary) for easier tool integration
Include sufficient metadata (time steps, parameter settings, etc.)
Consider data compression and storage efficiency

FAQ

What is the difference between observation rules and history records?

Observation rules define how to extract data from the state; history records save the time series of states. Observation rules can be applied to the current state or to history records.

How to optimize history records for large datasets?

Increase the record_every interval, use clear_history() for periodic cleanup, or export to external storage.

Do observation rules affect simulation performance?

Observation rules themselves do not affect simulation performance, but frequent data extraction and storage may impact overall performance.

This chapter introduced how to extract and analyze data from NATAL Core simulations. In real projects, it is recommended to first design appropriate observation rules, then choose the suitable data extraction method based on your needs.