dataframe

End-to-end dataframe-based operations.

Modules

Functions

surface_postprocess_trie(df, *[, ...])

Postprocess raw phylogenetic tree reconstruction output data to create finalized estimate of phylogenetic history.

surface_test_drive(df, *, dstream_algo, ...)

Reads alife standard phylogeny dataframe to create a population of hstrat surface annotations corresponding to the phylogeny tips, "as-if" they had evolved according to the provided phylogeny history.

surface_validate_trie(df[, max_num_checks, ...])

Validate trie reconstruction output data.

surface_postprocess_trie(df: ~polars.dataframe.frame.DataFrame, *, drop_dstream_metadata: bool | None = None, trie_postprocessor: ~typing.Callable = <hstrat.phylogenetic_inference.tree.trie_postprocess._NopTriePostprocessor.NopTriePostprocessor object>, delete_trunk: bool = True) DataFrame

Postprocess raw phylogenetic tree reconstruction output data to create finalized estimate of phylogenetic history.

Perfoms the following operations: - Delete trunk nodes with rank less than dstream_S. - Collapse unifurcations. - Assign contiguous IDs to nodes. - Apply supplied trie_postprocessor functor.

Parameters

dfpl.DataFrame

The input DataFrame containing packed data with required columns, one row per genome.

Required schema:
  • ‘id’pl.UInt64
    • Unique identifier for each taxon (RE alife standard format).

  • ‘ancestor_id’pl.UInt64
    • Unique identifier for ancestor taxon (RE alife standard format).

  • ‘dstream_rank’pl.UInt64
    • Num generations elapsed for ancestral differentia.

    • Corresponds to dstream_Tbar for inner nodes.

    • Corresponds to dstream_T - 1 for leaf nodes.

  • ‘hstrat_differentia_bitwidth’pl.UInt32
    • Size of annotation differentiae, in bits.

    • Corresponds to dstream_value_bitwidth.

  • ‘dstream_S’pl.UInt32
    • Capacity of dstream buffer used for hstrat surface, in number of data items (i.e., differentia values).

Optional schema:
  • ‘dstream_data_id’pl.UInt64
    • Unique identifier for each genome in source genomedataframe

delete_trunkbool, default True

Should trunk nodes with rank less than dstream_S be deleted?

Trunk deletion accounts for “dummy” strata added to fill hstrat surface for founding ancestor(s), by segregating subtrees with distinct founding strata into independent trees.

trie_postprocessorCallable, default hstrat.NopTriePostprocessor()

Tree postprocess functor.

Must take trie of type pandas.DataFrame, p_differentia_collision of type float, mutate of type bool, and progress_wrap of type Callable params. Must return postprocessed trie (type pl.DataFrame).

To apply multiple postprocessors, use hstrat.CompoundTriePostprocessor.

Returns

pl.DataFrame

The output DataFrame containing the estimated phylogenetic tree in alife standard format, with the following columns:

Required schema: - ‘id’ : pl.UInt64

  • Unique identifier for each taxon (RE alife standard format).

  • ‘ancestor_id’pl.UInt64
    • Unique identifier for ancestor taxon (RE alife standard format).

  • ‘hstrat_rank’pl.Int64
    • Num generations elapsed for ancestral differentia.

    • Corresponds to dstream_Tbar - dstream_S for inner nodes.

    • Corresponds to dstream_T - 1 - dstream_S for leaf nodes.

Optional schema: - ‘origin_time’ : pl.Int64

  • Estimated origin time for phylogeny nodes, in generations elapsed since founding ancestor.

    Value depends on the trie postprocessor used.

Additional user-defined columns will be forwarded from the input DataFrame. Any columns created by the trie postprocessor will also be included.

Note that the alife-standard ancestor_list column is not included in the output.

Notes

Collapsing trunk nodes with rank less than dstream_S assumes that S “dummy” strata were added to fill hstrat surface for founding ancestor(s).

Currently, data is converted to Pandas for processing, then back to Polars.

See Also

surface_unpack_reconstruct :

Creates raw reconstruction data postprocessed here.

alifestd_try_add_ancestor_list_col :

Adds alife-standard ancestor_list column to phylogeny data.

surface_test_drive(df: ~polars.lazyframe.frame.LazyFrame, *, dstream_algo: str, dstream_S: int, dstream_T_bitwidth: int = 32, progress_wrap: ~typing.Callable = <function <lambda>>, stratum_differentia_bit_width: int) DataFrame

Reads alife standard phylogeny dataframe to create a population of hstrat surface annotations corresponding to the phylogeny tips, “as-if” they had evolved according to the provided phylogeny history.

Parameters

dfpl.DataFrame

The input DataFrame containing alife standard phylogeny with required columns, one row per taxon.

Note that the alife-standard ancestor_list column is not required.

Required schema:
  • ‘id’pl.UInt64
    • Taxon identifier.

  • ‘ancestor_id’pl.UInt64
    • Taxon identifier of ancestor.

    • Own ‘id’ if root.

Optional schema:
  • ‘origin_time’pl.UInt64
    • Number of generations elapsed from ancestor.

    • Determines branch lengths.

    • Otherwise, all branches are assumed to be length 1.

  • ‘extant’pl.Boolean
    • Should an entry corresponding to this phylogeny taxon be included in the output population?

    • Otherwise, all tips are considered extant and all inner nodes are not.

  • Additional user-defined columns will be forwarded to the output DataFrame.

dstream_algostr

Name of downstream curation algorithm to use.

dstream_Sint

Capacity of annotation dstream buffer, in number of data items.

progress_wrapCallable, optional

Pass tqdm or equivalent to display a progress bar.

stratum_differentia_bit_widthint

The bit width of the generated differentia.

Returns

pl.DataFrame

The output DataFrame containing generated hstrat surface annotations.

Required schema:
  • ‘data_hex’pl.String
    • Raw genome data, with serialized dstream buffer and counter.

    • Represented as a hexadecimal string.

  • ‘downstream_version’pl.Categorical
    • Version of downstream library used.

  • ‘dstream_algo’pl.Categorical
    • Name of downstream curation algorithm used.

    • e.g., ‘dstream.steady_algo’

  • ‘dstream_storage_bitoffset’pl.UInt32
    • Position of dstream buffer field in ‘data_hex’.

  • ‘dstream_storage_bitwidth’pl.UInt32
    • Size of dstream buffer field in ‘data_hex’.

  • ‘dstream_T_bitoffset’pl.UInt32
    • Position of dstream counter field in ‘data_hex’.

  • ‘dstream_T_bitwidth’pl.UInt32
    • Size of dstream counter field in ‘data_hex’.

  • ‘dstream_S’pl.Uint32
    • Capacity of dstream buffer, in number of data items.

  • ‘origin_time’pl.UInt64
    • Number of generations elapsed since the founding ancestor.

  • ‘td_source_id’pl.UInt64
    • Corresponding taxon identifier in source phylogeny.

Additional user-defined columns will be forwarded from the input DataFrame.

Notes

  • Input columns “id”, “ancestor_id”, and “ancestor_list” are not forwarded to output, to avoid conflicts with the output schema for subsequent phylogeny reconstruction.

surface_validate_trie(df: ~polars.dataframe.frame.DataFrame, max_num_checks: int = 1000, max_violations: int = 0, progress_wrap: ~typing.Callable = <function <lambda>>, seed: int | None = None) int

Validate trie reconstruction output data.

Performs structural checks and pairwise leaf-node validation to confirm that reconstructed trie correctly reflects common differentia among source hereditary stratigraphic surfaces.

Checks performed:

  1. Required dstream/downstream columns for surface deserialization from data_hex are present.

  2. The id and ancestor_id columns are present.

  3. Taxon ids are contiguous (i.e., match row indices 0, 1, …, n-1).

  4. Data is topologically sorted (each ancestor appears before all its descendants).

  5. Samples random leaf-node pairs and compares each pair’s first retained disparity rank (computed from deserialized surfaces) to the MRCA node’s dstream_rank - dstream_S in the trie (converting from raw dstream T space to hstrat rank space). A violation occurs when first_disparity_rank < mrca_rank: the surfaces prove divergence earlier than the trie records.

Parameters

dfpl.DataFrame

Trie reconstruction output, as produced by surface_unpack_reconstruct with --no-drop-dstream-metadata.

Required schema:
  • ‘id’integer

    Unique identifier for each taxon (RE alife standard).

  • ‘ancestor_id’integer

    Unique identifier for ancestor taxon (RE alife standard).

  • ‘dstream_rank’integer

    Rank stored at this node (generation count).

  • ‘data_hex’string

    Raw genome data as a hexadecimal string.

  • ‘dstream_algo’string or categorical

    Name of downstream curation algorithm (e.g., 'dstream.steady_algo').

  • ‘dstream_storage_bitoffset’integer

    Bit offset of the dstream buffer field in data_hex.

  • ‘dstream_storage_bitwidth’integer

    Bit width of the dstream buffer field in data_hex.

  • ‘dstream_T_bitoffset’integer

    Bit offset of the dstream counter (“rank”) field in data_hex.

  • ‘dstream_T_bitwidth’integer

    Bit width of the dstream counter field in data_hex.

  • ‘dstream_S’integer

    Capacity of the dstream buffer (number of differentia stored per annotation).

max_num_checksint (default 1_000)

Maximum number of leaf-pair comparisons to perform. Pairs are sampled randomly without replacement from all possible pairs.

max_violationsint (default 1)

Maximum number of MRCA-rank violations tolerated before returning early. Callers should treat a return value exceeding this threshold as a validation failure.

progress_wrapcallable, optional

Wrapper applied to the pair-check iterator, e.g., tqdm.tqdm for a progress bar. Must accept and return an iterable. Default is the identity function (no wrapping).

seedint, default None

Random seed used when sampling leaf pairs.

Returns

int

Number of leaf-pair violations detected. Returns early (possibly before all max_num_checks pairs have been checked) once max_violations is exceeded.

Raises

ValueError

If any required column is missing, ids are not contiguous, or data is not topologically sorted.

See Also

surface_unpack_reconstruct :

Produces trie reconstruction data to be validated here.