Quickstart

The purpose of hstrat is to enable approximate inference of the phylogenetic history of a distributed digital population solely through analysis of HereditaryStratigraphicColumn genome annotations. Each annotation object attaches to a particular genome, and is inherited by all offspring genomes. (In most use cases, each genome within a simulation will have exactly one annotation.)

The hereditary stratigraphy methodology implemented by HereditaryStratigraphicColumn enables generations elapsed since the most recent common ancestor (MRCA) of two genomes to be bounded within an estimated range. This is done by means of data local to the HereditaryStratigraphicColumn object — no centralized tracking system required.

This capability has direct applications in digital evolution research (e.g., artificial life, genetic programming, genetic algorithms), and also may prove useful for other distributed systems applications where it is desired to track the copy-tree provenance of digital artifacts.

The following provides a practical, code-first introduction to creating, propagating, and analyzing hstrat instrumentation to perform distributed phylogenetic tracking. See “Under the Hood” for more technical discussion of the underlying hereditary stratigraphy methodology.

Installation

python3 -m pip install hstrat

Initializing a `HereditaryStratigraphicColumn`

The hstrat library provides a genome annotation called a hereditary stratigraphic column to track phylogenetic relatedness within a simulation. Here, we initialize two columns — representing two fully-independent evolutionary origins.

[1]:

from hstrat import hstrat
from hstrat._auxiliary_lib import seed_random

seed_random(1)  # reproducibility

founder1 = hstrat.HereditaryStratigraphicColumn(
    stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(1),
    stratum_differentia_bit_width=16,
)
founder2 = hstrat.HereditaryStratigraphicColumn(
    stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(3),
    stratum_differentia_bit_width=16,
)

Note the configuration of stratum retention policy for each annotation above.

This policy governs the trade-off between space usage of HereditaryStratigraphicColumn objects and inferential power. Simply put, these are various methods use follow a specific algorithm to “prune”, or systematically delete, data in the column as generations elapse. (Hereditary stratigraphy works by appending data to annotations each generation.) Pruning avoids linear space complexity as generations elapse, but comes at the expense of introducing uncertainty into estimates of relatedness between columns.

Above, the algorithm fixed_resolution_algo was chosen. A stratum retention algorithm denotes a general class of pruning strategies. Creating a specific policy instance is supplied an integer that dictates the amount of pruning. Here, smaller values result in denser stratum retention. A policy using the fixed resolution algorithm supplied with the value 3 retains strata from every third generation, while 1 would retain every strata.

Below is a quick overview of the five available algorithms.

Stratum Retention Algorithm	Space Complexity	MRCA Gen Uncertainty
Fixed Resolution Algorithm	`n/k`	`k`
Recency-proportional Resolution Algorithm	`k * log(n)`	`m/k`
Depth-proportional Resolution Algorithm	`k`	`n/k`
Geometric Sequence Nth Root Algorithm	`k`	`m * n^(1/k)`
Curbed Recency-proportional Resolution Algorithm	`k`	`m / k` -> `m * n^(1/k)`

For most purposes, the curbed recency-proportional algorithm represents a good choice. To find a more in depth overview, visit Choosing a Retention Policy.

Differentia bit width was also configured above. This specifies the number of bits used to store lineage-identification data within an annotation each generation.

Identifying Independent Origins

To check for any common ancestors between two columns (versus independent founder geneses):

[2]:

"any common ancestry?", hstrat.does_have_any_common_ancestor(
    founder1, founder2
)

[2]:

('any common ancestry?', False)

[3]:

"any common ancestry?", hstrat.does_have_any_common_ancestor(
    founder1, founder1
)

[3]:

('any common ancestry?', True)

Replicating Annotations & Elapsing Generations

Generational transmission of HereditaryStratigraphicColumn’s involves two operations: 1. cloning — generation of a fully-independent annotation object (i.e., a “deep” copy), 2. deposition — appending annotation data to register passage of a generation.

These operations can be performed independently, via .Clone() and .DepositStratum(). (To elapse n generations at once, call .DepositStrata(n).)

[4]:

descendant2a = founder2.Clone()
for __ in range(5):
    descendant2a.DepositStratum()

descendant2a.DepositStrata(5)

assert hstrat.does_have_any_common_ancestor(founder2, descendant2a)
f"{descendant2a.GetNumStrataDeposited()} generations since founder genesis"

[4]:

'11 generations since founder genesis'

Alternately, call .CloneDescendant() to perform both operations — cloning followed by deposition.

[5]:

descendant2b = descendant2a.CloneDescendant()

assert hstrat.does_have_any_common_ancestor(founder2, descendant2b)
f"{descendant2b.GetNumStrataDeposited()} generations since founder genesis"

[5]:

'12 generations since founder genesis'

The member function .CloneNthDescendant(n) operates equivalently to .Clone() followed by DepositStrata(n), taking in an integer to specify the number of generations elapsed.

[6]:

descendant2c = descendant2a.CloneNthDescendant(20)

assert hstrat.does_have_any_common_ancestor(founder2, descendant2a)
f"{descendant2c.GetNumStrataDeposited()} generations since founder genesis"

[6]:

'31 generations since founder genesis'

Most use cases will primarily use .CloneDescendant().

Mesuring Relatedness

If you want to find the number of generations elapsed since the MRCA (most recent common ancestor), calc_ranks_since_mrca_bounds_with will return a tuple with the estimated lower and upper bound. (In hstrat parlance, the term “rank” corresponds to generations elapsed.)

[7]:

hstrat.calc_ranks_since_mrca_bounds_with(
    descendant2b,
    descendant2c,
    prior="arbitrary",
)

[7]:

(0, 3)

Uncertainty in phylogenetic inference is introduced due to the stratum pruning procedure mentioned above.

Note that generations since MRCA will not be symmetrical if uneven branch lengths are at play.

[8]:

hstrat.calc_ranks_since_mrca_bounds_with(
    descendant2c,
    descendant2b,
    prior="arbitrary",
)

[8]:

(19, 22)

Generation of MRCA (i.e., MRCA generational depth from genesis) can also be estimated.

[9]:

hstrat.calc_rank_of_mrca_bounds_between(
    descendant2b, descendant2c, prior="arbitrary"
)

[9]:

(9, 12)

This measure is symmetric.

[10]:

hstrat.calc_rank_of_mrca_bounds_between(
    descendant2c, descendant2b, prior="arbitrary"
)

[10]:

(9, 12)

Generation of MRCA can be estimated over an entire population.

[11]:

hstrat.calc_rank_of_mrca_bounds_among(
    [descendant2a, descendant2c, descendant2b], prior="arbitrary"
)

[11]:

(9, 11)

The argument prior used above denotes the prior probability density distribution over possible generations of the MRCA. For most cases, "arbitrary" will suffice.

User-defined Genomes

In most cases, columns will need to be associated with genetic content relevant to the simulation at hand. Python’s object-oriented programming model provides a convenient way to achieve this.

Most user-defined Genome’s will look something like this. Note the coupling of content mutation and hstrat copy/deposit in Genome.CloneDescendant.

[12]:

import typing


class Genome:
    column: hstrat.HereditaryStratigraphicColumn
    content: typing.Any

    def __init__(self, column=None, content=None):
        if column is None:
            column = hstrat.HereditaryStratigraphicColumn(
                # configure policy, differentia bit width, etc. here
            )
        if content is None:
            self.content = (
                "placeholder"  # initialize functional genome content
            )

        self.column = column
        self.content = content

    def CloneDescendant(self):
        mutated_content = self.content  # put mutation operator here
        return Genome(self.column.CloneDescendant(), mutated_content)

Collections of Genome objects would then make up the simulated population.

See here for a more detailed example of this pattern in action.

Phylogenetic Reconstruction

A major use case of hstrat is reconstruction of phylogenetic history over entire populations. We’ll walk through that now.

Begin by retrieving a ground-truth phylogeny. We will use this as a point of comparison.

[13]:

import random

random.seed(1)
import dendropy as dp

tree_url = "https://raw.githubusercontent.com/mmore500/hstrat/5069db7c358ac6949ceda5fe8cc9989d5d7139f9/examples/assets/example.newick"
template_tree = dp.Tree.get(url=tree_url, schema="newick")
for node in template_tree:
    node.edge_length = random.randint(1, 10)  # random branch lengths
template_tree.is_rooted = True
template_tree.ladderize()
print(template_tree.as_ascii_plot(plot_metric="length", width=50))

       /----------------------- ant
       |
       |------------------ dog
       |
       +    /----------- bat
       |----+
       |    \--- cow
       |
       |                  /------------------- elk
       \------------------+
                          \---------------- fox

hstrat provides descend_template_phylogeny_* functions to generate end-state HereditaryStratigraphicColumn’s “as if” they had evolved from seed_column with a provided phylogenetic history. Here, we generate a population of HereditaryStratigraphicColumn’s corresponding to the leaves of template_tree.

[14]:

extant_population = hstrat.descend_template_phylogeny_dendropy(
    template_tree,
    seed_column=hstrat.HereditaryStratigraphicColumn(
        hstrat.recency_proportional_resolution_algo.Policy(5)
    ),
    extant_nodes=template_tree.leaf_node_iter(),
)

Most use cases will not involve a template phylogeny approach. Rather, HereditaryStratigraphicColumn’s will evolve according to the simulation dynamics. However, it is convenient for this demonstration of phylogenetic reconstruction.

The function build_tree takes in a population of columns and returns a tree constructed in alife standard format. The version_pin parameter dictates how calls to the function should resolve in future releases, with hstrat.__version__ automatically tracking new updates.

[15]:

reconstructed_phylogeny = hstrat.build_tree(
    extant_population,
    version_pin=hstrat.__version__,
    taxon_labels=map(lambda n: n.taxon.label, template_tree.leaf_node_iter()),
)
reconstructed_phylogeny

[15]:

	id	ancestor_list	origin_time	taxon_label	ancestor_id
0	0	[none]	0.0	Root	0
5	5	[0]	2.0	Inner+r=2+d=GSsXbnXBxB-+uid=1v5ktyenGFmEudpamzswb	0
19	19	[5]	4.0	cow	5
27	27	[0]	8.0	Inner+r=8+d=MFuIoTBD6pA+uid=Be2pLYZKxdudcHEH6F...	0
33	33	[5]	7.0	bat	5
38	38	[0]	8.0	dog	0
45	45	[0]	10.0	ant	0
52	52	[27]	15.0	fox	27
54	54	[27]	16.0	elk	27

Reconstruction would be performed the same way on a population of HereditaryStratigraphicColumn’s resulting from an actual evolutionary simulation rather than “as if” template descent.

Use dendropy to visualize the reconstruction, and compare to ground truth.

[16]:

import alifedata_phyloinformatics_convert as apc

dendropy_tree = apc.alife_dataframe_to_dendropy_tree(
    reconstructed_phylogeny,
    setup_edge_lengths=True,
)
dendropy_tree.is_rooted = True
dendropy_tree.ladderize()

print("=============== reconstructed ===============")
print(dendropy_tree.as_ascii_plot(plot_metric="length", width=50))

print("=================== true ====================")
print(template_tree.as_ascii_plot(plot_metric="length", width=50))

=============== reconstructed ===============
/---------------------- dog
|
|--------------------------- ant
|
+    /----- cow
|----+
|    \-------------- bat
|
|                      /------------------- fox
\----------------------+
                       \---------------------- elk


=================== true ====================
       /----------------------- ant
       |
       |------------------ dog
       |
       +    /----------- bat
       |----+
       |    \--- cow
       |
       |                  /------------------- elk
       \------------------+
                          \---------------- fox

Serialization

Saving and restoring HereditaryStratigraphicColumn’s is important in order to be able to transmit hstrat annotations between processes during a simulation or save state after a simulation. The library provides mechanisms to support both binary and text formats.

Begin by creating an example HereditaryStratigraphicColumn to play with.

[17]:

diffwidth = 8
policy = hstrat.recency_proportional_resolution_curbed_algo.Policy(10)
column = hstrat.HereditaryStratigraphicColumn(
    stratum_retention_policy=policy,
    stratum_differentia_bit_width=diffwidth,
    always_store_rank_in_stratum=False,
)
column.DepositStrata(20)

For binary formats, use col_to_packet and col_from_packet. (If the packet is contained within a larger buffer, you’ll want to use col_from_packet_buffer instead.)

[18]:

packet = hstrat.col_to_packet(column)
reconst = hstrat.col_from_packet(
    packet,
    differentia_bit_width=diffwidth,
    stratum_retention_policy=policy,
)
assert reconst == column

packet

[18]:

bytearray(b'\x00\x00\x00\x15\xa4\xda\x1e\x98')

Note that differentia_bit_width and stratum_retention_policy must be passed — this information is not stored within serialized packets.

The first step for plain-text serialization is conversion between column objects and representation via builtin Python types.

[19]:

records = hstrat.col_to_records(column)
reconst = hstrat.col_from_records(records)
assert reconst == column

records

[19]:

{'policy_algo': 'recency_proportional_resolution_curbed_algo',
 'policy_spec': {'size_curb': 10},
 'policy': 'hstrat.recency_proportional_resolution_curbed_algo.Policy(policy_spec=hstrat.recency_proportional_resolution_curbed_algo.PolicySpec(size_curb=10))',
 'hstrat_version': '1.11.7',
 'num_strata_deposited': 21,
 'differentiae': 'pNoemA==',
 'differentia_bit_width': 8}

Unlike packet serialization, policy information is stored in the generated records — convenient for data persistence beyond simulation runtime.

Conversion between records and plain text can then be handled by a number of tools. Python provides builtin support for JSON.

[20]:

import json

as_json = json.dumps(records, indent=2)  # indent kwarg pretty prints
reconst = json.loads(as_json)
assert records == reconst

print(as_json)

{
  "policy_algo": "recency_proportional_resolution_curbed_algo",
  "policy_spec": {
    "size_curb": 10
  },
  "policy": "hstrat.recency_proportional_resolution_curbed_algo.Policy(policy_spec=hstrat.recency_proportional_resolution_curbed_algo.PolicySpec(size_curb=10))",
  "hstrat_version": "1.11.7",
  "num_strata_deposited": 21,
  "differentiae": "pNoemA==",
  "differentia_bit_width": 8
}

Perhaps you prefer yaml?

[21]:

import yaml

as_yaml = yaml.safe_dump(records)
reconst = yaml.safe_load(as_yaml)
assert records == reconst

print(as_yaml)

differentia_bit_width: 8
differentiae: pNoemA==
hstrat_version: 1.11.7
num_strata_deposited: 21
policy: hstrat.recency_proportional_resolution_curbed_algo.Policy(policy_spec=hstrat.recency_proportional_resolution_curbed_algo.PolicySpec(size_curb=10))
policy_algo: recency_proportional_resolution_curbed_algo
policy_spec:
  size_curb: 10

The pop_to_records and pop_to_records operate similarly, except on entire collections of HereditaryStratigraphicColumns rather than individual columns.

[22]:

population = [column, column.CloneDescendant()]
records = hstrat.pop_to_records(population)
reconst = hstrat.pop_from_records(records)
assert reconst == population

records

[22]:

{'policy': 'hstrat.recency_proportional_resolution_curbed_algo.Policy(policy_spec=hstrat.recency_proportional_resolution_curbed_algo.PolicySpec(size_curb=10))',
 'policy_algo': 'recency_proportional_resolution_curbed_algo',
 'policy_spec': {'size_curb': 10},
 'differentia_bit_width': 8,
 'hstrat_version': '1.11.7',
 'columns': [{'num_strata_deposited': 21, 'differentiae': 'pNoemA=='},
  {'num_strata_deposited': 22, 'differentiae': 'pNoeGA=='}]}

For very large populations, consider using gzip or lzma/xz to counteract file size overhead from plain-text format.

Viewing Column Internals

An ascii representation of a hereditary stratigraph column can be printed using the hstrat.col_to_ascii() function, passing in the specified column as a parameter.

[23]:

col = hstrat.HereditaryStratigraphicColumn(
    stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(3),
)
col.DepositStrata(5)

print(hstrat.col_to_ascii(col))

+----------------------------------------------------+
|                    MOST ANCIENT                    |
+--------------+---------------+---------------------+
| stratum rank | stratum index | stratum differentia |
+--------------+---------------+---------------------+
|      0       |       0       |  1391f9b9dbc799b0*  |
+--------------+---------------+---------------------+
|      1       | ░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░ |
+--------------+---------------+---------------------+
|      2       | ░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░ |
+--------------+---------------+---------------------+
|      3       |       1       |  -f8acb011533eef2*  |
+--------------+---------------+---------------------+
|      4       | ░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░ |
+--------------+---------------+---------------------+
|      5       |       2       |  28804790be6c6fe9*  |
+--------------+---------------+---------------------+
|                    MOST RECENT                     |
+----------------------------------------------------+
*: stored in stratum (otherwise calculated)
░: discarded stratum

A column can also be exported to pandas dataframe.

[24]:

hstrat.col_to_dataframe(col)

[24]:

	index	rank	differentia
0	0	0	1410182734995233200
1	1	3	-1119930662866120434
2	2	5	2918411245531721705