Quickstart
The purpose of hstrat is to enable approximate inference of the phylogenetic history of a distributed digital population solely through analysis of HereditaryStratigraphicColumn
genome annotations. Each annotation object attaches to a particular genome, and is inherited by all offspring genomes. (In most use cases, each genome within a simulation will have exactly one annotation.)
The hereditary stratigraphy methodology implemented by HereditaryStratigraphicColumn
enables generations elapsed since the most recent common ancestor (MRCA) of two genomes to be bounded within an estimated range. This is done by means of data local to the HereditaryStratigraphicColumn
object — no centralized tracking system required.
This capability has direct applications in digital evolution research (e.g., artificial life, genetic programming, genetic algorithms), and also may prove useful for other distributed systems applications where it is desired to track the copy-tree provenance of digital artifacts.
The following provides a practical, code-first introduction to creating, propagating, and analyzing hstrat instrumentation to perform distributed phylogenetic tracking. See “Under the Hood” for more technical discussion of the underlying hereditary stratigraphy methodology.
Installation
python3 -m pip install hstrat
Initializing a HereditaryStratigraphicColumn
The hstrat library provides a genome annotation called a hereditary stratigraphic column to track phylogenetic relatedness within a simulation. Here, we initialize two columns — representing two fully-independent evolutionary origins.
[1]:
from hstrat import hstrat
from hstrat._auxiliary_lib import seed_random
seed_random(1) # reproducibility
founder1 = hstrat.HereditaryStratigraphicColumn(
stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(1),
stratum_differentia_bit_width=16,
)
founder2 = hstrat.HereditaryStratigraphicColumn(
stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(3),
stratum_differentia_bit_width=16,
)
Note the configuration of stratum retention policy for each annotation above.
This policy governs the trade-off between space usage of HereditaryStratigraphicColumn
objects and inferential power. Simply put, these are various methods use follow a specific algorithm to “prune”, or systematically delete, data in the column as generations elapse. (Hereditary stratigraphy works by appending data to annotations each generation.) Pruning avoids linear space complexity as generations elapse, but comes at the expense of introducing uncertainty into estimates of relatedness
between columns.
Above, the algorithm fixed_resolution_algo
was chosen. A stratum retention algorithm denotes a general class of pruning strategies. Creating a specific policy instance is supplied an integer that dictates the amount of pruning. Here, smaller values result in denser stratum retention. A policy using the fixed resolution algorithm supplied with the value 3 retains strata from every third generation, while 1 would retain every strata.
Below is a quick overview of the five available algorithms.
Stratum Retention Algorithm |
Space Complexity |
MRCA Gen Uncertainty |
---|---|---|
Fixed Resolution Algorithm |
|
|
Recency-proportional Resolution Algorithm |
|
|
Depth-proportional Resolution Algorithm |
|
|
Geometric Sequence Nth Root Algorithm |
|
|
Curbed Recency-proportional Resolution Algorithm |
|
|
For most purposes, the curbed recency-proportional algorithm represents a good choice. To find a more in depth overview, visit Choosing a Retention Policy.
Differentia bit width was also configured above. This specifies the number of bits used to store lineage-identification data within an annotation each generation.
Identifying Independent Origins
To check for any common ancestors between two columns (versus independent founder geneses):
[2]:
"any common ancestry?", hstrat.does_have_any_common_ancestor(
founder1, founder2
)
[2]:
('any common ancestry?', False)
[3]:
"any common ancestry?", hstrat.does_have_any_common_ancestor(
founder1, founder1
)
[3]:
('any common ancestry?', True)
Replicating Annotations & Elapsing Generations
Generational transmission of HereditaryStratigraphicColumn
’s involves two operations: 1. cloning — generation of a fully-independent annotation object (i.e., a “deep” copy), 2. deposition — appending annotation data to register passage of a generation.
These operations can be performed independently, via .Clone()
and .DepositStratum()
. (To elapse n
generations at once, call .DepositStrata(n)
.)
[4]:
descendant2a = founder2.Clone()
for __ in range(5):
descendant2a.DepositStratum()
descendant2a.DepositStrata(5)
assert hstrat.does_have_any_common_ancestor(founder2, descendant2a)
f"{descendant2a.GetNumStrataDeposited()} generations since founder genesis"
[4]:
'11 generations since founder genesis'
Alternately, call .CloneDescendant()
to perform both operations — cloning followed by deposition.
[5]:
descendant2b = descendant2a.CloneDescendant()
assert hstrat.does_have_any_common_ancestor(founder2, descendant2b)
f"{descendant2b.GetNumStrataDeposited()} generations since founder genesis"
[5]:
'12 generations since founder genesis'
The member function .CloneNthDescendant(n)
operates equivalently to .Clone()
followed by DepositStrata(n)
, taking in an integer to specify the number of generations elapsed.
[6]:
descendant2c = descendant2a.CloneNthDescendant(20)
assert hstrat.does_have_any_common_ancestor(founder2, descendant2a)
f"{descendant2c.GetNumStrataDeposited()} generations since founder genesis"
[6]:
'31 generations since founder genesis'
Most use cases will primarily use .CloneDescendant()
.
Mesuring Relatedness
If you want to find the number of generations elapsed since the MRCA (most recent common ancestor), calc_ranks_since_mrca_bounds_with
will return a tuple with the estimated lower and upper bound. (In hstrat
parlance, the term “rank” corresponds to generations elapsed.)
[7]:
hstrat.calc_ranks_since_mrca_bounds_with(
descendant2b,
descendant2c,
prior="arbitrary",
)
[7]:
(0, 3)
Uncertainty in phylogenetic inference is introduced due to the stratum pruning procedure mentioned above.
Note that generations since MRCA will not be symmetrical if uneven branch lengths are at play.
[8]:
hstrat.calc_ranks_since_mrca_bounds_with(
descendant2c,
descendant2b,
prior="arbitrary",
)
[8]:
(19, 22)
Generation of MRCA (i.e., MRCA generational depth from genesis) can also be estimated.
[9]:
hstrat.calc_rank_of_mrca_bounds_between(
descendant2b, descendant2c, prior="arbitrary"
)
[9]:
(9, 12)
This measure is symmetric.
[10]:
hstrat.calc_rank_of_mrca_bounds_between(
descendant2c, descendant2b, prior="arbitrary"
)
[10]:
(9, 12)
Generation of MRCA can be estimated over an entire population.
[11]:
hstrat.calc_rank_of_mrca_bounds_among(
[descendant2a, descendant2c, descendant2b], prior="arbitrary"
)
[11]:
(9, 11)
The argument prior
used above denotes the prior probability density distribution over possible generations of the MRCA. For most cases, "arbitrary"
will suffice.
User-defined Genomes
In most cases, columns will need to be associated with genetic content relevant to the simulation at hand. Python’s object-oriented programming model provides a convenient way to achieve this.
Most user-defined Genome
’s will look something like this. Note the coupling of content mutation and hstrat copy/deposit in Genome.CloneDescendant
.
[12]:
import typing
class Genome:
column: hstrat.HereditaryStratigraphicColumn
content: typing.Any
def __init__(self, column=None, content=None):
if column is None:
column = hstrat.HereditaryStratigraphicColumn(
# configure policy, differentia bit width, etc. here
)
if content is None:
self.content = (
"placeholder" # initialize functional genome content
)
self.column = column
self.content = content
def CloneDescendant(self):
mutated_content = self.content # put mutation operator here
return Genome(self.column.CloneDescendant(), mutated_content)
Collections of Genome
objects would then make up the simulated population.
See here for a more detailed example of this pattern in action.
Phylogenetic Reconstruction
A major use case of hstrat is reconstruction of phylogenetic history over entire populations. We’ll walk through that now.
Begin by retrieving a ground-truth phylogeny. We will use this as a point of comparison.
[13]:
import random
random.seed(1)
import dendropy as dp
tree_url = "https://raw.githubusercontent.com/mmore500/hstrat/5069db7c358ac6949ceda5fe8cc9989d5d7139f9/examples/assets/example.newick"
template_tree = dp.Tree.get(url=tree_url, schema="newick")
for node in template_tree:
node.edge_length = random.randint(1, 10) # random branch lengths
template_tree.is_rooted = True
template_tree.ladderize()
print(template_tree.as_ascii_plot(plot_metric="length", width=50))
/----------------------- ant
|
|------------------ dog
|
+ /----------- bat
|----+
| \--- cow
|
| /------------------- elk
\------------------+
\---------------- fox
hstrat provides descend_template_phylogeny_*
functions to generate end-state HereditaryStratigraphicColumn
’s “as if” they had evolved from seed_column
with a provided phylogenetic history. Here, we generate a population of HereditaryStratigraphicColumn
’s corresponding to the leaves of template_tree
.
[14]:
extant_population = hstrat.descend_template_phylogeny_dendropy(
template_tree,
seed_column=hstrat.HereditaryStratigraphicColumn(
hstrat.recency_proportional_resolution_algo.Policy(5)
),
extant_nodes=template_tree.leaf_node_iter(),
)
Most use cases will not involve a template phylogeny approach. Rather, HereditaryStratigraphicColumn
’s will evolve according to the simulation dynamics. However, it is convenient for this demonstration of phylogenetic reconstruction.
The function build_tree
takes in a population of columns and returns a tree constructed in alife standard format. The version_pin
parameter dictates how calls to the function should resolve in future releases, with hstrat.__version__
automatically tracking new updates.
[15]:
reconstructed_phylogeny = hstrat.build_tree(
extant_population,
version_pin=hstrat.__version__,
taxon_labels=map(lambda n: n.taxon.label, template_tree.leaf_node_iter()),
)
reconstructed_phylogeny
[15]:
id | ancestor_list | origin_time | taxon_label | ancestor_id | |
---|---|---|---|---|---|
0 | 0 | [none] | 0.0 | Root | 0 |
5 | 5 | [0] | 2.0 | Inner+r=2+d=GSsXbnXBxB-+uid=1v5ktyenGFmEudpamzswb | 0 |
19 | 19 | [5] | 4.0 | cow | 5 |
27 | 27 | [0] | 8.0 | Inner+r=8+d=MFuIoTBD6pA+uid=Be2pLYZKxdudcHEH6F... | 0 |
33 | 33 | [5] | 7.0 | bat | 5 |
38 | 38 | [0] | 8.0 | dog | 0 |
45 | 45 | [0] | 10.0 | ant | 0 |
52 | 52 | [27] | 15.0 | fox | 27 |
54 | 54 | [27] | 16.0 | elk | 27 |
Reconstruction would be performed the same way on a population of HereditaryStratigraphicColumn
’s resulting from an actual evolutionary simulation rather than “as if” template descent.
Use dendropy to visualize the reconstruction, and compare to ground truth.
[16]:
import alifedata_phyloinformatics_convert as apc
dendropy_tree = apc.alife_dataframe_to_dendropy_tree(
reconstructed_phylogeny,
setup_edge_lengths=True,
)
dendropy_tree.is_rooted = True
dendropy_tree.ladderize()
print("=============== reconstructed ===============")
print(dendropy_tree.as_ascii_plot(plot_metric="length", width=50))
print("=================== true ====================")
print(template_tree.as_ascii_plot(plot_metric="length", width=50))
=============== reconstructed ===============
/---------------------- dog
|
|--------------------------- ant
|
+ /----- cow
|----+
| \-------------- bat
|
| /------------------- fox
\----------------------+
\---------------------- elk
=================== true ====================
/----------------------- ant
|
|------------------ dog
|
+ /----------- bat
|----+
| \--- cow
|
| /------------------- elk
\------------------+
\---------------- fox
Serialization
Saving and restoring HereditaryStratigraphicColumn
’s is important in order to be able to transmit hstrat annotations between processes during a simulation or save state after a simulation. The library provides mechanisms to support both binary and text formats.
Begin by creating an example HereditaryStratigraphicColumn
to play with.
[17]:
diffwidth = 8
policy = hstrat.recency_proportional_resolution_curbed_algo.Policy(10)
column = hstrat.HereditaryStratigraphicColumn(
stratum_retention_policy=policy,
stratum_differentia_bit_width=diffwidth,
always_store_rank_in_stratum=False,
)
column.DepositStrata(20)
For binary formats, use col_to_packet
and col_from_packet
. (If the packet is contained within a larger buffer, you’ll want to use col_from_packet_buffer
instead.)
[18]:
packet = hstrat.col_to_packet(column)
reconst = hstrat.col_from_packet(
packet,
differentia_bit_width=diffwidth,
stratum_retention_policy=policy,
)
assert reconst == column
packet
[18]:
bytearray(b'\x00\x00\x00\x15\xa4\xda\x1e\x98')
Note that differentia_bit_width
and stratum_retention_policy
must be passed — this information is not stored within serialized packets.
The first step for plain-text serialization is conversion between column objects and representation via builtin Python types.
[19]:
records = hstrat.col_to_records(column)
reconst = hstrat.col_from_records(records)
assert reconst == column
records
[19]:
{'policy_algo': 'recency_proportional_resolution_curbed_algo',
'policy_spec': {'size_curb': 10},
'policy': 'hstrat.recency_proportional_resolution_curbed_algo.Policy(policy_spec=hstrat.recency_proportional_resolution_curbed_algo.PolicySpec(size_curb=10))',
'hstrat_version': '1.11.7',
'num_strata_deposited': 21,
'differentiae': 'pNoemA==',
'differentia_bit_width': 8}
Unlike packet serialization, policy information is stored in the generated records — convenient for data persistence beyond simulation runtime.
Conversion between records and plain text can then be handled by a number of tools. Python provides builtin support for JSON.
[20]:
import json
as_json = json.dumps(records, indent=2) # indent kwarg pretty prints
reconst = json.loads(as_json)
assert records == reconst
print(as_json)
{
"policy_algo": "recency_proportional_resolution_curbed_algo",
"policy_spec": {
"size_curb": 10
},
"policy": "hstrat.recency_proportional_resolution_curbed_algo.Policy(policy_spec=hstrat.recency_proportional_resolution_curbed_algo.PolicySpec(size_curb=10))",
"hstrat_version": "1.11.7",
"num_strata_deposited": 21,
"differentiae": "pNoemA==",
"differentia_bit_width": 8
}
Perhaps you prefer yaml?
[21]:
import yaml
as_yaml = yaml.safe_dump(records)
reconst = yaml.safe_load(as_yaml)
assert records == reconst
print(as_yaml)
differentia_bit_width: 8
differentiae: pNoemA==
hstrat_version: 1.11.7
num_strata_deposited: 21
policy: hstrat.recency_proportional_resolution_curbed_algo.Policy(policy_spec=hstrat.recency_proportional_resolution_curbed_algo.PolicySpec(size_curb=10))
policy_algo: recency_proportional_resolution_curbed_algo
policy_spec:
size_curb: 10
The pop_to_records
and pop_to_records
operate similarly, except on entire collections of HereditaryStratigraphicColumns
rather than individual columns.
[22]:
population = [column, column.CloneDescendant()]
records = hstrat.pop_to_records(population)
reconst = hstrat.pop_from_records(records)
assert reconst == population
records
[22]:
{'policy': 'hstrat.recency_proportional_resolution_curbed_algo.Policy(policy_spec=hstrat.recency_proportional_resolution_curbed_algo.PolicySpec(size_curb=10))',
'policy_algo': 'recency_proportional_resolution_curbed_algo',
'policy_spec': {'size_curb': 10},
'differentia_bit_width': 8,
'hstrat_version': '1.11.7',
'columns': [{'num_strata_deposited': 21, 'differentiae': 'pNoemA=='},
{'num_strata_deposited': 22, 'differentiae': 'pNoeGA=='}]}
For very large populations, consider using gzip
or lzma
/xz
to counteract file size overhead from plain-text format.
Viewing Column Internals
An ascii representation of a hereditary stratigraph column can be printed using the hstrat.col_to_ascii()
function, passing in the specified column as a parameter.
[23]:
col = hstrat.HereditaryStratigraphicColumn(
stratum_retention_policy=hstrat.fixed_resolution_algo.Policy(3),
)
col.DepositStrata(5)
print(hstrat.col_to_ascii(col))
+----------------------------------------------------+
| MOST ANCIENT |
+--------------+---------------+---------------------+
| stratum rank | stratum index | stratum differentia |
+--------------+---------------+---------------------+
| 0 | 0 | 1391f9b9dbc799b0* |
+--------------+---------------+---------------------+
| 1 | ░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░ |
+--------------+---------------+---------------------+
| 2 | ░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░ |
+--------------+---------------+---------------------+
| 3 | 1 | -f8acb011533eef2* |
+--------------+---------------+---------------------+
| 4 | ░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░░░░░░ |
+--------------+---------------+---------------------+
| 5 | 2 | 28804790be6c6fe9* |
+--------------+---------------+---------------------+
| MOST RECENT |
+----------------------------------------------------+
*: stored in stratum (otherwise calculated)
░: discarded stratum
A column can also be exported to pandas dataframe.
[24]:
hstrat.col_to_dataframe(col)
[24]:
index | rank | differentia | |
---|---|---|---|
0 | 0 | 0 | 1410182734995233200 |
1 | 1 | 3 | -1119930662866120434 |
2 | 2 | 5 | 2918411245531721705 |
Further Reading
Additional usage examples can be found within the
`examples/
directory <https://github.com/mmore500/hstrat/tree/master/examples>`__.All library components can be accessed directly from the convenience flat namespace
hstrat.hstrat
(e.g.,from hstrat import hstrat
). A full API listing is provided here.Information and advice on selecting a stratum retention policy is here.
Preliminary prototype work on improved methodology for fixed-size hstrat annotations is available here
Preliminary prototype work on extensions for systems with sexual recombination is available here