Tutorial 1: Benchmark Generation

This tutorial provides a detailed guide for creating customized benchmark datasets with the dyn-benchmark package, focusing on configuring all available parameters to control community evolution and network structure.

Prerequisites

# Install the package if you haven't already
# pip install dyn-benchmark

# Import required modules
from dyn.benchmark.generator.groundtruth_generator import GroundtruthGenerator
from dyn.benchmark.generator.communities_generator import CommunitiesGenerator, RelativeOverlap, Match, Overlap
from dyn.benchmark.generator.edges_generator import SBM, BPAM, FastBPAM, PAM
from dyn.benchmark.generator.nodes_generator import RandomMemberGenerator
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt

1. Community Generation Parameters

The CommunitiesGenerator is responsible for creating evolving communities and operates on two distinct levels:

Global generation parameters: These are set as attributes in the constructor and control the overall properties of the generation process.
Evolving Communities shape parameters: These are controlled by overriding specific methods that define probability distributions for community attributes like size, lifetime, and growth patterns.

Global generation parameters

The following parameters can be set when instantiating a CommunitiesGenerator:

# Create a communities generator with fully customized parameters
community_generator = CommunitiesGenerator(
    # Basic parameters
    community_count=15,          # Number of evolving communities to generate
    snapshot_count=10,           # Number of snapshots in the temporal network
    community_size_min=5,        # Minimum size of any static community
    core_nodes_ratio=0.7,        # Ratio of members that stay in their community between snapshots
    matching_metric_type=RelativeOverlap,  # Algorithm for matching communities across snapshots
    seed=42                      # Seed for reproducibility
)

Matching metrics help identify when a community at time t corresponds to a community at time t+1. As defined in Aynaud et al. [AFGW13], matching metrics compare the intersection of nodes between communities to determine their continuity over time. The package implements three matching metrics:

# Match: min(|C0 ∩ C1| / |C0|, |C0 ∩ C1| / |C1|)
# Compares the relative size of the intersection to both community sizes
# Used when community sizes vary significantly
match_generator = CommunitiesGenerator(matching_metric_type=Match)

# RelativeOverlap: |C0 ∩ C1| / |C0 ∪ C1|
# Compares the intersection to the union (Jaccard coefficient)
# Balanced approach (default)
relative_overlap_generator = CommunitiesGenerator(matching_metric_type=RelativeOverlap)

# Overlap: |C0 ∩ C1|
# Simply uses the raw intersection size
# Used when community sizes are stable
overlap_generator = CommunitiesGenerator(matching_metric_type=Overlap)

These matching metrics have different properties suitable for different community evolution patterns. For example, Match is appropriate when communities can significantly vary in size, as it takes the minimum of the relative intersections. RelativeOverlap (the default) uses the Jaccard coefficient to balance the comparison, while Overlap simply looks at the raw number of shared nodes.

Evolving Communities shape parameters

The generator creates communities based on several probability distributions that you can customize by subclassing CommunitiesGenerator and overriding its methods:

class CustomCommunitiesGenerator(CommunitiesGenerator):
    """Custom generator with specific community evolution patterns"""

    def draw_community_size(self, *args, **kwargs):
        """Controls the initial size of communities
        Returns a float (will be rounded to integer)"""
        return self.rng.normal(loc=50, scale=20)  # Normal distribution centered at 50

    def draw_community_lifetime(self, *args, **kwargs):
        """Controls how long communities exist
        Returns a float (will be rounded to integer)"""
        # Truncated normal distribution between 3-7 snapshots
        return np.maximum(3, np.minimum(7, self.rng.normal(loc=5, scale=2)))

    def draw_community_start(self, *args, **kwargs):
        """Controls when communities are born
        Returns a float between 0-1 (scaled to valid snapshot range)"""
        return self.rng.random()  # Uniform distribution

    def draw_change_ratio(self, *args, **kwargs):
        """Controls community size changes over time
        Returns a float (negative = shrink, positive = grow)"""
        return self.rng.normal(loc=0, scale=0.2)  # Normal distribution centered at 0

2. Network Structure Parameters

The package offers several graph generation models to create the underlying network structure at a given snapshot.

Stochastic Block Model (SBM)

The Stochastic Block Model (SBM) is a classic approach for generating networks with community structure. It creates edges between nodes based on their community membership, with higher probability for intra-community connections than inter-community connections.

# SBM parameters (classic community-based model)
sbm_generator = SBM(
    p_in=0.7,      # Probability of edge between nodes in same community
    p_out=0.05,    # Probability of edge between nodes in different communities
    max_iter=10,   # Maximum number of attempts to generate a connected graph
    seed=42        # Seed for reproducibility
)

The SBM was originally introduced by Holland et al. [HLL83] as a statistical model for social networks with block structures, making it particularly suitable for generating synthetic networks with well-defined community structures.

For more details check the SBM class.

Preferential Attachment Model (PAM)

The Preferential Attachment Model (PAM) generates networks with a power-law degree distribution, simulating the “rich get richer” phenomenon observed in many real-world networks.

# PAM parameters (scale-free networks)
pam_generator = PAM(
    m=5,           # Number of edges to add for each new node
    self_loop=False,  # Whether self-loops are allowed
    seed=42        # Seed for reproducibility
)

This implementation is based on the efficient algorithms described by Tonelli et al. [TCL10] for implementing preferential attachment mechanisms in evolving networks.

For more details check the PAM class.

Block Preferential Attachment Model (BPAM)

The Block Preferential Attachment Model (BPAM) combines the community-aware structure of SBM with the scale-free properties of PAM, creating networks with both community structure and realistic degree distributions.

# BPAM parameters (community-aware preferential attachment)
bpam_generator = BPAM(
    gamma_in=0.8,  # Intra-community interaction strength
    gamma_out=0.1, # Inter-community interaction strength
    m=5,           # Number of edges to add for each new node
    self_loop=False,  # Whether self-loops are allowed
    seed=42        # Seed for reproducibility
)

BPAM was introduced by Tang et al. [TGT20] as an extension of the Buckley-Osthus model, incorporating block structures to create networks with both community organization and power-law degree distributions.

For more details check the BPAM class.

Fast Block Preferential Attachment Model (FastBPAM)

The Fast Block Preferential Attachment Model (FastBPAM) builds upon the original PAM by incorporating features from BPAM and optimizing performance. This model generates networks with both community-aware structures and realistic degree distributions, while being computationally efficient.

# FastBPAM - optimized version of BPAM with the same parameters
fast_bpam = FastBPAM(
    gamma_in=0.8,
    gamma_out=0.1,
    m=5,
    self_loop=False,
    seed=42
)

Fore more details check the FastBPAM class.

3. Ensuring Reproducibility

To generate identical benchmarks, it’s essential to use consistent seed values across all components of the generation process:

# First generator with seed 42
generator1 = GroundtruthGenerator(seed=42)
benchmark1 = generator1.generate()

# Second generator with seed 42
generator2 = GroundtruthGenerator(seed=42)
benchmark2 = generator2.generate()

# Verify the benchmarks are identical
same_nodes = all(
    set(benchmark1.graphs[t].nodes) == set(benchmark2.graphs[t].nodes)
    for t in benchmark1.graphs.keys()
)
print(f"Benchmarks have identical nodes: {same_nodes}")

When creating a custom generator with multiple components, ensure that each component receives a consistent seed value:

# Create a reproducible custom generator
custom_generator = GroundtruthGenerator(
    community_generator=CustomCommunitiesGenerator(seed=42),
    node_generator=RandomMemberGenerator(seed=42),
    edge_generator=FastBPAM(seed=42),
    seed=42  # Master seed for the entire generation process
)

The GroundtruthGenerator distributes child seeds to its components, so providing a master seed is usually sufficient, but explicitly setting component seeds ensures maximum control over reproducibility.

4. Exploring the Generated Benchmark

Once you have generated a benchmark, you can explore its properties through various attributes and methods:

# Access different components of the benchmark
print(f"Generated {len(groundtruth.tcommlist)} member assignments")
print(f"Across {len(groundtruth.graphs)} snapshots")
print(f"With {len(groundtruth.events)} community events")

# Inspect the first snapshot
first_snapshot = min(groundtruth.graphs.keys())
graph = groundtruth.graphs[first_snapshot]
print(f"First snapshot has {len(graph.nodes)} nodes and {len(graph.edges)} edges")

# List community events
event_types = set(event.label for event in groundtruth.events)
print(f"Event types: {event_types}")

For deeper analysis and visualization of your generated benchmark, refer to the following tutorials:

Tutorial 3: Comprehensive Metrics Computation and Analysis explains how to compute various metrics and analyze events for evolving communities and temporal networks.
Tutorial 4: Visualizing Evolving Communities provides comprehensive methods for visualizing community evolution, including Sankey diagrams, network snapshots, and animated visualizations.

5. Complete Example

Here’s a complete example that demonstrates how to generate a customized benchmark. This implementation:

Creates custom community generators with specific distributions
Sets up different network structure generators (SBM, PAM, BPAM, FastBPAM)
Ensures reproducibility with consistent seeds
Explores the generated benchmark properties

"""
Complete example of benchmark generation using dyn-benchmark
"""
import numpy as np
import networkx as nx
import pandas as pd

from dyn.benchmark.generator.groundtruth_generator import GroundtruthGenerator
from dyn.benchmark.generator.communities_generator import CommunitiesGenerator, RelativeOverlap, Match, Overlap
from dyn.benchmark.generator.edges_generator import SBM, BPAM, FastBPAM, PAM
from dyn.benchmark.generator.nodes_generator import RandomMemberGenerator
from dyn.core.communities import Membership
from dyn.drawing.sankey_drawing import plot_sankey

# 1. Define a custom communities generator with specific distributions
class CustomCommunitiesGenerator(CommunitiesGenerator):
    """Custom generator with specific community evolution patterns"""

    def draw_community_size(self, *args, **kwargs):
        """Controls the initial size of communities"""
        return self.rng.normal(loc=50, scale=20)  # Normal distribution centered at 50

    def draw_community_lifetime(self, *args, **kwargs):
        """Controls how long communities exist"""
        # Truncated normal distribution between 3-7 snapshots
        return np.maximum(3, np.minimum(7, self.rng.normal(loc=5, scale=2)))

    def draw_community_start(self, *args, **kwargs):
        """Controls when communities are born"""
        return self.rng.random()  # Uniform distribution

    def draw_change_ratio(self, *args, **kwargs):
        """Controls community size changes over time"""
        return self.rng.normal(loc=0, scale=0.2)  # Normal distribution centered at 0

# 2. Set up a custom generator with all components
community_generator = CustomCommunitiesGenerator(
    community_count=15,          # Number of evolving communities
    snapshot_count=10,           # Number of snapshots in the temporal network
    community_size_min=5,        # Minimum size of any static community
    core_nodes_ratio=0.7,        # Ratio of members that stay in their community
    matching_metric_type=RelativeOverlap,  # Algorithm for matching communities
    seed=42                      # Seed for reproducibility
)

# 3. Create network structure generators with Fast Block Preferential Attachment Model (FastBPAM)
fast_bpam_generator = FastBPAM(
    gamma_in=0.8,  # Intra-community interaction strength
    gamma_out=0.1, # Inter-community interaction strength
    m=5,           # Number of edges to add for each new node
    self_loop=False,  # Whether self-loops are allowed
    seed=42        # Seed for reproducibility
)

# 4. Create a node generator
node_generator = RandomMemberGenerator(seed=42)

# 5. Assemble the final generator with all components
custom_generator = GroundtruthGenerator(
    community_generator=community_generator,
    node_generator=node_generator,
    edge_generator=fast_bpam_generator,  # Using FastBPAM for this example
    seed=42  # Master seed for the entire generation process
)

# 6. Generate the benchmark
print("Generating benchmark...")
groundtruth = custom_generator.generate()

# 7. Explore the generated benchmark properties
print(f"\nGenerated benchmark with:")
#print(f"- {len(groundtruth.tcommlist)} member assignments")
print(f"- {len(groundtruth.graphs)} snapshots")
print(f"- {len(groundtruth.events)} community events")

# 8. Inspect the first snapshot
first_snapshot = min(groundtruth.graphs.keys())
graph = groundtruth.graphs[first_snapshot]
print(f"\nFirst snapshot (t={first_snapshot}):")
print(f"- {len(graph.nodes)} nodes")
print(f"- {len(graph.edges)} edges")
print(f"- Density: {2*len(graph.edges)/(len(graph.nodes)*(len(graph.nodes)-1)):.4f}")

# 9. Extract membership information
membership = Membership.from_tcommlist(groundtruth.tcommlist)
print(f"\nMembership details:")
print(f"- {len(membership.members)} unique members")
print(f"- {len(membership.evolving_communities)} evolving communities")
print(f"- {len(membership.static_communities)} static communities")

# 10. List community events
event_types = {}
for event in groundtruth.events:
    event_types[event.label] = event_types.get(event.label, 0) + 1

print("\nCommunity events:")
for event_type, count in event_types.items():
    print(f"- {event_type}: {count} occurrences")

# 10. Create a visualization of community evolution using Sankey diagram
plot_sankey(membership.community_graph)

print("\nBenchmark generation and analysis complete!")

This tutorial has demonstrated the complete process of generating customized benchmarks using dyn-benchmark, explaining all parameters and their effects on community evolution and network structure. For more advanced analyses and visualizations, refer to the other tutorials in this documentation.

References

[AFGW13]

Thomas Aynaud, Eric Fleury, Jean-Loup Guillaume, and Qinna Wang. Communities in Evolving Networks: Definitions, Detection, and Analysis Techniques, pages 159–200. Springer New York, New York, NY, 2013. doi:10.1007/978-1-4614-6729-8_9.

[HLL83]

Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: first steps. Social Networks, 5(2):109–137, 1983. doi:10.1016/0378-8733(83)90021-7.

[TGT20]

Wenpin Tang, Xin Guo, and Fengmin Tang. The buckley-osthus model and the block preferential attachment model: statistical analysis and application. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 9377–9386. PMLR, 13–18 Jul 2020. URL: https://proceedings.mlr.press/v119/tang20b.html.

[TCL10]

Roberto Tonelli, Giulio Concas, and Mario Locci. Three efficient algorithms for implementing the preferential attachment mechanism in yule-simon stochastic process. WSEAS Trans. Info. Sci. and App., 7(2):176–185, February 2010. URL: https://dl.acm.org/doi/10.5555/1852489.1852492.