Contents
Controls
K-Core Decomposition for Ultramarathon Data
The Sparsity Problem
Ultramarathon race results present a fascinating modeling challenge. With thousands of races and hundreds of thousands of runners, you might think we'd have plenty of data—but the reality is far more complex. Most runners complete only a handful of races, and most races see only a fraction of the broader running population. This creates an extremely sparse dataset where direct comparisons between runners or courses are often impossible.
Imagine trying to compare a runner who only does Western States with someone who only does Leadville. Without shared race participation, we have no common reference point. This sparsity becomes a critical problem when we want to build statistical models that estimate runner abilities or course difficulties.
The solution? Rather than wrestling with the full sparse dataset, we can identify a densely-connected subset where runners and courses have enough overlap for meaningful inference. This is where k-core decomposition comes in—a graph-theoretic technique that iteratively prunes nodes until every remaining node meets minimum connectivity requirements.
In this notebook, we'll:
- Explore the space of possible k-core configurations
- Visualize the tradeoffs between subset size and density
- Select and examine a specific k-core for downstream modeling
Setup
We begin by loading race results from UltraSignup, a comprehensive database of ultramarathon events. The preprocessing pipeline filters to running events (excluding cycling, swimming, etc.), extracts distance information, and identifies races that report DNFs (Did Not Finish)—a key signal for our later analysis.
We focus on marathon-distance and longer events (≥26.2 miles) since these are the races where DNF patterns become most meaningful and where the ultramarathon community truly begins.
1 import pandas as pd 2 import numpy as np 3 import plotly.graph_objects as go 4 import igraph as ig 5 6 from utils.data_processing import load_results, process_results, filter_races_with_dnfs 7 from utils.kcore_subsetting import evaluate_all_kcore_candidates, subset_kcore_data
1 # preprocess data 2 3 results = load_results() 4 results = process_results(results) 5 results = filter_races_with_dnfs(results) 6 results = results[results['distance_miles'] >= 26.2]
Exploring the K-Core Parameter Space
What is a (α, β)-Core?
Think of ultramarathon data as a bipartite graph: one set of nodes represents runners, another represents courses (race-distance combinations), and edges connect runners to courses they've completed. An (α, β)-core is the largest subgraph where:
- Every runner has completed at least α different courses
- Every course has at least β different runners
The parameters α and β control the density-size tradeoff:
- Higher α means runners must have more race history → fewer runners qualify
- Higher β means courses need more participants → fewer courses qualify
Finding the right balance is crucial. Too loose, and we retain too much sparsity for effective modeling. Too strict, and we throw away valuable data.
Breadth-First Exploration
Rather than testing every possible (α, β) combination, we use an intelligent breadth-first search. Starting from minimal requirements, we incrementally tighten constraints and track how the subset evolves—measuring not just size, but also completeness metrics that indicate how representative the k-core remains of each runner's and course's full history.
1 # search k-core space 2 3 candidates_df = evaluate_all_kcore_candidates( 4 results=results, 5 max_entities=None, # explore full parameter space 6 min_alpha=2, # courses per runner 7 min_beta=30, # runners per course 8 beta_step=2, 9 verbose=True, 10 use_cache=True 11 )
✅ Loading cached k-core candidates from candidates_maxentNone_alpha2_beta30_step2.pkl
Visualizing the Parameter Landscape
The 3D visualization below reveals the fundamental tradeoffs in k-core selection. Each point on the surface represents a different (α, β) configuration:
- X-axis (Course Completeness): On average, what percentage of each course's total runners are retained in the k-core?
- Y-axis (Runner Completeness): On average, what percentage of each runner's race history is retained?
- Z-axis: Configurable—toggle between course count, total entities, or total results
Key patterns to observe:
- The surface drops off sharply as we move toward higher completeness values—achieving both high course and runner completeness simultaneously requires sacrificing dataset size
- The "ridge" of the surface shows the Pareto frontier: configurations where you can't improve one metric without sacrificing another
- Hover over points to see the exact (α, β) values and detailed statistics
Use the toggle buttons to explore different perspectives on the data.
1 def plot_3d_kcore_space(candidates_df: pd.DataFrame, full_results: pd.DataFrame = None): 2 """ 3 Create interactive 3D wireframe visualization of k-core parameter space. 4 5 Displays parameter space as a grid wireframe, connecting discrete (α, β) points 6 from breadth-first search exploration. 7 8 Args: 9 candidates_df: All k-core candidates with metrics 10 full_results: Full dataset for calculating baseline density reference planes 11 """ 12 df = candidates_df.copy() 13 14 # Calculate total entities 15 df['n_entities'] = df['n_courses'] + df['n_participants'] 16 17 # Get unique sorted alpha and beta values 18 alpha_values = sorted(df['alpha'].unique()) 19 beta_values = sorted(df['beta'].unique()) 20 21 # Create grid lookup with all z metrics 22 grid_data = {} 23 for _, row in df.iterrows(): 24 key = (row['alpha'], row['beta']) 25 grid_data[key] = { 26 'x': row['avg_course_completeness'], 27 'y': row['avg_runner_completeness'], 28 'z_courses': row['n_courses'], 29 'z_entities': row['n_entities'], 30 'z_total_results': row['total_results'], 31 'n_courses': row['n_courses'], 32 'n_entities': row['n_entities'], 33 'total_results': row['total_results'], 34 'hover': ( 35 f"α={int(row['alpha'])}, β={int(row['beta'])}<br>" 36 f"Courses: {int(row['n_courses']):,}<br>" 37 f"Runners: {int(row['n_participants']):,}<br>" 38 f"Total Entities: {int(row['n_entities']):,}<br>" 39 f"Total Results: {int(row['total_results']):,}<br>" 40 f"Overall Density: {row['density']:.1f}<br>" 41 f"Runner completeness: {row['avg_runner_completeness']:.1f}%<br>" 42 f"Course completeness: {row['avg_course_completeness']:.1f}%" 43 ) 44 } 45 46 # Determine color scale ranges for each metric 47 min_courses = df['n_courses'].min() 48 max_courses = df['n_courses'].max() 49 min_entities = df['n_entities'].min() 50 max_entities = df['n_entities'].max() 51 min_results = df['total_results'].min() 52 max_results = df['total_results'].max() 53 54 fig = go.Figure() 55 56 # Helper function to create traces for a given z metric and color metric 57 def create_traces(z_key, color_key): 58 traces = [] 59 60 # Get color metric min/max 61 if color_key == 'n_courses': 62 color_min, color_max = min_courses, max_courses 63 elif color_key == 'n_entities': 64 color_min, color_max = min_entities, max_entities 65 else: # total_results 66 color_min, color_max = min_results, max_results 67 68 # Draw horizontal grid lines (constant alpha, varying beta) 69 for alpha in alpha_values: 70 x_line, y_line, z_line, color_line, hover_line = [], [], [], [], [] 71 for beta in beta_values: 72 if (alpha, beta) in grid_data: 73 point = grid_data[(alpha, beta)] 74 x_line.append(point['x']) 75 y_line.append(point['y']) 76 z_line.append(point[z_key]) 77 color_line.append(point[color_key]) 78 hover_line.append(point['hover']) 79 80 if len(x_line) > 1: # Only draw if we have multiple points 81 # Use average color for the line segment (very light to very dark blue) 82 avg_color = sum(color_line) / len(color_line) 83 normalized_color = (avg_color - color_min) / (color_max - color_min) if color_max > color_min else 0 84 85 # Very light blue (224, 242, 255) to very dark blue (0, 30, 80) 86 r = int(224 + normalized_color * (0 - 224)) 87 g = int(242 + normalized_color * (30 - 242)) 88 b = int(255 + normalized_color * (80 - 255)) 89 90 traces.append(go.Scatter3d( 91 x=x_line, 92 y=y_line, 93 z=z_line, 94 mode='lines', 95 line=dict( 96 color=f'rgb({r}, {g}, {b})', 97 width=2 98 ), 99 text=hover_line, 100 hoverinfo='text', 101 showlegend=False, 102 visible=False # Will be set to True for initial z metric 103 )) 104 105 # Draw vertical grid lines (constant beta, varying alpha) 106 for beta in beta_values: 107 x_line, y_line, z_line, color_line, hover_line = [], [], [], [], [] 108 for alpha in alpha_values: 109 if (alpha, beta) in grid_data: 110 point = grid_data[(alpha, beta)] 111 x_line.append(point['x']) 112 y_line.append(point['y']) 113 z_line.append(point[z_key]) 114 color_line.append(point[color_key]) 115 hover_line.append(point['hover']) 116 117 if len(x_line) > 1: # Only draw if we have multiple points 118 # Use average color for the line segment (very light to very dark blue) 119 avg_color = sum(color_line) / len(color_line) 120 normalized_color = (avg_color - color_min) / (color_max - color_min) if color_max > color_min else 0 121 122 # Very light blue (224, 242, 255) to very dark blue (0, 30, 80) 123 r = int(224 + normalized_color * (0 - 224)) 124 g = int(242 + normalized_color * (30 - 242)) 125 b = int(255 + normalized_color * (80 - 255)) 126 127 traces.append(go.Scatter3d( 128 x=x_line, 129 y=y_line, 130 z=z_line, 131 mode='lines', 132 line=dict( 133 color=f'rgb({r}, {g}, {b})', 134 width=2 135 ), 136 text=hover_line, 137 hoverinfo='text', 138 showlegend=False, 139 visible=False # Will be set to True for initial z metric 140 )) 141 142 return traces 143 144 # Create traces for all 9 combinations (3 z-metrics × 3 color-metrics) 145 trace_sets = {} 146 for z_key in ['z_courses', 'z_entities', 'z_total_results']: 147 for color_key in ['n_courses', 'n_entities', 'total_results']: 148 trace_sets[(z_key, color_key)] = create_traces(z_key, color_key) 149 150 # Add all traces (start with courses z-axis, total_results coloring) 151 for (z_key, color_key), traces in trace_sets.items(): 152 for trace in traces: 153 if z_key == 'z_courses' and color_key == 'total_results': 154 trace.visible = True 155 fig.add_trace(trace) 156 157 # Add a single invisible trace for the colorbar 158 fig.add_trace(go.Scatter3d( 159 x=[df['avg_course_completeness'].iloc[0]], 160 y=[df['avg_runner_completeness'].iloc[0]], 161 z=[df['n_courses'].iloc[0]], 162 mode='markers', 163 marker=dict( 164 size=0.1, 165 color=[min_results], 166 colorscale=[[0, 'rgb(224, 242, 255)'], [1, 'rgb(0, 30, 80)']], 167 showscale=True, 168 colorbar=dict( 169 title="Total Results", 170 x=1.1, 171 tickformat=',' 172 ), 173 cmin=min_results, 174 cmax=max_results 175 ), 176 showlegend=False, 177 hoverinfo='skip', 178 visible=True 179 )) 180 181 # Calculate number of traces per combination 182 n_traces_per_combo = len(trace_sets[('z_courses', 'n_courses')]) 183 184 # Create visibility arrays for toggling 185 def make_visible_array(z_metric, color_metric): 186 """Create visibility array for all traces.""" 187 visible = [] 188 for (z_key, color_key) in trace_sets.keys(): 189 z_match = (z_key == f'z_{z_metric}') 190 color_match = (color_key == color_metric) 191 visible.extend([z_match and color_match] * n_traces_per_combo) 192 # Colorbar trace 193 visible.append(True) 194 return visible 195 196 # Get current color metric name for colorbar updates 197 def get_color_title(color_metric): 198 if color_metric == 'n_courses': 199 return "Course Count" 200 elif color_metric == 'n_entities': 201 return "Total Entities" 202 else: 203 return "Total Results" 204 205 def get_color_range(color_metric): 206 if color_metric == 'n_courses': 207 return [min_courses, max_courses] 208 elif color_metric == 'n_entities': 209 return [min_entities, max_entities] 210 else: 211 return [min_results, max_results] 212 213 # Update layout 214 fig.update_layout( 215 title=dict( 216 text='Interactive 3D K-Core Parameter Space Wireframe', 217 x=0.5, 218 xanchor='center', 219 font=dict(size=18) 220 ), 221 scene=dict( 222 xaxis=dict( 223 title='Course Completeness', 224 backgroundcolor="rgb(230, 230,230)", 225 gridcolor="white", 226 ticksuffix='%' 227 ), 228 yaxis=dict( 229 title='Runner Completeness', 230 backgroundcolor="rgb(230, 230,230)", 231 gridcolor="white", 232 ticksuffix='%' 233 ), 234 zaxis=dict( 235 title='Course Count', 236 backgroundcolor="rgb(230, 230,230)", 237 gridcolor="white", 238 separatethousands=True 239 ), 240 camera=dict( 241 eye=dict(x=1.5, y=-1.5, z=1.3) # Runner completeness on right, course on left 242 ) 243 ), 244 width=1000, 245 height=850, 246 hovermode='closest', 247 updatemenus=[ 248 # Z-axis toggle buttons 249 dict( 250 type="buttons", 251 direction="left", 252 buttons=[ 253 dict( 254 label="Z-axis: Courses", 255 method="update", 256 args=[ 257 {"visible": make_visible_array('courses', 'total_results')}, 258 { 259 "scene.zaxis.title": "Course Count", 260 "scene.zaxis.range": [0, df['n_courses'].max() * 1.1], 261 "scene.zaxis.separatethousands": True 262 } 263 ] 264 ), 265 dict( 266 label="Z-axis: Total Entities", 267 method="update", 268 args=[ 269 {"visible": make_visible_array('entities', 'total_results')}, 270 { 271 "scene.zaxis.title": "Total Entities (Courses + Runners)", 272 "scene.zaxis.range": [0, df['n_entities'].max() * 1.1], 273 "scene.zaxis.separatethousands": True 274 } 275 ] 276 ), 277 dict( 278 label="Z-axis: Total Results", 279 method="update", 280 args=[ 281 {"visible": make_visible_array('total_results', 'total_results')}, 282 { 283 "scene.zaxis.title": "Total Results", 284 "scene.zaxis.range": [0, df['total_results'].max() * 1.1], 285 "scene.zaxis.separatethousands": True 286 } 287 ] 288 ), 289 ], 290 x=0.5, 291 xanchor="center", 292 y=1.08, 293 yanchor="top", 294 bgcolor="rgba(255, 255, 255, 0.8)", 295 bordercolor="rgba(100, 100, 100, 0.5)", 296 borderwidth=1 297 ), 298 # Color mapping toggle buttons 299 dict( 300 type="buttons", 301 direction="left", 302 buttons=[ 303 dict( 304 label="Color: Courses", 305 method="update", 306 args=[ 307 {"visible": make_visible_array('courses', 'n_courses')}, 308 { 309 "marker.colorbar.title": "Course Count", 310 "marker.cmin": min_courses, 311 "marker.cmax": max_courses 312 } 313 ] 314 ), 315 dict( 316 label="Color: Total Entities", 317 method="update", 318 args=[ 319 {"visible": make_visible_array('courses', 'n_entities')}, 320 { 321 "marker.colorbar.title": "Total Entities", 322 "marker.cmin": min_entities, 323 "marker.cmax": max_entities 324 } 325 ] 326 ), 327 dict( 328 label="Color: Total Results", 329 method="update", 330 args=[ 331 {"visible": make_visible_array('courses', 'total_results')}, 332 { 333 "marker.colorbar.title": "Total Results", 334 "marker.cmin": min_results, 335 "marker.cmax": max_results 336 } 337 ] 338 ), 339 ], 340 x=0.5, 341 xanchor="center", 342 y=1.03, 343 yanchor="top", 344 bgcolor="rgba(255, 255, 255, 0.8)", 345 bordercolor="rgba(100, 100, 100, 0.5)", 346 borderwidth=1 347 ) 348 ] 349 ) 350 351 fig.show() 352 353 # Visualize the parameter space with interactive 3D plot 354 plot_3d_kcore_space(candidates_df=candidates_df, full_results=results)
Loading Plotly visualization...
Examining the (3, 840) K-Core
Based on our exploration, we select the (α=3, β=840) configuration for downstream modeling. This choice balances several considerations:
-
α=3: Runners must have completed at least 3 different course types. This ensures we have enough repeated observations per runner to estimate individual ability, while not being so restrictive that we lose casual participants.
-
β=840: Courses must have at least 840 unique runners. This focuses on well-established events with substantial participation—races where we can reliably estimate course difficulty and where DNF reporting is likely to be consistent.
The resulting subset represents the "core" ultramarathon community: dedicated runners who return to the sport repeatedly, competing in popular events that form the backbone of the ultra calendar.
Visualizing the Bipartite Structure
With our k-core selected, we can now visualize the runner-course relationships. These visualizations reveal the community structure hidden within the data—clusters of runners who share similar race portfolios, courses that attract overlapping populations, and the overall connectivity patterns that make comparative inference possible.
Building the Subset
We extract all results matching our k-core criteria and compute basic network statistics. The degree of a runner is the number of unique courses they've completed; the degree of a course is its number of unique finishers. These degree distributions tell us about the heterogeneity within our subset—are there a few "super-connectors" or is participation relatively uniform?
1 # Create (3, 840) k-core subset 2 alpha, beta = 3, 840 3 4 # Ensure fresh copy without any previous course_id columns 5 results_clean = results.drop(columns=['course_id'], errors='ignore') 6 kcore_subset = subset_kcore_data(results_clean, alpha=alpha, beta=beta) 7 kcore_data = kcore_subset[kcore_subset['in_kcore']].copy() 8 9 # Get degree statistics 10 runner_degrees = kcore_data.groupby('participant_id').size() 11 course_degrees = kcore_data.groupby('event_distance_id').size() 12 13 # Use ALL runners in k-core (no sampling) 14 kcore_sample = kcore_data.copy() 15 16 # Count edges (participation frequency per runner-course pair) 17 edge_counts = kcore_sample.groupby(['participant_id', 'event_distance_id']).size().reset_index(name='edge_weight') 18 19 print(f"K-core (α={alpha}, β={beta}) statistics:") 20 print(f" Total courses: {kcore_data['event_distance_id'].nunique()}") 21 print(f" Total runners: {kcore_data['participant_id'].nunique()}") 22 print(f" Total results: {len(kcore_data):,}") 23 print(f" Unique edges: {len(edge_counts)}") 24 print(f" Max edge weight: {edge_counts['edge_weight'].max()}")
K-core (α=3, β=840) statistics:
Total courses: 367
Total runners: 3061
Total results: 31,490
Unique edges: 31490
Max edge weight: 1
Adjacency Matrix Heatmap
The adjacency matrix provides a bird's-eye view of runner-course relationships. Each row is a runner, each column is a course, and cells are colored by percentile finish time (darker = faster relative to that course's field).
We apply hierarchical clustering to both rows and columns, reordering them to reveal latent structure. Look for:
- Block patterns: Groups of runners who share similar race portfolios (they cluster together vertically) and tend to perform similarly across those races
- Color gradients: Runners who consistently finish in similar percentiles across different courses (horizontal color consistency indicates stable relative ability)
- Sparse regions: The gaps remind us that even within our dense k-core, not every runner-course combination is observed
This matrix shows only the top 100 runners by degree for readability, but the patterns extend throughout the full dataset.
1 import plotly.express as px 2 from scipy.cluster.hierarchy import linkage, dendrogram 3 from scipy.spatial.distance import pdist 4 5 # Sample top 100 runners for adjacency matrix visualization 6 top_runners = runner_degrees.nlargest(100).index 7 kcore_matrix_sample = kcore_sample[kcore_sample['participant_id'].isin(top_runners)] 8 9 # Create adjacency matrix with percentile finish times 10 matrix_data = kcore_matrix_sample.groupby(['participant_id', 'event_distance_id'])['time_ms'].mean().reset_index() 11 12 # Calculate percentile for each course 13 percentile_data = [] 14 for event_id in matrix_data['event_distance_id'].unique(): 15 # Get all finish times for this course 16 course_times = kcore_matrix_sample[kcore_matrix_sample['event_distance_id'] == event_id]['time_ms'] 17 18 # Get participant times for this course 19 participant_times = matrix_data[matrix_data['event_distance_id'] == event_id].copy() 20 21 # Calculate percentile rank (0-100) 22 participant_times['percentile'] = participant_times['time_ms'].rank(pct=True) * 100 23 24 percentile_data.append(participant_times) 25 26 # Combine all percentiles 27 percentile_df = pd.concat(percentile_data, ignore_index=True) 28 29 # Create matrix with percentiles 30 matrix = percentile_df.pivot(index='participant_id', columns='event_distance_id', values='percentile') 31 32 # Fill NaN values with median for clustering (represents missing races) 33 matrix_filled = matrix.fillna(matrix.median().median()) 34 35 # Perform hierarchical clustering on rows (runners) 36 row_linkage = linkage(pdist(matrix_filled, metric='euclidean'), method='ward') 37 row_dendrogram = dendrogram(row_linkage, no_plot=True) 38 row_order = row_dendrogram['leaves'] 39 40 # Perform hierarchical clustering on columns (courses) 41 col_linkage = linkage(pdist(matrix_filled.T, metric='euclidean'), method='ward') 42 col_dendrogram = dendrogram(col_linkage, no_plot=True) 43 col_order = col_dendrogram['leaves'] 44 45 # Reorder matrix using clustering results 46 matrix_clustered = matrix.iloc[row_order, col_order] 47 48 # Get runner names for hover text 49 runner_names = kcore_matrix_sample[['participant_id', 'firstname', 'lastname']].drop_duplicates('participant_id').set_index('participant_id') 50 runner_names['full_name'] = runner_names['firstname'] + ' ' + runner_names['lastname'] 51 52 # Create custom hover text with runner names 53 matrix_clustered_with_names = matrix_clustered.copy() 54 matrix_clustered_with_names.index = [ 55 runner_names.loc[pid, 'full_name'] if pid in runner_names.index else str(pid) 56 for pid in matrix_clustered.index 57 ] 58 59 # Get race names for column labels - keep event_distance_id to ensure uniqueness 60 race_info = kcore_matrix_sample[['event_distance_id', 'name']].drop_duplicates('event_distance_id').set_index('event_distance_id') 61 # Create unique labels by combining name with event_distance_id 62 matrix_clustered_with_names.columns = [ 63 f"{race_info.loc[col, 'name'][:25]}... (ID:{col})" if len(race_info.loc[col, 'name']) > 25 64 else f"{race_info.loc[col, 'name']} (ID:{col})" 65 for col in matrix_clustered_with_names.columns 66 ] 67 68 fig = px.imshow( 69 matrix_clustered_with_names, 70 aspect='auto', 71 color_continuous_scale='Viridis', 72 labels=dict(color="Percentile", x="Course", y="Runner"), 73 title=f"Runner × Course Adjacency Matrix (α={alpha}, β={beta}, Top 100 Runners)<br><sub>Color = Percentile Finish Time (higher = slower) | Hierarchically Clustered</sub>" 74 ) 75 76 fig.update_layout( 77 width=1200, 78 height=800, 79 xaxis=dict(tickangle=-45, tickfont=dict(size=8), showgrid=False, showline=False, showticklabels=False, title=''), 80 yaxis=dict(tickfont=dict(size=6), showgrid=False, showline=False, showticklabels=False, title=''), 81 plot_bgcolor='white', 82 paper_bgcolor='white' 83 84 ) 85 86 fig.show()
Loading Plotly visualization...
Force-Directed Network
The final visualization presents the k-core as a force-directed network graph. This physics-inspired layout treats edges as springs and nodes as mutually-repelling particles, letting the graph "relax" into a natural arrangement where connected nodes cluster together.
Super-Node Compression
With thousands of runners, a naive visualization would be overwhelming. We address this by creating super-nodes: runners with identical race participation patterns are merged into a single node. This is semantically meaningful—runners who've done exactly the same set of races are essentially interchangeable from a network topology perspective.
The resulting graph reveals:
- Central hubs: Courses that attract runners from across the community (high connectivity)
- Peripheral clusters: Regional or niche race circuits with dedicated but isolated participant pools
- Bridge nodes: Runners or courses that connect otherwise-separate sub-communities
Interactive features: Hover over any node to highlight its connections. Course nodes appear in coral; runner groups in the Viridis colorscale (colored by degree).
1 # Create super-nodes: group runners with identical race participation patterns 2 # This reduces graph complexity while preserving structure 3 4 # Build race participation signatures for each runner 5 runner_signatures = {} 6 for runner_id in kcore_sample['participant_id'].unique(): 7 races = frozenset(kcore_sample[kcore_sample['participant_id'] == runner_id]['event_distance_id']) 8 runner_signatures[runner_id] = races 9 10 # Invert to create super-nodes: signature -> set of runners 11 unique_signatures = {} 12 for runner_id, sig in runner_signatures.items(): 13 if sig not in unique_signatures: 14 unique_signatures[sig] = set() 15 unique_signatures[sig].add(runner_id) 16 17 # Get unique courses 18 unique_courses = sorted(kcore_sample['event_distance_id'].unique()) 19 20 # Get runner and race names for hover text 21 runner_names = kcore_sample[['participant_id', 'firstname', 'lastname']].drop_duplicates('participant_id') 22 runner_names = {row['participant_id']: f"{row['firstname']} {row['lastname']}" for _, row in runner_names.iterrows()} 23 24 race_names = kcore_sample[['event_distance_id', 'name']].drop_duplicates('event_distance_id') 25 race_names = {row['event_distance_id']: row['name'] for _, row in race_names.iterrows()} 26 27 print(f"Created {len(unique_signatures):,} super-nodes from {len(runner_signatures):,} runners") 28 print(f"Average group size: {len(runner_signatures) / len(unique_signatures):.1f} runners per super-node") 29 print(f"Largest group: {max(len(runners) for runners in unique_signatures.values())} runners") 30 print(f"Total courses: {len(unique_courses)}")
Created 2,952 super-nodes from 3,061 runners
Average group size: 1.0 runners per super-node
Largest group: 9 runners
Total courses: 367
1 import igraph as ig 2 import plotly.graph_objects as go 3 import numpy as np 4 import time 5 import json 6 7 start_time = time.time() 8 9 # Create bipartite graph from super-nodes 10 edges = [] 11 edge_weights = {} 12 13 # Build runner index to super-node mapping 14 runner_to_super = {} 15 for super_idx, (sig, runners) in enumerate(unique_signatures.items()): 16 for runner_id in runners: 17 runner_to_super[runner_id] = super_idx 18 19 # Use dict for O(1) course lookups instead of list.index() 20 course_to_idx = {course_id: idx for idx, course_id in enumerate(unique_courses)} 21 22 # Build edges from edge counts 23 for _, row in edge_counts.iterrows(): 24 runner_id = row['participant_id'] 25 course_id = row['event_distance_id'] 26 27 if runner_id in runner_to_super: 28 super_idx = runner_to_super[runner_id] 29 # CRITICAL: Offset course index by number of super-nodes for bipartite graph 30 course_idx = len(unique_signatures) + course_to_idx[course_id] 31 edge_key = (super_idx, course_idx) 32 edge_weights[edge_key] = edge_weights.get(edge_key, 0) + row['edge_weight'] 33 34 # Create edges list 35 edges = list(edge_weights.keys()) 36 37 # Create bipartite graph 38 layout_start = time.time() 39 g = ig.Graph.Bipartite( 40 [0] * len(unique_signatures) + [1] * len(unique_courses), 41 edges 42 ) 43 44 # Compute layouts 45 layouts = {} 46 47 # Fruchterman-Reingold with maximum overlap prevention 48 # The 'grid' parameter is the key to preventing overlap 49 fr_start = time.time() 50 base_layout = g.layout_fruchterman_reingold( 51 niter=10000, # Many iterations for convergence 52 grid='auto' # Spatial grid prevents overlap - this is the critical parameter 53 ) 54 55 layouts['Fruchterman-Reingold'] = base_layout 56 57 # Get node degrees 58 degrees = g.degree() 59 60 # Prepare visualization data 61 runner_names_list = [] 62 for sig in unique_signatures.keys(): 63 runners = unique_signatures[sig] 64 if len(runners) == 1: 65 runner_id = list(runners)[0] 66 runner_names_list.append(runner_names.get(runner_id, f"Runner {runner_id}")) 67 else: 68 runner_names_list.append(f"{len(runners)} runners") 69 70 course_names_list = [race_names.get(c, f"Course {c}") for c in unique_courses] 71 72 # Create figure with optimized dual-trace edge rendering 73 fig = go.Figure() 74 75 def add_optimized_layout(fig, layout): 76 """Add layout with optimized edge rendering using dual traces.""" 77 pos_x = [coord[0] for coord in layout.coords] 78 pos_y = [coord[1] for coord in layout.coords] 79 80 # Build all edge coordinates and connectivity mapping 81 edge_x = [] 82 edge_y = [] 83 node_to_segments = {} # Maps node index -> list of segment indices 84 node_neighbors = {} # Maps node index -> list of connected node indices 85 86 segment_idx = 0 87 for edge_idx, edge in enumerate(g.es): 88 source, target = edge.tuple 89 90 # Track which segments connect to each node 91 if source not in node_to_segments: 92 node_to_segments[source] = [] 93 node_neighbors[source] = [] 94 if target not in node_to_segments: 95 node_to_segments[target] = [] 96 node_neighbors[target] = [] 97 98 node_to_segments[source].append(segment_idx) 99 node_to_segments[target].append(segment_idx) 100 node_neighbors[source].append(target) 101 node_neighbors[target].append(source) 102 103 # Add edge coordinates (x1, x2, None for line break) 104 edge_x.extend([pos_x[source], pos_x[target], None]) 105 edge_y.extend([pos_y[source], pos_y[target], None]) 106 107 segment_idx += 1 108 109 # Add base edges trace (always visible, low opacity) 110 fig.add_trace(go.Scattergl( # WebGL acceleration! 111 x=edge_x, 112 y=edge_y, 113 mode='lines', 114 line=dict(width=0.5, color='rgba(200,200,200,0.3)'), 115 hoverinfo='skip', # Prevent hover on edges 116 name='base_edges', 117 showlegend=False 118 )) 119 120 # Add highlight edges trace (initially empty, updated by JavaScript) 121 fig.add_trace(go.Scattergl( # WebGL acceleration! 122 x=[], 123 y=[], 124 mode='lines', 125 line=dict(width=2, color='rgba(255,50,50,0.9)'), 126 hoverinfo='skip', 127 name='highlight_edges', 128 showlegend=False 129 )) 130 131 # Prepare node data 132 super_node_indices = list(range(len(unique_signatures))) 133 course_indices = list(range(len(unique_signatures), len(unique_signatures) + len(unique_courses))) 134 135 # Calculate super-node sizes based on group size (even smaller) 136 super_node_sizes = [] 137 super_node_degrees = [] 138 super_node_hover_text = [] 139 140 for idx, (sig, runners) in enumerate(unique_signatures.items()): 141 group_size = len(runners) 142 # Reduced base size and scaling factor for smaller nodes 143 super_node_sizes.append(3 + np.log1p(group_size) * 2) 144 super_node_degrees.append(degrees[idx]) 145 146 if group_size == 1: 147 runner_id = list(runners)[0] 148 name = runner_names.get(runner_id, f"Runner {runner_id}") 149 super_node_hover_text.append(f"{name}<br>Degree: {degrees[idx]}") 150 else: 151 sample_names = [runner_names.get(r, f"Runner {r}") for r in list(runners)[:3]] 152 hover = f"Group of {group_size} runners<br>" + "<br>".join(sample_names) 153 if group_size > 3: 154 hover += f"<br>... and {group_size - 3} more" 155 hover += f"<br>Degree: {degrees[idx]}" 156 super_node_hover_text.append(hover) 157 158 # Course hover text 159 course_hover_text = [ 160 f"{course_names_list[i]}<br>Degree: {degrees[len(unique_signatures) + i]}" 161 for i in range(len(unique_courses)) 162 ] 163 164 # Calculate course sizes with even more aggressive sub-linear scaling 165 course_sizes = [2 + np.log1p(degrees[i]) * 1.5 for i in course_indices] 166 167 # Add super-node trace 168 fig.add_trace(go.Scattergl( # WebGL acceleration! 169 x=[pos_x[i] for i in super_node_indices], 170 y=[pos_y[i] for i in super_node_indices], 171 mode='markers', 172 marker=dict( 173 size=super_node_sizes, 174 color=super_node_degrees, 175 colorscale='Viridis', 176 showscale=False, 177 symbol='circle', 178 opacity=0.6, 179 line=dict(width=0) 180 ), 181 text=super_node_hover_text, 182 hoverinfo='text', 183 name='Runner Groups', 184 showlegend=False, 185 customdata=[[i] for i in super_node_indices] 186 )) 187 188 # Add course node trace with sub-linear size scaling 189 fig.add_trace(go.Scattergl( # WebGL acceleration! 190 x=[pos_x[i] for i in course_indices], 191 y=[pos_y[i] for i in course_indices], 192 mode='markers', 193 marker=dict( 194 size=course_sizes, 195 color='coral', 196 symbol='circle', 197 opacity=0.6, 198 line=dict(width=0) 199 ), 200 text=course_hover_text, 201 hoverinfo='text', 202 name='Courses', 203 showlegend=False, 204 customdata=[[i] for i in course_indices] 205 )) 206 207 # Add highlight runner trace (initially empty, will show highlighted runners on top) 208 fig.add_trace(go.Scattergl( 209 x=[], 210 y=[], 211 mode='markers', 212 marker=dict( 213 size=[], 214 color=[], 215 colorscale='Viridis', 216 symbol='circle', 217 opacity=1.0, 218 line=dict(width=2, color='white') 219 ), 220 text=[], 221 hoverinfo='text', 222 name='highlight_runners', 223 showlegend=False 224 )) 225 226 # Add highlight course trace (initially empty, will show highlighted courses on top) 227 fig.add_trace(go.Scattergl( 228 x=[], 229 y=[], 230 mode='markers', 231 marker=dict( 232 size=[], 233 color='coral', 234 symbol='circle', 235 opacity=1.0, 236 line=dict(width=2, color='white') 237 ), 238 text=[], 239 hoverinfo='text', 240 name='highlight_courses', 241 showlegend=False 242 )) 243 244 # Return connectivity mapping, edge coordinates, neighbor mapping, and all positions 245 return node_to_segments, edge_x, edge_y, node_neighbors, pos_x, pos_y, super_node_sizes, super_node_degrees, super_node_hover_text, course_hover_text, course_sizes 246 247 # Add visualization 248 node_to_segs, base_edge_x, base_edge_y, node_neighbors, pos_x, pos_y, runner_sizes, runner_degrees, runner_hover, course_hover, course_sizes = add_optimized_layout(fig, layouts['Fruchterman-Reingold']) 249 connectivity_data = { 250 'node_to_segments': {str(k): v for k, v in node_to_segs.items()}, 251 'node_neighbors': {str(k): v for k, v in node_neighbors.items()}, 252 'base_edge_x': base_edge_x, 253 'base_edge_y': base_edge_y, 254 'pos_x': pos_x, 255 'pos_y': pos_y, 256 'runner_sizes': runner_sizes, 257 'runner_degrees': runner_degrees, 258 'runner_hover': runner_hover, 259 'course_hover': course_hover, 260 'course_sizes': course_sizes 261 } 262 263 # Update layout 264 fig.update_layout( 265 title=f"Bipartite Network: Runner Groups × Courses (α={alpha}, β={beta})<br><sub>10k iterations with grid='auto' for overlap prevention + smaller nodes</sub>", 266 showlegend=False, 267 hovermode='closest', 268 width=1200, 269 height=700, 270 plot_bgcolor='white', 271 uirevision='constant' 272 ) 273 274 # Remove axes 275 fig.update_xaxes(showgrid=False, zeroline=False, showticklabels=False) 276 fig.update_yaxes(showgrid=False, zeroline=False, showticklabels=False) 277 278 # Generate HTML with Plotly.update() for notebook compatibility 279 import plotly.io as pio 280 from IPython.display import HTML, display 281 282 html_str = pio.to_html(fig, include_plotlyjs='cdn', full_html=False) 283 284 # Enhanced: Bring connected nodes to top with borders 285 optimized_js = f""" 286 <div id="connectivity-data" style="display:none;">{json.dumps(connectivity_data)}</div> 287 <script> 288 (function() {{ 289 function initOptimizedHover() {{ 290 var allPlotlyDivs = document.querySelectorAll('.plotly-graph-div'); 291 var gd = allPlotlyDivs[allPlotlyDivs.length - 1]; 292 293 if (!gd || !gd.data) {{ 294 setTimeout(initOptimizedHover, 100); 295 return; 296 }} 297 298 // Load connectivity data 299 var connectivityData = JSON.parse(document.getElementById('connectivity-data').textContent); 300 var nodeToSegments = connectivityData.node_to_segments; 301 var nodeNeighbors = connectivityData.node_neighbors; 302 var cachedBaseX = connectivityData.base_edge_x; 303 var cachedBaseY = connectivityData.base_edge_y; 304 var posX = connectivityData.pos_x; 305 var posY = connectivityData.pos_y; 306 var runnerSizes = connectivityData.runner_sizes; 307 var runnerDegrees = connectivityData.runner_degrees; 308 var runnerHover = connectivityData.runner_hover; 309 var courseHover = connectivityData.course_hover; 310 var courseSizes = connectivityData.course_sizes; 311 312 // Find trace indices 313 var baseTraceIdx = -1; 314 var highlightTraceIdx = -1; 315 var runnerTraceIdx = -1; 316 var courseTraceIdx = -1; 317 var highlightRunnerTraceIdx = -1; 318 var highlightCourseTraceIdx = -1; 319 320 gd.data.forEach(function(trace, i) {{ 321 if (trace.name === 'base_edges') baseTraceIdx = i; 322 else if (trace.name === 'highlight_edges') highlightTraceIdx = i; 323 else if (trace.name === 'Runner Groups') runnerTraceIdx = i; 324 else if (trace.name === 'Courses') courseTraceIdx = i; 325 else if (trace.name === 'highlight_runners') highlightRunnerTraceIdx = i; 326 else if (trace.name === 'highlight_courses') highlightCourseTraceIdx = i; 327 }}); 328 329 if (baseTraceIdx === -1 || highlightTraceIdx === -1) return; 330 331 // Store original node data 332 var originalRunnerData = {{ 333 marker: {{ 334 opacity: gd.data[runnerTraceIdx].marker.opacity 335 }} 336 }}; 337 338 var originalCourseData = {{ 339 marker: {{ 340 opacity: gd.data[courseTraceIdx].marker.opacity 341 }} 342 }}; 343 344 // Recursive handler attachment function 345 function attachHoverHandlers(gd, highlightTraceIdx) {{ 346 gd.removeAllListeners('plotly_hover'); 347 gd.removeAllListeners('plotly_unhover'); 348 349 // Hover handler 350 gd.on('plotly_hover', function(data) {{ 351 if (!data.points || data.points.length === 0) return; 352 353 var point = data.points[0]; 354 if (point.data.mode !== 'markers' || !point.data.customdata) return; 355 356 var nodeIdx = point.data.customdata[point.pointIndex][0]; 357 var connectedSegments = nodeToSegments[nodeIdx.toString()] || []; 358 var neighbors = nodeNeighbors[nodeIdx.toString()] || []; 359 360 if (connectedSegments.length === 0) return; 361 362 // Build highlight coordinates 363 var highlightX = []; 364 var highlightY = []; 365 for (var i = 0; i < connectedSegments.length; i++) {{ 366 var segIdx = connectedSegments[i]; 367 var startIdx = segIdx * 3; 368 highlightX.push(cachedBaseX[startIdx], cachedBaseX[startIdx + 1], null); 369 highlightY.push(cachedBaseY[startIdx], cachedBaseY[startIdx + 1], null); 370 }} 371 372 // Dim original nodes 373 var runnerOpacity = new Array(gd.data[runnerTraceIdx].x.length).fill(0.2); 374 var courseOpacity = new Array(gd.data[courseTraceIdx].x.length).fill(0.2); 375 376 // Build highlighted node data (these will appear on top with borders) 377 var highlightRunnerX = []; 378 var highlightRunnerY = []; 379 var highlightRunnerSizes = []; 380 var highlightRunnerColors = []; 381 var highlightRunnerText = []; 382 383 var highlightCourseX = []; 384 var highlightCourseY = []; 385 var highlightCourseSizes = []; 386 var highlightCourseText = []; 387 388 neighbors.forEach(function(neighborIdx) {{ 389 if (neighborIdx < gd.data[runnerTraceIdx].x.length) {{ 390 // It's a runner - add to highlight trace 391 highlightRunnerX.push(posX[neighborIdx]); 392 highlightRunnerY.push(posY[neighborIdx]); 393 highlightRunnerSizes.push(runnerSizes[neighborIdx]); 394 highlightRunnerColors.push(runnerDegrees[neighborIdx]); 395 highlightRunnerText.push(runnerHover[neighborIdx]); 396 }} else {{ 397 // It's a course - add to highlight trace 398 var courseArrayIdx = neighborIdx - gd.data[runnerTraceIdx].x.length; 399 highlightCourseX.push(posX[neighborIdx]); 400 highlightCourseY.push(posY[neighborIdx]); 401 highlightCourseSizes.push(courseSizes[courseArrayIdx]); 402 highlightCourseText.push(courseHover[courseArrayIdx]); 403 }} 404 }}); 405 406 // Update all traces 407 Plotly.update(gd, 408 {{ 409 x: [highlightX], 410 y: [highlightY] 411 }}, 412 {{}}, 413 [highlightTraceIdx] 414 ).then(function() {{ 415 return Plotly.restyle(gd, {{'marker.opacity': runnerOpacity}}, [runnerTraceIdx]); 416 }}).then(function() {{ 417 return Plotly.restyle(gd, {{'marker.opacity': courseOpacity}}, [courseTraceIdx]); 418 }}).then(function() {{ 419 return Plotly.restyle(gd, {{ 420 x: [highlightRunnerX], 421 y: [highlightRunnerY], 422 'marker.size': [highlightRunnerSizes], 423 'marker.color': [highlightRunnerColors], 424 text: [highlightRunnerText] 425 }}, [highlightRunnerTraceIdx]); 426 }}).then(function() {{ 427 return Plotly.restyle(gd, {{ 428 x: [highlightCourseX], 429 y: [highlightCourseY], 430 'marker.size': [highlightCourseSizes], 431 text: [highlightCourseText] 432 }}, [highlightCourseTraceIdx]); 433 }}).then(function() {{ 434 attachHoverHandlers(gd, highlightTraceIdx); 435 }}); 436 }}); 437 438 // Unhover handler 439 gd.on('plotly_unhover', function() {{ 440 // Reset everything 441 Plotly.update(gd, 442 {{ 443 x: [[]], 444 y: [[]] 445 }}, 446 {{}}, 447 [highlightTraceIdx] 448 ).then(function() {{ 449 return Plotly.restyle(gd, {{ 450 'marker.opacity': originalRunnerData.marker.opacity 451 }}, [runnerTraceIdx]); 452 }}).then(function() {{ 453 return Plotly.restyle(gd, {{ 454 'marker.opacity': originalCourseData.marker.opacity 455 }}, [courseTraceIdx]); 456 }}).then(function() {{ 457 return Plotly.restyle(gd, {{ 458 x: [[]], 459 y: [[]], 460 'marker.size': [[]], 461 'marker.color': [[]], 462 text: [[]] 463 }}, [highlightRunnerTraceIdx]); 464 }}).then(function() {{ 465 return Plotly.restyle(gd, {{ 466 x: [[]], 467 y: [[]], 468 'marker.size': [[]], 469 text: [[]] 470 }}, [highlightCourseTraceIdx]); 471 }}).then(function() {{ 472 attachHoverHandlers(gd, highlightTraceIdx); 473 }}); 474 }}); 475 }} 476 477 // Initial handler attachment 478 attachHoverHandlers(gd, highlightTraceIdx); 479 }} 480 481 if (document.readyState === 'loading') {{ 482 document.addEventListener('DOMContentLoaded', initOptimizedHover); 483 }} else {{ 484 initOptimizedHover(); 485 }} 486 }})(); 487 </script> 488 """ 489 490 # Combine and display 491 full_html = f""" 492 <div> 493 {html_str} 494 {optimized_js} 495 </div> 496 """ 497 498 display(HTML(full_html))
Summary & Next Steps
We've successfully identified a densely-connected subset of the ultramarathon universe using k-core decomposition. The (3, 840)-core balances data retention with the connectivity needed for comparative inference, giving us a foundation for the modeling work ahead.
Key takeaways:
- The full ultramarathon dataset is extremely sparse—most runners and courses share few direct connections
- K-core decomposition provides a principled way to extract a dense, well-connected subset
- The selected subset represents the "core" ultramarathon community: repeat participants at established events
- Visualizations reveal meaningful structure: runner clusters, hub courses, and community topology
What's next? In the following notebooks, we'll use this k-core subset to:
- Model finish times: Estimate runner abilities and course difficulties using Bayesian hierarchical models
- Analyze DNF patterns: Understand which factors predict whether a runner finishes vs. drops out
- Identify reporting artifacts: Distinguish true DNFs from data collection issues
The dense connectivity of our k-core ensures that runner and course effects can be properly identified—the foundation for all downstream inference.