Project Summary
Designed and developed a high-performance network monitoring system that tracks availability
of distributed field devices across a state-wide fiber infrastructure. The system monitors
thousands of network endpoints organized by subnet, supporting configurable polling intervals
per device type and real-time status updates to operational databases.
Served as sole developer responsible for full-stack implementation including asynchronous
network checking infrastructure, intelligent batch scheduling, database integration layer,
containerized deployment configuration, and operational monitoring tools.
90%
Cycle Time Reduction
(54min to 5-7min)
168K+
Device Checks/Day
Zero Errors in 15hr Validation
System Architecture
High-Level Design: A scheduler triggers per-device-type check cycles at
configurable intervals. The batch generator queries device configurations and organizes them
by subnet for efficient network scanning. A concurrent worker pool executes ICMP checks with
streaming database writes for real-time visibility.
flowchart LR
subgraph SCHEDULER["APScheduler"]
JOBS["Per-Type
Jobs"]
end
subgraph ORCHESTRATOR["Cycle Orchestrator"]
ORCH["Execution
Controller"]
TRACKER["Device Type
Tracker"]
end
subgraph BATCH["Batch Generator"]
GEN["Subnet
Batching"]
FILTER["Type
Filtering"]
end
subgraph WORKERS["Worker Pool"]
POOL["Async
Workers"]
PING["ICMP
Checker"]
end
subgraph DATABASE["Database Layer"]
API["PostgREST
API"]
STAGE["Staging
Table"]
PROD["Production
Tables"]
end
JOBS --> ORCH
ORCH --> TRACKER
TRACKER --> GEN
GEN --> FILTER
FILTER --> POOL
POOL --> PING
PING --> API
API --> STAGE
STAGE -->|"Trigger"| PROD
style JOBS fill:#667eea,color:#fff
style ORCH fill:#667eea,color:#fff
style TRACKER fill:#667eea,color:#fff
style GEN fill:#fd7e14,color:#fff
style FILTER fill:#fd7e14,color:#fff
style POOL fill:#28a745,color:#fff
style PING fill:#28a745,color:#fff
style API fill:#3182ce,color:#fff
style STAGE fill:#d4edda
style PROD fill:#d4edda
Technologies Used
Backend
Python 3.x
asyncio
aiohttp
APScheduler
structlog
Database
PostgreSQL
PostgREST
Flyway Migrations
Stored Procedures
Database Triggers
Infrastructure
Docker
Docker Compose
AWS ECR
AWS Secrets Manager
GitHub Actions
Protocols & Tools
ICMP (Ping)
REST API
Bash Scripting
IP Aliasing
Skills Demonstrated
High-Concurrency Architecture
Designed asyncio-based worker pool supporting 1,000+ parallel subnet operations with
30 workers per subnet for efficient network scanning.
Intelligent Scheduling
Implemented per-device-type scheduling with APScheduler, allowing different polling
intervals (15, 30, 60 minutes) based on device criticality.
Subnet-Based Batching
Developed efficient batch generation algorithm organizing devices by subnet for
network-aware scanning with configurable concurrency limits.
Streaming Database Writes
Implemented real-time status updates via PostgREST API with configurable batch
intervals, replacing end-of-cycle batch processing.
Defense-in-Depth Error Handling
Multi-layer timeout protection: ICMP-level, process-level (asyncio.wait_for), and
diagnostic logging for production troubleshooting.
Containerized Deployment
Docker-based deployment with environment-specific configuration, AWS Secrets Manager
integration, and automated CI/CD via GitHub Actions.
Database Architecture
Staging table pattern with trigger-based propagation to production tables, supporting
read-write separation and audit capabilities.
Operational Tooling
Created monitoring scripts for schedule adherence tracking, diagnostic tools for
subprocess hang analysis, and network usage monitoring.
Concurrency Architecture
Three-Tier Concurrency Model: The system achieves high throughput through
layered parallelism: scheduler-level (multiple device types), batch-level (parallel subnets),
and worker-level (parallel devices within each subnet).
flowchart TB
subgraph TIER1["Tier 1: Scheduler"]
T1["Per-Type Jobs
(3 concurrent types)"]
end
subgraph TIER2["Tier 2: Batch Execution"]
T2A["Subnet A"]
T2B["Subnet B"]
T2C["Subnet ..."]
T2D["Subnet N"]
end
subgraph TIER3["Tier 3: Worker Pool"]
T3A["Worker 1-30"]
T3B["Worker 1-30"]
T3C["Worker 1-30"]
T3D["Worker 1-30"]
end
T1 --> T2A
T1 --> T2B
T1 --> T2C
T1 --> T2D
T2A --> T3A
T2B --> T3B
T2C --> T3C
T2D --> T3D
style T1 fill:#667eea,color:#fff
style T2A fill:#fd7e14,color:#fff
style T2B fill:#fd7e14,color:#fff
style T2C fill:#fd7e14,color:#fff
style T2D fill:#fd7e14,color:#fff
style T3A fill:#28a745,color:#fff
style T3B fill:#28a745,color:#fff
style T3C fill:#28a745,color:#fff
style T3D fill:#28a745,color:#fff
Concurrency Parameters
- MAX_PARALLEL_SUBNETS: Up to 1,000 subnets processed simultaneously
- MAX_WORKERS_PER_SUBNET: 30 concurrent ICMP checks per subnet
- DB_WRITE_INTERVAL_BATCHES: Streaming writes every 50 batches
- File Descriptors: System tuned to 65,535 for high-concurrency operations
Performance Optimization
Optimization Journey: Systematic analysis of bottlenecks reduced cycle time
from 54-60 minutes to 5-7 minutes through targeted improvements at multiple levels.
Key Optimizations
- Timeout Reduction: ICMP timeout reduced from 2000ms to 200ms based on
network latency analysis
- Concurrency Scaling: Increased parallel subnets from 10 to 1,000,
workers per subnet from 5 to 30
- Architecture Migration: Converted from synchronous to asyncio-based
processing for non-blocking I/O
- Batch Processing: Subnet-based organization eliminates redundant
network hops within broadcast domains
- Streaming Writes: Real-time database updates replace end-of-cycle
batch commits
Validation Results
- Duration: 15+ hours continuous operation (908 minutes uptime)
- Schedule Adherence: 0-minute drift across all device types
- Device Checks: 168,867 successful checks across 130 cycles
- Error Rate: Zero errors or warnings during validation period
Development Approach
Testing Strategy: Four-stage validation pipeline from unit tests to
pre-production, including loopback network simulation for safe testing of production
IP ranges without network impact.
Validation Pipeline
- Stage 1 - Unit Tests: Mock-based orchestration logic validation
- Stage 2 - Integration: Docker container-based ICMP validation
- Stage 3 - Scale Test: Loopback network with 17,000+ IP aliases
simulating production topology
- Stage 4 - Pre-Production: Real network infrastructure with
production database connection
Operational Tools Created
- monitorscheduler.sh: Real-time schedule adherence tracking with
interval drift calculation
- diagnose_ping_hang.sh: Subprocess diagnostic tool with kernel
stack trace analysis
- scheduler_analytics.sh: Cycle completion statistics and error
aggregation
- create_mock_network.py: Automated loopback network generation
from database configuration