Device Status Monitoring System - Portfolio Overview

Project Summary

Designed and developed a high-performance network monitoring system that tracks availability of distributed field devices across a state-wide fiber infrastructure. The system monitors thousands of network endpoints organized by subnet, supporting configurable polling intervals per device type and real-time status updates to operational databases.

Served as sole developer responsible for full-stack implementation including asynchronous network checking infrastructure, intelligent batch scheduling, database integration layer, containerized deployment configuration, and operational monitoring tools.

2,400+

Devices Monitored

750+

Subnets Managed

90%

Cycle Time Reduction

(54min to 5-7min)

168K+

Device Checks/Day

Zero Errors in 15hr Validation

System Architecture

High-Level Design: A scheduler triggers per-device-type check cycles at configurable intervals. The batch generator queries device configurations and organizes them by subnet for efficient network scanning. A concurrent worker pool executes ICMP checks with streaming database writes for real-time visibility.

flowchart LR subgraph SCHEDULER["APScheduler"] JOBS["Per-Type
Jobs"] end subgraph ORCHESTRATOR["Cycle Orchestrator"] ORCH["Execution
Controller"] TRACKER["Device Type
Tracker"] end subgraph BATCH["Batch Generator"] GEN["Subnet
Batching"] FILTER["Type
Filtering"] end subgraph WORKERS["Worker Pool"] POOL["Async
Workers"] PING["ICMP
Checker"] end subgraph DATABASE["Database Layer"] API["PostgREST
API"] STAGE["Staging
Table"] PROD["Production
Tables"] end JOBS --> ORCH ORCH --> TRACKER TRACKER --> GEN GEN --> FILTER FILTER --> POOL POOL --> PING PING --> API API --> STAGE STAGE -->|"Trigger"| PROD style JOBS fill:#667eea,color:#fff style ORCH fill:#667eea,color:#fff style TRACKER fill:#667eea,color:#fff style GEN fill:#fd7e14,color:#fff style FILTER fill:#fd7e14,color:#fff style POOL fill:#28a745,color:#fff style PING fill:#28a745,color:#fff style API fill:#3182ce,color:#fff style STAGE fill:#d4edda style PROD fill:#d4edda

Scheduling Layer

Batch Processing

Network Operations

API Layer

Database

Technologies Used

Backend

Python 3.x asyncio aiohttp APScheduler structlog

Database

PostgreSQL PostgREST Flyway Migrations Stored Procedures Database Triggers

Infrastructure

Docker Docker Compose AWS ECR AWS Secrets Manager GitHub Actions

Protocols & Tools

ICMP (Ping) REST API Bash Scripting IP Aliasing

Skills Demonstrated

High-Concurrency Architecture

Designed asyncio-based worker pool supporting 1,000+ parallel subnet operations with 30 workers per subnet for efficient network scanning.

Intelligent Scheduling

Implemented per-device-type scheduling with APScheduler, allowing different polling intervals (15, 30, 60 minutes) based on device criticality.

Subnet-Based Batching

Developed efficient batch generation algorithm organizing devices by subnet for network-aware scanning with configurable concurrency limits.

Streaming Database Writes

Implemented real-time status updates via PostgREST API with configurable batch intervals, replacing end-of-cycle batch processing.

Defense-in-Depth Error Handling

Multi-layer timeout protection: ICMP-level, process-level (asyncio.wait_for), and diagnostic logging for production troubleshooting.

Containerized Deployment

Docker-based deployment with environment-specific configuration, AWS Secrets Manager integration, and automated CI/CD via GitHub Actions.

Database Architecture

Staging table pattern with trigger-based propagation to production tables, supporting read-write separation and audit capabilities.

Operational Tooling

Created monitoring scripts for schedule adherence tracking, diagnostic tools for subprocess hang analysis, and network usage monitoring.

Concurrency Architecture

Three-Tier Concurrency Model: The system achieves high throughput through layered parallelism: scheduler-level (multiple device types), batch-level (parallel subnets), and worker-level (parallel devices within each subnet).

flowchart TB subgraph TIER1["Tier 1: Scheduler"] T1["Per-Type Jobs
(3 concurrent types)"] end subgraph TIER2["Tier 2: Batch Execution"] T2A["Subnet A"] T2B["Subnet B"] T2C["Subnet ..."] T2D["Subnet N"] end subgraph TIER3["Tier 3: Worker Pool"] T3A["Worker 1-30"] T3B["Worker 1-30"] T3C["Worker 1-30"] T3D["Worker 1-30"] end T1 --> T2A T1 --> T2B T1 --> T2C T1 --> T2D T2A --> T3A T2B --> T3B T2C --> T3C T2D --> T3D style T1 fill:#667eea,color:#fff style T2A fill:#fd7e14,color:#fff style T2B fill:#fd7e14,color:#fff style T2C fill:#fd7e14,color:#fff style T2D fill:#fd7e14,color:#fff style T3A fill:#28a745,color:#fff style T3B fill:#28a745,color:#fff style T3C fill:#28a745,color:#fff style T3D fill:#28a745,color:#fff

Concurrency Parameters

MAX_PARALLEL_SUBNETS: Up to 1,000 subnets processed simultaneously
MAX_WORKERS_PER_SUBNET: 30 concurrent ICMP checks per subnet
DB_WRITE_INTERVAL_BATCHES: Streaming writes every 50 batches
File Descriptors: System tuned to 65,535 for high-concurrency operations

Performance Optimization

Optimization Journey: Systematic analysis of bottlenecks reduced cycle time from 54-60 minutes to 5-7 minutes through targeted improvements at multiple levels.

Key Optimizations

Timeout Reduction: ICMP timeout reduced from 2000ms to 200ms based on network latency analysis
Concurrency Scaling: Increased parallel subnets from 10 to 1,000, workers per subnet from 5 to 30
Architecture Migration: Converted from synchronous to asyncio-based processing for non-blocking I/O
Batch Processing: Subnet-based organization eliminates redundant network hops within broadcast domains
Streaming Writes: Real-time database updates replace end-of-cycle batch commits

Validation Results

Duration: 15+ hours continuous operation (908 minutes uptime)
Schedule Adherence: 0-minute drift across all device types
Device Checks: 168,867 successful checks across 130 cycles
Error Rate: Zero errors or warnings during validation period

Development Approach

Testing Strategy: Four-stage validation pipeline from unit tests to pre-production, including loopback network simulation for safe testing of production IP ranges without network impact.

Validation Pipeline

Stage 1 - Unit Tests: Mock-based orchestration logic validation
Stage 2 - Integration: Docker container-based ICMP validation
Stage 3 - Scale Test: Loopback network with 17,000+ IP aliases simulating production topology
Stage 4 - Pre-Production: Real network infrastructure with production database connection

Operational Tools Created

monitorscheduler.sh: Real-time schedule adherence tracking with interval drift calculation
diagnose_ping_hang.sh: Subprocess diagnostic tool with kernel stack trace analysis
scheduler_analytics.sh: Cycle completion statistics and error aggregation
create_mock_network.py: Automated loopback network generation from database configuration