Rules

blackbox-alerts

16.623s ago

4.633ms

Rule State Error Last Evaluation Evaluation Time
alert: EndpointDown expr: probe_success == 0 for: 1m labels: category: availability severity: critical team: apps annotations: action: 1) Verificar estado de la aplicación | 2) Revisar balanceador de carga | 3) Verificar certificado SSL | 4) Revisar logs de la aplicación | 5) Verificar conectividad de red description: El endpoint {{ $labels.instance }} no responde. El probe de monitoreo no recibe código HTTP 2xx. runbook_url: https://wiki.celuwebcloud.com/runbooks/endpoint-down summary: "\U0001F310 SERVICIO CAÍDO: {{ $labels.instance }}" ok 16.623s ago 3.933ms
alert: HighLatency expr: probe_duration_seconds > 5 for: 3m labels: category: performance severity: warning team: apps annotations: action: Verificar carga del servidor, base de datos, queries lentos. description: El tiempo de respuesta es {{ $value | humanizeDuration }}. summary: Latencia alta en {{ $labels.instance }} ok 16.619s ago 250.7us
alert: SSLCertificateExpiringSoon expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30 for: 1m labels: category: security severity: warning team: infra annotations: action: '1) Renovar certificado con certbot: certbot renew | 2) O contactar proveedor SSL | 3) Verificar auto-renovación configurada | 4) Reiniciar servicio web después de renovar' days_remaining: '{{ $value }}' description: El certificado SSL para {{ $labels.instance }} expira en {{ $value | humanizeDuration }}. Los usuarios verán advertencia de seguridad. summary: "\U0001F512 Certificado SSL expira en {{ $value | humanizeDuration }}" ok 16.619s ago 233.2us
alert: SSLCertificateExpired expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 0 labels: category: security severity: critical team: infra annotations: action: 'URGENTE: 1) Renovar certificado: certbot renew --force-renewal | 2) Reiniciar nginx/apache | 3) Verificar con: curl -v https://{{ $labels.instance }} | 4) Si usa CDN, invalidar caché' description: El certificado SSL para {{ $labels.instance }} ha EXPIRADO. Los usuarios están viendo errores de seguridad AHORA. summary: "\U0001F534 CERTIFICADO SSL EXPIRADO - {{ $labels.instance }}" ok 16.619s ago 190.5us

hardware-alerts

40.912s ago

760.5us

Rule State Error Last Evaluation Evaluation Time
alert: HighTemperature expr: node_hwmon_temp_celsius > 80 for: 5m labels: category: hardware severity: critical team: infra annotations: action: 'URGENTE: Verificar ventiladores, limpiar polvo del servidor. Apagar si supera 85°C.' description: La temperatura del sensor {{ $labels.sensor }} está a {{ $value }}°C summary: Temperatura alta en {{ $labels.instance }} ok 40.912s ago 385.1us
alert: MediumTemperature expr: node_hwmon_temp_celsius > 70 for: 10m labels: category: hardware severity: warning team: infra annotations: action: Verificar ventilación del servidor/data center. description: La temperatura está a {{ $value }}°C. Monitorear de cerca. summary: Temperatura elevada en {{ $labels.instance }} ok 40.912s ago 196.8us
alert: RAIDDegraded expr: node_md_state{state="degraded"} == 1 labels: category: hardware severity: critical team: infra annotations: action: 'URGENTE: Reemplazar disco fallido inmediatamente. Verificar: cat /proc/mdstat' description: El array RAID {{ $labels.md_device }} está en estado degradado. summary: RAID degradado en {{ $labels.instance }} ok 40.912s ago 135us

info-alerts

2m37.399s ago

749.4us

Rule State Error Last Evaluation Evaluation Time
alert: NodeRebooted expr: (time() - node_boot_time_seconds) < 300 labels: category: maintenance severity: info team: infra annotations: action: Verificar si fue reinicio planificado o inesperado. description: El sistema se reinició hace {{ $value | humanizeDuration }}. summary: Reinicio detectado en {{ $labels.instance }} ok 2m37.4s ago 490.6us
alert: RebootRequired expr: node_reboot_required > 0 labels: category: maintenance severity: info team: infra annotations: action: Planificar ventana de mantenimiento para reiniciar y aplicar actualizaciones de seguridad. description: Hay actualizaciones del kernel pendientes que requieren reinicio. summary: Reinicio requerido en {{ $labels.instance }} ok 2m37.399s ago 238.6us

linux-critical

18.144s ago

10.25ms

Rule State Error Last Evaluation Evaluation Time
alert: HostDown expr: up{job=~"linux-servers|windows-servers"} == 0 for: 2m labels: category: availability severity: critical team: infra annotations: action: 1) Verificar si el servidor está encendido | 2) Comprobar conectividad de red (ping) | 3) Verificar servicio node_exporter/windows_exporter | 4) Revisar firewall description: El servidor {{ $labels.instance }} no ha respondido a pings de monitoreo en los últimos 2 minutos. runbook_url: https://wiki.celuwebcloud.com/runbooks/host-down summary: "\U0001F6A8 SERVIDOR CAÍDO: {{ $labels.instance }}" ok 18.144s ago 8.932ms
alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 for: 5m labels: category: performance severity: critical team: infra annotations: action: '1) Conectar vía SSH y ejecutar: top o htop | 2) Identificar procesos de alto consumo | 3) Si es un servicio específico: sudo systemctl restart <servicio> | 4) Considerar escalar recursos si es carga legítima' cpu_percent: '{{ $value }}' description: 'El uso de CPU ha estado por encima del 90% durante más de 5 minutos. Valor actual: {{ $value | humanizePercentage }}' runbook_url: https://wiki.celuwebcloud.com/runbooks/high-cpu summary: "\U0001F525 CPU CRÍTICO en {{ $labels.instance }}" ok 18.135s ago 534.7us
alert: MemoryCritical expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 95 for: 3m labels: category: performance severity: critical team: infra annotations: action: '1) Identificar procesos: ps aux --sort=-%mem | head -10 | 2) Liberar caché: sudo sync && sudo echo 3 > /proc/sys/vm/drop_caches | 3) Considerar reiniciar servicios de alto consumo | 4) Verificar memory leaks en aplicaciones' description: 'El uso de memoria está por encima del 95%. Valor actual: {{ $value | humanizePercentage }}. Riesgo de usar swap o que OOM killer termine procesos.' memory_percent: '{{ $value }}' runbook_url: https://wiki.celuwebcloud.com/runbooks/memory-critical summary: ⚠️ MEMORIA CRÍTICA en {{ $labels.instance }} ok 18.135s ago 375us
alert: DiskFull expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 95 for: 2m labels: category: storage os: linux severity: critical team: infra annotations: action: '1) Limpiar logs: sudo find /var/log -type f -name '*.log' -mtime +7 -delete | 2) Limpiar paquetes: sudo apt autoremove && sudo apt autoclean | 3) Verificar docker: docker system prune -a' description: El disco raíz está {{ $value | humanizePercentage }} lleno. Espacio libre crítico. Quedan {{ $value | humanizePercentage }} libres. runbook_url: https://wiki.celuwebcloud.com/runbooks/disk-full summary: Disco raíz casi lleno en {{ $labels.instance }} usage_percent: '{{ $value }}' ok 18.134s ago 376.1us

linux-warning

5.633s ago

4.402ms

Rule State Error Last Evaluation Evaluation Time
alert: HighCPUUsageWarning expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75 for: 10m labels: category: performance severity: warning team: infra annotations: action: Monitorear tendencia. Identificar si es carga normal o anomalía. description: 'El uso de CPU ha estado por encima del 75% durante más de 10 minutos. Valor: {{ $value | humanizePercentage }}' summary: Uso de CPU elevado en {{ $labels.instance }} ok 5.633s ago 774.5us
alert: MemoryHigh expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: category: performance severity: warning team: infra annotations: action: Revisar procesos con alto consumo de memoria. Verificar leaks de memoria en aplicaciones. description: 'El uso de memoria está por encima del 85%. Valor: {{ $value | humanizePercentage }}' summary: Uso de memoria elevado en {{ $labels.instance }} ok 5.632s ago 512.4us
alert: DiskSpaceLow expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 80 for: 5m labels: category: storage severity: warning team: infra annotations: action: Planificar limpieza de disco. Revisar logs, archivos temporales, backups antiguos. description: El disco raíz está {{ $value | humanizePercentage }} lleno. summary: Espacio en disco bajo en {{ $labels.instance }} ok 5.632s ago 1.037ms
alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4 * 3600) < 0 for: 10m labels: category: storage severity: warning team: infra annotations: action: Acción preventiva urgente. Investigar qué está consumiendo espacio rápidamente. description: Basado en la tendencia actual, el disco se llenará en las próximas 4 horas. summary: El disco se llenará pronto en {{ $labels.instance }} ok 5.631s ago 279.8us
alert: HighSwapUsage expr: (node_memory_SwapTotal_bytes > 0) and (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 80 for: 5m labels: category: performance severity: warning team: infra annotations: action: El sistema está usando swap pesadamente. Considerar aumentar RAM o investigar memory leaks. description: El uso de swap está por encima del 80%. Esto indica presión de memoria. summary: Uso de swap elevado en {{ $labels.instance }} ok 5.631s ago 974.1us
alert: HighLoadAverage expr: node_load1 > (count by(instance) (node_cpu_seconds_total{mode="idle"}) * 2) for: 10m labels: category: performance severity: warning team: infra annotations: action: Muchos procesos en cola. Verificar I/O de disco o procesos bloqueados. description: La carga promedio (1m) es {{ $value }}, superior al doble del número de cores. summary: Carga del sistema alta en {{ $labels.instance }} ok 5.63s ago 478.2us
alert: SlowDiskIO expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8 for: 5m labels: category: storage severity: warning team: infra annotations: action: Verificar qué procesos están haciendo I/O intensivo con 'iotop'. Considerar SSD o RAID. description: El tiempo de I/O del disco está por encima del 80%. Posible cuello de botella. summary: I/O de disco lento en {{ $labels.instance }} ok 5.63s ago 263.8us

network-alerts

16.679s ago

2.897ms

Rule State Error Last Evaluation Evaluation Time
alert: TooManyNetworkConnections expr: node_netstat_Tcp_CurrEstab > 10000 for: 5m labels: category: network severity: warning team: infra annotations: action: Verificar conexiones con 'ss -s' o 'netstat -an'. Buscar patrones inusuales. description: Hay {{ $value }} conexiones TCP establecidas. Esto puede indicar un ataque o leak de conexiones. summary: Muchas conexiones TCP en {{ $labels.instance }} ok 16.679s ago 428.4us
alert: NetworkErrors expr: rate(node_network_receive_errs_total[5m]) > 10 or rate(node_network_transmit_errs_total[5m]) > 10 for: 5m labels: category: network severity: warning team: infra annotations: action: Verificar cables, interfaces de red, switch. Usar 'ethtool' para diagnóstico. description: Se detectan errores de transmisión/recepción de red. summary: Errores de red en {{ $labels.instance }} ok 16.679s ago 1.203ms
alert: HighNetworkTrafficRX expr: rate(node_network_receive_bytes_total[5m]) > 1e+08 for: 10m labels: category: network severity: info team: infra annotations: action: Verificar si es tráfico legítimo o posible ataque DDoS. description: Tráfico de red entrante está por encima de 100MB/s. summary: Alto tráfico de entrada en {{ $labels.instance }} ok 16.678s ago 697.4us
alert: HighNetworkTrafficTX expr: rate(node_network_transmit_bytes_total[5m]) > 1e+08 for: 10m labels: category: network severity: info team: infra annotations: action: Verificar si es tráfico legítimo o posible exfiltración de datos. description: Tráfico de red saliente está por encima de 100MB/s. summary: Alto tráfico de salida en {{ $labels.instance }} ok 16.677s ago 540us

process-alerts

33.617s ago

761.9us

Rule State Error Last Evaluation Evaluation Time
alert: ZombieProcesses expr: node_processes_state{state="Z"} > 0 for: 5m labels: category: processes severity: warning team: infra annotations: action: 'Identificar proceso padre que dejó huérfanos: ps aux | grep 'Z'' description: Hay {{ $value }} procesos zombie en el sistema. summary: Procesos zombie en {{ $labels.instance }} ok 33.617s ago 365.1us
alert: TooManyProcesses expr: node_processes_max_processes - node_processes_state{state="R"} < 100 for: 5m labels: category: processes severity: warning team: infra annotations: action: Verificar fuga de procesos (fork bombs, aplicaciones con leaks). description: Quedan menos de 100 procesos disponibles del límite del sistema. summary: Límite de procesos cercano en {{ $labels.instance }} ok 33.617s ago 379.9us

windows-alerts

56.393s ago

10.28ms

Rule State Error Last Evaluation Evaluation Time
alert: WindowsServerDown expr: up{job="windows-servers"} == 0 for: 2m labels: category: availability severity: critical team: infra annotations: action: Verificar servicio windows_exporter, firewall, y estado del servidor. description: El servidor Windows no responde al exporter. summary: Servidor Windows {{ $labels.instance }} está caído ok 56.393s ago 1.668ms
alert: WindowsHighCPU expr: 100 - (avg by(instance) (windows_cpu_time_total{mode="idle"}) * 100) > 85 for: 10m labels: category: performance severity: warning team: infra annotations: action: Verificar procesos en Task Manager o con Get-Process en PowerShell. description: 'Uso de CPU: {{ $value | humanizePercentage }}' summary: CPU elevada en Windows {{ $labels.instance }} ok 56.391s ago 1.051ms
alert: WindowsHighMemory expr: (windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes * 100 > 90 for: 5m labels: category: performance severity: critical team: infra annotations: action: Reiniciar servicios de alto consumo. Considerar aumentar RAM. description: 'Uso de memoria: {{ $value | humanizePercentage }}' summary: Memoria crítica en Windows {{ $labels.instance }} ok 56.39s ago 2.884ms
alert: WindowsDiskFull expr: (windows_logical_disk_size_bytes - windows_logical_disk_free_bytes) / windows_logical_disk_size_bytes * 100 > 90 for: 5m labels: category: storage os: windows severity: critical team: infra annotations: action: '1) Ejecutar cleanmgr como administrador | 2) Vaciar Papelera de reciclaje | 3) Limpiar logs de IIS: C:/inetpub/logs | 4) Limpiar Event Viewer logs | 5) Desinstalar programas no usados' description: 'Volumen {{ $labels.volume }} en {{ $labels.instance }} tiene {{ $value | humanizePercentage }} de uso (umbral: 90%). Se requiere acción inmediata.' runbook_url: https://wiki.celuwebcloud.com/runbooks/disk-full summary: Disco lleno en {{ $labels.instance }} - Volumen {{ $labels.volume }} usage_percent: '{{ $value }}' ok 56.388s ago 3.911ms
alert: WindowsServiceStopped expr: windows_service_state{name=~"MSSQLSERVER|W3SVC|ADWS|NTDS|DNS",state="running"} == 0 for: 1m labels: category: services severity: critical team: infra annotations: action: 'Iniciar servicio inmediatamente: net start {{ $labels.name }}' description: El servicio {{ $labels.name }} está detenido. summary: Servicio crítico detenido en {{ $labels.instance }} ok 56.384s ago 242.8us
alert: WindowsUnexpectedReboot expr: (time() - windows_system_system_up_time) < 300 and (time() - windows_system_system_up_time) > 0 labels: category: availability severity: warning team: infra annotations: action: Verificar logs del sistema para determinar causa del reinicio. Event Viewer > System. description: El servidor Windows se reinició hace menos de 5 minutos. summary: Reinicio reciente detectado en {{ $labels.instance }} ok 56.384s ago 486us

aggregated-team-metrics

3m7.647s ago

2.487ms

Rule State Error Last Evaluation Evaluation Time
record: team:linux_servers_up:total expr: count(up{job="linux-servers"} == 1) ok 3m7.648s ago 776.5us
record: team:linux_servers_down:total expr: count(up{job="linux-servers"} == 0) ok 3m7.647s ago 353.1us
record: team:windows_servers_up:total expr: count(up{job="windows-servers"} == 1) ok 3m7.647s ago 326us
record: team:windows_servers_down:total expr: count(up{job="windows-servers"} == 0) ok 3m7.646s ago 258us
record: team:infrastructure_availability:ratio expr: (count(up{job=~"linux-servers|windows-servers"} == 1) / count(up{job=~"linux-servers|windows-servers"})) ok 3m7.646s ago 742.9us

availability-metrics

3m40.938s ago

916.2us

Rule State Error Last Evaluation Evaluation Time
record: instance:node_uptime:days expr: (time() - node_boot_time_seconds) / 86400 ok 3m40.938s ago 622.1us
record: instance:node_time_since_boot:hours expr: (time() - node_boot_time_seconds) / 3600 ok 3m40.938s ago 276.5us

blackbox-metrics

17.474s ago

1.221ms

Rule State Error Last Evaluation Evaluation Time
record: probe:latency:avg5m expr: avg_over_time(probe_duration_seconds[5m]) ok 17.474s ago 543.4us
record: probe:success_rate:ratio5m expr: avg_over_time(probe_success[5m]) ok 17.474s ago 299.4us
record: probe:ssl_expiry:days expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 ok 17.473s ago 351.3us

capacity-planning

2m13.364s ago

1.809ms

Rule State Error Last Evaluation Evaluation Time
record: instance_mount:node_disk_fill_prediction_24h:bytes expr: predict_linear(node_filesystem_free_bytes[1h], 24 * 3600) ok 2m13.364s ago 982.3us
record: instance_mount:node_disk_growth_rate:bytes_per_hour expr: (deriv(node_filesystem_used_bytes[1h]) * 3600) ok 2m13.363s ago 182.8us
record: instance:node_cpu_trend:avg1h expr: avg_over_time(instance:node_cpu_utilisation:rate5m[1h]) ok 2m13.363s ago 294.8us
record: instance:node_memory_trend:avg1h expr: avg_over_time(instance:node_memory_utilisation:ratio[1h]) ok 2m13.363s ago 323.6us

cpu-metrics

4.974s ago

2.555ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_cpu_utilisation:rate5m expr: 1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) ok 4.974s ago 713.8us
record: instance:node_cpu_usage_by_mode:rate5m expr: rate(node_cpu_seconds_total{mode=~"user|system|iowait"}[5m]) ok 4.974s ago 1.264ms
record: instance:node_load_normalised:avg5m expr: node_load1 / count by(instance) (node_cpu_seconds_total{mode="idle"}) ok 4.972s ago 552.6us

disk-metrics

20.882s ago

2.083ms

Rule State Error Last Evaluation Evaluation Time
record: instance_mount:node_disk_utilisation:ratio expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes ok 20.882s ago 965.9us
record: instance_device:node_disk_io_rate:bytes5m expr: rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m]) ok 20.881s ago 569us
record: instance_device:node_disk_io_utilisation:rate5m expr: rate(node_disk_io_time_seconds_total[5m]) ok 20.881s ago 251.1us
record: instance_device:node_disk_io_latency:avg5m expr: rate(node_disk_io_time_weighted_seconds_total[5m]) / rate(node_disk_ios_completed_total[5m]) ok 20.881s ago 270.6us

memory-metrics

8.194s ago

1.645ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_memory_utilisation:ratio expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes ok 8.194s ago 706.8us
record: instance:node_memory_used_bytes:calc expr: node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes ok 8.194s ago 397.8us
record: instance:node_swap_utilisation:ratio expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes ok 8.193s ago 517.6us

network-metrics

37.628s ago

5.169ms

Rule State Error Last Evaluation Evaluation Time
record: instance_interface:node_network_traffic_rate:bytes5m expr: rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m]) ok 37.628s ago 1.9ms
record: instance_interface:node_network_packets_rate:packets5m expr: rate(node_network_receive_packets_total[5m]) + rate(node_network_transmit_packets_total[5m]) ok 37.626s ago 1.503ms
record: instance_interface:node_network_errors_rate:errors5m expr: rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m]) ok 37.624s ago 1.486ms
record: instance:node_network_tcp_connections:total expr: node_netstat_Tcp_CurrEstab ok 37.623s ago 249us

process-metrics

8.887s ago

1.191ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_processes_running:total expr: node_processes_state{state="R"} ok 8.887s ago 449.2us
record: instance:node_processes_sleeping:total expr: node_processes_state{state="S"} ok 8.886s ago 200.3us
record: instance:node_processes_zombie:total expr: node_processes_state{state="Z"} ok 8.886s ago 102us
record: instance:node_threads:total expr: node_procs_running + node_procs_blocked ok 8.886s ago 411.9us

windows-metrics

16.813s ago

4.783ms

Rule State Error Last Evaluation Evaluation Time
record: instance:windows_cpu_utilisation:rate5m expr: 1 - avg by(instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) ok 16.813s ago 1.704ms
record: instance:windows_memory_utilisation:ratio expr: (windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes ok 16.811s ago 708.8us
record: instance_volume:windows_disk_utilisation:ratio expr: (windows_logical_disk_size_bytes - windows_logical_disk_free_bytes) / windows_logical_disk_size_bytes ok 16.81s ago 2.351ms