Rules

blackbox-alerts

26.945s ago

5.312ms

Rule State Error Last Evaluation Evaluation Time
alert: EndpointDown expr: probe_success == 0 for: 1m labels: category: availability severity: critical team: apps annotations: action: 1) Verificar estado de la aplicación | 2) Revisar balanceador de carga | 3) Verificar certificado SSL | 4) Revisar logs de la aplicación | 5) Verificar conectividad de red description: El endpoint {{ $labels.instance }} no responde. El probe de monitoreo no recibe código HTTP 2xx. runbook_url: https://wiki.celuwebcloud.com/runbooks/endpoint-down summary: "\U0001F310 SERVICIO CAÍDO: {{ $labels.instance }}" ok 26.945s ago 4.044ms
alert: HighLatency expr: probe_duration_seconds > 5 for: 3m labels: category: performance severity: warning team: apps annotations: action: Verificar carga del servidor, base de datos, queries lentos. description: El tiempo de respuesta es {{ $value | humanizeDuration }}. summary: Latencia alta en {{ $labels.instance }} ok 26.941s ago 762.4us
alert: SSLCertificateExpiringSoon expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30 for: 1m labels: category: security severity: warning team: infra annotations: action: '1) Renovar certificado con certbot: certbot renew | 2) O contactar proveedor SSL | 3) Verificar auto-renovación configurada | 4) Reiniciar servicio web después de renovar' days_remaining: '{{ $value }}' description: El certificado SSL para {{ $labels.instance }} expira en {{ $value | humanizeDuration }}. Los usuarios verán advertencia de seguridad. summary: "\U0001F512 Certificado SSL expira en {{ $value | humanizeDuration }}" ok 26.94s ago 247.5us
alert: SSLCertificateExpired expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 0 labels: category: security severity: critical team: infra annotations: action: 'URGENTE: 1) Renovar certificado: certbot renew --force-renewal | 2) Reiniciar nginx/apache | 3) Verificar con: curl -v https://{{ $labels.instance }} | 4) Si usa CDN, invalidar caché' description: El certificado SSL para {{ $labels.instance }} ha EXPIRADO. Los usuarios están viendo errores de seguridad AHORA. summary: "\U0001F534 CERTIFICADO SSL EXPIRADO - {{ $labels.instance }}" ok 26.94s ago 227us

hardware-alerts

21.233s ago

962.6us

Rule State Error Last Evaluation Evaluation Time
alert: HighTemperature expr: node_hwmon_temp_celsius > 80 for: 5m labels: category: hardware severity: critical team: infra annotations: action: 'URGENTE: Verificar ventiladores, limpiar polvo del servidor. Apagar si supera 85°C.' description: La temperatura del sensor {{ $labels.sensor }} está a {{ $value }}°C summary: Temperatura alta en {{ $labels.instance }} ok 21.233s ago 549.1us
alert: MediumTemperature expr: node_hwmon_temp_celsius > 70 for: 10m labels: category: hardware severity: warning team: infra annotations: action: Verificar ventilación del servidor/data center. description: La temperatura está a {{ $value }}°C. Monitorear de cerca. summary: Temperatura elevada en {{ $labels.instance }} ok 21.232s ago 270.4us
alert: RAIDDegraded expr: node_md_state{state="degraded"} == 1 labels: category: hardware severity: critical team: infra annotations: action: 'URGENTE: Reemplazar disco fallido inmediatamente. Verificar: cat /proc/mdstat' description: El array RAID {{ $labels.md_device }} está en estado degradado. summary: RAID degradado en {{ $labels.instance }} ok 21.232s ago 121.9us

info-alerts

1m17.72s ago

730.4us

Rule State Error Last Evaluation Evaluation Time
alert: NodeRebooted expr: (time() - node_boot_time_seconds) < 300 labels: category: maintenance severity: info team: infra annotations: action: Verificar si fue reinicio planificado o inesperado. description: El sistema se reinició hace {{ $value | humanizeDuration }}. summary: Reinicio detectado en {{ $labels.instance }} ok 1m17.72s ago 585.6us
alert: RebootRequired expr: node_reboot_required > 0 labels: category: maintenance severity: info team: infra annotations: action: Planificar ventana de mantenimiento para reiniciar y aplicar actualizaciones de seguridad. description: Hay actualizaciones del kernel pendientes que requieren reinicio. summary: Reinicio requerido en {{ $labels.instance }} ok 1m17.719s ago 127.1us

linux-critical

28.464s ago

18.75ms

Rule State Error Last Evaluation Evaluation Time
alert: HostDown expr: up{job=~"linux-servers|windows-servers"} == 0 for: 2m labels: category: availability severity: critical team: infra annotations: action: 1) Verificar si el servidor está encendido | 2) Comprobar conectividad de red (ping) | 3) Verificar servicio node_exporter/windows_exporter | 4) Revisar firewall description: El servidor {{ $labels.instance }} no ha respondido a pings de monitoreo en los últimos 2 minutos. runbook_url: https://wiki.celuwebcloud.com/runbooks/host-down summary: "\U0001F6A8 SERVIDOR CAÍDO: {{ $labels.instance }}" ok 28.464s ago 15.98ms
alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 for: 5m labels: category: performance severity: critical team: infra annotations: action: '1) Conectar vía SSH y ejecutar: top o htop | 2) Identificar procesos de alto consumo | 3) Si es un servicio específico: sudo systemctl restart <servicio> | 4) Considerar escalar recursos si es carga legítima' cpu_percent: '{{ $value }}' description: 'El uso de CPU ha estado por encima del 90% durante más de 5 minutos. Valor actual: {{ $value | humanizePercentage }}' runbook_url: https://wiki.celuwebcloud.com/runbooks/high-cpu summary: "\U0001F525 CPU CRÍTICO en {{ $labels.instance }}" ok 28.448s ago 1.552ms
alert: MemoryCritical expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 95 for: 3m labels: category: performance severity: critical team: infra annotations: action: '1) Identificar procesos: ps aux --sort=-%mem | head -10 | 2) Liberar caché: sudo sync && sudo echo 3 > /proc/sys/vm/drop_caches | 3) Considerar reiniciar servicios de alto consumo | 4) Verificar memory leaks en aplicaciones' description: 'El uso de memoria está por encima del 95%. Valor actual: {{ $value | humanizePercentage }}. Riesgo de usar swap o que OOM killer termine procesos.' memory_percent: '{{ $value }}' runbook_url: https://wiki.celuwebcloud.com/runbooks/memory-critical summary: ⚠️ MEMORIA CRÍTICA en {{ $labels.instance }} ok 28.447s ago 630.2us
alert: DiskFull expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 95 for: 2m labels: category: storage os: linux severity: critical team: infra annotations: action: '1) Limpiar logs: sudo find /var/log -type f -name '*.log' -mtime +7 -delete | 2) Limpiar paquetes: sudo apt autoremove && sudo apt autoclean | 3) Verificar docker: docker system prune -a' description: El disco raíz está {{ $value | humanizePercentage }} lleno. Espacio libre crítico. Quedan {{ $value | humanizePercentage }} libres. runbook_url: https://wiki.celuwebcloud.com/runbooks/disk-full summary: Disco raíz casi lleno en {{ $labels.instance }} usage_percent: '{{ $value }}' ok 28.447s ago 560.8us

linux-warning

45.953s ago

7.318ms

Rule State Error Last Evaluation Evaluation Time
alert: HighCPUUsageWarning expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 75 for: 10m labels: category: performance severity: warning team: infra annotations: action: Monitorear tendencia. Identificar si es carga normal o anomalía. description: 'El uso de CPU ha estado por encima del 75% durante más de 10 minutos. Valor: {{ $value | humanizePercentage }}' summary: Uso de CPU elevado en {{ $labels.instance }} ok 45.953s ago 1.76ms
alert: MemoryHigh expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: category: performance severity: warning team: infra annotations: action: Revisar procesos con alto consumo de memoria. Verificar leaks de memoria en aplicaciones. description: 'El uso de memoria está por encima del 85%. Valor: {{ $value | humanizePercentage }}' summary: Uso de memoria elevado en {{ $labels.instance }} ok 45.951s ago 740.1us
alert: DiskSpaceLow expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 80 for: 5m labels: category: storage severity: warning team: infra annotations: action: Planificar limpieza de disco. Revisar logs, archivos temporales, backups antiguos. description: El disco raíz está {{ $value | humanizePercentage }} lleno. summary: Espacio en disco bajo en {{ $labels.instance }} ok 45.951s ago 698.7us
alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4 * 3600) < 0 for: 10m labels: category: storage severity: warning team: infra annotations: action: Acción preventiva urgente. Investigar qué está consumiendo espacio rápidamente. description: Basado en la tendencia actual, el disco se llenará en las próximas 4 horas. summary: El disco se llenará pronto en {{ $labels.instance }} ok 45.95s ago 435.2us
alert: HighSwapUsage expr: (node_memory_SwapTotal_bytes > 0) and (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 80 for: 5m labels: category: performance severity: warning team: infra annotations: action: El sistema está usando swap pesadamente. Considerar aumentar RAM o investigar memory leaks. description: El uso de swap está por encima del 80%. Esto indica presión de memoria. summary: Uso de swap elevado en {{ $labels.instance }} ok 45.95s ago 1.401ms
alert: HighLoadAverage expr: node_load1 > (count by(instance) (node_cpu_seconds_total{mode="idle"}) * 2) for: 10m labels: category: performance severity: warning team: infra annotations: action: Muchos procesos en cola. Verificar I/O de disco o procesos bloqueados. description: La carga promedio (1m) es {{ $value }}, superior al doble del número de cores. summary: Carga del sistema alta en {{ $labels.instance }} ok 45.949s ago 1.304ms
alert: SlowDiskIO expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8 for: 5m labels: category: storage severity: warning team: infra annotations: action: Verificar qué procesos están haciendo I/O intensivo con 'iotop'. Considerar SSD o RAID. description: El tiempo de I/O del disco está por encima del 80%. Posible cuello de botella. summary: I/O de disco lento en {{ $labels.instance }} ok 45.947s ago 937.1us

network-alerts

56.999s ago

10.95ms

Rule State Error Last Evaluation Evaluation Time
alert: TooManyNetworkConnections expr: node_netstat_Tcp_CurrEstab > 10000 for: 5m labels: category: network severity: warning team: infra annotations: action: Verificar conexiones con 'ss -s' o 'netstat -an'. Buscar patrones inusuales. description: Hay {{ $value }} conexiones TCP establecidas. Esto puede indicar un ataque o leak de conexiones. summary: Muchas conexiones TCP en {{ $labels.instance }} ok 56.999s ago 526.1us
alert: NetworkErrors expr: rate(node_network_receive_errs_total[5m]) > 10 or rate(node_network_transmit_errs_total[5m]) > 10 for: 5m labels: category: network severity: warning team: infra annotations: action: Verificar cables, interfaces de red, switch. Usar 'ethtool' para diagnóstico. description: Se detectan errores de transmisión/recepción de red. summary: Errores de red en {{ $labels.instance }} ok 56.999s ago 5.049ms
alert: HighNetworkTrafficRX expr: rate(node_network_receive_bytes_total[5m]) > 1e+08 for: 10m labels: category: network severity: info team: infra annotations: action: Verificar si es tráfico legítimo o posible ataque DDoS. description: Tráfico de red entrante está por encima de 100MB/s. summary: Alto tráfico de entrada en {{ $labels.instance }} ok 56.994s ago 2.647ms
alert: HighNetworkTrafficTX expr: rate(node_network_transmit_bytes_total[5m]) > 1e+08 for: 10m labels: category: network severity: info team: infra annotations: action: Verificar si es tráfico legítimo o posible exfiltración de datos. description: Tráfico de red saliente está por encima de 100MB/s. summary: Alto tráfico de salida en {{ $labels.instance }} ok 56.991s ago 2.7ms

process-alerts

13.937s ago

1.85ms

Rule State Error Last Evaluation Evaluation Time
alert: ZombieProcesses expr: node_processes_state{state="Z"} > 0 for: 5m labels: category: processes severity: warning team: infra annotations: action: 'Identificar proceso padre que dejó huérfanos: ps aux | grep 'Z'' description: Hay {{ $value }} procesos zombie en el sistema. summary: Procesos zombie en {{ $labels.instance }} ok 13.938s ago 1.556ms
alert: TooManyProcesses expr: node_processes_max_processes - node_processes_state{state="R"} < 100 for: 5m labels: category: processes severity: warning team: infra annotations: action: Verificar fuga de procesos (fork bombs, aplicaciones con leaks). description: Quedan menos de 100 procesos disponibles del límite del sistema. summary: Límite de procesos cercano en {{ $labels.instance }} ok 13.936s ago 268.6us

windows-alerts

36.712s ago

10.58ms

Rule State Error Last Evaluation Evaluation Time
alert: WindowsServerDown expr: up{job="windows-servers"} == 0 for: 2m labels: category: availability severity: critical team: infra annotations: action: Verificar servicio windows_exporter, firewall, y estado del servidor. description: El servidor Windows no responde al exporter. summary: Servidor Windows {{ $labels.instance }} está caído ok 36.712s ago 1.765ms
alert: WindowsHighCPU expr: 100 - (avg by(instance) (windows_cpu_time_total{mode="idle"}) * 100) > 85 for: 10m labels: category: performance severity: warning team: infra annotations: action: Verificar procesos en Task Manager o con Get-Process en PowerShell. description: 'Uso de CPU: {{ $value | humanizePercentage }}' summary: CPU elevada en Windows {{ $labels.instance }} ok 36.71s ago 1.494ms
alert: WindowsHighMemory expr: (windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes * 100 > 90 for: 5m labels: category: performance severity: critical team: infra annotations: action: Reiniciar servicios de alto consumo. Considerar aumentar RAM. description: 'Uso de memoria: {{ $value | humanizePercentage }}' summary: Memoria crítica en Windows {{ $labels.instance }} ok 36.709s ago 2.005ms
alert: WindowsDiskFull expr: (windows_logical_disk_size_bytes - windows_logical_disk_free_bytes) / windows_logical_disk_size_bytes * 100 > 90 for: 5m labels: category: storage os: windows severity: critical team: infra annotations: action: '1) Ejecutar cleanmgr como administrador | 2) Vaciar Papelera de reciclaje | 3) Limpiar logs de IIS: C:/inetpub/logs | 4) Limpiar Event Viewer logs | 5) Desinstalar programas no usados' description: 'Volumen {{ $labels.volume }} en {{ $labels.instance }} tiene {{ $value | humanizePercentage }} de uso (umbral: 90%). Se requiere acción inmediata.' runbook_url: https://wiki.celuwebcloud.com/runbooks/disk-full summary: Disco lleno en {{ $labels.instance }} - Volumen {{ $labels.volume }} usage_percent: '{{ $value }}' ok 36.707s ago 4.607ms
alert: WindowsServiceStopped expr: windows_service_state{name=~"MSSQLSERVER|W3SVC|ADWS|NTDS|DNS",state="running"} == 0 for: 1m labels: category: services severity: critical team: infra annotations: action: 'Iniciar servicio inmediatamente: net start {{ $labels.name }}' description: El servicio {{ $labels.name }} está detenido. summary: Servicio crítico detenido en {{ $labels.instance }} ok 36.703s ago 308.3us
alert: WindowsUnexpectedReboot expr: (time() - windows_system_system_up_time) < 300 and (time() - windows_system_system_up_time) > 0 labels: category: availability severity: warning team: infra annotations: action: Verificar logs del sistema para determinar causa del reinicio. Event Viewer > System. description: El servidor Windows se reinició hace menos de 5 minutos. summary: Reinicio reciente detectado en {{ $labels.instance }} ok 36.702s ago 367.5us

aggregated-team-metrics

1m47.968s ago

3.391ms

Rule State Error Last Evaluation Evaluation Time
record: team:linux_servers_up:total expr: count(up{job="linux-servers"} == 1) ok 1m47.968s ago 866.4us
record: team:linux_servers_down:total expr: count(up{job="linux-servers"} == 0) ok 1m47.968s ago 635.1us
record: team:windows_servers_up:total expr: count(up{job="windows-servers"} == 1) ok 1m47.967s ago 338.5us
record: team:windows_servers_down:total expr: count(up{job="windows-servers"} == 0) ok 1m47.967s ago 315.7us
record: team:infrastructure_availability:ratio expr: (count(up{job=~"linux-servers|windows-servers"} == 1) / count(up{job=~"linux-servers|windows-servers"})) ok 1m47.967s ago 1.197ms

availability-metrics

2m21.26s ago

1.342ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_uptime:days expr: (time() - node_boot_time_seconds) / 86400 ok 2m21.26s ago 845.8us
record: instance:node_time_since_boot:hours expr: (time() - node_boot_time_seconds) / 3600 ok 2m21.259s ago 475.7us

blackbox-metrics

27.794s ago

1.297ms

Rule State Error Last Evaluation Evaluation Time
record: probe:latency:avg5m expr: avg_over_time(probe_duration_seconds[5m]) ok 27.794s ago 584.6us
record: probe:success_rate:ratio5m expr: avg_over_time(probe_success[5m]) ok 27.794s ago 323.5us
record: probe:ssl_expiry:days expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 ok 27.793s ago 364.5us

capacity-planning

53.684s ago

3.524ms

Rule State Error Last Evaluation Evaluation Time
record: instance_mount:node_disk_fill_prediction_24h:bytes expr: predict_linear(node_filesystem_free_bytes[1h], 24 * 3600) ok 53.684s ago 2.298ms
record: instance_mount:node_disk_growth_rate:bytes_per_hour expr: (deriv(node_filesystem_used_bytes[1h]) * 3600) ok 53.682s ago 197.9us
record: instance:node_cpu_trend:avg1h expr: avg_over_time(instance:node_cpu_utilisation:rate5m[1h]) ok 53.682s ago 535.7us
record: instance:node_memory_trend:avg1h expr: avg_over_time(instance:node_memory_utilisation:ratio[1h]) ok 53.681s ago 463.3us

cpu-metrics

15.294s ago

5.347ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_cpu_utilisation:rate5m expr: 1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) ok 15.294s ago 1.261ms
record: instance:node_cpu_usage_by_mode:rate5m expr: rate(node_cpu_seconds_total{mode=~"user|system|iowait"}[5m]) ok 15.293s ago 3.337ms
record: instance:node_load_normalised:avg5m expr: node_load1 / count by(instance) (node_cpu_seconds_total{mode="idle"}) ok 15.289s ago 727.6us

disk-metrics

1.202s ago

3.744ms

Rule State Error Last Evaluation Evaluation Time
record: instance_mount:node_disk_utilisation:ratio expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes ok 1.203s ago 2.26ms
record: instance_device:node_disk_io_rate:bytes5m expr: rate(node_disk_read_bytes_total[5m]) + rate(node_disk_written_bytes_total[5m]) ok 1.2s ago 735.4us
record: instance_device:node_disk_io_utilisation:rate5m expr: rate(node_disk_io_time_seconds_total[5m]) ok 1.2s ago 365.6us
record: instance_device:node_disk_io_latency:avg5m expr: rate(node_disk_io_time_weighted_seconds_total[5m]) / rate(node_disk_ios_completed_total[5m]) ok 1.199s ago 353.7us

memory-metrics

18.515s ago

2.138ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_memory_utilisation:ratio expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes ok 18.515s ago 1.086ms
record: instance:node_memory_used_bytes:calc expr: node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes ok 18.514s ago 426.5us
record: instance:node_swap_utilisation:ratio expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes ok 18.513s ago 603.3us

network-metrics

17.948s ago

16.85ms

Rule State Error Last Evaluation Evaluation Time
record: instance_interface:node_network_traffic_rate:bytes5m expr: rate(node_network_receive_bytes_total[5m]) + rate(node_network_transmit_bytes_total[5m]) ok 17.948s ago 6.047ms
record: instance_interface:node_network_packets_rate:packets5m expr: rate(node_network_receive_packets_total[5m]) + rate(node_network_transmit_packets_total[5m]) ok 17.942s ago 5.321ms
record: instance_interface:node_network_errors_rate:errors5m expr: rate(node_network_receive_errs_total[5m]) + rate(node_network_transmit_errs_total[5m]) ok 17.936s ago 5.193ms
record: instance:node_network_tcp_connections:total expr: node_netstat_Tcp_CurrEstab ok 17.931s ago 256.9us

process-metrics

49.207s ago

1.383ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_processes_running:total expr: node_processes_state{state="R"} ok 49.207s ago 474.7us
record: instance:node_processes_sleeping:total expr: node_processes_state{state="S"} ok 49.206s ago 228.1us
record: instance:node_processes_zombie:total expr: node_processes_state{state="Z"} ok 49.206s ago 204.8us
record: instance:node_threads:total expr: node_procs_running + node_procs_blocked ok 49.206s ago 449us

windows-metrics

57.133s ago

5.374ms

Rule State Error Last Evaluation Evaluation Time
record: instance:windows_cpu_utilisation:rate5m expr: 1 - avg by(instance) (rate(windows_cpu_time_total{mode="idle"}[5m])) ok 57.133s ago 2.04ms
record: instance:windows_memory_utilisation:ratio expr: (windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes ok 57.131s ago 885.6us
record: instance_volume:windows_disk_utilisation:ratio expr: (windows_logical_disk_size_bytes - windows_logical_disk_free_bytes) / windows_logical_disk_size_bytes ok 57.13s ago 2.425ms