Files
aiworker/TROUBLESHOOTING.md
Hector Ros db71705842 Complete documentation for future sessions
- CLAUDE.md for AI agents to understand the codebase
- GITEA-GUIDE.md centralizes all Gitea operations (API, Registry, Auth)
- DEVELOPMENT-WORKFLOW.md explains complete dev process
- ROADMAP.md, NEXT-SESSION.md for planning
- QUICK-REFERENCE.md, TROUBLESHOOTING.md for daily use
- 40+ detailed docs in /docs folder
- Backend as submodule from Gitea

Everything documented for autonomous operation.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-01-20 00:37:19 +01:00

7.5 KiB

🔧 Troubleshooting Guide

Guía rápida de solución de problemas comunes.


🚨 PROBLEMAS COMUNES

1. No puedo acceder al cluster

Síntomas: kubectl no conecta

Solución:

# Verificar kubeconfig
export KUBECONFIG=~/.kube/aiworker-config
kubectl cluster-info

# Si falla, re-descargar
ssh root@108.165.47.233 "cat /etc/rancher/k3s/k3s.yaml" | \
  sed 's/127.0.0.1/108.165.47.233/g' > ~/.kube/aiworker-config

2. Pod en CrashLoopBackOff

Síntomas: Pod se reinicia constantemente

Diagnóstico:

# Ver logs
kubectl logs <pod-name> -n <namespace>

# Ver logs del contenedor anterior
kubectl logs <pod-name> -n <namespace> --previous

# Describir pod
kubectl describe pod <pod-name> -n <namespace>

Causas comunes:

  • Variable de entorno faltante
  • Secret no existe
  • No puede conectar a DB
  • Puerto ya en uso

3. Ingress no resuelve (502/503/504)

Síntomas: URL da error de gateway

Diagnóstico:

# Verificar Ingress
kubectl get ingress -n <namespace>
kubectl describe ingress <name> -n <namespace>

# Verificar Service
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>

# Logs de Nginx Ingress
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100 | grep <domain>

Verificar:

  • Service selector correcto
  • Pod está Running y Ready
  • Puerto correcto en Service

4. TLS Certificate no se emite

Síntomas: Certificado en estado False

Diagnóstico:

# Ver certificado
kubectl get certificate -n <namespace>
kubectl describe certificate <name> -n <namespace>

# Ver CertificateRequest
kubectl get certificaterequest -n <namespace>

# Ver Challenge (HTTP-01)
kubectl get challenge -n <namespace>
kubectl describe challenge <name> -n <namespace>

# Logs de cert-manager
kubectl logs -n cert-manager deployment/cert-manager --tail=50

Causas comunes:

  • DNS no apunta a los LBs
  • Firewall bloquea puerto 80
  • Ingress no tiene annotation de cert-manager

Fix:

# Verificar DNS
dig <domain> +short
# Debe mostrar: 108.165.47.221, 108.165.47.203

# Delete y recreate certificate
kubectl delete certificate <name> -n <namespace>
kubectl delete secret <name> -n <namespace>
# Recreate ingress

5. PVC en estado Pending

Síntomas: PVC no se bindea

Diagnóstico:

# Ver PVC
kubectl get pvc -n <namespace>
kubectl describe pvc <name> -n <namespace>

# Ver PVs disponibles
kubectl get pv

# Ver Longhorn volumes
kubectl get volumes.longhorn.io -n longhorn-system

Fix:

# Ver Longhorn UI
open https://longhorn.fuq.tv

# Logs de Longhorn
kubectl logs -n longhorn-system daemonset/longhorn-manager --tail=50

6. Gitea Actions no ejecuta

Síntomas: Workflow no se trigerea

Diagnóstico:

# Ver runner
kubectl get pods -n gitea-actions
kubectl logs -n gitea-actions deployment/gitea-runner -c runner --tail=100

# Ver en Gitea UI
open https://git.fuq.tv/admin/aiworker-backend/actions

Fix:

# Restart runner
kubectl rollout restart deployment/gitea-runner -n gitea-actions

# Verificar runner registrado
kubectl logs -n gitea-actions deployment/gitea-runner -c runner | grep "registered"

# Push de nuevo para triggear
git commit --allow-empty -m "Trigger workflow"
git push

7. MariaDB no conecta

Síntomas: Connection refused o Access denied

Diagnóstico:

# Verificar pod
kubectl get pods -n control-plane mariadb-0

# Ver logs
kubectl logs -n control-plane mariadb-0

# Test de conexión
kubectl exec -n control-plane mariadb-0 -- \
  mariadb -uaiworker -pAiWorker2026_UserPass! -e "SELECT 1"

Credenciales correctas:

Host: mariadb.control-plane.svc.cluster.local
Port: 3306
User: aiworker
Pass: AiWorker2026_UserPass!
DB:   aiworker

8. Load Balancer no responde

Síntomas: curl https://<domain> timeout

Diagnóstico:

# Verificar HAProxy
ssh root@108.165.47.221 "systemctl status haproxy"
ssh root@108.165.47.203 "systemctl status haproxy"

# Ver stats
open http://108.165.47.221:8404/stats
# Usuario: admin / aiworker2026

# Test directo a worker
curl http://108.165.47.225:32388  # NodePort de Ingress

Fix:

# Restart HAProxy
ssh root@108.165.47.221 "systemctl restart haproxy"
ssh root@108.165.47.203 "systemctl restart haproxy"

# Verificar config
ssh root@108.165.47.221 "cat /etc/haproxy/haproxy.cfg"

🔍 COMANDOS DE DIAGNÓSTICO GENERAL

Estado del Cluster

# Nodos
kubectl get nodes -o wide

# Recursos
kubectl top nodes
kubectl top pods -A

# Eventos recientes
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Pods con problemas
kubectl get pods -A | grep -v Running

Verificar Conectividad

# Desde un pod a otro servicio
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Dentro del pod:
wget -O- http://mariadb.control-plane.svc.cluster.local:3306

Limpiar Recursos

# Pods completados/fallidos
kubectl delete pods --field-selector=status.phase=Failed -A
kubectl delete pods --field-selector=status.phase=Succeeded -A

# Preview namespaces viejos
kubectl get ns -l environment=preview
kubectl delete ns <preview-namespace>

📞 CONTACTOS Y RECURSOS

Soporte

Logs Centrales

# Todos los errores recientes
kubectl get events -A --sort-by='.lastTimestamp' | grep -i error | tail -20

Backup Rápido

# Export toda la configuración
kubectl get all,ingress,certificate,pvc,secret -A -o yaml > cluster-backup.yaml

# Backup MariaDB
kubectl exec -n control-plane mariadb-0 -- \
  mariadb-dump -uroot -pAiWorker2026_RootPass! --all-databases > backup-$(date +%Y%m%d).sql

🆘 EMERGENCY PROCEDURES

Cluster no responde

# SSH a control plane
ssh root@108.165.47.233

# Ver K3s
systemctl status k3s
journalctl -u k3s -n 100

# Restart K3s (último recurso)
systemctl restart k3s

Nodo caído

# Cordon (evitar scheduling)
kubectl cordon <node-name>

# Drain (mover pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Investigar en el nodo
ssh root@<node-ip>
systemctl status k3s-agent

Storage corruption

# Ver Longhorn UI
open https://longhorn.fuq.tv

# Ver réplicas
kubectl get replicas.longhorn.io -n longhorn-system

# Restore desde snapshot (si existe)
# Via Longhorn UI → Volume → Create from Snapshot

💡 TIPS

Desarrollo Rápido

# Auto-reload en backend
bun --watch src/index.ts

# Ver logs en tiempo real
kubectl logs -f deployment/backend -n control-plane

# Port-forward para testing
kubectl port-forward svc/backend 3000:3000 -n control-plane

Debug de Networking

# Test desde fuera del cluster
curl -v https://api.fuq.tv

# Test desde dentro del cluster
kubectl run curl --image=curlimages/curl -it --rm -- sh
curl http://backend.control-plane.svc.cluster.local:3000/api/health

Performance

# Ver uso de recursos
kubectl top pods -n control-plane
kubectl top nodes

# Ver pods que más consumen
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu

🔗 ENLACES RÁPIDOS

  • Cluster Info: CLUSTER-READY.md
  • Credenciales: CLUSTER-CREDENTIALS.md
  • Roadmap: ROADMAP.md
  • Próxima sesión: NEXT-SESSION.md
  • Guía para agentes: AGENT-GUIDE.md
  • Container Registry: docs/CONTAINER-REGISTRY.md

Si nada de esto funciona, revisa los docs completos en /docs o contacta con el equipo.