aiworker/TROUBLESHOOTING.md

# 🔧 Troubleshooting Guide

Guía rápida de solución de problemas comunes.

---

## 🚨 PROBLEMAS COMUNES

### 1. No puedo acceder al cluster

**Síntomas**: `kubectl` no conecta

**Solución**:
```bash
# Verificar kubeconfig
export KUBECONFIG=~/.kube/aiworker-config
kubectl cluster-info

# Si falla, re-descargar
ssh root@108.165.47.233 "cat /etc/rancher/k3s/k3s.yaml" | \
  sed 's/127.0.0.1/108.165.47.233/g' > ~/.kube/aiworker-config
```

### 2. Pod en CrashLoopBackOff

**Síntomas**: Pod se reinicia constantemente

**Diagnóstico**:
```bash
# Ver logs
kubectl logs <pod-name> -n <namespace>

# Ver logs del contenedor anterior
kubectl logs <pod-name> -n <namespace> --previous

# Describir pod
kubectl describe pod <pod-name> -n <namespace>
```

**Causas comunes**:
- Variable de entorno faltante
- Secret no existe
- No puede conectar a DB
- Puerto ya en uso

### 3. Ingress no resuelve (502/503/504)

**Síntomas**: URL da error de gateway

**Diagnóstico**:
```bash
# Verificar Ingress
kubectl get ingress -n <namespace>
kubectl describe ingress <name> -n <namespace>

# Verificar Service
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>

# Logs de Nginx Ingress
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100 | grep <domain>
```

**Verificar**:
- Service selector correcto
- Pod está Running y Ready
- Puerto correcto en Service

### 4. TLS Certificate no se emite

**Síntomas**: Certificado en estado `False`

**Diagnóstico**:
```bash
# Ver certificado
kubectl get certificate -n <namespace>
kubectl describe certificate <name> -n <namespace>

# Ver CertificateRequest
kubectl get certificaterequest -n <namespace>

# Ver Challenge (HTTP-01)
kubectl get challenge -n <namespace>
kubectl describe challenge <name> -n <namespace>

# Logs de cert-manager
kubectl logs -n cert-manager deployment/cert-manager --tail=50
```

**Causas comunes**:
- DNS no apunta a los LBs
- Firewall bloquea puerto 80
- Ingress no tiene annotation de cert-manager

**Fix**:
```bash
# Verificar DNS
dig <domain> +short
# Debe mostrar: 108.165.47.221, 108.165.47.203

# Delete y recreate certificate
kubectl delete certificate <name> -n <namespace>
kubectl delete secret <name> -n <namespace>
# Recreate ingress
```

### 5. PVC en estado Pending

**Síntomas**: PVC no se bindea

**Diagnóstico**:
```bash
# Ver PVC
kubectl get pvc -n <namespace>
kubectl describe pvc <name> -n <namespace>

# Ver PVs disponibles
kubectl get pv

# Ver Longhorn volumes
kubectl get volumes.longhorn.io -n longhorn-system
```

**Fix**:
```bash
# Ver Longhorn UI
open https://longhorn.fuq.tv

# Logs de Longhorn
kubectl logs -n longhorn-system daemonset/longhorn-manager --tail=50
```

### 6. Gitea Actions no ejecuta

**Síntomas**: Workflow no se trigerea

**Diagnóstico**:
```bash
# Ver runner
kubectl get pods -n gitea-actions
kubectl logs -n gitea-actions deployment/gitea-runner -c runner --tail=100

# Ver en Gitea UI
open https://git.fuq.tv/admin/aiworker-backend/actions
```

**Fix**:
```bash
# Restart runner
kubectl rollout restart deployment/gitea-runner -n gitea-actions

# Verificar runner registrado
kubectl logs -n gitea-actions deployment/gitea-runner -c runner | grep "registered"

# Push de nuevo para triggear
git commit --allow-empty -m "Trigger workflow"
git push
```

### 7. MariaDB no conecta

**Síntomas**: `Connection refused` o `Access denied`

**Diagnóstico**:
```bash
# Verificar pod
kubectl get pods -n control-plane mariadb-0

# Ver logs
kubectl logs -n control-plane mariadb-0

# Test de conexión
kubectl exec -n control-plane mariadb-0 -- \
  mariadb -uaiworker -pAiWorker2026_UserPass! -e "SELECT 1"
```

**Credenciales correctas**:
```
Host: mariadb.control-plane.svc.cluster.local
Port: 3306
User: aiworker
Pass: AiWorker2026_UserPass!
DB:   aiworker
```

### 8. Load Balancer no responde

**Síntomas**: `curl https://<domain>` timeout

**Diagnóstico**:
```bash
# Verificar HAProxy
ssh root@108.165.47.221 "systemctl status haproxy"
ssh root@108.165.47.203 "systemctl status haproxy"

# Ver stats
open http://108.165.47.221:8404/stats
# Usuario: admin / aiworker2026

# Test directo a worker
curl http://108.165.47.225:32388  # NodePort de Ingress
```

**Fix**:
```bash
# Restart HAProxy
ssh root@108.165.47.221 "systemctl restart haproxy"
ssh root@108.165.47.203 "systemctl restart haproxy"

# Verificar config
ssh root@108.165.47.221 "cat /etc/haproxy/haproxy.cfg"
```

---

## 🔍 COMANDOS DE DIAGNÓSTICO GENERAL

### Estado del Cluster
```bash
# Nodos
kubectl get nodes -o wide

# Recursos
kubectl top nodes
kubectl top pods -A

# Eventos recientes
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Pods con problemas
kubectl get pods -A | grep -v Running
```

### Verificar Conectividad

```bash
# Desde un pod a otro servicio
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Dentro del pod:
wget -O- http://mariadb.control-plane.svc.cluster.local:3306
```

### Limpiar Recursos

```bash
# Pods completados/fallidos
kubectl delete pods --field-selector=status.phase=Failed -A
kubectl delete pods --field-selector=status.phase=Succeeded -A

# Preview namespaces viejos
kubectl get ns -l environment=preview
kubectl delete ns <preview-namespace>
```

---

## 📞 CONTACTOS Y RECURSOS

### Soporte
- **CubePath**: https://cubepath.com/support
- **K3s Issues**: https://github.com/k3s-io/k3s/issues
- **Gitea**: https://discourse.gitea.io

### Logs Centrales
```bash
# Todos los errores recientes
kubectl get events -A --sort-by='.lastTimestamp' | grep -i error | tail -20
```

### Backup Rápido
```bash
# Export toda la configuración
kubectl get all,ingress,certificate,pvc,secret -A -o yaml > cluster-backup.yaml

# Backup MariaDB
kubectl exec -n control-plane mariadb-0 -- \
  mariadb-dump -uroot -pAiWorker2026_RootPass! --all-databases > backup-$(date +%Y%m%d).sql
```

---

## 🆘 EMERGENCY PROCEDURES

### Cluster no responde
```bash
# SSH a control plane
ssh root@108.165.47.233

# Ver K3s
systemctl status k3s
journalctl -u k3s -n 100

# Restart K3s (último recurso)
systemctl restart k3s
```

### Nodo caído
```bash
# Cordon (evitar scheduling)
kubectl cordon <node-name>

# Drain (mover pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Investigar en el nodo
ssh root@<node-ip>
systemctl status k3s-agent
```

### Storage corruption
```bash
# Ver Longhorn UI
open https://longhorn.fuq.tv

# Ver réplicas
kubectl get replicas.longhorn.io -n longhorn-system

# Restore desde snapshot (si existe)
# Via Longhorn UI → Volume → Create from Snapshot
```

---

## 💡 TIPS

### Desarrollo Rápido
```bash
# Auto-reload en backend
bun --watch src/index.ts

# Ver logs en tiempo real
kubectl logs -f deployment/backend -n control-plane

# Port-forward para testing
kubectl port-forward svc/backend 3000:3000 -n control-plane
```

### Debug de Networking
```bash
# Test desde fuera del cluster
curl -v https://api.fuq.tv

# Test desde dentro del cluster
kubectl run curl --image=curlimages/curl -it --rm -- sh
curl http://backend.control-plane.svc.cluster.local:3000/api/health
```

### Performance
```bash
# Ver uso de recursos
kubectl top pods -n control-plane
kubectl top nodes

# Ver pods que más consumen
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu
```

---

## 🔗 ENLACES RÁPIDOS

- **Cluster Info**: `CLUSTER-READY.md`
- **Credenciales**: `CLUSTER-CREDENTIALS.md`
- **Roadmap**: `ROADMAP.md`
- **Próxima sesión**: `NEXT-SESSION.md`
- **Guía para agentes**: `AGENT-GUIDE.md`
- **Container Registry**: `docs/CONTAINER-REGISTRY.md`

---

**Si nada de esto funciona, revisa los docs completos en `/docs` o contacta con el equipo.**