- CLAUDE.md for AI agents to understand the codebase - GITEA-GUIDE.md centralizes all Gitea operations (API, Registry, Auth) - DEVELOPMENT-WORKFLOW.md explains complete dev process - ROADMAP.md, NEXT-SESSION.md for planning - QUICK-REFERENCE.md, TROUBLESHOOTING.md for daily use - 40+ detailed docs in /docs folder - Backend as submodule from Gitea Everything documented for autonomous operation. Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
373 lines
7.5 KiB
Markdown
373 lines
7.5 KiB
Markdown
# 🔧 Troubleshooting Guide
|
|
|
|
Guía rápida de solución de problemas comunes.
|
|
|
|
---
|
|
|
|
## 🚨 PROBLEMAS COMUNES
|
|
|
|
### 1. No puedo acceder al cluster
|
|
|
|
**Síntomas**: `kubectl` no conecta
|
|
|
|
**Solución**:
|
|
```bash
|
|
# Verificar kubeconfig
|
|
export KUBECONFIG=~/.kube/aiworker-config
|
|
kubectl cluster-info
|
|
|
|
# Si falla, re-descargar
|
|
ssh root@108.165.47.233 "cat /etc/rancher/k3s/k3s.yaml" | \
|
|
sed 's/127.0.0.1/108.165.47.233/g' > ~/.kube/aiworker-config
|
|
```
|
|
|
|
### 2. Pod en CrashLoopBackOff
|
|
|
|
**Síntomas**: Pod se reinicia constantemente
|
|
|
|
**Diagnóstico**:
|
|
```bash
|
|
# Ver logs
|
|
kubectl logs <pod-name> -n <namespace>
|
|
|
|
# Ver logs del contenedor anterior
|
|
kubectl logs <pod-name> -n <namespace> --previous
|
|
|
|
# Describir pod
|
|
kubectl describe pod <pod-name> -n <namespace>
|
|
```
|
|
|
|
**Causas comunes**:
|
|
- Variable de entorno faltante
|
|
- Secret no existe
|
|
- No puede conectar a DB
|
|
- Puerto ya en uso
|
|
|
|
### 3. Ingress no resuelve (502/503/504)
|
|
|
|
**Síntomas**: URL da error de gateway
|
|
|
|
**Diagnóstico**:
|
|
```bash
|
|
# Verificar Ingress
|
|
kubectl get ingress -n <namespace>
|
|
kubectl describe ingress <name> -n <namespace>
|
|
|
|
# Verificar Service
|
|
kubectl get svc -n <namespace>
|
|
kubectl get endpoints -n <namespace>
|
|
|
|
# Logs de Nginx Ingress
|
|
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100 | grep <domain>
|
|
```
|
|
|
|
**Verificar**:
|
|
- Service selector correcto
|
|
- Pod está Running y Ready
|
|
- Puerto correcto en Service
|
|
|
|
### 4. TLS Certificate no se emite
|
|
|
|
**Síntomas**: Certificado en estado `False`
|
|
|
|
**Diagnóstico**:
|
|
```bash
|
|
# Ver certificado
|
|
kubectl get certificate -n <namespace>
|
|
kubectl describe certificate <name> -n <namespace>
|
|
|
|
# Ver CertificateRequest
|
|
kubectl get certificaterequest -n <namespace>
|
|
|
|
# Ver Challenge (HTTP-01)
|
|
kubectl get challenge -n <namespace>
|
|
kubectl describe challenge <name> -n <namespace>
|
|
|
|
# Logs de cert-manager
|
|
kubectl logs -n cert-manager deployment/cert-manager --tail=50
|
|
```
|
|
|
|
**Causas comunes**:
|
|
- DNS no apunta a los LBs
|
|
- Firewall bloquea puerto 80
|
|
- Ingress no tiene annotation de cert-manager
|
|
|
|
**Fix**:
|
|
```bash
|
|
# Verificar DNS
|
|
dig <domain> +short
|
|
# Debe mostrar: 108.165.47.221, 108.165.47.203
|
|
|
|
# Delete y recreate certificate
|
|
kubectl delete certificate <name> -n <namespace>
|
|
kubectl delete secret <name> -n <namespace>
|
|
# Recreate ingress
|
|
```
|
|
|
|
### 5. PVC en estado Pending
|
|
|
|
**Síntomas**: PVC no se bindea
|
|
|
|
**Diagnóstico**:
|
|
```bash
|
|
# Ver PVC
|
|
kubectl get pvc -n <namespace>
|
|
kubectl describe pvc <name> -n <namespace>
|
|
|
|
# Ver PVs disponibles
|
|
kubectl get pv
|
|
|
|
# Ver Longhorn volumes
|
|
kubectl get volumes.longhorn.io -n longhorn-system
|
|
```
|
|
|
|
**Fix**:
|
|
```bash
|
|
# Ver Longhorn UI
|
|
open https://longhorn.fuq.tv
|
|
|
|
# Logs de Longhorn
|
|
kubectl logs -n longhorn-system daemonset/longhorn-manager --tail=50
|
|
```
|
|
|
|
### 6. Gitea Actions no ejecuta
|
|
|
|
**Síntomas**: Workflow no se trigerea
|
|
|
|
**Diagnóstico**:
|
|
```bash
|
|
# Ver runner
|
|
kubectl get pods -n gitea-actions
|
|
kubectl logs -n gitea-actions deployment/gitea-runner -c runner --tail=100
|
|
|
|
# Ver en Gitea UI
|
|
open https://git.fuq.tv/admin/aiworker-backend/actions
|
|
```
|
|
|
|
**Fix**:
|
|
```bash
|
|
# Restart runner
|
|
kubectl rollout restart deployment/gitea-runner -n gitea-actions
|
|
|
|
# Verificar runner registrado
|
|
kubectl logs -n gitea-actions deployment/gitea-runner -c runner | grep "registered"
|
|
|
|
# Push de nuevo para triggear
|
|
git commit --allow-empty -m "Trigger workflow"
|
|
git push
|
|
```
|
|
|
|
### 7. MariaDB no conecta
|
|
|
|
**Síntomas**: `Connection refused` o `Access denied`
|
|
|
|
**Diagnóstico**:
|
|
```bash
|
|
# Verificar pod
|
|
kubectl get pods -n control-plane mariadb-0
|
|
|
|
# Ver logs
|
|
kubectl logs -n control-plane mariadb-0
|
|
|
|
# Test de conexión
|
|
kubectl exec -n control-plane mariadb-0 -- \
|
|
mariadb -uaiworker -pAiWorker2026_UserPass! -e "SELECT 1"
|
|
```
|
|
|
|
**Credenciales correctas**:
|
|
```
|
|
Host: mariadb.control-plane.svc.cluster.local
|
|
Port: 3306
|
|
User: aiworker
|
|
Pass: AiWorker2026_UserPass!
|
|
DB: aiworker
|
|
```
|
|
|
|
### 8. Load Balancer no responde
|
|
|
|
**Síntomas**: `curl https://<domain>` timeout
|
|
|
|
**Diagnóstico**:
|
|
```bash
|
|
# Verificar HAProxy
|
|
ssh root@108.165.47.221 "systemctl status haproxy"
|
|
ssh root@108.165.47.203 "systemctl status haproxy"
|
|
|
|
# Ver stats
|
|
open http://108.165.47.221:8404/stats
|
|
# Usuario: admin / aiworker2026
|
|
|
|
# Test directo a worker
|
|
curl http://108.165.47.225:32388 # NodePort de Ingress
|
|
```
|
|
|
|
**Fix**:
|
|
```bash
|
|
# Restart HAProxy
|
|
ssh root@108.165.47.221 "systemctl restart haproxy"
|
|
ssh root@108.165.47.203 "systemctl restart haproxy"
|
|
|
|
# Verificar config
|
|
ssh root@108.165.47.221 "cat /etc/haproxy/haproxy.cfg"
|
|
```
|
|
|
|
---
|
|
|
|
## 🔍 COMANDOS DE DIAGNÓSTICO GENERAL
|
|
|
|
### Estado del Cluster
|
|
```bash
|
|
# Nodos
|
|
kubectl get nodes -o wide
|
|
|
|
# Recursos
|
|
kubectl top nodes
|
|
kubectl top pods -A
|
|
|
|
# Eventos recientes
|
|
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
|
|
|
|
# Pods con problemas
|
|
kubectl get pods -A | grep -v Running
|
|
```
|
|
|
|
### Verificar Conectividad
|
|
|
|
```bash
|
|
# Desde un pod a otro servicio
|
|
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
|
|
# Dentro del pod:
|
|
wget -O- http://mariadb.control-plane.svc.cluster.local:3306
|
|
```
|
|
|
|
### Limpiar Recursos
|
|
|
|
```bash
|
|
# Pods completados/fallidos
|
|
kubectl delete pods --field-selector=status.phase=Failed -A
|
|
kubectl delete pods --field-selector=status.phase=Succeeded -A
|
|
|
|
# Preview namespaces viejos
|
|
kubectl get ns -l environment=preview
|
|
kubectl delete ns <preview-namespace>
|
|
```
|
|
|
|
---
|
|
|
|
## 📞 CONTACTOS Y RECURSOS
|
|
|
|
### Soporte
|
|
- **CubePath**: https://cubepath.com/support
|
|
- **K3s Issues**: https://github.com/k3s-io/k3s/issues
|
|
- **Gitea**: https://discourse.gitea.io
|
|
|
|
### Logs Centrales
|
|
```bash
|
|
# Todos los errores recientes
|
|
kubectl get events -A --sort-by='.lastTimestamp' | grep -i error | tail -20
|
|
```
|
|
|
|
### Backup Rápido
|
|
```bash
|
|
# Export toda la configuración
|
|
kubectl get all,ingress,certificate,pvc,secret -A -o yaml > cluster-backup.yaml
|
|
|
|
# Backup MariaDB
|
|
kubectl exec -n control-plane mariadb-0 -- \
|
|
mariadb-dump -uroot -pAiWorker2026_RootPass! --all-databases > backup-$(date +%Y%m%d).sql
|
|
```
|
|
|
|
---
|
|
|
|
## 🆘 EMERGENCY PROCEDURES
|
|
|
|
### Cluster no responde
|
|
```bash
|
|
# SSH a control plane
|
|
ssh root@108.165.47.233
|
|
|
|
# Ver K3s
|
|
systemctl status k3s
|
|
journalctl -u k3s -n 100
|
|
|
|
# Restart K3s (último recurso)
|
|
systemctl restart k3s
|
|
```
|
|
|
|
### Nodo caído
|
|
```bash
|
|
# Cordon (evitar scheduling)
|
|
kubectl cordon <node-name>
|
|
|
|
# Drain (mover pods)
|
|
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
|
|
|
|
# Investigar en el nodo
|
|
ssh root@<node-ip>
|
|
systemctl status k3s-agent
|
|
```
|
|
|
|
### Storage corruption
|
|
```bash
|
|
# Ver Longhorn UI
|
|
open https://longhorn.fuq.tv
|
|
|
|
# Ver réplicas
|
|
kubectl get replicas.longhorn.io -n longhorn-system
|
|
|
|
# Restore desde snapshot (si existe)
|
|
# Via Longhorn UI → Volume → Create from Snapshot
|
|
```
|
|
|
|
---
|
|
|
|
## 💡 TIPS
|
|
|
|
### Desarrollo Rápido
|
|
```bash
|
|
# Auto-reload en backend
|
|
bun --watch src/index.ts
|
|
|
|
# Ver logs en tiempo real
|
|
kubectl logs -f deployment/backend -n control-plane
|
|
|
|
# Port-forward para testing
|
|
kubectl port-forward svc/backend 3000:3000 -n control-plane
|
|
```
|
|
|
|
### Debug de Networking
|
|
```bash
|
|
# Test desde fuera del cluster
|
|
curl -v https://api.fuq.tv
|
|
|
|
# Test desde dentro del cluster
|
|
kubectl run curl --image=curlimages/curl -it --rm -- sh
|
|
curl http://backend.control-plane.svc.cluster.local:3000/api/health
|
|
```
|
|
|
|
### Performance
|
|
```bash
|
|
# Ver uso de recursos
|
|
kubectl top pods -n control-plane
|
|
kubectl top nodes
|
|
|
|
# Ver pods que más consumen
|
|
kubectl top pods -A --sort-by=memory
|
|
kubectl top pods -A --sort-by=cpu
|
|
```
|
|
|
|
---
|
|
|
|
## 🔗 ENLACES RÁPIDOS
|
|
|
|
- **Cluster Info**: `CLUSTER-READY.md`
|
|
- **Credenciales**: `CLUSTER-CREDENTIALS.md`
|
|
- **Roadmap**: `ROADMAP.md`
|
|
- **Próxima sesión**: `NEXT-SESSION.md`
|
|
- **Guía para agentes**: `AGENT-GUIDE.md`
|
|
- **Container Registry**: `docs/CONTAINER-REGISTRY.md`
|
|
|
|
---
|
|
|
|
**Si nada de esto funciona, revisa los docs completos en `/docs` o contacta con el equipo.**
|