- CLAUDE.md for AI agents to understand the codebase - GITEA-GUIDE.md centralizes all Gitea operations (API, Registry, Auth) - DEVELOPMENT-WORKFLOW.md explains complete dev process - ROADMAP.md, NEXT-SESSION.md for planning - QUICK-REFERENCE.md, TROUBLESHOOTING.md for daily use - 40+ detailed docs in /docs folder - Backend as submodule from Gitea Everything documented for autonomous operation. Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
7.5 KiB
7.5 KiB
🔧 Troubleshooting Guide
Guía rápida de solución de problemas comunes.
🚨 PROBLEMAS COMUNES
1. No puedo acceder al cluster
Síntomas: kubectl no conecta
Solución:
# Verificar kubeconfig
export KUBECONFIG=~/.kube/aiworker-config
kubectl cluster-info
# Si falla, re-descargar
ssh root@108.165.47.233 "cat /etc/rancher/k3s/k3s.yaml" | \
sed 's/127.0.0.1/108.165.47.233/g' > ~/.kube/aiworker-config
2. Pod en CrashLoopBackOff
Síntomas: Pod se reinicia constantemente
Diagnóstico:
# Ver logs
kubectl logs <pod-name> -n <namespace>
# Ver logs del contenedor anterior
kubectl logs <pod-name> -n <namespace> --previous
# Describir pod
kubectl describe pod <pod-name> -n <namespace>
Causas comunes:
- Variable de entorno faltante
- Secret no existe
- No puede conectar a DB
- Puerto ya en uso
3. Ingress no resuelve (502/503/504)
Síntomas: URL da error de gateway
Diagnóstico:
# Verificar Ingress
kubectl get ingress -n <namespace>
kubectl describe ingress <name> -n <namespace>
# Verificar Service
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
# Logs de Nginx Ingress
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100 | grep <domain>
Verificar:
- Service selector correcto
- Pod está Running y Ready
- Puerto correcto en Service
4. TLS Certificate no se emite
Síntomas: Certificado en estado False
Diagnóstico:
# Ver certificado
kubectl get certificate -n <namespace>
kubectl describe certificate <name> -n <namespace>
# Ver CertificateRequest
kubectl get certificaterequest -n <namespace>
# Ver Challenge (HTTP-01)
kubectl get challenge -n <namespace>
kubectl describe challenge <name> -n <namespace>
# Logs de cert-manager
kubectl logs -n cert-manager deployment/cert-manager --tail=50
Causas comunes:
- DNS no apunta a los LBs
- Firewall bloquea puerto 80
- Ingress no tiene annotation de cert-manager
Fix:
# Verificar DNS
dig <domain> +short
# Debe mostrar: 108.165.47.221, 108.165.47.203
# Delete y recreate certificate
kubectl delete certificate <name> -n <namespace>
kubectl delete secret <name> -n <namespace>
# Recreate ingress
5. PVC en estado Pending
Síntomas: PVC no se bindea
Diagnóstico:
# Ver PVC
kubectl get pvc -n <namespace>
kubectl describe pvc <name> -n <namespace>
# Ver PVs disponibles
kubectl get pv
# Ver Longhorn volumes
kubectl get volumes.longhorn.io -n longhorn-system
Fix:
# Ver Longhorn UI
open https://longhorn.fuq.tv
# Logs de Longhorn
kubectl logs -n longhorn-system daemonset/longhorn-manager --tail=50
6. Gitea Actions no ejecuta
Síntomas: Workflow no se trigerea
Diagnóstico:
# Ver runner
kubectl get pods -n gitea-actions
kubectl logs -n gitea-actions deployment/gitea-runner -c runner --tail=100
# Ver en Gitea UI
open https://git.fuq.tv/admin/aiworker-backend/actions
Fix:
# Restart runner
kubectl rollout restart deployment/gitea-runner -n gitea-actions
# Verificar runner registrado
kubectl logs -n gitea-actions deployment/gitea-runner -c runner | grep "registered"
# Push de nuevo para triggear
git commit --allow-empty -m "Trigger workflow"
git push
7. MariaDB no conecta
Síntomas: Connection refused o Access denied
Diagnóstico:
# Verificar pod
kubectl get pods -n control-plane mariadb-0
# Ver logs
kubectl logs -n control-plane mariadb-0
# Test de conexión
kubectl exec -n control-plane mariadb-0 -- \
mariadb -uaiworker -pAiWorker2026_UserPass! -e "SELECT 1"
Credenciales correctas:
Host: mariadb.control-plane.svc.cluster.local
Port: 3306
User: aiworker
Pass: AiWorker2026_UserPass!
DB: aiworker
8. Load Balancer no responde
Síntomas: curl https://<domain> timeout
Diagnóstico:
# Verificar HAProxy
ssh root@108.165.47.221 "systemctl status haproxy"
ssh root@108.165.47.203 "systemctl status haproxy"
# Ver stats
open http://108.165.47.221:8404/stats
# Usuario: admin / aiworker2026
# Test directo a worker
curl http://108.165.47.225:32388 # NodePort de Ingress
Fix:
# Restart HAProxy
ssh root@108.165.47.221 "systemctl restart haproxy"
ssh root@108.165.47.203 "systemctl restart haproxy"
# Verificar config
ssh root@108.165.47.221 "cat /etc/haproxy/haproxy.cfg"
🔍 COMANDOS DE DIAGNÓSTICO GENERAL
Estado del Cluster
# Nodos
kubectl get nodes -o wide
# Recursos
kubectl top nodes
kubectl top pods -A
# Eventos recientes
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Pods con problemas
kubectl get pods -A | grep -v Running
Verificar Conectividad
# Desde un pod a otro servicio
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Dentro del pod:
wget -O- http://mariadb.control-plane.svc.cluster.local:3306
Limpiar Recursos
# Pods completados/fallidos
kubectl delete pods --field-selector=status.phase=Failed -A
kubectl delete pods --field-selector=status.phase=Succeeded -A
# Preview namespaces viejos
kubectl get ns -l environment=preview
kubectl delete ns <preview-namespace>
📞 CONTACTOS Y RECURSOS
Soporte
- CubePath: https://cubepath.com/support
- K3s Issues: https://github.com/k3s-io/k3s/issues
- Gitea: https://discourse.gitea.io
Logs Centrales
# Todos los errores recientes
kubectl get events -A --sort-by='.lastTimestamp' | grep -i error | tail -20
Backup Rápido
# Export toda la configuración
kubectl get all,ingress,certificate,pvc,secret -A -o yaml > cluster-backup.yaml
# Backup MariaDB
kubectl exec -n control-plane mariadb-0 -- \
mariadb-dump -uroot -pAiWorker2026_RootPass! --all-databases > backup-$(date +%Y%m%d).sql
🆘 EMERGENCY PROCEDURES
Cluster no responde
# SSH a control plane
ssh root@108.165.47.233
# Ver K3s
systemctl status k3s
journalctl -u k3s -n 100
# Restart K3s (último recurso)
systemctl restart k3s
Nodo caído
# Cordon (evitar scheduling)
kubectl cordon <node-name>
# Drain (mover pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Investigar en el nodo
ssh root@<node-ip>
systemctl status k3s-agent
Storage corruption
# Ver Longhorn UI
open https://longhorn.fuq.tv
# Ver réplicas
kubectl get replicas.longhorn.io -n longhorn-system
# Restore desde snapshot (si existe)
# Via Longhorn UI → Volume → Create from Snapshot
💡 TIPS
Desarrollo Rápido
# Auto-reload en backend
bun --watch src/index.ts
# Ver logs en tiempo real
kubectl logs -f deployment/backend -n control-plane
# Port-forward para testing
kubectl port-forward svc/backend 3000:3000 -n control-plane
Debug de Networking
# Test desde fuera del cluster
curl -v https://api.fuq.tv
# Test desde dentro del cluster
kubectl run curl --image=curlimages/curl -it --rm -- sh
curl http://backend.control-plane.svc.cluster.local:3000/api/health
Performance
# Ver uso de recursos
kubectl top pods -n control-plane
kubectl top nodes
# Ver pods que más consumen
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu
🔗 ENLACES RÁPIDOS
- Cluster Info:
CLUSTER-READY.md - Credenciales:
CLUSTER-CREDENTIALS.md - Roadmap:
ROADMAP.md - Próxima sesión:
NEXT-SESSION.md - Guía para agentes:
AGENT-GUIDE.md - Container Registry:
docs/CONTAINER-REGISTRY.md
Si nada de esto funciona, revisa los docs completos en /docs o contacta con el equipo.