# 🔧 Troubleshooting Guide Guía rápida de solución de problemas comunes. --- ## 🚨 PROBLEMAS COMUNES ### 1. No puedo acceder al cluster **Síntomas**: `kubectl` no conecta **Solución**: ```bash # Verificar kubeconfig export KUBECONFIG=~/.kube/aiworker-config kubectl cluster-info # Si falla, re-descargar ssh root@108.165.47.233 "cat /etc/rancher/k3s/k3s.yaml" | \ sed 's/127.0.0.1/108.165.47.233/g' > ~/.kube/aiworker-config ``` ### 2. Pod en CrashLoopBackOff **Síntomas**: Pod se reinicia constantemente **Diagnóstico**: ```bash # Ver logs kubectl logs -n # Ver logs del contenedor anterior kubectl logs -n --previous # Describir pod kubectl describe pod -n ``` **Causas comunes**: - Variable de entorno faltante - Secret no existe - No puede conectar a DB - Puerto ya en uso ### 3. Ingress no resuelve (502/503/504) **Síntomas**: URL da error de gateway **Diagnóstico**: ```bash # Verificar Ingress kubectl get ingress -n kubectl describe ingress -n # Verificar Service kubectl get svc -n kubectl get endpoints -n # Logs de Nginx Ingress kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100 | grep ``` **Verificar**: - Service selector correcto - Pod está Running y Ready - Puerto correcto en Service ### 4. TLS Certificate no se emite **Síntomas**: Certificado en estado `False` **Diagnóstico**: ```bash # Ver certificado kubectl get certificate -n kubectl describe certificate -n # Ver CertificateRequest kubectl get certificaterequest -n # Ver Challenge (HTTP-01) kubectl get challenge -n kubectl describe challenge -n # Logs de cert-manager kubectl logs -n cert-manager deployment/cert-manager --tail=50 ``` **Causas comunes**: - DNS no apunta a los LBs - Firewall bloquea puerto 80 - Ingress no tiene annotation de cert-manager **Fix**: ```bash # Verificar DNS dig +short # Debe mostrar: 108.165.47.221, 108.165.47.203 # Delete y recreate certificate kubectl delete certificate -n kubectl delete secret -n # Recreate ingress ``` ### 5. PVC en estado Pending **Síntomas**: PVC no se bindea **Diagnóstico**: ```bash # Ver PVC kubectl get pvc -n kubectl describe pvc -n # Ver PVs disponibles kubectl get pv # Ver Longhorn volumes kubectl get volumes.longhorn.io -n longhorn-system ``` **Fix**: ```bash # Ver Longhorn UI open https://longhorn.fuq.tv # Logs de Longhorn kubectl logs -n longhorn-system daemonset/longhorn-manager --tail=50 ``` ### 6. Gitea Actions no ejecuta **Síntomas**: Workflow no se trigerea **Diagnóstico**: ```bash # Ver runner kubectl get pods -n gitea-actions kubectl logs -n gitea-actions deployment/gitea-runner -c runner --tail=100 # Ver en Gitea UI open https://git.fuq.tv/admin/aiworker-backend/actions ``` **Fix**: ```bash # Restart runner kubectl rollout restart deployment/gitea-runner -n gitea-actions # Verificar runner registrado kubectl logs -n gitea-actions deployment/gitea-runner -c runner | grep "registered" # Push de nuevo para triggear git commit --allow-empty -m "Trigger workflow" git push ``` ### 7. MariaDB no conecta **Síntomas**: `Connection refused` o `Access denied` **Diagnóstico**: ```bash # Verificar pod kubectl get pods -n control-plane mariadb-0 # Ver logs kubectl logs -n control-plane mariadb-0 # Test de conexión kubectl exec -n control-plane mariadb-0 -- \ mariadb -uaiworker -pAiWorker2026_UserPass! -e "SELECT 1" ``` **Credenciales correctas**: ``` Host: mariadb.control-plane.svc.cluster.local Port: 3306 User: aiworker Pass: AiWorker2026_UserPass! DB: aiworker ``` ### 8. Load Balancer no responde **Síntomas**: `curl https://` timeout **Diagnóstico**: ```bash # Verificar HAProxy ssh root@108.165.47.221 "systemctl status haproxy" ssh root@108.165.47.203 "systemctl status haproxy" # Ver stats open http://108.165.47.221:8404/stats # Usuario: admin / aiworker2026 # Test directo a worker curl http://108.165.47.225:32388 # NodePort de Ingress ``` **Fix**: ```bash # Restart HAProxy ssh root@108.165.47.221 "systemctl restart haproxy" ssh root@108.165.47.203 "systemctl restart haproxy" # Verificar config ssh root@108.165.47.221 "cat /etc/haproxy/haproxy.cfg" ``` --- ## 🔍 COMANDOS DE DIAGNÓSTICO GENERAL ### Estado del Cluster ```bash # Nodos kubectl get nodes -o wide # Recursos kubectl top nodes kubectl top pods -A # Eventos recientes kubectl get events -A --sort-by='.lastTimestamp' | tail -30 # Pods con problemas kubectl get pods -A | grep -v Running ``` ### Verificar Conectividad ```bash # Desde un pod a otro servicio kubectl run -it --rm debug --image=busybox --restart=Never -- sh # Dentro del pod: wget -O- http://mariadb.control-plane.svc.cluster.local:3306 ``` ### Limpiar Recursos ```bash # Pods completados/fallidos kubectl delete pods --field-selector=status.phase=Failed -A kubectl delete pods --field-selector=status.phase=Succeeded -A # Preview namespaces viejos kubectl get ns -l environment=preview kubectl delete ns ``` --- ## 📞 CONTACTOS Y RECURSOS ### Soporte - **CubePath**: https://cubepath.com/support - **K3s Issues**: https://github.com/k3s-io/k3s/issues - **Gitea**: https://discourse.gitea.io ### Logs Centrales ```bash # Todos los errores recientes kubectl get events -A --sort-by='.lastTimestamp' | grep -i error | tail -20 ``` ### Backup Rápido ```bash # Export toda la configuración kubectl get all,ingress,certificate,pvc,secret -A -o yaml > cluster-backup.yaml # Backup MariaDB kubectl exec -n control-plane mariadb-0 -- \ mariadb-dump -uroot -pAiWorker2026_RootPass! --all-databases > backup-$(date +%Y%m%d).sql ``` --- ## 🆘 EMERGENCY PROCEDURES ### Cluster no responde ```bash # SSH a control plane ssh root@108.165.47.233 # Ver K3s systemctl status k3s journalctl -u k3s -n 100 # Restart K3s (último recurso) systemctl restart k3s ``` ### Nodo caído ```bash # Cordon (evitar scheduling) kubectl cordon # Drain (mover pods) kubectl drain --ignore-daemonsets --delete-emptydir-data # Investigar en el nodo ssh root@ systemctl status k3s-agent ``` ### Storage corruption ```bash # Ver Longhorn UI open https://longhorn.fuq.tv # Ver réplicas kubectl get replicas.longhorn.io -n longhorn-system # Restore desde snapshot (si existe) # Via Longhorn UI → Volume → Create from Snapshot ``` --- ## 💡 TIPS ### Desarrollo Rápido ```bash # Auto-reload en backend bun --watch src/index.ts # Ver logs en tiempo real kubectl logs -f deployment/backend -n control-plane # Port-forward para testing kubectl port-forward svc/backend 3000:3000 -n control-plane ``` ### Debug de Networking ```bash # Test desde fuera del cluster curl -v https://api.fuq.tv # Test desde dentro del cluster kubectl run curl --image=curlimages/curl -it --rm -- sh curl http://backend.control-plane.svc.cluster.local:3000/api/health ``` ### Performance ```bash # Ver uso de recursos kubectl top pods -n control-plane kubectl top nodes # Ver pods que más consumen kubectl top pods -A --sort-by=memory kubectl top pods -A --sort-by=cpu ``` --- ## 🔗 ENLACES RÁPIDOS - **Cluster Info**: `CLUSTER-READY.md` - **Credenciales**: `CLUSTER-CREDENTIALS.md` - **Roadmap**: `ROADMAP.md` - **Próxima sesión**: `NEXT-SESSION.md` - **Guía para agentes**: `AGENT-GUIDE.md` - **Container Registry**: `docs/CONTAINER-REGISTRY.md` --- **Si nada de esto funciona, revisa los docs completos en `/docs` o contacta con el equipo.**