Complete documentation for future sessions

- CLAUDE.md for AI agents to understand the codebase - GITEA-GUIDE.md centralizes all Gitea operations (API, Registry, Auth) - DEVELOPMENT-WORKFLOW.md explains complete dev process - ROADMAP.md, NEXT-SESSION.md for planning - QUICK-REFERENCE.md, TROUBLESHOOTING.md for daily use - 40+ detailed docs in /docs folder - Backend as submodule from Gitea Everything documented for autonomous operation. Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
2026-01-20 00:36:53 +01:00
commit db71705842
49 changed files with 19162 additions and 0 deletions
--- a/TROUBLESHOOTING.md
+++ b/TROUBLESHOOTING.md
@@ -0,0 +1,372 @@
+# 🔧 Troubleshooting Guide
+
+Guía rápida de solución de problemas comunes.
+
+---
+
+## 🚨 PROBLEMAS COMUNES
+
+### 1. No puedo acceder al cluster
+
+**Síntomas**: `kubectl` no conecta
+
+**Solución**:
+```bash
+# Verificar kubeconfig
+export KUBECONFIG=~/.kube/aiworker-config
+kubectl cluster-info
+
+# Si falla, re-descargar
+ssh root@108.165.47.233 "cat /etc/rancher/k3s/k3s.yaml" | \
+  sed 's/127.0.0.1/108.165.47.233/g' > ~/.kube/aiworker-config
+```
+
+### 2. Pod en CrashLoopBackOff
+
+**Síntomas**: Pod se reinicia constantemente
+
+**Diagnóstico**:
+```bash
+# Ver logs
+kubectl logs <pod-name> -n <namespace>
+
+# Ver logs del contenedor anterior
+kubectl logs <pod-name> -n <namespace> --previous
+
+# Describir pod
+kubectl describe pod <pod-name> -n <namespace>
+```
+
+**Causas comunes**:
+- Variable de entorno faltante
+- Secret no existe
+- No puede conectar a DB
+- Puerto ya en uso
+
+### 3. Ingress no resuelve (502/503/504)
+
+**Síntomas**: URL da error de gateway
+
+**Diagnóstico**:
+```bash
+# Verificar Ingress
+kubectl get ingress -n <namespace>
+kubectl describe ingress <name> -n <namespace>
+
+# Verificar Service
+kubectl get svc -n <namespace>
+kubectl get endpoints -n <namespace>
+
+# Logs de Nginx Ingress
+kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100 | grep <domain>
+```
+
+**Verificar**:
+- Service selector correcto
+- Pod está Running y Ready
+- Puerto correcto en Service
+
+### 4. TLS Certificate no se emite
+
+**Síntomas**: Certificado en estado `False`
+
+**Diagnóstico**:
+```bash
+# Ver certificado
+kubectl get certificate -n <namespace>
+kubectl describe certificate <name> -n <namespace>
+
+# Ver CertificateRequest
+kubectl get certificaterequest -n <namespace>
+
+# Ver Challenge (HTTP-01)
+kubectl get challenge -n <namespace>
+kubectl describe challenge <name> -n <namespace>
+
+# Logs de cert-manager
+kubectl logs -n cert-manager deployment/cert-manager --tail=50
+```
+
+**Causas comunes**:
+- DNS no apunta a los LBs
+- Firewall bloquea puerto 80
+- Ingress no tiene annotation de cert-manager
+
+**Fix**:
+```bash
+# Verificar DNS
+dig <domain> +short
+# Debe mostrar: 108.165.47.221, 108.165.47.203
+
+# Delete y recreate certificate
+kubectl delete certificate <name> -n <namespace>
+kubectl delete secret <name> -n <namespace>
+# Recreate ingress
+```
+
+### 5. PVC en estado Pending
+
+**Síntomas**: PVC no se bindea
+
+**Diagnóstico**:
+```bash
+# Ver PVC
+kubectl get pvc -n <namespace>
+kubectl describe pvc <name> -n <namespace>
+
+# Ver PVs disponibles
+kubectl get pv
+
+# Ver Longhorn volumes
+kubectl get volumes.longhorn.io -n longhorn-system
+```
+
+**Fix**:
+```bash
+# Ver Longhorn UI
+open https://longhorn.fuq.tv
+
+# Logs de Longhorn
+kubectl logs -n longhorn-system daemonset/longhorn-manager --tail=50
+```
+
+### 6. Gitea Actions no ejecuta
+
+**Síntomas**: Workflow no se trigerea
+
+**Diagnóstico**:
+```bash
+# Ver runner
+kubectl get pods -n gitea-actions
+kubectl logs -n gitea-actions deployment/gitea-runner -c runner --tail=100
+
+# Ver en Gitea UI
+open https://git.fuq.tv/admin/aiworker-backend/actions
+```
+
+**Fix**:
+```bash
+# Restart runner
+kubectl rollout restart deployment/gitea-runner -n gitea-actions
+
+# Verificar runner registrado
+kubectl logs -n gitea-actions deployment/gitea-runner -c runner | grep "registered"
+
+# Push de nuevo para triggear
+git commit --allow-empty -m "Trigger workflow"
+git push
+```
+
+### 7. MariaDB no conecta
+
+**Síntomas**: `Connection refused` o `Access denied`
+
+**Diagnóstico**:
+```bash
+# Verificar pod
+kubectl get pods -n control-plane mariadb-0
+
+# Ver logs
+kubectl logs -n control-plane mariadb-0
+
+# Test de conexión
+kubectl exec -n control-plane mariadb-0 -- \
+  mariadb -uaiworker -pAiWorker2026_UserPass! -e "SELECT 1"
+```
+
+**Credenciales correctas**:
+```
+Host: mariadb.control-plane.svc.cluster.local
+Port: 3306
+User: aiworker
+Pass: AiWorker2026_UserPass!
+DB:   aiworker
+```
+
+### 8. Load Balancer no responde
+
+**Síntomas**: `curl https://<domain>` timeout
+
+**Diagnóstico**:
+```bash
+# Verificar HAProxy
+ssh root@108.165.47.221 "systemctl status haproxy"
+ssh root@108.165.47.203 "systemctl status haproxy"
+
+# Ver stats
+open http://108.165.47.221:8404/stats
+# Usuario: admin / aiworker2026
+
+# Test directo a worker
+curl http://108.165.47.225:32388  # NodePort de Ingress
+```
+
+**Fix**:
+```bash
+# Restart HAProxy
+ssh root@108.165.47.221 "systemctl restart haproxy"
+ssh root@108.165.47.203 "systemctl restart haproxy"
+
+# Verificar config
+ssh root@108.165.47.221 "cat /etc/haproxy/haproxy.cfg"
+```
+
+---
+
+## 🔍 COMANDOS DE DIAGNÓSTICO GENERAL
+
+### Estado del Cluster
+```bash
+# Nodos
+kubectl get nodes -o wide
+
+# Recursos
+kubectl top nodes
+kubectl top pods -A
+
+# Eventos recientes
+kubectl get events -A --sort-by='.lastTimestamp' | tail -30
+
+# Pods con problemas
+kubectl get pods -A | grep -v Running
+```
+
+### Verificar Conectividad
+
+```bash
+# Desde un pod a otro servicio
+kubectl run -it --rm debug --image=busybox --restart=Never -- sh
+# Dentro del pod:
+wget -O- http://mariadb.control-plane.svc.cluster.local:3306
+```
+
+### Limpiar Recursos
+
+```bash
+# Pods completados/fallidos
+kubectl delete pods --field-selector=status.phase=Failed -A
+kubectl delete pods --field-selector=status.phase=Succeeded -A
+
+# Preview namespaces viejos
+kubectl get ns -l environment=preview
+kubectl delete ns <preview-namespace>
+```
+
+---
+
+## 📞 CONTACTOS Y RECURSOS
+
+### Soporte
+- **CubePath**: https://cubepath.com/support
+- **K3s Issues**: https://github.com/k3s-io/k3s/issues
+- **Gitea**: https://discourse.gitea.io
+
+### Logs Centrales
+```bash
+# Todos los errores recientes
+kubectl get events -A --sort-by='.lastTimestamp' | grep -i error | tail -20
+```
+
+### Backup Rápido
+```bash
+# Export toda la configuración
+kubectl get all,ingress,certificate,pvc,secret -A -o yaml > cluster-backup.yaml
+
+# Backup MariaDB
+kubectl exec -n control-plane mariadb-0 -- \
+  mariadb-dump -uroot -pAiWorker2026_RootPass! --all-databases > backup-$(date +%Y%m%d).sql
+```
+
+---
+
+## 🆘 EMERGENCY PROCEDURES
+
+### Cluster no responde
+```bash
+# SSH a control plane
+ssh root@108.165.47.233
+
+# Ver K3s
+systemctl status k3s
+journalctl -u k3s -n 100
+
+# Restart K3s (último recurso)
+systemctl restart k3s
+```
+
+### Nodo caído
+```bash
+# Cordon (evitar scheduling)
+kubectl cordon <node-name>
+
+# Drain (mover pods)
+kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
+
+# Investigar en el nodo
+ssh root@<node-ip>
+systemctl status k3s-agent
+```
+
+### Storage corruption
+```bash
+# Ver Longhorn UI
+open https://longhorn.fuq.tv
+
+# Ver réplicas
+kubectl get replicas.longhorn.io -n longhorn-system
+
+# Restore desde snapshot (si existe)
+# Via Longhorn UI → Volume → Create from Snapshot
+```
+
+---
+
+## 💡 TIPS
+
+### Desarrollo Rápido
+```bash
+# Auto-reload en backend
+bun --watch src/index.ts
+
+# Ver logs en tiempo real
+kubectl logs -f deployment/backend -n control-plane
+
+# Port-forward para testing
+kubectl port-forward svc/backend 3000:3000 -n control-plane
+```
+
+### Debug de Networking
+```bash
+# Test desde fuera del cluster
+curl -v https://api.fuq.tv
+
+# Test desde dentro del cluster
+kubectl run curl --image=curlimages/curl -it --rm -- sh
+curl http://backend.control-plane.svc.cluster.local:3000/api/health
+```
+
+### Performance
+```bash
+# Ver uso de recursos
+kubectl top pods -n control-plane
+kubectl top nodes
+
+# Ver pods que más consumen
+kubectl top pods -A --sort-by=memory
+kubectl top pods -A --sort-by=cpu
+```
+
+---
+
+## 🔗 ENLACES RÁPIDOS
+
+- **Cluster Info**: `CLUSTER-READY.md`
+- **Credenciales**: `CLUSTER-CREDENTIALS.md`
+- **Roadmap**: `ROADMAP.md`
+- **Próxima sesión**: `NEXT-SESSION.md`
+- **Guía para agentes**: `AGENT-GUIDE.md`
+- **Container Registry**: `docs/CONTAINER-REGISTRY.md`
+
+---
+
+**Si nada de esto funciona, revisa los docs completos en `/docs` o contacta con el equipo.**