Complete documentation for future sessions

- CLAUDE.md for AI agents to understand the codebase
- GITEA-GUIDE.md centralizes all Gitea operations (API, Registry, Auth)
- DEVELOPMENT-WORKFLOW.md explains complete dev process
- ROADMAP.md, NEXT-SESSION.md for planning
- QUICK-REFERENCE.md, TROUBLESHOOTING.md for daily use
- 40+ detailed docs in /docs folder
- Backend as submodule from Gitea

Everything documented for autonomous operation.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hector Ros
2026-01-20 00:36:53 +01:00
commit db71705842
49 changed files with 19162 additions and 0 deletions

372
TROUBLESHOOTING.md Normal file
View File

@@ -0,0 +1,372 @@
# 🔧 Troubleshooting Guide
Guía rápida de solución de problemas comunes.
---
## 🚨 PROBLEMAS COMUNES
### 1. No puedo acceder al cluster
**Síntomas**: `kubectl` no conecta
**Solución**:
```bash
# Verificar kubeconfig
export KUBECONFIG=~/.kube/aiworker-config
kubectl cluster-info
# Si falla, re-descargar
ssh root@108.165.47.233 "cat /etc/rancher/k3s/k3s.yaml" | \
sed 's/127.0.0.1/108.165.47.233/g' > ~/.kube/aiworker-config
```
### 2. Pod en CrashLoopBackOff
**Síntomas**: Pod se reinicia constantemente
**Diagnóstico**:
```bash
# Ver logs
kubectl logs <pod-name> -n <namespace>
# Ver logs del contenedor anterior
kubectl logs <pod-name> -n <namespace> --previous
# Describir pod
kubectl describe pod <pod-name> -n <namespace>
```
**Causas comunes**:
- Variable de entorno faltante
- Secret no existe
- No puede conectar a DB
- Puerto ya en uso
### 3. Ingress no resuelve (502/503/504)
**Síntomas**: URL da error de gateway
**Diagnóstico**:
```bash
# Verificar Ingress
kubectl get ingress -n <namespace>
kubectl describe ingress <name> -n <namespace>
# Verificar Service
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
# Logs de Nginx Ingress
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100 | grep <domain>
```
**Verificar**:
- Service selector correcto
- Pod está Running y Ready
- Puerto correcto en Service
### 4. TLS Certificate no se emite
**Síntomas**: Certificado en estado `False`
**Diagnóstico**:
```bash
# Ver certificado
kubectl get certificate -n <namespace>
kubectl describe certificate <name> -n <namespace>
# Ver CertificateRequest
kubectl get certificaterequest -n <namespace>
# Ver Challenge (HTTP-01)
kubectl get challenge -n <namespace>
kubectl describe challenge <name> -n <namespace>
# Logs de cert-manager
kubectl logs -n cert-manager deployment/cert-manager --tail=50
```
**Causas comunes**:
- DNS no apunta a los LBs
- Firewall bloquea puerto 80
- Ingress no tiene annotation de cert-manager
**Fix**:
```bash
# Verificar DNS
dig <domain> +short
# Debe mostrar: 108.165.47.221, 108.165.47.203
# Delete y recreate certificate
kubectl delete certificate <name> -n <namespace>
kubectl delete secret <name> -n <namespace>
# Recreate ingress
```
### 5. PVC en estado Pending
**Síntomas**: PVC no se bindea
**Diagnóstico**:
```bash
# Ver PVC
kubectl get pvc -n <namespace>
kubectl describe pvc <name> -n <namespace>
# Ver PVs disponibles
kubectl get pv
# Ver Longhorn volumes
kubectl get volumes.longhorn.io -n longhorn-system
```
**Fix**:
```bash
# Ver Longhorn UI
open https://longhorn.fuq.tv
# Logs de Longhorn
kubectl logs -n longhorn-system daemonset/longhorn-manager --tail=50
```
### 6. Gitea Actions no ejecuta
**Síntomas**: Workflow no se trigerea
**Diagnóstico**:
```bash
# Ver runner
kubectl get pods -n gitea-actions
kubectl logs -n gitea-actions deployment/gitea-runner -c runner --tail=100
# Ver en Gitea UI
open https://git.fuq.tv/admin/aiworker-backend/actions
```
**Fix**:
```bash
# Restart runner
kubectl rollout restart deployment/gitea-runner -n gitea-actions
# Verificar runner registrado
kubectl logs -n gitea-actions deployment/gitea-runner -c runner | grep "registered"
# Push de nuevo para triggear
git commit --allow-empty -m "Trigger workflow"
git push
```
### 7. MariaDB no conecta
**Síntomas**: `Connection refused` o `Access denied`
**Diagnóstico**:
```bash
# Verificar pod
kubectl get pods -n control-plane mariadb-0
# Ver logs
kubectl logs -n control-plane mariadb-0
# Test de conexión
kubectl exec -n control-plane mariadb-0 -- \
mariadb -uaiworker -pAiWorker2026_UserPass! -e "SELECT 1"
```
**Credenciales correctas**:
```
Host: mariadb.control-plane.svc.cluster.local
Port: 3306
User: aiworker
Pass: AiWorker2026_UserPass!
DB: aiworker
```
### 8. Load Balancer no responde
**Síntomas**: `curl https://<domain>` timeout
**Diagnóstico**:
```bash
# Verificar HAProxy
ssh root@108.165.47.221 "systemctl status haproxy"
ssh root@108.165.47.203 "systemctl status haproxy"
# Ver stats
open http://108.165.47.221:8404/stats
# Usuario: admin / aiworker2026
# Test directo a worker
curl http://108.165.47.225:32388 # NodePort de Ingress
```
**Fix**:
```bash
# Restart HAProxy
ssh root@108.165.47.221 "systemctl restart haproxy"
ssh root@108.165.47.203 "systemctl restart haproxy"
# Verificar config
ssh root@108.165.47.221 "cat /etc/haproxy/haproxy.cfg"
```
---
## 🔍 COMANDOS DE DIAGNÓSTICO GENERAL
### Estado del Cluster
```bash
# Nodos
kubectl get nodes -o wide
# Recursos
kubectl top nodes
kubectl top pods -A
# Eventos recientes
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Pods con problemas
kubectl get pods -A | grep -v Running
```
### Verificar Conectividad
```bash
# Desde un pod a otro servicio
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Dentro del pod:
wget -O- http://mariadb.control-plane.svc.cluster.local:3306
```
### Limpiar Recursos
```bash
# Pods completados/fallidos
kubectl delete pods --field-selector=status.phase=Failed -A
kubectl delete pods --field-selector=status.phase=Succeeded -A
# Preview namespaces viejos
kubectl get ns -l environment=preview
kubectl delete ns <preview-namespace>
```
---
## 📞 CONTACTOS Y RECURSOS
### Soporte
- **CubePath**: https://cubepath.com/support
- **K3s Issues**: https://github.com/k3s-io/k3s/issues
- **Gitea**: https://discourse.gitea.io
### Logs Centrales
```bash
# Todos los errores recientes
kubectl get events -A --sort-by='.lastTimestamp' | grep -i error | tail -20
```
### Backup Rápido
```bash
# Export toda la configuración
kubectl get all,ingress,certificate,pvc,secret -A -o yaml > cluster-backup.yaml
# Backup MariaDB
kubectl exec -n control-plane mariadb-0 -- \
mariadb-dump -uroot -pAiWorker2026_RootPass! --all-databases > backup-$(date +%Y%m%d).sql
```
---
## 🆘 EMERGENCY PROCEDURES
### Cluster no responde
```bash
# SSH a control plane
ssh root@108.165.47.233
# Ver K3s
systemctl status k3s
journalctl -u k3s -n 100
# Restart K3s (último recurso)
systemctl restart k3s
```
### Nodo caído
```bash
# Cordon (evitar scheduling)
kubectl cordon <node-name>
# Drain (mover pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Investigar en el nodo
ssh root@<node-ip>
systemctl status k3s-agent
```
### Storage corruption
```bash
# Ver Longhorn UI
open https://longhorn.fuq.tv
# Ver réplicas
kubectl get replicas.longhorn.io -n longhorn-system
# Restore desde snapshot (si existe)
# Via Longhorn UI → Volume → Create from Snapshot
```
---
## 💡 TIPS
### Desarrollo Rápido
```bash
# Auto-reload en backend
bun --watch src/index.ts
# Ver logs en tiempo real
kubectl logs -f deployment/backend -n control-plane
# Port-forward para testing
kubectl port-forward svc/backend 3000:3000 -n control-plane
```
### Debug de Networking
```bash
# Test desde fuera del cluster
curl -v https://api.fuq.tv
# Test desde dentro del cluster
kubectl run curl --image=curlimages/curl -it --rm -- sh
curl http://backend.control-plane.svc.cluster.local:3000/api/health
```
### Performance
```bash
# Ver uso de recursos
kubectl top pods -n control-plane
kubectl top nodes
# Ver pods que más consumen
kubectl top pods -A --sort-by=memory
kubectl top pods -A --sort-by=cpu
```
---
## 🔗 ENLACES RÁPIDOS
- **Cluster Info**: `CLUSTER-READY.md`
- **Credenciales**: `CLUSTER-CREDENTIALS.md`
- **Roadmap**: `ROADMAP.md`
- **Próxima sesión**: `NEXT-SESSION.md`
- **Guía para agentes**: `AGENT-GUIDE.md`
- **Container Registry**: `docs/CONTAINER-REGISTRY.md`
---
**Si nada de esto funciona, revisa los docs completos en `/docs` o contacta con el equipo.**