👋 Hi, I'm Gonzalo Moreno

Senior QA Automation Engineer · SDET · AI Quality Engineer

I build automation frameworks from scratch and design test suites for LLMs, chatbots and RAG systems, validating quality, safety, robustness and regression. 10+ years across financial services, telecom, retail and cybersecurity.

View Projects Contact

0 Years of QA experience

0 % test coverage gain

0 % less test definition effort

0 % faster deployments with tooling

0 % regression time reduction

Selected companies I’ve helped improve quality at

Get to Know Me

About Me

ISTQB CTFL, CT-TAS, CT-MAT, CT-GenAI

Sectors

Financial Services Telecom Retail Cybersecurity

Languages

Spanish — Native English — Intermediate

Senior QA Automation Engineer | QA Architect

Senior QA Automation Engineer with 10+ years of experience across financial services, telecommunications, retail and cybersecurity sectors. ISTQB CTFL, CT-TAS, CT-MAT and CT-GenAI certified professional specialized in designing and building automation frameworks from scratch, including test architecture, CI/CD integration (GitLab, Jenkins, Azure DevOps) and automated reporting.

Hands-on expertise with Playwright, Selenium, Robot Framework, Cucumber, REST Assured, end-to-end and integration testing, plus performance testing with JMeter. Strong background in REST API testing, Python and Java development, and the creation of AI/LLM-based testing tools. Successfully led Agile adoption and BDD/TDD practices, improving delivery workflows and overall product reliability.

Test Automation

Expert in Playwright, Selenium, Robot Framework, Rest Assured

CI/CD Integration

Jenkins, GitLab CI, Azure DevOps, GitHub Actions

AI-Powered Testing

LLMs, test generation, failure analysis automation

Ver GitHub

Career

Experience

10+ years building QA strategy, automation frameworks and AI-powered tooling across finance, telecom, retail and cybersecurity.

Currently

Senior QA Automation Engineer

MTP — Client: DIA Group (Retail) · Sep 2025 – Present

Frontend and backend automated tests with Playwright and Karate DSL. Working on improving regression execution time via parallelisation, locator optimisation and selective execution strategies.
Building AI agents and skills for the Playwright automation project and QA tasks in general, plus internal tooling for test data generation.
Member of the company's AI Working Group — contributing solutions and sharing knowledge around applied AI for QA.
Data-driven testing integrated with Google Cloud (BigQuery + Pub/Sub) for business-logic validation and async event flows.

Playwright Karate DSL Generative AI BigQuery Pub/Sub

Sep 2022 – Aug 2025

Senior QA Automation Engineer

Devo · Cybersecurity SaaS

Designed E2E and integration testing with Cucumber + REST Assured + Java. Built LLM-powered QA tooling integrated into GitLab CI pipelines — +40% test coverage and −70% test definition effort. Performance testing with JMeter and security testing with OWASP ZAP.

Cucumber REST Assured Java LLMs GitLab CI JMeter OWASP ZAP
Jul 2019 – Sep 2022

QA Architect & QA Lead

VASS · Clients: Prosegur, Redsys, Bankinter

Built a complete QA framework from scratch (Robot Framework, Jenkins + Azure DevOps, JMeter for HTTP / Azure Service Bus / MQTT). Performance monitoring with Azure Application Insights and Grafana. Internal product QAWAT.

Robot Framework Jenkins Azure DevOps JMeter Grafana Katalon
Dec 2018 – Jul 2019

QA Automation Engineer

MásMóvil (via MTP) · Telecom

API and web automation (Postman/Newman, Katalon, Selenium) and Jenkins CI pipelines. API-based test data generators and JMeter load testing. E2E and integration testing in Agile.

Selenium Postman Katalon Jenkins
Apr 2016 – Dec 2018

QA Analyst

GFI Informática · Clients: BBVA & DGT

Functional and automated testing for BBVA with an internal Java/Selenium framework and BDD (Cucumber/Gherkin) in Agile squads. Earlier at DGT: manual + UFT automation with TestLink/Mantis.

Selenium Java Cucumber UFT
Jul 2014 – Mar 2016

IT Analyst & QA Junior

Cetelem (BNP Paribas) · Financial Services

Project office, IT & product coordination, and compliance. First step into IT before specialising in QA.

PMO Compliance

Publication

QA Manual for AI Systems

LLMs · RAG · Chatbots · Agents — a professional reference guide (2026 Edition).

Portada del Manual de QA para Sistemas de Inteligencia Artificial

114 págs
33 capítulos
4 apéndices
2026 edición

RAGAS LLM-as-Judge OWASP LLM Top 10 Golden Datasets CI/CD Quality Gates Robustness Agentes

Manual content in Spanish

Get it on Amazon Read preview

Manual table of contents 33 chapters · 4 appendices

01. Introducción a la IA generativa
02. Fundamentos técnicos de los LLMs
03. Riesgos, calidad y marco normativo
04. Por qué QA AI no es QA tradicional
05. Taxonomía de sistemas conversacionales
06. RAG: Retrieval-Augmented Generation
07. Métricas RAGAS y evaluación de RAG
08. LLM-as-Judge: evaluación con modelos
09. Golden Datasets y datos de referencia
10. Testing de chatbots: manual y automático
11. Evaluación semántica y similitud
12. Robustness y perturbation testing
13. Deriva semántica y monitorización
14. Seguridad: OWASP LLM Top 10 (2025)
15. Prompt Injection y ataques adversariales
16. Evaluación multi-turno y contexto
17. Alucinaciones: tipos y detección
18. CI/CD y quality gates en pipelines AI
19. Observabilidad y trazabilidad
20. Herramientas: RAGAS, TruLens, DeepEval
21. Testing de agentes y sistemas multi-agente
22. Antipatrones y errores en evaluación LLM
23. Estrategia integral de QA AI
24. Prompt Regression Testing
25. Bias, Toxicity y Safety Testing
26. Antipatrones operativos y red flags
27. Cost-aware QA y observabilidad de costes
28. Privacy y PII leakage testing
29. Retrieval avanzado y testing
30. Function calling y tool use testing
31. Human-in-the-loop e inter-annotator agreement
32. Reproducibilidad y determinismo
33. Glosario consolidado

Apéndice A — Preguntas de Consolidación Técnica (45 preguntas)
Apéndice B — Referencias y lecturas recomendadas
Apéndice C — Índice alfabético
Apéndice D — Caso práctico end-to-end (chatbot RAG regulado)

Preview · Chapter 1 of 33

01. Introducción a la IA generativa

1.1 Qué entendemos por IA en este manual

Antes de hablar de cómo testar IA, hace falta acordar de qué estamos hablando. La inteligencia artificial es la rama de la informática que construye sistemas capaces de hacer tareas que antes solo hacían las personas: entender lenguaje, reconocer objetos, planificar acciones o generar contenido nuevo. Russell & Norvig (2021), en el manual de referencia Artificial Intelligence: A Modern Approach (4ª ed.), articulan la definición canónica del campo en torno a la idea de un agente que actúa racionalmente para maximizar el valor esperado de un objetivo dado.

Esa definición es útil pero abstracta. Para QA conviene una definición operativa, que sirva para decidir si lo que tenemos delante es «software de toda la vida» o «software de IA»:

IA es cualquier sistema cuyo comportamiento se aprende de datos, en lugar de programarse línea a línea con reglas.

Esa diferencia — comportamiento aprendido en lugar de programado — es la que rompe los pilares del QA clásico. En software tradicional, el código siempre devuelve lo mismo ante la misma entrada, así que tiene sentido hablar de cobertura de ramas, oráculo de tests por igualdad exacta y regresión comparando byte a byte. En IA, la misma entrada puede dar respuestas distintas, el oráculo «correcto» no siempre existe, y la regresión se mide en distribuciones. El resto del manual explica cómo se adapta el QA a esa nueva realidad.

Cuando se habla de «IA» en prensa se mezclan tres niveles muy distintos. Conviene tenerlos separados:

IA débil o narrow AI. Sistemas que resuelven una tarea o dominio acotado. Toda la IA en producción en 2026 es narrow.
IA general (AGI). Sistema hipotético con capacidades cognitivas comparables a las humanas en cualquier dominio. No existe.
IA superhumana. Sistema hipotético que supera al humano en cualquier tarea cognitiva. Fuera del scope.

1.2 Espectro de la IA

No toda IA es IA generativa. En un sistema real conviven cuatro familias o paradigmas, cada uno con su forma de funcionar y sus implicaciones para QA. La tabla siguiente las sitúa de menor a mayor complejidad estadística:

Tabla 1.1 — Espectro de paradigmas de IA y su huella en QA.
Paradigma	Cómo funciona	Ejemplo	Implicación QA
IA simbólica	Reglas lógicas explícitas	Safety filter declarativo, allowlist de tools	Tests basados en reglas (software clásico)
ML clásico	Aprendizaje a partir de features ingenierizadas	Classifier de intent, detector PII	Métricas estadísticas (accuracy, precision, recall)
Deep learning	Redes neuronales que aprenden representaciones	CNN imagen, RNN texto	Métricas estadísticas + análisis por subgrupo
Generative AI	Modelos que generan contenido nuevo	LLM, modelo de difusión, multimodal	Métricas semánticas, faithfulness, robustness

En 2026, la mayoría de sistemas conversacionales en producción son IA generativa basada en arquitecturas transformer (el motor detrás de ChatGPT, Claude o Gemini). La IA simbólica no ha desaparecido: sigue viva en guardrails declarativos, validación de esquemas JSON y motores de reglas de negocio. Un sistema QA maduro testa los dos paradigmas con métricas distintas; los capítulos siguientes lo desarrollan.

1.3 Tipos de aprendizaje automático

Dentro del aprendizaje automático (machine learning, ML) hay varias formas de aprender, y cada una se mide con métricas distintas. Saber qué tipo de aprendizaje usa el subsistema bajo test es lo que decide si vale más medir con precision/recall, con perplexity o con win-rate. Los cinco tipos canónicos son:

Tabla 1.2 — Tipos de aprendizaje automático y relevancia para QA AI.
Tipo	Qué aprende	Ejemplo en QA AI	Métrica típica
Supervised	Mapeo input → label etiquetado	Classifier de intent; detector de PII	Precision, recall, F1
Unsupervised	Estructura latente sin etiquetas	Clustering de queries	Silhouette, ARI
Semi-supervised	Etiquetas escasas + pools no etiquetados	Bootstrap de golden dataset	Métrica supervisada sobre validación
Self-supervised	Etiquetas derivadas del propio input	Pre-training de LLMs (next-token prediction)	Perplexity, downstream eval
Reinforcement	Política por recompensa	RLHF para alineación	Reward model score, win-rate

Un detalle clave: un LLM moderno no usa un solo tipo de aprendizaje, sino los tres últimos en cadena. Primero aprende lenguaje de forma self-supervised (pre-training); luego se le enseña a seguir instrucciones de forma supervised (instruction tuning); finalmente se ajusta por refuerzo con feedback humano (RLHF, DPO o RLAIF). El §2.5 del Cap. 2 desarrolla cada etapa y por qué cada una abre modos de fallo distintos.

1.4 Modelos discriminativos vs generativos

Dentro de la IA hay dos grandes formas de plantear un problema: discriminar (elegir entre opciones conocidas) y generar (producir contenido nuevo). Un clasificador de spam discrimina entre «spam / no spam»; un LLM como GPT-4 genera texto que antes no existía. La diferencia parece sutil pero cambia por completo cómo se testa el sistema, porque cambia el oráculo: en discriminativo hay una respuesta «correcta»; en generativo hay infinitas respuestas válidas.

Tabla 1.3 — Modelos discriminativos vs generativos.
Aspecto	Discriminativo	Generativo
Qué modela	P(label \| input)	P(output \| input), muestreando
Output típico	Clase, score, valor numérico	Texto, imagen, audio, secuencia
Determinismo	Alto con threshold fijo	Bajo por naturaleza
Oráculo de test	Etiqueta esperada	Criterio semántico (similitud, faithfulness)
Métricas	Confusion matrix, ROC/AUC, F1	RAGAS, LLM-as-Judge, embedding similarity
Coste por inferencia	Bajo	Medio a alto

En la práctica, los sistemas reales combinan los dos enfoques. Un chatbot RAG típico tiene cuatro componentes encadenados: un clasificador de intent (discriminativo, decide a qué se refiere la pregunta), un filtro de safety (discriminativo, decide si la pregunta es legítima), un retriever que ordena los documentos más relevantes (ranking discriminativo) y, al final, un LLM generativo que produce la respuesta. Cada uno se testa con su propia métrica.

1.5 Panorama de IA generativa en 2026

La IA generativa no es solo «chatbots de texto». Cada modalidad (texto, código, imagen, audio, vídeo, multimodal) tiene sus casos de uso típicos, sus modelos de referencia y, sobre todo, sus tests específicos. Este manual cubre principalmente texto, pero conocer el resto del panorama ayuda a situar los proyectos en los que vas a trabajar:

Tabla 1.4 — Panorama de IA generativa por modalidad (revisión 2026-04).
Modalidad	Tareas	Ejemplos	Tests específicos
Texto	Q&A, resumen, redacción, código	GPT-4o, Claude Sonnet 4.x, Llama 3/4, Mistral, Qwen	Faithfulness, factual accuracy, robustness
Código	Autocompletado, refactor	Copilot, Cursor, Codeium	Compilable, tests pasan, sin vulnerabilidades
Imagen	Text-to-image, edición	DALL·E 3, Stable Diffusion, Midjourney	Adherencia a prompt, NSFW, watermarking
Audio	TTS, STT, generación	Whisper, ElevenLabs, OpenVoice	WER, MOS, latencia TTFT
Vídeo	Generación, edición	Sora, Veo, Runway	Coherencia temporal, fidelidad
Multimodal	Texto + imagen + audio	GPT-4o, Claude Sonnet 4.x, Gemini 2	Cross-modal grounding, cross-modal hallucination

Este manual se centra en la modalidad texto / conversacional, que es la que más ha penetrado en producción empresarial. La multimodalidad se introduce en el §2.6 del Cap. 2, lo justo para situarla; un capítulo dedicado queda pendiente para futuras versiones.

1.6 Posicionamiento del manual

Si vienes del mundo ISTQB es importante saber dónde encaja este manual respecto a las certificaciones existentes. El syllabus ISTQB tiene dos niveles relacionados con IA, y este manual se alinea sobre todo con el primero:

Tabla 1.5 — Posicionamiento del manual.
Recurso	Audiencia	Enfoque
ISTQB Foundation Level	QA generalista	Software testing clásico
ISTQB CT-AI v1.0	QA en transición a IA	Fundamentos AI; probar sistemas IA
ISTQB CT-GenAI v1.0	QA con base CT-AI	Usar IA generativa para probar software (AI-augmented testing)
Manuales de proveedor	Equipos sobre un stack	Vendor-specific
Este manual	QA AI Engineer / SDET / AI Quality Engineer	Práctico; vertical LLMs / RAG / chatbots / agentes

1.7 Mapa del manual

El cuerpo del manual recorre treinta y tres capítulos en orden secuencial. Sobre esa secuencia se superponen siete agrupaciones temáticas transversales que ayudan a localizar el material por área funcional. La agrupación es conceptual, no estructural: los capítulos siguen un orden de lectura lineal y las partes solo señalan «de qué trata cada bloque».

Parte I — Fundamentos (Cap. 1-3). Bases para un QA Engineer en transición.
Parte II — De QA tradicional a QA AI (Cap. 4-5). Cambio de paradigma y taxonomía de sistemas.
Parte III — Sistemas y arquitecturas (Cap. 6, 21, 29-30). RAG, retrieval avanzado, agentes, function calling.
Parte IV — Evaluación (Cap. 7-9, 11). RAGAS, LLM-as-Judge, evaluación semántica, golden datasets.
Parte V — Dominios de testing (Cap. 10, 12, 16-17). Chatbots, robustness, multi-turno, alucinaciones.
Parte VI — Seguridad, privacidad y riesgo (Cap. 14-15, 25, 28). OWASP LLM, prompt injection, bias, PII.
Parte VII — Operaciones y madurez (Cap. 13, 18-20, 22-24, 26-27, 31-33). CI/CD, observabilidad, deriva, reproducibilidad, costes, HITL, regression, herramientas, antipatrones, estrategia, glosario.

Los apéndices A-D aportan 45 preguntas de consolidación, referencias bibliográficas, índice alfabético y un caso práctico end-to-end.

1.8 Ruta mínima de lectura

114 páginas dan para mucho, pero no hace falta leerlas en orden estricto. Si vienes de QA clásico sin experiencia previa en ML o LLMs, la ruta más eficiente para llegar pronto a productivo es:

Cap. 1 (este): panorama y vocabulario inicial.
Cap. 2: cómo funciona un LLM por dentro (tokens, embeddings, parámetros de inferencia, tipos de LLM).
Cap. 3: riesgos, calidad y marco normativo.
Cap. 4: por qué QA AI no es QA tradicional (puerta de entrada al cuerpo principal).
Continúa por la parte que cubra tu caso de uso (RAG → Cap. 6; safety → Cap. 14-15, 25; CI/CD → Cap. 18).

This is just chapter 1 of 33

The full manual is available on Amazon.

Get it on Amazon

My Work

Featured QA Projects

Explore my automation projects, testing frameworks and QA tools

Explore all my repos on GitHub

Technical Experience

Skills & Technologies

AI Quality Engineering

LLM Evaluation RAG Testing Prompt Engineering Prompt Regression Hallucination Testing OWASP LLM Top 10 LLM Observability Agentic AI Testing Evaluation Pipelines

AI Tooling

RAGAS DeepEval promptfoo BERTScore LangSmith Ollama ChromaDB Claude Code Gemini CLI Codex Skills HuggingFace LangChain LangGraph MCP Embeddings

Testing & QA

Playwright REST Assured Robot Framework Selenium Karate DSL Cucumber / BDD pytest TestNG JUnit Postman / Newman Appium Pact

Programming Languages

Python Java JavaScript TypeScript SQL Bash

CI/CD & DevOps

GitLab CI Jenkins GitHub Actions Azure DevOps Allure Reports SonarQube

Performance & Observability

JMeter Locust Azure Application Insights Grafana OWASP ZAP k6

Cloud & Infrastructure

Google Cloud Platform Azure Docker AWS Kubernetes

Data & Event-Driven

SQL BigQuery Pub/Sub Apache Kafka PostgreSQL MySQL Oracle SQL RabbitMQ

Let's Talk

Contact

Interested in collaborating or have a job opportunity? Contact me!

Partnering with remote EU teams

Contact Information

Choose the channel that suits you best and I will reply within one business day.

Response time

Replies within 24h (CET)

Preferred channels

Start with email or LinkedIn

Location

Madrid, Spain

gonzalomorenocominero@gmail.com

Send email

linkedin.com/in/gonzalomorenoc

Visit LinkedIn

GitHub

github.com/gonzaloMorenoc

View GitHub