Análisis del rendimiento de modelos de aprendizaje automático sobre datos anonimizados

Marcos Sánchez de la Blanca, Carmen

dc.contributor.advisor	Sáinz-Pardo Díaz, Judith
dc.contributor.advisor	López García, Álvaro
dc.contributor.author	Marcos Sánchez de la Blanca, Carmen
dc.contributor.other	Universidad de Cantabria	es_ES
dc.date.accessioned	2023-10-23T18:41:00Z
dc.date.available	2023-10-23T18:41:00Z
dc.date.issued	2023-07
dc.identifier.uri	https://hdl.handle.net/10902/30293
dc.description.abstract	La gran cantidad de datos abiertos disponibles hace necesario el estudio y desarrollo de técnicas que garanticen la seguridad de dichos datos para su posterior tratamiento y análisis. En concreto, el estudio de las técnicas de anonimizarían se centra en el análisis de la distribución de los cuasi-identificadores y atributos sensibles en una base de datos. Existen muchas técnicas que pueden aplicarse, cada una de ellas pueden evitar distintos tipos de ataques. En este estudio se exploran tres técnicas de anonimización clásicas, su bases teóricas y diferentes tipos de ataques que previenen: k-anonimato, l-diversidad y t-cercanía. Además, se utilizan diferentes herramientas para garantizar la fiabilidad de estas técnicas, que son aplicadas a diferentes niveles sobre dos conjuntos de datos en abierto, tras pre definir diferentes jerarquías sobre los cuasi-identificadores. A continuación, se estudiará el rendimiento de una batería de modelos de Machine Learning aplicado en los datos anonimizados presentados anteriormente. Se generarán un amplio rango de resultados experimentales, variando la técnica de anonimizarían empleada, así como el nivel establecido. Todo el código es desarrollado en Python, y distribuido mediante un repositorio de datos en abierto. Además, los datasets han sido anonimizados utilizando el Software ARX.	es_ES
dc.description.abstract	The large amount of open data available makes it necessary to study and develop techniques that guarantee its security for processing and analysis. Specifically, the study of anonymization techniques focuses on analyzing the distribution of the quasi-identifiers and sensitive attributes in a database. There are numerous techniques that can be applied, each of which can prevent different types of attacks. The present study explores three classical anonymity techniques, their theoretical basis and the kind of attacks they prevent: k-anonymity, ℓ-diversity and t-closeness. Specifically, different tools are used to ensure the reliability of these techniques which are applied at various levels on two open-access datasets, after pre-defining different hierarchies for the quasi-identifiers. Next, the performance of a battery of machine learning models applied on the anonymized data is studied. A wide range of experimental results is carried out, varying the anonymization technique employed, as well as the level established. All the code developed is written in Python and is distributed through an open source repository. In addition, the datasets were anonymized using the ARX Software.	es_ES
dc.format.extent	62 p.	es_ES
dc.language.iso	eng	es_ES
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject.other	Anonimización	es_ES
dc.subject.other	Aprendizaje automático	es_ES
dc.subject.other	Análisis de rendimiento	es_ES
dc.subject.other	Privacidad	es_ES
dc.subject.other	k-anonimato	es_ES
dc.subject.other	Anonymization	es_ES
dc.subject.other	Performance analysis	es_ES
dc.subject.other	Privacy	es_ES
dc.subject.other	k-anonymity	es_ES
dc.title	Análisis del rendimiento de modelos de aprendizaje automático sobre datos anonimizados	es_ES
dc.title.alternative	Analyzing the performance of machine learning models on anonymized data	es_ES
dc.type	info:eu-repo/semantics/masterThesis	es_ES
dc.rights.accessRights	openAccess	es_ES
dc.description.degree	Máster en Ciencia de Datos	es_ES