Optimización de tiempo de ejecución con PySpark de Hadoop de un análisis de sentimientos de tweets

Escobar Galaburda, Daria Angélica

dc.contributor.advisor	Palacio Hoz, Aida
dc.contributor.author	Escobar Galaburda, Daria Angélica
dc.contributor.other	Universidad de Cantabria	es_ES
dc.date.accessioned	2022-12-14T14:12:15Z
dc.date.available	2022-12-14T14:12:15Z
dc.date.issued	2022-09-16
dc.identifier.uri	https://hdl.handle.net/10902/26886
dc.description.abstract	RESUMEN: Cuando se trata del análisis y procesamiento de volúmenes grandes de datos del orden de los GB o superiores, las herramientas utilizadas usualmente como Python nativo no suelen ser suficientes si se requiere obtener resultados en un tiempo reducido. Por ello, el presente proyecto pretende demostrar la eficiencia del uso de herramientas que hacen uso de los sistemas distribuidos, en este caso PySpark. Se realiza un análisis de sentimientos junto a su respectivo preprocesado de datos de un conjunto de tweets cuyo tamaño supera los 20 GB utilizando la librería TextBlob utilizando tanto Python nativo como PySpark, midiendo en ambos casos el tiempo de ejecución de operaciones como descompresión, lectura y escritura de datos a un archivo csv, modificaciones al conjunto de datos, limpieza de texto y la clasificación de sentimientos reflejados en los tweets en negativo, positivo o neutro. Se realiza una comparación de los tiempos de ejecución obtenidos logrando demostrar que al adaptar el código utilizado en Python a PySpark se reduce la operación más costosa, la clasificación de sentimientos con TextBlob, de horas a menos de un segundo. Como conclusión en los resultados se obtiene una reducción final del 96% del tiempo empleado con Python, pasando de invertir casi 16 horas a tan solo 38 minutos con la ayuda de PySpark.	es_ES
dc.description.abstract	ABSTRACT: When it comes to the analysis and processing of large volumes of data in the order of GB or more, the tools normally used such as native Python are usually not enough if you need to obtain results in a short time. Therefore, this project aims to demonstrate the efficiency using tools that make use of distributed systems, in this case PySpark. A sentiment analysis is carried out along with its respective data preprocessing of a set of tweets whose size exceeds 20 GB using the TextBlob library with both native Python and PySpark, measuring in both cases the execution time of operations such as decompression, reading and writing data to a csv file, modifications to the data set, text cleaning and the classification of sentiments reflected in the tweets in negative, positive or neutral. A comparison of the obtained execution times is made, demonstrating that adapting the code used in Python to PySpark reduces the most expensive operation, the classification of sentiments with TextBlob, from hours to less than a second. As a conclusion in the results, a final reduction of 96% of the time spent with Python is obtained, going from investing almost 16 hours to only 38 minutes with the help of PySpark.	es_ES
dc.format.extent	44 p.	es_ES
dc.language.iso	spa	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject.other	Computación distribuida	es_ES
dc.subject.other	Hadoop	es_ES
dc.subject.other	Apache Spark	es_ES
dc.subject.other	PySpark	es_ES
dc.subject.other	Python	es_ES
dc.subject.other	Análisis de sentimientos	es_ES
dc.subject.other	NLP	es_ES
dc.subject.other	TextBlob	es_ES
dc.subject.other	Distributed Computing	es_ES
dc.subject.other	Sentiment Analysis	es_ES
dc.title	Optimización de tiempo de ejecución con PySpark de Hadoop de un análisis de sentimientos de tweets	es_ES
dc.title.alternative	Optimization of the runtime of tweet sentiment analysis with Hadoop’s PySpark	es_ES
dc.type	info:eu-repo/semantics/masterThesis	es_ES
dc.rights.accessRights	openAccess	es_ES
dc.description.degree	Máster en Ciencia de Datos	es_ES