Statistical Depth for Text Data: An Application to the Classification of Healthcare Data

Bolívar Gómez, Sergio; Nieto Reyes, Alicia; Rogers, Heather L.

doi:10.3390/math11010228

dc.contributor.author	Bolívar Gómez, Sergio
dc.contributor.author	Nieto Reyes, Alicia
dc.contributor.author	Rogers, Heather L.
dc.contributor.other	Universidad de Cantabria	es_ES
dc.date.accessioned	2023-02-20T14:03:32Z
dc.date.available	2023-02-20T14:03:32Z
dc.date.issued	2023
dc.identifier.issn	2227-7390
dc.identifier.uri	https://hdl.handle.net/10902/27736
dc.description.abstract	This manuscript introduces a new concept of statistical depth function: the compositional D-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the tf-idf (term frequency?inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, D. This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the DDG -classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional D-depth.	es_ES
dc.description.sponsorship	Funding: A.N.-R. is supported by Grant 21.VP67.64662 funded by “Proyectos Puente 2022” from the Spanish Government of Cantabria. For H.L.R., the qualitative data used in study were funded by Instituto de Salud Carlos III through the project “PI17/02070” (co-funded by the European Regional Development Fund/European Social Fund “A way to make Europe”/“Investing in your future”) and the Basque Government Department of Health project “2017111086”. The funding bodies had no role in the design of the study, collection, analysis, interpretation of data nor the writing of the manuscript.	es_ES
dc.format.extent	20 p.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	MDPI	es_ES
dc.rights	© 2023 by the authors.	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	*
dc.source	Mathematics, 2023, 11(1), 228	es_ES
dc.subject.other	Compositional Depth	es_ES
dc.subject.other	Multivariate Data	es_ES
dc.subject.other	Natural Language Processing	es_ES
dc.subject.other	Qualitative Data	es_ES
dc.subject.other	Statistical Depth	es_ES
dc.subject.other	Supervised Classification	es_ES
dc.subject.other	Text Mining	es_ES
dc.title	Statistical Depth for Text Data: An Application to the Classification of Healthcare Data	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.relation.publisherVersion	https://doi.org/10.3390/math11010228	es_ES
dc.rights.accessRights	openAccess	es_ES
dc.identifier.DOI	10.3390/math11010228
dc.type.version	publishedVersion	es_ES