Mostrar el registro sencillo

dc.contributor.authorBolívar Gómez, Sergio
dc.contributor.authorNieto Reyes, Alicia 
dc.contributor.authorRogers, Heather L.
dc.contributor.otherUniversidad de Cantabriaes_ES
dc.date.accessioned2023-02-20T14:03:32Z
dc.date.available2023-02-20T14:03:32Z
dc.date.issued2023
dc.identifier.issn2227-7390
dc.identifier.urihttps://hdl.handle.net/10902/27736
dc.description.abstractThis manuscript introduces a new concept of statistical depth function: the compositional D-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the tf-idf (term frequency?inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, D. This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the DDG -classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional D-depth.es_ES
dc.description.sponsorshipFunding: A.N.-R. is supported by Grant 21.VP67.64662 funded by “Proyectos Puente 2022” from the Spanish Government of Cantabria. For H.L.R., the qualitative data used in study were funded by Instituto de Salud Carlos III through the project “PI17/02070” (co-funded by the European Regional Development Fund/European Social Fund “A way to make Europe”/“Investing in your future”) and the Basque Government Department of Health project “2017111086”. The funding bodies had no role in the design of the study, collection, analysis, interpretation of data nor the writing of the manuscript.es_ES
dc.format.extent20 p.es_ES
dc.language.isoenges_ES
dc.publisherMDPIes_ES
dc.rights© 2023 by the authors.es_ES
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.sourceMathematics, 2023, 11(1), 228es_ES
dc.subject.otherCompositional Depthes_ES
dc.subject.otherMultivariate Dataes_ES
dc.subject.otherNatural Language Processinges_ES
dc.subject.otherQualitative Dataes_ES
dc.subject.otherStatistical Depthes_ES
dc.subject.otherSupervised Classificationes_ES
dc.subject.otherText Mininges_ES
dc.titleStatistical Depth for Text Data: An Application to the Classification of Healthcare Dataes_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.relation.publisherVersionhttps://doi.org/10.3390/math11010228es_ES
dc.rights.accessRightsopenAccesses_ES
dc.identifier.DOI10.3390/math11010228
dc.type.versionpublishedVersiones_ES


Ficheros en el ítem

Thumbnail

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo

© 2023 by the authors.Excepto si se señala otra cosa, la licencia del ítem se describe como © 2023 by the authors.