JOURNAL ARTICLE

Comparing programming languages for data analytics: Accuracy of estimation in Python and R.

Published In: WIREs: Data Mining & Knowledge Discovery, 2024, v. 14, n. 3. P. 1 1 of 3
Database: Academic Search Ultimate 2 of 3
Authored By: Hill, Chelsey; Du, Lanqing; Johnson, Marina; McCullough, B. D. 3 of 3

Abstract

Several open‐source programming languages, particularly R and Python, are utilized in industry and academia for statistical data analysis, data mining, and machine learning. While most commercial software programs and programming languages provide a single way to deliver a statistical procedure, open‐source programming languages have multiple libraries and packages offering many ways to complete the same analysis, often with varying results. Applying the same statistical method across these different libraries and packages can lead to entirely different solutions due to the differences in their implementations. Therefore, reliability and accuracy should be essential considerations when making library and package usage decisions while conducting statistical analysis using open source programming languages. Instead, most users take this for granted, assuming that their chosen libraries and packages produce accurate results for their statistical analysis. To this extent, this study assesses the estimation accuracy and reliability of Python and R's various libraries and packages by evaluating the univariate summary statistics, analysis of variance (ANOVA), and linear regression procedures using benchmarking data from the National Institutes of Standards and Technology (NIST). Further, experimental results are presented comparing machine learning methods for classification and regression. The libraries and packages assessed in this study include the stats package in R and Pandas, Statistics, NumPy, statsmodels, SciPy, statsmodels, scikit‐learn, and pingouin in Python. The results show that the stats package in R and statsmodels library in Python are reliable for univariate summary statistics. In contrast, Python's scikit‐learn library produces the most accurate results and is recommended for ANOVA. Among the libraries and packages assessed for linear regression, the results demonstrated that the stats package in R is more reliable, accurate, and flexible; thus, it is recommended for linear regression analysis. Further, we present results and recommendations for machine learning using R and Python. This article is categorized under:Algorithmic Development > StatisticsApplication Areas > Data Mining Software Tools [ABSTRACT FROM AUTHOR]

Additional Information

Source:WIREs: Data Mining & Knowledge Discovery. 2024/05, Vol. 14, Issue 3, p1
Document Type:Article
Subject Area:Computer Science
Publication Date:2024
ISSN:1942-4787
DOI:10.1002/widm.1531
Accession Number:177146373
Copyright Statement:Copyright of WIREs: Data Mining & Knowledge Discovery is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)

Looking to go deeper into this topic? Look for more articles on EBSCOhost.