“Understanding Data Science Software: Exploring Differences and Solutions”

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Actually, it’s both possible

This Article was originally published before on YOZM-IT as Korean

Various way of data science

There are many programming languages in the world and software that utilizes them. And those play an important role in “Data science”.

For example, if you’re using funnel analysis to improve your product, you might want to

Compare the bounce rates of funnel stages before and after an event,
And perform a ratio test to calculate their statistical significance.

Meanwhile, data scientists have various career backgrounds and experiences. So They tend to use the methods they’re comfortable with, including Python, R, SAS and more.

We see this quite a bit, because in most cases, the software you use at the level of business doesn’t make much of a difference.

But what happens if you “produce different results by the software used?”

The following image shows the results of running a proportion test in R, Python, and STATA with example mentioned.

You can see that even though we used the same values of 1000 and 123, the p-value, which indicates the significance of the proportion test, is slightly different for each method.

There are many reasons why the calculation value is different depending on the method used, such as

Different algorithms in the core logic of the programming language
Different default values of the parameters used in the function.

In the example above, if you change the value of the parameter correct in R and apply “Continuity correction” as using “correct = F” , you can see that the result is the same as in STATA.

Rounding

Next, I’ll introduce rounding for more general data analysis.

Similarly, you can see that the round changes its value depending on software.

If the fee is “0.5 billion” in some large financial transaction in business, the rounded cost could be zero or 1 billion, depending on how you calculate the rounding.

Another case could be Logistic regression, which various round can be reverse prediction.

Image from Wikipedia, edited by the author

Why is round different?

Let’s talk a little more about why this round is different.

Rounding as we usually perceive it means changing 0 ~ 4 to 0, and 5 ~ 9 to 10, as shown below image.

And in decimal units, is rounding to the nearest whole number by changing .0 ~ .4999.. to 0 and .5 ~ .9999.. to 1.

However, there are a number of mathematical interpretations of when exactly 0.5 , and when it is a negative number.

For example, round(-23.5) should produce -23 or -24?

Both are possible, depending on the mathematical interpretation and it’s called as rounding half up and rounding half down respectively. We can take this a step further and round both positive and negative numbers closer to zero, or vice versa.

This means that round(-23.5) will round to -23, and round(23.5) will round to 23, or round to -24 and 24, respectively. These are represented by the names Rounding half toward zero, Rounding half away from zero, respectively.

Finally, there are methods called Rounding half to even and Rounding half to odd, which mean that we want to consider the nearest integers to be even and odd, respectively.

In particular, the Rounding half to even method also goes by the names Convergent rounding, Statistician’s rounding, Dutch rounding, Gaussian rounding, and Bankers’ rounding, and is one of the official standard methods according to IEEE 754.

Bankers’ rounding

Bankers’s rounding, is default method in R , so Let’s breif a little bit more.

The image below shows the result of rounding from 0.0 to 2.0.

While this may seem like a good idea, there is actually a problem. Because .5 is unconditionally rounded to the next integer, there is an unconditional bias towards rounding to a “+ value”.

I don’t know the exact reason for this, but one theory is that the US IRS used to use this rounding to collect taxes and was sued for unfairly profiting by collecting more taxes from people who were .5 off, so they lost the case and changed to rounding to the nearest even (or odd) number to match the .5 rounding.

This means that by modifying the rounding as shown below, we can avoid the bias that was previously occurring.

The problem with different results

In recent years, industries in various domains, including pharmaceuticals and finance, have been trying to switch from “commercial” software such as SPSS, SAS and STATA to “open source” software such as Python, R and Julia .

And as rounding mentioned earlier, diffrent result issue by software has been also raised which can create problems in terms of reproducibility, uncertainty, accuracy, and traceability.

So if you’re utilizing multiple softwares, you should be aware of why they produce different results, and how you can use them to properly

CAMIS project

CAMIS stands for Comparing Analysis Method Implementations in Software.

This project compares the differences in softwares (or programming languages) and make standards to produce the same results.

The core area of the project is the “statistical computation” part, so most contributions come from the data science leaders who have strong understanding with it.

But CAMIS is also an open source project, that is not restricted and maintained with various people through regular discussions, collaboration, and sharing of project progress.

Below is one of the comparisons published on the CAMIS project’s webpage, which reviews how a one sample t-test is run with each software, what the results are, and how the results are compatible with each other.

The CAMIS project was started by members who interested in “SAS to R” in the medical and pharmaceutical industry. So it mainly focuses on R and SAS along major statistical data analysis, but recently it’s also working on how to use Python for data science in a broader domain of the industry.

Not only clasiccal methods such as Hypothesis tests, Regression analysis, but modern methods in data science such as Bayesian statistics, Causal inference and novel implementations of existing methods (e.g. MMRM) are topic of interest in project.

Sessions are increasingly appearing at multiple data science conferences, where many researchers and contributors are encouraged to promote, contribute and utilize it as a reference.

Finally, the CAMIS project is also collaborating with academia beyond the data science industry, as similar topics have been published in The American Statistician and Drug Information Association, among others.

The project is also currently working with students on a thesis entitled “A comparison of MMRM methodology in SAS and R software” and is open to collaborations and suggestions on other topics.

Summary

Various software used in data science. As the domain, the libraries or software used by an organization may be dependent on a particular language, which can sometimes be mixed with personal preferred methods. (in many cases, this doesn’t vary much at the level of the business)

However, if you’re not careful, the methods you use can lead to different results.

In this article, I’ve given you some examples of and reasons for differences in the methods used by different software for calculations, and introduced the CAMIS project, a research project that aims to minimize them to ensure consistency in data analysis.

If you use different software in your data analytics work, it’s a good idea to take a look at them to understand the differences and try to find the optimal method for your purposes,

And if you work in data science in the field, I highly recommend that you take an interstate in or contribute to the CAMIS project for a global collaborative experience.

Is round(0.5) 0 or 1? was first posted on October 5, 2024 at 7:03 am.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Is round(0.5) 0 or 1?

Potential Future Insights and Developments of Data Science Software Variance

The article discusses the significant role of various programming languages in data science and how different software can yield different results. We learn that even when the same values of calculation are applied across different software like R, Python, and STATA, they can produce varying results. For instance, the bounce rates and ratio tests results of an event would vary under different platforms, despite using the same data. Crucially, the article underscores the importance of the Comparing Analysis Method Implementations in Software (CAMIS) project, which aims to standardize results across various softwares.

Implications of Software Differences in Data Science

Today, different industries including pharmaceuticals and finance are transitioning from commercial to open source software such as Python, R and Julia. However, the differing results issue by software raises concerns in relation to reproducibility, uncertainty, accuracy, and traceability. This variance could trigger significant divergences in forecast modeling and data interpretation within a single organization or amongst industry competition. Resolving this discrepancy necessitates understanding why different software produce varying results and discerning how to correctly and consistently utilize their functionalities.

Potential of ‘Rounding’ in Data Science

The article mentions the role and definition of ’rounding’ in data science especially when handling extensive data sets. We learn that the process of rounding can differ based upon the mathematical interpretations used. This, too, can yield differential results across platforms and software. The concept of ‘Rounding half toward zero’, ‘Rounding half away from zero’, ‘Rounding half to even’ and ‘Rounding half to odd’ in the context of both positive and negative numbers was also introduced in the discussion. Clearly, programming languages provide more than just a tool for implementation – they offer different philosophies of approach to problem-solving in data science.

The Role of the CAMIS Project

The Comparing Analysis Method Implementations in Software (CAMIS) project is an initiative aimed at addressing differences in software used in data science. By comparing diverse software and programming languages, the project seeks to develop a standard that achieves consistent results, thereby assisting industries in confidently transitioning from commercial software to open source software. The terms of the project are not restricted and involve a collaborative, progressive effort from various contributors. A primary focus of the project is on R and SAS alongside major statistical data analysis, and it also explores the use of Python for data science across wider industry domains.

Actionable Advice

If your work involves using different software for data analytics, it is advisable to review and understand the differences and nuances of your selected tools. Moreover, strive to find the optimal methods that align with your specific industry requirements.
If you work in data science, participating in or contributing to the CAMIS project is highly beneficial for both personal growth and collaborative knowledge sharing. Apart from staying updated with the latest developments, you can also lend your expertise to this significant cause.
Utilizing rounding correctly is crucial in data science. Awareness of the different types of rounding and how different software handle this can ensure the accuracy and reliability of your results.
The more well-versed you are with your chosen programming language and software, the more effectively you can minimize and address discrepancies in your work.

Read the original article