Sunday, May 12, 2019

Problems with Keras and RStudio

Students once in a while approach us with problems that arise when they try to run Keras from R. Keras is an interface for deep learning / neural networks that is very well developed and popular. Apparently we're not the only ones bumping into problems here, but one student here in Tübingen documented an approach that worked, pulled from a GitHub discussion. The (or, at least one) solution is here:
1.) installation of R 3.5.2 (independent from directory (standard or any other))
2.) installation of RStudio (independent from directory)
3.) RStudio -> Tools -> Global Options -> Packages -> Disable both "Use secure download method for HTTP" and "Use Internet Explorer library/proxy for HTTP"
4.) installation of Miniconda3 (-> has to be the standard directory!) Version 4.5.11 (it did not work with the newest version)
5.) Use RStudio ->
install.packages("tensorflow")
install.packages("keras")
library("keras")
install_keras()
 I am posting this as a reminder to ourselves, and hopefully this is also useful to other people running into similar issues.

Friday, May 10, 2019

Things I like to read (1)

As a researcher (and teacher), I spend a substantial part of my time reading (what a privilege), although - admittedly - I should read more than I do. Clearly, my reading involves many journal articles, to keep up with relevant developments in the literature.
But, increasingly over the last years, an important part of communication about research issues has moved to the online domain. Here, Twitter is an important channel, but I don't enjoy reading Twitter because my impression is that it encourages simplifications even in situations when in-depth explanations are required or helpful. That's the reason why I prefer blogs. I read several blogs, almost on a daily basis. I've learned a lot by reading blogs, and I keep learning a lot by reading blogs.

In no particular order, I will provide links to the blogs (and articles) that I particularly like, and I will update this through new posts over the next weeks.

Today: Simply Statistics.This is a blog I've been following for the last few years. The authors add new posts every few days, and the posts are a healthy mix of facts (about statistics, applications, data, developments in the way we think about statistics) and (well-founded) opinions. This blog informed my way of thinking about Data Science quite a lot. Recent posts that touch many important aspects of what Data Science is about are this one or this one.
But they also share fun stuff like this list of 10 things R can do that might surprise you. Or (half-fun, half-serious, I guess) this one about bar plots, and why they must die.
In any case, their blog is highly recommended.

I sometimes hear the opinion that blogs (and similar sources) will replace the academic literature (e.g., journals) in the not-too-far-future. I disagree. Why? Here are two reasons. (1) The brand. Blogs work for people (authors) who have a brand, a reputation. If a person, unknown in a given academic community, starts a blog, the insights and opinions shared on that blog will only have little impact because readers do not have a reason to believe that this source is better, better informed, or more relevant than other web pages or blogs. Journals, in contrast, have a reputation, an impact factor. Of course, we all know that this is a bad measure for quality, but I am convinced that readers will associate a higher credibility with a claim brought forward in, say Nature, than, e.g., in some blog. Because there are gatekeepers at journals - as imperfect as they may be - which are absent elsewhere. Clearly, large blogs have active comment sections, but this is restricted to large, popular blogs like Andrew Gelman's. (2) Passage of Time. Blogs are nice for rapid communication about timely topics. But nobody can keep the author of some academic blog from turning off that blog. Then the content is gone, invisible to the public, and all references to this content are useless. With journals, this is different. Journal have some sort of institutional commitment to keep the content available, and even if a journal at some point disappears, the content will still be available to in many libraries.
Hence, I believe, blogs serve an important function to ignite discussion, disseminate knowledge, opinions, and insights in a rapid way, but journals, I believe, are here to stay. Let's revisit this prediction 10 years from now.

Friday, May 3, 2019

Another big debate about p-values - should we care?

The last weeks have seen a pretty hot academic debate. About the proper use of p-values. Really, can you have a hot debate about p-values?!? Yes, you can! I regularly read Andrew Gelman's blog, and three posts on that topic (this, this, and this) that appeared in March and April 2019 attracted more than 800 (!) comments. And by comment, I mean real text, often quite long, and not just a thumbs up or thumbs down. So, apparently, the topic does attract a lot of attention, and it seems almost impossible to keep track of all these comments and contributions.

A brief and incomplete timeline

So, in case you have not heard about it, what is going on here? Clearly, the topic of how to make use of p-values in a proper way is not new, see here for an example from 1994. In 2015, the debate received renewed attention after the journal "Basic and Applied Social Psychology" banned the use of p-values (or t-values, confidence intervals, and the like). They explained their reasoning in an editorial, and it seems this editorial is by far the most frequently read article in this journal.

Then, in 2016, the American Statistical Association published a statement on p-values. It's not quite a "product recall", but it's close, maybe a product safety alert. In pretty clear word, the statements cautions against the typical use of p-values. Then, in March 2019, the journal "American Statistician" published a special issue "Statistical Inference in the 21st Century: A World Beyond p < 0.05", with dozens of articles dealing with this topic. I have read about a third of these articles by now, and I can highly recommend taking the time and reading them!

Parallel to the publication of this special issue, a group of researchers (Valentin Amrhein, Sander Greenland, Blake McShane) wrote a short piece to be published as a commentary in "Nature". The main point of the article was the pledge to "retire statistical significance". The authors invited researcher worldwide to sign this "petition" if they agreed. I do agree with most of what they wrote, so I signed, along with more than 800 other researchers. Nature then published this piece under the slightly attention-grabbing headline of "Scientists rise up against statistical significance". And, again, this created a lot of attention. As far as I know, this article is the one with the highest Altmetric-score of all articles that have been tracked so far. As I write this, it has a score of 12795. Not bad. For a piece about p-values.

Why does this topic attract so much attention and discussion? 

There are probably many reasons, but I want to mention three. (1) Statistical inference matters. Millions of researchers around the globe collect and analyze data, and their goal should be to draw valid conclusions. Whether we are using the right tools to do that is important. (2) Most empirical researchers so far have relied on p-values in the past, this is the dominating paradigm. So when this is knocked off its pedestal, it concerns many researchers. (3) (Applied) statistician are supposed to be able to make sense of numbers, right? At least that is, I would argue, the laymen's perspective. Statisticians apply their tools to extract the truth from the data, don't they? And analyzing data will bring certainty where uncertainty would prevail otherwise. But apparently, there seems to be a lot of uncertainty as to how to properly analyze the data, or at least, there is a lot of uncertainty on how to interpret the uncertainty in the data. 

Why should we retire statistical significance? 

Clearly, the comment by Amrhein et al. does not advocate to abandon statistical analysis. Clearly, it does not call for ignoring the uncertainty associated with an estimate. To me, the most important part of the comment is the point about not dichotomizing the evidence. An estimate that has a p-value of .04 is not fundamentally or qualitatively different from an estimate with a p-value of .06. Concluding that an estimate with p<.04 "has an effect" while the estimate with p>.06 has "has no effect" is wrong.  As Gelman has written a zillion times, the difference between significant and insignificant is not significant. And Amrhein's comment and the surrounding publicity puts a spotlight on this debate. There is also valid and relevant criticism that people bring forward against their comments, but I will save that for another day.