Wednesday, July 31, 2019

Special Interest Group Meeting on Quantitative Marketing

At this year's Annual Conference of the European Marketing Academy in Hamburg, we had a very nice Special Interest Group (SIG) Meeting for the SIG in Quantitative Marketing. Actually, it was more of a special session than a typical SIG meeting, but it was a really great session.
We had three contributors: Klaus Miller from Frankfurt gave a hands-on talk / tutorial on how to deal with the empirical analysis of really big data sets. A key ingredient of the recipe that he presented was sparklyr, and I think this is really helpful for other researchers dealing with large data sets. In any case, it was uncharted territory for many in the audience.
The second presentation was by Stefan Mayer from Tübingen, and it dealt with how we can use Deep Learning in marketing applications. Again, this was very, very applied.
The third talk was more of a methodological contribution by Max Pachali, also from Frankfurt, dealing with sign and order constraints on priors in Bayesian analyses that are used for counterfactual predictions.

There were three aspects I particularly liked about this session.
(1) The talks were not typical research papers, but rather small tutorials. You could view it as an attempt for development of skills in these areas in the community. I think, our conferences would benefit from more of these tutorial-style talks.
(2) The room was packed with people, some in the audience had to stand or sit on the floor. Apparently, people found these topics to be valuable, which supports point (1) from above.
(3) All slides and all code that was used and presented in this session was uploaded in a git repository. Clearly, this is extra work, and in some cases this may be too much to ask for from presenters, but I believe this is very helpful in making it stick, to create impact, to allow people from the audience to actually try the stuff directly.

So thanks to the three presenters for providing this service to the community!

P.S.: This is the abstract of the session:
Quantitative marketing research has seen substantial advances in recent years. These advances concern both the way the research process is organized as well as the methods that quantitative marketing researchers use for analyzing data. This session will cover examples of advances in both areas to showcase important recent developments and potentially rewarding applications in quantitative marketing research.
While it is easy to recognize the advantages that are associated with large data sets, it is less straightforward to actually perform successful work with really large data sets. One important challenge that arises is producing reproducible code that handles and analyzes large, complex, and unstructured data sets. Another issue is scaling this code to process these data sets in a cloud computing environment. The first presentation “Managing and Analyzing Big Data” will therefore discuss and showcase several approaches to deal with these issues and provide a hands on introduction to working with large scale data in quantitative marketing research.
Machine learning is one development in the methodological domain that has the potential to substantially impact the way the marketing discipline will analyze data in the future. The specific value, however, of machine learning methods for solving marketing problems is often unclear. The presentation “What is deep learning, and why should I care?” will therefore showcase one particular method (i.e., deep learning) and describe potential applications for the marketing discipline (e.g., the use of transfer learning).
While the focus of machine learning methods is traditionally on making predictions, causal inference that is valid beyond the specific context being analyzed is still the goal of many applications in quantitative marketing. For example, many standard approaches in the literature perform well in predicting consumer choices locally but violate basic economic principles and thus do not extrapolate well in counterfactual simulations (such as optimal pricing or product design). The third talk will therefore discuss the pitfalls of relying on standard unconstrained models and proposes a practical way of specifying more economically faithful hierarchical prior distributions. It will also document that this approach improves counterfactual predictions substantially.

Sunday, May 12, 2019

Problems with Keras and RStudio

Students once in a while approach us with problems that arise when they try to run Keras from R. Keras is an interface for deep learning / neural networks that is very well developed and popular. Apparently we're not the only ones bumping into problems here, but one student here in Tübingen documented an approach that worked, pulled from a GitHub discussion. The (or, at least one) solution is here:
1.) installation of R 3.5.2 (independent from directory (standard or any other))
2.) installation of RStudio (independent from directory)
3.) RStudio -> Tools -> Global Options -> Packages -> Disable both "Use secure download method for HTTP" and "Use Internet Explorer library/proxy for HTTP"
4.) installation of Miniconda3 (-> has to be the standard directory!) Version 4.5.11 (it did not work with the newest version)
5.) Use RStudio ->
install.packages("tensorflow")
install.packages("keras")
library("keras")
install_keras()
 I am posting this as a reminder to ourselves, and hopefully this is also useful to other people running into similar issues.

Friday, May 10, 2019

Things I like to read (1)

As a researcher (and teacher), I spend a substantial part of my time reading (what a privilege), although - admittedly - I should read more than I do. Clearly, my reading involves many journal articles, to keep up with relevant developments in the literature.
But, increasingly over the last years, an important part of communication about research issues has moved to the online domain. Here, Twitter is an important channel, but I don't enjoy reading Twitter because my impression is that it encourages simplifications even in situations when in-depth explanations are required or helpful. That's the reason why I prefer blogs. I read several blogs, almost on a daily basis. I've learned a lot by reading blogs, and I keep learning a lot by reading blogs.

In no particular order, I will provide links to the blogs (and articles) that I particularly like, and I will update this through new posts over the next weeks.

Today: Simply Statistics.This is a blog I've been following for the last few years. The authors add new posts every few days, and the posts are a healthy mix of facts (about statistics, applications, data, developments in the way we think about statistics) and (well-founded) opinions. This blog informed my way of thinking about Data Science quite a lot. Recent posts that touch many important aspects of what Data Science is about are this one or this one.
But they also share fun stuff like this list of 10 things R can do that might surprise you. Or (half-fun, half-serious, I guess) this one about bar plots, and why they must die.
In any case, their blog is highly recommended.

I sometimes hear the opinion that blogs (and similar sources) will replace the academic literature (e.g., journals) in the not-too-far-future. I disagree. Why? Here are two reasons. (1) The brand. Blogs work for people (authors) who have a brand, a reputation. If a person, unknown in a given academic community, starts a blog, the insights and opinions shared on that blog will only have little impact because readers do not have a reason to believe that this source is better, better informed, or more relevant than other web pages or blogs. Journals, in contrast, have a reputation, an impact factor. Of course, we all know that this is a bad measure for quality, but I am convinced that readers will associate a higher credibility with a claim brought forward in, say Nature, than, e.g., in some blog. Because there are gatekeepers at journals - as imperfect as they may be - which are absent elsewhere. Clearly, large blogs have active comment sections, but this is restricted to large, popular blogs like Andrew Gelman's. (2) Passage of Time. Blogs are nice for rapid communication about timely topics. But nobody can keep the author of some academic blog from turning off that blog. Then the content is gone, invisible to the public, and all references to this content are useless. With journals, this is different. Journal have some sort of institutional commitment to keep the content available, and even if a journal at some point disappears, the content will still be available to in many libraries.
Hence, I believe, blogs serve an important function to ignite discussion, disseminate knowledge, opinions, and insights in a rapid way, but journals, I believe, are here to stay. Let's revisit this prediction 10 years from now.

Friday, May 3, 2019

Another big debate about p-values - should we care?

The last weeks have seen a pretty hot academic debate. About the proper use of p-values. Really, can you have a hot debate about p-values?!? Yes, you can! I regularly read Andrew Gelman's blog, and three posts on that topic (this, this, and this) that appeared in March and April 2019 attracted more than 800 (!) comments. And by comment, I mean real text, often quite long, and not just a thumbs up or thumbs down. So, apparently, the topic does attract a lot of attention, and it seems almost impossible to keep track of all these comments and contributions.

A brief and incomplete timeline

So, in case you have not heard about it, what is going on here? Clearly, the topic of how to make use of p-values in a proper way is not new, see here for an example from 1994. In 2015, the debate received renewed attention after the journal "Basic and Applied Social Psychology" banned the use of p-values (or t-values, confidence intervals, and the like). They explained their reasoning in an editorial, and it seems this editorial is by far the most frequently read article in this journal.

Then, in 2016, the American Statistical Association published a statement on p-values. It's not quite a "product recall", but it's close, maybe a product safety alert. In pretty clear word, the statements cautions against the typical use of p-values. Then, in March 2019, the journal "American Statistician" published a special issue "Statistical Inference in the 21st Century: A World Beyond p < 0.05", with dozens of articles dealing with this topic. I have read about a third of these articles by now, and I can highly recommend taking the time and reading them!

Parallel to the publication of this special issue, a group of researchers (Valentin Amrhein, Sander Greenland, Blake McShane) wrote a short piece to be published as a commentary in "Nature". The main point of the article was the pledge to "retire statistical significance". The authors invited researcher worldwide to sign this "petition" if they agreed. I do agree with most of what they wrote, so I signed, along with more than 800 other researchers. Nature then published this piece under the slightly attention-grabbing headline of "Scientists rise up against statistical significance". And, again, this created a lot of attention. As far as I know, this article is the one with the highest Altmetric-score of all articles that have been tracked so far. As I write this, it has a score of 12795. Not bad. For a piece about p-values.

Why does this topic attract so much attention and discussion? 

There are probably many reasons, but I want to mention three. (1) Statistical inference matters. Millions of researchers around the globe collect and analyze data, and their goal should be to draw valid conclusions. Whether we are using the right tools to do that is important. (2) Most empirical researchers so far have relied on p-values in the past, this is the dominating paradigm. So when this is knocked off its pedestal, it concerns many researchers. (3) (Applied) statistician are supposed to be able to make sense of numbers, right? At least that is, I would argue, the laymen's perspective. Statisticians apply their tools to extract the truth from the data, don't they? And analyzing data will bring certainty where uncertainty would prevail otherwise. But apparently, there seems to be a lot of uncertainty as to how to properly analyze the data, or at least, there is a lot of uncertainty on how to interpret the uncertainty in the data. 

Why should we retire statistical significance? 

Clearly, the comment by Amrhein et al. does not advocate to abandon statistical analysis. Clearly, it does not call for ignoring the uncertainty associated with an estimate. To me, the most important part of the comment is the point about not dichotomizing the evidence. An estimate that has a p-value of .04 is not fundamentally or qualitatively different from an estimate with a p-value of .06. Concluding that an estimate with p<.04 "has an effect" while the estimate with p>.06 has "has no effect" is wrong.  As Gelman has written a zillion times, the difference between significant and insignificant is not significant. And Amrhein's comment and the surrounding publicity puts a spotlight on this debate. There is also valid and relevant criticism that people bring forward against their comments, but I will save that for another day.

Tuesday, April 30, 2019

We just launched a new M.Sc.-program: Data Science in Business and Economics

After close to two years of pretty intense work, many discussions, and a lot of reading, I'm quite happy that we're finally there: we launched the new Master's program "Data Science in Business and Economics" at the School of Business and Economics at the University of Tübingen.
This program rests on three main pillars. Of course, data science does not work without econometrics, which is the first pillar. Second, we believe that data scientists must have sound theoretical knowledge of the content domain that they are active in. Hence, students in this program will have to spend considerable time improving their understanding of how markets and consumers work and think. And the really new part is that we will equip students with a new skill set that allows to them to deal with large and unstructured data sets and apply new methods. This includes, of course, coding in R and Python, but students can also benefit from our University's excellent Department of Computer Science, one of the leading places for Machine Learning. It is going to be very interesting to see how this combination works out. I will provide a bit more information on our definition of and view on data science in one of the next posts.