Summary

Grocery shopping is an integral part of all of our lives. We rely on the grocery store to fulfill out nutrition, health, and other needs. Due to the ubiquity of this activity, we wondered what we could learn from people's shopping habits. We used the Dunnhumby dataset to look for interesting shopping patterns. Specifically, we mainly used the "Complete Journey" dataset, which contains data about household shopping habits from over 2,500 households over 2 years. This dataset contains information about when transactions took place, which items were bought, household demographics, and more. We also used their "Breakfast at the Frat" dataset, which contains information about discounts for a smaller sample of products.

Research Questions

Dataset Limitations

It is important to realize that we only have sales data for a specific set of stores. When we see an increase in spending over time, what we really see is increase in those particular stores' sales. This may reflect a broader trend in society, but it is also possible that people are spending the same amount, but choosing the stores we have access to more frequently instead of other stores that are not in this database. In fact, the big rise in the first half of the year (see spending trends section) could be explained by the store starting a marketing campaign - the small sales in the beginning could mean new customers starting to sometimes go to these stores and the rapid increase could be the result of them starting to choose it more and more often over other stores due to marketing. The slowdown afterwards may be due to saturation - most interested customers already have started to frequent these stores often and they do not change their habits that much anymore.

Furthermore, this dataset is particularly limited in terms of demographic data. While the "Complete Journey" dataset contains transaction data for over 2,500 households, it only contains demographic data for 801 of those households. Because demographic group splits are often very uneven, in some cases sample sizes are very small. For example, the dataset is split into 12 income groups, and the "200-249k" group has only 5 households in it. This is part of the reason why we decided not to include some of the demographic analysis we had done.