Good Reads Users Find Missouri Authors ‘Slightly Below Average’
By: Erik Shively
The data used is taken from Kaggle. It is a dataset of 10,000 books reviews scraped from Goodreads.com
Many prolific writers are from Missouri, and they’ve produced a lot of memorable books. Perhaps the most notable would be Samuel Clemens, known by his pen name as Mark Twain. How well, though, does Missouri stack up compared to the rest of the world? An easily quantifiable measure of a book’s quality (regardless of its accuracy) would be to see what people think of the book through online reviews. The dataset I worked with sources its opinions from Goodreads.com, and features data such as: Average Review Score, Review Count and Book Length (Pages).
The first thing I did was compose a list of notable Missouri Authors. I aggregated a set of 141 Missouri authors from various lists, sourced from Goodreads.com itself and literature blogs. This allowed me to write a script to iterate through the primary dataset and construct a new dataset that represents Missouri’s literary output.
Out of these 109 entries, the average score for Missouri was actually pretty close to that of the whole dataset. I figured Mark Twain would skew the average higher, so I made another Missouri set that excludes him. Again, the results are close. To exaggerate the differences, I subtracted 3.5 from the average scores. Here’s the result:
Missouri scores slightly lower than the average of the total dataset.
Mark Twain’s popularity made me wonder if there’s a correlation between average review score and other factors. I started by plotting the relation between the book length and review score. Here’s the result:
Average score tends to stabilize as book length increases, and the scores tend to be high. Beyond 2000 pages, no book comes close to going below an average review score of 3.0. There will be more on that later.
Another correlation I tested was between author popularity (by the number of ratings) and average review score. Here’s the result:
Interestingly, this figure suggests that review score tends to be unaffected by the book’s popularity. The plots do not stabilize.
Finally, let me offer a conjecture as to why review scores stabilize as the length of the book increases. I plotted a relation of the number of reviews compared to book length. Here’s the result:
Longer books tend to get fewer reviews. Assuming people generally wait until they’ve read the whole book to leave a review, this suggests that people who stick with a long book long enough to finish reading will enjoy the book enough to leave a positive review.
The world of online reviews is vast and interesting, and making sense of the data can lead to some interesting results. I encourage you to explore reviews and share what you find!