In this post:
- Total Number of Books by Year (2008-2018)
- Top 20 Tags Overall (2008-2018)
- Top 10 Tags by Year (2011-2018)
It is indeed still Christmastide! So I have a gift for everyone who has already blown through the books they found under the tree (or knows they will in this glorious lull between Christmas and New Year’s).
I web-scraped the NPR Book Concierge and Best Books of the Year pages from 2008 to 2018. And then I compiled them into a nifty spreadsheet that is searchable by tag and author or filterable by any category.
The spreadsheet contains 2323 books from the past 10 years of NPR’s “best of” lists. Search the tags to find new favorites in your most-read genres to take a chance on the “no-tag” books to discover something new.
For more on the process, keep reading the post. Though I understand if you go off and read the 2000+ awesome books from NPR’s Concierge instead :).
Since this year’s Book Concierge marks the 10th anniversary of NPR’s end-of-year recommendations, so I thought it’d be a fun side project to compile the lists from 2008 to 2018. This also gave me a chance to practice three of the digital humanities skills I’ve been using for the dissertation: web scraping plus tidy data and visualizations in R.
First, I employed web-scraping using OutWit Hub. This is a method of harvesting data from a webpage by asking the scraping program to look for specific patterns in the HTML code behind a web page.
Scraping the 2013 to 2018 pages was a breeze. The web app format NPR has used for the last five years is tagged well. For instance, <author> precedes the author’s name in the source code. The tags made it easy to identify common, unique patterns to plug into OutWit Hub’s desktop app.
Scraping the pages for 2008 to 2012, on the other hand, was a pain. (I’m still fairly new to web scraping and it stumped me.) The flow of the pages is inconsistent and I found it tricky to scrape just some links instead of all of them. Honestly, I ended up doing a lot of manual clean up, especially for 2008, 2009, and 2010. But okay. Worth it.
I also did some data tidying using the tidyverse and tidytext packages in RStudio. (I used Text Mining with R by Julia Silge and David Robinson as my bible for this work, as always.) Essentially, tidy data means that every instance of an observation has its own separate row in a table. For example, a book from the Concierge with two tags appears in two rows rather than one. (One for each tag.) To give a super simple example of untidy vs tidy data:
Figure 1: Untidy Data
|Anne Leckie||Ancillary Justice||science-fiction-and-fantasy, |
Figure 2: Tidy Data
|Anne Leckie||Ancillary Justice||science-fiction-and-fantasy|
|Anne Leckie||Ancillary Justice||it’s-all-geek-to-me|
I mainly tidied the data so I could show off some visualizations. 🙂 Voila – total counts for year and tag plus most used tags by year.
Graph: Total books per year
Interestingly, a bit of a dip in the number of books in the concierge this year (2018). But otherwise the count keeps going steadily up. We should see upwards of 320 in 2019 at least.
Over the last 10 years, the “best of” lists utilized 74 distinct tags. This gets visually messy, so I’ve opted to show just the top 25 here. I removed the “no-tag” label, which applied to all books in
Please also note there’s some overlap. “Staff-picks” and “
Graph: Top 25 Tags (2008-2018)
The most popular tags have shifted from year to year, with some tags fading out altogether over time. The facets below show the most popular tags per year from 2011 to 2018 (since no tags are available for 2008-2010). My apologies for the messy visualizations. I’m still working on how to sort the darn things properly…
This was a fun project for my winter break and definitely something I’ll use going forward. My next step is to start tracking which books I’ve read and locate the ones I most want to read in my local library.
Suggestions welcome for other improvements. Otherwise, happy reading!
Table 1: Total books per year
Table 2: Top 25 Tags (2008-2018)
Table 3: Top 10 Tags by Year