My digital dissertation on historical thinking, social media, and the digital age primarily utilizes Twitter data to answer questions about students’ understandings of the significance of the practice and content of history. Working with Twitter data is new territory to me but I have a few thoughts on the process of cleaning and organizing the data thus far.
1. Twitter is a Hydra
Hydra: A mythological beastie with many heads. When someone lops off one head, two more appear. Also apparently exhales poisonous fumes. Heracles (Hercules) was only able to destroy the monster with the help of his nephew, who cauterized the stump of each head to prevent new ones growing.
I didn’t know Twitter was going to be a Hydra, partly because the initial collection of tweets was super easy thanks to Ian Milligan, who generously set up and hosted a dnflow server for me this semester.1 (Because digital humanities people are awesome about helping new-to-DH scholars realize their projects.)
My students and I used a class hashtag (#hwc111) to organize our tweets and, once a week or so, I entered the hashtag into a search box on dnflow. The program created analytics regarding the most popular tweets, common images, and the number collected and I downloaded this data into a neat Google Drive folder.
The original data set was comprised of 10,486 tweets – but I knew that wasn’t all of them. Dnflow had trouble collecting retweets 2 and quoted retweets 3. Plus no one (myself included) tags their tweets perfectly all the time.
My initial, optimistic workflow looked a bit like this:
- Compile all tweets from dnflow requests into a single spreadsheet.
- Review individual feeds.
- Add missing tweets.
- Categorize tweets based on which question in class, if any, the tweet responded to.
Simple, yes? Hilarious is more like it.
The second task “review individuals feeds” became an additional four sub-tasks and “add missing tweets” turned into adding not only un-tagged tweets, but also all replies by students because I decided halfway through that I wanted to explore whether a network existed among students and, if so, what it looks like. I also added new tasks as I started reviewing the data, such as creating a column to describe the media (GIFs, images, quotes from class readings) attached to tweets.
For this sort of work, the experience and assistance of other people is clearly beneficial. Something like Jessica Otis’s workflow for examining a conference network with Gephi would have been exceptionally helpful at the start of this process (and certainly will be helpful as I experiment with Gephi). To that end, I’ve documented the workflow that emerged for me (really, it’s more of a task list), available via Google Drive. Ideally, this will help other Twitter -data newcomers avoid similar pitfalls in the future.
2. How to be a historian who thinks with machines?
I ultimately added 1,671 tweets to the original 10,486 – about a 16% increase in the data set. I’m not sure yet whether or not this is a significant amount. (Though I’m sure some students or colleagues with working knowledge of statistics can tell me…)
I’m used to thinking like a historian in an archive, where documents are rare and every particular piece of evidence matters. This isn’t that kind of project, though. Instead, I’ll be visualizing and analyzing the contents of hundreds of tweets at a time. Will 16 extra tweets make a difference when analyzing a batch of 100?
My guess is that the additions may not make much of a difference to any text analysis involving large segments of the data set. Added tweets might, however, make a difference in the composition of the network of students. The new tweets might also contain some zingy and insightful quotes that allow me to make a point with a bit more panache. Like this one from a student processing the perspective and bias of the Greek historian, Herodotus.
— Xiaoling Tan (@xiaoling_tan) February 23, 2017
I suspect I’ll return to the question of how much data is worth saving, adding, and exploring. This is an important question in the broader practice of digital history. How should digital historians balance a disciplinary preference for the particularities of individual documents with a methodology that requires setting aside the particular, at least initially, in order to extract generalizations from massive sets of evidence? What will that look like for this particular project.
3. Backtracking is disheartening, but necessary.
While reviewing the individual Twitter feeds created by my students, I came up with a clever idea I believed would expedite the review process: Save the missing tweets to a Twitter Moment a place to store and then return to record the retweets after completing the initial review of the feeds.
Paige Morgan and Yvonne Lam, who led the Intro to Data Wrangling workshop at the Digital Humanities Summer Institute, warned participants that starting over is always a possibility. Paige also noted in a recent talk/blog post: “I say that I work with data, but in some ways, it feels more accurate to say that I work with various types of mess.”5
I wholeheartedly agree with acknowledging that starting over happens and that data is usually some type of mess or another. And I suspect this won’t be the last time that happens. Backtracking is disheartening and time-consuming and that emotional toll perhaps could be better acknowledged in DH work – even if it’s a necessary part of the messy digital process.
4. It’s okay to leave some things for later.
In a recent chat with Veronica Armour, an Instructional Designer at Seton Hall University, I asked her what project management training she acquired before moving into her current field of work.6 Her answer was “not much” (which seems quite common), but she did recommend some online courses – particularly those that favor “agile management” over “waterfall management.”
My understanding of these models is quite basic, but here’s the heart of it. Agile models make it possible and even desirable to move forward even if all the pieces aren’t yet in place; the object is to continuously work toward goals by testing, seeking feedback, and testing again as new information or materials become available. Waterfall models, by contrast, require everything from one step of the project to be completed before moving onto the next.
I’m definitely a “waterfall” person when it comes to my own projects. I don’t like the feeling of incompleteness and I prefer to explore every possibility before moving on.
But that is shaping up to be an ineffective way to work with data – especially data that acts like a Hydra. With the next few stages of the project, then, I’m hoping to become more okay with leaving things for later.
- Dnflow is a web-scraping program – an automated way to capture content from the internet. It is an early iteration of the more robust DocNow, a “a tool and a community developed around supporting the ethical collection, use, and preservation of social media content.”
- A retweet is when you share another tweet, without added commentary.
- A quoted retweet is when you choose the retweet option on Twitter, but add content of your own to accompany the tweet. Sort of like a Facebook share if you added commentary above the link or video shared.
- I was under the impression that a Moment was a bit like a Collection on Twitter or like Storify, a related app. This is not at all the case. Twitter instead has some very specific guidelines for how they’d like Moments to be used.
- Emphasis mine.
- Project management seems to be an ongoing concern in the digital humanities field – or at least this is my impression based on conversations and panels at the DH sessions at the American Historical Association’s meeting in January and at DHSI in June.